Inputting Business Logic Using Constraints

Do you have rules in your dataset that every row in the data must follow? You can use constraints to describe business logic to any of the SDV single table models.

The SDV has predefined constraints that are commonly found in datasets. For example:

  • Fixing combinations. Your table might have two different columns for city and country. The values in those columns should not be shuffled because that would result in incorrect locations (eg. Paris USA or London Italy).

  • Comparing inequalities. Your table might have two different columns for an employee’s start_date and end_date that are related to each other: The start_date must always come before the end_date.

In this guide, we’ll walk through the usage of each predefined constraint.

Load a Tabular Demo

To illustrate some of the constraints, let’s load a small table that contains some details about employees from several companies.

In [1]: from sdv.demo import load_tabular_demo

In [2]: employees = load_tabular_demo()

In [3]: employees
Out[3]: 
     company     department  employee_id  age  age_when_joined  years_in_the_company     salary  annual_bonus  prior_years_experience  full_time  part_time  contractor
0       Pear          Sales            1   44               40                     4  157000.00       8500.00                       1        1.0        0.0         0.0
1       Pear         Design            5   46               39                     7   99922.78       8187.43                       2        0.0        0.0         1.0
2    Glasses             AI            1   44               41                     3   46000.00      19000.00                       1        1.0        0.0         0.0
3    Glasses  Search Engine            7   36               30                     6   93859.75      23479.83                       3        0.0        0.0         1.0
4   Cheerper        BigData            6   34               30                     4   84500.00      22000.00                       4        0.0        1.0         0.0
5   Cheerper        Support           11   38               29                     9  140000.00      14000.00                       2        0.0        1.0         0.0
6       Pear          Sales           28   42               33                     9  127500.00      15500.00                       5        1.0        0.0         0.0
7       Pear         Design           75   36               35                     1  102819.98      13044.93                       1        0.0        0.0         1.0
8    Glasses             AI           33   34               32                     2  121500.00       6500.00                       1        1.0        0.0         0.0
9    Glasses  Search Engine           56   48               45                     3  121048.62      17242.98                       5        0.0        0.0         1.0
10  Cheerper        BigData           42   32               28                     4   71000.00       8500.00                       5        0.0        1.0         0.0
11  Cheerper        Support           80   48               39                     9   32500.00      15000.00                       5        0.0        1.0         0.0

This table contains a few rules that can be written as predefined constraints. We’ll use it as an example when describing the constraints.

Predefined Constraints

Unique

The Unique constraint enforces that the values in column (or set of columns) are unique within the entire table.

In our demo table, there is a Unique constraint: Within a company, all the employee ids must be unique.

Enforce this by creating a Unique constraint. This object accepts a list of 1 or more column names.

In [4]: from sdv.constraints import Unique

In [5]: unique_employee_id_company_constraint = Unique(
   ...:     column_names=['employee_id', 'company']
   ...: )
   ...: 

Note

The SDV already ensures that primary keys are unique in the dataset. You do not need to add a Unique constraint on these columns.

FixedCombinations

The FixedCombinations constraint enforces that the combinations between a set of columns are fixed. That is, no other permutations or shuffling is allowed other than what’s already observed in the real data.

In our demo table, there is a FixedCombinations constraint: Each company has a fixed set of departments. The company and department values should not be shuffled in the synthetic data.

Enforce this by creating a FixedCombinations constraint. This object accepts a list of 2 or more column names.

In [6]: from sdv.constraints import FixedCombinations

In [7]: fixed_company_department_constraint = FixedCombinations(
   ...:     column_names=['company', 'department']
   ...: )
   ...: 

Inequality

The Inequality constraint enforces an inequality relationship between a pair of columns. For every row, the value in one column must be greater than a value in another.

In our demo table, there is an Inequality constraint: The current age of an employee must be greater than or equal to the age they were when they joined.

Enforce this by creating an Inequality constraint. This object accepts column names for the high and low columns. The columns can be either numerical or datetime.

In [8]: from sdv.constraints import Inequality

In [9]: age_gt_age_when_joined_constraint = Inequality(
   ...:     low_column_name='age_when_joined',
   ...:     high_column_name='age'
   ...: )
   ...: 

ScalarInequality

The ScalarInequality constraint enforces that all values in a column are greater or less than a fixed (scalar) value. That is, it enforces a lower or upper bound to the synthetic data.

In our demo table, we can define a ScalarInequality constraint: All employees must be 18 or older.

Enforce this by creating a ScalarInequality constraint. This object accepts a numerical or datetime column name and value. It also expects an inequality relation that must be one of “>”, “>=”, “<” or “<=”.

In [10]: from sdv.constraints import ScalarInequality

In [11]: age_gt_18 = ScalarInequality(
   ....:     column_name='age',
   ....:     relation='>=',
   ....:     value=18
   ....: )
   ....: 

Note

All SDV tabular models have an enforce_min_max_values parameter that you set to enforce bounds on all columns. This constraint is redundant if you set this model parameter.

Positive and Negative

The Positive and Negative constraints are shortcuts to the ScalarInequality constraint when the column’s values must be >0 or <0.

In our demo table, we can define a Positive constraint: All employee ages must be positive.

Enforce this by creating a Positive constraint. This object accepts a numerical column name. (The Negative constraint works the same way.)

In [12]: from sdv.constraints import Positive

In [13]: age_positive = Positive(column_name='age')

Note

All SDV tabular models have an enforce_min_max_value parameter that you set to enforce bounds on all columns. This constraint is redundant if you set this model parameter.

OneHotEncoding

The OneHotEncoding constraint enforces that a set of columns follow a one hot encoding scheme . That is, exactly one of the columns must contain a value of 1 while all the others must be 0.

In our demo table, we have a OneHotEncoding constraint: An employee can only be one of: full time, part time or contractor. That is, only 1 of these columns must be 1 while the others must be a 0.

Enforce this by creating a OneHotEncoding constraint. The object accepts a list of column names that, together, are part of the one hot encoding scheme.

In [14]: from sdv.constraints import OneHotEncoding

In [15]: job_category_constraint = OneHotEncoding(
   ....:     column_names=['full_time', 'part_time', 'contractor']
   ....: )
   ....: 

FixedIncrements

The FixedIncrements constraint enforces that all the values in a column are increments of a particular, fixed value. That is, all the data must be divisible by the value.

We do not have a FixedIncrements constraint in our demo table. But we can imagine a table where all the salary values must be divisible by 500.

Enforce this by creating a FixedIncrements constraint. This object accepts a numerical column name and an increment value that must be an integer greater than 1.

In [16]: from sdv.constraints import FixedIncrements

# this constraint does not actually exist in the demo dataset
In [17]: salary_divisble_by_500 = FixedIncrements(
   ....:     column_name='salary',
   ....:     increment_value=500
   ....: )
   ....: 

Range

The Range constraint enforces that for all rows, the value of one of the columns is bounded by the values in the other two columns.

We do not have a Range constraint in our demo table. But we can imagine a table where an employee’s age is bounded by the age when they first started working and an age when they will retire.

Enforce this by creating a Range constraint. This object accepts high, middle and low column names. The columns can be either numerical or datetime.

In [18]: from sdv.constraints import Range

# this constraint does not actually exist in the demo dataset
In [19]: age_btwn_joined_retirement = Range(
   ....:     low_column_name='age_started_working',
   ....:     middle_column_name='age_today',
   ....:     high_column_name='age_when_retiring'
   ....: )
   ....: 

Note

This constraint assumes strict bounds between the low, middle and high column names. That is: low < middle < high. You can express other business logic using a multiple Inequality and ScalarInequality constraints.

ScalarRange

The ScalarRange constraint enforces that all the values in a column are in between two known, fixed values. That is, it enforces upper and lower bounds to the data.

In our demo table, we can define a ScalarRange constraint: All employees must be between the ages of 18 and 100.

Enforce this by creating a ScalarRange constraint. This object accepts a numerical or datetime column name and the low and high values. It also accepts a boolean that describes whether the ranges are strict (exclusive) or not (inclusive).

In [20]: from sdv.constraints import ScalarRange

In [21]: age_btwn_18_100 = ScalarRange(
   ....:     column_name='age',
   ....:     low_value=18,
   ....:     high_value=100,
   ....:     strict_boundaries=False
   ....: )
   ....: 

Note

All SDV tabular models have an enforce_min_max_values parameter that you set to enforce bounds on all columns. This constraint is redundant if you set this model parameter.

Applying the Constraints

Once you have defined the constraints, you can use them in any SDV single table model (TabularPreset, GaussianCopula, CopulaGAN, CTGAN and TVAE). Use the constraints parameter to pass in the objects a list.

In [22]: from sdv.tabular import GaussianCopula

In [23]: constraints = [
   ....:     unique_employee_id_company_constraint,
   ....:     fixed_company_department_constraint,
   ....:     age_gt_age_when_joined_constraint,
   ....:     job_category_constraint,
   ....:     age_btwn_18_100
   ....: ]
   ....: 

In [24]: model = GaussianCopula(constraints=constraints, enforce_min_max_values=False)

Then you can fit the model using the real data. During this process, the SDV ensures that the model learns the constraints.

In [25]: model.fit(employees)

Warning

The constraints must accurately describe the data. Constraints are business rules that must be followed by every row of your data. If the real data does not fully meet the constraint, the model will not be able to learn it well. The SDV will throw an error.

Finally, you can sample synthetic data. Observe that every row in the synthetic data adheres to the constraints.

In [26]: synthetic_data = model.sample(num_rows=10)

In [27]: synthetic_data
Out[27]: 
    company     department  employee_id  age  age_when_joined  years_in_the_company     salary  annual_bonus  prior_years_experience  full_time  part_time  contractor
0      Pear         Design            7   30               28                     2  138288.53      13760.33                       2        1.0        0.0         0.0
1  Cheerper        BigData           54   42               34                     8   56533.72      14649.08                       4        1.0        0.0         0.0
2   Glasses  Search Engine           21   41               38                     3   73930.87      14313.96                       3        1.0        0.0         0.0
3      Pear          Sales            4   31               29                     2  112134.04      12518.74                       1        1.0        0.0         0.0
4   Glasses  Search Engine           22   39               30                     9   96812.95      16629.63                       5        1.0        0.0         0.0
5  Cheerper        BigData           35   42               34                     8   78989.60      10539.48                       2        1.0        0.0         0.0
6   Glasses  Search Engine           37   39               36                     3   74568.04       9465.79                       2        0.0        0.0         1.0
7   Glasses             AI            9   42               39                     3  121196.43      22518.88                       3        0.0        0.0         1.0
8  Cheerper        BigData           38   40               36                     5  108572.76      21520.80                       2        0.0        1.0         0.0
9  Cheerper        BigData           37   42               35                     8   79930.04      17756.40                       4        0.0        1.0         0.0

FAQs

Warning

Constraints may slow down the synthetic data model & leak privacy. Before adding a constraint to your model, carefully consider whether it is necessary. Here are a few questions to ask:

  • How do I plan to use the synthetic data? Without the constraint, the rule may still be valid a majority of the time. Only add the constraint if you require 100% adherence.

  • Who do I plan to share the synthetic data with? Consider whether they will be able to use the business rule to uncover sensitive information about the real data.

  • How did the rule come to be? In some cases, there may be other data sources that are present without extra columns and rules.

In the ideal case, there are only a handful constraints you are applying to your model.

When do constraints affect the modeling & sampling performance?

In most cases, the time it takes to fit the model and sample synthetic data should not be significantly affected if you add a few constraints. However, there are certain scenarios where you may notice a slow-down:

  • You have a large number of constraints that overlap. That is, multiple constraints are referencing the same column(s) in the data.

  • Your constrained data has a high cardinality. For example, you have a categorical column with hundreds of possible categories that you are using in a FixedCombinations constraint.

  • You are conditional sampling on a constrained column. This requires some special processing and it may not always be possible to efficiently create conditional synthetic data.

For any questions or feature requests related to performance, please create an issue describing your data, constraints and sampling needs.

What happened to Rounding and ColumnFormula?

Rounding and ColumnFormula constraints were available in older versions of the SDV. These constraints are no longer included as predefined constraints because there are other ways to achieve the same logic:

  • Rounding: All SDV single table models contain a ‘rounding’ parameter. By default, they learn the number of decimal digits in your data and enforce that the synthetic data has the same.

  • ColumnFormula: In this version of the SDV, you can implement a formula as a CustomConstraint. See the Defining Custom Constraints guide for more details.

Why am I getting a ConstraintsNotMetError when I try to fit my data?

A constraint should describe a rule that is true for every row in your real data. If any rows in the real data violate the rule, the SDV will throw a ConstraintsNotMetError. Since the constraint is not true in your real data, the model will not be able to learn it.

If you see this error, you have two options:

  • (recommended) Remove the constraint. This ensures the model learns patterns that exist in the real data. You can use conditional sampling later to generate synthetic data with specific values.

  • Clean your input dataset. If you remove the violative rows in the real data, then you will be able to apply the constraint. This is not recommended because even if the model can learn the constraint, it is not truly representative of the full, original dataset.

How does the SDV handle the constraints?

Under-the-hood, the SDV uses a combination of strategies to ensure that the synthetic data always follows the constraints. These strategies are:

  1. Transformation: Most of the time, it’s possible to transform the data in a way that guarantees the models will be able to learn the constraint. This is paired with a reverse transformation to ensure the synthetic data looks like the original.

  2. Reject Sampling: Another strategy is to model and sample synthetic data as usual, and then throw away any rows in the synthetic data that violate the constraints.

Transformation is the most efficient strategy, but it is not always possible to use. For example, multiple constraints might be attempting to transform the same column, or the logic itself may not be possible to achieve through transformation.

In such cases, the SDV will fall back to using reject sampling. You’ll get a warning when this happens. Reject sampling may slow down the sampling process but there will be no other effect on the synthetic data’s quality or validity.