Danger

You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software

Click here to go to the new docs pages.

Custom Constraints

If you have business logic that cannot be represented using Predefined Constraints, you can define custom logic. In this guide, we’ll walk through the process for defining a custom constraint and using it.

Defining your custom constraint

To define your custom constraint you need to write some functionality in a separate Python file. This includes:

  • Validity Check: A test that determines whether a row in the data meets the rule, and

  • (optional) Transformation Functions: Functions to modify the data before & after modeling

The SDV then uses the functionality you provided, as shown in the diagram below.

../../_images/custom_constraint.png

Each function (validity, transform and reverse transform) must accept the same inputs:

  • column_names: The names of the columns involved in the constraints

  • data: The full dataset, represented as a pandas.DataFrame

  • <other parameters>: Any other parameters that are necessary for your logic

Example

Let’s demonstrate this using our demo dataset.

In [1]: from sdv.demo import load_tabular_demo

In [2]: employees = load_tabular_demo()

In [3]: employees
Out[3]: 
     company     department  employee_id  age  age_when_joined  years_in_the_company     salary  annual_bonus  prior_years_experience  full_time  part_time  contractor
0       Pear          Sales            1   45               39                     6  121500.00      10000.00                       1        1.0        0.0         0.0
1       Pear         Design            5   31               26                     5   88289.72      13272.35                       2        0.0        0.0         1.0
2    Glasses             AI            1   34               29                     5  114500.00      11500.00                       5        1.0        0.0         0.0
3    Glasses  Search Engine            7   40               31                     9   85333.88      14879.06                       5        0.0        0.0         1.0
4   Cheerper        BigData            6   35               33                     2  112500.00      22500.00                       4        0.0        1.0         0.0
5   Cheerper        Support           11   32               26                     6   45000.00      15500.00                       3        0.0        1.0         0.0
6       Pear          Sales           28   42               40                     2   73500.00      10000.00                       2        1.0        0.0         0.0
7       Pear         Design           75   30               29                     1   80771.20      18952.46                       5        0.0        0.0         1.0
8    Glasses             AI           33   32               29                     3  101000.00       9500.00                       3        1.0        0.0         0.0
9    Glasses  Search Engine           56   38               32                     6   85909.15      18161.58                       5        0.0        0.0         1.0
10  Cheerper        BigData           42   46               38                     8   52500.00      11000.00                       5        0.0        1.0         0.0
11  Cheerper        Support           80   33               31                     2   91500.00      10500.00                       4        0.0        1.0         0.0

The dataset contains basic details about employees in some fictional companies. Many of the rules in the dataset can be described using predefined constraints. However, there is one complex rule that needs a custom constraint:

  • If the employee is not a contractor (contractor == 0), then the salary must be divisible by 500

  • Otherwise if the employee is a contractor (contractor == 1), then this rule does not apply

Note

This is similar to the predefined FixedIncrements constraint with the addition of an exclusion criteria (exclude the constraint check if the employee is a contractor).

Validity Check

The validity check should return a pandas.Series of True/False values that determine whether each row is valid.

Let’s code the logic up using parameters:

  • column_names will be a single item list containing the column that must be divisible (eg. salary)

  • data will be the full dataset

  • Custom parameter: increment describes the numerical increment (eg. 500)

  • Custom parameter: exclusion_column describes the column with the exclusion criteria (eg. contractor)

def is_valid(column_names, data, increment, exclusion_column):
    column_name=column_names[0]

    is_divisible = (data[column_name] % increment == 0)
    is_excluded = (data[exclusion_column] > 0)

    return (is_divisible | is_excluded)

Transformations

The transformations must return the full datasets with particular columns transformed. We can modify, delete or add columns as long as we can reverse the transformation later.

In our case, the transformation can just divide each of the values in the column by the increment.

def transform(column_names, data, increment, exclusion_column):
    column_name = column_names[0]
    data[column_name] = data[column_name] / increment
    return data

Reversing the transformation is trickier. If we multiply every value by the increment, the salaries won’t necessarily be divisible by 500. Instead we should:

  • Round values to whole numbers whenever the employee is not a contractor first, and then

  • Multiply every value by 500

def reverse_transform(column_names, transformed_data, increment, exclusion_column):
    column_name = column_names[0]

    is_included = (transformed_data[exclusion_column] == 0)
    rounded_data = transformed_data[is_included][column_name].round()
    transformed_data.at[is_included, column_name] = rounded_data

    transformed_data[column_name] *= increment
    return transformed_data

Creating your class

Finally, we can put all the functionality together to create a class that describes our constraint. Use the create_custom_constraint factory method to do this. It accepts your functions as inputs and returns a class that’s ready to use.

You can name this class whatever you’d like. Since our constraint is similar to FixedIncrements, let’s call it FixedIncrementsWithExclusion.

In [4]: from sdv.constraints import create_custom_constraint

In [5]: FixedIncrementsWithExclusion = create_custom_constraint(
   ...:     is_valid_fn=is_valid,
   ...:     transform_fn=transform, # optional
   ...:     reverse_transform_fn=reverse_transform # optional
   ...: )
   ...: 

Using your custom constraint

Now that you have a class, you can use it like any other predefined constraint. Create an object by putting in the parameters you defined. Note that you do not need to input the data.

You can apply the same constraint to other columns by creating a different object. In our case the annual_bonus column also follows the same logic.

In [6]: salary_divis_500 = FixedIncrementsWithExclusion(
   ...:    column_names=['salary'],
   ...:    increment=500,
   ...:    exclusion_column='contractor'
   ...: )
   ...: 

In [7]: bonus_divis_500 = FixedIncrementsWithExclusion(
   ...:    column_names=['annual_bonus'],
   ...:    increment=500,
   ...:    exclusion_column='contractor'
   ...: )
   ...: 

Finally, input these constraints into your model using the constraints parameter just like you would for predefined constraints.

In [8]: from sdv.tabular import GaussianCopula

In [9]: constraints = [
   ...:   salary_divis_500,
   ...:   bonus_divis_500
   ...: ]
   ...: 

In [10]: model = GaussianCopula(constraints=constraints, enforce_min_max_values=False)

In [11]: model.fit(employees)

Now, when you sample from the model, all rows of the synthetic data will follow the custom constraint.

In [12]: synthetic_data = model.sample(num_rows=10)

In [13]: synthetic_data
Out[13]: 
    company     department  employee_id  age  age_when_joined  years_in_the_company     salary  annual_bonus  prior_years_experience  full_time  part_time  contractor
0   Glasses        BigData           52   39               28                     7   77910.27      20541.24                       4        0.0        1.0         1.0
1   Glasses             AI           44   37               31                     5   83100.15      15155.37                       4        0.0        0.0         1.0
2   Glasses  Search Engine           35   34               30                     6   80254.31      15252.24                       5        0.0        0.0         1.0
3      Pear             AI           24   32               27                     2   77032.39      10892.66                       3        1.0        0.0         1.0
4   Glasses         Design           21   36               36                     4  111106.75      16617.82                       5        1.0        0.0         1.0
5   Glasses  Search Engine           40   33               28                     5   76595.23      11736.41                       4        0.0        0.0         1.0
6  Cheerper        Support           42   40               26                     8   55987.56      14224.90                       5        0.0        1.0         1.0
7  Cheerper        Support           79   37               27                     8   49544.59       9597.30                       5        0.0        1.0         1.0
8      Pear         Design            6   40               35                     8   93637.60      17795.99                       3        0.0        0.0         1.0
9   Glasses             AI           63   35               35                     3   62200.67      19326.75                       5        0.0        1.0         1.0