Custom Constraints

If you have business logic that cannot be represented using Predefined Constraints, you can define custom logic. In this guide, we’ll walk through the process for defining a custom constraint and using it.

Defining your custom constraint

To define your custom constraint you need to write some functionality in a separate Python file. This includes:

  • Validity Check: A test that determines whether a row in the data meets the rule, and

  • (optional) Transformation Functions: Functions to modify the data before & after modeling

The SDV then uses the functionality you provided, as shown in the diagram below.

../../_images/custom_constraint.png

Each function (validity, transform and reverse transform) must accept the same inputs:

  • column_names: The names of the columns involved in the constraints

  • data: The full dataset, represented as a pandas.DataFrame

  • <other parameters>: Any other parameters that are necessary for your logic

Example

Let’s demonstrate this using our demo dataset.

In [1]: from sdv.demo import load_tabular_demo

In [2]: employees = load_tabular_demo()

In [3]: employees
Out[3]: 
     company     department  employee_id  age  age_when_joined  years_in_the_company     salary  annual_bonus  prior_years_experience  full_time  part_time  contractor
0       Pear          Sales            1   33               28                     5   93500.00      17500.00                       5        1.0        0.0         0.0
1       Pear         Design            5   36               34                     2  115354.24      22312.91                       2        0.0        0.0         1.0
2    Glasses             AI            1   36               29                     7   89500.00      12500.00                       4        1.0        0.0         0.0
3    Glasses  Search Engine            7   37               35                     2  129958.17      23256.47                       1        0.0        0.0         1.0
4   Cheerper        BigData            6   43               34                     9  137000.00      18500.00                       4        0.0        1.0         0.0
5   Cheerper        Support           11   44               39                     5   76000.00      11500.00                       2        0.0        1.0         0.0
6       Pear          Sales           28   43               37                     6   83000.00      10000.00                       3        1.0        0.0         0.0
7       Pear         Design           75   33               25                     8  154187.38      23568.57                       2        0.0        0.0         1.0
8    Glasses             AI           33   48               46                     2   38000.00      10500.00                       1        1.0        0.0         0.0
9    Glasses  Search Engine           56   31               24                     7   87494.58      21256.79                       5        0.0        0.0         1.0
10  Cheerper        BigData           42   34               31                     3   38500.00       9500.00                       1        0.0        1.0         0.0
11  Cheerper        Support           80   34               28                     6  138500.00      19500.00                       3        0.0        1.0         0.0

The dataset contains basic details about employees in some fictional companies. Many of the rules in the dataset can be described using predefined constraints. However, there is one complex rule that needs a custom constraint:

  • If the employee is not a contractor (contractor == 0), then the salary must be divisible by 500

  • Otherwise if the employee is a contractor (contractor == 1), then this rule does not apply

Note

This is similar to the predefined FixedIncrements constraint with the addition of an exclusion criteria (exclude the constraint check if the employee is a contractor).

Validity Check

The validity check should return a pandas.Series of True/False values that determine whether each row is valid.

Let’s code the logic up using parameters:

  • column_names will be a single item list containing the column that must be divisible (eg. salary)

  • data will be the full dataset

  • Custom parameter: increment describes the numerical increment (eg. 500)

  • Custom parameter: exclusion_column describes the column with the exclusion criteria (eg. contractor)

def is_valid(column_names, data, increment, exclusion_column):
    column_name=column_names[0]

    is_divisible = (data[column_name] % increment == 0)
    is_excluded = (data[exclusion_column] > 0)

    return (is_divisible | is_excluded)

Transformations

The transformations must return the full datasets with particular columns transformed. We can modify, delete or add columns as long as we can reverse the transformation later.

In our case, the transformation can just divide each of the values in the column by the increment.

def transform(column_names, data, increment, exclusion_column):
    column_name = column_names[0]
    data[column_name] = data[column_name] / increment
    return data

Reversing the transformation is trickier. If we multiply every value by the increment, the salaries won’t necessarily be divisible by 500. Instead we should:

  • Round values to whole numbers whenever the employee is not a contractor first, and then

  • Multiply every value by 500

def reverse_transform(column_names, transformed_data, increment, exclusion_column):
    column_name = column_names[0]

    is_included = (transformed_data[exclusion_column] == 0)
    rounded_data = transformed_data[is_included][column_name].round()
    transformed_data.at[is_included, column_name] = rounded_data

    transformed_data[column_name] *= increment
    return transformed_data

Creating your class

Finally, we can put all the functionality together to create a class that describes our constraint. Use the create_custom_constraint factory method to do this. It accepts your functions as inputs and returns a class that’s ready to use.

You can name this class whatever you’d like. Since our constraint is similar to FixedIncrements, let’s call it FixedIncrementsWithExclusion.

In [4]: from sdv.constraints import create_custom_constraint

In [5]: FixedIncrementsWithExclusion = create_custom_constraint(
   ...:     is_valid_fn=is_valid,
   ...:     transform_fn=transform, # optional
   ...:     reverse_transform_fn=reverse_transform # optional
   ...: )
   ...: 

Using your custom constraint

Now that you have a class, you can use it like any other predefined constraint. Create an object by putting in the parameters you defined. Note that you do not need to input the data.

You can apply the same constraint to other columns by creating a different object. In our case the annual_bonus column also follows the same logic.

In [6]: salary_divis_500 = FixedIncrementsWithExclusion(
   ...:    column_names=['salary'],
   ...:    increment=500,
   ...:    exclusion_column='contractor'
   ...: )
   ...: 

In [7]: bonus_divis_500 = FixedIncrementsWithExclusion(
   ...:    column_names=['annual_bonus'],
   ...:    increment=500,
   ...:    exclusion_column='contractor'
   ...: )
   ...: 

Finally, input these constraints into your model using the constraints parameter just like you would for predefined constraints.

In [8]: from sdv.tabular import GaussianCopula

In [9]: constraints = [
   ...:   salary_divis_500,
   ...:   bonus_divis_500
   ...: ]
   ...: 

In [10]: model = GaussianCopula(constraints=constraints, enforce_min_max_values=False)

In [11]: model.fit(employees)

Now, when you sample from the model, all rows of the synthetic data will follow the custom constraint.

In [12]: synthetic_data = model.sample(num_rows=10)

In [13]: synthetic_data
Out[13]: 
   company     department  employee_id  age  age_when_joined  years_in_the_company     salary  annual_bonus  prior_years_experience  full_time  part_time  contractor
0     Pear             AI           68   42               35                     6  106013.87      18804.93                       3        1.0        0.0         1.0
1  Glasses  Search Engine           46   31               24                     8  122991.22      15928.16                       5        0.0        1.0         1.0
2  Glasses          Sales           11   45               41                     3   77638.80      18600.16                       1        1.0        0.0         1.0
3     Pear        BigData           48   31               24                     5  119001.38      21867.99                       5        0.0        0.0         1.0
4     Pear         Design           33   39               33                     3  110384.21      23132.88                       4        1.0        0.0         1.0
5  Glasses         Design            7   32               25                     8  150072.79      22553.98                       5        0.0        0.0         1.0
6  Glasses             AI           19   32               26                     6  140877.29      23144.70                       5        0.0        0.0         1.0
7     Pear         Design           32   35               28                     5  136226.27      22490.30                       4        0.0        0.0         1.0
8     Pear          Sales           21   32               25                     7  115194.57      15995.88                       5        1.0        0.0         1.0
9  Glasses  Search Engine           32   34               28                     5  138753.71      23010.28                       3        0.0        0.0         1.0