Try the new SDV 1.0 Beta! We are transitioning to a new version of SDV with improved workflows, new features and an updated documentation site. Click here to go to the new docs pages.
Try the new SDV 1.0 Beta!
We are transitioning to a new version of SDV with improved workflows, new features and an updated documentation site.
Click here to go to the new docs pages.
If you have business logic that cannot be represented using Predefined Constraints, you can define custom logic. In this guide, we’ll walk through the process for defining a custom constraint and using it.
To define your custom constraint you need to write some functionality in a separate Python file. This includes:
Validity Check: A test that determines whether a row in the data meets the rule, and
(optional) Transformation Functions: Functions to modify the data before & after modeling
The SDV then uses the functionality you provided, as shown in the diagram below.
Each function (validity, transform and reverse transform) must accept the same inputs:
column_names: The names of the columns involved in the constraints
data: The full dataset, represented as a pandas.DataFrame
pandas.DataFrame
<other parameters>: Any other parameters that are necessary for your logic
Let’s demonstrate this using our demo dataset.
In [1]: from sdv.demo import load_tabular_demo In [2]: employees = load_tabular_demo() In [3]: employees Out[3]: company department employee_id age age_when_joined years_in_the_company salary annual_bonus prior_years_experience full_time part_time contractor 0 Pear Sales 1 32 25 7 48500.00 13000.00 3 1.0 0.0 0.0 1 Pear Design 5 43 41 2 82676.46 19821.56 2 0.0 0.0 1.0 2 Glasses AI 1 30 25 5 79500.00 9000.00 2 1.0 0.0 0.0 3 Glasses Search Engine 7 46 38 8 60534.36 7012.80 3 0.0 0.0 1.0 4 Cheerper BigData 6 45 37 8 109500.00 8500.00 5 0.0 1.0 0.0 5 Cheerper Support 11 36 35 1 77000.00 6500.00 4 0.0 1.0 0.0 6 Pear Sales 28 31 24 7 82500.00 9500.00 1 1.0 0.0 0.0 7 Pear Design 75 36 31 5 90556.79 23939.90 5 0.0 0.0 1.0 8 Glasses AI 33 40 38 2 78500.00 22000.00 4 1.0 0.0 0.0 9 Glasses Search Engine 56 37 33 4 42767.74 23949.45 4 0.0 0.0 1.0 10 Cheerper BigData 42 44 42 2 66000.00 5500.00 2 0.0 1.0 0.0 11 Cheerper Support 80 44 39 5 101000.00 23500.00 1 0.0 1.0 0.0
The dataset contains basic details about employees in some fictional companies. Many of the rules in the dataset can be described using predefined constraints. However, there is one complex rule that needs a custom constraint:
If the employee is not a contractor (contractor == 0), then the salary must be divisible by 500
Otherwise if the employee is a contractor (contractor == 1), then this rule does not apply
Note
This is similar to the predefined FixedIncrements constraint with the addition of an exclusion criteria (exclude the constraint check if the employee is a contractor).
The validity check should return a pandas.Series of True/False values that determine whether each row is valid.
pandas.Series
True
False
Let’s code the logic up using parameters:
column_names will be a single item list containing the column that must be divisible (eg. salary)
data will be the full dataset
Custom parameter: increment describes the numerical increment (eg. 500)
Custom parameter: exclusion_column describes the column with the exclusion criteria (eg. contractor)
def is_valid(column_names, data, increment, exclusion_column): column_name=column_names[0] is_divisible = (data[column_name] % increment == 0) is_excluded = (data[exclusion_column] > 0) return (is_divisible | is_excluded)
The transformations must return the full datasets with particular columns transformed. We can modify, delete or add columns as long as we can reverse the transformation later.
In our case, the transformation can just divide each of the values in the column by the increment.
def transform(column_names, data, increment, exclusion_column): column_name = column_names[0] data[column_name] = data[column_name] / increment return data
Reversing the transformation is trickier. If we multiply every value by the increment, the salaries won’t necessarily be divisible by 500. Instead we should:
Round values to whole numbers whenever the employee is not a contractor first, and then
Multiply every value by 500
def reverse_transform(column_names, transformed_data, increment, exclusion_column): column_name = column_names[0] is_included = (transformed_data[exclusion_column] == 0) rounded_data = transformed_data[is_included][column_name].round() transformed_data.at[is_included, column_name] = rounded_data transformed_data[column_name] *= increment return transformed_data
Finally, we can put all the functionality together to create a class that describes our constraint. Use the create_custom_constraint factory method to do this. It accepts your functions as inputs and returns a class that’s ready to use.
You can name this class whatever you’d like. Since our constraint is similar to FixedIncrements, let’s call it FixedIncrementsWithExclusion.
FixedIncrements
FixedIncrementsWithExclusion
In [4]: from sdv.constraints import create_custom_constraint In [5]: FixedIncrementsWithExclusion = create_custom_constraint( ...: is_valid_fn=is_valid, ...: transform_fn=transform, # optional ...: reverse_transform_fn=reverse_transform # optional ...: ) ...:
Now that you have a class, you can use it like any other predefined constraint. Create an object by putting in the parameters you defined. Note that you do not need to input the data.
You can apply the same constraint to other columns by creating a different object. In our case the annual_bonus column also follows the same logic.
In [6]: salary_divis_500 = FixedIncrementsWithExclusion( ...: column_names=['salary'], ...: increment=500, ...: exclusion_column='contractor' ...: ) ...: In [7]: bonus_divis_500 = FixedIncrementsWithExclusion( ...: column_names=['annual_bonus'], ...: increment=500, ...: exclusion_column='contractor' ...: ) ...:
Finally, input these constraints into your model using the constraints parameter just like you would for predefined constraints.
In [8]: from sdv.tabular import GaussianCopula In [9]: constraints = [ ...: salary_divis_500, ...: bonus_divis_500 ...: ] ...: In [10]: model = GaussianCopula(constraints=constraints, enforce_min_max_values=False) In [11]: model.fit(employees)
Now, when you sample from the model, all rows of the synthetic data will follow the custom constraint.
In [12]: synthetic_data = model.sample(num_rows=10) In [13]: synthetic_data Out[13]: company department employee_id age age_when_joined years_in_the_company salary annual_bonus prior_years_experience full_time part_time contractor 0 Glasses BigData 67 45 36 3 87281.26 15898.94 3 0.0 1.0 1.0 1 Glasses AI 78 33 31 2 77932.51 23300.65 5 0.0 0.0 1.0 2 Cheerper BigData 66 44 34 5 88414.14 12958.92 3 0.0 1.0 1.0 3 Glasses BigData 22 40 27 4 90313.15 18061.04 4 0.0 0.0 1.0 4 Glasses AI 24 35 26 6 83627.65 18974.23 4 0.0 0.0 1.0 5 Glasses BigData 78 42 41 4 73852.29 23344.88 4 0.0 0.0 1.0 6 Pear Design 8 36 34 7 72200.20 13935.21 4 0.0 0.0 1.0 7 Glasses AI 30 42 35 5 67735.93 12969.91 3 1.0 0.0 1.0 8 Pear Design 33 32 27 4 66552.37 23711.83 4 1.0 0.0 1.0 9 Cheerper Support 80 45 35 4 82976.29 23397.54 2 0.0 1.0 1.0