Danger You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software Click here to go to the new docs pages.
Danger
You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software
Click here to go to the new docs pages.
If you have business logic that cannot be represented using Predefined Constraints, you can define custom logic. In this guide, we’ll walk through the process for defining a custom constraint and using it.
To define your custom constraint you need to write some functionality in a separate Python file. This includes:
Validity Check: A test that determines whether a row in the data meets the rule, and
(optional) Transformation Functions: Functions to modify the data before & after modeling
The SDV then uses the functionality you provided, as shown in the diagram below.
Each function (validity, transform and reverse transform) must accept the same inputs:
column_names: The names of the columns involved in the constraints
data: The full dataset, represented as a pandas.DataFrame
pandas.DataFrame
<other parameters>: Any other parameters that are necessary for your logic
Let’s demonstrate this using our demo dataset.
In [1]: from sdv.demo import load_tabular_demo In [2]: employees = load_tabular_demo() In [3]: employees Out[3]: company department employee_id age age_when_joined years_in_the_company salary annual_bonus prior_years_experience full_time part_time contractor 0 Pear Sales 1 45 39 6 121500.00 10000.00 1 1.0 0.0 0.0 1 Pear Design 5 31 26 5 88289.72 13272.35 2 0.0 0.0 1.0 2 Glasses AI 1 34 29 5 114500.00 11500.00 5 1.0 0.0 0.0 3 Glasses Search Engine 7 40 31 9 85333.88 14879.06 5 0.0 0.0 1.0 4 Cheerper BigData 6 35 33 2 112500.00 22500.00 4 0.0 1.0 0.0 5 Cheerper Support 11 32 26 6 45000.00 15500.00 3 0.0 1.0 0.0 6 Pear Sales 28 42 40 2 73500.00 10000.00 2 1.0 0.0 0.0 7 Pear Design 75 30 29 1 80771.20 18952.46 5 0.0 0.0 1.0 8 Glasses AI 33 32 29 3 101000.00 9500.00 3 1.0 0.0 0.0 9 Glasses Search Engine 56 38 32 6 85909.15 18161.58 5 0.0 0.0 1.0 10 Cheerper BigData 42 46 38 8 52500.00 11000.00 5 0.0 1.0 0.0 11 Cheerper Support 80 33 31 2 91500.00 10500.00 4 0.0 1.0 0.0
The dataset contains basic details about employees in some fictional companies. Many of the rules in the dataset can be described using predefined constraints. However, there is one complex rule that needs a custom constraint:
If the employee is not a contractor (contractor == 0), then the salary must be divisible by 500
Otherwise if the employee is a contractor (contractor == 1), then this rule does not apply
Note
This is similar to the predefined FixedIncrements constraint with the addition of an exclusion criteria (exclude the constraint check if the employee is a contractor).
The validity check should return a pandas.Series of True/False values that determine whether each row is valid.
pandas.Series
True
False
Let’s code the logic up using parameters:
column_names will be a single item list containing the column that must be divisible (eg. salary)
data will be the full dataset
Custom parameter: increment describes the numerical increment (eg. 500)
Custom parameter: exclusion_column describes the column with the exclusion criteria (eg. contractor)
def is_valid(column_names, data, increment, exclusion_column): column_name=column_names[0] is_divisible = (data[column_name] % increment == 0) is_excluded = (data[exclusion_column] > 0) return (is_divisible | is_excluded)
The transformations must return the full datasets with particular columns transformed. We can modify, delete or add columns as long as we can reverse the transformation later.
In our case, the transformation can just divide each of the values in the column by the increment.
def transform(column_names, data, increment, exclusion_column): column_name = column_names[0] data[column_name] = data[column_name] / increment return data
Reversing the transformation is trickier. If we multiply every value by the increment, the salaries won’t necessarily be divisible by 500. Instead we should:
Round values to whole numbers whenever the employee is not a contractor first, and then
Multiply every value by 500
def reverse_transform(column_names, transformed_data, increment, exclusion_column): column_name = column_names[0] is_included = (transformed_data[exclusion_column] == 0) rounded_data = transformed_data[is_included][column_name].round() transformed_data.at[is_included, column_name] = rounded_data transformed_data[column_name] *= increment return transformed_data
Finally, we can put all the functionality together to create a class that describes our constraint. Use the create_custom_constraint factory method to do this. It accepts your functions as inputs and returns a class that’s ready to use.
You can name this class whatever you’d like. Since our constraint is similar to FixedIncrements, let’s call it FixedIncrementsWithExclusion.
FixedIncrements
FixedIncrementsWithExclusion
In [4]: from sdv.constraints import create_custom_constraint In [5]: FixedIncrementsWithExclusion = create_custom_constraint( ...: is_valid_fn=is_valid, ...: transform_fn=transform, # optional ...: reverse_transform_fn=reverse_transform # optional ...: ) ...:
Now that you have a class, you can use it like any other predefined constraint. Create an object by putting in the parameters you defined. Note that you do not need to input the data.
You can apply the same constraint to other columns by creating a different object. In our case the annual_bonus column also follows the same logic.
In [6]: salary_divis_500 = FixedIncrementsWithExclusion( ...: column_names=['salary'], ...: increment=500, ...: exclusion_column='contractor' ...: ) ...: In [7]: bonus_divis_500 = FixedIncrementsWithExclusion( ...: column_names=['annual_bonus'], ...: increment=500, ...: exclusion_column='contractor' ...: ) ...:
Finally, input these constraints into your model using the constraints parameter just like you would for predefined constraints.
In [8]: from sdv.tabular import GaussianCopula In [9]: constraints = [ ...: salary_divis_500, ...: bonus_divis_500 ...: ] ...: In [10]: model = GaussianCopula(constraints=constraints, enforce_min_max_values=False) In [11]: model.fit(employees)
Now, when you sample from the model, all rows of the synthetic data will follow the custom constraint.
In [12]: synthetic_data = model.sample(num_rows=10) In [13]: synthetic_data Out[13]: company department employee_id age age_when_joined years_in_the_company salary annual_bonus prior_years_experience full_time part_time contractor 0 Glasses BigData 52 39 28 7 77910.27 20541.24 4 0.0 1.0 1.0 1 Glasses AI 44 37 31 5 83100.15 15155.37 4 0.0 0.0 1.0 2 Glasses Search Engine 35 34 30 6 80254.31 15252.24 5 0.0 0.0 1.0 3 Pear AI 24 32 27 2 77032.39 10892.66 3 1.0 0.0 1.0 4 Glasses Design 21 36 36 4 111106.75 16617.82 5 1.0 0.0 1.0 5 Glasses Search Engine 40 33 28 5 76595.23 11736.41 4 0.0 0.0 1.0 6 Cheerper Support 42 40 26 8 55987.56 14224.90 5 0.0 1.0 1.0 7 Cheerper Support 79 37 27 8 49544.59 9597.30 5 0.0 1.0 1.0 8 Pear Design 6 40 35 8 93637.60 17795.99 3 0.0 0.0 1.0 9 Glasses AI 63 35 35 3 62200.67 19326.75 5 0.0 1.0 1.0