Handling Constraints

A very common scenario that we face when working with tabular data is finding columns that have very particular relationships between them which are very hard to model and easily confuse the Tabular Models.

Some simple examples of these scenarios include:

  • A table that has the columns country and city: In such scenario, it might be very hard to learn which country each city belongs to, and when sampling probabilistically, the model is likely to end up generating invalid country/city combinations.

  • A table that contains both the age and the date of birth of a user. The model will learn the age and date of birth distributions and mostly generate valid combinations, but in some cases it might end up giving back ages that do not correspond to the given date of birth.

These kind of special relationships between columns are called Constraints, and SDV provides a very powerful and flexible mechanism to take them into account and guarantee that the sampled data always respects them.

Let us explore a few Constraint examples and learn how to handle them:

Load a Tabular Demo

We will start by loading a small table that contains data with some constraints:

In [1]: from sdv.demo import load_tabular_demo

In [2]: employees = load_tabular_demo()

In [3]: employees
Out[3]: 
     company     department                 name                                            address  age  age_when_joined  years_in_the_company     salary  prior_years_experience  full_time  part_time  contractor
0       Pear          Sales           Angel Bray             2645 Ritter Hill\nDanielview, NV 30074   40               39                     1  113251.24                       5        1.0        0.0         0.0
1       Pear         Design       Stacy Sandoval                          USNV Weaver\nFPO AA 44239   41               38                     3  120010.01                       3        0.0        0.0         1.0
2    Glasses             AI         Miranda Cook   016 Torres Plains Suite 776\nKaylaberg, TN 65995   30               26                     4  127590.71                       4        1.0        0.0         0.0
3    Glasses  Search Engine     Justin Frederick  7308 Williams Village Suite 402\nPort Jennifer...   47               43                     4   67378.35                       2        1.0        0.0         0.0
4   Cheerper        BigData         Summer Ellis  22299 Rosario Burg Suite 228\nAndrewsville, DC...   40               36                     4   80377.27                       4        0.0        1.0         0.0
5   Cheerper        Support      Ashley Mitchell  306 Wilkerson Pines Apt. 481\nNorth Garrettche...   36               27                     9   41108.45                       4        0.0        1.0         0.0
6       Pear          Sales          Lauren King  39497 Aguirre Fields Suite 392\nPort Sheila, P...   44               35                     9   82441.32                       3        1.0        0.0         0.0
7       Pear         Design       Tiffany Nelson    7678 Morales Mount\nWest Patriciafort, AK 09161   34               25                     9   51947.44                       4        0.0        0.0         1.0
8    Glasses             AI           Mary Olson            757 Aaron Dam\nEast Elizabeth, PA 90453   49               45                     4   69391.95                       3        1.0        0.0         0.0
9    Glasses  Search Engine         Tyler Murphy             034 Gill Keys\nCassandrafurt, TN 45774   47               40                     7   78785.02                       1        1.0        0.0         0.0
10  Cheerper        BigData          Kelly Smith                         USCGC Taylor\nFPO AE 04780   47               43                     4   48509.85                       2        0.0        1.0         0.0
11  Cheerper        Support  Dr. Claudia Collier  959 Munoz Drives Apt. 174\nCristinahaven, NH 6...   33               24                     9  144091.34                       2        0.0        1.0         0.0

This step loaded a simple table that gives us some basic details about simulated employees from several companies.

If we observe the data closely we will find a few constraints:

  1. Each company has employees from two or more departments, but department names are different across companies. This implies that a company should only be paired with its own departments and never with the departments of other companies.

  2. We have an age column that represents the age of the employee at the date when the data was created and an age_when_joined that represents the age of the employee when they joined the company. Since all of them joined the company before the data was created, the age_when_joined will always be equal or lower than the age column.

  3. We have a years_in_the_company column that indicates how many years passed since they joined the company, which means that the years_in_the_company will always be equal to the age minus the age_when_joined.

  4. We have a salary column that should always be rounded to 2 decimal points.

  5. The age column is bounded, since realistically an employee can only be so old (or so young).

  6. The full_time, part_time and contractor columns are related in such a way that one of them will always be one and the others zero, since the employee must be part of one of the three categories.

How does SDV Handle Constraints?

SDV handles constraints using two different strategies:

Transform Strategy

When using this strategy, SDV applies a transformation to the data before learning it in a way that allows the model to better capture the data properties. For example, if we have one column that needs to be always greater than the other one, SDV can do the following:

  1. Replace the higher column with the difference between the two columns, which will always be positive.

  2. Model the transformed data and sample new values.

  3. Recompute the value of the high column by adding the values of the lower column to it.

The Transform strategy is very efficient and does not affect the speed of the modeling and sampling process, but in some cases might affect the quality of the learning process or simply not be possible.

Reject Sampling Strategy

In the cases where applying a Transform strategy is not possible or may affect the quality of the learning process, SDV can apply a Reject Sampling strategy.

When using this strategy, SDV validates the sampled rows, discards the ones that do not adjust to the constraint, and re-samples them. This process is repeated until enough rows have been sampled.

Defining Constraints

Let us go back to the demo data that we loaded before and define Constraints that indicate SDV how to work with this data.

UniqueCombinations Constraint

The first constraint that we will explore is the UniqueCombinations constraint.

This Constraint class can handle the situation number 1 indicated above, in which the values of a set of columns can only be combined exactly as seen in the original data, and new combinations are not accepted. In order to use this constraint we will need to import it from the sdv.constraints module and create an instance of it indicating:

  • the names of the affected columns

  • which strategy we want to use: transform or reject_sampling

In [4]: from sdv.constraints import UniqueCombinations

In [5]: unique_company_department_constraint = UniqueCombinations(
   ...:     columns=['company', 'department'],
   ...:     handling_strategy='transform'
   ...: )
   ...: 

GreaterThan Constraint

The second constraint that we need for our data is the GreaterThan constraint. This constraint guarantees that one column is always greater than the other one. In order to use it, we need to create an instance passing:

  • the name of the low column

  • the name of the high column

  • the handling strategy that we want to use

In [6]: from sdv.constraints import GreaterThan

In [7]: age_gt_age_when_joined_constraint = GreaterThan(
   ...:     low='age_when_joined',
   ...:     high='age',
   ...:     handling_strategy='reject_sampling'
   ...: )
   ...: 

The GreaterThan constraint can also be used to guarantee a column is greater than a scalar value or specific datetime value instead of another column. To use this functionality, we can pass:

  • the scalar value for low

  • the scalar value for high

  • a boolean indicating low or high is a scalar

In [8]: salary_gt_30000_constraint = GreaterThan(
   ...:     low=30000,
   ...:     high='salary',
   ...:     handling_strategy='reject_sampling'
   ...: )
   ...: 

Positive and Negative Constraints

Similar to the GreaterThan constraint, we can use the Positive or Negative constraints. These constraints enforce that a specified column is always positive or negative. We can create an instance passing:

  • the name of the low column for Negative or the name of the high column for Positive

  • the handling strategy that we want to use

  • a boolean specifying whether to make the data strictly above or below 0, or include 0 as a possible value

In [9]: from sdv.constraints import Positive

In [10]: positive_prior_exp_constraint = Positive(
   ....:     high='prior_years_experience',
   ....:     strict=False,
   ....:     handling_strategy='reject_sampling'
   ....: )
   ....: 

ColumnFormula Constraint

In some cases, one column will need to be computed based on the other columns using a custom formula. This is, for example, what happens with the years_in_the_company column in our demo data, which will always need to be computed based on the age and age_when_joined columns by subtracting them. In these cases, we need to define a custom function that defines how to compute the value of the column:

In [11]: def years_in_the_company(data):
   ....:     return data['age'] - data['age_when_joined']
   ....: 

Once we have defined this function, we can use the ColumnFormula constraint by passing it:

  • the name of the column that we want to generate

  • the function that generates the column values

  • the handling strategy that we want to use

In [12]: from sdv.constraints import ColumnFormula

In [13]: years_in_the_company_constraint = ColumnFormula(
   ....:     column='years_in_the_company',
   ....:     formula=years_in_the_company,
   ....:     handling_strategy='transform'
   ....: )
   ....: 

Rounding Constraint

In order for data to be realistic, we also might want to round data to a certain number of digits. To do this, we can use the Rounding Constraint. We will pass this constraint:

  • the name of the column(s) that should be rounded.

  • the number of digits each column should be rounded to.

  • the handling strategy that we want to use

  • (optional) if reject sampling, we can customize the threshold of the sampled values.

In [14]: from sdv.constraints import Rounding

In [15]: salary_rounding_constraint = Rounding(
   ....:     columns='salary',
   ....:     digits=2,
   ....:     handling_strategy='transform'
   ....: )
   ....: 

Between Constraint

Another possibility is the Between constraint. It guarantees that one column is always in between two other columns/values. For example, the age column in our demo data is realistically bounded to the ages of 15 and 90 since acual employees won’t be too young or too old.

In order to use it, we need to create an instance passing:

  • the name of the low column or a scalar value to be used as the lower bound

  • the name of the high column or a scalar value to be used as the upper bound

  • the handling strategy that we want to use

In [16]: from sdv.constraints import Between

In [17]: reasonable_age_constraint = Between(
   ....:     column='age',
   ....:     low=15,
   ....:     high=90,
   ....:     handling_strategy='transform'
   ....: )
   ....: 

OneHotEncoding Constraint

Another constraint available is the OneHotEncoding constraint. This constraint allows the user to specify a list of columns where each row is a one hot vector. Then, the constraint will make sure that the output of the model is transformed so that the column with the largest value is set to 1 while all other columns are set to 0. To apply the constraint we need to create an instance passing:

  • A list of the names of the columns of interest

  • The strategy we want to use (transform is recommended)

In [18]: from sdv.constraints import OneHotEncoding

In [19]: one_hot_constraint = OneHotEncoding(
   ....:     columns=['full_time', 'part_time', 'contractor']
   ....: )
   ....: 

Using the Constraints

Now that we have defined the constraints needed to properly describe our dataset, we can pass them to the Tabular Model of our choice. For example, let us create a GaussianCopula model passing it the constraints that we just defined as a list:

In [20]: from sdv.tabular import GaussianCopula

In [21]: constraints = [
   ....:     unique_company_department_constraint,
   ....:     age_gt_age_when_joined_constraint,
   ....:     years_in_the_company_constraint,
   ....:     salary_gt_30000_constraint,
   ....:     positive_prior_exp_constraint,
   ....:     salary_rounding_constraint,
   ....:     reasonable_age_constraint,
   ....:     one_hot_constraint
   ....: ]
   ....: 

In [22]: gc = GaussianCopula(constraints=constraints)

After creating the model, we can just fit and sample as usual:

In [23]: gc.fit(employees)

In [24]: sampled = gc.sample(10)

And observe that the sampled rows really adjust to the constraints that we defined:

In [25]: sampled
Out[25]: 
   company     department                 name                                            address  age  age_when_joined  years_in_the_company     salary  prior_years_experience  full_time  part_time  contractor
0     Pear         Design       Stacy Sandoval                          USNV Weaver\nFPO AA 44239   40               36                     4   82735.14                       4        1.0        0.0         0.0
1     Pear         Design          Kelly Smith                         USCGC Taylor\nFPO AE 04780   33               33                     0  117191.01                       4        1.0        0.0         0.0
2  Glasses             AI         Tyler Murphy             034 Gill Keys\nCassandrafurt, TN 45774   36               30                     6   65096.33                       3        1.0        0.0         0.0
3     Pear          Sales  Dr. Claudia Collier  959 Munoz Drives Apt. 174\nCristinahaven, NH 6...   33               25                     8  141853.68                       3        0.0        1.0         0.0
4     Pear          Sales          Kelly Smith                         USCGC Taylor\nFPO AE 04780   47               41                     6  112648.30                       5        1.0        0.0         0.0
6  Glasses  Search Engine          Lauren King  39497 Aguirre Fields Suite 392\nPort Sheila, P...   48               44                     4   45294.96                       1        1.0        0.0         0.0
7  Glasses  Search Engine         Tyler Murphy             034 Gill Keys\nCassandrafurt, TN 45774   32               30                     2   50939.44                       2        1.0        0.0         0.0
8  Glasses             AI           Mary Olson            757 Aaron Dam\nEast Elizabeth, PA 90453   47               41                     6   53311.42                       4        1.0        0.0         0.0
0  Glasses             AI           Mary Olson            757 Aaron Dam\nEast Elizabeth, PA 90453   47               37                    10  114132.29                       5        1.0        0.0         0.0
0  Glasses             AI         Tyler Murphy             034 Gill Keys\nCassandrafurt, TN 45774   34               30                     4  136161.59                       1        1.0        0.0         0.0