Handling Constraints

A very common scenario that we face when working with tabular data is finding columns that have very particular relationships between them which are very hard to model and easily confuse the Tabular Models.

Some simple examples of these scenarios include:

  • A table that has the columns country and city: In such scenario, it might be very hard to learn which country each city belongs to, and when sampling probabilistically, the model is likely to end up generating invalid country/city combinations.

  • A table that contains both the age and the date of birth of a user. The model will learn the age and date of birth distributions and mostly generate valid combinations, but in some cases it might end up giving back ages that do not correspond to the given date of birth.

These kind of special relationships between columns are called Constraints, and SDV provides a very powerful and flexible mechanism to take them into account and guarantee that the sampled data always respects them.

Let us explore a few Constraint examples and learn how to handle them:

Load a Tabular Demo

We will start by loading a small table that contains data with some constraints:

In [1]: from sdv.demo import load_tabular_demo

In [2]: employees = load_tabular_demo()

In [3]: employees
Out[3]: 
     company     department              name                                            address  age  age_when_joined  years_in_the_company
0       Pear          Sales     Pamela Rhodes    911 Webster Curve Suite 667\nOscarton, WA 88746   44               39                     5
1       Pear         Design      Emily Duarte         7166 Stevens Vista\nPhillipsport, IN 27743   30               26                     4
2    Glasses             AI       Chris Smith  6660 Philip Locks Apt. 286\nNorth Thomas, ME 2...   39               32                     7
3    Glasses  Search Engine     Lindsay Wells  93641 Nancy Views Apt. 393\nSouth Steven, SD 5...   30               24                     6
4   Cheerper        BigData  Steven Hernandez  9567 Rivera Road Suite 732\nSouth James, KY 45557   40               33                     7
5   Cheerper        Support         Amy Parks  14564 Bonnie Canyon Apt. 323\nSouth Debra, WA ...   47               46                     1
6       Pear          Sales    Mary Maldonado           1343 Brewer Plaza\nNew Jessica, DC 53742   35               26                     9
7       Pear         Design   Gregory Wells V  9596 Robert Orchard Apt. 472\nMartinchester, M...   44               41                     3
8    Glasses             AI        Juan Hayes  6061 Collins Knolls Suite 981\nDanastad, DC 74055   47               42                     5
9    Glasses  Search Engine   Catherine Watts       903 Brooks Crossroad\nPort Melissa, CA 14036   49               49                     0
10  Cheerper        BigData    Carrie Salazar        5512 Steele Orchard\nNorth Carrie, CO 57018   41               36                     5
11  Cheerper        Support     Mary Robinson  662 Williams Trace Suite 273\nWest Christopher...   47               38                     9

This step loaded a simple table that gives us some basic details about simulated employees from several companies.

If we observe the data closely we will find a few constraints:

  1. Each company has employees from two or more departments, but department names are different across companies. This implies that a company should only be paired with its own departments and never with the departments of other companies.

  2. We have an age column that represents the age of the employee at the date when the data was created and an age_when_joined that represents the age of the employee when they joined the company. Since all of them joined the company before the data was created, the age_when_joined will always be equal or lower than the age column.

  3. We have a years_in_the_company column that indicates how many years passed since they joined the company, which means that the years_in_the_company will always be equal to the age minus the age_when_joined.

How does SDV Handle Constraints?

SDV handles constraints using two different strategies:

Transform Strategy

When using this strategy, SDV applies a transformation to the data before learning it in a way that allows the model to better capture the data properties. For example, if we have one column that needs to be always greater than the other one, SDV can do the following:

  1. Replace the higher column with the difference between the two columns, which will always be positive.

  2. Model the transformed data and sample new values.

  3. Recompute the value of the high column by adding the values of the lower column to it.

The Transform strategy is very efficient and does not affect the speed of the modeling and sampling process, but in some cases might affect the quality of the learning process or simply not be possible.

Reject Sampling Strategy

In the cases where applying a Transform strategy is not possible or may affect the quality of the learning process, SDV can apply a Reject Sampling strategy.

When using this strategy, SDV validates the sampled rows, discards the ones that do not adjust to the constraint, and re-samples them. This process is repeated until enough rows have been sampled.

Defining Constraints

Let us go back to the demo data that we loaded before and define Constraints that indicate SDV how to work with this data.

UniqueCombinations Constraint

The first constraint that we will explore is the UniqueCombinations constraint.

This Constraint class can handle the situation number 1 indicated above, in which the values of a set of columns can only be combined exactly as seen in the original data, and new combinations are not accepted. In order to use this constraint we will need to import it from the sdv.constraints module and create an instance of it indicating:

  • the names of the affected columns

  • which strategy we want to use: transform or reject_sampling

In [4]: from sdv.constraints import UniqueCombinations

In [5]: unique_company_department_constraint = UniqueCombinations(
   ...:     columns=['company', 'department'],
   ...:     handling_strategy='transform'
   ...: )
   ...: 

GreaterThan Constraint

The second constraint that we need for our data is the GreaterThan constraint. This constraint guarantees that one column is always greater than the other one. In order to use it, we need to create an instance passing:

  • the name of the low column

  • the name of the high column

  • the handling strategy that we want to use

In [6]: from sdv.constraints import GreaterThan

In [7]: age_gt_age_when_joined_constraint = GreaterThan(
   ...:     low='age_when_joined',
   ...:     high='age',
   ...:     handling_strategy='reject_sampling'
   ...: )
   ...: 

CustomFormula Constraint

In some cases, one column will need to be computed based on the other columns using a custom formula. This is, for example, what happens with the years_in_the_company column in our demo data, which will always need to be computed based on the age and age_when_joined columns by subtracting them. In these cases, we need to define a custom function that defines how to compute the value of the column:

In [8]: def years_in_the_company(data):
   ...:     return data['age'] - data['age_when_joined']
   ...: 

Once we have defined this function, we can use the ColumnFormula constraint by passing it:

  • the name of the column that we want to generate

  • the function that generates the column values

  • the handling strategy that we want to use

In [9]: from sdv.constraints import ColumnFormula

In [10]: years_in_the_company_constraint = ColumnFormula(
   ....:     column='years_in_the_company',
   ....:     formula=years_in_the_company,
   ....:     handling_strategy='transform'
   ....: )
   ....: 

Using the Constraints

Now that we have defined the constraints needed to properly describe our dataset, we can pass them to the Tabular Model of our choice. For example, let us create a GaussianCopula model passing it the constraints that we just defined as a list:

In [11]: from sdv.tabular import GaussianCopula

In [12]: constraints = [
   ....:     unique_company_department_constraint,
   ....:     age_gt_age_when_joined_constraint,
   ....:     years_in_the_company_constraint
   ....: ]
   ....: 

In [13]: gc = GaussianCopula(constraints=constraints)

After creating the model, we can just fit and sample as usual:

In [14]: gc.fit(employees)

In [15]: sampled = gc.sample(10)

And observe that the sampled rows really adjust to the constraints that we defined:

In [16]: sampled
Out[16]: 
    company     department              name                                            address  age  age_when_joined  years_in_the_company
0      Pear          Sales     Pamela Rhodes    911 Webster Curve Suite 667\nOscarton, WA 88746   47               40                     7
1   Glasses             AI     Mary Robinson  662 Williams Trace Suite 273\nWest Christopher...   39               31                     8
2      Pear          Sales    Carrie Salazar        5512 Steele Orchard\nNorth Carrie, CO 57018   40               40                     0
3   Glasses  Search Engine     Pamela Rhodes    911 Webster Curve Suite 667\nOscarton, WA 88746   43               41                     2
4      Pear         Design   Catherine Watts       903 Brooks Crossroad\nPort Melissa, CA 14036   32               28                     4
5      Pear          Sales    Mary Maldonado           1343 Brewer Plaza\nNew Jessica, DC 53742   37               30                     7
6  Cheerper        BigData    Carrie Salazar        5512 Steele Orchard\nNorth Carrie, CO 57018   40               28                    12
7   Glasses             AI  Steven Hernandez  9567 Rivera Road Suite 732\nSouth James, KY 45557   40               33                     7
8   Glasses             AI        Juan Hayes  6061 Collins Knolls Suite 981\nDanastad, DC 74055   50               43                     7
9   Glasses  Search Engine    Carrie Salazar        5512 Steele Orchard\nNorth Carrie, CO 57018   41               39                     2