GaussianCopula Model

In this guide we will go through a series of steps that will let you discover functionalities of the GaussianCopula model, including how to:

  • Create an instance of a GaussianCopula.

  • Fit the instance to your data.

  • Generate synthetic versions of your data.

  • Use GaussianCopula to anonymize PII information.

  • Customize the data transformations to improve the learning process.

  • Specify the column distributions to improve the output quality.

What is GaussianCopula?

The sdv.tabular.GaussianCopula model is based on copula funtions.

In mathematical terms, a copula is a distribution over the unit cube \({\displaystyle [0,1]^{d}}\) which is constructed from a multivariate normal distribution over \({\displaystyle \mathbb {R} ^{d}}\) by using the probability integral transform. Intuitively, a copula is a mathematical function that allows us to describe the joint distribution of multiple random variables by analyzing the dependencies between their marginal distributions.

Let’s now discover how to learn a dataset and later on generate synthetic data with the same format and statistical properties by using the GaussianCopula model.

Quick Usage

We will start by loading one of our demo datasets, the student_placements, which contains information about MBA students that applied for placements during the year 2020.

In [1]: from sdv.demo import load_tabular_demo

In [2]: data = load_tabular_demo('student_placements')

In [3]: data.head()
Out[3]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date duration
0       17264      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12      3.0
1       17265      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09      3.0
2       17266      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13      6.0
3       17267      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT      NaN
4       17268      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27      3.0

As you can see, this table contains information about students which includes, among other things:

  • Their id and gender

  • Their grades and specializations

  • Their work experience

  • The salary that they were offered

  • The duration and dates of their placement

You will notice that there is data with the following characteristics:

  • There are float, integer, boolean, categorical and datetime values.

  • There are some variables that have missing data. In particular, all the data related to the placement details is missing in the rows where the student was not placed.

Let us use the GaussianCopula to learn this data and then sample synthetic data about new students to see how well the model captures the characteristics indicated above. In order to do this you will need to:

  • Import the sdv.tabular.GaussianCopula class and create an instance of it.

  • Call its fit method passing our table.

  • Call its sample method indicating the number of synthetic rows that you want to generate.

In [4]: from sdv.tabular import GaussianCopula

In [5]: model = GaussianCopula()

In [6]: model.fit(data)

Note

Notice that the model fitting process took care of transforming the different fields using the appropriate Reversible Data Transforms to ensure that the data has a format that the GaussianMultivariate model can handle.

Generate synthetic data from the model

Once the modeling has finished you are ready to generate new synthetic data by calling the sample method from your model passing the number of rows that we want to generate.

In [7]: new_data = model.sample(200)

This will return a table identical to the one which the model was fitted on, but filled with new data which resembles the original one.

In [8]: new_data.head()
Out[8]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0       17445      M    71.502694  66.540612   Science    71.228192   Comm&Mgmt            False                 0           57.663848  Mkt&Fin  63.393554  26904.930523    True 2020-03-14 2020-09-03      3.0
1       17367      F    63.324587  55.223690  Commerce    64.185648   Comm&Mgmt            False                -1           65.277799   Mkt&HR  61.486636           NaN   False        NaT        NaT      NaN
2       17281      F    74.268674  69.399272   Science    72.559820    Sci&Tech            False                 0           86.657122  Mkt&Fin  65.642716  36267.041374    True 2020-02-29 2020-09-23     12.0
3       17372      M    87.791407  93.223229   Science    70.333096    Sci&Tech            False                 0           79.222248   Mkt&HR  65.873716  23750.433128    True 2020-02-17 2020-05-26      3.0
4       17380      F    68.274953  63.540454  Commerce    77.060372    Sci&Tech            False                 1           51.943410   Mkt&HR  62.874340  28801.338637   False 2020-01-04 2020-05-26      NaN

Note

You can control the number of rows by specifying the number of samples in the model.sample(<num_rows>). To test, try model.sample(10000). Note that the original table only had ~200 rows.

Save and Load the model

In many scenarios it will be convenient to generate synthetic versions of your data directly in systems that do not have access to the original data source. For example, if you may want to generate testing data on the fly inside a testing environment that does not have access to your production database. In these scenarios, fitting the model with real data every time that you need to generate new data is feasible, so you will need to fit a model in your production environment, save the fitted model into a file, send this file to the testing environment and then load it there to be able to sample from it.

Let’s see how this process works.

Save and share the model

Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is pickle.

In [9]: model.save('my_model.pkl')

This will have created a file called my_model.pkl in the same directory in which you are running SDV.

Important

If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains no information about the original data, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this my_model.pkl file without the risk of disclosing any of your real data!

Load the model and generate new data

The file you just generated can be sent over to the system where the synthetic data will be generated. Once it is there, you can load it using the GaussianCopula.load method, and then you are ready to sample new data from the loaded instance:

In [10]: loaded = GaussianCopula.load('my_model.pkl')

In [11]: new_data = loaded.sample(200)

Warning

Notice that the system where the model is loaded needs to also have sdv installed, otherwise it will not be able to load the model and use it.

Specifying the Primary Key of the table

One of the first things that you may have noticed when looking at the demo data is that there is a student_id column which acts as the primary key of the table, and which is supposed to have unique values. Indeed, if we look at the number of times that each value appears, we see that all of them appear at most once:

In [12]: data.student_id.value_counts().max()
Out[12]: 1

However, if we look at the synthetic data that we generated, we observe that there are some values that appear more than once:

In [13]: new_data[new_data.student_id == new_data.student_id.value_counts().index[0]]
Out[13]: 
     student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
101       17451      F    53.777573  77.469202  Commerce    58.886594   Comm&Mgmt            False                 0           55.789568   Mkt&HR  66.764646           NaN   False        NaT        NaT      NaN
122       17451      F    48.609547  56.208351   Science    72.103867   Comm&Mgmt            False                 0           51.141349  Mkt&Fin  62.086654           NaN   False        NaT        NaT      NaN
129       17451      M    66.295968  64.914855  Commerce    69.609510   Comm&Mgmt            False                 0           65.348814  Mkt&Fin  52.330106  24487.618792    True 2020-04-13 2020-09-05      3.0
150       17451      M    60.642834  54.973978   Science    60.465220    Sci&Tech            False                 1           71.296884   Mkt&HR  54.353150           NaN   False        NaT        NaT      NaN
165       17451      M    72.720120  50.695996  Commerce    66.451778   Comm&Mgmt            False                 0           73.942398   Mkt&HR  54.780965  29516.728716    True 2020-07-20 2020-08-04      3.0
189       17451      F    73.104277  65.423069   Science    75.147736   Comm&Mgmt            False                 0           76.765632  Mkt&Fin  71.146266  29548.021837    True 2020-01-12 2020-06-07      3.0

This happens because the model was not notified at any point about the fact that the student_id had to be unique, so when it generates new data it will provoke collisions sooner or later. In order to solve this, we can pass the argument primary_key to our model when we create it, indicating the name of the column that is the index of the table.

In [14]: model = GaussianCopula(
   ....:     primary_key='student_id'
   ....: )
   ....: 

In [15]: model.fit(data)

In [16]: new_data = model.sample(200)

In [17]: new_data.head()
Out[17]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      F    75.954437  65.874155   Science    73.071839   Comm&Mgmt            False                 1           62.245229  Mkt&Fin  59.854641  25272.228737    True 2020-03-21 2020-08-09      3.0
1           1      M    76.199565  69.003440   Science    74.994287    Sci&Tech            False                 1           89.078363  Mkt&Fin  71.451042  36267.842700    True 2020-02-08 2020-09-07      3.0
2           2      M    47.637773  55.690358   Science    58.478495    Sci&Tech            False                 1           55.157416   Mkt&HR  57.272402           NaN   False        NaT        NaT      NaN
3           3      M    63.262929  63.947319   Science    71.171869   Comm&Mgmt            False                 0           88.587588  Mkt&Fin  62.559554  22892.135506    True 2020-01-05 2020-06-20      3.0
4           4      M    78.148608  56.609793   Science    73.367579    Sci&Tech            False                 0           84.840520  Mkt&Fin  63.818865  29655.467846    True 2020-03-27 2020-08-21      3.0

As a result, the model will learn that this column must be unique and generate a unique sequence of values for the column:

In [18]: new_data.student_id.value_counts().max()
Out[18]: 1

Anonymizing Personally Identifiable Information (PII)

There will be many cases where the data will contain Personally Identifiable Information which we cannot disclose. In these cases, we will want our Tabular Models to replace the information within these fields with fake, simulated data that looks similar to the real one but does not contain any of the original values.

Let’s load a new dataset that contains a PII field, the student_placements_pii demo, and try to generate synthetic versions of it that do not contain any of the PII fields.

Note

The student_placements_pii dataset is a modified version of the student_placements dataset with one new field, address, which contains PII information about the students. Notice that this additional address field has been simulated and does not correspond to data from the real users.

In [19]: data_pii = load_tabular_demo('student_placements_pii')

In [20]: data_pii.head()
Out[20]: 
   student_id                                            address gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date duration
0       17264        70304 Baker Turnpike\nEricborough, MS 15086      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12      3.0
1       17265    805 Herrera Avenue Apt. 134\nMaryview, NJ 36510      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09      3.0
2       17266        3702 Bradley Island\nNorth Victor, FL 12268      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13      6.0
3       17267                   Unit 0879 Box 3878\nDPO AP 42663      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT      NaN
4       17268  96493 Kelly Canyon Apt. 145\nEast Steven, NC 3...      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27      3.0

If we use our tabular model on this new data we will see how the synthetic data that it generates discloses the addresses from the real students:

In [21]: model = GaussianCopula(
   ....:     primary_key='student_id',
   ....: )
   ....: 

In [22]: model.fit(data_pii)

In [23]: new_data_pii = model.sample(200)

In [24]: new_data_pii.head()
Out[24]: 
   student_id                                            address gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0  61900 Monica Stream Suite 028\nPort Michael, M...      M    63.261295  59.026588   Science    62.386083   Comm&Mgmt            False                 0           96.785469   Mkt&HR  62.152346  35384.025614    True 2020-05-07 2020-09-19      NaN
1           1        201 Rhodes Isle\nPort Robertmouth, MS 27570      M    45.378136  47.840825  Commerce    61.841753   Comm&Mgmt            False                 0           57.550855   Mkt&HR  57.988714           NaN   False        NaT        NaT      NaN
2           2   9976 James Crest Apt. 125\nStevenhaven, GA 30830      M    52.900469  65.948894  Commerce    52.293776   Comm&Mgmt            False                 0           92.892410   Mkt&HR  62.259721  22265.435572    True 2020-03-03 2020-09-24      3.0
3           3                   Unit 4181 Box 7016\nDPO AP 47399      F    57.967665  60.289538  Commerce    55.759392   Comm&Mgmt            False                 0           83.009944  Mkt&Fin  58.468285           NaN   False        NaT        NaT      NaN
4           4  96493 Kelly Canyon Apt. 145\nEast Steven, NC 3...      M    83.905387  73.193487   Science    69.023676   Comm&Mgmt            False                -1           97.500080  Mkt&Fin  58.657225           NaN   False        NaT        NaT      NaN

More specifically, we can see how all the addresses that have been generated actually come from the original dataset:

In [25]: new_data_pii.address.isin(data_pii.address).sum()
Out[25]: 200

In order to solve this, we can pass an additional argument anonymize_fields to our model when we create the instance. This anonymize_fields argument will need to be a dictionary that contains:

  • The name of the field that we want to anonymize.

  • The category of the field that we want to use when we generate fake values for it.

The list complete list of possible categories can be seen in the Faker Providers page, and it contains a huge list of concepts such as:

  • name

  • address

  • country

  • city

  • ssn

  • credit_card_number

  • credit_card_expire

  • credit_card_security_code

  • email

  • telephone

In this case, since the field is an e-mail address, we will pass a dictionary indicating the category address

In [26]: model = GaussianCopula(
   ....:     primary_key='student_id',
   ....:     anonymize_fields={
   ....:         'address': 'address'
   ....:     }
   ....: )
   ....: 

In [27]: model.fit(data_pii)

As a result, we can see how the real address values have been replaced by other fake addresses:

In [28]: new_data_pii = model.sample(200)

In [29]: new_data_pii.head()
Out[29]: 
   student_id                                            address gender  second_perc   high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0        0454 Martin Ridges\nNew Ritahaven, GA 67609      F    72.703674   57.257009   Science    69.682157    Sci&Tech            False                 0           75.155014   Mkt&HR  58.372490  24187.634324    True 2020-02-20 2020-06-13      3.0
1           1  1962 Leon Islands Suite 710\nFrancistown, NJ 5...      F    64.055928  132.110982   Science    73.123287   Comm&Mgmt            False                 1           71.687703  Mkt&Fin  66.351226  20490.139595    True 2020-04-26 2020-10-14      3.0
2           2  80198 James Greens Apt. 219\nPort Alicia, WV 8...      F    82.012310   62.888033  Commerce    67.821035   Comm&Mgmt            False                 1           56.923800  Mkt&Fin  65.323453  28392.759021    True 2020-01-08 2020-10-25     12.0
3           3          034 Stevens Island\nNorth Robin, NE 02835      M    60.280532   74.174304  Commerce    56.581816   Comm&Mgmt            False                 0           62.978577   Mkt&HR  61.944776           NaN   False        NaT        NaT      NaN
4           4   1878 Ward Rue Apt. 566\nEast Markburgh, VT 60004      M    75.491247   62.711577   Science    73.754758   Comm&Mgmt            False                 0           88.443999  Mkt&Fin  58.990644  27134.947928    True 2020-01-10 2020-07-26     12.0

Which means that none of the original addresses can be found in the sampled data:

In [30]: data_pii.address.isin(new_data_pii.address).sum()
Out[30]: 0

Advanced Usage

Now that we have discovered the basics, let’s go over a few more advanced usage examples and see the different arguments that we can pass to our GaussianCopula Model in order to customize it to our needs.

How to set transforms to use?

One thing that you may have noticed when executing the previous steps is that the fitting process took much longer on the student_placements_pii dataset than it took on the previous version that did not contain the student address. This happens because the address field is interpreted as a categorical variable, which the GaussianCopula one-hot encoded generating 215 new columns that it had to learn afterwards.

This transformation, which in this case was very inefficient, happens because the Tabular Models apply Reversible Data Transforms under the hood to transform all the non-numerical variables, which the underlying models cannot handle, into numerical representations which they can properly work with. In the case of the GaussianCopula, the default transformation is a One-Hot encoding, which can work very well with variables that have a small number of different values, but which is very inefficient in cases where there is a large number of values.

For this reason, the Tabular Models have an additional argument called field_transformers that let you select which transformer to apply to each column. This field_transformers argument must be passed as a dict which contains the name of the fields for which we want to use a transformer different than the default, and the name of the transformer that we want to use.

Possible transformer names are:

  • integer: Uses a NumericalTransformer of dtype int.

  • float: Uses a NumericalTransformer of dtype float.

  • categorical: Uses a CategoricalTransformer without gaussian noise.

  • categorical_fuzzy: Uses a CategoricalTransformer adding gaussian noise.

  • one_hot_encoding: Uses a OneHotEncodingTransformer.

  • label_encoding: Uses a LabelEncodingTransformer.

  • boolean: Uses a BooleanTransformer.

  • datetime: Uses a DatetimeTransformer.

NOTE: For additional details about each one of the transformers, please visit RDT

Let’s now try to improve the previous fitting process by changing the transformer that we use for the address field to something other than the default. As an example, we will use the label_encoding transformer, which instead of generating one column for each possible value, it just replaces each value with a unique integer value.

In [31]: model = GaussianCopula(
   ....:     primary_key='student_id',
   ....:     anonymize_fields={
   ....:         'address': 'address'
   ....:     },
   ....:     field_transformers={
   ....:         'address': 'label_encoding'
   ....:     }
   ....: )
   ....: 

In [32]: model.fit(data_pii)

In [33]: new_data_pii = model.sample(200)

In [34]: new_data_pii.head()
Out[34]: 
   student_id                                            address gender  second_perc   high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0              5139 Suzanne Way\nCarolport, PA 02727      M    45.763635   65.665957  Commerce    59.476828   Comm&Mgmt            False                 1           83.823511  Mkt&Fin  56.427125  25304.517813    True 2020-01-14 2020-06-25      NaN
1           1      65896 Franklin Station\nCamposshire, AZ 78118      M    85.057499   75.530829   Science    76.665179    Sci&Tech            False                 1           92.332921  Mkt&Fin  72.020314  47549.272263    True 2020-01-24 2020-07-16      3.0
2           2    387 Gregory Dam Suite 757\nRussohaven, SD 50616      M    86.130538  104.161759  Commerce    83.399081   Comm&Mgmt            False                 1           90.106108   Mkt&HR  72.840935  30295.239108    True 2020-01-27 2020-07-15      3.0
3           3       53255 Aaron Dam\nNew Kathleenhaven, RI 91431      F    68.170330   66.916735   Science    67.605387   Comm&Mgmt            False                 0           71.414406   Mkt&HR  71.746166  28016.888923    True 2020-01-17 2020-09-09      3.0
4           4  728 Zachary Point Apt. 282\nNew Michaeltown, F...      F    68.929501   65.213088  Commerce    66.461774   Comm&Mgmt            False                 0           89.353622  Mkt&Fin  66.328419           NaN   False        NaT        NaT      NaN

Exploring the Probability Distributions

During the previous steps, every time we fitted the GaussianCopula it performed the following operations:

  1. Learn the format and data types of the passed data

  2. Transform the non-numerical and null data using Reversible Data Transforms to obtain a fully numerical representation of the data from which we can learn the probability distributions.

  3. Learn the probability distribution of each column from the table

  4. Transform the values of each numerical column by converting them to their marginal distribution CDF values and then applying an inverse CDF transformation of a standard normal on them.

  5. Learn the correlations of the newly generated random variables.

After this, when we used the model to generate new data for our table using the sample method, it did:

  1. Sample from a Multivariate Standard Normal distribution with the learned correlations.

  2. Revert the sampled values by computing their standard normal CDF and then applying the inverse CDF of their marginal distributions.

  3. Revert the RDT transformations to go back to the original data format.

As you can see, during these steps the Marginal Probability Distributions have a very important role, since the GaussianCopula had to learn and reproduce the individual distributions of each column in our table. We can explore the distributions which the GaussianCopula used to model each column using its get_distributions method:

In [35]: model = GaussianCopula(
   ....:     primary_key='student_id'
   ....: )
   ....: 

In [36]: model.fit(data)

In [37]: distributions = model.get_distributions()

This will return us a dict which contains the name of the distribution class used for each column:

In [38]: distributions
Out[38]: 
{'second_perc': 'copulas.univariate.truncated_gaussian.TruncatedGaussian',
 'high_perc': 'copulas.univariate.log_laplace.LogLaplace',
 'degree_perc': 'copulas.univariate.student_t.StudentTUnivariate',
 'work_experience': 'copulas.univariate.student_t.StudentTUnivariate',
 'experience_years': 'copulas.univariate.gaussian.GaussianUnivariate',
 'employability_perc': 'copulas.univariate.truncated_gaussian.TruncatedGaussian',
 'mba_perc': 'copulas.univariate.gamma.GammaUnivariate',
 'placed': 'copulas.univariate.gamma.GammaUnivariate',
 'gender#0': 'copulas.univariate.gaussian.GaussianUnivariate',
 'gender#1': 'copulas.univariate.gaussian.GaussianUnivariate',
 'high_spec#0': 'copulas.univariate.gaussian.GaussianUnivariate',
 'high_spec#1': 'copulas.univariate.gamma.GammaUnivariate',
 'high_spec#2': 'copulas.univariate.gaussian.GaussianUnivariate',
 'degree_type#0': 'copulas.univariate.student_t.StudentTUnivariate',
 'degree_type#1': 'copulas.univariate.gaussian.GaussianUnivariate',
 'degree_type#2': 'copulas.univariate.gaussian.GaussianUnivariate',
 'mba_spec#0': 'copulas.univariate.gamma.GammaUnivariate',
 'mba_spec#1': 'copulas.univariate.gaussian.GaussianUnivariate',
 'salary#0': 'copulas.univariate.gamma.GammaUnivariate',
 'salary#1': 'copulas.univariate.gaussian.GaussianUnivariate',
 'start_date#0': 'copulas.univariate.gamma.GammaUnivariate',
 'start_date#1': 'copulas.univariate.gaussian.GaussianUnivariate',
 'end_date#0': 'copulas.univariate.gamma.GammaUnivariate',
 'end_date#1': 'copulas.univariate.gaussian.GaussianUnivariate',
 'duration#0': 'copulas.univariate.gaussian.GaussianUnivariate',
 'duration#1': 'copulas.univariate.gaussian.GaussianUnivariate',
 'duration#2': 'copulas.univariate.student_t.StudentTUnivariate',
 'duration#3': 'copulas.univariate.gaussian.GaussianUnivariate'}

Note

In this list we will see multiple distributions for each one of the columns that we have in our data. This is because the RDT transformations used to encode the data numerically often use more than one column to represent each one of the input variables.

Let’s explore the individual distribution of one of the columns in our data to better understand how the GaussianCopula processed them and see if we can improve the results by manually specifying a different distribution. For example, let’s explore the experience_years column by looking at the frequency of its values within the original data:

In [39]: data.experience_years.value_counts()
Out[39]: 
0    141
1     65
2      8
3      1
Name: experience_years, dtype: int64

In [40]: data.experience_years.hist();
../../_images/experience_years_1.png

By observing the data we can see that the behavior of the values in this column is very similar to a Gamma or even some types of Beta distribution, where the majority of the values are 0 and the frequency decreases as the values increase.

Was the GaussianCopula able to capture this distribution on its own?

In [41]: distributions['experience_years']
Out[41]: 'copulas.univariate.gaussian.GaussianUnivariate'

It seems that it was not, as it rather thought that the behavior was closer to a Gaussian distribution. And, as a result, we can see how the generated values now contain negative values which are invalid for this column:

In [42]: new_data.experience_years.value_counts()
Out[42]: 
 0    99
 1    85
-1    13
 2     3
Name: experience_years, dtype: int64

In [43]: new_data.experience_years.hist();
../../_images/experience_years_2.png

Let’s see how we can improve this situation by passing the GaussianCopula the exact distribution that we want it to use for this column.

Setting distributions for individual variables

The GaussianCopula class offers the possibility to indicate which distribution to use for each one of the columns in the table, in order to solve situations like the one that we just described. In order to do this, we need to pass a field_distributions argument with dict that indicates the distribution that we want to use for each column.

Possible values for the distribution argument are:

  • univariate: Let copulas select the optimal univariate distribution. This may result in non-parametric models being used.

  • parametric: Let copulas select the optimal univariate distribution, but restrict the selection to parametric distributions only.

  • bounded: Let copulas select the optimal univariate distribution, but restrict the selection to bounded distributions only. This may result in non-parametric models being used.

  • semi_bounded: Let copulas select the optimal univariate distribution, but restrict the selection to semi-bounded distributions only. This may result in non-parametric models being used.

  • parametric_bounded: Let copulas select the optimal univariate distribution, but restrict the selection to parametric and bounded distributions only.

  • parametric_semi_bounded: Let copulas select the optimal univariate distribution, but restrict the selection to parametric and semi-bounded distributions only.

  • gaussian: Use a Gaussian distribution.

  • gamma: Use a Gamma distribution.

  • beta: Use a Beta distribution.

  • student_t: Use a Student T distribution.

  • gaussian_kde: Use a GaussianKDE distribution. This model is non-parametric, so using this will make get_parameters unusable.

  • truncated_gaussian: Use a Truncated Gaussian distribution.

Let’s see what happens if we make the GaussianCopula use the gamma distribution for our column.

In [44]: from sdv.tabular import GaussianCopula

In [45]: model = GaussianCopula(
   ....:     primary_key='student_id',
   ....:     field_distributions={
   ....:         'experience_years': 'gamma'
   ....:     }
   ....: )
   ....: 

In [46]: model.fit(data)

After this, we can see how the GaussianCopula used the indicated distribution for the experience_years column

In [47]: model.get_distributions()['experience_years']
Out[47]: 'copulas.univariate.gamma.GammaUnivariate'

And, as a result, we can see how the generated data now have a behavior which is closer to the original data and always stays within the valid values range.

In [48]: new_data = model.sample(len(data))

In [49]: new_data.experience_years.value_counts()
Out[49]: 
0    196
1     17
2      2
Name: experience_years, dtype: int64

In [50]: new_data.experience_years.hist();
../../_images/experience_years_3.png

Note

Even though there are situations like the one shown above where manually choosing a distribution seems to give better results, in most cases the GaussianCopula will be able to find the optimal distribution on its own, making this manual search of the marginal distributions necessary on very little occasions.

Conditional Sampling

As the name implies, conditional sampling allows us to sample from a conditional distribution using the GaussianCopula model, which means we can generate only values that satisfy certain conditions. These conditional values can be passed to the conditions parameter in the sample method either as a dataframe or a dictionary.

In case a dictionary is passed, the model will generate as many rows as requested, all of which will satisfy the specified conditions, such as gender = M.

In [51]: conditions = {
   ....:     'gender': 'M'
   ....: }
   ....: 

In [52]: model.sample(5, conditions=conditions)
Out[52]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      M    72.651376  66.487230  Commerce    74.424150   Comm&Mgmt            False                 0           75.440880  Mkt&Fin  61.821936  25849.913028    True 2020-04-10 2021-02-04      3.0
1           1      M    66.975467  64.790703  Commerce    62.838066    Sci&Tech            False                 1           71.604492  Mkt&Fin  61.700825  31159.896684    True 2020-01-06 2020-04-19      3.0
2           2      M    73.739052  65.219144   Science    68.569188   Comm&Mgmt            False                 0           63.904228  Mkt&Fin  65.630012  22888.220833    True 2020-01-28 2020-09-15     12.0
3           3      M    49.152340  62.286996   Science    57.748447      Others            False                 0           59.840260  Mkt&Fin  60.105272           NaN   False        NaT        NaT      NaN
4           4      M    58.477255  63.351342   Science    60.109075   Comm&Mgmt            False                 0           60.809020  Mkt&Fin  53.584537  34176.618625    True 2020-02-09 2020-07-15      3.0

It’s also possible to condition on multiple columns, such as gender = M, 'experience_years': 0.

In [53]: conditions = {
   ....:     'gender': 'M',
   ....:     'experience_years': 0
   ....: }
   ....: 

In [54]: model.sample(5, conditions=conditions)
Out[54]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc       salary  placed start_date   end_date duration
0           0      M    55.712659  62.498415  Commerce    59.805716   Comm&Mgmt            False                 0           61.996033   Mkt&HR  51.976394  24879.36516    True 2020-03-11 2020-08-26      3.0
1           1      M    57.854473  52.561250   Science    52.895935    Sci&Tech            False                 0           79.969430   Mkt&HR  50.558409          NaN   False        NaT        NaT      NaN
2           2      M    48.701860  52.395212  Commerce    60.060996   Comm&Mgmt            False                 0           50.453369   Mkt&HR  51.118056          NaN   False        NaT        NaT      NaN
3           3      M    53.711590  60.035696  Commerce    56.347530   Comm&Mgmt            False                 0           59.975297   Mkt&HR  58.521588          NaN   False        NaT        NaT      NaN
4           4      M    46.070352  55.996227  Commerce    60.301035   Comm&Mgmt            False                 0           67.119200   Mkt&HR  61.774106          NaN    True        NaT        NaT      NaN

The conditions can also be passed as a dataframe. In that case, the model will generate one sample for each row of the dataframe, sorted in the same order. Since the model already knows how many samples to generate, passing it as a parameter is unnecessary. For example, if we want to generate three samples where gender = M and three samples with gender = F, we can do the following:

In [55]: import pandas as pd

In [56]: conditions = pd.DataFrame({
   ....:     'gender': ['M', 'M', 'M', 'F', 'F', 'F'],
   ....: })
   ....: 

In [57]: model.sample(conditions=conditions)
Out[57]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      M    70.818890  69.342371  Commerce    67.548007   Comm&Mgmt            False                 0           72.276878  Mkt&Fin  56.892232  36970.365953    True 2020-01-07 2020-07-25     12.0
1           1      M    49.922037  66.670714   Science    69.971812   Comm&Mgmt            False                 0           77.018489  Mkt&Fin  65.067512           NaN   False        NaT        NaT      NaN
2           2      M    45.419798  64.168603   Science    57.754671   Comm&Mgmt            False                 0           80.573616   Mkt&HR  60.313179           NaN   False        NaT        NaT      NaN
3           3      F    87.593852  92.957886   Science    79.076768   Comm&Mgmt            False                 0           90.939686  Mkt&Fin  69.235631  31000.803506    True 2020-01-10 2020-07-12      3.0
4           4      F    74.269341  67.334386  Commerce    67.513388   Comm&Mgmt            False                 0           55.237230   Mkt&HR  62.589396  25389.699012    True 2020-09-13 2020-08-17      3.0
5           5      F    70.281451  63.549215  Commerce    62.878898   Comm&Mgmt            False                 0           80.286559  Mkt&Fin  69.414610           NaN   False        NaT        NaT      NaN

GaussianCopula also supports conditioning on continuous values, as long as the values are within the range of seen numbers. For example, if all the values of the dataset are within 0 and 1, GaussianCopula will not be able to set this value to 1000.

In [58]: conditions = {
   ....:     'degree_perc': 70.0
   ....: }
   ....: 

In [59]: model.sample(5, conditions=conditions)
Out[59]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      F    61.883342  62.337618   Science         70.0   Comm&Mgmt            False                 0           63.140252   Mkt&HR  64.516206  29776.417801    True 2020-01-20 2020-08-01      3.0
1           1      F    85.021481  78.270952   Science         70.0    Sci&Tech            False                 0           57.910443   Mkt&HR  70.524223  27145.663946    True 2020-07-21 2020-11-11      3.0
2           2      M    69.163341  62.547299   Science         70.0    Sci&Tech            False                 0           59.390953  Mkt&Fin  65.931951  32271.970695    True 2020-03-22 2020-10-10      3.0
3           3      M    60.440550  46.592372   Science         70.0    Sci&Tech            False                 0           81.656548  Mkt&Fin  58.761903  27897.376907    True 2020-04-03 2020-06-26      3.0
4           4      M    47.121051  63.963479   Science         70.0   Comm&Mgmt            False                 0           63.894390  Mkt&Fin  60.215117  32288.365572    True 2020-01-17 2020-06-13      3.0

Note

Currently, conditional sampling works through a rejection sampling process, where rows are sampled repeatedly until one that satisfies the conditions is found. In case you are running into a Could not get enough valid rows within x trials or simply wish to optimize the results, there are three parameters that can be fine-tuned: max_rows_multiplier, max_retries and float_rtol. More information about these parameters can be found in the API section.

How do I specify constraints?

If you look closely at the data you may notice that some properties were not completely captured by the model. For example, you may have seen that sometimes the model produces an experience_years number greater than 0 while also indicating that work_experience is False. These types of properties are what we call Constraints and can also be handled using SDV. For further details about them please visit the Handling Constraints guide.

Can I evaluate the Synthetic Data?

A very common question when someone starts using SDV to generate synthetic data is: “How good is the data that I just generated?”

In order to answer this question, SDV has a collection of metrics and tools that allow you to compare the real that you provided and the synthetic data that you generated using SDV or any other tool.

You can read more about this in the Synthetic Data Evaluation guide.