CopulaGAN Model

In this guide we will go through a series of steps that will let you discover functionalities of the CopulaGAN model, including how to:

  • Create an instance of CopulaGAN.

  • Fit the instance to your data.

  • Generate synthetic versions of your data.

  • Use CopulaGAN to anonymize PII information.

  • Customize the data transformations to improve the learning process.

  • Specify the column distributions to improve the output quality.

  • Specify hyperparameters to improve the output quality.

What is CopulaGAN?

The sdv.tabular.CopulaGAN model is a variation of the CTGAN Model which takes advantage of the CDF based transformation that the GaussianCopulas apply to make the underlying CTGAN model task of learning the data easier.

Let’s now discover how to learn a dataset and later on generate synthetic data with the same format and statistical properties by using the CopulaGAN class from SDV.

Quick Usage

We will start by loading one of our demo datasets, the student_placements, which contains information about MBA students that applied for placements during the year 2020.

Warning

In order to follow this guide you need to have ctgan installed on your system. If you have not done it yet, please install ctgan now by executing the command pip install sdv in a terminal.

In [1]: from sdv.demo import load_tabular_demo

In [2]: data = load_tabular_demo('student_placements')

In [3]: data.head()
Out[3]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date duration
0       17264      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12      3.0
1       17265      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09      3.0
2       17266      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13      6.0
3       17267      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT      NaN
4       17268      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27      3.0

As you can see, this table contains information about students which includes, among other things:

  • Their id and gender

  • Their grades and specializations

  • Their work experience

  • The salary that they were offered

  • The duration and dates of their placement

You will notice that there is data with the following characteristics:

  • There are float, integer, boolean, categorical and datetime values.

  • There are some variables that have missing data. In particular, all the data related to the placement details is missing in the rows where the student was not placed.

Let us use CopulaGAN to learn this data and then sample synthetic data about new students to see how well the model captures the characteristics indicated above. In order to do this you will need to:

  • Import the sdv.tabular.CopulaGAN class and create an instance of it.

  • Call its fit method passing our table.

  • Call its sample method indicating the number of synthetic rows that you want to generate.

In [4]: from sdv.tabular import CopulaGAN

In [5]: model = CopulaGAN()

In [6]: model.fit(data)

Note

Notice that the model fitting process took care of transforming the different fields using the appropriate Reversible Data Transforms to ensure that the data has a format that the underlying CTGANSynthesizer class can handle.

Generate synthetic data from the model

Once the modeling has finished you are ready to generate new synthetic data by calling the sample method from your model passing the number of rows that we want to generate.

In [7]: new_data = model.sample(200)

This will return a table identical to the one which the model was fitted on, but filled with new data which resembles the original one.

In [8]: new_data.head()
Out[8]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0       17343      M    57.675716  67.052617      Arts    75.331502   Comm&Mgmt            False                 0           54.631293  Mkt&Fin  71.562876  27208.435796    True 2020-01-15 2020-06-17      3.0
1       17477      F    45.677374  42.234641      Arts    63.132638   Comm&Mgmt            False                 0           70.753065  Mkt&Fin  67.442271  18380.314323    True        NaT 2020-09-28      NaN
2       17265      F    73.766502  65.216742   Science    74.964961   Comm&Mgmt             True                 0           53.731417   Mkt&HR  64.151859  29320.612490   False 2020-03-13 2020-09-06      3.0
3       17461      F    67.086296  51.806403  Commerce    87.203741   Comm&Mgmt             True                 0           64.555617   Mkt&HR  64.658355  24349.805374    True 2020-03-22 2020-07-04      3.0
4       17392      M    89.059970  60.128504   Science    74.592259   Comm&Mgmt            False                 0           53.585064   Mkt&HR  56.903326  26930.163403   False 2020-01-20        NaT     12.0

Note

You can control the number of rows by specifying the number of samples in the model.sample(<num_rows>). To test, try model.sample(10000). Note that the original table only had ~200 rows.

Save and Load the model

In many scenarios it will be convenient to generate synthetic versions of your data directly in systems that do not have access to the original data source. For example, if you may want to generate testing data on the fly inside a testing environment that does not have access to your production database. In these scenarios, fitting the model with real data every time that you need to generate new data is feasible, so you will need to fit a model in your production environment, save the fitted model into a file, send this file to the testing environment and then load it there to be able to sample from it.

Let’s see how this process works.

Save and share the model

Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is pickle.

In [9]: model.save('my_model.pkl')

This will have created a file called my_model.pkl in the same directory in which you are running SDV.

Important

If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains no information about the original data, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this my_model.pkl file without the risc of disclosing any of your real data!

Load the model and generate new data

The file you just generated can be sent over to the system where the synthetic data will be generated. Once it is there, you can load it using the CopulaGAN.load method, and then you are ready to sample new data from the loaded instance:

In [10]: loaded = CopulaGAN.load('my_model.pkl')

In [11]: new_data = loaded.sample(200)

Warning

Notice that the system where the model is loaded needs to also have sdv and ctgan installed, otherwise it will not be able to load the model and use it.

Specifying the Primary Key of the table

One of the first things that you may have noticed when looking at the demo data is that there is a student_id column which acts as the primary key of the table, and which is supposed to have unique values. Indeed, if we look at the number of times that each value appears, we see that all of them appear at most once:

In [12]: data.student_id.value_counts().max()
Out[12]: 1

However, if we look at the synthetic data that we generated, we observe that there are some values that appear more than once:

In [13]: new_data[new_data.student_id == new_data.student_id.value_counts().index[0]]
Out[13]: 
     student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
39        17478      M    42.999117  61.146830  Commerce    79.106691   Comm&Mgmt            False                 4           66.817192  Mkt&Fin  64.745617  17997.757626   False        NaT 2020-08-24      NaN
53        17478      F    69.863648  58.435742   Science    73.083194    Sci&Tech            False                 1           59.696847  Mkt&Fin  68.451614  27376.159521    True 2020-08-05 2021-03-14      NaN
56        17478      F    66.290662  81.304770      Arts    83.028636    Sci&Tech            False                 1           51.126993  Mkt&Fin  63.472818  19532.753808   False        NaT 2020-09-22      6.0
64        17478      M    43.634751  70.764316  Commerce    82.027700   Comm&Mgmt             True                 2           51.628523   Mkt&HR  64.704598  28575.246367    True 2020-01-18        NaT      NaN
151       17478      M    53.434521  28.853579  Commerce    64.886096   Comm&Mgmt             True                 0           50.729228  Mkt&Fin  74.157459  26521.406728   False        NaT        NaT      3.0
169       17478      F    48.090778  65.048094   Science    68.511165   Comm&Mgmt            False                 1           53.098110  Mkt&Fin  57.132024           NaN   False 2020-01-05 2020-09-23     12.0
182       17478      F    55.620425  32.735862  Commerce    70.662312   Comm&Mgmt            False                 1           64.273674   Mkt&HR  62.161264  26804.949476   False 2020-03-14 2020-06-10      NaN
194       17478      M    89.400000  30.748189  Commerce    84.278267   Comm&Mgmt            False                 1           54.146131   Mkt&HR  79.546356           NaN    True 2020-04-03 2020-07-28      6.0

This happens because the model was not notified at any point about the fact that the student_id had to be unique, so when it generates new data it will provoke collisions sooner or later. In order to solve this, we can pass the argument primary_key to our model when we create it, indicating the name of the column that is the index of the table.

In [14]: model = CopulaGAN(
   ....:     primary_key='student_id'
   ....: )
   ....: 

In [15]: model.fit(data)

In [16]: new_data = model.sample(200)

In [17]: new_data.head()
Out[17]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      M    78.148814  66.139058   Science    61.435598   Comm&Mgmt            False                 0           54.860183  Mkt&Fin  62.917992  20231.965709    True 2020-04-28 2020-08-04     12.0
1           1      M    89.358878  64.437661  Commerce    76.154126   Comm&Mgmt             True                 1           83.879527  Mkt&Fin  74.483029  27731.856224   False        NaT 2020-04-27      3.0
2           2      F    71.541120  43.413513  Commerce    62.419453   Comm&Mgmt            False                 0           93.163297  Mkt&Fin  63.038829  30785.551143    True 2020-02-22 2020-04-06      3.0
3           3      F    70.017385  68.449887   Science    81.995885   Comm&Mgmt            False                 1           63.660266   Mkt&HR  75.350619  21465.277145    True 2020-01-07        NaT     12.0
4           4      M    78.132191  70.935037  Commerce    86.443417   Comm&Mgmt            False                 0           94.083981  Mkt&Fin  80.809315  31765.145261    True        NaT 2020-11-19      3.0

As a result, the model will learn that this column must be unique and generate a unique sequence of values for the column:

In [18]: new_data.student_id.value_counts().max()
Out[18]: 1

Anonymizing Personally Identifiable Information (PII)

There will be many cases where the data will contain Personally Identifiable Information which we cannot disclose. In these cases, we will want our Tabular Models to replace the information within these fields with fake, simulated data that looks similar to the real one but does not contain any of the original values.

Let’s load a new dataset that contains a PII field, the student_placements_pii demo, and try to generate synthetic versions of it that do not contain any of the PII fields.

Note

The student_placements_pii dataset is a modified version of the student_placements dataset with one new field, address, which contains PII information about the students. Notice that this additional address field has been simulated and does not correspond to data from the real users.

In [19]: data_pii = load_tabular_demo('student_placements_pii')

In [20]: data_pii.head()
Out[20]: 
   student_id                                            address gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date duration
0       17264        70304 Baker Turnpike\nEricborough, MS 15086      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12      3.0
1       17265    805 Herrera Avenue Apt. 134\nMaryview, NJ 36510      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09      3.0
2       17266        3702 Bradley Island\nNorth Victor, FL 12268      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13      6.0
3       17267                   Unit 0879 Box 3878\nDPO AP 42663      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT      NaN
4       17268  96493 Kelly Canyon Apt. 145\nEast Steven, NC 3...      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27      3.0

If we use our tabular model on this new data we will see how the synthetic data that it generates discloses the addresses from the real students:

In [21]: model = CopulaGAN(
   ....:     primary_key='student_id',
   ....: )
   ....: 

In [22]: model.fit(data_pii)

In [23]: new_data_pii = model.sample(200)

In [24]: new_data_pii.head()
Out[24]: 
   student_id                                            address gender  second_perc   high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0    6994 White Falls\nLake Cynthiaborough, CT 15835      F    88.299259   65.067945   Science    59.582102    Sci&Tech            False                 1           52.561624  Mkt&Fin  69.336242  28008.887338    True 2020-09-12 2020-06-28      NaN
1           1  29166 Tammy Crest Apt. 839\nSouth Lindaside, F...      F    82.424339   82.751783      Arts    67.876132   Comm&Mgmt            False                 1           66.760665  Mkt&Fin  57.032038  25533.175475    True        NaT 2020-06-20      6.0
2           2  8585 Jennifer Road Apt. 853\nNorth Paulside, T...      M    80.768921   72.051864  Commerce    62.181795   Comm&Mgmt            False                 0           73.995943  Mkt&Fin  65.785328  51545.407394    True        NaT 2020-07-19      3.0
3           3   0917 Shawn Grove Suite 337\nWhiteville, KS 92476      M    73.083471  110.089528  Commerce    62.232031    Sci&Tech            False                 1           52.501414  Mkt&Fin  65.539283           NaN    True 2020-02-24 2020-12-15      NaN
4           4       29282 Kelley Orchard\nMartinezview, CT 78137      M    88.011639  106.433397   Science    69.887740   Comm&Mgmt            False                 0           57.791789   Mkt&HR  53.004735  45542.795803    True        NaT        NaT      NaN

More specifically, we can see how all the addresses that have been generated actually come from the original dataset:

In [25]: new_data_pii.address.isin(data_pii.address).sum()
Out[25]: 200

In order to solve this, we can pass an additional argument anonymize_fields to our model when we create the instance. This anonymize_fields argument will need to be a dictionary that contains:

  • The name of the field that we want to anonymize.

  • The category of the field that we want to use when we generate fake values for it.

The list complete list of possible categories can be seen in the Faker Providers page, and it contains a huge list of concepts such as:

  • name

  • address

  • country

  • city

  • ssn

  • credit_card_number

  • credit_card_expire

  • credit_card_security_code

  • email

  • telephone

In this case, since the field is an e-mail address, we will pass a dictionary indicating the category address

In [26]: model = CopulaGAN(
   ....:     primary_key='student_id',
   ....:     anonymize_fields={
   ....:         'address': 'address'
   ....:     }
   ....: )
   ....: 

In [27]: model.fit(data_pii)

As a result, we can see how the real address values have been replaced by other fake addresses that were not taken from the real data that we learned.

In [28]: new_data_pii = model.sample(200)

In [29]: new_data_pii.head()
Out[29]: 
   student_id                                            address gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date end_date duration
0           0        61244 Bryan Fort\nPort Derekmouth, IN 37997      F    48.598021  53.000171   Science    80.691675    Sci&Tech            False                 0           97.191270  Mkt&Fin  72.761389           NaN   False 2020-01-04      NaT      6.0
1           1         43155 Perry Island\nEast Melissa, VT 44821      F    72.866875  87.156717  Commerce    77.068867   Comm&Mgmt            False                 1           83.429756  Mkt&Fin  64.143684           NaN   False        NaT      NaT      3.0
2           2  46668 David Mission Suite 507\nOlsonborough, W...      M    73.393634  62.651596   Science    93.250913   Comm&Mgmt            False                 1           78.143185  Mkt&Fin  69.721572  28495.619069    True        NaT      NaT      3.0
3           3  29159 Mccarthy Village Apt. 880\nNorth Jennife...      F    44.975299  98.821769  Commerce    75.560103   Comm&Mgmt            False                 0           97.999915   Mkt&HR  65.658346           NaN   False 2020-01-13      NaT      3.0
4           4  166 Nielsen Divide Apt. 524\nFigueroashire, KY...      F    71.084187  95.284468  Commerce    87.164345   Comm&Mgmt            False                 0           84.849699  Mkt&Fin  71.587327           NaN   False 2020-01-04      NaT     12.0

Which means that none of the original addresses can be found in the sampled data:

In [30]: data_pii.address.isin(new_data_pii.address).sum()
Out[30]: 0

Advanced Usage

Now that we have discovered the basics, let’s go over a few more advanced usage examples and see the different arguments that we can pass to our CopulaGAN Model in order to customize it to our needs.

Exploring the Probability Distributions

During the previous steps, every time we fitted the CopulaGAN it performed the following operations:

  1. Learn the format and data types of the passed data

  2. Transform the non-numerical and null data using Reversible Data Transforms to obtain a fully numerical representation of the data from which we can learn the probability distributions.

  3. Learn the probability distribution of each column from the table

  4. Transform the values of each numerical column by converting them to their marginal distribution CDF values and then applying an inverse CDF transformation of a standard normal on them.

  5. Fit a CTGAN model on the transformed data, which learns how each column is correlated to the others.

After this, when we used the model to generate new data for our table using the sample method, it did:

  1. Sample rows from the CTGAN model.

  2. Revert the sampled values by computing their standard normal CDF and then applying the inverse CDF of their marginal distributions.

  3. Revert the RDT transformations to go back to the original data format.

As you can see, during these steps the Marginal Probability Distributions have a very important role, since the CopulaGAN had to learn and reproduce the individual distributions of each column in our table. We can explore the distributions which the CopulaGAN used to model each column using its get_distributions method:

In [31]: model = CopulaGAN(
   ....:     primary_key='student_id'
   ....: )
   ....: 

In [32]: model.fit(data)

In [33]: distributions = model.get_distributions()

This will return us a dict which contains the name of the distribution class used for each column:

In [34]: distributions
Out[34]: 
{'second_perc': 'copulas.univariate.truncated_gaussian.TruncatedGaussian',
 'high_perc': 'copulas.univariate.log_laplace.LogLaplace',
 'degree_perc': 'copulas.univariate.student_t.StudentTUnivariate',
 'work_experience': 'copulas.univariate.student_t.StudentTUnivariate',
 'experience_years': 'copulas.univariate.gaussian.GaussianUnivariate',
 'employability_perc': 'copulas.univariate.truncated_gaussian.TruncatedGaussian',
 'mba_perc': 'copulas.univariate.gamma.GammaUnivariate',
 'placed': 'copulas.univariate.gamma.GammaUnivariate',
 'salary#0': 'copulas.univariate.gamma.GammaUnivariate',
 'salary#1': 'copulas.univariate.gaussian.GaussianUnivariate',
 'start_date#0': 'copulas.univariate.gamma.GammaUnivariate',
 'start_date#1': 'copulas.univariate.gaussian.GaussianUnivariate',
 'end_date#0': 'copulas.univariate.gamma.GammaUnivariate',
 'end_date#1': 'copulas.univariate.gaussian.GaussianUnivariate'}

Note

In this list we will see multiple distributions for each one of the columns that we have in our data. This is because the RDT transformations used to encode the data numerically often use more than one column to represent each one of the input variables.

Let’s explore the individual distribution of one of the columns in our data to better understand how the CopulaGAN processed them and see if we can improve the results by manually specifying a different distribution. For example, let’s explore the experience_years column by looking at the frequency of its values within the original data:

In [35]: data.experience_years.value_counts()
Out[35]: 
0    141
1     65
2      8
3      1
Name: experience_years, dtype: int64

In [36]: data.experience_years.hist();
../../_images/copulagan_experience_years_1.png

By observing the data we can see that the behavior of the values in this column is very similar to a Gamma or even some types of Beta distribution, where the majority of the values are 0 and the frequency decreases as the values increase.

Was the CopulaGAN able to capture this distribution on its own?

In [37]: distributions['experience_years']
Out[37]: 'copulas.univariate.gaussian.GaussianUnivariate'

It seems that the it was not, as it rather thought that the behavior was closer to a Gaussian distribution. And, as a result, we can see how the generated values now contain negative values which are invalid for this column:

In [38]: new_data.experience_years.value_counts()
Out[38]: 
0    130
1     51
3     11
2      5
4      3
Name: experience_years, dtype: int64

In [39]: new_data.experience_years.hist();
../../_images/copulagan_experience_years_2.png

Let’s see how we can improve this situation by passing the CopulaGAN the exact distribution that we want it to use for this column.

Setting distributions for indvidual variables

The CopulaGAN class offers the possibility to indicate which distribution to use for each one of the columns in the table, in order to solve situations like the one that we just described. In order to do this, we need to pass a field_distributions argument with dict that indicates, the distribution that we want to use for each column.

Possible values for the distribution argument are:

  • univariate: Let copulas select the optimal univariate distribution. This may result in non-parametric models being used.

  • parametric: Let copulas select the optimal univariate distribution, but restrict the selection to parametric distributions only.

  • bounded: Let copulas select the optimal univariate distribution, but restrict the selection to bounded distributions only. This may result in non-parametric models being used.

  • semi_bounded: Let copulas select the optimal univariate distribution, but restrict the selection to semi-bounded distributions only. This may result in non-parametric models being used.

  • parametric_bounded: Let copulas select the optimal univariate distribution, but restrict the selection to parametric and bounded distributions only.

  • parametric_semi_bounded: Let copulas select the optimal univariate distribution, but restrict the selection to parametric and semi-bounded distributions only.

  • gaussian: Use a Gaussian distribution.

  • gamma: Use a Gamma distribution.

  • beta: Use a Beta distribution.

  • student_t: Use a Student T distribution.

  • gaussian_kde: Use a GaussianKDE distribution. This model is non-parametric, so using this will make get_parameters unusable.

  • truncated_gaussian: Use a Truncated Gaussian distribution.

Let’s see what happens if we make the CopulaGAN use the gamma distribution for our column.

In [40]: model = CopulaGAN(
   ....:     primary_key='student_id',
   ....:     field_distributions={
   ....:         'experience_years': 'gamma'
   ....:     }
   ....: )
   ....: 

In [41]: model.fit(data)

After this, we can see how the CopulaGAN used the indicated distribution for the experience_years column

In [42]: model.get_distributions()['experience_years']
Out[42]: 'copulas.univariate.gamma.GammaUnivariate'

And, as a result, now we can see how the generated data now have a behavior which is closer to the original data and always stays within the valid values range.

In [43]: new_data = model.sample(len(data))

In [44]: new_data.experience_years.value_counts()
Out[44]: 
0    126
2     23
1     23
3     22
4     17
5      4
Name: experience_years, dtype: int64

In [45]: new_data.experience_years.hist();
../../_images/copulagan_experience_years_3.png

Note

Even though there are situations like the one show above where manually choosing a distribution seems to give better results, in most cases the CopulaGAN will be able to find the optimal distribution on its own, making this manual search of the marginal distributions necessary on very little occasions.

How to modify the CopulaGAN Hyperparameters?

A part from the arguments explained above, CopulaGAN has a number of additional hyperparameters that control its learning behavior and can impact on the performance of the model, both in terms of quality of the generated data and computational time:

  • epochs and batch_size: these arguments control the number of iterations that the model will perform to optimize its parameters, as well as the number of samples used in each step. Its default values are 300 and 500 respectively, and batch_size needs to always be a value which is multiple of 10.

    These hyperparameters have a very direct effect in time the training process lasts but also on the performance of the data, so for new datasets, you might want to start by setting a low value on both of them to see how long the training process takes on your data and later on increase the number to acceptable values in order to improve the performance.

  • log_frequency: Whether to use log frequency of categorical levels in conditional sampling. It defaults to True. This argument affects how the model processes the frequencies of the categorical values that are used to condition the rest of the values. In some cases, changing it to False could lead to better performance.

  • embedding_dim (int): Size of the random sample passed to the Generator. Defaults to 128.

  • generator_dim (tuple or list of ints): Size of the output samples for each one of the Residuals. A Resiudal Layer will be created for each one of the values provided. Defaults to (256, 256).

  • discriminator_dim (tuple or list of ints): Size of the output samples for each one of the Discriminator Layers. A Linear Layer will be created for each one of the values provided. Defaults to (256, 256).

  • generator_lr (float): Learning rate for the generator. Defaults to 2e-4.

  • generator_decay (float): Generator weight decay for the Adam Optimizer. Defaults to 1e-6.

  • discriminator_lr (float): Learning rate for the discriminator. Defaults to 2e-4.

  • discriminator_decay (float): Discriminator weight decay for the Adam Optimizer. Defaults to 1e-6.

  • discriminator_steps (int): Number of discriminator updates to do for each generator update. From the WGAN paper: https://arxiv.org/abs/1701.07875. WGAN paper default is 5. Default used is 1 to match original CTGAN implementation.

  • verbose: Whether to print fit progress on stdout. Defaults to False.

Warning

Notice that the value that you set on the batch_size argument must always be a multiple of 10!

As an example, we will try to fit the CopulaGAN model slightly increasing the number of epochs, reducing the batch_size, adding one additional layer to the models involved and using a smaller wright decay.

Before we start, we will evaluate the quality of the previously generated data using the sdv.evaluation.evaluate function

In [46]: from sdv.evaluation import evaluate

In [47]: evaluate(new_data, data)
Out[47]: 0.4293634499496008

Afterwards, we create a new instance of the CopulaGAN model with the hyperparameter values that we want to use

In [48]: model = CopulaGAN(
   ....:     primary_key='student_id',
   ....:     epochs=500,
   ....:     batch_size=100,
   ....:     generator_dim=(256, 256, 256),
   ....:     discriminator_dim=(256, 256, 256)
   ....: )
   ....: 

And fit to our data.

In [49]: model.fit(data)

Finally, we are ready to generate new data and evaluate the results.

In [50]: new_data = model.sample(len(data))

In [51]: evaluate(new_data, data)
Out[51]: 0.44570271728859173

As we can see, in this case these modifications changed the obtained results slightly, but they did neither introduce dramatic changes in the performance.

Conditional Sampling

As the name implies, conditional sampling allows us to sample from a conditional distribution using the CopulaGAN model, which means we can generate only values that satisfy certain conditions. These conditional values can be passed to the conditions parameter in the sample method either as a dataframe or a dictionary.

In case a dictionary is passed, the model will generate as many rows as requested, all of which will satisfy the specified conditions, such as gender = M.

In [52]: conditions = {
   ....:     'gender': 'M'
   ....: }
   ....: 

In [53]: model.sample(5, conditions=conditions)
Out[53]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      M    62.824691  89.994451  Commerce    79.616881    Sci&Tech             True                 1           80.313493  Mkt&Fin  62.488066  33928.559018   False 2020-03-02        NaT      3.0
1           2      M    51.421063  94.999539  Commerce    63.707466   Comm&Mgmt             True                 1           76.051231   Mkt&HR  59.646386  24479.723607    True        NaT 2020-09-21      NaN
2           4      M    73.327214  67.459283  Commerce    64.802464   Comm&Mgmt            False                 0           87.144877  Mkt&Fin  57.198298           NaN    True 2020-04-05 2020-07-06      NaN
3           0      M    62.473799  66.504021   Science    54.757391   Comm&Mgmt             True                 0           81.076903  Mkt&Fin  58.118415  23146.773791   False 2020-09-14 2020-10-06      3.0
4           2      M    84.812985  63.508492   Science    79.304501   Comm&Mgmt            False                 0           75.208023  Mkt&Fin  51.563275  30131.412004    True        NaT 2020-12-06      3.0

It’s also possible to condition on multiple columns, such as gender = M, 'experience_years': 0.

In [54]: conditions = {
   ....:     'gender': 'M',
   ....:     'experience_years': 0
   ....: }
   ....: 

In [55]: model.sample(5, conditions=conditions)
Out[55]: 
   student_id gender  second_perc   high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           1      M    67.080566   76.559468   Science    81.830111    Sci&Tech            False                 0           94.280770  Mkt&Fin  58.870468  25109.025716    True        NaT        NaT      3.0
1           2      M    79.033351  110.711037   Science    68.900305   Comm&Mgmt            False                 0           89.802652   Mkt&HR  70.056052  24073.464084    True 2020-01-30 2020-08-13      3.0
2           4      M    42.711550   64.964331   Science    74.785012    Sci&Tech             True                 0           60.703893  Mkt&Fin  58.350176  24626.037112    True 2020-01-18 2020-02-09      NaN
3           1      M    41.139236   62.862007  Commerce    47.060809    Sci&Tech            False                 0           84.811220   Mkt&HR  53.093891           NaN    True 2020-02-26 2020-08-19      6.0
4           1      M    78.033437   78.561666   Science    78.033286    Sci&Tech             True                 0           55.852691  Mkt&Fin  73.215443  28218.308047    True        NaT 2020-06-18     12.0

The conditions can also be passed as a dataframe. In that case, the model will generate one sample for each row of the dataframe, sorted in the same order. Since the model already knows how many samples to generate, passing it as a parameter is unnecessary. For example, if we want to generate three samples where gender = M and three samples with gender = F, we can do the following:

In [56]: import pandas as pd

In [57]: conditions = pd.DataFrame({
   ....:     'gender': ['M', 'M', 'M', 'F', 'F', 'F'],
   ....: })
   ....: 

In [58]: model.sample(conditions=conditions)
Out[58]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      M    87.433534  62.038618   Science    72.119368   Comm&Mgmt            False                 0           56.510793   Mkt&HR  72.399082  21962.974001    True        NaT        NaT     12.0
1           2      M    87.772965  68.389526   Science    64.863942   Comm&Mgmt            False                 1           98.000000   Mkt&HR  55.841082  28432.974984    True 2020-01-06 2020-06-02      3.0
2           0      M    45.575536  55.042077  Commerce    80.583900   Comm&Mgmt             True                 0           83.856533   Mkt&HR  67.077801  47637.934860    True 2020-03-25        NaT      6.0
3           1      F    42.227821  58.204189  Commerce    75.330320   Comm&Mgmt            False                 0           75.372946  Mkt&Fin  64.055297           NaN    True 2020-02-23 2020-07-22      3.0
4           0      F    73.056691  89.878789  Commerce    81.565637   Comm&Mgmt            False                 1           65.926540   Mkt&HR  53.553933  20715.466898    True        NaT        NaT      3.0
5           1      F    56.742777  52.504886  Commerce    61.510134   Comm&Mgmt            False                 1           93.652418   Mkt&HR  69.574096  28118.452860    True 2020-11-23 2020-07-31      3.0

CopulaGAN also supports conditioning on continuous values, as long as the values are within the range of seen numbers. For example, if all the values of the dataset are within 0 and 1, CopulaGAN will not be able to set this value to 1000.

In [59]: conditions = {
   ....:     'degree_perc': 70.0
   ....: }
   ....: 

In [60]: model.sample(5, conditions=conditions)
Out[60]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           8      F    78.024897  91.492040   Science         70.0   Comm&Mgmt             True                 0           67.199453   Mkt&HR  50.694556  74445.440718    True        NaT 2020-06-10      NaN
1          17      M    57.464505  37.361945  Commerce         70.0    Sci&Tech             True                 0           72.781478  Mkt&Fin  67.654225  23886.987968    True 2020-01-10 2020-07-04      NaN
2          23      M    57.807262  73.206394  Commerce         70.0   Comm&Mgmt            False                 0           97.675316  Mkt&Fin  59.201454           NaN    True 2020-01-20 2021-02-02      3.0
3           4      M    77.997035  43.882148  Commerce         70.0   Comm&Mgmt            False                 2           97.923731  Mkt&Fin  66.747637  26339.359603    True 2020-01-22 2020-06-04      3.0
4           6      F    69.974488  64.250549  Commerce         70.0    Sci&Tech            False                 0           74.059313  Mkt&Fin  49.244701           NaN    True 2020-07-03        NaT      3.0

Note

Currently, conditional sampling works through a rejection sampling process, where rows are sampled repeatedly until one that satisfies the conditions is found. In case you are running into a Could not get enough valid rows within x trials or simply wish to optimize the results, there are three parameters that can be fine-tuned: max_rows_multiplier, max_retries and float_rtol. More information about these parameters can be found in the API section.

How do I specify constraints?

If you look closely at the data you may notice that some properties were not completely captured by the model. For example, you may have seen that sometimes the model produces an experience_years number greater than 0 while also indicating that work_experience is False. These types of properties are what we call Constraints and can also be handled using SDV. For further details about them please visit the Handling Constraints guide.

Can I evaluate the Synthetic Data?

A very common question when someone starts using SDV to generate synthetic data is: “How good is the data that I just generated?”

In order to answer this question, SDV has a collection of metrics and tools that allow you to compare the real that you provided and the synthetic data that you generated using SDV or any other tool.

You can read more about this in the Synthetic Data Evaluation guide.