CTGAN Model

In this guide we will go through a series of steps that will let you discover functionalities of the CTGAN model, including how to:

  • Create an instance of CTGAN.

  • Fit the instance to your data.

  • Generate synthetic versions of your data.

  • Use CTGAN to anonymize PII information.

  • Customize the data transformations to improve the learning process.

  • Specify hyperparameters to improve the output quality.

What is CTGAN?

The sdv.tabular.CTGAN model is based on the GAN-based Deep Learning data synthesizer which was presented at the NeurIPS 2020 conference by the paper titled Modeling Tabular data using Conditional GAN.

Let’s now discover how to learn a dataset and later on generate synthetic data with the same format and statistical properties by using the CTGAN class from SDV.

Quick Usage

We will start by loading one of our demo datasets, the student_placements, which contains information about MBA students that applied for placements during the year 2020.

Warning

In order to follow this guide you need to have ctgan installed on your system. If you have not done it yet, please install ctgan now by executing the command pip install sdv in a terminal.

In [1]: from sdv.demo import load_tabular_demo

In [2]: data = load_tabular_demo('student_placements')

In [3]: data.head()
Out[3]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date duration
0       17264      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12      3.0
1       17265      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09      3.0
2       17266      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13      6.0
3       17267      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT      NaN
4       17268      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27      3.0

As you can see, this table contains information about students which includes, among other things:

  • Their id and gender

  • Their grades and specializations

  • Their work experience

  • The salary that they were offered

  • The duration and dates of their placement

You will notice that there is data with the following characteristics:

  • There are float, integer, boolean, categorical and datetime values.

  • There are some variables that have missing data. In particular, all the data related to the placement details is missing in the rows where the student was not placed.

Let us use CTGAN to learn this data and then sample synthetic data about new students to see how well the model captures the characteristics indicated above. In order to do this you will need to:

  • Import the sdv.tabular.CTGAN class and create an instance of it.

  • Call its fit method passing our table.

  • Call its sample method indicating the number of synthetic rows that you want to generate.

In [4]: from sdv.tabular import CTGAN

In [5]: model = CTGAN()

In [6]: model.fit(data)

Note

Notice that the model fitting process took care of transforming the different fields using the appropriate Reversible Data Transforms to ensure that the data has a format that the underlying CTGANSynthesizer class can handle.

Generate synthetic data from the model

Once the modeling has finished you are ready to generate new synthetic data by calling the sample method from your model passing the number of rows that we want to generate.

In [7]: new_data = model.sample(200)

This will return a table identical to the one which the model was fitted on, but filled with new data which resembles the original one.

In [8]: new_data.head()
Out[8]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0       17415      M    58.996629  71.638601  Commerce    63.810096   Comm&Mgmt            False                 0           87.798829  Mkt&Fin  48.215954           NaN   False        NaT 2020-10-21      3.0
1       17469      F    74.317264  59.589704      Arts    51.299096    Sci&Tech            False                 0           53.164798  Mkt&Fin  59.164475           NaN    True 2020-03-27 2020-04-13     12.0
2       17391      F    70.444916  75.801224  Commerce    60.143831      Others             True                 0           63.634851  Mkt&Fin  58.667847           NaN    True 2020-03-18        NaT      3.0
3       17474      M    62.365941  39.238885   Science    71.710117   Comm&Mgmt            False                 0           44.994612  Mkt&Fin  59.745547  17107.444836    True        NaT        NaT      NaN
4       17394      M    69.564289  59.864980   Science    54.205378    Sci&Tech            False                 0           50.756422   Mkt&HR  58.645687  20401.720283    True 2020-02-25 2020-09-08      NaN

Note

You can control the number of rows by specifying the number of samples in the model.sample(<num_rows>). To test, try model.sample(10000). Note that the original table only had ~200 rows.

Save and Load the model

In many scenarios it will be convenient to generate synthetic versions of your data directly in systems that do not have access to the original data source. For example, if you may want to generate testing data on the fly inside a testing environment that does not have access to your production database. In these scenarios, fitting the model with real data every time that you need to generate new data is feasible, so you will need to fit a model in your production environment, save the fitted model into a file, send this file to the testing environment and then load it there to be able to sample from it.

Let’s see how this process works.

Save and share the model

Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is pickle.

In [9]: model.save('my_model.pkl')

This will have created a file called my_model.pkl in the same directory in which you are running SDV.

Important

If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains no information about the original data, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this my_model.pkl file without the risc of disclosing any of your real data!

Load the model and generate new data

The file you just generated can be sent over to the system where the synthetic data will be generated. Once it is there, you can load it using the CTGAN.load method, and then you are ready to sample new data from the loaded instance:

In [10]: loaded = CTGAN.load('my_model.pkl')

In [11]: new_data = loaded.sample(200)

Warning

Notice that the system where the model is loaded needs to also have sdv and ctgan installed, otherwise it will not be able to load the model and use it.

Specifying the Primary Key of the table

One of the first things that you may have noticed when looking at the demo data is that there is a student_id column which acts as the primary key of the table, and which is supposed to have unique values. Indeed, if we look at the number of times that each value appears, we see that all of them appear at most once:

In [12]: data.student_id.value_counts().max()
Out[12]: 1

However, if we look at the synthetic data that we generated, we observe that there are some values that appear more than once:

In [13]: new_data[new_data.student_id == new_data.student_id.value_counts().index[0]]
Out[13]: 
     student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
11        17432      M    90.417936  53.326898   Science    73.203085    Sci&Tech            False                 1           70.048738  Mkt&Fin  61.033321  26934.622888    True 2020-01-16        NaT      NaN
50        17432      M    88.758628  55.821830  Commerce    60.342289   Comm&Mgmt            False                 1           88.821372   Mkt&HR  61.119847  11309.279225    True 2020-01-10        NaT      3.0
101       17432      M    56.512322  66.586250  Commerce    73.761405    Sci&Tech            False                 1           63.200270   Mkt&HR  64.833396  40088.295446    True 2020-03-04 2020-07-16     12.0
141       17432      M    82.855050  97.057838   Science    66.780161   Comm&Mgmt             True                 0           80.473435  Mkt&Fin  61.648138  -7748.213461    True 2020-01-06        NaT      NaN

This happens because the model was not notified at any point about the fact that the student_id had to be unique, so when it generates new data it will provoke collisions sooner or later. In order to solve this, we can pass the argument primary_key to our model when we create it, indicating the name of the column that is the index of the table.

In [14]: model = CTGAN(
   ....:     primary_key='student_id'
   ....: )
   ....: 

In [15]: model.fit(data)

In [16]: new_data = model.sample(200)

In [17]: new_data.head()
Out[17]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      M    92.745022  74.901954  Commerce    60.241747   Comm&Mgmt            False                 0           69.254793  Mkt&Fin  55.137732           NaN    True 2020-01-30 2020-07-29      NaN
1           1      M    67.226143  85.044403  Commerce    76.553777    Sci&Tech            False                 0           50.063479  Mkt&Fin  51.811657  30981.871859    True        NaT 2020-04-23      3.0
2           2      F    94.588261  45.233045  Commerce    47.259991   Comm&Mgmt            False                 0           88.192169  Mkt&Fin  55.022099           NaN    True 2020-01-06 2020-09-14     12.0
3           3      M    87.751559  48.008146   Science    79.227244    Sci&Tech            False                 0           55.297518   Mkt&HR  51.081906  28552.524283   False        NaT 2020-09-02      3.0
4           4      M    73.505802  91.218619   Science    71.372142   Comm&Mgmt            False                 0           78.962284  Mkt&Fin  54.207192           NaN   False 2020-03-06 2020-08-13      NaN

As a result, the model will learn that this column must be unique and generate a unique sequence of values for the column:

In [18]: new_data.student_id.value_counts().max()
Out[18]: 1

Anonymizing Personally Identifiable Information (PII)

There will be many cases where the data will contain Personally Identifiable Information which we cannot disclose. In these cases, we will want our Tabular Models to replace the information within these fields with fake, simulated data that looks similar to the real one but does not contain any of the original values.

Let’s load a new dataset that contains a PII field, the student_placements_pii demo, and try to generate synthetic versions of it that do not contain any of the PII fields.

Note

The student_placements_pii dataset is a modified version of the student_placements dataset with one new field, address, which contains PII information about the students. Notice that this additional address field has been simulated and does not correspond to data from the real users.

In [19]: data_pii = load_tabular_demo('student_placements_pii')

In [20]: data_pii.head()
Out[20]: 
   student_id                                            address gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date duration
0       17264        70304 Baker Turnpike\nEricborough, MS 15086      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12      3.0
1       17265    805 Herrera Avenue Apt. 134\nMaryview, NJ 36510      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09      3.0
2       17266        3702 Bradley Island\nNorth Victor, FL 12268      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13      6.0
3       17267                   Unit 0879 Box 3878\nDPO AP 42663      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT      NaN
4       17268  96493 Kelly Canyon Apt. 145\nEast Steven, NC 3...      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27      3.0

If we use our tabular model on this new data we will see how the synthetic data that it generates discloses the addresses from the real students:

In [21]: model = CTGAN(
   ....:     primary_key='student_id',
   ....: )
   ....: 

In [22]: model.fit(data_pii)

In [23]: new_data_pii = model.sample(200)

In [24]: new_data_pii.head()
Out[24]: 
   student_id                                            address gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0                          USS Sanchez\nFPO AA 16841      F    80.248517  88.042206  Commerce    74.932701   Comm&Mgmt            False                 2           64.127380   Mkt&HR  79.166379  16083.692267    True        NaT 2020-08-22      NaN
1           1  8398 Seth Vista Apt. 266\nSouth Lauraberg, VA ...      F    53.744461  82.460488   Science    72.244045   Comm&Mgmt            False                 0           82.151231   Mkt&HR  81.977538  17228.860302    True 2019-12-29        NaT      6.0
2           2        8034 Freeman Meadows\nSouth Bryce, NJ 14719      F    85.566425  56.259839  Commerce    74.140825   Comm&Mgmt            False                 0           52.029352  Mkt&Fin  71.427603           NaN    True 2020-01-12        NaT      6.0
3           3             8368 Sarah Well\nNewmanville, WA 69934      F    68.441639  62.165037   Science    62.218357    Sci&Tech            False                 0           74.854105  Mkt&Fin  72.715475           NaN    True 2020-06-06        NaT     12.0
4           4          5376 Amanda Terrace\nSouth Glen, ID 04884      F    75.812732  83.023668      Arts    60.087456    Sci&Tech             True                 0           73.093631  Mkt&Fin  69.314887  26434.832457    True 2020-02-23 2020-08-18      6.0

More specifically, we can see how all the addresses that have been generated actually come from the original dataset:

In [25]: new_data_pii.address.isin(data_pii.address).sum()
Out[25]: 200

In order to solve this, we can pass an additional argument anonymize_fields to our model when we create the instance. This anonymize_fields argument will need to be a dictionary that contains:

  • The name of the field that we want to anonymize.

  • The category of the field that we want to use when we generate fake values for it.

The list complete list of possible categories can be seen in the Faker Providers page, and it contains a huge list of concepts such as:

  • name

  • address

  • country

  • city

  • ssn

  • credit_card_number

  • credit_card_expire

  • credit_card_security_code

  • email

  • telephone

In this case, since the field is an e-mail address, we will pass a dictionary indicating the category address

In [26]: model = CTGAN(
   ....:     primary_key='student_id',
   ....:     anonymize_fields={
   ....:         'address': 'address'
   ....:     }
   ....: )
   ....: 

In [27]: model.fit(data_pii)

As a result, we can see how the real address values have been replaced by other fake addresses:

In [28]: new_data_pii = model.sample(200)

In [29]: new_data_pii.head()
Out[29]: 
   student_id                                            address gender  second_perc   high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0  6826 Leblanc Harbors Suite 864\nLake Garyfort,...      F    61.669164   76.039806   Science    60.627468   Comm&Mgmt            False                 0           44.465228  Mkt&Fin  57.092623  26749.864738    True 2020-01-28 2020-07-09     12.0
1           1   5620 Monica Ports\nSouth Baileyborough, WA 89720      F    48.226669   76.861218  Commerce    72.447540   Comm&Mgmt            False                 0           48.119016   Mkt&HR  49.886540           NaN   False 2019-12-17        NaT      NaN
2           2     2723 Sandra Parkway\nEast Garrettton, MI 64638      M    56.963847   49.730330   Science    60.236265    Sci&Tech            False                 0           69.418668  Mkt&Fin  52.712834  28265.921442    True        NaT 2020-07-14      6.0
3           3    3796 Diana Curve Apt. 786\nEast Megan, OH 48006      M    84.105554   78.782977  Commerce    63.126444   Comm&Mgmt            False                 0           64.808281  Mkt&Fin  50.562953  27197.574172   False        NaT 2020-10-31      NaN
4           4  85565 Ward Knoll Suite 358\nVincentborough, CO...      M    59.002053  107.006692  Commerce    47.618939   Comm&Mgmt             True                 0           71.231515   Mkt&HR  49.360704           NaN   False        NaT        NaT      NaN

Which means that none of the original addresses can be found in the sampled data:

In [30]: data_pii.address.isin(new_data_pii.address).sum()
Out[30]: 0

Advanced Usage

Now that we have discovered the basics, let’s go over a few more advanced usage examples and see the different arguments that we can pass to our CTGAN Model in order to customize it to our needs.

How to modify the CTGAN Hyperparameters?

A part from the common Tabular Model arguments, CTGAN has a number of additional hyperparameters that control its learning behavior and can impact on the performance of the model, both in terms of quality of the generated data and computational time.

  • epochs and batch_size: these arguments control the number of iterations that the model will perform to optimize its parameters, as well as the number of samples used in each step. Its default values are 300 and 500 respectively, and batch_size needs to always be a value which is multiple of 10.

    These hyperparameters have a very direct effect in time the training process lasts but also on the performance of the data, so for new datasets, you might want to start by setting a low value on both of them to see how long the training process takes on your data and later on increase the number to acceptable values in order to improve the performance.

  • log_frequency: Whether to use log frequency of categorical levels in conditional sampling. It defaults to True. This argument affects how the model processes the frequencies of the categorical values that are used to condition the rest of the values. In some cases, changing it to False could lead to better performance.

  • embedding_dim (int): Size of the random sample passed to the Generator. Defaults to 128.

  • generator_dim (tuple or list of ints): Size of the output samples for each one of the Residuals. A Resiudal Layer will be created for each one of the values provided. Defaults to (256, 256).

  • discriminator_dim (tuple or list of ints): Size of the output samples for each one of the Discriminator Layers. A Linear Layer will be created for each one of the values provided. Defaults to (256, 256).

  • generator_lr (float): Learning rate for the generator. Defaults to 2e-4.

  • generator_decay (float): Generator weight decay for the Adam Optimizer. Defaults to 1e-6.

  • discriminator_lr (float): Learning rate for the discriminator. Defaults to 2e-4.

  • discriminator_decay (float): Discriminator weight decay for the Adam Optimizer. Defaults to 1e-6.

  • discriminator_steps (int): Number of discriminator updates to do for each generator update. From the WGAN paper: https://arxiv.org/abs/1701.07875. WGAN paper default is 5. Default used is 1 to match original CTGAN implementation.

  • verbose: Whether to print fit progress on stdout. Defaults to False.

  • cuda (bool or str): If True, use CUDA. If a str, use the indicated device. If False, do not use cuda at all.

Warning

Notice that the value that you set on the batch_size argument must always be a multiple of 10!

As an example, we will try to fit the CTGAN model slightly increasing the number of epochs, reducing the batch_size, adding one additional layer to the models involved and using a smaller wright decay.

Before we start, we will evaluate the quality of the previously generated data using the sdv.evaluation.evaluate function

In [31]: from sdv.evaluation import evaluate

In [32]: evaluate(new_data, data)
Out[32]: 0.47358137945705026

Afterwards, we create a new instance of the CTGAN model with the hyperparameter values that we want to use

In [33]: model = CTGAN(
   ....:     primary_key='student_id',
   ....:     epochs=500,
   ....:     batch_size=100,
   ....:     generator_dim=(256, 256, 256),
   ....:     discriminator_dim=(256, 256, 256)
   ....: )
   ....: 

And fit to our data.

In [34]: model.fit(data)

Finally, we are ready to generate new data and evaluate the results.

In [35]: new_data = model.sample(len(data))

In [36]: evaluate(new_data, data)
Out[36]: 0.44572545994097723

As we can see, in this case these modifications changed the obtained results slightly, but they did neither introduce dramatic changes in the performance.

Conditional Sampling

As the name implies, conditional sampling allows us to sample from a conditional distribution using the CTGAN model, which means we can generate only values that satisfy certain conditions. These conditional values can be passed to the conditions parameter in the sample method either as a dataframe or a dictionary.

In case a dictionary is passed, the model will generate as many rows as requested, all of which will satisfy the specified conditions, such as gender = M.

In [37]: conditions = {
   ....:     'gender': 'M'
   ....: }
   ....: 

In [38]: model.sample(5, conditions=conditions)
Out[38]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      M    28.290712  51.797002  Commerce    40.841523    Sci&Tech            False                 0           72.042408   Mkt&HR  45.010863           NaN   False        NaT        NaT      NaN
1           2      M    81.696535  57.077085      Arts    62.232806   Comm&Mgmt             True                 0           53.189994   Mkt&HR  46.540070  31618.989479   False 2020-03-06 2020-10-05      6.0
2           3      M   102.229980  54.580840  Commerce    94.554095    Sci&Tech            False                 0           77.068197  Mkt&Fin  64.216779  22072.326240    True 2019-12-14 2020-02-27      3.0
3           4      M    76.060251  50.293824  Commerce    62.611387   Comm&Mgmt            False                 0           78.817311  Mkt&Fin  70.411247  25677.625230    True 2020-02-24 2020-09-08      3.0
4           0      M    59.691580  51.596113  Commerce    70.332767   Comm&Mgmt             True                 0           63.206175  Mkt&Fin  74.856288  31858.330887    True        NaT 2020-07-02      6.0

It’s also possible to condition on multiple columns, such as gender = M, 'experience_years': 0.

In [39]: conditions = {
   ....:     'gender': 'M',
   ....:     'experience_years': 0
   ....: }
   ....: 

In [40]: model.sample(5, conditions=conditions)
Out[40]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           1      M    73.485919  53.261342      Arts    51.141654    Sci&Tech            False                 0           61.800959  Mkt&Fin  60.679747  83144.812556    True        NaT 2020-09-18      NaN
1           2      M    37.619412  22.677742  Commerce    49.101590   Comm&Mgmt            False                 0           79.849598   Mkt&HR  49.733512           NaN   False        NaT        NaT      NaN
2           3      M    61.126271  61.059981  Commerce    41.514696      Others            False                 0           77.409094  Mkt&Fin  56.506821  29141.343673   False 2020-02-21        NaT      3.0
3           4      M    35.964559  40.950881   Science    47.332896    Sci&Tech            False                 0          111.736064   Mkt&HR  56.955224           NaN   False 2020-02-26 2020-09-02      NaN
4           0      M    46.587756  70.120633  Commerce    42.758214   Comm&Mgmt            False                 0           83.202785   Mkt&HR  59.626345           NaN   False 2020-05-26 2021-04-18      NaN

The conditions can also be passed as a dataframe. In that case, the model will generate one sample for each row of the dataframe, sorted in the same order. Since the model already knows how many samples to generate, passing it as a parameter is unnecessary. For example, if we want to generate three samples where gender = M and three samples with gender = F, we can do the following:

In [41]: import pandas as pd

In [42]: conditions = pd.DataFrame({
   ....:     'gender': ['M', 'M', 'M', 'F', 'F', 'F'],
   ....: })
   ....: 

In [43]: model.sample(conditions=conditions)
Out[43]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      M    64.252020  40.272738   Science    53.938619    Sci&Tech            False                 1           75.168903   Mkt&HR  53.546725  47493.231618   False        NaT        NaT      NaN
1           2      M    28.381880  54.668176  Commerce    43.432589   Comm&Mgmt            False                 0           72.623675  Mkt&Fin  47.134094           NaN   False        NaT        NaT      NaN
2           0      M    61.836885  58.355751  Commerce    43.628577   Comm&Mgmt             True                 0           52.310129   Mkt&HR  58.811533           NaN   False        NaT 2020-12-05      NaN
3           1      F    86.923627  56.830623  Commerce    73.887249   Comm&Mgmt            False                 0           53.546200   Mkt&HR  53.215357  30987.865703   False 2020-06-09 2021-03-15      6.0
4           2      F    87.185306  32.133005   Science    52.832368    Sci&Tech            False                 1           84.644799  Mkt&Fin  56.782870           NaN    True        NaT        NaT      3.0
5           6      F   101.236967  65.974442  Commerce    62.257487   Comm&Mgmt             True                 0          103.640437  Mkt&Fin  64.550289  28561.590801    True 2020-01-22 2020-03-23      3.0

CTGAN also supports conditioning on continuous values, as long as the values are within the range of seen numbers. For example, if all the values of the dataset are within 0 and 1, CTGAN will not be able to set this value to 1000.

In [44]: conditions = {
   ....:     'degree_perc': 70.0
   ....: }
   ....: 

In [45]: model.sample(5, conditions=conditions)
Out[45]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           1      M    84.392593  49.867336  Commerce         70.0    Sci&Tech            False                 0           98.363184  Mkt&Fin  63.443452  23946.566121    True 2020-02-24 2020-02-26      6.0
1           5      F    62.009780  53.225228  Commerce         70.0   Comm&Mgmt            False                 1           74.374311   Mkt&HR  76.868128  24741.656224    True 2020-02-12 2020-09-27      NaN
2          17      F    91.806180  58.870241  Commerce         70.0   Comm&Mgmt             True                 0           73.740159   Mkt&HR  66.317381  25320.011979    True 2020-06-07 2020-09-30     12.0
3          21      M    61.018639  61.444294  Commerce         70.0   Comm&Mgmt            False                 2           76.174053  Mkt&Fin  71.433675           NaN    True 2020-02-05 2020-07-26      3.0
4          24      F    91.603024  41.019216  Commerce         70.0   Comm&Mgmt             True                 0           59.449923  Mkt&Fin  65.359545  31963.759897    True 2020-08-04 2020-12-19      3.0

Note

Currently, conditional sampling works through a rejection sampling process, where rows are sampled repeatedly until one that satisfies the conditions is found. In case you are running into a Could not get enough valid rows within x trials or simply wish to optimize the results, there are three parameters that can be fine-tuned: max_rows_multiplier, max_retries and float_rtol. More information about these parameters can be found in the API section.

How do I specify constraints?

If you look closely at the data you may notice that some properties were not completely captured by the model. For example, you may have seen that sometimes the model produces an experience_years number greater than 0 while also indicating that work_experience is False. These types of properties are what we call Constraints and can also be handled using SDV. For further details about them please visit the Handling Constraints guide.

Can I evaluate the Synthetic Data?

A very common question when someone starts using SDV to generate synthetic data is: “How good is the data that I just generated?”

In order to answer this question, SDV has a collection of metrics and tools that allow you to compare the real that you provided and the synthetic data that you generated using SDV or any other tool.

You can read more about this in the Synthetic Data Evaluation guide.