TVAE Model

In this guide we will go through a series of steps that will let you discover functionalities of the TVAE model, including how to:

  • Create an instance of TVAE.

  • Fit the instance to your data.

  • Generate synthetic versions of your data.

  • Use TVAE to anonymize PII information.

  • Customize the data transformations to improve the learning process.

  • Specify hyperparameters to improve the output quality.

What is TVAE?

The sdv.tabular.TVAE model is based on the VAE-based Deep Learning data synthesizer which was presented at the NeurIPS 2020 conference by the paper titled Modeling Tabular data using Conditional GAN.

Let’s now discover how to learn a dataset and later on generate synthetic data with the same format and statistical properties by using the TVAE class from SDV.

Quick Usage

We will start by loading one of our demo datasets, the student_placements, which contains information about MBA students that applied for placements during the year 2020.

Warning

In order to follow this guide you need to have tvae installed on your system. If you have not done it yet, please install tvae now by executing the command pip install sdv in a terminal.

In [1]: from sdv.demo import load_tabular_demo

In [2]: data = load_tabular_demo('student_placements')

In [3]: data.head()
Out[3]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date duration
0       17264      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12      3.0
1       17265      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09      3.0
2       17266      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13      6.0
3       17267      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT      NaN
4       17268      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27      3.0

As you can see, this table contains information about students which includes, among other things:

  • Their id and gender

  • Their grades and specializations

  • Their work experience

  • The salary that they were offered

  • The duration and dates of their placement

You will notice that there is data with the following characteristics:

  • There are float, integer, boolean, categorical and datetime values.

  • There are some variables that have missing data. In particular, all the data related to the placement details is missing in the rows where the student was not placed.

Let us use TVAE to learn this data and then sample synthetic data about new students to see how well the model captures the characteristics indicated above. In order to do this you will need to:

  • Import the sdv.tabular.TVAE class and create an instance of it.

  • Call its fit method passing our table.

  • Call its sample method indicating the number of synthetic rows that you want to generate.

In [4]: from sdv.tabular import TVAE

In [5]: model = TVAE()

In [6]: model.fit(data)

Note

Notice that the model fitting process took care of transforming the different fields using the appropriate Reversible Data Transforms to ensure that the data has a format that the underlying TVAESynthesizer class can handle.

Generate synthetic data from the model

Once the modeling has finished you are ready to generate new synthetic data by calling the sample method from your model passing the number of rows that we want to generate.

In [7]: new_data = model.sample(200)

This will return a table identical to the one which the model was fitted on, but filled with new data which resembles the original one.

In [8]: new_data.head()
Out[8]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0       17448      M    64.604568  62.756413   Science    64.169343   Comm&Mgmt            False                 0           50.921246  Mkt&Fin  67.036185  28901.185358    True 2020-01-11 2020-07-24      NaN
1       17440      M    60.254406  63.002046  Commerce    58.809934   Comm&Mgmt            False                 0           63.586521   Mkt&HR  62.984031           NaN   False        NaT        NaT      NaN
2       17273      M    62.917068  62.103602  Commerce    69.852155   Comm&Mgmt            False                 0           62.484459  Mkt&Fin  60.113145  26628.116120    True 2020-01-16 2020-09-23      3.0
3       17308      M    83.066175  87.046848  Commerce    73.745576   Comm&Mgmt            False                 0           80.738820  Mkt&Fin  58.691415  20791.080136    True 2020-01-11 2020-08-17      3.0
4       17314      M    84.239590  78.028746   Science    66.501422    Sci&Tech             True                 0           89.699066  Mkt&Fin  65.412941  29194.174051    True 2020-01-29 2020-05-02      3.0

Note

You can control the number of rows by specifying the number of samples in the model.sample(<num_rows>). To test, try model.sample(10000). Note that the original table only had ~200 rows.

Save and Load the model

In many scenarios it will be convenient to generate synthetic versions of your data directly in systems that do not have access to the original data source. For example, if you may want to generate testing data on the fly inside a testing environment that does not have access to your production database. In these scenarios, fitting the model with real data every time that you need to generate new data is feasible, so you will need to fit a model in your production environment, save the fitted model into a file, send this file to the testing environment and then load it there to be able to sample from it.

Let’s see how this process works.

Save and share the model

Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is pickle.

In [9]: model.save('my_model.pkl')

This will have created a file called my_model.pkl in the same directory in which you are running SDV.

Important

If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains no information about the original data, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this my_model.pkl file without the risc of disclosing any of your real data!

Load the model and generate new data

The file you just generated can be sent over to the system where the synthetic data will be generated. Once it is there, you can load it using the TVAE.load method, and then you are ready to sample new data from the loaded instance:

In [10]: loaded = TVAE.load('my_model.pkl')

In [11]: new_data = loaded.sample(200)

Warning

Notice that the system where the model is loaded needs to also have sdv and tvae installed, otherwise it will not be able to load the model and use it.

Specifying the Primary Key of the table

One of the first things that you may have noticed when looking at the demo data is that there is a student_id column which acts as the primary key of the table, and which is supposed to have unique values. Indeed, if we look at the number of times that each value appears, we see that all of them appear at most once:

In [12]: data.student_id.value_counts().max()
Out[12]: 1

However, if we look at the synthetic data that we generated, we observe that there are some values that appear more than once:

In [13]: new_data[new_data.student_id == new_data.student_id.value_counts().index[0]]
Out[13]: 
     student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
23        17410      M    63.975182  63.253162  Commerce    58.398604   Comm&Mgmt            False                 0           69.725629  Mkt&Fin  56.864663  29327.658650    True 2020-01-19 2020-07-28      3.0
124       17410      M    62.452962  67.957529  Commerce    68.598872   Comm&Mgmt             True                 0           64.005992  Mkt&Fin  53.336265  28063.788199    True 2020-01-19 2020-08-14      3.0
148       17410      M    55.855973  60.858664  Commerce    67.972296   Comm&Mgmt             True                 0           58.530322  Mkt&Fin  60.233213  24751.285346    True 2020-01-16 2020-08-02      3.0
163       17410      M    47.054126  48.841317  Commerce    61.390821   Comm&Mgmt            False                 0           62.838622   Mkt&HR  61.163114           NaN   False        NaT        NaT      NaN
185       17410      M    62.067503  58.200434  Commerce    55.240622   Comm&Mgmt            False                 0           57.418334   Mkt&HR  57.288601           NaN   False        NaT        NaT      NaN

This happens because the model was not notified at any point about the fact that the student_id had to be unique, so when it generates new data it will provoke collisions sooner or later. In order to solve this, we can pass the argument primary_key to our model when we create it, indicating the name of the column that is the index of the table.

In [14]: model = TVAE(
   ....:     primary_key='student_id'
   ....: )
   ....: 

In [15]: model.fit(data)

In [16]: new_data = model.sample(200)

In [17]: new_data.head()
Out[17]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      M    49.375402  52.138793  Commerce    63.680203   Comm&Mgmt             True                 1           80.793547  Mkt&Fin  61.972025  29207.473171    True 2020-01-15 2020-07-18      3.0
1           1      M    88.459997  64.282219   Science    63.036313   Comm&Mgmt             True                 0           85.363994  Mkt&Fin  60.766776  24493.389236    True 2020-01-23 2020-09-25      3.0
2           2      M    64.311707  69.186836  Commerce    56.413013   Comm&Mgmt            False                 0           77.267950  Mkt&Fin  61.601944  29872.886877    True 2020-01-15 2020-08-15      3.0
3           3      M    67.044219  62.052711  Commerce    71.160371   Comm&Mgmt            False                 0          111.481894   Mkt&HR  58.999803  21448.940097    True 2020-02-02 2020-07-24      3.0
4           4      M    55.072510  66.956499   Science    66.121258   Comm&Mgmt             True                 1           78.442342  Mkt&Fin  65.516219           NaN   False        NaT        NaT      NaN

As a result, the model will learn that this column must be unique and generate a unique sequence of values for the column:

In [18]: new_data.student_id.value_counts().max()
Out[18]: 1

Anonymizing Personally Identifiable Information (PII)

There will be many cases where the data will contain Personally Identifiable Information which we cannot disclose. In these cases, we will want our Tabular Models to replace the information within these fields with fake, simulated data that looks similar to the real one but does not contain any of the original values.

Let’s load a new dataset that contains a PII field, the student_placements_pii demo, and try to generate synthetic versions of it that do not contain any of the PII fields.

Note

The student_placements_pii dataset is a modified version of the student_placements dataset with one new field, address, which contains PII information about the students. Notice that this additional address field has been simulated and does not correspond to data from the real users.

In [19]: data_pii = load_tabular_demo('student_placements_pii')

In [20]: data_pii.head()
Out[20]: 
   student_id                                            address gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date duration
0       17264        70304 Baker Turnpike\nEricborough, MS 15086      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12      3.0
1       17265    805 Herrera Avenue Apt. 134\nMaryview, NJ 36510      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09      3.0
2       17266        3702 Bradley Island\nNorth Victor, FL 12268      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13      6.0
3       17267                   Unit 0879 Box 3878\nDPO AP 42663      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT      NaN
4       17268  96493 Kelly Canyon Apt. 145\nEast Steven, NC 3...      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27      3.0

If we use our tabular model on this new data we will see how the synthetic data that it generates discloses the addresses from the real students:

In [21]: model = TVAE(
   ....:     primary_key='student_id',
   ....: )
   ....: 

In [22]: model.fit(data_pii)

In [23]: new_data_pii = model.sample(200)

In [24]: new_data_pii.head()
Out[24]: 
   student_id                                            address gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0  0530 James Trafficway Apt. 877\nDavidshire, DE...      M    78.326404  62.466473  Commerce    63.747327   Comm&Mgmt             True                 0           56.314808  Mkt&Fin  68.740134  26444.853751    True 2020-01-20 2020-08-02      3.0
1           1  083 Robinson Points Suite 667\nLake Stephaniem...      M    62.810922  65.436810  Commerce    66.723399   Comm&Mgmt            False                 0           51.540207  Mkt&Fin  59.616739  24107.001262    True 2020-01-21 2020-07-25      6.0
2           2  793 Rebecca Isle Apt. 327\nSouth Nicoleport, H...      M    60.127629  62.953011  Commerce    57.523345   Comm&Mgmt            False                 0           61.888203  Mkt&Fin  52.577698  29606.105280    True 2020-01-24 2020-08-08      6.0
3           3       54911 Gloria Island\nLake Veronica, WA 91035      M    82.976943  68.494418   Science    81.844226   Comm&Mgmt            False                 0           60.784650   Mkt&HR  63.340848  24303.617125    True 2020-01-16 2020-05-31      3.0
4           4         0478 Sanders Turnpike\nEricafurt, KY 50273      M    69.399889  62.232802  Commerce    73.683809   Comm&Mgmt            False                 0           59.984279  Mkt&Fin  59.767815  26543.501594    True 2020-01-12 2020-07-07      6.0

More specifically, we can see how all the addresses that have been generated actually come from the original dataset:

In [25]: new_data_pii.address.isin(data_pii.address).sum()
Out[25]: 200

In order to solve this, we can pass an additional argument anonymize_fields to our model when we create the instance. This anonymize_fields argument will need to be a dictionary that contains:

  • The name of the field that we want to anonymize.

  • The category of the field that we want to use when we generate fake values for it.

The list complete list of possible categories can be seen in the Faker Providers page, and it contains a huge list of concepts such as:

  • name

  • address

  • country

  • city

  • ssn

  • credit_card_number

  • credit_card_expire

  • credit_card_security_code

  • email

  • telephone

In this case, since the field is an e-mail address, we will pass a dictionary indicating the category address

In [26]: model = TVAE(
   ....:     primary_key='student_id',
   ....:     anonymize_fields={
   ....:         'address': 'address'
   ....:     }
   ....: )
   ....: 

In [27]: model.fit(data_pii)

As a result, we can see how the real address values have been replaced by other fake addresses:

In [28]: new_data_pii = model.sample(200)

In [29]: new_data_pii.head()
Out[29]: 
   student_id                                            address gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0  8805 Deborah Wells Suite 429\nRhondashire, MN ...      M    80.061312  60.323540  Commerce    61.763486   Comm&Mgmt            False                 0           52.504712   Mkt&HR  56.588960  26094.309653    True 2020-01-24 2020-08-02      3.0
1           1                   PSC 9641, Box 2763\nAPO AA 01441      F    57.101802  67.989106  Commerce    61.042251   Comm&Mgmt            False                 0           63.073393   Mkt&HR  56.120996  22495.509401    True 2020-01-29 2020-06-25      6.0
2           2  38308 Ramirez Road Apt. 083\nNorth Aaron, WY 1...      M    82.022477  60.850982  Commerce    64.521683   Comm&Mgmt            False                 0           66.052834  Mkt&Fin  59.686279  23511.028943    True 2020-01-16 2020-10-26      3.0
3           3               458 Oliver Forks\nLaraberg, MS 22222      M    65.591655  85.153962  Commerce    71.729380   Comm&Mgmt            False                 0           82.760528  Mkt&Fin  60.016939  23212.096853    True 2020-01-25 2021-01-09      3.0
4           4  77465 Cynthia Station Suite 847\nMaryburgh, GA...      M    65.925046  68.524287   Science    64.878277   Comm&Mgmt            False                 0           72.728818  Mkt&Fin  63.635244  24517.205815    True 2020-01-16 2020-10-07      3.0

Which means that none of the original addresses can be found in the sampled data:

In [30]: data_pii.address.isin(new_data_pii.address).sum()
Out[30]: 0

Advanced Usage

Now that we have discovered the basics, let’s go over a few more advanced usage examples and see the different arguments that we can pass to our CTGAN Model in order to customize it to our needs.

How to modify the TVAE Hyperparameters?

A part from the common Tabular Model arguments, TVAE has a number of additional hyperparameters that control its learning behavior and can impact on the performance of the model, both in terms of quality of the generated data and computational time.

  • epochs and batch_size: these arguments control the number of iterations that the model will perform to optimize its parameters, as well as the number of samples used in each step. Its default values are 300 and 500 respectively, and batch_size needs to always be a value which is multiple of 10.

    These hyperparameters have a very direct effect in time the training process lasts but also on the performance of the data, so for new datasets, you might want to start by setting a low value on both of them to see how long the training process takes on your data and later on increase the number to acceptable values in order to improve the performance.

  • log_frequency: Whether to use log frequency of categorical levels in conditional sampling. It defaults to True. This argument affects how the model processes the frequencies of the categorical values that are used to condition the rest of the values. In some cases, changing it to False could lead to better performance.

  • embedding_dim (int): Size of the random sample passed to the Generator. Defaults to 128.

  • compress_dims (tuple or list of ints): Size of each hidden layer in the encoder. Defaults to (128, 128).

  • decompress_dims (tuple or list of ints): Size of each hidden layer in the decoder. Defaults to (128, 128).

  • l2scale (int): Regularization term. Defaults to 1e-5.

  • batch_size (int): Number of data samples to process in each step.

  • loss_factor (int): Multiplier for the reconstruction error. Defaults to 2.

  • cuda (bool or str): If True, use CUDA. If a str, use the indicated device. If False, do not use cuda at all.

Warning

Notice that the value that you set on the batch_size argument must always be a multiple of 10!

As an example, we will try to fit the TVAE model slightly increasing the number of epochs, reducing the batch_size, adding one additional layer to the models involved and using a smaller wright decay.

Before we start, we will evaluate the quality of the previously generated data using the sdv.evaluation.evaluate function

In [31]: from sdv.evaluation import evaluate

In [32]: evaluate(new_data, data)
Out[32]: 0.4354929566577206

Afterwards, we create a new instance of the TVAE model with the hyperparameter values that we want to use

In [33]: model = TVAE(
   ....:     primary_key='student_id',
   ....:     epochs=500,
   ....:     compress_dims=(256, 256, 256),
   ....:     decompress_dims=(256, 256, 256)
   ....: )
   ....: 

And fit to our data.

In [34]: model.fit(data)

Finally, we are ready to generate new data and evaluate the results.

In [35]: new_data = model.sample(len(data))

In [36]: evaluate(new_data, data)
Out[36]: 0.47822058238115456

As we can see, in this case these modifications changed the obtained results slightly, but they did neither introduce dramatic changes in the performance.

Conditional Sampling

As the name implies, conditional sampling allows us to sample from a conditional distribution using the TVAE model, which means we can generate only values that satisfy certain conditions. These conditional values can be passed to the conditions parameter in the sample method either as a dataframe or a dictionary.

In case a dictionary is passed, the model will generate as many rows as requested, all of which will satisfy the specified conditions, such as gender = M.

In [37]: conditions = {
   ....:     'gender': 'M'
   ....: }
   ....: 

In [38]: model.sample(5, conditions=conditions)
Out[38]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      M    66.740587  64.639258  Commerce    66.344518   Comm&Mgmt            False                 0           47.046731  Mkt&Fin  51.028056  25067.113133    True 2020-01-27 2020-07-27      3.0
1           1      M    65.541779  55.015753  Commerce    65.235016   Comm&Mgmt             True                 0           59.123779   Mkt&HR  50.795305  29632.381492    True 2020-01-13 2020-08-18      NaN
2           2      M    79.230967  66.110557  Commerce    76.863025   Comm&Mgmt            False                 0           91.595577  Mkt&Fin  57.036491  21800.890949    True 2020-01-16 2020-10-14      3.0
3           3      M    56.394430  60.594170  Commerce    72.471679   Comm&Mgmt            False                 0           68.994952   Mkt&HR  60.349849           NaN   False        NaT        NaT      NaN
4           4      M    85.331716  70.315460  Commerce    73.586865   Comm&Mgmt             True                 1           91.273467  Mkt&Fin  68.134973  23328.949962    True 2020-01-18 2020-03-30      3.0

It’s also possible to condition on multiple columns, such as gender = M, 'experience_years': 0.

In [39]: conditions = {
   ....:     'gender': 'M',
   ....:     'experience_years': 0
   ....: }
   ....: 

In [40]: model.sample(5, conditions=conditions)
Out[40]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      M    68.306172  62.816627  Commerce    70.257203   Comm&Mgmt            False                 0           59.410180  Mkt&Fin  54.124450  26985.097795    True 2020-01-14 2020-08-04      6.0
1           1      M    66.918590  63.532007  Commerce    66.885183   Comm&Mgmt            False                 0           60.501667   Mkt&HR  61.972308  23032.882335    True 2020-01-18 2020-08-10      3.0
2           2      M    52.796393  77.050693  Commerce    65.648619   Comm&Mgmt            False                 0           62.306843   Mkt&HR  58.308859           NaN   False        NaT        NaT      NaN
3           3      M    75.108056  66.247568  Commerce    79.960783   Comm&Mgmt            False                 0           83.853711  Mkt&Fin  67.281500  24215.100538    True 2020-01-31 2020-04-23      3.0
4           4      M    64.530749  62.434780  Commerce    58.056132   Comm&Mgmt            False                 0           60.492196   Mkt&HR  62.678857  28997.974726    True 2020-03-14 2020-08-16      NaN

The conditions can also be passed as a dataframe. In that case, the model will generate one sample for each row of the dataframe, sorted in the same order. Since the model already knows how many samples to generate, passing it as a parameter is unnecessary. For example, if we want to generate three samples where gender = M and three samples with gender = F, we can do the following:

In [41]: import pandas as pd

In [42]: conditions = pd.DataFrame({
   ....:     'gender': ['M', 'M', 'M', 'F', 'F', 'F'],
   ....: })
   ....: 

In [43]: model.sample(conditions=conditions)
Out[43]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0      M    70.690960  70.145581  Commerce    67.446533   Comm&Mgmt            False                 0           54.494398  Mkt&Fin  54.226159  20434.444092    True 2020-01-18 2020-10-12      3.0
1           1      M    79.314546  79.630019   Science    77.742661   Comm&Mgmt             True                 1           75.576171  Mkt&Fin  63.080210  28835.820273    True 2020-01-23 2020-04-22      3.0
2           2      M    61.467316  77.271953  Commerce    67.656423   Comm&Mgmt             True                 0           61.220382  Mkt&Fin  57.334080  21962.896474    True 2020-01-15 2020-07-21      6.0
3           3      F    76.458818  69.121699   Science    73.480272   Comm&Mgmt            False                 0           92.909345  Mkt&Fin  60.851382  22235.439066    True 2020-01-23 2020-07-14      6.0
4           4      F    64.898366  86.396013  Commerce    58.042495   Comm&Mgmt            False                 0           57.650022  Mkt&Fin  55.088576  25539.554270    True 2020-01-13 2020-07-23      6.0
5           5      F    51.107861  52.584110  Commerce    66.941111   Comm&Mgmt            False                 0           53.789168   Mkt&HR  61.889249           NaN   False        NaT        NaT      NaN

TVAE also supports conditioning on continuous values, as long as the values are within the range of seen numbers. For example, if all the values of the dataset are within 0 and 1, TVAE will not be able to set this value to 1000.

In [44]: conditions = {
   ....:     'degree_perc': 70.0
   ....: }
   ....: 

In [45]: model.sample(5, conditions=conditions)
Out[45]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           1      M    54.493246  58.925478  Commerce         70.0   Comm&Mgmt             True                 0           64.663612  Mkt&Fin  65.549845  26657.467097    True 2020-01-19 2021-01-01      3.0
1           9      M    64.621070  67.810251  Commerce         70.0   Comm&Mgmt             True                 1           78.186189  Mkt&Fin  72.235053  26135.701145    True 2020-01-18 2020-12-14      3.0
2          19      M    60.138242  71.069280  Commerce         70.0   Comm&Mgmt            False                 0           60.828465  Mkt&Fin  62.808776  23971.723420    True 2020-01-18 2020-07-26      6.0
3          25      M    74.573986  72.484648   Science         70.0   Comm&Mgmt             True                 0           63.259568  Mkt&Fin  54.903955  26384.348928    True 2020-01-21 2020-08-16      3.0
4           4      M    61.453360  55.616071  Commerce         70.0   Comm&Mgmt            False                 0           66.068565   Mkt&HR  57.491851  21157.870958    True 2020-01-16 2020-08-16      3.0

Note

Currently, conditional sampling works through a rejection sampling process, where rows are sampled repeatedly until one that satisfies the conditions is found. In case you are running into a Could not get enough valid rows within x trials or simply wish to optimize the results, there are three parameters that can be fine-tuned: max_rows_multiplier, max_retries and float_rtol. More information about these parameters can be found in the API section.

How do I specify constraints?

If you look closely at the data you may notice that some properties were not completely captured by the model. For example, you may have seen that sometimes the model produces an experience_years number greater than 0 while also indicating that work_experience is False. These types of properties are what we call Constraints and can also be handled using SDV. For further details about them please visit the Handling Constraints guide.

Can I evaluate the Synthetic Data?

A very common question when someone starts using SDV to generate synthetic data is: “How good is the data that I just generated?”

In order to answer this question, SDV has a collection of metrics and tools that allow you to compare the real that you provided and the synthetic data that you generated using SDV or any other tool.

You can read more about this in the Synthetic Data Evaluation guide.