TVAE Model

In this guide we will go through a series of steps that will let you discover functionalities of the TVAE model, including how to:

  • Create an instance of TVAE.

  • Fit the instance to your data.

  • Generate synthetic versions of your data.

  • Use TVAE to anonymize PII information.

  • Customize the data transformations to improve the learning process.

  • Specify hyperparameters to improve the output quality.

What is TVAE?

The sdv.tabular.TVAE model is based on the VAE-based Deep Learning data synthesizer which was presented at the NeurIPS 2020 conference by the paper titled Modeling Tabular data using Conditional GAN.

Let’s now discover how to learn a dataset and later on generate synthetic data with the same format and statistical properties by using the TVAE class from SDV.

Quick Usage

We will start by loading one of our demo datasets, the student_placements, which contains information about MBA students that applied for placements during the year 2020.

Warning

In order to follow this guide you need to have tvae installed on your system. If you have not done it yet, please install tvae now by executing the command pip install sdv in a terminal.

In [1]: from sdv.demo import load_tabular_demo

In [2]: data = load_tabular_demo('student_placements')

In [3]: data.head()
Out[3]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date duration
0       17264      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12      3.0
1       17265      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09      3.0
2       17266      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13      6.0
3       17267      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT      NaN
4       17268      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27      3.0

As you can see, this table contains information about students which includes, among other things:

  • Their id and gender

  • Their grades and specializations

  • Their work experience

  • The salary that they where offered

  • The duration and dates of their placement

You will notice that there is data with the following characteristics:

  • There are float, integer, boolean, categorical and datetime values.

  • There are some variables that have missing data. In particular, all the data related to the placement details is missing in the rows where the student was not placed.

Let us use TVAE to learn this data and then sample synthetic data about new students to see how well de model captures the characteristics indicated above. In order to do this you will need to:

  • Import the sdv.tabular.TVAE class and create an instance of it.

  • Call its fit method passing our table.

  • Call its sample method indicating the number of synthetic rows that you want to generate.

In [4]: from sdv.tabular import TVAE

In [5]: model = TVAE()

In [6]: model.fit(data)

Note

Notice that the model fitting process took care of transforming the different fields using the appropriate Reversible Data Transforms to ensure that the data has a format that the underlying TVAESynthesizer class can handle.

Generate synthetic data from the model

Once the modeling has finished you are ready to generate new synthetic data by calling the sample method from your model passing the number of rows that we want to generate.

In [7]: new_data = model.sample(200)

This will return a table identical to the one which the model was fitted on, but filled with new data which resembles the original one.

In [8]: new_data.head()
Out[8]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0       17420      F    73.101956  65.227723      Arts    64.477218      Others             True                 0           91.607941  Mkt&Fin  64.044536  25610.365246   False        NaT        NaT      6.0
1       17350      F    82.720206  56.908505      Arts    64.209844   Comm&Mgmt            False                 0           80.835069  Mkt&Fin  58.412540           NaN   False        NaT 2020-10-07      3.0
2       17314      F    62.598960  63.034191      Arts    62.527316   Comm&Mgmt            False                 1           80.795522  Mkt&Fin  58.836303  25287.391663   False        NaT        NaT      3.0
3       17336      F    79.074552  61.143598  Commerce    69.073348   Comm&Mgmt             True                 0           87.452836  Mkt&Fin  65.356574  25388.356955   False        NaT        NaT      6.0
4       17317      F    83.496966  62.227672   Science    67.126816   Comm&Mgmt            False                 0           61.036979   Mkt&HR  59.735393  24061.944418   False        NaT 2020-09-29      6.0

Note

You can control the number of rows by specifying the number of samples in the model.sample(<num_rows>). To test, try model.sample(10000). Note that the original table only had ~200 rows.

Save and Load the model

In many scenarios it will be convenient to generate synthetic versions of your data directly in systems that do not have access to the original data source. For example, if you may want to generate testing data on the fly inside a testing environment that does not have access to your production database. In these scenarios, fitting the model with real data every time that you need to generate new data is feasible, so you will need to fit a model in your production environment, save the fitted model into a file, send this file to the testing environment and then load it there to be able to sample from it.

Let’s see how this process works.

Save and share the model

Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is pickle.

In [9]: model.save('my_model.pkl')

This will have created a file called my_model.pkl in the same directory in which you are running SDV.

Important

If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains no information about the original data, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this my_model.pkl file without the risc of disclosing any of your real data!

Load the model and generate new data

The file you just generated can be send over to the system where the synthetic data will be generated. Once it is there, you can load it using the TVAE.load method, and then you are ready to sample new data from the loaded instance:

In [10]: loaded = TVAE.load('my_model.pkl')

In [11]: new_data = loaded.sample(200)

Warning

Notice that the system where the model is loaded needs to also have sdv and tvae installed, otherwise it will not be able to load the model and use it.

Specifying the Primary Key of the table

One of the first things that you may have noticed when looking that demo data is that there is a student_id column which acts as the primary key of the table, and which is supposed to have unique values. Indeed, if we look at the number of times that each value appears, we see that all of them appear at most once:

In [12]: data.student_id.value_counts().max()
Out[12]: 1

However, if we look at the synthetic data that we generated, we observe that there are some values that appear more than once:

In [13]: new_data[new_data.student_id == new_data.student_id.value_counts().index[0]]
Out[13]: 
     student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
29        17320      F    65.623244  59.543834   Science    70.931281   Comm&Mgmt            False                 0           83.719507  Mkt&Fin  60.181747           NaN    True        NaT 2020-07-16      6.0
39        17320      F    65.144786  60.473493      Arts    64.213700      Others             True                 0           79.935333  Mkt&Fin  55.221993  23725.230328   False        NaT        NaT      3.0
67        17320      F    72.877853  64.474467      Arts    70.096766    Sci&Tech            False                 0           87.189243  Mkt&Fin  57.094211  25341.892094   False        NaT 2020-10-29      3.0
72        17320      F    69.038406  60.437968   Science    65.937777    Sci&Tech             True                 0           79.523541  Mkt&Fin  56.612522  24864.864828   False        NaT        NaT     12.0
104       17320      F    77.798796  66.233184   Science    67.394602   Comm&Mgmt            False                 0           61.984805  Mkt&Fin  59.399406  25646.532785   False        NaT        NaT     12.0
122       17320      F    85.326715  69.665371  Commerce    62.602816   Comm&Mgmt            False                 0           84.138310   Mkt&HR  58.785227  24575.512262   False        NaT        NaT     12.0
127       17320      F    73.319358  65.961327   Science    64.781834   Comm&Mgmt             True                 1           82.530500   Mkt&HR  59.582095  24353.443680   False        NaT        NaT      3.0
190       17320      F    70.502959  64.973758   Science    65.669293   Comm&Mgmt             True                 0           84.765201   Mkt&HR  59.017874  26143.039221    True        NaT        NaT     12.0
198       17320      F    73.058822  61.412487      Arts    60.221149   Comm&Mgmt             True                 2           60.301887   Mkt&HR  60.399314  27672.095195   False        NaT        NaT      3.0

This happens because the model was not notified at any point about the fact that the student_id had to be unique, so when it generates new data it will provoke collisions sooner or later. In order to solve this, we can pass the argument primary_key to our model when we create it, indicating the name of the column that is the index of the table.

In [14]: model = TVAE(
   ....:     primary_key='student_id'
   ....: )
   ....: 

In [15]: model.fit(data)

In [16]: new_data = model.sample(200)

In [17]: new_data.head()
Out[17]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date end_date duration
0           0      M    64.092492  62.921274  Commerce    71.771255      Others             True                 0           80.381349  Mkt&Fin  56.851139           NaN   False 2020-01-07      NaT      3.0
1           1      M    67.551962  62.388292  Commerce    61.188514      Others             True                 0           71.025927  Mkt&Fin  71.266530           NaN   False 2020-01-15      NaT      6.0
2           2      M    53.206351  61.427396   Science    68.673206      Others             True                 0           77.744350   Mkt&HR  64.093158  53775.470963   False 2020-06-25      NaT      NaN
3           3      M    56.728756  64.797254   Science    65.394216   Comm&Mgmt            False                 0           73.843538   Mkt&HR  61.316548  28743.240149   False        NaT      NaT      3.0
4           4      M    64.996623  60.546220  Commerce    66.972814      Others            False                 0           79.015321  Mkt&Fin  59.622942  28615.130299   False 2020-07-16      NaT      6.0

As a result, the model will learn that this column must be unique and generate a unique sequence of values for the column:

In [18]: new_data.student_id.value_counts().max()
Out[18]: 1

Anonymizing Personally Identifiable Information (PII)

There will be many cases where the data will contain Personally Identifiable Information which we cannot disclose. In these cases, we will want our Tabular Models to replace the information within these fields with fake, simulated data that looks similar to the real one but does not contain any of the original values.

Let’s load a new dataset that contains a PII field, the student_placements_pii demo, and try to generate synthetic versions of it that do not contain any of the PII fields.

Note

The student_placements_pii dataset is a modified version of the student_placements dataset with one new field, address, which contains PII information about the students. Notice that this additional address field has been simulated and does not correspond to data from the real users.

In [19]: data_pii = load_tabular_demo('student_placements_pii')

In [20]: data_pii.head()
Out[20]: 
   student_id                                            address gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date duration
0       17264        70304 Baker Turnpike\nEricborough, MS 15086      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12      3.0
1       17265    805 Herrera Avenue Apt. 134\nMaryview, NJ 36510      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09      3.0
2       17266        3702 Bradley Island\nNorth Victor, FL 12268      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13      6.0
3       17267                   Unit 0879 Box 3878\nDPO AP 42663      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT      NaN
4       17268  96493 Kelly Canyon Apt. 145\nEast Steven, NC 3...      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27      3.0

If we use our tabular model on this new data we will see how the synthetic data that it generates discloses the addresses from the real students:

In [21]: model = TVAE(
   ....:     primary_key='student_id',
   ....: )
   ....: 

In [22]: model.fit(data_pii)

In [23]: new_data_pii = model.sample(200)

In [24]: new_data_pii.head()
Out[24]: 
   student_id                                            address gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date   end_date duration
0           0             8368 Sarah Well\nNewmanville, WA 69934      M    69.601288  64.089683  Commerce    61.850612      Others            False                 0           58.315535  Mkt&Fin  60.667745           NaN    True        NaT 2020-04-16      3.0
1           1  49460 Jeremy Unions Suite 915\nLake Coryboroug...      M    76.357697  47.815611  Commerce    67.809625      Others             True                 0           76.304454  Mkt&Fin  58.400210           NaN   False        NaT 2020-04-08      3.0
2           2                   PSC 2994, Box 1804\nAPO AE 05822      F    69.174713  58.372999   Science    64.464850      Others             True                 1           60.039985  Mkt&Fin  59.470149  29065.171805   False 2020-02-22 2020-04-13     12.0
3           3                   PSC 2994, Box 1804\nAPO AE 05822      M    77.966934  53.600451      Arts    70.515518      Others             True                 0           69.310361  Mkt&Fin  69.100697           NaN   False 2020-02-28        NaT     12.0
4           4                   Unit 8706 Box 8587\nDPO AP 31555      F    78.766469  64.966222  Commerce    64.597489      Others             True                 0           76.562658  Mkt&Fin  57.086433           NaN   False 2020-03-02 2020-09-05      3.0

More specifically, we can see how all the addresses that have been generated actually come from the original dataset:

In [25]: new_data_pii.address.isin(data_pii.address).sum()
Out[25]: 200

In order to solve this, we can pass an additional argument anonymize_fields to our model when we create the instance. This anonymize_fields argument will need to be a dictionary that contains:

  • The name of the field that we want to anonymize.

  • The category of the field that we want to use when we generate fake values for it.

The list complete list of possible categories can be seen in the Faker Providers page, and it contains a huge list of concepts such as:

  • name

  • address

  • country

  • city

  • ssn

  • credit_card_number

  • credit_card_expire

  • credit_card_security_code

  • email

  • telephone

In this case, since the field is an e-mail address, we will pass a dictionary indicating the category address

In [26]: model = TVAE(
   ....:     primary_key='student_id',
   ....:     anonymize_fields={
   ....:         'address': 'address'
   ....:     }
   ....: )
   ....: 

In [27]: model.fit(data_pii)

As a result, we can see how the real address values have been replaced by other fake addresses:

In [28]: new_data_pii = model.sample(200)

In [29]: new_data_pii.head()
Out[29]: 
   student_id                                            address gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc        salary  placed start_date end_date duration
0           0                   Unit 7402 Box 9999\nDPO AP 80379      F    70.300066  74.766468      Arts    71.038388   Comm&Mgmt             True                 0           78.357462  Mkt&Fin  54.975943           NaN   False 2020-01-20      NaT      6.0
1           1  444 Charlene Hill Suite 347\nWest Jamiestad, M...      F    75.595756  75.677733  Commerce    67.918597   Comm&Mgmt            False                 1           83.935857   Mkt&HR  58.926559  25182.200417   False        NaT      NaT      NaN
2           2   14087 Wang Viaduct Suite 155\nAlexberg, FL 60308      F    74.162473  63.649446      Arts    67.439641   Comm&Mgmt             True                 1           84.652581   Mkt&HR  67.532252           NaN   False        NaT      NaT      6.0
3           3  847 Herman Land Apt. 224\nMatthewchester, NV 9...      F    76.778371  68.227541   Science    73.236282      Others             True                 0           77.974505  Mkt&Fin  66.508842           NaN   False 2020-01-30      NaT      NaN
4           4               4752 Kyle Shore\nLopezport, OH 71000      F    74.035485  63.382243   Science    71.893993      Others             True                 0           82.159270  Mkt&Fin  59.808962           NaN   False 2020-01-27      NaT      6.0

Which means that none of the original addresses can be found in the sampled data:

In [30]: data_pii.address.isin(new_data_pii.address).sum()
Out[30]: 0

How do I specify constraints?

If you look closely at the data you may notice that some properties were not completely captured by the model. For example, you may have seen that sometimes the model produces an experience_years number greater than 0 while also indicating that work_experience is False. These type of properties are what we call Constraints and can also be handled using SDV. For further details about them please visit the Handling Constraints guide.

Can I evaluate the Synthetic Data?

A very common question when someone starts using SDV to generate synthetic data is: “How good is the data that I just generated?”

In order to answer this question, SDV has a collection of metrics and tools that allow you to compare the real that you provided and the synthetic data that you generated using SDV or any other tool.

You can read more about this in the Synthetic Data Evaluation guide.