In this guide we will go through a series of steps that will let you discover functionalities of the TVAE model, including how to:
TVAE
Create an instance of TVAE.
Fit the instance to your data.
Generate synthetic versions of your data.
Use TVAE to anonymize PII information.
Customize the data transformations to improve the learning process.
Specify hyperparameters to improve the output quality.
The sdv.tabular.TVAE model is based on the VAE-based Deep Learning data synthesizer which was presented at the NeurIPS 2020 conference by the paper titled Modeling Tabular data using Conditional GAN.
sdv.tabular.TVAE
Let’s now discover how to learn a dataset and later on generate synthetic data with the same format and statistical properties by using the TVAE class from SDV.
We will start by loading one of our demo datasets, the student_placements, which contains information about MBA students that applied for placements during the year 2020.
student_placements
Warning
In order to follow this guide you need to have tvae installed on your system. If you have not done it yet, please install tvae now by executing the command pip install sdv in a terminal.
tvae
pip install sdv
In [1]: from sdv.demo import load_tabular_demo In [2]: data = load_tabular_demo('student_placements') In [3]: data.head() Out[3]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17264 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 1 17265 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 2 17266 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 3 17267 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN 4 17268 M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
As you can see, this table contains information about students which includes, among other things:
Their id and gender
Their grades and specializations
Their work experience
The salary that they where offered
The duration and dates of their placement
You will notice that there is data with the following characteristics:
There are float, integer, boolean, categorical and datetime values.
There are some variables that have missing data. In particular, all the data related to the placement details is missing in the rows where the student was not placed.
Let us use TVAE to learn this data and then sample synthetic data about new students to see how well de model captures the characteristics indicated above. In order to do this you will need to:
Import the sdv.tabular.TVAE class and create an instance of it.
Call its fit method passing our table.
fit
Call its sample method indicating the number of synthetic rows that you want to generate.
sample
In [4]: from sdv.tabular import TVAE In [5]: model = TVAE() In [6]: model.fit(data)
Note
Notice that the model fitting process took care of transforming the different fields using the appropriate Reversible Data Transforms to ensure that the data has a format that the underlying TVAESynthesizer class can handle.
fitting
Once the modeling has finished you are ready to generate new synthetic data by calling the sample method from your model passing the number of rows that we want to generate.
In [7]: new_data = model.sample(200)
This will return a table identical to the one which the model was fitted on, but filled with new data which resembles the original one.
In [8]: new_data.head() Out[8]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17420 F 73.101956 65.227723 Arts 64.477218 Others True 0 91.607941 Mkt&Fin 64.044536 25610.365246 False NaT NaT 6.0 1 17350 F 82.720206 56.908505 Arts 64.209844 Comm&Mgmt False 0 80.835069 Mkt&Fin 58.412540 NaN False NaT 2020-10-07 3.0 2 17314 F 62.598960 63.034191 Arts 62.527316 Comm&Mgmt False 1 80.795522 Mkt&Fin 58.836303 25287.391663 False NaT NaT 3.0 3 17336 F 79.074552 61.143598 Commerce 69.073348 Comm&Mgmt True 0 87.452836 Mkt&Fin 65.356574 25388.356955 False NaT NaT 6.0 4 17317 F 83.496966 62.227672 Science 67.126816 Comm&Mgmt False 0 61.036979 Mkt&HR 59.735393 24061.944418 False NaT 2020-09-29 6.0
You can control the number of rows by specifying the number of samples in the model.sample(<num_rows>). To test, try model.sample(10000). Note that the original table only had ~200 rows.
samples
model.sample(<num_rows>)
model.sample(10000)
In many scenarios it will be convenient to generate synthetic versions of your data directly in systems that do not have access to the original data source. For example, if you may want to generate testing data on the fly inside a testing environment that does not have access to your production database. In these scenarios, fitting the model with real data every time that you need to generate new data is feasible, so you will need to fit a model in your production environment, save the fitted model into a file, send this file to the testing environment and then load it there to be able to sample from it.
Let’s see how this process works.
Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is pickle.
save
.pkl
In [9]: model.save('my_model.pkl')
This will have created a file called my_model.pkl in the same directory in which you are running SDV.
my_model.pkl
Important
If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains no information about the original data, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this my_model.pkl file without the risc of disclosing any of your real data!
The file you just generated can be send over to the system where the synthetic data will be generated. Once it is there, you can load it using the TVAE.load method, and then you are ready to sample new data from the loaded instance:
TVAE.load
In [10]: loaded = TVAE.load('my_model.pkl') In [11]: new_data = loaded.sample(200)
Notice that the system where the model is loaded needs to also have sdv and tvae installed, otherwise it will not be able to load the model and use it.
sdv
One of the first things that you may have noticed when looking that demo data is that there is a student_id column which acts as the primary key of the table, and which is supposed to have unique values. Indeed, if we look at the number of times that each value appears, we see that all of them appear at most once:
student_id
In [12]: data.student_id.value_counts().max() Out[12]: 1
However, if we look at the synthetic data that we generated, we observe that there are some values that appear more than once:
In [13]: new_data[new_data.student_id == new_data.student_id.value_counts().index[0]] Out[13]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 29 17320 F 65.623244 59.543834 Science 70.931281 Comm&Mgmt False 0 83.719507 Mkt&Fin 60.181747 NaN True NaT 2020-07-16 6.0 39 17320 F 65.144786 60.473493 Arts 64.213700 Others True 0 79.935333 Mkt&Fin 55.221993 23725.230328 False NaT NaT 3.0 67 17320 F 72.877853 64.474467 Arts 70.096766 Sci&Tech False 0 87.189243 Mkt&Fin 57.094211 25341.892094 False NaT 2020-10-29 3.0 72 17320 F 69.038406 60.437968 Science 65.937777 Sci&Tech True 0 79.523541 Mkt&Fin 56.612522 24864.864828 False NaT NaT 12.0 104 17320 F 77.798796 66.233184 Science 67.394602 Comm&Mgmt False 0 61.984805 Mkt&Fin 59.399406 25646.532785 False NaT NaT 12.0 122 17320 F 85.326715 69.665371 Commerce 62.602816 Comm&Mgmt False 0 84.138310 Mkt&HR 58.785227 24575.512262 False NaT NaT 12.0 127 17320 F 73.319358 65.961327 Science 64.781834 Comm&Mgmt True 1 82.530500 Mkt&HR 59.582095 24353.443680 False NaT NaT 3.0 190 17320 F 70.502959 64.973758 Science 65.669293 Comm&Mgmt True 0 84.765201 Mkt&HR 59.017874 26143.039221 True NaT NaT 12.0 198 17320 F 73.058822 61.412487 Arts 60.221149 Comm&Mgmt True 2 60.301887 Mkt&HR 60.399314 27672.095195 False NaT NaT 3.0
This happens because the model was not notified at any point about the fact that the student_id had to be unique, so when it generates new data it will provoke collisions sooner or later. In order to solve this, we can pass the argument primary_key to our model when we create it, indicating the name of the column that is the index of the table.
primary_key
In [14]: model = TVAE( ....: primary_key='student_id' ....: ) ....: In [15]: model.fit(data) In [16]: new_data = model.sample(200) In [17]: new_data.head() Out[17]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 64.092492 62.921274 Commerce 71.771255 Others True 0 80.381349 Mkt&Fin 56.851139 NaN False 2020-01-07 NaT 3.0 1 1 M 67.551962 62.388292 Commerce 61.188514 Others True 0 71.025927 Mkt&Fin 71.266530 NaN False 2020-01-15 NaT 6.0 2 2 M 53.206351 61.427396 Science 68.673206 Others True 0 77.744350 Mkt&HR 64.093158 53775.470963 False 2020-06-25 NaT NaN 3 3 M 56.728756 64.797254 Science 65.394216 Comm&Mgmt False 0 73.843538 Mkt&HR 61.316548 28743.240149 False NaT NaT 3.0 4 4 M 64.996623 60.546220 Commerce 66.972814 Others False 0 79.015321 Mkt&Fin 59.622942 28615.130299 False 2020-07-16 NaT 6.0
As a result, the model will learn that this column must be unique and generate a unique sequence of values for the column:
In [18]: new_data.student_id.value_counts().max() Out[18]: 1
There will be many cases where the data will contain Personally Identifiable Information which we cannot disclose. In these cases, we will want our Tabular Models to replace the information within these fields with fake, simulated data that looks similar to the real one but does not contain any of the original values.
Let’s load a new dataset that contains a PII field, the student_placements_pii demo, and try to generate synthetic versions of it that do not contain any of the PII fields.
student_placements_pii
The student_placements_pii dataset is a modified version of the student_placements dataset with one new field, address, which contains PII information about the students. Notice that this additional address field has been simulated and does not correspond to data from the real users.
address
In [19]: data_pii = load_tabular_demo('student_placements_pii') In [20]: data_pii.head() Out[20]: student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17264 70304 Baker Turnpike\nEricborough, MS 15086 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 1 17265 805 Herrera Avenue Apt. 134\nMaryview, NJ 36510 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 2 17266 3702 Bradley Island\nNorth Victor, FL 12268 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 3 17267 Unit 0879 Box 3878\nDPO AP 42663 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN 4 17268 96493 Kelly Canyon Apt. 145\nEast Steven, NC 3... M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
If we use our tabular model on this new data we will see how the synthetic data that it generates discloses the addresses from the real students:
In [21]: model = TVAE( ....: primary_key='student_id', ....: ) ....: In [22]: model.fit(data_pii) In [23]: new_data_pii = model.sample(200) In [24]: new_data_pii.head() Out[24]: student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 8368 Sarah Well\nNewmanville, WA 69934 M 69.601288 64.089683 Commerce 61.850612 Others False 0 58.315535 Mkt&Fin 60.667745 NaN True NaT 2020-04-16 3.0 1 1 49460 Jeremy Unions Suite 915\nLake Coryboroug... M 76.357697 47.815611 Commerce 67.809625 Others True 0 76.304454 Mkt&Fin 58.400210 NaN False NaT 2020-04-08 3.0 2 2 PSC 2994, Box 1804\nAPO AE 05822 F 69.174713 58.372999 Science 64.464850 Others True 1 60.039985 Mkt&Fin 59.470149 29065.171805 False 2020-02-22 2020-04-13 12.0 3 3 PSC 2994, Box 1804\nAPO AE 05822 M 77.966934 53.600451 Arts 70.515518 Others True 0 69.310361 Mkt&Fin 69.100697 NaN False 2020-02-28 NaT 12.0 4 4 Unit 8706 Box 8587\nDPO AP 31555 F 78.766469 64.966222 Commerce 64.597489 Others True 0 76.562658 Mkt&Fin 57.086433 NaN False 2020-03-02 2020-09-05 3.0
More specifically, we can see how all the addresses that have been generated actually come from the original dataset:
In [25]: new_data_pii.address.isin(data_pii.address).sum() Out[25]: 200
In order to solve this, we can pass an additional argument anonymize_fields to our model when we create the instance. This anonymize_fields argument will need to be a dictionary that contains:
anonymize_fields
The name of the field that we want to anonymize.
The category of the field that we want to use when we generate fake values for it.
The list complete list of possible categories can be seen in the Faker Providers page, and it contains a huge list of concepts such as:
name
country
city
ssn
credit_card_number
credit_card_expire
credit_card_security_code
email
telephone
…
In this case, since the field is an e-mail address, we will pass a dictionary indicating the category address
In [26]: model = TVAE( ....: primary_key='student_id', ....: anonymize_fields={ ....: 'address': 'address' ....: } ....: ) ....: In [27]: model.fit(data_pii)
As a result, we can see how the real address values have been replaced by other fake addresses:
In [28]: new_data_pii = model.sample(200) In [29]: new_data_pii.head() Out[29]: student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 Unit 7402 Box 9999\nDPO AP 80379 F 70.300066 74.766468 Arts 71.038388 Comm&Mgmt True 0 78.357462 Mkt&Fin 54.975943 NaN False 2020-01-20 NaT 6.0 1 1 444 Charlene Hill Suite 347\nWest Jamiestad, M... F 75.595756 75.677733 Commerce 67.918597 Comm&Mgmt False 1 83.935857 Mkt&HR 58.926559 25182.200417 False NaT NaT NaN 2 2 14087 Wang Viaduct Suite 155\nAlexberg, FL 60308 F 74.162473 63.649446 Arts 67.439641 Comm&Mgmt True 1 84.652581 Mkt&HR 67.532252 NaN False NaT NaT 6.0 3 3 847 Herman Land Apt. 224\nMatthewchester, NV 9... F 76.778371 68.227541 Science 73.236282 Others True 0 77.974505 Mkt&Fin 66.508842 NaN False 2020-01-30 NaT NaN 4 4 4752 Kyle Shore\nLopezport, OH 71000 F 74.035485 63.382243 Science 71.893993 Others True 0 82.159270 Mkt&Fin 59.808962 NaN False 2020-01-27 NaT 6.0
Which means that none of the original addresses can be found in the sampled data:
In [30]: data_pii.address.isin(new_data_pii.address).sum() Out[30]: 0
If you look closely at the data you may notice that some properties were not completely captured by the model. For example, you may have seen that sometimes the model produces an experience_years number greater than 0 while also indicating that work_experience is False. These type of properties are what we call Constraints and can also be handled using SDV. For further details about them please visit the Handling Constraints guide.
experience_years
0
work_experience
False
Constraints
SDV
A very common question when someone starts using SDV to generate synthetic data is: “How good is the data that I just generated?”
In order to answer this question, SDV has a collection of metrics and tools that allow you to compare the real that you provided and the synthetic data that you generated using SDV or any other tool.
You can read more about this in the Synthetic Data Evaluation guide.