Try the new SDV 1.0 Beta! We are transitioning to a new version of SDV with improved workflows, new features and an updated documentation site. Click here to go to the new docs pages.
Try the new SDV 1.0 Beta!
We are transitioning to a new version of SDV with improved workflows, new features and an updated documentation site.
Click here to go to the new docs pages.
In this guide we will go through a series of steps that will let you discover functionalities of the TVAE model, including how to:
TVAE
Create an instance of TVAE.
Fit the instance to your data.
Generate synthetic versions of your data.
Use TVAE to anonymize PII information.
Specify hyperparameters to improve the output quality.
The sdv.tabular.TVAE model is based on the VAE-based Deep Learning data synthesizer which was presented at the NeurIPS 2020 conference by the paper titled Modeling Tabular data using Conditional GAN.
sdv.tabular.TVAE
Let’s now discover how to learn a dataset and later on generate synthetic data with the same format and statistical properties by using the TVAE class from SDV.
We will start by loading one of our demo datasets, the student_placements, which contains information about MBA students that applied for placements during the year 2020.
student_placements
Warning
In order to follow this guide you need to have tvae installed on your system. If you have not done it yet, please install tvae now by executing the command pip install sdv in a terminal.
tvae
pip install sdv
In [1]: from sdv.demo import load_tabular_demo In [2]: data = load_tabular_demo('student_placements') In [3]: data.head() Out[3]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17264 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 1 17265 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 2 17266 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 3 17267 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN 4 17268 M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
As you can see, this table contains information about students which includes, among other things:
Their id and gender
Their grades and specializations
Their work experience
The salary that they were offered
The duration and dates of their placement
You will notice that there is data with the following characteristics:
There are float, integer, boolean, categorical and datetime values.
There are some variables that have missing data. In particular, all the data related to the placement details is missing in the rows where the student was not placed.
Let us use TVAE to learn this data and then sample synthetic data about new students to see how well the model captures the characteristics indicated above. In order to do this you will need to:
Import the sdv.tabular.TVAE class and create an instance of it.
Call its fit method passing our table.
fit
Call its sample method indicating the number of synthetic rows that you want to generate.
sample
In [4]: from sdv.tabular import TVAE In [5]: model = TVAE() In [6]: model.fit(data)
Note
Notice that the model fitting process took care of transforming the different fields using the appropriate Reversible Data Transforms to ensure that the data has a format that the underlying TVAE class can handle.
fitting
Once the modeling has finished you are ready to generate new synthetic data by calling the sample method from your model passing the number of rows that we want to generate. The number of rows (num_rows) is a required parameter.
num_rows
In [7]: new_data = model.sample(num_rows=200)
This will return a table identical to the one which the model was fitted on, but filled with new data which resembles the original one.
In [8]: new_data.head() Out[8]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17403 M 54.40 58.43 Science 68.71 Comm&Mgmt True 0 54.74 Mkt&HR 59.40 27485.0 True 2020-01-23 2021-02-09 3.0 1 17326 M 66.10 64.85 Commerce 66.59 Comm&Mgmt False 0 57.81 Mkt&Fin 52.33 24878.0 True 2020-01-13 2020-07-30 6.0 2 17304 M 63.41 66.88 Commerce 76.96 Comm&Mgmt True 1 84.54 Mkt&Fin 70.11 21977.0 True 2020-01-14 2020-11-26 12.0 3 17436 M 62.18 72.81 Science 61.40 Comm&Mgmt True 1 82.41 Mkt&Fin 67.69 23368.0 True 2020-01-13 2020-07-23 6.0 4 17383 M 65.87 65.11 Science 65.77 Sci&Tech True 1 67.51 Mkt&Fin 62.93 30350.0 True 2020-01-08 2020-12-25 3.0
There are a number of other parameters in this method that you can use to optimize the process of generating synthetic data. Use output_file_path to directly write results to a CSV file, batch_size to break up sampling into smaller pieces & track their progress and randomize_samples to determine whether to generate the same synthetic data every time. See the `API section https://sdv.dev/SDV/api_reference/tabular/api/sdv. tabular.ctgan.TVAE.sample>`__ for more details.
output_file_path
batch_size
randomize_samples
In many scenarios it will be convenient to generate synthetic versions of your data directly in systems that do not have access to the original data source. For example, if you may want to generate testing data on the fly inside a testing environment that does not have access to your production database. In these scenarios, fitting the model with real data every time that you need to generate new data is feasible, so you will need to fit a model in your production environment, save the fitted model into a file, send this file to the testing environment and then load it there to be able to sample from it.
Let’s see how this process works.
Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is cloudpickle.
save
.pkl
In [9]: model.save('my_model.pkl')
This will have created a file called my_model.pkl in the same directory in which you are running SDV.
my_model.pkl
Important
If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains no information about the original data, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this my_model.pkl file without the risc of disclosing any of your real data!
The file you just generated can be sent over to the system where the synthetic data will be generated. Once it is there, you can load it using the TVAE.load method, and then you are ready to sample new data from the loaded instance:
TVAE.load
In [10]: loaded = TVAE.load('my_model.pkl') In [11]: new_data = loaded.sample(num_rows=200)
Notice that the system where the model is loaded needs to also have sdv and tvae installed, otherwise it will not be able to load the model and use it.
sdv
One of the first things that you may have noticed when looking at the demo data is that there is a student_id column which acts as the primary key of the table, and which is supposed to have unique values. Indeed, if we look at the number of times that each value appears, we see that all of them appear at most once:
student_id
In [12]: data.student_id.value_counts().max() Out[12]: 1
However, if we look at the synthetic data that we generated, we observe that there are some values that appear more than once:
In [13]: new_data[new_data.student_id == new_data.student_id.value_counts().index[0]] Out[13]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 21 17331 M 54.46 68.58 Commerce 74.01 Comm&Mgmt True 0 64.33 Mkt&Fin 65.77 NaN False NaT NaT NaN 26 17331 M 87.42 87.81 Commerce 63.71 Comm&Mgmt True 1 94.22 Mkt&Fin 61.18 25632.0 True 2020-01-23 2020-04-13 3.0 68 17331 M 68.27 78.71 Science 80.63 Comm&Mgmt True 1 97.74 Mkt&Fin 69.17 23282.0 True 2020-01-20 2020-12-28 12.0 159 17331 M 65.85 66.74 Commerce 62.25 Comm&Mgmt False 0 80.43 Mkt&Fin 70.50 29989.0 True 2020-01-11 2020-08-09 5.0 160 17331 M 75.06 68.42 Commerce 70.26 Comm&Mgmt True 1 88.32 Mkt&Fin 63.24 25797.0 True 2020-01-16 2020-10-10 3.0 169 17331 M 68.11 82.98 Commerce 66.51 Comm&Mgmt False 0 97.03 Mkt&Fin 68.51 29203.0 True 2020-01-14 2020-08-08 6.0
This happens because the model was not notified at any point about the fact that the student_id had to be unique, so when it generates new data it will provoke collisions sooner or later. In order to solve this, we can pass the argument primary_key to our model when we create it, indicating the name of the column that is the index of the table.
primary_key
In [14]: model = TVAE( ....: primary_key='student_id' ....: ) ....: In [15]: model.fit(data) In [16]: new_data = model.sample(200) In [17]: new_data.head() Out[17]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 65.65 79.47 Commerce 63.34 Comm&Mgmt False 0 58.09 Mkt&Fin 63.27 24486.0 True 2020-01-18 2020-10-06 3.0 1 1 M 69.33 81.15 Commerce 68.00 Comm&Mgmt False 0 73.66 Mkt&Fin 59.54 22551.0 True 2020-01-18 2020-07-09 6.0 2 2 M 72.56 80.71 Commerce 62.48 Comm&Mgmt False 0 50.00 Mkt&Fin 57.85 24411.0 True 2020-01-17 2020-08-03 3.0 3 3 F 65.29 57.06 Science 70.42 Comm&Mgmt False 0 66.57 Mkt&Fin 56.60 28789.0 True 2020-01-18 2020-08-09 6.0 4 4 M 73.12 72.42 Commerce 59.97 Comm&Mgmt True 0 57.73 Mkt&Fin 62.47 25617.0 True 2020-01-07 2020-07-16 3.0
As a result, the model will learn that this column must be unique and generate a unique sequence of values for the column:
In [18]: new_data.student_id.value_counts().max() Out[18]: 1
There will be many cases where the data will contain Personally Identifiable Information which we cannot disclose. In these cases, we will want our Tabular Models to replace the information within these fields with fake, simulated data that looks similar to the real one but does not contain any of the original values.
Let’s load a new dataset that contains a PII field, the student_placements_pii demo, and try to generate synthetic versions of it that do not contain any of the PII fields.
student_placements_pii
The student_placements_pii dataset is a modified version of the student_placements dataset with one new field, address, which contains PII information about the students. Notice that this additional address field has been simulated and does not correspond to data from the real users.
address
In [19]: data_pii = load_tabular_demo('student_placements_pii') In [20]: data_pii.head() Out[20]: student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17264 70304 Baker Turnpike\nEricborough, MS 15086 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 1 17265 805 Herrera Avenue Apt. 134\nMaryview, NJ 36510 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 2 17266 3702 Bradley Island\nNorth Victor, FL 12268 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 3 17267 Unit 0879 Box 3878\nDPO AP 42663 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN 4 17268 96493 Kelly Canyon Apt. 145\nEast Steven, NC 3... M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
If we use our tabular model on this new data we will see how the synthetic data that it generates discloses the addresses from the real students:
In [21]: model = TVAE( ....: primary_key='student_id', ....: ) ....: In [22]: model.fit(data_pii) In [23]: new_data_pii = model.sample(200) In [24]: new_data_pii.head() Out[24]: student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 33435 Vazquez Via\nSouth Kristinaberg, FL 98070 M 71.56 60.40 Science 70.28 Comm&Mgmt True 1 66.51 Mkt&Fin 60.93 28127.0 True 2020-01-21 2020-08-15 6.0 1 1 9812 Tiffany Alley Suite 788\nLopezburgh, DC 2... M 52.99 79.37 Commerce 67.48 Comm&Mgmt False 0 64.38 Mkt&Fin 68.60 NaN False NaT NaT NaN 2 2 557 Stephanie Knolls Apt. 110\nLake Rachel, MN... F 63.19 62.42 Commerce 65.71 Comm&Mgmt False 0 80.52 Mkt&HR 65.45 NaN False NaT NaT NaN 3 3 1729 Thomas Islands Apt. 583\nSanchezview, ND ... M 66.77 66.97 Commerce 73.70 Comm&Mgmt False 0 60.93 Mkt&Fin 60.66 28801.0 True 2020-03-07 2020-08-13 3.0 4 4 3534 Martinez Parks Suite 682\nLake Anthony, N... M 59.40 53.69 Commerce 63.94 Comm&Mgmt False 0 60.59 Mkt&HR 61.26 NaN False NaT NaT NaN
More specifically, we can see how all the addresses that have been generated actually come from the original dataset:
In [25]: new_data_pii.address.isin(data_pii.address).sum() Out[25]: 200
In order to solve this, we can pass an additional argument anonymize_fields to our model when we create the instance. This anonymize_fields argument will need to be a dictionary that contains:
anonymize_fields
The name of the field that we want to anonymize.
The category of the field that we want to use when we generate fake values for it.
The list complete list of possible categories can be seen in the Faker Providers page, and it contains a huge list of concepts such as:
name
country
city
ssn
credit_card_number
credit_card_expire
credit_card_security_code
email
telephone
…
In this case, since the field is an address, we will pass a dictionary indicating the category address
In [26]: model = TVAE( ....: primary_key='student_id', ....: anonymize_fields={ ....: 'address': 'address' ....: } ....: ) ....: In [27]: model.fit(data_pii)
As a result, we can see how the real address values have been replaced by other fake addresses:
In [28]: new_data_pii = model.sample(200) In [29]: new_data_pii.head() Out[29]: student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 3228 Orozco Forks Apt. 136\nJoneshaven, GA 79281 M 64.57 75.45 Commerce 64.24 Comm&Mgmt False 0 54.38 Mkt&HR 58.11 23228.0 True 2020-01-19 2020-10-22 3.0 1 1 72960 Calvin Road\nKathyshire, CA 39426 M 74.32 69.91 Commerce 62.89 Comm&Mgmt False 0 83.59 Mkt&Fin 57.29 27194.0 True 2020-01-14 2020-07-20 3.0 2 2 8979 Cody Inlet Apt. 429\nGriffinburgh, ME 86950 M 60.74 49.07 Commerce 63.31 Comm&Mgmt False 0 51.39 Mkt&HR 58.92 NaN False NaT NaT NaN 3 3 6324 Smith Junctions Suite 891\nTarafort, NJ 2... M 73.00 76.30 Commerce 71.56 Comm&Mgmt False 0 62.08 Mkt&Fin 66.44 29463.0 True 2020-01-23 2020-08-04 3.0 4 4 782 James Manors Apt. 160\nNew Angelaport, OR ... M 71.79 61.81 Science 61.46 Comm&Mgmt False 0 57.33 Mkt&HR 61.04 NaN False NaT NaT NaN
Which means that none of the original addresses can be found in the sampled data:
In [30]: data_pii.address.isin(new_data_pii.address).sum() Out[30]: 0
Now that we have discovered the basics, let’s go over a few more advanced usage examples and see the different arguments that we can pass to our CTGAN Model in order to customize it to our needs.
CTGAN
A part from the common Tabular Model arguments, TVAE has a number of additional hyperparameters that control its learning behavior and can impact on the performance of the model, both in terms of quality of the generated data and computational time.
epochs and batch_size: these arguments control the number of iterations that the model will perform to optimize its parameters, as well as the number of samples used in each step. Its default values are 300 and 500 respectively, and batch_size needs to always be a value which is multiple of 10.
epochs
300
500
10
These hyperparameters have a very direct effect in time the training process lasts but also on the performance of the data, so for new datasets, you might want to start by setting a low value on both of them to see how long the training process takes on your data and later on increase the number to acceptable values in order to improve the performance.
log_frequency: Whether to use log frequency of categorical levels in conditional sampling. It defaults to True. This argument affects how the model processes the frequencies of the categorical values that are used to condition the rest of the values. In some cases, changing it to False could lead to better performance.
log_frequency
True
False
embedding_dim (int): Size of the random sample passed to the Generator. Defaults to 128.
embedding_dim
compress_dims (tuple or list of ints): Size of each hidden layer in the encoder. Defaults to (128, 128).
compress_dims
decompress_dims (tuple or list of ints): Size of each hidden layer in the decoder. Defaults to (128, 128).
decompress_dims
l2scale (int): Regularization term. Defaults to 1e-5.
l2scale
batch_size (int): Number of data samples to process in each step.
loss_factor (int): Multiplier for the reconstruction error. Defaults to 2.
loss_factor
cuda (bool or str): If True, use CUDA. If a str, use the indicated device. If False, do not use cuda at all.
cuda
str
Notice that the value that you set on the batch_size argument must always be a multiple of 10!
As an example, we will try to fit the TVAE model slightly increasing the number of epochs, reducing the batch_size, adding one additional layer to the models involved and using a smaller wright decay.
Before we start, we will evaluate the quality of the previously generated data using the sdv.evaluation.evaluate function
sdv.evaluation.evaluate
In [31]: from sdv.evaluation import evaluate In [32]: evaluate(new_data, data) Out[32]: 0.7653505767533744
Afterwards, we create a new instance of the TVAE model with the hyperparameter values that we want to use
In [33]: model = TVAE( ....: primary_key='student_id', ....: epochs=500, ....: compress_dims=(256, 256, 256), ....: decompress_dims=(256, 256, 256) ....: ) ....:
And fit to our data.
In [34]: model.fit(data)
Finally, we are ready to generate new data and evaluate the results.
In [35]: new_data = model.sample(len(data)) In [36]: evaluate(new_data, data) Out[36]: 0.7652229212729833
As we can see, in this case these modifications changed the obtained results slightly, but they did neither introduce dramatic changes in the performance.
As the name implies, conditional sampling allows us to sample from a conditional distribution using the TVAE model, which means we can generate only values that satisfy certain conditions. These conditional values can be passed to the sample_conditions method as a list of sdv.sampling.Condition objects or to the sample_remaining_columns method as a dataframe.
sample_conditions
sdv.sampling.Condition
sample_remaining_columns
When specifying a sdv.sampling.Condition object, we can pass in the desired conditions as a dictionary, as well as specify the number of desired rows for that condition.
In [37]: from sdv.sampling import Condition In [38]: condition = Condition({ ....: 'gender': 'M' ....: }, num_rows=5) ....: In [39]: model.sample_conditions(conditions=[condition]) Out[39]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 57.66 66.40 Commerce 55.21 Comm&Mgmt False 0 56.41 Mkt&HR 59.92 NaN False NaT NaT NaN 1 1 M 57.94 73.50 Science 59.44 Comm&Mgmt True 1 83.87 Mkt&Fin 62.43 23151.0 True 2020-01-20 2020-10-12 3.0 2 2 M 59.41 76.95 Commerce 59.33 Comm&Mgmt False 0 75.30 Mkt&Fin 56.20 24154.0 True 2020-01-09 2020-08-06 12.0 3 3 M 60.63 82.15 Commerce 62.19 Comm&Mgmt False 0 69.39 Mkt&HR 53.13 NaN False NaT NaT NaN 4 4 M 49.96 47.27 Commerce 59.63 Comm&Mgmt False 0 62.30 Mkt&HR 59.92 NaN False NaT NaT NaN
It’s also possible to condition on multiple columns, such as gender = M, 'experience_years': 0.
gender = M, 'experience_years': 0
In [40]: condition = Condition({ ....: 'gender': 'M', ....: 'experience_years': 0 ....: }, num_rows=5) ....: In [41]: model.sample_conditions(conditions=[condition]) Out[41]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 57.80 56.26 Commerce 59.09 Comm&Mgmt False 0 82.39 Mkt&HR 58.57 NaN False NaT NaT NaN 1 1 M 61.69 72.30 Commerce 64.74 Comm&Mgmt False 0 91.38 Mkt&Fin 57.26 22267.0 True 2020-01-20 2020-11-28 3.0 2 2 M 57.58 76.01 Commerce 66.85 Comm&Mgmt False 0 61.41 Mkt&Fin 56.23 21814.0 True 2020-01-20 2020-08-02 6.0 3 3 M 65.25 57.48 Commerce 69.30 Comm&Mgmt False 0 54.70 Mkt&HR 51.21 29446.0 True 2020-03-12 2020-11-12 3.0 4 4 M 57.49 52.80 Science 59.08 Comm&Mgmt False 0 82.37 Mkt&HR 57.17 28011.0 True 2020-01-19 2020-10-28 3.0
In the sample_remaining_columns method, conditions is passed as a dataframe. In that case, the model will generate one sample for each row of the dataframe, sorted in the same order. Since the model already knows how many samples to generate, passing it as a parameter is unnecessary. For example, if we want to generate three samples where gender = M and three samples with gender = F, we can do the following:
conditions
gender = M
gender = F
In [42]: import pandas as pd In [43]: conditions = pd.DataFrame({ ....: 'gender': ['M', 'M', 'M', 'F', 'F', 'F'], ....: }) ....: In [44]: model.sample_remaining_columns(conditions) Out[44]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 65.18 77.35 Commerce 78.44 Comm&Mgmt True 1 83.66 Mkt&Fin 61.97 20647.0 True 2020-01-09 2021-01-01 12.0 1 1 M 60.00 58.13 Commerce 71.00 Comm&Mgmt False 0 57.32 Mkt&HR 56.60 NaN False NaT NaT NaN 2 2 M 56.13 66.75 Commerce 55.95 Comm&Mgmt False 0 69.41 Mkt&HR 58.17 NaN False NaT NaT NaN 3 3 F 58.18 70.09 Commerce 66.65 Comm&Mgmt False 0 63.27 Mkt&HR 61.97 NaN False NaT NaT NaN 4 4 F 56.80 73.19 Commerce 72.12 Comm&Mgmt False 0 78.70 Mkt&HR 59.50 NaN False NaT NaT NaN 5 5 F 60.01 75.07 Commerce 54.57 Comm&Mgmt False 0 50.00 Mkt&HR 58.79 28935.0 True 2020-03-09 2020-08-22 3.0
TVAE also supports conditioning on continuous values, as long as the values are within the range of seen numbers. For example, if all the values of the dataset are within 0 and 1, TVAE will not be able to set this value to 1000.
In [45]: condition = Condition({ ....: 'degree_perc': 70.0 ....: }, num_rows=5) ....: In [46]: model.sample_conditions(conditions=[condition]) Out[46]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 14 M 50.75 64.81 Commerce 70.0 Comm&Mgmt True 1 53.65 Mkt&Fin 64.80 25580.0 True 2020-01-18 2020-08-06 6.0 1 16 M 68.56 57.73 Commerce 70.0 Comm&Mgmt False 0 90.57 Mkt&Fin 57.48 25772.0 True 2020-01-16 2020-10-03 3.0 2 24 M 74.92 69.43 Commerce 70.0 Comm&Mgmt False 0 98.00 Mkt&Fin 65.36 23354.0 True 2020-01-05 2020-04-07 3.0 3 8 M 53.16 48.68 Commerce 70.0 Comm&Mgmt False 0 56.65 Mkt&HR 57.45 NaN False NaT NaT NaN 4 28 F 56.98 75.72 Commerce 70.0 Comm&Mgmt False 0 58.12 Mkt&HR 57.86 NaN False NaT NaT NaN
Conditional sampling works through a rejection sampling process, where rows are sampled repeatedly until one that satisfies the conditions is found. In case you are not able to sample enough valid rows, try increasing max_tries_per_batch. More information about this parameter can be found in the API section
max_tries_per_batch
If you have many conditions that cannot easily be satisified, consider switching to the GaussianCopula model, which is able to handle conditional sampling more efficiently.
If you look closely at the data you may notice that some properties were not completely captured by the model. For example, you may have seen that sometimes the model produces an experience_years number greater than 0 while also indicating that work_experience is False. These types of properties are what we call Constraints and can also be handled using SDV. For further details about them please visit the Constraints guide.
experience_years
0
work_experience
Constraints
SDV
After creating synthetic data, you may be wondering how you can evaluate it against the original data. You can use the SDMetrics library to get more insights, generate reports and visualize the data. This library is automatically installed with SDV.
To get started, visit: https://docs.sdv.dev/sdmetrics/