In this guide we will go through a series of steps that will let you discover functionalities of the CTGAN model, including how to:
CTGAN
Create an instance of CTGAN.
Fit the instance to your data.
Generate synthetic versions of your data.
Use CTGAN to anonymize PII information.
Customize the data transformations to improve the learning process.
Specify hyperparameters to improve the output quality.
The sdv.tabular.CTGAN model is based on the GAN-based Deep Learning data synthesizer which was presented at the NeurIPS 2020 conference by the paper titled Modeling Tabular data using Conditional GAN.
sdv.tabular.CTGAN
Let’s now discover how to learn a dataset and later on generate synthetic data with the same format and statistical properties by using the CTGAN class from SDV.
We will start by loading one of our demo datasets, the student_placements, which contains information about MBA students that applied for placements during the year 2020.
student_placements
Warning
In order to follow this guide you need to have ctgan installed on your system. If you have not done it yet, please install ctgan now by executing the command pip install sdv in a terminal.
ctgan
pip install sdv
In [1]: from sdv.demo import load_tabular_demo In [2]: data = load_tabular_demo('student_placements') In [3]: data.head() Out[3]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17264 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 1 17265 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 2 17266 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 3 17267 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN 4 17268 M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
As you can see, this table contains information about students which includes, among other things:
Their id and gender
Their grades and specializations
Their work experience
The salary that they were offered
The duration and dates of their placement
You will notice that there is data with the following characteristics:
There are float, integer, boolean, categorical and datetime values.
There are some variables that have missing data. In particular, all the data related to the placement details is missing in the rows where the student was not placed.
Let us use CTGAN to learn this data and then sample synthetic data about new students to see how well the model captures the characteristics indicated above. In order to do this you will need to:
Import the sdv.tabular.CTGAN class and create an instance of it.
Call its fit method passing our table.
fit
Call its sample method indicating the number of synthetic rows that you want to generate.
sample
In [4]: from sdv.tabular import CTGAN In [5]: model = CTGAN() In [6]: model.fit(data)
Note
Notice that the model fitting process took care of transforming the different fields using the appropriate Reversible Data Transforms to ensure that the data has a format that the underlying CTGANSynthesizer class can handle.
fitting
Once the modeling has finished you are ready to generate new synthetic data by calling the sample method from your model passing the number of rows that we want to generate.
In [7]: new_data = model.sample(200)
This will return a table identical to the one which the model was fitted on, but filled with new data which resembles the original one.
In [8]: new_data.head() Out[8]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17415 M 58.996629 71.638601 Commerce 63.810096 Comm&Mgmt False 0 87.798829 Mkt&Fin 48.215954 NaN False NaT 2020-10-21 3.0 1 17469 F 74.317264 59.589704 Arts 51.299096 Sci&Tech False 0 53.164798 Mkt&Fin 59.164475 NaN True 2020-03-27 2020-04-13 12.0 2 17391 F 70.444916 75.801224 Commerce 60.143831 Others True 0 63.634851 Mkt&Fin 58.667847 NaN True 2020-03-18 NaT 3.0 3 17474 M 62.365941 39.238885 Science 71.710117 Comm&Mgmt False 0 44.994612 Mkt&Fin 59.745547 17107.444836 True NaT NaT NaN 4 17394 M 69.564289 59.864980 Science 54.205378 Sci&Tech False 0 50.756422 Mkt&HR 58.645687 20401.720283 True 2020-02-25 2020-09-08 NaN
You can control the number of rows by specifying the number of samples in the model.sample(<num_rows>). To test, try model.sample(10000). Note that the original table only had ~200 rows.
samples
model.sample(<num_rows>)
model.sample(10000)
In many scenarios it will be convenient to generate synthetic versions of your data directly in systems that do not have access to the original data source. For example, if you may want to generate testing data on the fly inside a testing environment that does not have access to your production database. In these scenarios, fitting the model with real data every time that you need to generate new data is feasible, so you will need to fit a model in your production environment, save the fitted model into a file, send this file to the testing environment and then load it there to be able to sample from it.
Let’s see how this process works.
Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is pickle.
save
.pkl
In [9]: model.save('my_model.pkl')
This will have created a file called my_model.pkl in the same directory in which you are running SDV.
my_model.pkl
Important
If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains no information about the original data, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this my_model.pkl file without the risc of disclosing any of your real data!
The file you just generated can be sent over to the system where the synthetic data will be generated. Once it is there, you can load it using the CTGAN.load method, and then you are ready to sample new data from the loaded instance:
CTGAN.load
In [10]: loaded = CTGAN.load('my_model.pkl') In [11]: new_data = loaded.sample(200)
Notice that the system where the model is loaded needs to also have sdv and ctgan installed, otherwise it will not be able to load the model and use it.
sdv
One of the first things that you may have noticed when looking at the demo data is that there is a student_id column which acts as the primary key of the table, and which is supposed to have unique values. Indeed, if we look at the number of times that each value appears, we see that all of them appear at most once:
student_id
In [12]: data.student_id.value_counts().max() Out[12]: 1
However, if we look at the synthetic data that we generated, we observe that there are some values that appear more than once:
In [13]: new_data[new_data.student_id == new_data.student_id.value_counts().index[0]] Out[13]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 11 17432 M 90.417936 53.326898 Science 73.203085 Sci&Tech False 1 70.048738 Mkt&Fin 61.033321 26934.622888 True 2020-01-16 NaT NaN 50 17432 M 88.758628 55.821830 Commerce 60.342289 Comm&Mgmt False 1 88.821372 Mkt&HR 61.119847 11309.279225 True 2020-01-10 NaT 3.0 101 17432 M 56.512322 66.586250 Commerce 73.761405 Sci&Tech False 1 63.200270 Mkt&HR 64.833396 40088.295446 True 2020-03-04 2020-07-16 12.0 141 17432 M 82.855050 97.057838 Science 66.780161 Comm&Mgmt True 0 80.473435 Mkt&Fin 61.648138 -7748.213461 True 2020-01-06 NaT NaN
This happens because the model was not notified at any point about the fact that the student_id had to be unique, so when it generates new data it will provoke collisions sooner or later. In order to solve this, we can pass the argument primary_key to our model when we create it, indicating the name of the column that is the index of the table.
primary_key
In [14]: model = CTGAN( ....: primary_key='student_id' ....: ) ....: In [15]: model.fit(data) In [16]: new_data = model.sample(200) In [17]: new_data.head() Out[17]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 92.745022 74.901954 Commerce 60.241747 Comm&Mgmt False 0 69.254793 Mkt&Fin 55.137732 NaN True 2020-01-30 2020-07-29 NaN 1 1 M 67.226143 85.044403 Commerce 76.553777 Sci&Tech False 0 50.063479 Mkt&Fin 51.811657 30981.871859 True NaT 2020-04-23 3.0 2 2 F 94.588261 45.233045 Commerce 47.259991 Comm&Mgmt False 0 88.192169 Mkt&Fin 55.022099 NaN True 2020-01-06 2020-09-14 12.0 3 3 M 87.751559 48.008146 Science 79.227244 Sci&Tech False 0 55.297518 Mkt&HR 51.081906 28552.524283 False NaT 2020-09-02 3.0 4 4 M 73.505802 91.218619 Science 71.372142 Comm&Mgmt False 0 78.962284 Mkt&Fin 54.207192 NaN False 2020-03-06 2020-08-13 NaN
As a result, the model will learn that this column must be unique and generate a unique sequence of values for the column:
In [18]: new_data.student_id.value_counts().max() Out[18]: 1
There will be many cases where the data will contain Personally Identifiable Information which we cannot disclose. In these cases, we will want our Tabular Models to replace the information within these fields with fake, simulated data that looks similar to the real one but does not contain any of the original values.
Let’s load a new dataset that contains a PII field, the student_placements_pii demo, and try to generate synthetic versions of it that do not contain any of the PII fields.
student_placements_pii
The student_placements_pii dataset is a modified version of the student_placements dataset with one new field, address, which contains PII information about the students. Notice that this additional address field has been simulated and does not correspond to data from the real users.
address
In [19]: data_pii = load_tabular_demo('student_placements_pii') In [20]: data_pii.head() Out[20]: student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17264 70304 Baker Turnpike\nEricborough, MS 15086 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 1 17265 805 Herrera Avenue Apt. 134\nMaryview, NJ 36510 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 2 17266 3702 Bradley Island\nNorth Victor, FL 12268 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 3 17267 Unit 0879 Box 3878\nDPO AP 42663 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN 4 17268 96493 Kelly Canyon Apt. 145\nEast Steven, NC 3... M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
If we use our tabular model on this new data we will see how the synthetic data that it generates discloses the addresses from the real students:
In [21]: model = CTGAN( ....: primary_key='student_id', ....: ) ....: In [22]: model.fit(data_pii) In [23]: new_data_pii = model.sample(200) In [24]: new_data_pii.head() Out[24]: student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 USS Sanchez\nFPO AA 16841 F 80.248517 88.042206 Commerce 74.932701 Comm&Mgmt False 2 64.127380 Mkt&HR 79.166379 16083.692267 True NaT 2020-08-22 NaN 1 1 8398 Seth Vista Apt. 266\nSouth Lauraberg, VA ... F 53.744461 82.460488 Science 72.244045 Comm&Mgmt False 0 82.151231 Mkt&HR 81.977538 17228.860302 True 2019-12-29 NaT 6.0 2 2 8034 Freeman Meadows\nSouth Bryce, NJ 14719 F 85.566425 56.259839 Commerce 74.140825 Comm&Mgmt False 0 52.029352 Mkt&Fin 71.427603 NaN True 2020-01-12 NaT 6.0 3 3 8368 Sarah Well\nNewmanville, WA 69934 F 68.441639 62.165037 Science 62.218357 Sci&Tech False 0 74.854105 Mkt&Fin 72.715475 NaN True 2020-06-06 NaT 12.0 4 4 5376 Amanda Terrace\nSouth Glen, ID 04884 F 75.812732 83.023668 Arts 60.087456 Sci&Tech True 0 73.093631 Mkt&Fin 69.314887 26434.832457 True 2020-02-23 2020-08-18 6.0
More specifically, we can see how all the addresses that have been generated actually come from the original dataset:
In [25]: new_data_pii.address.isin(data_pii.address).sum() Out[25]: 200
In order to solve this, we can pass an additional argument anonymize_fields to our model when we create the instance. This anonymize_fields argument will need to be a dictionary that contains:
anonymize_fields
The name of the field that we want to anonymize.
The category of the field that we want to use when we generate fake values for it.
The list complete list of possible categories can be seen in the Faker Providers page, and it contains a huge list of concepts such as:
name
country
city
ssn
credit_card_number
credit_card_expire
credit_card_security_code
email
telephone
…
In this case, since the field is an e-mail address, we will pass a dictionary indicating the category address
In [26]: model = CTGAN( ....: primary_key='student_id', ....: anonymize_fields={ ....: 'address': 'address' ....: } ....: ) ....: In [27]: model.fit(data_pii)
As a result, we can see how the real address values have been replaced by other fake addresses:
In [28]: new_data_pii = model.sample(200) In [29]: new_data_pii.head() Out[29]: student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 6826 Leblanc Harbors Suite 864\nLake Garyfort,... F 61.669164 76.039806 Science 60.627468 Comm&Mgmt False 0 44.465228 Mkt&Fin 57.092623 26749.864738 True 2020-01-28 2020-07-09 12.0 1 1 5620 Monica Ports\nSouth Baileyborough, WA 89720 F 48.226669 76.861218 Commerce 72.447540 Comm&Mgmt False 0 48.119016 Mkt&HR 49.886540 NaN False 2019-12-17 NaT NaN 2 2 2723 Sandra Parkway\nEast Garrettton, MI 64638 M 56.963847 49.730330 Science 60.236265 Sci&Tech False 0 69.418668 Mkt&Fin 52.712834 28265.921442 True NaT 2020-07-14 6.0 3 3 3796 Diana Curve Apt. 786\nEast Megan, OH 48006 M 84.105554 78.782977 Commerce 63.126444 Comm&Mgmt False 0 64.808281 Mkt&Fin 50.562953 27197.574172 False NaT 2020-10-31 NaN 4 4 85565 Ward Knoll Suite 358\nVincentborough, CO... M 59.002053 107.006692 Commerce 47.618939 Comm&Mgmt True 0 71.231515 Mkt&HR 49.360704 NaN False NaT NaT NaN
Which means that none of the original addresses can be found in the sampled data:
In [30]: data_pii.address.isin(new_data_pii.address).sum() Out[30]: 0
Now that we have discovered the basics, let’s go over a few more advanced usage examples and see the different arguments that we can pass to our CTGAN Model in order to customize it to our needs.
A part from the common Tabular Model arguments, CTGAN has a number of additional hyperparameters that control its learning behavior and can impact on the performance of the model, both in terms of quality of the generated data and computational time.
epochs and batch_size: these arguments control the number of iterations that the model will perform to optimize its parameters, as well as the number of samples used in each step. Its default values are 300 and 500 respectively, and batch_size needs to always be a value which is multiple of 10.
epochs
batch_size
300
500
10
These hyperparameters have a very direct effect in time the training process lasts but also on the performance of the data, so for new datasets, you might want to start by setting a low value on both of them to see how long the training process takes on your data and later on increase the number to acceptable values in order to improve the performance.
log_frequency: Whether to use log frequency of categorical levels in conditional sampling. It defaults to True. This argument affects how the model processes the frequencies of the categorical values that are used to condition the rest of the values. In some cases, changing it to False could lead to better performance.
log_frequency
True
False
embedding_dim (int): Size of the random sample passed to the Generator. Defaults to 128.
embedding_dim
generator_dim (tuple or list of ints): Size of the output samples for each one of the Residuals. A Resiudal Layer will be created for each one of the values provided. Defaults to (256, 256).
generator_dim
discriminator_dim (tuple or list of ints): Size of the output samples for each one of the Discriminator Layers. A Linear Layer will be created for each one of the values provided. Defaults to (256, 256).
discriminator_dim
generator_lr (float): Learning rate for the generator. Defaults to 2e-4.
generator_lr
generator_decay (float): Generator weight decay for the Adam Optimizer. Defaults to 1e-6.
generator_decay
discriminator_lr (float): Learning rate for the discriminator. Defaults to 2e-4.
discriminator_lr
discriminator_decay (float): Discriminator weight decay for the Adam Optimizer. Defaults to 1e-6.
discriminator_decay
discriminator_steps (int): Number of discriminator updates to do for each generator update. From the WGAN paper: https://arxiv.org/abs/1701.07875. WGAN paper default is 5. Default used is 1 to match original CTGAN implementation.
discriminator_steps
verbose: Whether to print fit progress on stdout. Defaults to False.
verbose
cuda (bool or str): If True, use CUDA. If a str, use the indicated device. If False, do not use cuda at all.
cuda
str
Notice that the value that you set on the batch_size argument must always be a multiple of 10!
As an example, we will try to fit the CTGAN model slightly increasing the number of epochs, reducing the batch_size, adding one additional layer to the models involved and using a smaller wright decay.
Before we start, we will evaluate the quality of the previously generated data using the sdv.evaluation.evaluate function
sdv.evaluation.evaluate
In [31]: from sdv.evaluation import evaluate In [32]: evaluate(new_data, data) Out[32]: 0.47358137945705026
Afterwards, we create a new instance of the CTGAN model with the hyperparameter values that we want to use
In [33]: model = CTGAN( ....: primary_key='student_id', ....: epochs=500, ....: batch_size=100, ....: generator_dim=(256, 256, 256), ....: discriminator_dim=(256, 256, 256) ....: ) ....:
And fit to our data.
In [34]: model.fit(data)
Finally, we are ready to generate new data and evaluate the results.
In [35]: new_data = model.sample(len(data)) In [36]: evaluate(new_data, data) Out[36]: 0.44572545994097723
As we can see, in this case these modifications changed the obtained results slightly, but they did neither introduce dramatic changes in the performance.
As the name implies, conditional sampling allows us to sample from a conditional distribution using the CTGAN model, which means we can generate only values that satisfy certain conditions. These conditional values can be passed to the conditions parameter in the sample method either as a dataframe or a dictionary.
conditions
In case a dictionary is passed, the model will generate as many rows as requested, all of which will satisfy the specified conditions, such as gender = M.
gender = M
In [37]: conditions = { ....: 'gender': 'M' ....: } ....: In [38]: model.sample(5, conditions=conditions) Out[38]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 28.290712 51.797002 Commerce 40.841523 Sci&Tech False 0 72.042408 Mkt&HR 45.010863 NaN False NaT NaT NaN 1 2 M 81.696535 57.077085 Arts 62.232806 Comm&Mgmt True 0 53.189994 Mkt&HR 46.540070 31618.989479 False 2020-03-06 2020-10-05 6.0 2 3 M 102.229980 54.580840 Commerce 94.554095 Sci&Tech False 0 77.068197 Mkt&Fin 64.216779 22072.326240 True 2019-12-14 2020-02-27 3.0 3 4 M 76.060251 50.293824 Commerce 62.611387 Comm&Mgmt False 0 78.817311 Mkt&Fin 70.411247 25677.625230 True 2020-02-24 2020-09-08 3.0 4 0 M 59.691580 51.596113 Commerce 70.332767 Comm&Mgmt True 0 63.206175 Mkt&Fin 74.856288 31858.330887 True NaT 2020-07-02 6.0
It’s also possible to condition on multiple columns, such as gender = M, 'experience_years': 0.
gender = M, 'experience_years': 0
In [39]: conditions = { ....: 'gender': 'M', ....: 'experience_years': 0 ....: } ....: In [40]: model.sample(5, conditions=conditions) Out[40]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 1 M 73.485919 53.261342 Arts 51.141654 Sci&Tech False 0 61.800959 Mkt&Fin 60.679747 83144.812556 True NaT 2020-09-18 NaN 1 2 M 37.619412 22.677742 Commerce 49.101590 Comm&Mgmt False 0 79.849598 Mkt&HR 49.733512 NaN False NaT NaT NaN 2 3 M 61.126271 61.059981 Commerce 41.514696 Others False 0 77.409094 Mkt&Fin 56.506821 29141.343673 False 2020-02-21 NaT 3.0 3 4 M 35.964559 40.950881 Science 47.332896 Sci&Tech False 0 111.736064 Mkt&HR 56.955224 NaN False 2020-02-26 2020-09-02 NaN 4 0 M 46.587756 70.120633 Commerce 42.758214 Comm&Mgmt False 0 83.202785 Mkt&HR 59.626345 NaN False 2020-05-26 2021-04-18 NaN
The conditions can also be passed as a dataframe. In that case, the model will generate one sample for each row of the dataframe, sorted in the same order. Since the model already knows how many samples to generate, passing it as a parameter is unnecessary. For example, if we want to generate three samples where gender = M and three samples with gender = F, we can do the following:
gender = F
In [41]: import pandas as pd In [42]: conditions = pd.DataFrame({ ....: 'gender': ['M', 'M', 'M', 'F', 'F', 'F'], ....: }) ....: In [43]: model.sample(conditions=conditions) Out[43]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 64.252020 40.272738 Science 53.938619 Sci&Tech False 1 75.168903 Mkt&HR 53.546725 47493.231618 False NaT NaT NaN 1 2 M 28.381880 54.668176 Commerce 43.432589 Comm&Mgmt False 0 72.623675 Mkt&Fin 47.134094 NaN False NaT NaT NaN 2 0 M 61.836885 58.355751 Commerce 43.628577 Comm&Mgmt True 0 52.310129 Mkt&HR 58.811533 NaN False NaT 2020-12-05 NaN 3 1 F 86.923627 56.830623 Commerce 73.887249 Comm&Mgmt False 0 53.546200 Mkt&HR 53.215357 30987.865703 False 2020-06-09 2021-03-15 6.0 4 2 F 87.185306 32.133005 Science 52.832368 Sci&Tech False 1 84.644799 Mkt&Fin 56.782870 NaN True NaT NaT 3.0 5 6 F 101.236967 65.974442 Commerce 62.257487 Comm&Mgmt True 0 103.640437 Mkt&Fin 64.550289 28561.590801 True 2020-01-22 2020-03-23 3.0
CTGAN also supports conditioning on continuous values, as long as the values are within the range of seen numbers. For example, if all the values of the dataset are within 0 and 1, CTGAN will not be able to set this value to 1000.
In [44]: conditions = { ....: 'degree_perc': 70.0 ....: } ....: In [45]: model.sample(5, conditions=conditions) Out[45]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 1 M 84.392593 49.867336 Commerce 70.0 Sci&Tech False 0 98.363184 Mkt&Fin 63.443452 23946.566121 True 2020-02-24 2020-02-26 6.0 1 5 F 62.009780 53.225228 Commerce 70.0 Comm&Mgmt False 1 74.374311 Mkt&HR 76.868128 24741.656224 True 2020-02-12 2020-09-27 NaN 2 17 F 91.806180 58.870241 Commerce 70.0 Comm&Mgmt True 0 73.740159 Mkt&HR 66.317381 25320.011979 True 2020-06-07 2020-09-30 12.0 3 21 M 61.018639 61.444294 Commerce 70.0 Comm&Mgmt False 2 76.174053 Mkt&Fin 71.433675 NaN True 2020-02-05 2020-07-26 3.0 4 24 F 91.603024 41.019216 Commerce 70.0 Comm&Mgmt True 0 59.449923 Mkt&Fin 65.359545 31963.759897 True 2020-08-04 2020-12-19 3.0
Currently, conditional sampling works through a rejection sampling process, where rows are sampled repeatedly until one that satisfies the conditions is found. In case you are running into a Could not get enough valid rows within x trials or simply wish to optimize the results, there are three parameters that can be fine-tuned: max_rows_multiplier, max_retries and float_rtol. More information about these parameters can be found in the API section.
Could not get enough valid rows within x trials
max_rows_multiplier
max_retries
float_rtol
If you look closely at the data you may notice that some properties were not completely captured by the model. For example, you may have seen that sometimes the model produces an experience_years number greater than 0 while also indicating that work_experience is False. These types of properties are what we call Constraints and can also be handled using SDV. For further details about them please visit the Handling Constraints guide.
experience_years
0
work_experience
Constraints
SDV
A very common question when someone starts using SDV to generate synthetic data is: “How good is the data that I just generated?”
In order to answer this question, SDV has a collection of metrics and tools that allow you to compare the real that you provided and the synthetic data that you generated using SDV or any other tool.
You can read more about this in the Synthetic Data Evaluation guide.