In this guide we will go through a series of steps that will let you discover functionalities of the CopulaGAN model, including how to:
CopulaGAN
Create an instance of CopulaGAN.
Fit the instance to your data.
Generate synthetic versions of your data.
Use CopulaGAN to anonymize PII information.
Customize the data transformations to improve the learning process.
Specify the column distributions to improve the output quality.
Specify hyperparameters to improve the output quality.
The sdv.tabular.CopulaGAN model is a variation of the CTGAN Model which takes advantage of the CDF based transformation that the GaussianCopulas apply to make the underlying CTGAN model task of learning the data easier.
sdv.tabular.CopulaGAN
Let’s now discover how to learn a dataset and later on generate synthetic data with the same format and statistical properties by using the CopulaGAN class from SDV.
We will start by loading one of our demo datasets, the student_placements, which contains information about MBA students that applied for placements during the year 2020.
student_placements
Warning
In order to follow this guide you need to have ctgan installed on your system. If you have not done it yet, please install ctgan now by executing the command pip install sdv in a terminal.
ctgan
pip install sdv
In [1]: from sdv.demo import load_tabular_demo In [2]: data = load_tabular_demo('student_placements') In [3]: data.head() Out[3]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17264 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 1 17265 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 2 17266 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 3 17267 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN 4 17268 M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
As you can see, this table contains information about students which includes, among other things:
Their id and gender
Their grades and specializations
Their work experience
The salary that they where offered
The duration and dates of their placement
You will notice that there is data with the following characteristics:
There are float, integer, boolean, categorical and datetime values.
There are some variables that have missing data. In particular, all the data related to the placement details is missing in the rows where the student was not placed.
Let us use CopulaGAN to learn this data and then sample synthetic data about new students to see how well de model captures the characteristics indicated above. In order to do this you will need to:
Import the sdv.tabular.CopulaGAN class and create an instance of it.
Call its fit method passing our table.
fit
Call its sample method indicating the number of synthetic rows that you want to generate.
sample
In [4]: from sdv.tabular import CopulaGAN In [5]: model = CopulaGAN() In [6]: model.fit(data)
Note
Notice that the model fitting process took care of transforming the different fields using the appropriate Reversible Data Transforms to ensure that the data has a format that the underlying CTGANSynthesizer class can handle.
fitting
Once the modeling has finished you are ready to generate new synthetic data by calling the sample method from your model passing the number of rows that we want to generate.
In [7]: new_data = model.sample(200)
This will return a table identical to the one which the model was fitted on, but filled with new data which resembles the original one.
In [8]: new_data.head() Out[8]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17351 M 84.485656 66.433633 Science 46.186459 Comm&Mgmt False 0 63.074348 Mkt&HR 66.110219 NaN True 2020-01-05 2021-04-07 12.0 1 17265 F 84.774956 56.081792 Commerce 55.557085 Sci&Tech False 0 88.474646 Mkt&Fin 69.359913 NaN True 2020-01-03 NaT 12.0 2 17359 F 85.350866 72.875801 Science 56.984539 Comm&Mgmt False 0 97.327876 Mkt&HR 58.536167 29660.243675 True 2020-05-04 2020-06-07 6.0 3 17455 M 89.331350 63.229224 Commerce 52.493689 Comm&Mgmt False 0 80.057421 Mkt&Fin 72.216692 28666.936764 False 2020-01-06 NaT 3.0 4 17271 F 82.276915 63.903086 Science 49.583515 Comm&Mgmt False 0 97.824780 Mkt&HR 63.868849 26773.233200 True 2020-01-06 NaT 6.0
You can control the number of rows by specifying the number of samples in the model.sample(<num_rows>). To test, try model.sample(10000). Note that the original table only had ~200 rows.
samples
model.sample(<num_rows>)
model.sample(10000)
In many scenarios it will be convenient to generate synthetic versions of your data directly in systems that do not have access to the original data source. For example, if you may want to generate testing data on the fly inside a testing environment that does not have access to your production database. In these scenarios, fitting the model with real data every time that you need to generate new data is feasible, so you will need to fit a model in your production environment, save the fitted model into a file, send this file to the testing environment and then load it there to be able to sample from it.
Let’s see how this process works.
Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is pickle.
save
.pkl
In [9]: model.save('my_model.pkl')
This will have created a file called my_model.pkl in the same directory in which you are running SDV.
my_model.pkl
Important
If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains no information about the original data, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this my_model.pkl file without the risc of disclosing any of your real data!
The file you just generated can be send over to the system where the synthetic data will be generated. Once it is there, you can load it using the CopulaGAN.load method, and then you are ready to sample new data from the loaded instance:
CopulaGAN.load
In [10]: loaded = CopulaGAN.load('my_model.pkl') In [11]: new_data = loaded.sample(200)
Notice that the system where the model is loaded needs to also have sdv and ctgan installed, otherwise it will not be able to load the model and use it.
sdv
One of the first things that you may have noticed when looking that demo data is that there is a student_id column which acts as the primary key of the table, and which is supposed to have unique values. Indeed, if we look at the number of times that each value appears, we see that all of them appear at most once:
student_id
In [12]: data.student_id.value_counts().max() Out[12]: 1
However, if we look at the synthetic data that we generated, we observe that there are some values that appear more than once:
In [13]: new_data[new_data.student_id == new_data.student_id.value_counts().index[0]] Out[13]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 23 17264 M 89.173055 70.578807 Science 53.422448 Comm&Mgmt False 0 57.528153 Mkt&Fin 65.605384 28878.887639 True 2020-06-18 2020-10-23 NaN 74 17264 M 64.973284 86.128437 Science 50.712619 Comm&Mgmt False 0 91.562319 Mkt&HR 64.830177 19614.046522 False 2020-02-21 2021-03-09 NaN 86 17264 M 62.783847 44.388466 Commerce 63.914130 Comm&Mgmt False 0 96.888726 Mkt&Fin 55.464247 NaN True NaT 2020-03-24 NaN 102 17264 M 89.364595 63.380975 Commerce 61.938792 Sci&Tech False 0 83.312928 Mkt&Fin 53.825561 26273.420154 True 2020-01-19 2020-08-22 NaN 123 17264 M 85.611299 62.916063 Science 50.103949 Sci&Tech False 0 88.481223 Mkt&Fin 66.161783 23248.629562 True 2020-01-06 2020-11-14 NaN 131 17264 F 70.329905 82.097816 Science 68.840351 Comm&Mgmt False 0 96.203923 Mkt&HR 54.615575 23141.950297 True 2020-01-13 2020-09-04 6.0 149 17264 M 89.155153 65.996167 Commerce 69.357639 Comm&Mgmt False 0 73.153274 Mkt&Fin 67.405330 35514.830452 True 2020-03-01 2020-08-15 3.0 150 17264 M 76.767618 71.328677 Science 57.438650 Sci&Tech False 0 77.225140 Mkt&HR 69.670076 19631.951726 True 2020-02-14 2021-02-22 6.0 162 17264 F 87.595619 63.887684 Science 64.153801 Comm&Mgmt True 0 85.300731 Mkt&HR 67.123881 18264.849956 True 2020-01-08 2020-09-16 NaN
This happens because the model was not notified at any point about the fact that the student_id had to be unique, so when it generates new data it will provoke collisions sooner or later. In order to solve this, we can pass the argument primary_key to our model when we create it, indicating the name of the column that is the index of the table.
primary_key
In [14]: model = CopulaGAN( ....: primary_key='student_id' ....: ) ....: In [15]: model.fit(data) In [16]: new_data = model.sample(200) In [17]: new_data.head() Out[17]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 55.587215 61.309286 Arts 48.913129 Comm&Mgmt False 0 53.155713 Mkt&Fin 71.485060 52165.972432 True 2020-09-17 NaT 3.0 1 1 M 58.903340 66.514610 Commerce 74.931265 Sci&Tech False 1 50.449068 Mkt&HR 70.587148 31183.673850 True 2020-01-01 2020-04-10 3.0 2 2 M 60.307064 65.957274 Commerce 64.846480 Others True 0 50.000000 Mkt&HR 62.515730 28796.496159 True 2020-02-10 2020-04-07 6.0 3 3 M 44.994790 58.688502 Commerce 68.418327 Others False 1 55.054244 Mkt&Fin 59.036430 21939.590723 True NaT NaT 6.0 4 4 M 70.675238 58.357250 Science 50.840781 Comm&Mgmt False 1 55.191078 Mkt&HR 74.590156 NaN True 2020-03-14 2020-10-17 12.0
As a result, the model will learn that this column must be unique and generate a unique sequence of values for the column:
In [18]: new_data.student_id.value_counts().max() Out[18]: 1
There will be many cases where the data will contain Personally Identifiable Information which we cannot disclose. In these cases, we will want our Tabular Models to replace the information within these fields with fake, simulated data that looks similar to the real one but does not contain any of the original values.
Let’s load a new dataset that contains a PII field, the student_placements_pii demo, and try to generate synthetic versions of it that do not contain any of the PII fields.
student_placements_pii
The student_placements_pii dataset is a modified version of the student_placements dataset with one new field, address, which contains PII information about the students. Notice that this additional address field has been simulated and does not correspond to data from the real users.
address
In [19]: data_pii = load_tabular_demo('student_placements_pii') In [20]: data_pii.head() Out[20]: student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17264 70304 Baker Turnpike\nEricborough, MS 15086 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 1 17265 805 Herrera Avenue Apt. 134\nMaryview, NJ 36510 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 2 17266 3702 Bradley Island\nNorth Victor, FL 12268 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 3 17267 Unit 0879 Box 3878\nDPO AP 42663 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN 4 17268 96493 Kelly Canyon Apt. 145\nEast Steven, NC 3... M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
If we use our tabular model on this new data we will see how the synthetic data that it generates discloses the addresses from the real students:
In [21]: model = CopulaGAN( ....: primary_key='student_id', ....: ) ....: In [22]: model.fit(data_pii) In [23]: new_data_pii = model.sample(200) In [24]: new_data_pii.head() Out[24]: student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 1350 Tyler Hollow\nNew Jacquelineport, OH 59348 M 85.536004 73.867407 Science 59.903343 Comm&Mgmt False 0 64.923488 Mkt&HR 72.281152 NaN True NaT 2020-09-21 3.0 1 1 65737 Meyer Junction Suite 154\nWest Steven, N... M 78.318478 110.525878 Science 43.466670 Comm&Mgmt False 0 51.961181 Mkt&HR 61.041276 77172.161950 True 2020-02-09 NaT 3.0 2 2 43733 Sara Forges Suite 447\nWest Sarahmouth, ... M 52.378793 96.849175 Commerce 51.270064 Comm&Mgmt False 0 53.147880 Mkt&Fin 56.829103 29086.093800 True 2020-01-11 2020-09-12 3.0 3 3 65737 Meyer Junction Suite 154\nWest Steven, N... M 64.891355 104.975062 Science 63.136568 Comm&Mgmt False 0 52.623693 Mkt&Fin 59.333750 NaN True 2020-06-20 2020-10-31 3.0 4 4 081 Carrie Square Apt. 439\nJohnsontown, NM 66991 M 87.327417 92.797528 Commerce 63.796187 Sci&Tech False 0 50.000000 Mkt&Fin 62.752135 27771.632648 True NaT 2021-03-24 3.0
More specifically, we can see how all the addresses that have been generated actually come from the original dataset:
In [25]: new_data_pii.address.isin(data_pii.address).sum() Out[25]: 200
In order to solve this, we can pass an additional argument anonymize_fields to our model when we create the instance. This anonymize_fields argument will need to be a dictionary that contains:
anonymize_fields
The name of the field that we want to anonymize.
The category of the field that we want to use when we generate fake values for it.
The list complete list of possible categories can be seen in the Faker Providers page, and it contains a huge list of concepts such as:
name
country
city
ssn
credit_card_number
credit_card_expire
credit_card_security_code
email
telephone
…
In this case, since the field is an e-mail address, we will pass a dictionary indicating the category address
In [26]: model = CopulaGAN( ....: primary_key='student_id', ....: anonymize_fields={ ....: 'address': 'address' ....: } ....: ) ....: In [27]: model.fit(data_pii)
As a result, we can see how the real address values have been replaced by other fake addresses that were not taken from the real data that we learned.
In [28]: new_data_pii = model.sample(200) In [29]: new_data_pii.head() Out[29]: student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 379 Simon Hills Apt. 864\nSouth Dustin, ID 18410 F 83.876052 60.481488 Science 71.626776 Comm&Mgmt False 0 86.234711 Mkt&HR 66.282367 NaN True 2020-02-11 NaT NaN 1 1 23147 Kenneth Springs\nEast Jesse, ND 59627 F 65.610008 66.048397 Commerce 73.520366 Comm&Mgmt True 0 62.880333 Mkt&Fin 62.044864 37945.093834 True NaT NaT NaN 2 2 654 Sharon Views Apt. 098\nFrederickberg, FL 4... F 88.878580 67.264524 Commerce 56.067286 Comm&Mgmt False 0 60.812253 Mkt&HR 66.486290 NaN True NaT 2020-08-15 3.0 3 3 97978 Joanne Curve\nSouth Jose, NC 50021 M 88.862827 69.394390 Arts 63.674363 Comm&Mgmt False 0 75.627544 Mkt&Fin 72.064379 25031.505218 False 2020-02-03 2020-08-24 3.0 4 4 671 Paul Neck Suite 109\nGarrettton, CO 39564 F 72.149129 61.380893 Arts 80.519019 Comm&Mgmt False 0 89.945357 Mkt&HR 64.307747 28144.703032 False NaT NaT 3.0
Which means that none of the original addresses can be found in the sampled data:
In [30]: data_pii.address.isin(new_data_pii.address).sum() Out[30]: 0
Now that we have discovered the basics, let’s go over a few more advanced usage examples and see the different arguments that we can pass to our CopulaGAN Model in order to customize it to our needs.
During the previous steps, every time we fitted the CopulaGAN it performed the following operations:
Learn the format and data types of the passed data
Transform the non-numerical and null data using Reversible Data Transforms to obtain a fully numerical representation of the data from which we can learn the probability distributions.
Learn the probability distribution of each column from the table
Transform the values of each numerical column by converting them to their marginal distribution CDF values and then applying an inverse CDF transformation of a standard normal on them.
Fit a CTGAN model on the transformed data, which learns how each column is correlated to the others.
After this, when we used the model to generate new data for our table using the sample method, it did:
Sample rows from the CTGAN model.
Revert the sampled values by computing their standard normal CDF and then applying the inverse CDF of their marginal distributions.
Revert the RDT transformations to go back to the original data format.
As you can see, during these steps the Marginal Probability Distributions have a very important role, since the CopulaGAN had to learn and reproduce the individual distributions of each column in our table. We can explore the distributions which the CopulaGAN used to model each column using its get_distributions method:
get_distributions
In [31]: model = CopulaGAN( ....: primary_key='student_id' ....: ) ....: In [32]: model.fit(data) In [33]: distributions = model.get_distributions()
This will return us a dict which contains the name of the distribution class used for each column:
dict
In [34]: distributions Out[34]: {'second_perc': 'copulas.univariate.truncated_gaussian.TruncatedGaussian', 'high_perc': 'copulas.univariate.log_laplace.LogLaplace', 'degree_perc': 'copulas.univariate.student_t.StudentTUnivariate', 'work_experience': 'copulas.univariate.student_t.StudentTUnivariate', 'experience_years': 'copulas.univariate.gaussian.GaussianUnivariate', 'employability_perc': 'copulas.univariate.truncated_gaussian.TruncatedGaussian', 'mba_perc': 'copulas.univariate.gamma.GammaUnivariate', 'salary#0': 'copulas.univariate.gamma.GammaUnivariate', 'salary#1': 'copulas.univariate.gaussian.GaussianUnivariate', 'placed': 'copulas.univariate.gamma.GammaUnivariate', 'start_date#0': 'copulas.univariate.gamma.GammaUnivariate', 'start_date#1': 'copulas.univariate.gaussian.GaussianUnivariate', 'end_date#0': 'copulas.univariate.gamma.GammaUnivariate', 'end_date#1': 'copulas.univariate.gaussian.GaussianUnivariate'}
In this list we will see multiple distributions for each one of the columns that we have in our data. This is because the RDT transformations used to encode the data numerically often use more than one column to represent each one of the input variables.
Let’s explore the individual distribution of one of the columns in our data to better understand how the CopulaGAN processed them and see if we can improve the results by manually specifying a different distribution. For example, let’s explore the experience_years column by looking at the frequency of its values within the original data:
experience_years
In [35]: data.experience_years.value_counts() Out[35]: 0 141 1 65 2 8 3 1 Name: experience_years, dtype: int64 In [36]: data.experience_years.hist();
By observing the data we can see that the behavior of the values in this column is very similar to a Gamma or even some types of Beta distribution, where the majority of the values are 0 and the frequency decreases as the values increase.
Was the CopulaGAN able to capture this distribution on its own?
In [37]: distributions['experience_years'] Out[37]: 'copulas.univariate.gaussian.GaussianUnivariate'
It seems that the it was not, as it rather thought that the behavior was closer to a Gaussian distribution. And, as a result, we can see how the generated values now contain negative values which are invalid for this column:
In [38]: new_data.experience_years.value_counts() Out[38]: 1 115 0 83 2 2 Name: experience_years, dtype: int64 In [39]: new_data.experience_years.hist();
Let’s see how we can improve this situation by passing the CopulaGAN the exact distribution that we want it to use for this column.
The CopulaGAN class offers the possibility to indicate which distribution to use for each one of the columns in the table, in order to solve situations like the one that we just described. In order to do this, we need to pass a field_distributions argument with dict that indicates, the distribution that we want to use for each column.
field_distributions
Possible values for the distribution argument are:
univariate: Let copulas select the optimal univariate distribution. This may result in non-parametric models being used.
univariate
copulas
parametric: Let copulas select the optimal univariate distribution, but restrict the selection to parametric distributions only.
parametric
bounded: Let copulas select the optimal univariate distribution, but restrict the selection to bounded distributions only. This may result in non-parametric models being used.
bounded
semi_bounded: Let copulas select the optimal univariate distribution, but restrict the selection to semi-bounded distributions only. This may result in non-parametric models being used.
semi_bounded
parametric_bounded: Let copulas select the optimal univariate distribution, but restrict the selection to parametric and bounded distributions only.
parametric_bounded
parametric_semi_bounded: Let copulas select the optimal univariate distribution, but restrict the selection to parametric and semi-bounded distributions only.
parametric_semi_bounded
gaussian: Use a Gaussian distribution.
gaussian
gamma: Use a Gamma distribution.
gamma
beta: Use a Beta distribution.
beta
student_t: Use a Student T distribution.
student_t
gaussian_kde: Use a GaussianKDE distribution. This model is non-parametric, so using this will make get_parameters unusable.
gaussian_kde
get_parameters
truncated_gaussian: Use a Truncated Gaussian distribution.
truncated_gaussian
Let’s see what happens if we make the CopulaGAN use the gamma distribution for our column.
In [40]: model = CopulaGAN( ....: primary_key='student_id', ....: field_distributions={ ....: 'experience_years': 'gamma' ....: } ....: ) ....: In [41]: model.fit(data)
After this, we can see how the CopulaGAN used the indicated distribution for the experience_years column
In [42]: model.get_distributions()['experience_years'] Out[42]: 'copulas.univariate.gamma.GammaUnivariate'
And, as a result, now we can see how the generated data now have a behavior which is closer to the original data and always stays within the valid values range.
In [43]: new_data = model.sample(len(data)) In [44]: new_data.experience_years.value_counts() Out[44]: 0 178 1 27 2 9 4 1 Name: experience_years, dtype: int64 In [45]: new_data.experience_years.hist();
Even though there are situations like the one show above where manually choosing a distribution seems to give better results, in most cases the CopulaGAN will be able to find the optimal distribution on its own, making this manual search of the marginal distributions necessary on very little occasions.
A part from the arguments explained above, CopulaGAN has a number of additional hyperparameters that control its learning behavior and can impact on the performance of the model, both in terms of quality of the generated data and computational time:
epochs and batch_size: these arguments control the number of iterations that the model will perform to optimize its parameters, as well as the number of samples used in each step. Its default values are 300 and 500 respectively, and batch_size needs to always be a value which is multiple of 10.
epochs
batch_size
300
500
10
These hyperparameters have a very direct effect in time the training process lasts but also on the performance of the data, so for new datasets, you might want to start by setting a low value on both of them to see how long the training process takes on your data and later on increase the number to acceptable values in order to improve the performance.
log_frequency: Whether to use log frequency of categorical levels in conditional sampling. It defaults to True. This argument affects how the model processes the frequencies of the categorical values that are used to condition the rest of the values. In some cases, changing it to False could lead to better performance.
log_frequency
True
False
embedding_dim (int): Size of the random sample passed to the Generator. Defaults to 128.
embedding_dim
generator_dim (tuple or list of ints): Size of the output samples for each one of the Residuals. A Resiudal Layer will be created for each one of the values provided. Defaults to (256, 256).
generator_dim
discriminator_dim (tuple or list of ints): Size of the output samples for each one of the Discriminator Layers. A Linear Layer will be created for each one of the values provided. Defaults to (256, 256).
discriminator_dim
generator_lr (float): Learning rate for the generator. Defaults to 2e-4.
generator_lr
generator_decay (float): Generator weight decay for the Adam Optimizer. Defaults to 1e-6.
generator_decay
discriminator_lr (float): Learning rate for the discriminator. Defaults to 2e-4.
discriminator_lr
discriminator_decay (float): Discriminator weight decay for the Adam Optimizer. Defaults to 1e-6.
discriminator_decay
discriminator_steps (int): Number of discriminator updates to do for each generator update. From the WGAN paper: https://arxiv.org/abs/1701.07875. WGAN paper default is 5. Default used is 1 to match original CTGAN implementation.
discriminator_steps
verbose: Whether to print fit progress on stdout. Defaults to False.
verbose
Notice that the value that you set on the batch_size argument must always be a multiple of 10!
As an example, we will try to fit the CopulaGAN model slightly increasing the number of epochs, reducing the batch_size, adding one additional layer to the models involved and using a smaller wright decay.
Before we start, we will evaluate the quality of the previously generated data using the sdv.evaluation.evaluate function
sdv.evaluation.evaluate
In [46]: from sdv.evaluation import evaluate In [47]: evaluate(new_data, data) Out[47]: 0.5098086509468629
Afterwards, we create a new instance of the CopulaGAN model with the hyperparameter values that we want to use
In [48]: model = CopulaGAN( ....: primary_key='student_id', ....: epochs=500, ....: batch_size=100, ....: generator_dim=(256, 256, 256), ....: discriminator_dim=(256, 256, 256) ....: ) ....:
And fit to our data.
In [49]: model.fit(data)
Finally, we are ready to generate new data and evaluate the results.
In [50]: new_data = model.sample(len(data)) In [51]: evaluate(new_data, data) Out[51]: 0.5094647272650655
As we can see, in this case these modifications changed the obtained results slightly, but they did neither introduce dramatic changes in the performance.
If you look closely at the data you may notice that some properties were not completely captured by the model. For example, you may have seen that sometimes the model produces an experience_years number greater than 0 while also indicating that work_experience is False. These type of properties are what we call Constraints and can also be handled using SDV. For further details about them please visit the Handling Constraints guide.
0
work_experience
Constraints
SDV
A very common question when someone starts using SDV to generate synthetic data is: “How good is the data that I just generated?”
In order to answer this question, SDV has a collection of metrics and tools that allow you to compare the real that you provided and the synthetic data that you generated using SDV or any other tool.
You can read more about this in the Synthetic Data Evaluation guide.