# CopulaGAN Model¶

In this guide we will go through a series of steps that will let you
discover functionalities of the `CopulaGAN`

model, including how to:

Create an instance of

`CopulaGAN`

.Fit the instance to your data.

Generate synthetic versions of your data.

Use

`CopulaGAN`

to anonymize PII information.Customize the data transformations to improve the learning process.

Specify the column distributions to improve the output quality.

Specify hyperparameters to improve the output quality.

## What is CopulaGAN?¶

The `sdv.tabular.CopulaGAN`

model is a variation of the CTGAN Model
which takes advantage of the CDF based transformation that the GaussianCopulas
apply to make the underlying CTGAN model task of learning the data easier.

Let’s now discover how to learn a dataset and later on generate
synthetic data with the same format and statistical properties by using
the `CopulaGAN`

class from SDV.

## Quick Usage¶

We will start by loading one of our demo datasets, the
`student_placements`

, which contains information about MBA students
that applied for placements during the year 2020.

Warning

In order to follow this guide you need to have `ctgan`

installed on
your system. If you have not done it yet, please install `ctgan`

now
by executing the command `pip install sdv`

in a terminal.

```
In [1]: from sdv.demo import load_tabular_demo
In [2]: data = load_tabular_demo('student_placements')
In [3]: data.head()
Out[3]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 17264 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0
1 17265 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0
2 17266 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0
3 17267 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN
4 17268 M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
```

As you can see, this table contains information about students which includes, among other things:

Their id and gender

Their grades and specializations

Their work experience

The salary that they were offered

The duration and dates of their placement

You will notice that there is data with the following characteristics:

There are float, integer, boolean, categorical and datetime values.

There are some variables that have missing data. In particular, all the data related to the placement details is missing in the rows where the student was not placed.

Let us use `CopulaGAN`

to learn this data and then sample synthetic data
about new students to see how well the model captures the characteristics
indicated above. In order to do this you will need to:

Import the

`sdv.tabular.CopulaGAN`

class and create an instance of it.Call its

`fit`

method passing our table.Call its

`sample`

method indicating the number of synthetic rows that you want to generate.

```
In [4]: from sdv.tabular import CopulaGAN
In [5]: model = CopulaGAN()
In [6]: model.fit(data)
```

Note

Notice that the model `fitting`

process took care of transforming the
different fields using the appropriate Reversible Data
Transforms to ensure that the data
has a format that the underlying CTGANSynthesizer class can handle.

### Generate synthetic data from the model¶

Once the modeling has finished you are ready to generate new synthetic
data by calling the `sample`

method from your model passing the number
of rows that we want to generate.

```
In [7]: new_data = model.sample(200)
```

This will return a table identical to the one which the model was fitted on, but filled with new data which resembles the original one.

```
In [8]: new_data.head()
Out[8]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 17343 M 57.675716 67.052617 Arts 75.331502 Comm&Mgmt False 0 54.631293 Mkt&Fin 71.562876 27208.435796 True 2020-01-15 2020-06-17 3.0
1 17477 F 45.677374 42.234641 Arts 63.132638 Comm&Mgmt False 0 70.753065 Mkt&Fin 67.442271 18380.314323 True NaT 2020-09-28 NaN
2 17265 F 73.766502 65.216742 Science 74.964961 Comm&Mgmt True 0 53.731417 Mkt&HR 64.151859 29320.612490 False 2020-03-13 2020-09-06 3.0
3 17461 F 67.086296 51.806403 Commerce 87.203741 Comm&Mgmt True 0 64.555617 Mkt&HR 64.658355 24349.805374 True 2020-03-22 2020-07-04 3.0
4 17392 M 89.059970 60.128504 Science 74.592259 Comm&Mgmt False 0 53.585064 Mkt&HR 56.903326 26930.163403 False 2020-01-20 NaT 12.0
```

Note

You can control the number of rows by specifying the number of
`samples`

in the `model.sample(<num_rows>)`

. To test, try
`model.sample(10000)`

. Note that the original table only had ~200
rows.

### Save and Load the model¶

In many scenarios it will be convenient to generate synthetic versions
of your data directly in systems that do not have access to the original
data source. For example, if you may want to generate testing data on
the fly inside a testing environment that does not have access to your
production database. In these scenarios, fitting the model with real
data every time that you need to generate new data is feasible, so you
will need to fit a model in your production environment, save the fitted
model into a file, send this file to the testing environment and then
load it there to be able to `sample`

from it.

Let’s see how this process works.

#### Load the model and generate new data¶

The file you just generated can be sent over to the system where the
synthetic data will be generated. Once it is there, you can load it
using the `CopulaGAN.load`

method, and then you are ready to sample new
data from the loaded instance:

```
In [10]: loaded = CopulaGAN.load('my_model.pkl')
In [11]: new_data = loaded.sample(200)
```

Warning

Notice that the system where the model is loaded needs to also have
`sdv`

and `ctgan`

installed, otherwise it will not be able to load
the model and use it.

### Specifying the Primary Key of the table¶

One of the first things that you may have noticed when looking at the demo
data is that there is a `student_id`

column which acts as the primary
key of the table, and which is supposed to have unique values. Indeed,
if we look at the number of times that each value appears, we see that
all of them appear at most once:

```
In [12]: data.student_id.value_counts().max()
Out[12]: 1
```

However, if we look at the synthetic data that we generated, we observe that there are some values that appear more than once:

```
In [13]: new_data[new_data.student_id == new_data.student_id.value_counts().index[0]]
Out[13]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
39 17478 M 42.999117 61.146830 Commerce 79.106691 Comm&Mgmt False 4 66.817192 Mkt&Fin 64.745617 17997.757626 False NaT 2020-08-24 NaN
53 17478 F 69.863648 58.435742 Science 73.083194 Sci&Tech False 1 59.696847 Mkt&Fin 68.451614 27376.159521 True 2020-08-05 2021-03-14 NaN
56 17478 F 66.290662 81.304770 Arts 83.028636 Sci&Tech False 1 51.126993 Mkt&Fin 63.472818 19532.753808 False NaT 2020-09-22 6.0
64 17478 M 43.634751 70.764316 Commerce 82.027700 Comm&Mgmt True 2 51.628523 Mkt&HR 64.704598 28575.246367 True 2020-01-18 NaT NaN
151 17478 M 53.434521 28.853579 Commerce 64.886096 Comm&Mgmt True 0 50.729228 Mkt&Fin 74.157459 26521.406728 False NaT NaT 3.0
169 17478 F 48.090778 65.048094 Science 68.511165 Comm&Mgmt False 1 53.098110 Mkt&Fin 57.132024 NaN False 2020-01-05 2020-09-23 12.0
182 17478 F 55.620425 32.735862 Commerce 70.662312 Comm&Mgmt False 1 64.273674 Mkt&HR 62.161264 26804.949476 False 2020-03-14 2020-06-10 NaN
194 17478 M 89.400000 30.748189 Commerce 84.278267 Comm&Mgmt False 1 54.146131 Mkt&HR 79.546356 NaN True 2020-04-03 2020-07-28 6.0
```

This happens because the model was not notified at any point about the
fact that the `student_id`

had to be unique, so when it generates new
data it will provoke collisions sooner or later. In order to solve this,
we can pass the argument `primary_key`

to our model when we create it,
indicating the name of the column that is the index of the table.

```
In [14]: model = CopulaGAN(
....: primary_key='student_id'
....: )
....:
In [15]: model.fit(data)
In [16]: new_data = model.sample(200)
In [17]: new_data.head()
Out[17]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 0 M 78.148814 66.139058 Science 61.435598 Comm&Mgmt False 0 54.860183 Mkt&Fin 62.917992 20231.965709 True 2020-04-28 2020-08-04 12.0
1 1 M 89.358878 64.437661 Commerce 76.154126 Comm&Mgmt True 1 83.879527 Mkt&Fin 74.483029 27731.856224 False NaT 2020-04-27 3.0
2 2 F 71.541120 43.413513 Commerce 62.419453 Comm&Mgmt False 0 93.163297 Mkt&Fin 63.038829 30785.551143 True 2020-02-22 2020-04-06 3.0
3 3 F 70.017385 68.449887 Science 81.995885 Comm&Mgmt False 1 63.660266 Mkt&HR 75.350619 21465.277145 True 2020-01-07 NaT 12.0
4 4 M 78.132191 70.935037 Commerce 86.443417 Comm&Mgmt False 0 94.083981 Mkt&Fin 80.809315 31765.145261 True NaT 2020-11-19 3.0
```

As a result, the model will learn that this column must be unique and generate a unique sequence of values for the column:

```
In [18]: new_data.student_id.value_counts().max()
Out[18]: 1
```

### Anonymizing Personally Identifiable Information (PII)¶

There will be many cases where the data will contain Personally Identifiable Information which we cannot disclose. In these cases, we will want our Tabular Models to replace the information within these fields with fake, simulated data that looks similar to the real one but does not contain any of the original values.

Let’s load a new dataset that contains a PII field, the
`student_placements_pii`

demo, and try to generate synthetic versions
of it that do not contain any of the PII fields.

Note

The `student_placements_pii`

dataset is a modified version of the
`student_placements`

dataset with one new field, `address`

, which
contains PII information about the students. Notice that this additional
`address`

field has been simulated and does not correspond to data
from the real users.

```
In [19]: data_pii = load_tabular_demo('student_placements_pii')
In [20]: data_pii.head()
Out[20]:
student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 17264 70304 Baker Turnpike\nEricborough, MS 15086 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0
1 17265 805 Herrera Avenue Apt. 134\nMaryview, NJ 36510 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0
2 17266 3702 Bradley Island\nNorth Victor, FL 12268 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0
3 17267 Unit 0879 Box 3878\nDPO AP 42663 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN
4 17268 96493 Kelly Canyon Apt. 145\nEast Steven, NC 3... M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
```

If we use our tabular model on this new data we will see how the synthetic data that it generates discloses the addresses from the real students:

```
In [21]: model = CopulaGAN(
....: primary_key='student_id',
....: )
....:
In [22]: model.fit(data_pii)
In [23]: new_data_pii = model.sample(200)
In [24]: new_data_pii.head()
Out[24]:
student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 0 6994 White Falls\nLake Cynthiaborough, CT 15835 F 88.299259 65.067945 Science 59.582102 Sci&Tech False 1 52.561624 Mkt&Fin 69.336242 28008.887338 True 2020-09-12 2020-06-28 NaN
1 1 29166 Tammy Crest Apt. 839\nSouth Lindaside, F... F 82.424339 82.751783 Arts 67.876132 Comm&Mgmt False 1 66.760665 Mkt&Fin 57.032038 25533.175475 True NaT 2020-06-20 6.0
2 2 8585 Jennifer Road Apt. 853\nNorth Paulside, T... M 80.768921 72.051864 Commerce 62.181795 Comm&Mgmt False 0 73.995943 Mkt&Fin 65.785328 51545.407394 True NaT 2020-07-19 3.0
3 3 0917 Shawn Grove Suite 337\nWhiteville, KS 92476 M 73.083471 110.089528 Commerce 62.232031 Sci&Tech False 1 52.501414 Mkt&Fin 65.539283 NaN True 2020-02-24 2020-12-15 NaN
4 4 29282 Kelley Orchard\nMartinezview, CT 78137 M 88.011639 106.433397 Science 69.887740 Comm&Mgmt False 0 57.791789 Mkt&HR 53.004735 45542.795803 True NaT NaT NaN
```

More specifically, we can see how all the addresses that have been generated actually come from the original dataset:

```
In [25]: new_data_pii.address.isin(data_pii.address).sum()
Out[25]: 200
```

In order to solve this, we can pass an additional argument
`anonymize_fields`

to our model when we create the instance. This
`anonymize_fields`

argument will need to be a dictionary that
contains:

The name of the field that we want to anonymize.

The category of the field that we want to use when we generate fake values for it.

The list complete list of possible categories can be seen in the Faker Providers page, and it contains a huge list of concepts such as:

name

address

country

city

ssn

credit_card_number

credit_card_expire

credit_card_security_code

email

telephone

…

In this case, since the field is an e-mail address, we will pass a
dictionary indicating the category `address`

```
In [26]: model = CopulaGAN(
....: primary_key='student_id',
....: anonymize_fields={
....: 'address': 'address'
....: }
....: )
....:
In [27]: model.fit(data_pii)
```

As a result, we can see how the real `address`

values have been
replaced by other fake addresses that were not taken from the real data
that we learned.

```
In [28]: new_data_pii = model.sample(200)
In [29]: new_data_pii.head()
Out[29]:
student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 0 61244 Bryan Fort\nPort Derekmouth, IN 37997 F 48.598021 53.000171 Science 80.691675 Sci&Tech False 0 97.191270 Mkt&Fin 72.761389 NaN False 2020-01-04 NaT 6.0
1 1 43155 Perry Island\nEast Melissa, VT 44821 F 72.866875 87.156717 Commerce 77.068867 Comm&Mgmt False 1 83.429756 Mkt&Fin 64.143684 NaN False NaT NaT 3.0
2 2 46668 David Mission Suite 507\nOlsonborough, W... M 73.393634 62.651596 Science 93.250913 Comm&Mgmt False 1 78.143185 Mkt&Fin 69.721572 28495.619069 True NaT NaT 3.0
3 3 29159 Mccarthy Village Apt. 880\nNorth Jennife... F 44.975299 98.821769 Commerce 75.560103 Comm&Mgmt False 0 97.999915 Mkt&HR 65.658346 NaN False 2020-01-13 NaT 3.0
4 4 166 Nielsen Divide Apt. 524\nFigueroashire, KY... F 71.084187 95.284468 Commerce 87.164345 Comm&Mgmt False 0 84.849699 Mkt&Fin 71.587327 NaN False 2020-01-04 NaT 12.0
```

Which means that none of the original addresses can be found in the sampled data:

```
In [30]: data_pii.address.isin(new_data_pii.address).sum()
Out[30]: 0
```

## Advanced Usage¶

Now that we have discovered the basics, let’s go over a few more
advanced usage examples and see the different arguments that we can pass
to our `CopulaGAN`

Model in order to customize it to our needs.

### Exploring the Probability Distributions¶

During the previous steps, every time we fitted the `CopulaGAN`

it performed the following operations:

Learn the format and data types of the passed data

Transform the non-numerical and null data using Reversible Data Transforms to obtain a fully numerical representation of the data from which we can learn the probability distributions.

Learn the probability distribution of each column from the table

Transform the values of each numerical column by converting them to their marginal distribution CDF values and then applying an inverse CDF transformation of a standard normal on them.

Fit a CTGAN model on the transformed data, which learns how each column is correlated to the others.

After this, when we used the model to generate new data for our table
using the `sample`

method, it did:

Sample rows from the CTGAN model.

Revert the sampled values by computing their standard normal CDF and then applying the inverse CDF of their marginal distributions.

Revert the RDT transformations to go back to the original data format.

As you can see, during these steps the *Marginal Probability
Distributions* have a very important role, since the `CopulaGAN`

had to learn and reproduce the individual distributions of each column
in our table. We can explore the distributions which the
`CopulaGAN`

used to model each column using its
`get_distributions`

method:

```
In [31]: model = CopulaGAN(
....: primary_key='student_id'
....: )
....:
In [32]: model.fit(data)
In [33]: distributions = model.get_distributions()
```

This will return us a `dict`

which contains the name of the
distribution class used for each column:

```
In [34]: distributions
Out[34]:
{'second_perc': 'copulas.univariate.truncated_gaussian.TruncatedGaussian',
'high_perc': 'copulas.univariate.log_laplace.LogLaplace',
'degree_perc': 'copulas.univariate.student_t.StudentTUnivariate',
'work_experience': 'copulas.univariate.student_t.StudentTUnivariate',
'experience_years': 'copulas.univariate.gaussian.GaussianUnivariate',
'employability_perc': 'copulas.univariate.truncated_gaussian.TruncatedGaussian',
'mba_perc': 'copulas.univariate.gamma.GammaUnivariate',
'placed': 'copulas.univariate.gamma.GammaUnivariate',
'salary#0': 'copulas.univariate.gamma.GammaUnivariate',
'salary#1': 'copulas.univariate.gaussian.GaussianUnivariate',
'start_date#0': 'copulas.univariate.gamma.GammaUnivariate',
'start_date#1': 'copulas.univariate.gaussian.GaussianUnivariate',
'end_date#0': 'copulas.univariate.gamma.GammaUnivariate',
'end_date#1': 'copulas.univariate.gaussian.GaussianUnivariate'}
```

Note

In this list we will see multiple distributions for each one of the columns that we have in our data. This is because the RDT transformations used to encode the data numerically often use more than one column to represent each one of the input variables.

Let’s explore the individual distribution of one of the columns in our
data to better understand how the `CopulaGAN`

processed them and
see if we can improve the results by manually specifying a different
distribution. For example, let’s explore the `experience_years`

column
by looking at the frequency of its values within the original data:

```
In [35]: data.experience_years.value_counts()
Out[35]:
0 141
1 65
2 8
3 1
Name: experience_years, dtype: int64
In [36]: data.experience_years.hist();
```

By observing the data we can see that the behavior of the values in this column is very similar to a Gamma or even some types of Beta distribution, where the majority of the values are 0 and the frequency decreases as the values increase.

Was the `CopulaGAN`

able to capture this distribution on its own?

```
In [37]: distributions['experience_years']
Out[37]: 'copulas.univariate.gaussian.GaussianUnivariate'
```

It seems that the it was not, as it rather thought that the behavior was closer to a Gaussian distribution. And, as a result, we can see how the generated values now contain negative values which are invalid for this column:

```
In [38]: new_data.experience_years.value_counts()
Out[38]:
0 130
1 51
3 11
2 5
4 3
Name: experience_years, dtype: int64
In [39]: new_data.experience_years.hist();
```

Let’s see how we can improve this situation by passing the
`CopulaGAN`

the exact distribution that we want it to use for
this column.

### Setting distributions for indvidual variables¶

The `CopulaGAN`

class offers the possibility to indicate which
distribution to use for each one of the columns in the table, in order
to solve situations like the one that we just described. In order to do
this, we need to pass a `field_distributions`

argument with `dict`

that
indicates, the distribution that we want to use for each column.

Possible values for the distribution argument are:

`univariate`

: Let`copulas`

select the optimal univariate distribution. This may result in non-parametric models being used.`parametric`

: Let`copulas`

select the optimal univariate distribution, but restrict the selection to parametric distributions only.`bounded`

: Let`copulas`

select the optimal univariate distribution, but restrict the selection to bounded distributions only. This may result in non-parametric models being used.`semi_bounded`

: Let`copulas`

select the optimal univariate distribution, but restrict the selection to semi-bounded distributions only. This may result in non-parametric models being used.`parametric_bounded`

: Let`copulas`

select the optimal univariate distribution, but restrict the selection to parametric and bounded distributions only.`parametric_semi_bounded`

: Let`copulas`

select the optimal univariate distribution, but restrict the selection to parametric and semi-bounded distributions only.`gaussian`

: Use a Gaussian distribution.`gamma`

: Use a Gamma distribution.`beta`

: Use a Beta distribution.`student_t`

: Use a Student T distribution.`gaussian_kde`

: Use a GaussianKDE distribution. This model is non-parametric, so using this will make`get_parameters`

unusable.`truncated_gaussian`

: Use a Truncated Gaussian distribution.

Let’s see what happens if we make the `CopulaGAN`

use the
`gamma`

distribution for our column.

```
In [40]: model = CopulaGAN(
....: primary_key='student_id',
....: field_distributions={
....: 'experience_years': 'gamma'
....: }
....: )
....:
In [41]: model.fit(data)
```

After this, we can see how the `CopulaGAN`

used the indicated
distribution for the `experience_years`

column

```
In [42]: model.get_distributions()['experience_years']
Out[42]: 'copulas.univariate.gamma.GammaUnivariate'
```

And, as a result, now we can see how the generated data now have a behavior which is closer to the original data and always stays within the valid values range.

```
In [43]: new_data = model.sample(len(data))
In [44]: new_data.experience_years.value_counts()
Out[44]:
0 126
2 23
1 23
3 22
4 17
5 4
Name: experience_years, dtype: int64
In [45]: new_data.experience_years.hist();
```

Note

Even though there are situations like the one show above where manually
choosing a distribution seems to give better results, in most cases the
`CopulaGAN`

will be able to find the optimal distribution on its
own, making this manual search of the marginal distributions necessary
on very little occasions.

### How to modify the CopulaGAN Hyperparameters?¶

A part from the arguments explained above, `CopulaGAN`

has a number
of additional hyperparameters that control its learning behavior and can
impact on the performance of the model, both in terms of quality of the
generated data and computational time:

`epochs`

and`batch_size`

: these arguments control the number of iterations that the model will perform to optimize its parameters, as well as the number of samples used in each step. Its default values are`300`

and`500`

respectively, and`batch_size`

needs to always be a value which is multiple of`10`

.These hyperparameters have a very direct effect in time the training process lasts but also on the performance of the data, so for new datasets, you might want to start by setting a low value on both of them to see how long the training process takes on your data and later on increase the number to acceptable values in order to improve the performance.

`log_frequency`

: Whether to use log frequency of categorical levels in conditional sampling. It defaults to`True`

. This argument affects how the model processes the frequencies of the categorical values that are used to condition the rest of the values. In some cases, changing it to`False`

could lead to better performance.`embedding_dim`

(int): Size of the random sample passed to the Generator. Defaults to 128.`generator_dim`

(tuple or list of ints): Size of the output samples for each one of the Residuals. A Resiudal Layer will be created for each one of the values provided. Defaults to (256, 256).`discriminator_dim`

(tuple or list of ints): Size of the output samples for each one of the Discriminator Layers. A Linear Layer will be created for each one of the values provided. Defaults to (256, 256).`generator_lr`

(float): Learning rate for the generator. Defaults to 2e-4.`generator_decay`

(float): Generator weight decay for the Adam Optimizer. Defaults to 1e-6.`discriminator_lr`

(float): Learning rate for the discriminator. Defaults to 2e-4.`discriminator_decay`

(float): Discriminator weight decay for the Adam Optimizer. Defaults to 1e-6.`discriminator_steps`

(int): Number of discriminator updates to do for each generator update. From the WGAN paper: https://arxiv.org/abs/1701.07875. WGAN paper default is 5. Default used is 1 to match original CTGAN implementation.`verbose`

: Whether to print fit progress on stdout. Defaults to`False`

.

Warning

Notice that the value that you set on the `batch_size`

argument must always be a
multiple of `10`

!

As an example, we will try to fit the `CopulaGAN`

model slightly
increasing the number of epochs, reducing the `batch_size`

, adding one
additional layer to the models involved and using a smaller wright
decay.

Before we start, we will evaluate the quality of the previously
generated data using the `sdv.evaluation.evaluate`

function

```
In [46]: from sdv.evaluation import evaluate
In [47]: evaluate(new_data, data)
Out[47]: 0.4293634499496008
```

Afterwards, we create a new instance of the `CopulaGAN`

model with the
hyperparameter values that we want to use

```
In [48]: model = CopulaGAN(
....: primary_key='student_id',
....: epochs=500,
....: batch_size=100,
....: generator_dim=(256, 256, 256),
....: discriminator_dim=(256, 256, 256)
....: )
....:
```

And fit to our data.

```
In [49]: model.fit(data)
```

Finally, we are ready to generate new data and evaluate the results.

```
In [50]: new_data = model.sample(len(data))
In [51]: evaluate(new_data, data)
Out[51]: 0.44570271728859173
```

As we can see, in this case these modifications changed the obtained results slightly, but they did neither introduce dramatic changes in the performance.

### Conditional Sampling¶

As the name implies, conditional sampling allows us to sample from a conditional
distribution using the `CopulaGAN`

model, which means we can generate only values that
satisfy certain conditions. These conditional values can be passed to the `conditions`

parameter in the `sample`

method either as a dataframe or a dictionary.

In case a dictionary is passed, the model will generate as many rows as requested,
all of which will satisfy the specified conditions, such as `gender = M`

.

```
In [52]: conditions = {
....: 'gender': 'M'
....: }
....:
In [53]: model.sample(5, conditions=conditions)
Out[53]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 0 M 62.824691 89.994451 Commerce 79.616881 Sci&Tech True 1 80.313493 Mkt&Fin 62.488066 33928.559018 False 2020-03-02 NaT 3.0
1 2 M 51.421063 94.999539 Commerce 63.707466 Comm&Mgmt True 1 76.051231 Mkt&HR 59.646386 24479.723607 True NaT 2020-09-21 NaN
2 4 M 73.327214 67.459283 Commerce 64.802464 Comm&Mgmt False 0 87.144877 Mkt&Fin 57.198298 NaN True 2020-04-05 2020-07-06 NaN
3 0 M 62.473799 66.504021 Science 54.757391 Comm&Mgmt True 0 81.076903 Mkt&Fin 58.118415 23146.773791 False 2020-09-14 2020-10-06 3.0
4 2 M 84.812985 63.508492 Science 79.304501 Comm&Mgmt False 0 75.208023 Mkt&Fin 51.563275 30131.412004 True NaT 2020-12-06 3.0
```

It’s also possible to condition on multiple columns, such as
`gender = M, 'experience_years': 0`

.

```
In [54]: conditions = {
....: 'gender': 'M',
....: 'experience_years': 0
....: }
....:
In [55]: model.sample(5, conditions=conditions)
Out[55]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 1 M 67.080566 76.559468 Science 81.830111 Sci&Tech False 0 94.280770 Mkt&Fin 58.870468 25109.025716 True NaT NaT 3.0
1 2 M 79.033351 110.711037 Science 68.900305 Comm&Mgmt False 0 89.802652 Mkt&HR 70.056052 24073.464084 True 2020-01-30 2020-08-13 3.0
2 4 M 42.711550 64.964331 Science 74.785012 Sci&Tech True 0 60.703893 Mkt&Fin 58.350176 24626.037112 True 2020-01-18 2020-02-09 NaN
3 1 M 41.139236 62.862007 Commerce 47.060809 Sci&Tech False 0 84.811220 Mkt&HR 53.093891 NaN True 2020-02-26 2020-08-19 6.0
4 1 M 78.033437 78.561666 Science 78.033286 Sci&Tech True 0 55.852691 Mkt&Fin 73.215443 28218.308047 True NaT 2020-06-18 12.0
```

The `conditions`

can also be passed as a dataframe. In that case, the model
will generate one sample for each row of the dataframe, sorted in the same
order. Since the model already knows how many samples to generate, passing
it as a parameter is unnecessary. For example, if we want to generate three
samples where `gender = M`

and three samples with `gender = F`

, we can do the
following:

```
In [56]: import pandas as pd
In [57]: conditions = pd.DataFrame({
....: 'gender': ['M', 'M', 'M', 'F', 'F', 'F'],
....: })
....:
In [58]: model.sample(conditions=conditions)
Out[58]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 0 M 87.433534 62.038618 Science 72.119368 Comm&Mgmt False 0 56.510793 Mkt&HR 72.399082 21962.974001 True NaT NaT 12.0
1 2 M 87.772965 68.389526 Science 64.863942 Comm&Mgmt False 1 98.000000 Mkt&HR 55.841082 28432.974984 True 2020-01-06 2020-06-02 3.0
2 0 M 45.575536 55.042077 Commerce 80.583900 Comm&Mgmt True 0 83.856533 Mkt&HR 67.077801 47637.934860 True 2020-03-25 NaT 6.0
3 1 F 42.227821 58.204189 Commerce 75.330320 Comm&Mgmt False 0 75.372946 Mkt&Fin 64.055297 NaN True 2020-02-23 2020-07-22 3.0
4 0 F 73.056691 89.878789 Commerce 81.565637 Comm&Mgmt False 1 65.926540 Mkt&HR 53.553933 20715.466898 True NaT NaT 3.0
5 1 F 56.742777 52.504886 Commerce 61.510134 Comm&Mgmt False 1 93.652418 Mkt&HR 69.574096 28118.452860 True 2020-11-23 2020-07-31 3.0
```

`CopulaGAN`

also supports conditioning on continuous values, as long as the values
are within the range of seen numbers. For example, if all the values of the
dataset are within 0 and 1, `CopulaGAN`

will not be able to set this value to 1000.

```
In [59]: conditions = {
....: 'degree_perc': 70.0
....: }
....:
In [60]: model.sample(5, conditions=conditions)
Out[60]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 8 F 78.024897 91.492040 Science 70.0 Comm&Mgmt True 0 67.199453 Mkt&HR 50.694556 74445.440718 True NaT 2020-06-10 NaN
1 17 M 57.464505 37.361945 Commerce 70.0 Sci&Tech True 0 72.781478 Mkt&Fin 67.654225 23886.987968 True 2020-01-10 2020-07-04 NaN
2 23 M 57.807262 73.206394 Commerce 70.0 Comm&Mgmt False 0 97.675316 Mkt&Fin 59.201454 NaN True 2020-01-20 2021-02-02 3.0
3 4 M 77.997035 43.882148 Commerce 70.0 Comm&Mgmt False 2 97.923731 Mkt&Fin 66.747637 26339.359603 True 2020-01-22 2020-06-04 3.0
4 6 F 69.974488 64.250549 Commerce 70.0 Sci&Tech False 0 74.059313 Mkt&Fin 49.244701 NaN True 2020-07-03 NaT 3.0
```

Note

Currently, conditional sampling works through a rejection sampling process,
where rows are sampled repeatedly until one that satisfies the conditions is
found. In case you are running into a ```
Could not get enough valid rows within
x trials
```

or simply wish to optimize the results, there are three parameters
that can be fine-tuned: `max_rows_multiplier`

, `max_retries`

and `float_rtol`

.
More information about these parameters can be found in the API section.

### How do I specify constraints?¶

If you look closely at the data you may notice that some properties were
not completely captured by the model. For example, you may have seen
that sometimes the model produces an `experience_years`

number greater
than `0`

while also indicating that `work_experience`

is `False`

.
These types of properties are what we call `Constraints`

and can also
be handled using `SDV`

. For further details about them please visit
the Handling Constraints guide.

### Can I evaluate the Synthetic Data?¶

A very common question when someone starts using **SDV** to generate
synthetic data is: *“How good is the data that I just generated?”*

In order to answer this question, **SDV** has a collection of metrics
and tools that allow you to compare the *real* that you provided and the
*synthetic* data that you generated using **SDV** or any other tool.

You can read more about this in the Synthetic Data Evaluation guide.