# GaussianCopula Model¶

In this guide we will go through a series of steps that will let you
discover functionalities of the `GaussianCopula`

model, including how
to:

Create an instance of a

`GaussianCopula`

.Fit the instance to your data.

Generate synthetic versions of your data.

Use

`GaussianCopula`

to anonymize PII information.Customize the data transformations to improve the learning process.

Specify the column distributions to improve the output quality.

## What is GaussianCopula?¶

The `sdv.tabular.GaussianCopula`

model is based on
copula funtions.

In mathematical terms, a *copula* is a distribution over the unit
cube \({\displaystyle [0,1]^{d}}\) which is constructed from a
multivariate normal distribution over
\({\displaystyle \mathbb {R} ^{d}}\) by using the probability
integral transform. Intuitively, a *copula* is a mathematical function
that allows us to describe the joint distribution of multiple random
variables by analyzing the dependencies between their marginal
distributions.

Let’s now discover how to learn a dataset and later on generate
synthetic data with the same format and statistical properties by using
the `GaussianCopula`

model.

## Quick Usage¶

We will start by loading one of our demo datasets, the
`student_placements`

, which contains information about MBA students
that applied for placements during the year 2020.

```
In [1]: from sdv.demo import load_tabular_demo
In [2]: data = load_tabular_demo('student_placements')
In [3]: data.head()
Out[3]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 17264 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0
1 17265 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0
2 17266 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0
3 17267 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN
4 17268 M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
```

As you can see, this table contains information about students which includes, among other things:

Their id and gender

Their grades and specializations

Their work experience

The salary that they were offered

The duration and dates of their placement

You will notice that there is data with the following characteristics:

There are float, integer, boolean, categorical and datetime values.

There are some variables that have missing data. In particular, all the data related to the placement details is missing in the rows where the student was not placed.

Let us use the `GaussianCopula`

to learn this data and then sample
synthetic data about new students to see how well the model captures the
characteristics indicated above. In order to do this you will need to:

Import the

`sdv.tabular.GaussianCopula`

class and create an instance of it.Call its

`fit`

method passing our table.Call its

`sample`

method indicating the number of synthetic rows that you want to generate.

```
In [4]: from sdv.tabular import GaussianCopula
In [5]: model = GaussianCopula()
In [6]: model.fit(data)
```

Note

Notice that the model `fitting`

process took care of transforming the
different fields using the appropriate Reversible Data
Transforms to ensure that the data
has a format that the `GaussianMultivariate`

model can handle.

### Generate synthetic data from the model¶

Once the modeling has finished you are ready to generate new synthetic
data by calling the `sample`

method from your model passing the number
of rows that we want to generate.

```
In [7]: new_data = model.sample(200)
```

This will return a table identical to the one which the model was fitted on, but filled with new data which resembles the original one.

```
In [8]: new_data.head()
Out[8]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 17445 M 71.502694 66.540612 Science 71.228192 Comm&Mgmt False 0 57.663848 Mkt&Fin 63.393554 26904.930523 True 2020-03-14 2020-09-03 3.0
1 17367 F 63.324587 55.223690 Commerce 64.185648 Comm&Mgmt False -1 65.277799 Mkt&HR 61.486636 NaN False NaT NaT NaN
2 17281 F 74.268674 69.399272 Science 72.559820 Sci&Tech False 0 86.657122 Mkt&Fin 65.642716 36267.041374 True 2020-02-29 2020-09-23 12.0
3 17372 M 87.791407 93.223229 Science 70.333096 Sci&Tech False 0 79.222248 Mkt&HR 65.873716 23750.433128 True 2020-02-17 2020-05-26 3.0
4 17380 F 68.274953 63.540454 Commerce 77.060372 Sci&Tech False 1 51.943410 Mkt&HR 62.874340 28801.338637 False 2020-01-04 2020-05-26 NaN
```

Note

You can control the number of rows by specifying the number of
`samples`

in the `model.sample(<num_rows>)`

. To test, try
`model.sample(10000)`

. Note that the original table only had ~200
rows.

### Save and Load the model¶

In many scenarios it will be convenient to generate synthetic versions
of your data directly in systems that do not have access to the original
data source. For example, if you may want to generate testing data on
the fly inside a testing environment that does not have access to your
production database. In these scenarios, fitting the model with real
data every time that you need to generate new data is feasible, so you
will need to fit a model in your production environment, save the fitted
model into a file, send this file to the testing environment and then
load it there to be able to `sample`

from it.

Let’s see how this process works.

#### Load the model and generate new data¶

The file you just generated can be sent over to the system where the
synthetic data will be generated. Once it is there, you can load it
using the `GaussianCopula.load`

method, and then you are ready to
sample new data from the loaded instance:

```
In [10]: loaded = GaussianCopula.load('my_model.pkl')
In [11]: new_data = loaded.sample(200)
```

Warning

Notice that the system where the model is loaded needs to also have
`sdv`

installed, otherwise it will not be able to load the model and
use it.

### Specifying the Primary Key of the table¶

One of the first things that you may have noticed when looking at the demo
data is that there is a `student_id`

column which acts as the primary
key of the table, and which is supposed to have unique values. Indeed,
if we look at the number of times that each value appears, we see that
all of them appear at most once:

```
In [12]: data.student_id.value_counts().max()
Out[12]: 1
```

However, if we look at the synthetic data that we generated, we observe that there are some values that appear more than once:

```
In [13]: new_data[new_data.student_id == new_data.student_id.value_counts().index[0]]
Out[13]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
101 17451 F 53.777573 77.469202 Commerce 58.886594 Comm&Mgmt False 0 55.789568 Mkt&HR 66.764646 NaN False NaT NaT NaN
122 17451 F 48.609547 56.208351 Science 72.103867 Comm&Mgmt False 0 51.141349 Mkt&Fin 62.086654 NaN False NaT NaT NaN
129 17451 M 66.295968 64.914855 Commerce 69.609510 Comm&Mgmt False 0 65.348814 Mkt&Fin 52.330106 24487.618792 True 2020-04-13 2020-09-05 3.0
150 17451 M 60.642834 54.973978 Science 60.465220 Sci&Tech False 1 71.296884 Mkt&HR 54.353150 NaN False NaT NaT NaN
165 17451 M 72.720120 50.695996 Commerce 66.451778 Comm&Mgmt False 0 73.942398 Mkt&HR 54.780965 29516.728716 True 2020-07-20 2020-08-04 3.0
189 17451 F 73.104277 65.423069 Science 75.147736 Comm&Mgmt False 0 76.765632 Mkt&Fin 71.146266 29548.021837 True 2020-01-12 2020-06-07 3.0
```

This happens because the model was not notified at any point about the
fact that the `student_id`

had to be unique, so when it generates new
data it will provoke collisions sooner or later. In order to solve this,
we can pass the argument `primary_key`

to our model when we create it,
indicating the name of the column that is the index of the table.

```
In [14]: model = GaussianCopula(
....: primary_key='student_id'
....: )
....:
In [15]: model.fit(data)
In [16]: new_data = model.sample(200)
In [17]: new_data.head()
Out[17]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 0 F 75.954437 65.874155 Science 73.071839 Comm&Mgmt False 1 62.245229 Mkt&Fin 59.854641 25272.228737 True 2020-03-21 2020-08-09 3.0
1 1 M 76.199565 69.003440 Science 74.994287 Sci&Tech False 1 89.078363 Mkt&Fin 71.451042 36267.842700 True 2020-02-08 2020-09-07 3.0
2 2 M 47.637773 55.690358 Science 58.478495 Sci&Tech False 1 55.157416 Mkt&HR 57.272402 NaN False NaT NaT NaN
3 3 M 63.262929 63.947319 Science 71.171869 Comm&Mgmt False 0 88.587588 Mkt&Fin 62.559554 22892.135506 True 2020-01-05 2020-06-20 3.0
4 4 M 78.148608 56.609793 Science 73.367579 Sci&Tech False 0 84.840520 Mkt&Fin 63.818865 29655.467846 True 2020-03-27 2020-08-21 3.0
```

As a result, the model will learn that this column must be unique and generate a unique sequence of values for the column:

```
In [18]: new_data.student_id.value_counts().max()
Out[18]: 1
```

### Anonymizing Personally Identifiable Information (PII)¶

There will be many cases where the data will contain Personally Identifiable Information which we cannot disclose. In these cases, we will want our Tabular Models to replace the information within these fields with fake, simulated data that looks similar to the real one but does not contain any of the original values.

Let’s load a new dataset that contains a PII field, the
`student_placements_pii`

demo, and try to generate synthetic versions
of it that do not contain any of the PII fields.

Note

The `student_placements_pii`

dataset is a modified version of the
`student_placements`

dataset with one new field, `address`

, which
contains PII information about the students. Notice that this additional
`address`

field has been simulated and does not correspond to data
from the real users.

```
In [19]: data_pii = load_tabular_demo('student_placements_pii')
In [20]: data_pii.head()
Out[20]:
student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 17264 70304 Baker Turnpike\nEricborough, MS 15086 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0
1 17265 805 Herrera Avenue Apt. 134\nMaryview, NJ 36510 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0
2 17266 3702 Bradley Island\nNorth Victor, FL 12268 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0
3 17267 Unit 0879 Box 3878\nDPO AP 42663 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN
4 17268 96493 Kelly Canyon Apt. 145\nEast Steven, NC 3... M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
```

If we use our tabular model on this new data we will see how the synthetic data that it generates discloses the addresses from the real students:

```
In [21]: model = GaussianCopula(
....: primary_key='student_id',
....: )
....:
In [22]: model.fit(data_pii)
In [23]: new_data_pii = model.sample(200)
In [24]: new_data_pii.head()
Out[24]:
student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 0 61900 Monica Stream Suite 028\nPort Michael, M... M 63.261295 59.026588 Science 62.386083 Comm&Mgmt False 0 96.785469 Mkt&HR 62.152346 35384.025614 True 2020-05-07 2020-09-19 NaN
1 1 201 Rhodes Isle\nPort Robertmouth, MS 27570 M 45.378136 47.840825 Commerce 61.841753 Comm&Mgmt False 0 57.550855 Mkt&HR 57.988714 NaN False NaT NaT NaN
2 2 9976 James Crest Apt. 125\nStevenhaven, GA 30830 M 52.900469 65.948894 Commerce 52.293776 Comm&Mgmt False 0 92.892410 Mkt&HR 62.259721 22265.435572 True 2020-03-03 2020-09-24 3.0
3 3 Unit 4181 Box 7016\nDPO AP 47399 F 57.967665 60.289538 Commerce 55.759392 Comm&Mgmt False 0 83.009944 Mkt&Fin 58.468285 NaN False NaT NaT NaN
4 4 96493 Kelly Canyon Apt. 145\nEast Steven, NC 3... M 83.905387 73.193487 Science 69.023676 Comm&Mgmt False -1 97.500080 Mkt&Fin 58.657225 NaN False NaT NaT NaN
```

More specifically, we can see how all the addresses that have been generated actually come from the original dataset:

```
In [25]: new_data_pii.address.isin(data_pii.address).sum()
Out[25]: 200
```

In order to solve this, we can pass an additional argument
`anonymize_fields`

to our model when we create the instance. This
`anonymize_fields`

argument will need to be a dictionary that
contains:

The name of the field that we want to anonymize.

The category of the field that we want to use when we generate fake values for it.

The list complete list of possible categories can be seen in the Faker Providers page, and it contains a huge list of concepts such as:

name

address

country

city

ssn

credit_card_number

credit_card_expire

credit_card_security_code

email

telephone

…

In this case, since the field is an e-mail address, we will pass a
dictionary indicating the category `address`

```
In [26]: model = GaussianCopula(
....: primary_key='student_id',
....: anonymize_fields={
....: 'address': 'address'
....: }
....: )
....:
In [27]: model.fit(data_pii)
```

As a result, we can see how the real `address`

values have been
replaced by other fake addresses:

```
In [28]: new_data_pii = model.sample(200)
In [29]: new_data_pii.head()
Out[29]:
student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 0 0454 Martin Ridges\nNew Ritahaven, GA 67609 F 72.703674 57.257009 Science 69.682157 Sci&Tech False 0 75.155014 Mkt&HR 58.372490 24187.634324 True 2020-02-20 2020-06-13 3.0
1 1 1962 Leon Islands Suite 710\nFrancistown, NJ 5... F 64.055928 132.110982 Science 73.123287 Comm&Mgmt False 1 71.687703 Mkt&Fin 66.351226 20490.139595 True 2020-04-26 2020-10-14 3.0
2 2 80198 James Greens Apt. 219\nPort Alicia, WV 8... F 82.012310 62.888033 Commerce 67.821035 Comm&Mgmt False 1 56.923800 Mkt&Fin 65.323453 28392.759021 True 2020-01-08 2020-10-25 12.0
3 3 034 Stevens Island\nNorth Robin, NE 02835 M 60.280532 74.174304 Commerce 56.581816 Comm&Mgmt False 0 62.978577 Mkt&HR 61.944776 NaN False NaT NaT NaN
4 4 1878 Ward Rue Apt. 566\nEast Markburgh, VT 60004 M 75.491247 62.711577 Science 73.754758 Comm&Mgmt False 0 88.443999 Mkt&Fin 58.990644 27134.947928 True 2020-01-10 2020-07-26 12.0
```

Which means that none of the original addresses can be found in the sampled data:

```
In [30]: data_pii.address.isin(new_data_pii.address).sum()
Out[30]: 0
```

## Advanced Usage¶

Now that we have discovered the basics, let’s go over a few more
advanced usage examples and see the different arguments that we can pass
to our `GaussianCopula`

Model in order to customize it to our needs.

### How to set transforms to use?¶

One thing that you may have noticed when executing the previous steps is
that the fitting process took much longer on the
`student_placements_pii`

dataset than it took on the previous version
that did not contain the student `address`

. This happens because the
`address`

field is interpreted as a categorical variable, which the
`GaussianCopula`

one-hot
encoded generating 215 new
columns that it had to learn afterwards.

This transformation, which in this case was very inefficient, happens
because the Tabular Models apply Reversible Data
Transforms under the hood to
transform all the non-numerical variables, which the underlying models
cannot handle, into numerical representations which they can properly
work with. In the case of the `GaussianCopula`

, the default
transformation is a One-Hot encoding, which can work very well with
variables that have a small number of different values, but which is
very inefficient in cases where there is a large number of values.

For this reason, the Tabular Models have an additional argument called
`field_transformers`

that let you select which transformer to apply to
each column. This `field_transformers`

argument must be passed as a
`dict`

which contains the name of the fields for which we want to use
a transformer different than the default, and the name of the
transformer that we want to use.

Possible transformer names are:

`integer`

: Uses a`NumericalTransformer`

of dtype`int`

.`float`

: Uses a`NumericalTransformer`

of dtype`float`

.`categorical`

: Uses a`CategoricalTransformer`

without gaussian noise.`categorical_fuzzy`

: Uses a`CategoricalTransformer`

adding gaussian noise.`one_hot_encoding`

: Uses a`OneHotEncodingTransformer`

.`label_encoding`

: Uses a`LabelEncodingTransformer`

.`boolean`

: Uses a`BooleanTransformer`

.`datetime`

: Uses a`DatetimeTransformer`

.

**NOTE**: For additional details about each one of the transformers,
please visit RDT

Let’s now try to improve the previous fitting process by changing the
transformer that we use for the `address`

field to something other
than the default. As an example, we will use the `label_encoding`

transformer, which instead of generating one column for each possible
value, it just replaces each value with a unique integer value.

```
In [31]: model = GaussianCopula(
....: primary_key='student_id',
....: anonymize_fields={
....: 'address': 'address'
....: },
....: field_transformers={
....: 'address': 'label_encoding'
....: }
....: )
....:
In [32]: model.fit(data_pii)
In [33]: new_data_pii = model.sample(200)
In [34]: new_data_pii.head()
Out[34]:
student_id address gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 0 5139 Suzanne Way\nCarolport, PA 02727 M 45.763635 65.665957 Commerce 59.476828 Comm&Mgmt False 1 83.823511 Mkt&Fin 56.427125 25304.517813 True 2020-01-14 2020-06-25 NaN
1 1 65896 Franklin Station\nCamposshire, AZ 78118 M 85.057499 75.530829 Science 76.665179 Sci&Tech False 1 92.332921 Mkt&Fin 72.020314 47549.272263 True 2020-01-24 2020-07-16 3.0
2 2 387 Gregory Dam Suite 757\nRussohaven, SD 50616 M 86.130538 104.161759 Commerce 83.399081 Comm&Mgmt False 1 90.106108 Mkt&HR 72.840935 30295.239108 True 2020-01-27 2020-07-15 3.0
3 3 53255 Aaron Dam\nNew Kathleenhaven, RI 91431 F 68.170330 66.916735 Science 67.605387 Comm&Mgmt False 0 71.414406 Mkt&HR 71.746166 28016.888923 True 2020-01-17 2020-09-09 3.0
4 4 728 Zachary Point Apt. 282\nNew Michaeltown, F... F 68.929501 65.213088 Commerce 66.461774 Comm&Mgmt False 0 89.353622 Mkt&Fin 66.328419 NaN False NaT NaT NaN
```

### Exploring the Probability Distributions¶

During the previous steps, every time we fitted the `GaussianCopula`

it performed the following operations:

Learn the format and data types of the passed data

Transform the non-numerical and null data using Reversible Data Transforms to obtain a fully numerical representation of the data from which we can learn the probability distributions.

Learn the probability distribution of each column from the table

Transform the values of each numerical column by converting them to their marginal distribution CDF values and then applying an inverse CDF transformation of a standard normal on them.

Learn the correlations of the newly generated random variables.

After this, when we used the model to generate new data for our table
using the `sample`

method, it did:

Sample from a Multivariate Standard Normal distribution with the learned correlations.

Revert the sampled values by computing their standard normal CDF and then applying the inverse CDF of their marginal distributions.

Revert the RDT transformations to go back to the original data format.

As you can see, during these steps the *Marginal Probability
Distributions* have a very important role, since the `GaussianCopula`

had to learn and reproduce the individual distributions of each column
in our table. We can explore the distributions which the
`GaussianCopula`

used to model each column using its
`get_distributions`

method:

```
In [35]: model = GaussianCopula(
....: primary_key='student_id'
....: )
....:
In [36]: model.fit(data)
In [37]: distributions = model.get_distributions()
```

This will return us a `dict`

which contains the name of the
distribution class used for each column:

```
In [38]: distributions
Out[38]:
{'second_perc': 'copulas.univariate.truncated_gaussian.TruncatedGaussian',
'high_perc': 'copulas.univariate.log_laplace.LogLaplace',
'degree_perc': 'copulas.univariate.student_t.StudentTUnivariate',
'work_experience': 'copulas.univariate.student_t.StudentTUnivariate',
'experience_years': 'copulas.univariate.gaussian.GaussianUnivariate',
'employability_perc': 'copulas.univariate.truncated_gaussian.TruncatedGaussian',
'mba_perc': 'copulas.univariate.gamma.GammaUnivariate',
'placed': 'copulas.univariate.gamma.GammaUnivariate',
'gender#0': 'copulas.univariate.gaussian.GaussianUnivariate',
'gender#1': 'copulas.univariate.gaussian.GaussianUnivariate',
'high_spec#0': 'copulas.univariate.gaussian.GaussianUnivariate',
'high_spec#1': 'copulas.univariate.gamma.GammaUnivariate',
'high_spec#2': 'copulas.univariate.gaussian.GaussianUnivariate',
'degree_type#0': 'copulas.univariate.student_t.StudentTUnivariate',
'degree_type#1': 'copulas.univariate.gaussian.GaussianUnivariate',
'degree_type#2': 'copulas.univariate.gaussian.GaussianUnivariate',
'mba_spec#0': 'copulas.univariate.gamma.GammaUnivariate',
'mba_spec#1': 'copulas.univariate.gaussian.GaussianUnivariate',
'salary#0': 'copulas.univariate.gamma.GammaUnivariate',
'salary#1': 'copulas.univariate.gaussian.GaussianUnivariate',
'start_date#0': 'copulas.univariate.gamma.GammaUnivariate',
'start_date#1': 'copulas.univariate.gaussian.GaussianUnivariate',
'end_date#0': 'copulas.univariate.gamma.GammaUnivariate',
'end_date#1': 'copulas.univariate.gaussian.GaussianUnivariate',
'duration#0': 'copulas.univariate.gaussian.GaussianUnivariate',
'duration#1': 'copulas.univariate.gaussian.GaussianUnivariate',
'duration#2': 'copulas.univariate.student_t.StudentTUnivariate',
'duration#3': 'copulas.univariate.gaussian.GaussianUnivariate'}
```

Note

In this list we will see multiple distributions for each one of the columns that we have in our data. This is because the RDT transformations used to encode the data numerically often use more than one column to represent each one of the input variables.

Let’s explore the individual distribution of one of the columns in our
data to better understand how the `GaussianCopula`

processed them and
see if we can improve the results by manually specifying a different
distribution. For example, let’s explore the `experience_years`

column
by looking at the frequency of its values within the original data:

```
In [39]: data.experience_years.value_counts()
Out[39]:
0 141
1 65
2 8
3 1
Name: experience_years, dtype: int64
In [40]: data.experience_years.hist();
```

By observing the data we can see that the behavior of the values in this column is very similar to a Gamma or even some types of Beta distribution, where the majority of the values are 0 and the frequency decreases as the values increase.

Was the `GaussianCopula`

able to capture this distribution on its own?

```
In [41]: distributions['experience_years']
Out[41]: 'copulas.univariate.gaussian.GaussianUnivariate'
```

It seems that it was not, as it rather thought that the behavior was closer to a Gaussian distribution. And, as a result, we can see how the generated values now contain negative values which are invalid for this column:

```
In [42]: new_data.experience_years.value_counts()
Out[42]:
0 99
1 85
-1 13
2 3
Name: experience_years, dtype: int64
In [43]: new_data.experience_years.hist();
```

Let’s see how we can improve this situation by passing the
`GaussianCopula`

the exact distribution that we want it to use for
this column.

### Setting distributions for individual variables¶

The `GaussianCopula`

class offers the possibility to indicate which
distribution to use for each one of the columns in the table, in order
to solve situations like the one that we just described. In order to do
this, we need to pass a `field_distributions`

argument with `dict`

that indicates the distribution that we want to use for each column.

Possible values for the distribution argument are:

`univariate`

: Let`copulas`

select the optimal univariate distribution. This may result in non-parametric models being used.`parametric`

: Let`copulas`

select the optimal univariate distribution, but restrict the selection to parametric distributions only.`bounded`

: Let`copulas`

select the optimal univariate distribution, but restrict the selection to bounded distributions only. This may result in non-parametric models being used.`semi_bounded`

: Let`copulas`

select the optimal univariate distribution, but restrict the selection to semi-bounded distributions only. This may result in non-parametric models being used.`parametric_bounded`

: Let`copulas`

select the optimal univariate distribution, but restrict the selection to parametric and bounded distributions only.`parametric_semi_bounded`

: Let`copulas`

select the optimal univariate distribution, but restrict the selection to parametric and semi-bounded distributions only.`gaussian`

: Use a Gaussian distribution.`gamma`

: Use a Gamma distribution.`beta`

: Use a Beta distribution.`student_t`

: Use a Student T distribution.`gaussian_kde`

: Use a GaussianKDE distribution. This model is non-parametric, so using this will make`get_parameters`

unusable.`truncated_gaussian`

: Use a Truncated Gaussian distribution.

Let’s see what happens if we make the `GaussianCopula`

use the
`gamma`

distribution for our column.

```
In [44]: from sdv.tabular import GaussianCopula
In [45]: model = GaussianCopula(
....: primary_key='student_id',
....: field_distributions={
....: 'experience_years': 'gamma'
....: }
....: )
....:
In [46]: model.fit(data)
```

After this, we can see how the `GaussianCopula`

used the indicated
distribution for the `experience_years`

column

```
In [47]: model.get_distributions()['experience_years']
Out[47]: 'copulas.univariate.gamma.GammaUnivariate'
```

And, as a result, we can see how the generated data now have a behavior which is closer to the original data and always stays within the valid values range.

```
In [48]: new_data = model.sample(len(data))
In [49]: new_data.experience_years.value_counts()
Out[49]:
0 196
1 17
2 2
Name: experience_years, dtype: int64
In [50]: new_data.experience_years.hist();
```

Note

Even though there are situations like the one shown above where manually
choosing a distribution seems to give better results, in most cases the
`GaussianCopula`

will be able to find the optimal distribution on its
own, making this manual search of the marginal distributions necessary
on very little occasions.

### Conditional Sampling¶

As the name implies, conditional sampling allows us to sample from a conditional
distribution using the `GaussianCopula`

model, which means we can generate only values that
satisfy certain conditions. These conditional values can be passed to the `conditions`

parameter in the `sample`

method either as a dataframe or a dictionary.

In case a dictionary is passed, the model will generate as many rows as requested,
all of which will satisfy the specified conditions, such as `gender = M`

.

```
In [51]: conditions = {
....: 'gender': 'M'
....: }
....:
In [52]: model.sample(5, conditions=conditions)
Out[52]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 0 M 72.651376 66.487230 Commerce 74.424150 Comm&Mgmt False 0 75.440880 Mkt&Fin 61.821936 25849.913028 True 2020-04-10 2021-02-04 3.0
1 1 M 66.975467 64.790703 Commerce 62.838066 Sci&Tech False 1 71.604492 Mkt&Fin 61.700825 31159.896684 True 2020-01-06 2020-04-19 3.0
2 2 M 73.739052 65.219144 Science 68.569188 Comm&Mgmt False 0 63.904228 Mkt&Fin 65.630012 22888.220833 True 2020-01-28 2020-09-15 12.0
3 3 M 49.152340 62.286996 Science 57.748447 Others False 0 59.840260 Mkt&Fin 60.105272 NaN False NaT NaT NaN
4 4 M 58.477255 63.351342 Science 60.109075 Comm&Mgmt False 0 60.809020 Mkt&Fin 53.584537 34176.618625 True 2020-02-09 2020-07-15 3.0
```

It’s also possible to condition on multiple columns, such as
`gender = M, 'experience_years': 0`

.

```
In [53]: conditions = {
....: 'gender': 'M',
....: 'experience_years': 0
....: }
....:
In [54]: model.sample(5, conditions=conditions)
Out[54]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 0 M 55.712659 62.498415 Commerce 59.805716 Comm&Mgmt False 0 61.996033 Mkt&HR 51.976394 24879.36516 True 2020-03-11 2020-08-26 3.0
1 1 M 57.854473 52.561250 Science 52.895935 Sci&Tech False 0 79.969430 Mkt&HR 50.558409 NaN False NaT NaT NaN
2 2 M 48.701860 52.395212 Commerce 60.060996 Comm&Mgmt False 0 50.453369 Mkt&HR 51.118056 NaN False NaT NaT NaN
3 3 M 53.711590 60.035696 Commerce 56.347530 Comm&Mgmt False 0 59.975297 Mkt&HR 58.521588 NaN False NaT NaT NaN
4 4 M 46.070352 55.996227 Commerce 60.301035 Comm&Mgmt False 0 67.119200 Mkt&HR 61.774106 NaN True NaT NaT NaN
```

The `conditions`

can also be passed as a dataframe. In that case, the model
will generate one sample for each row of the dataframe, sorted in the same
order. Since the model already knows how many samples to generate, passing
it as a parameter is unnecessary. For example, if we want to generate three
samples where `gender = M`

and three samples with `gender = F`

, we can do the
following:

```
In [55]: import pandas as pd
In [56]: conditions = pd.DataFrame({
....: 'gender': ['M', 'M', 'M', 'F', 'F', 'F'],
....: })
....:
In [57]: model.sample(conditions=conditions)
Out[57]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 0 M 70.818890 69.342371 Commerce 67.548007 Comm&Mgmt False 0 72.276878 Mkt&Fin 56.892232 36970.365953 True 2020-01-07 2020-07-25 12.0
1 1 M 49.922037 66.670714 Science 69.971812 Comm&Mgmt False 0 77.018489 Mkt&Fin 65.067512 NaN False NaT NaT NaN
2 2 M 45.419798 64.168603 Science 57.754671 Comm&Mgmt False 0 80.573616 Mkt&HR 60.313179 NaN False NaT NaT NaN
3 3 F 87.593852 92.957886 Science 79.076768 Comm&Mgmt False 0 90.939686 Mkt&Fin 69.235631 31000.803506 True 2020-01-10 2020-07-12 3.0
4 4 F 74.269341 67.334386 Commerce 67.513388 Comm&Mgmt False 0 55.237230 Mkt&HR 62.589396 25389.699012 True 2020-09-13 2020-08-17 3.0
5 5 F 70.281451 63.549215 Commerce 62.878898 Comm&Mgmt False 0 80.286559 Mkt&Fin 69.414610 NaN False NaT NaT NaN
```

`GaussianCopula`

also supports conditioning on continuous values, as long as the values
are within the range of seen numbers. For example, if all the values of the
dataset are within 0 and 1, `GaussianCopula`

will not be able to set this value to 1000.

```
In [58]: conditions = {
....: 'degree_perc': 70.0
....: }
....:
In [59]: model.sample(5, conditions=conditions)
Out[59]:
student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration
0 0 F 61.883342 62.337618 Science 70.0 Comm&Mgmt False 0 63.140252 Mkt&HR 64.516206 29776.417801 True 2020-01-20 2020-08-01 3.0
1 1 F 85.021481 78.270952 Science 70.0 Sci&Tech False 0 57.910443 Mkt&HR 70.524223 27145.663946 True 2020-07-21 2020-11-11 3.0
2 2 M 69.163341 62.547299 Science 70.0 Sci&Tech False 0 59.390953 Mkt&Fin 65.931951 32271.970695 True 2020-03-22 2020-10-10 3.0
3 3 M 60.440550 46.592372 Science 70.0 Sci&Tech False 0 81.656548 Mkt&Fin 58.761903 27897.376907 True 2020-04-03 2020-06-26 3.0
4 4 M 47.121051 63.963479 Science 70.0 Comm&Mgmt False 0 63.894390 Mkt&Fin 60.215117 32288.365572 True 2020-01-17 2020-06-13 3.0
```

Note

Currently, conditional sampling works through a rejection sampling process,
where rows are sampled repeatedly until one that satisfies the conditions is
found. In case you are running into a ```
Could not get enough valid rows within
x trials
```

or simply wish to optimize the results, there are three parameters
that can be fine-tuned: `max_rows_multiplier`

, `max_retries`

and `float_rtol`

.
More information about these parameters can be found in the API section.

### How do I specify constraints?¶

If you look closely at the data you may notice that some properties were
not completely captured by the model. For example, you may have seen
that sometimes the model produces an `experience_years`

number greater
than `0`

while also indicating that `work_experience`

is `False`

.
These types of properties are what we call `Constraints`

and can also
be handled using `SDV`

. For further details about them please visit
the Handling Constraints guide.

### Can I evaluate the Synthetic Data?¶

A very common question when someone starts using **SDV** to generate
synthetic data is: *“How good is the data that I just generated?”*

In order to answer this question, **SDV** has a collection of metrics
and tools that allow you to compare the *real* that you provided and the
*synthetic* data that you generated using **SDV** or any other tool.

You can read more about this in the Synthetic Data Evaluation guide.