sdv.tabular.copulagan.CopulaGAN¶

class
sdv.tabular.copulagan.
CopulaGAN
(field_names=None, field_types=None, field_transformers=None, anonymize_fields=None, primary_key=None, constraints=None, table_metadata=None, embedding_dim=128, generator_dim=256, 256, discriminator_dim=256, 256, generator_lr=0.0002, generator_decay=1e06, discriminator_lr=0.0002, discriminator_decay=1e06, batch_size=500, discriminator_steps=1, log_frequency=True, verbose=False, epochs=300, cuda=True, field_distributions=None, default_distribution=None, rounding='auto', min_value='auto', max_value='auto')[source]¶ Combination of GaussianCopula transformation and GANs.
This model extends the
CTGAN
model to add the flexibility of the GaussianCopula transformations provided by theGaussianCopulaTransformer
fromRDT
.Overall, the fitting process consists of the following steps:
Transform each non categorical variable from the input data using a
GaussianCopulaTransformer
:If not specified, find out the distribution which each one of the variables from the input dataset has.
Transform each variable to a standard normal space by applying the CDF of the corresponding distribution and later on applying an inverse CDF from a standard normal distribution.
Fit CTGAN with the transformed table.
And the process of sampling is:
Sample using CTGAN
Reverse the previous transformation by applying the CDF of a standard normal distribution and then inverting the CDF of the distribution that correpsonds to each variable.
The arguments of this model are the same as for CTGAN except for two additional arguments,
field_distributions
anddefault_distribution
that give the ability to define specific transformations for individual fields as well as which distribution to use by default if no specific distribution has been selected.Distributions can be passed as a
copulas
univariate instance or as one of the following string values:univariate
: Letcopulas
select the optimal univariate distribution. This may result in nonparametric models being used.parametric
: Letcopulas
select the optimal univariate distribution, but restrict the selection to parametric distributions only.bounded
: Letcopulas
select the optimal univariate distribution, but restrict the selection to bounded distributions only. This may result in nonparametric models being used.semi_bounded
: Letcopulas
select the optimal univariate distribution, but restrict the selection to semibounded distributions only. This may result in nonparametric models being used.parametric_bounded
: Letcopulas
select the optimal univariate distribution, but restrict the selection to parametric and bounded distributions only.parametric_semi_bounded
: Letcopulas
select the optimal univariate distribution, but restrict the selection to parametric and semibounded distributions only.gaussian
: Use a Gaussian distribution.gamma
: Use a Gamma distribution.beta
: Use a Beta distribution.student_t
: Use a Student T distribution.gaussian_kde
: Use a GaussianKDE distribution. This model is nonparametric, so using this will makeget_parameters
unusable.truncated_gaussian
: Use a Truncated Gaussian distribution.
 Parameters
field_names (list[str]) – List of names of the fields that need to be modeled and included in the generated output data. Any additional fields found in the data will be ignored and will not be included in the generated output. If
None
, all the fields found in the data are used.field_types (dict[str, dict]) – Dictinary specifying the data types and subtypes of the fields that will be modeled. Field types and subtypes combinations must be compatible with the SDV Metadata Schema.
field_transformers (dict[str, str]) –
Dictinary specifying which transformers to use for each field. Available transformers are:
integer
: Uses aNumericalTransformer
of dtypeint
.float
: Uses aNumericalTransformer
of dtypefloat
.categorical
: Uses aCategoricalTransformer
without gaussian noise.categorical_fuzzy
: Uses aCategoricalTransformer
adding gaussian noise.one_hot_encoding
: Uses aOneHotEncodingTransformer
.label_encoding
: Uses aLabelEncodingTransformer
.boolean
: Uses aBooleanTransformer
.datetime
: Uses aDatetimeTransformer
.
anonymize_fields (dict[str, str]) – Dict specifying which fields to anonymize and what faker category they belong to.
primary_key (str) – Name of the field which is the primary key of the table.
constraints (list[Constraint, dict]) – List of Constraint objects or dicts.
table_metadata (dict or metadata.Table) – Table metadata instance or dict representation. If given alongside any other metadatarelated arguments, an exception will be raised. If not given at all, it will be built using the other arguments or learned from the data.
log_frequency (boolean) – Whether to use log frequency of categorical levels in conditional sampling. Defaults to
True
.embedding_dim (int) – Size of the random sample passed to the Generator. Defaults to 128.
generator_dim (tuple or list of ints) – Size of the output samples for each one of the Residuals. A Resiudal Layer will be created for each one of the values provided. Defaults to (256, 256).
discriminator_dim (tuple or list of ints) – Size of the output samples for each one of the Discriminator Layers. A Linear Layer will be created for each one of the values provided. Defaults to (256, 256).
batch_size (int) – Number of data samples to process in each step.
verbose (bool) – Whether to print fit progress on stdout. Defaults to
False
.epochs (int) – Number of training epochs. Defaults to 300.
cuda (bool or str) – If
True
, use CUDA. If anstr
, use the indicated device. IfFalse
, do not use cuda at all.field_distributions (dict) – Optionally specify a dictionary that maps the name of each field to the distribution that must be used in it. Fields that are not specified in the input
dict
will be modeled using the default distribution. Defaults toNone
.default_distribution (copulas.univariate.Univariate or str) – Distribution to use on the fields for which no specific distribution has been given. Defaults to
parametric
.rounding (int, str or None) – Define rounding scheme for
NumericalTransformer
. If set to an int, values will be rounded to that number of decimal places. IfNone
, values will not be rounded. If set to'auto'
, the transformer will round to the maximum number of decimal places detected in the fitted data. Defaults to'auto'
.min_value (int, str or None) – Specify the minimum value the
NumericalTransformer
should use. If an integer is given, sampled data will be greater than or equal to it. If the string'auto'
is given, the minimum will be the minimum value seen in the fitted data. IfNone
is given, there won’t be a minimum. Defaults to'auto'
.max_value (int, str or None) – Specify the maximum value the
NumericalTransformer
should use. If an integer is given, sampled data will be less than or equal to it. If the string'auto'
is given, the maximum will be the maximum value seen in the fitted data. IfNone
is given, there won’t be a maximum. Defaults to'auto'
.

__init__
(field_names=None, field_types=None, field_transformers=None, anonymize_fields=None, primary_key=None, constraints=None, table_metadata=None, embedding_dim=128, generator_dim=256, 256, discriminator_dim=256, 256, generator_lr=0.0002, generator_decay=1e06, discriminator_lr=0.0002, discriminator_decay=1e06, batch_size=500, discriminator_steps=1, log_frequency=True, verbose=False, epochs=300, cuda=True, field_distributions=None, default_distribution=None, rounding='auto', min_value='auto', max_value='auto')[source]¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
([field_names, field_types, …])Initialize self.
fit
(data)Fit this model to the data.
Get the marginal distributions used by this CopulaGAN.
Get metadata about the table.
get_parameters
()Get the parameters learned from the data.
load
(path)Load a TabularModel instance from a given path.
sample
([num_rows, max_retries, …])Sample rows from this table.
save
(path)Save this model instance to the given path using pickle.
set_parameters
(parameters)Regenerate a previously learned model from its parameters.
Attributes
DEFAULT_DISTRIBUTION