Danger You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software Click here to go to the new docs pages.
Danger
You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software
Click here to go to the new docs pages.
sdv.tabular.copulagan.
CopulaGAN
Combination of GaussianCopula transformation and GANs.
This model extends the CTGAN model to add the flexibility of the GaussianCopula transformations provided by the GaussianNormalizer from RDT.
CTGAN
GaussianNormalizer
RDT
Overall, the fitting process consists of the following steps:
Transform each non categorical variable from the input data using a GaussianNormalizer:
If not specified, find out the distribution which each one of the variables from the input dataset has.
Transform each variable to a standard normal space by applying the CDF of the corresponding distribution and later on applying an inverse CDF from a standard normal distribution.
Fit CTGAN with the transformed table.
And the process of sampling is:
Sample using CTGAN
Reverse the previous transformation by applying the CDF of a standard normal distribution and then inverting the CDF of the distribution that correpsonds to each variable.
The arguments of this model are the same as for CTGAN except for two additional arguments, field_distributions and default_distribution that give the ability to define specific transformations for individual fields as well as which distribution to use by default if no specific distribution has been selected.
field_distributions
default_distribution
Distributions can be passed as a copulas univariate instance or as one of the following string values:
copulas
gaussian: Use a Gaussian distribution.
gaussian
gamma: Use a Gamma distribution.
gamma
beta: Use a Beta distribution.
beta
student_t: Use a Student T distribution.
student_t
gaussian_kde: Use a GaussianKDE distribution. This model is non-parametric, so using this will make get_parameters unusable.
gaussian_kde
get_parameters
truncated_gaussian: Use a Truncated Gaussian distribution.
truncated_gaussian
field_names (list[str]) – List of names of the fields that need to be modeled and included in the generated output data. Any additional fields found in the data will be ignored and will not be included in the generated output. If None, all the fields found in the data are used.
None
field_types (dict[str, dict]) – Dictinary specifying the data types and subtypes of the fields that will be modeled. Field types and subtypes combinations must be compatible with the SDV Metadata Schema.
field_transformers (dict[str, str]) –
Dictinary specifying which transformers to use for each field. Available transformers are:
FloatFormatter: Uses a FloatFormatter for numerical data. FrequencyEncoder: Uses a FrequencyEncoder without gaussian noise. FrequencyEncoder_noised: Uses a FrequencyEncoder adding gaussian noise. OneHotEncoder: Uses a OneHotEncoder. LabelEncoder: Uses a LabelEncoder without gaussian nose. LabelEncoder_noised: Uses a LabelEncoder adding gaussian noise. BinaryEncoder: Uses a BinaryEncoder. UnixTimestampEncoder: Uses a UnixTimestampEncoder.
FloatFormatter: Uses a FloatFormatter for numerical data.
FloatFormatter
FrequencyEncoder: Uses a FrequencyEncoder without gaussian noise.
FrequencyEncoder
FrequencyEncoder_noised: Uses a FrequencyEncoder adding gaussian noise.
FrequencyEncoder_noised
OneHotEncoder: Uses a OneHotEncoder.
OneHotEncoder
LabelEncoder: Uses a LabelEncoder without gaussian nose.
LabelEncoder
LabelEncoder_noised: Uses a LabelEncoder adding gaussian noise.
LabelEncoder_noised
BinaryEncoder: Uses a BinaryEncoder.
BinaryEncoder
UnixTimestampEncoder: Uses a UnixTimestampEncoder.
UnixTimestampEncoder
anonymize_fields (dict[str, str]) – Dict specifying which fields to anonymize and what faker category they belong to.
primary_key (str) – Name of the field which is the primary key of the table.
constraints (list[Constraint, dict]) – List of Constraint objects or dicts.
table_metadata (dict or metadata.Table) – Table metadata instance or dict representation. If given alongside any other metadata-related arguments, an exception will be raised. If not given at all, it will be built using the other arguments or learned from the data.
log_frequency (boolean) – Whether to use log frequency of categorical levels in conditional sampling. Defaults to True.
True
embedding_dim (int) – Size of the random sample passed to the Generator. Defaults to 128.
generator_dim (tuple or list of ints) – Size of the output samples for each one of the Residuals. A Resiudal Layer will be created for each one of the values provided. Defaults to (256, 256).
discriminator_dim (tuple or list of ints) – Size of the output samples for each one of the Discriminator Layers. A Linear Layer will be created for each one of the values provided. Defaults to (256, 256).
batch_size (int) – Number of data samples to process in each step.
verbose (bool) – Whether to print fit progress on stdout. Defaults to False.
False
epochs (int) – Number of training epochs. Defaults to 300.
cuda (bool or str) – If True, use CUDA. If an str, use the indicated device. If False, do not use cuda at all.
str
field_distributions (dict) – Optionally specify a dictionary that maps the name of each field to the distribution that must be used in it. Fields that are not specified in the input dict will be modeled using the default distribution. Defaults to None.
dict
default_distribution (copulas.univariate.Univariate or str) – Distribution to use on the fields for which no specific distribution has been given. Defaults to truncated_gaussian.
learn_rounding_scheme (bool) – Define rounding scheme for FloatFormatter. If True, the data returned by reverse_transform will be rounded to that place. Defaults to True.
reverse_transform
enforce_min_max_values (bool) – Specify whether or not to clip the data returned by reverse_transform of the numerical transformer, FloatFormatter, to the min and max values seen during fit. Defaults to True.
fit
__init__
Initialize self. See help(type(self)) for accurate signature.
Methods
__init__([field_names, field_types, …])
Initialize self.
fit(data)
Fit this model to the data.
get_distributions()
get_distributions
Get the marginal distributions used by this CopulaGAN.
get_metadata()
get_metadata
Get metadata about the table.
get_parameters()
Get the parameters learned from the data.
load(path)
load
Load a TabularModel instance from a given path.
sample(num_rows[, randomize_samples, …])
sample
Sample rows from this table.
sample_conditions(conditions[, …])
sample_conditions
Sample rows from this table with the given conditions.
sample_remaining_columns(known_columns[, …])
sample_remaining_columns
save(path)
save
Save this model instance to the given path using cloudpickle.
set_parameters(parameters)
set_parameters
Regenerate a previously learned model from its parameters.
Attributes
DEFAULT_DISTRIBUTION