tgan.data module¶

Data related functionalities.

This modules contains the tools to preprare the data, from the raw csv files, to the DataFlow objects will be used to fit our models.

class tgan.data.MultiModalNumberTransformer(num_modes=5)[source]¶

Bases: object

Reversible transform for multimodal data.

To effectively sample values from a multimodal distribution, we cluster values of a numerical variable using a skelarn.mixture.GaussianMixture model (GMM).

We train a GMM with n components for each numerical variable \(C_i\). GMM models a distribution with a weighted sum of n Gaussian distributions. The means and standard deviations of the n Gaussian distributions are \({\eta}^{(1)}_{i}, ..., {\eta}^{(n)}_{i}\) and \({\sigma}^{(1)}_{i}, ...,{\sigma}^{(n)}_{i}\).
We compute the probability of \(c_{i,j}\) coming from each of the n Gaussian distributions as a vector \({u}^{(1)}_{i,j}, ..., {u}^{(n)}_{i,j}\). u_{i,j} is a normalized probability distribution over n Gaussian distributions.
We normalize \(c_{i,j}\) as \(v_{i,j} = (c_{i,j}−{\eta}^{(k)}_{i})/2{\sigma}^ {(k)}_{i}\), where \(k = arg max_k {u}^{(k)}_{i,j}\). We then clip \(v_{i,j}\) to [−0.99, 0.99].

Then we use \(u_i\) and \(v_i\) to represent \(c_i\). For simplicity, we cluster all the numerical features, i.e. both uni-modal and multi-modal features are clustered to n = 5 Gaussian distributions.

The simplification is fair because GMM automatically weighs n components. For example, if a variable has only one mode and fits some Gaussian distribution, then GMM will assign a very low probability to n − 1 components and only 1 remaining component actually works, which is equivalent to not clustering this feature.

Parameters: num_modes (int) – Number of modes on given data.

num_modes¶

Number of components in the skelarn.mixture.GaussianMixture model.

Type: int

static inverse_transform(data, info)[source]¶

Reverse the clustering of values.

Parameters

data (numpy.ndarray) – Transformed data to restore.
info (dict) – Metadata.

Returns

Values in the original space.

Return type

numpy.ndarray

transform(data, *args, **kwargs)[source]¶

Cluster values using a skelarn.mixture.GaussianMixture model.

Parameters: data (numpy.ndarray) – Values to cluster in array of shape (n,1).
Returns: Tuple containg the features, probabilities, averages and stds of the given data.
Return type: tuple[numpy.ndarray, numpy.ndarray, list, list]

class tgan.data.Preprocessor(continuous_columns=None, metadata=None)[source]¶

Bases: object

Transform back and forth human-readable data into TGAN numerical features.

Parameters

continous_columns (list) – List of columns to be considered continuous
metadata (dict) – Metadata to initialize the object.

continous_columns¶

Same as constructor argument.

Type: list

metadata¶

Information about the transformations applied to the data and its format.

Type: dict

continous_transformer¶

Transformer for columns in continuous_columns

Type: MultiModalNumberTransformer

categorical_transformer¶

Transformer for categorical columns.

Type: CategoricalTransformer

columns¶

List of columns labels.

Type: list

fit(data)[source]¶

Initialize the internal state of the object using data.

Parameters: data (pandas.DataFrame) – Data to fit the object.

fit_transform(data, fitting=True)[source]¶

Transform human-readable data into TGAN numerical features.

Parameters

data (pandas.DataFrame) – Data to transform.
fitting (bool) – Whether or not to update self.metadata.

Returns

Model features

Return type

pandas.DataFrame

reverse_transform(data)[source]¶

Transform TGAN numerical features back into human-readable data.

Parameters

data (pandas.DataFrame) – Data to transform.
fitting (bool) – Whether or not to update self.metadata.

Returns

Model features

Return type

pandas.DataFrame

transform(data)[source]¶

Transform the given dataframe without generating new metadata.

Parameters: data (pandas.DataFrame) – Data to fit the object.

class tgan.data.RandomZData(shape)[source]¶

Bases: tensorpack.dataflow.base.DataFlow

Random dataflow.

Parameters: shape (tuple) – Shape of the array to return on get_data()

get_data()[source]¶: Yield random normal vectors of shape shape.

class tgan.data.TGANDataFlow(data, metadata, shuffle=True)[source]¶

Bases: tensorpack.dataflow.base.RNGDataFlow

Subclass of tensorpack.RNGDataFlow prepared to work with numpy.ndarray.

shuffle¶

Wheter or not to shuffle the data.

Type: bool

metadata¶

Metadata for the given data.

Type: dict

num_features¶

Number of features in given data.

Type: int

data¶

Prepared data from filename.

Type: list

distribution¶

DepecrationWarning?

Type: list

get_data()[source]¶

Yield the rows from data.

Yields: tuple – Row of data.

size()[source]¶

Return the number of rows in data.

Returns: Number of rows in data.
Return type: int

tgan.data.check_inputs(function)[source]¶

Validate inputs for functions whose first argument is a numpy.ndarray with shape (n,1).

Parameters: function (callable) – Method to validate.
Returns: Will check the inputs before calling function.
Return type: callable
Raises: ValueError – If first argument is not a valid numpy.array of shape (n, 1).

tgan.data.check_metadata(metadata)[source]¶

Check that the given metadata has correct types for all its members.

Parameters: metadata (dict) – Description of the inputs.
Returns: None
Raises: AssertionError – If any of the details is not valid.

tgan.data.load_demo_data(name, header=None)[source]¶

Fetch, load and prepare a dataset.

If name is one of the demo datasets

Parameters

name (str) – Name or path of the dataset.
header() – Header parameter when executing pandas.read_csv