tgan.data module

Data related functionalities.

This modules contains the tools to preprare the data, from the raw csv files, to the DataFlow objects will be used to fit our models.

class tgan.data.MultiModalNumberTransformer(num_modes=5)[source]

Bases: object

Reversible transform for multimodal data.

To effectively sample values from a multimodal distribution, we cluster values of a numerical variable using a skelarn.mixture.GaussianMixture model (GMM).

  • We train a GMM with n components for each numerical variable \(C_i\). GMM models a distribution with a weighted sum of n Gaussian distributions. The means and standard deviations of the n Gaussian distributions are \({\eta}^{(1)}_{i}, ..., {\eta}^{(n)}_{i}\) and \({\sigma}^{(1)}_{i}, ...,{\sigma}^{(n)}_{i}\).

  • We compute the probability of \(c_{i,j}\) coming from each of the n Gaussian distributions as a vector \({u}^{(1)}_{i,j}, ..., {u}^{(n)}_{i,j}\). u_{i,j} is a normalized probability distribution over n Gaussian distributions.

  • We normalize \(c_{i,j}\) as \(v_{i,j} = (c_{i,j}−{\eta}^{(k)}_{i})/2{\sigma}^ {(k)}_{i}\), where \(k = arg max_k {u}^{(k)}_{i,j}\). We then clip \(v_{i,j}\) to [−0.99, 0.99].

Then we use \(u_i\) and \(v_i\) to represent \(c_i\). For simplicity, we cluster all the numerical features, i.e. both uni-modal and multi-modal features are clustered to n = 5 Gaussian distributions.

The simplification is fair because GMM automatically weighs n components. For example, if a variable has only one mode and fits some Gaussian distribution, then GMM will assign a very low probability to n 1 components and only 1 remaining component actually works, which is equivalent to not clustering this feature.

Parameters

num_modes (int) – Number of modes on given data.

num_modes

Number of components in the skelarn.mixture.GaussianMixture model.

Type

int

static inverse_transform(data, info)[source]

Reverse the clustering of values.

Parameters
  • data (numpy.ndarray) – Transformed data to restore.

  • info (dict) – Metadata.

Returns

Values in the original space.

Return type

numpy.ndarray

transform(data, *args, **kwargs)[source]

Cluster values using a skelarn.mixture.GaussianMixture model.

Parameters

data (numpy.ndarray) – Values to cluster in array of shape (n,1).

Returns

Tuple containg the features, probabilities, averages and stds of the given data.

Return type

tuple[numpy.ndarray, numpy.ndarray, list, list]

class tgan.data.Preprocessor(continuous_columns=None, metadata=None)[source]

Bases: object

Transform back and forth human-readable data into TGAN numerical features.

Parameters
  • continous_columns (list) – List of columns to be considered continuous

  • metadata (dict) – Metadata to initialize the object.

continous_columns

Same as constructor argument.

Type

list

metadata

Information about the transformations applied to the data and its format.

Type

dict

continous_transformer

Transformer for columns in continuous_columns

Type

MultiModalNumberTransformer

categorical_transformer

Transformer for categorical columns.

Type

CategoricalTransformer

columns

List of columns labels.

Type

list

fit(data)[source]

Initialize the internal state of the object using data.

Parameters

data (pandas.DataFrame) – Data to fit the object.

fit_transform(data, fitting=True)[source]

Transform human-readable data into TGAN numerical features.

Parameters
  • data (pandas.DataFrame) – Data to transform.

  • fitting (bool) – Whether or not to update self.metadata.

Returns

Model features

Return type

pandas.DataFrame

reverse_transform(data)[source]

Transform TGAN numerical features back into human-readable data.

Parameters
  • data (pandas.DataFrame) – Data to transform.

  • fitting (bool) – Whether or not to update self.metadata.

Returns

Model features

Return type

pandas.DataFrame

transform(data)[source]

Transform the given dataframe without generating new metadata.

Parameters

data (pandas.DataFrame) – Data to fit the object.

class tgan.data.RandomZData(shape)[source]

Bases: tensorpack.dataflow.base.DataFlow

Random dataflow.

Parameters

shape (tuple) – Shape of the array to return on get_data()

get_data()[source]

Yield random normal vectors of shape shape.

class tgan.data.TGANDataFlow(data, metadata, shuffle=True)[source]

Bases: tensorpack.dataflow.base.RNGDataFlow

Subclass of tensorpack.RNGDataFlow prepared to work with numpy.ndarray.

shuffle

Wheter or not to shuffle the data.

Type

bool

metadata

Metadata for the given data.

Type

dict

num_features

Number of features in given data.

Type

int

data

Prepared data from filename.

Type

list

distribution

DepecrationWarning?

Type

list

get_data()[source]

Yield the rows from data.

Yields

tuple – Row of data.

size()[source]

Return the number of rows in data.

Returns

Number of rows in data.

Return type

int

tgan.data.check_inputs(function)[source]

Validate inputs for functions whose first argument is a numpy.ndarray with shape (n,1).

Parameters

function (callable) – Method to validate.

Returns

Will check the inputs before calling function.

Return type

callable

Raises

ValueError – If first argument is not a valid numpy.array of shape (n, 1).

tgan.data.check_metadata(metadata)[source]

Check that the given metadata has correct types for all its members.

Parameters

metadata (dict) – Description of the inputs.

Returns

None

Raises

AssertionError – If any of the details is not valid.

tgan.data.load_demo_data(name, header=None)[source]

Fetch, load and prepare a dataset.

If name is one of the demo datasets

Parameters
  • name (str) – Name or path of the dataset.

  • header() – Header parameter when executing pandas.read_csv