tgan.data module¶
Data related functionalities.
This modules contains the tools to preprare the data, from the raw csv files, to the DataFlow objects will be used to fit our models.
-
class
tgan.data.
MultiModalNumberTransformer
(num_modes=5)[source]¶ Bases:
object
Reversible transform for multimodal data.
To effectively sample values from a multimodal distribution, we cluster values of a numerical variable using a skelarn.mixture.GaussianMixture model (GMM).
We train a GMM with
n
components for each numerical variable \(C_i\). GMM models a distribution with a weighted sum ofn
Gaussian distributions. The means and standard deviations of then
Gaussian distributions are \({\eta}^{(1)}_{i}, ..., {\eta}^{(n)}_{i}\) and \({\sigma}^{(1)}_{i}, ...,{\sigma}^{(n)}_{i}\).We compute the probability of \(c_{i,j}\) coming from each of the
n
Gaussian distributions as a vector \({u}^{(1)}_{i,j}, ..., {u}^{(n)}_{i,j}\). u_{i,j} is a normalized probability distribution overn
Gaussian distributions.We normalize \(c_{i,j}\) as \(v_{i,j} = (c_{i,j}−{\eta}^{(k)}_{i})/2{\sigma}^ {(k)}_{i}\), where \(k = arg max_k {u}^{(k)}_{i,j}\). We then clip \(v_{i,j}\) to [−0.99, 0.99].
Then we use \(u_i\) and \(v_i\) to represent \(c_i\). For simplicity, we cluster all the numerical features, i.e. both uni-modal and multi-modal features are clustered to
n = 5
Gaussian distributions.The simplification is fair because GMM automatically weighs
n
components. For example, if a variable has only one mode and fits some Gaussian distribution, then GMM will assign a very low probability ton − 1
components and only 1 remaining component actually works, which is equivalent to not clustering this feature.- Parameters
num_modes (int) – Number of modes on given data.
-
num_modes
¶ Number of components in the skelarn.mixture.GaussianMixture model.
- Type
int
-
static
inverse_transform
(data, info)[source]¶ Reverse the clustering of values.
- Parameters
data (numpy.ndarray) – Transformed data to restore.
info (dict) – Metadata.
- Returns
Values in the original space.
- Return type
numpy.ndarray
-
transform
(data, *args, **kwargs)[source]¶ Cluster values using a skelarn.mixture.GaussianMixture model.
- Parameters
data (numpy.ndarray) – Values to cluster in array of shape (n,1).
- Returns
Tuple containg the features, probabilities, averages and stds of the given data.
- Return type
tuple[numpy.ndarray, numpy.ndarray, list, list]
-
class
tgan.data.
Preprocessor
(continuous_columns=None, metadata=None)[source]¶ Bases:
object
Transform back and forth human-readable data into TGAN numerical features.
- Parameters
continous_columns (list) – List of columns to be considered continuous
metadata (dict) – Metadata to initialize the object.
-
continous_columns
¶ Same as constructor argument.
- Type
list
-
metadata
¶ Information about the transformations applied to the data and its format.
- Type
dict
-
continous_transformer
¶ Transformer for columns in
continuous_columns
-
categorical_transformer
¶ Transformer for categorical columns.
- Type
CategoricalTransformer
-
columns
¶ List of columns labels.
- Type
list
-
fit
(data)[source]¶ Initialize the internal state of the object using
data
.- Parameters
data (pandas.DataFrame) – Data to fit the object.
-
fit_transform
(data, fitting=True)[source]¶ Transform human-readable data into TGAN numerical features.
- Parameters
data (pandas.DataFrame) – Data to transform.
fitting (bool) – Whether or not to update self.metadata.
- Returns
Model features
- Return type
pandas.DataFrame
-
class
tgan.data.
RandomZData
(shape)[source]¶ Bases:
tensorpack.dataflow.base.DataFlow
Random dataflow.
- Parameters
shape (tuple) – Shape of the array to return on
get_data()
-
class
tgan.data.
TGANDataFlow
(data, metadata, shuffle=True)[source]¶ Bases:
tensorpack.dataflow.base.RNGDataFlow
Subclass of
tensorpack.RNGDataFlow
prepared to work withnumpy.ndarray
.-
shuffle
¶ Wheter or not to shuffle the data.
- Type
bool
-
num_features
¶ Number of features in given data.
- Type
int
-
data
¶ Prepared data from
filename
.- Type
list
-
distribution
¶ DepecrationWarning?
- Type
list
-
-
tgan.data.
check_inputs
(function)[source]¶ Validate inputs for functions whose first argument is a numpy.ndarray with shape (n,1).
- Parameters
function (callable) – Method to validate.
- Returns
Will check the inputs before calling
function
.- Return type
callable
- Raises
ValueError – If first argument is not a valid
numpy.array
of shape (n, 1).