Synthetic Data for Machine Learning

In this tutorial, we’ll demonstrate how to generate a synthetic copy of the classic Boston housing prices dataset. We will train a simple linear model on the synthetic data and demonstrate that the model’s performance is competitive not just on the synthetic dataset but also the real dataset.

Loading the dataset

The Boston housing prices dataset is available through sklearn. We’ll import it here and divide it into a train/test set.

import warnings


from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

Generating synthetic data

Next, we’ll use a Gaussian copula to generate a synthetic training set. This simulates a scenario where a company may be unwilling to share the real dataset but is willing to release a synthetic copy which preserves many of the real dataset’s properties for researchers to use.

import numpy as np

from copulas.multivariate import GaussianMultivariate

def create_synthetic(X, y):
    This function combines X and y into a single dataset D, models it
    using a Gaussian copula, and generates a synthetic dataset S. It
    returns the new, synthetic versions of X and y.
    dataset = np.concatenate([X, np.expand_dims(y, 1)], axis=1)

    model = GaussianMultivariate()

    synthetic = model.sample(len(dataset))

    X = synthetic.values[:, :-1]
    y = synthetic.values[:, -1]

    return X, y

X_synthetic, y_synthetic = create_synthetic(X_train, y_train)

Training a linear model

Now we can train a simple linear model using the synthetic dataset.

from sklearn.linear_model import ElasticNet

model = ElasticNet(), y_synthetic)

Now, we can take this model - which is trained on the synthetic training set - and evaluate it’s performance on the real test set.

print(model.score(X_test, y_test))

For comparison, here’s a model that’s trained on the real training set.

model = ElasticNet(), y_train);
print(model.score(X_test, y_test))

The two models perform similarly on the real test set, suggesting that our Gaussian copula has successfully captured the statistical properties of the dataset that are important for solving this regression problem.