Synthetic Data for Machine Learning

In this tutorial, we’ll demonstrate how to generate a synthetic copy of the classic Boston housing prices dataset. We will train a simple linear model on the synthetic data and demonstrate that the model’s performance is competitive not just on the synthetic dataset but also the real dataset.

Loading the dataset

The Boston housing prices dataset is available through sklearn. We’ll import it here and divide it into a train/test set.

[9]:
import warnings

warnings.filterwarnings('ignore')

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

Generating synthetic data

Next, we’ll use a Gaussian copula to generate a synthetic training set. This simulates a scenario where a company may be unwilling to share the real dataset but is willing to release a synthetic copy which preserves many of the real dataset’s properties for researchers to use.

[10]:
import numpy as np

from copulas.multivariate import GaussianMultivariate

def create_synthetic(X, y):
    """
    This function combines X and y into a single dataset D, models it
    using a Gaussian copula, and generates a synthetic dataset S. It
    returns the new, synthetic versions of X and y.
    """
    dataset = np.concatenate([X, np.expand_dims(y, 1)], axis=1)

    model = GaussianMultivariate()
    model.fit(dataset)

    synthetic = model.sample(len(dataset))

    X = synthetic.values[:, :-1]
    y = synthetic.values[:, -1]

    return X, y

X_synthetic, y_synthetic = create_synthetic(X_train, y_train)

Training a linear model

Now we can train a simple linear model using the synthetic dataset.

[11]:
from sklearn.linear_model import ElasticNet

model = ElasticNet()
model.fit(X_synthetic, y_synthetic)
[11]:
ElasticNet()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now, we can take this model - which is trained on the synthetic training set - and evaluate it’s performance on the real test set.

[12]:
print(model.score(X_test, y_test))
0.010323153621473069

For comparison, here’s a model that’s trained on the real training set.

[13]:
model = ElasticNet()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
0.008700399284224392

The two models perform similarly on the real test set, suggesting that our Gaussian copula has successfully captured the statistical properties of the dataset that are important for solving this regression problem.