Synthetic Data for Machine Learning¶
In this tutorial, we’ll demonstrate how to generate a synthetic copy of the classic Boston housing prices dataset. We will train a simple linear model on the synthetic data and demonstrate that the model’s performance is competitive not just on the synthetic dataset but also the real dataset.
Loading the dataset¶
The Boston housing prices dataset is available through sklearn
. We’ll import it here and divide it into a train/test set.
[1]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
Generating synthetic data¶
Next, we’ll use a Gaussian copula to generate a synthetic training set. This simulates a scenario where a company may be unwilling to share the real dataset but is willing to release a synthetic copy which preserves many of the real dataset’s properties for researchers to use.
[2]:
import numpy as np
from copulas.multivariate import GaussianMultivariate
def create_synthetic(X, y):
"""
This function combines X and y into a single dataset D, models it
using a Gaussian copula, and generates a synthetic dataset S. It
returns the new, synthetic versions of X and y.
"""
dataset = np.concatenate([X, np.expand_dims(y, 1)], axis=1)
model = GaussianMultivariate()
model.fit(dataset)
synthetic = model.sample(len(dataset))
X = synthetic.values[:, :-1]
y = synthetic.values[:, -1]
return X, y
X_synthetic, y_synthetic = create_synthetic(X_train, y_train)
Training a linear model¶
Now we can train a simple linear model using the synthetic dataset.
[3]:
from sklearn.linear_model import ElasticNet
model = ElasticNet()
model.fit(X_synthetic, y_synthetic)
Now, we can take this model - which is trained on the synthetic training set - and evaluate it’s performance on the real test set.
[4]:
print(model.score(X_test, y_test))
0.574182672682398
For comparison, here’s a model that’s trained on the real training set.
[5]:
model = ElasticNet()
model.fit(X_train, y_train);
print(model.score(X_test, y_test))
0.5949580051414429
The two models perform similarly on the real test set, suggesting that our Gaussian copula has successfully captured the statistical properties of the dataset that are important for solving this regression problem.