Synthetic Data for Machine Learning¶
In this tutorial, we’ll demonstrate how to generate a synthetic copy of the classic Boston housing prices dataset. We will train a simple linear model on the synthetic data and demonstrate that the model’s performance is competitive not just on the synthetic dataset but also the real dataset.
Loading the dataset¶
The Boston housing prices dataset is available through
sklearn. We’ll import it here and divide it into a train/test set.
import warnings warnings.filterwarnings('ignore') from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split X, y = load_boston(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y)
Generating synthetic data¶
Next, we’ll use a Gaussian copula to generate a synthetic training set. This simulates a scenario where a company may be unwilling to share the real dataset but is willing to release a synthetic copy which preserves many of the real dataset’s properties for researchers to use.
import numpy as np from copulas.multivariate import GaussianMultivariate def create_synthetic(X, y): """ This function combines X and y into a single dataset D, models it using a Gaussian copula, and generates a synthetic dataset S. It returns the new, synthetic versions of X and y. """ dataset = np.concatenate([X, np.expand_dims(y, 1)], axis=1) model = GaussianMultivariate() model.fit(dataset) synthetic = model.sample(len(dataset)) X = synthetic.values[:, :-1] y = synthetic.values[:, -1] return X, y X_synthetic, y_synthetic = create_synthetic(X_train, y_train)
Training a linear model¶
Now we can train a simple linear model using the synthetic dataset.
from sklearn.linear_model import ElasticNet model = ElasticNet() model.fit(X_synthetic, y_synthetic)
Now, we can take this model - which is trained on the synthetic training set - and evaluate it’s performance on the real test set.
For comparison, here’s a model that’s trained on the real training set.
model = ElasticNet() model.fit(X_train, y_train); print(model.score(X_test, y_test))
The two models perform similarly on the real test set, suggesting that our Gaussian copula has successfully captured the statistical properties of the dataset that are important for solving this regression problem.