The benchmarking of synthetic data generators for single table datasets is
done using the Synthetic Data Gym (SDGym),
a library from the Synthetic Data Vault Project that offers a collection of both
real and simulated datasets to work on, Machine Learning Efficacy and Bayesian
Likelihood based metrics and a number of classical and novel synehtic data generators
to use as baselines to compare against.
The SDGym Library is not installed by default alongside sdv.
If you want to use it, please install it using the pip install sdgym
pip install sdgym
SDGym evaluates the performance of Synthesizers.
A Synthesizer is a Python function (or class method) that takes as input
a numpy matrix with some data, which we call the real data, and
outputs another numpy matrix with the same shape, filled with new
synthetic data that has similar mathematical properties as the real
Apart from the benchmark functionality and the SDV Tabular synthesizers, SDGym
implements a collection of Synthesizers which are either custom demo synthesizers
or re-implementations of synthesizers that have been presented in other publications.
More details about the implemented synthesizers can be found in the SDGym Synthesizers
SDGym evaluates the performance of Synthetic Data Generators
using datasets that are in three families:
Simulated data generated using Gaussian Mixtures
Simulated data generated using Bayesian Networks
Real world datasets
Further details about how these datasets were generated can be found in
the Modeling Tabular data using Conditional
GAN paper and in the SDGym datasets
This is a summary of the current SDGym leaderboard showing the number
of datasets in which each Synthesizer obtained the best score.
Detailed leaderboard results for all the releases are available in this
The easiest and recommended way to install SDGym is using
pip install sdgym
This will pull and install the latest stable release from
All you need to do in order to use the SDGym Benchmark, is to import
sdgym and call its run function passing it your synthesizer
function and the settings that you want to use for the evaluation.
For example, if we want to evaluate a simple synthesizer function in the
adult dataset we can execute:
import numpy as np
def my_synthesizer_function(real_data, categorical_columns, ordinal_columns):
"""dummy synthesizer that just returns a permutation of the real data."""
scores = sdgym.run(synthesizers=my_synthesizer_function, datasets=['adult'])
The output of the sdgym.run function will be a pd.DataFrame
containing the results obtained by your synthesizer on each dataset, as
well as the results obtained previously by the SDGym synthesizers:
adult/accuracy adult/f1 ... ring/test_likelihood
IndependentSynthesizer 0.56530 0.134593 ... -1.958888
UniformSynthesizer 0.39695 0.273753 ... -2.519416
IdentitySynthesizer 0.82440 0.659250 ... -1.705487
... ... ... ... ...
my_synthesizer_function 0.64865 0.210103 ... -1.964966
If you want to run the SDGym benchmark on the SDGym Synthesizers you can
directly pass the corresponding class, or a list of classes, to the
For example, if you want to run the complete benchmark suite to evaluate
all the existing synthesizers you can run:
from sdgym.synthesizers import (
CLBNSynthesizer, CTGANSynthesizer, IdentitySynthesizer, IndependentSynthesizer,
MedganSynthesizer, PrivBNSynthesizer, TableganSynthesizer, TVAESynthesizer,
all_synthesizers = [
scores = sdgym.run(synthesizers=all_synthesizers)
This will take A LOT of time to run on a single machine!
For further details about all the arguments and possibilities that the
benchmark function offers please refer to the SDGym benchmark