Date: Jul 22, 2022 Version: 0.16.0
The Synthetic Data Vault (SDV) is a Synthetic Data Generation
ecosystem of libraries that allows users to easily learn single-table, multi-table and timeseries
datasets to later on generate new Synthetic Data that has the same format and
statistical properties as the original dataset.
Synthetic data can then be used to supplement, augment and in some cases
replace real data when training Machine Learning models. Additionally,
it enables the testing of Machine Learning or other data dependent
software systems without the risk of exposure that comes with data
Underneath the hood it uses several probabilistic graphical modeling and
deep learning based techniques. To enable a variety of data storage
structures, we employ unique hierarchical generative modeling and
recursive sampling techniques.
Synthetic data generators for single table datasets with the following
Using Copulas and Deep Learning based models.
Handling of multiple data types and missing data with minimum user input.
Support for pre-defined and custom constraints and data
Synthetic data generators for complex, multi-table, relational datasets
with the following features:
Definition of entire multi-table datasets metadata with a custom
and flexible JSON schema.
Using Copulas and recursive modeling techniques.
Synthetic data generators for multi-type, multi-variate timeseries datasets
with the following features:
Using statistical, Autoregressive and Deep Learning models.
Conditional sampling based on contextual attributes.
Metrics for Synthetic Data Evaluation, including:
An easy to use Evaluation Framework to evaluate the quality of your synthetic
data with a single line of code.
Metrics for multiple data modalities, including Single Table Metrics and
Multi Table Metrics.
A Benchmarking Framework to easily compare multiple synthetic data generators, including:
Dozens of datasets of multiple data modalities already prepared to be run on.
Tools to easily add new synthetic data generators and datasets.
Distributed computing to reduce computing times.
Comprehensive results presented in multiple leaderboard formats.
If you want to quickly discover SDV, simply click the button below
and follow the tutorials!
If you want to be part of the SDV community to receive announcements of the latest releases,
ask questions, suggest new features or participate in the development meetings, please join
our Slack Workspace!
The Synthetic Data Vault Project was first created at MIT’s Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we
created DataCebo in 2020 with the goal of growing the project.
Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation
& evaluation. It is home to multiple libraries that support synthetic data, including:
🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
🧠 Multiple machine learning models – ranging from Copulas to Deep Learning – to create tabular,
multi table and time series data.
📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data
Get started using the SDV package – a fully
integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries
for specific needs.