Try the new SDV 1.0 Beta! We are transitioning to a new version of SDV with improved workflows, new features and an updated documentation site. Click here to go to the new docs pages.
Try the new SDV 1.0 Beta!
We are transitioning to a new version of SDV with improved workflows, new features and an updated documentation site.
Click here to go to the new docs pages.
Date: Mar 13, 2023 Version: 0.18.0
The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of libraries that allows users to easily learn single-table, multi-table and timeseries datasets to later on generate new Synthetic Data that has the same format and statistical properties as the original dataset.
Synthetic data can then be used to supplement, augment and in some cases replace real data when training Machine Learning models. Additionally, it enables the testing of Machine Learning or other data dependent software systems without the risk of exposure that comes with data disclosure.
Underneath the hood it uses several probabilistic graphical modeling and deep learning based techniques. To enable a variety of data storage structures, we employ unique hierarchical generative modeling and recursive sampling techniques.
Synthetic data generators for single table datasets with the following features:
Using Copulas and Deep Learning based models.
Handling of multiple data types and missing data with minimum user input.
Support for pre-defined and custom constraints and data validation.
Synthetic data generators for complex, multi-table, relational datasets with the following features:
Definition of entire multi-table datasets metadata with a custom and flexible JSON schema.
Using Copulas and recursive modeling techniques.
Synthetic data generators for multi-type, multi-variate timeseries datasets with the following features:
Using statistical, Autoregressive and Deep Learning models.
Conditional sampling based on contextual attributes.
Metrics for Synthetic Data Evaluation, including:
An easy to use Evaluation Framework to evaluate the quality of your synthetic data with a single line of code.
Metrics for multiple data modalities, including single_table_metrics and multi_table_metrics.
A Benchmarking Framework to easily compare multiple synthetic data generators, including:
Dozens of datasets of multiple data modalities already prepared to be run on.
Tools to easily add new synthetic data generators and datasets.
Distributed computing to reduce computing times.
Comprehensive results presented in multiple leaderboard formats.
If you want to quickly discover SDV, simply click the button below and follow the tutorials!
If you want to be part of the SDV community to receive announcements of the latest releases, ask questions, suggest new features or participate in the development meetings, please join our Slack Workspace!
Getting Started
User Guides
API Reference
Developer Guides
Release Notes
The Synthetic Data Vault Project was first created at MIT’s Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:
🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
🧠 Multiple machine learning models – ranging from Copulas to Deep Learning – to create tabular, multi table and time series data.
📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.
Get started using the SDV package – a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.