Danger

You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software

Click here to go to the new docs pages.

Development Status PyPi Shield Run Tests Coverage Status Downloads Binder Slack


Date: Mar 28, 2023 Version: 0.18.0

Overview

The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of libraries that allows users to easily learn single-table, multi-table and timeseries datasets to later on generate new Synthetic Data that has the same format and statistical properties as the original dataset.

Synthetic data can then be used to supplement, augment and in some cases replace real data when training Machine Learning models. Additionally, it enables the testing of Machine Learning or other data dependent software systems without the risk of exposure that comes with data disclosure.

Underneath the hood it uses several probabilistic graphical modeling and deep learning based techniques. To enable a variety of data storage structures, we employ unique hierarchical generative modeling and recursive sampling techniques.

Current functionality and features:

Try it out now!

If you want to quickly discover SDV, simply click the button below and follow the tutorials!

Binder

Join our Slack Workspace

If you want to be part of the SDV community to receive announcements of the latest releases, ask questions, suggest new features or participate in the development meetings, please join our Slack Workspace!

Slack

Explore SDV




The Synthetic Data Vault Project was first created at MIT’s Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

  • 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.

  • 🧠 Multiple machine learning models – ranging from Copulas to Deep Learning – to create tabular, multi table and time series data.

  • 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package – a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.