Tabular Preset

The TabularPreset is a tabular model that comes with pre-configured settings. This is meant for users who want to get started with using synthetic data and spend less time worrying about which model to choose or how to tune its parameters.

Note

We are currently in Beta testing our speed-optimized machine learning preset. Help us by testing the model and filing issues for any bugs or feature requests you may have.

What is the FAST_ML preset?

The FAST_ML preset is our first preset. It uses machine learning (ML) to model your data while optimizing for the modeling time. This is a great choice if it’s your first time using the SDV for a large custom dataset or if you’re exploring the benefits of using ML to create synthetic data.

What will you get with this preset?

  • This preset optimizes for the modeling time while still applying machine learning to model and generate synthetic data.

  • Your synthetic data will capture correlations that exist between the columns of the original data.

  • Your synthetic data will adhere to the basic statistical properties of the original columns: min/max values, averages and standard deviations.

While other SDV models may create higher quality synthetic data, they will take longer. Using the FAST_ML preset allows you to get started with ML to create synthetic data right now.

Quick Usage

Preparation

To use this preset, you must have:

  1. Your data, loaded as a pandas DataFrame, and

  2. (Optional but strongly recommended) A metadata file that describes the columns of your dataset

For this guide, we’ll load the demo data and metadata from the SDV. This data contains information about students, including their grades, major and work experience.

In [1]: from sdv.demo import load_tabular_demo

In [2]: metadata, data = load_tabular_demo('student_placements', metadata=True)

In [3]: data.head()
Out[3]: 
   student_id gender  second_perc  ...  start_date   end_date  duration
0       17264      M        67.00  ...  2020-07-23 2020-10-12       3.0
1       17265      M        79.33  ...  2020-01-11 2020-04-09       3.0
2       17266      M        65.00  ...  2020-01-26 2020-07-13       6.0
3       17267      M        56.00  ...         NaT        NaT       NaN
4       17268      M        85.80  ...  2020-07-04 2020-09-27       3.0

[5 rows x 17 columns]

If you want to use your custom dataset, you can load it using pandas. For example, if your data is available as a CSV file, you can use the read_csv method.

You can write your metadata as a dictionary. Follow the Metadata guide to create a dictionary for a single table. For example, the metadata for our table looks something like this:

{
    'fields': {
        'start_date': {'type': 'datetime', 'format': '%Y-%m-%d'},
        'end_date': {'type': 'datetime', 'format': '%Y-%m-%d'},
        'salary': {'type': 'numerical', 'subtype': 'integer'},
        'duration': {'type': 'categorical'},
        'student_id': {'type': 'id', 'subtype': 'integer'},
        'high_perc': {'type': 'numerical', 'subtype': 'float'},
        'high_spec': {'type': 'categorical'},
        'mba_spec': {'type': 'categorical'},
        'second_perc': {'type': 'numerical', 'subtype': 'float'},
        'gender': {'type': 'categorical'},
        'degree_perc': {'type': 'numerical', 'subtype': 'float'},
        'placed': {'type': 'boolean'},
        'experience_years': {'type': 'numerical', 'subtype': 'integer'},
        'employability_perc': {'type': 'numerical', 'subtype': 'float'},
        'mba_perc': {'type': 'numerical', 'subtype': 'float'},
        'work_experience': {'type': 'boolean'},
        'degree_type': {'type': 'categorical'}
    },
    'constraints': [],
    'primary_key': 'student_id'
}

Modeling

Pass in your metadata to create the TabularPreset FAST_ML model.

In [4]: from sdv.lite import TabularPreset

# Use the FAST_ML preset to optimize for modeling time
In [5]: model = TabularPreset(name='FAST_ML', metadata=metadata)

Then, simply pass in your data to train the model.

In [6]: model.fit(data)

The modeling step is optimized for speed. The exact time it takes depends on several factors including the number of rows, columns and distinct categories in categorical columns. As a rough benchmark, our analysis shows that:

  • Datasets with around 100K rows and 50-100 columns will take a few minutes to model

  • Larger datasets with around 1M rows and hundreds of columns may take closer to an hour

After you are finished modeling, you can save the fitted model and load it in again for future use.

# save the model in a new file
In [7]: model.save('fast_ml_model.pkl')

# later, you can load it in again
In [8]: model = TabularPreset.load('fast_ml_model.pkl')

Sampling

Once you have your model, you can begin to create synthetic data. Use the sample method and pass in the number of rows you want to synthesize.

In [9]: synthetic_data = model.sample(num_rows=100)

In [10]: synthetic_data.head()
Out[10]: 
   student_id gender  ...                      end_date  duration
0           0      F  ...                           NaT       NaN
1           1      F  ... 2020-08-21 16:48:26.298002176       NaN
2           2      F  ... 2020-10-21 11:20:34.264411136       NaN
3           3      F  ...                           NaT       6.0
4           4      M  ... 2020-08-02 21:05:52.004792832       3.0

[5 rows x 17 columns]

For creating large amounts of synthetic data, provide a batch_size. This breaks up the sampling into multiple batches and shows a progress bar. Use the output_file_path parameter to write results to a file.

In [11]: model.sample(num_rows=1_000_000, batch_size=10_000, output_file_path='synthetic_data.csv')
Out[11]: 
        student_id gender  ...                      end_date  duration
0                0      M  ...                           NaT      12.0
1                1      M  ... 2020-07-27 23:51:23.136863744       6.0
2                2      M  ... 2020-11-23 00:18:03.228760064       3.0
3                3      M  ...                           NaT       3.0
4                4      M  ...                           NaT       6.0
...            ...    ...  ...                           ...       ...
999995        9995      M  ...                           NaT       NaN
999996        9996      M  ... 2020-10-21 17:21:25.821657344      12.0
999997        9997      F  ...                           NaT       6.0
999998        9998      F  ... 2020-07-19 18:00:02.299542784       NaN
999999        9999      M  ...                           NaT       NaN

[1000000 rows x 17 columns]

Conditional Sampling

The model generates new synthetic data – synthetic rows that do not refer to the original. But sometimes you may want to fix some values.

For example, you might only be interested in synthesizing science and commerce students with work experience. Using conditional sampling, you can specify the exact, fixed values that you need. The SDV model will then synthesize the rest of the data.

First, use the Condition object to specify the exact values you want. You specify a dictionary of column names and the exact value you want, along with the number of rows to synthesize.

In [12]: from sdv.sampling.tabular import Condition

# 100 science students with work experience
In [13]: science_students = Condition(
   ....:    column_values={'high_spec': 'Science', 'work_experience': True}, num_rows=100)
   ....: 

# 200 commerce students with work experience
In [14]: commerce_students = Condition(
   ....:    column_values={'high_spec': 'Commerce', 'work_experience': True}, num_rows=200)
   ....: 

You can now use the sample_conditions function and pass in a list of conditions.

In [15]: all_conditions = [science_students, commerce_students]

In [16]: model.sample_conditions(conditions=all_conditions)
Out[16]: 
     student_id gender  ...                      end_date  duration
0             0      M  ...                           NaT       3.0
1             1      F  ...                           NaT       3.0
2             2      M  ... 2020-07-17 02:35:37.029714688       NaN
3             3      F  ... 2020-03-02 23:12:10.715367424       NaN
4             4      M  ...                           NaT      12.0
..          ...    ...  ...                           ...       ...
291         291      F  ... 2020-09-16 08:49:52.684590848       3.0
292         292      M  ... 2021-02-26 12:01:38.025448960       NaN
293         293      M  ...                           NaT       NaN
294         294      M  ... 2020-12-27 00:10:25.697350656       6.0
295         295      F  ... 2020-09-12 08:23:07.567118592       NaN

[296 rows x 17 columns]

Advanced Usage

Adding Constraints

A constraint is a logical business rule that must be met by every row in your dataset.

In most cases, the preset is able to learn a general trend and create synthetic data where most of the rows follow the rule. Use a constraint if you want to enforce that all of the rows must follow the rule.

In our dataset, we have a constraint: If experience_years=0, then work_experience=False. Otherwise, work_experience=True. We can describe this using a ColumnFormula constraint.

In [17]: from sdv.constraints import ColumnFormula

# define the formula for computing work experience
In [18]: def calculate_work_experience(data):
   ....:     return data['experience_years'] > 0
   ....: 

# use the formula when defining the constraint
In [19]: work_constraint = ColumnFormula(
   ....:     column='work_experience',
   ....:     formula=calculate_work_experience,
   ....: )
   ....: 

You can input constraints into the presets when creating your model.

In [20]: constrained_model = TabularPreset(
   ....:     name='FAST_ML',
   ....:     metadata=metadata,
   ....:     constraints=[work_constraint],
   ....: )
   ....: 

In [21]: constrained_model.fit(data)

When you sample from the model, the synthetic data will follow the constraints

In [22]: constrained_synthetic_data = constrained_model.sample(num_rows=1_000)

In [23]: constrained_synthetic_data.head(10)
Out[23]: 
   student_id gender  ...                      end_date  duration
0           0      F  ...                           NaT       3.0
1           1      M  ... 2020-06-30 10:35:05.327057152      12.0
2           2      F  ...                           NaT       NaN
3           3      F  ...                           NaT       6.0
4           4      F  ... 2020-05-01 16:10:38.467646464       3.0
5           5      F  ...                           NaT       NaN
6           6      M  ...                           NaT       NaN
7           7      M  ...                           NaT       NaN
8           8      F  ... 2020-08-25 05:55:29.006303488       6.0
9           9      M  ... 2020-04-12 09:09:57.670391296       6.0

[10 rows x 17 columns]

To read more about defining constraints, see the Handling Constraints User Guide.

Resources

The SDV (Synthetic Data Vault) is an open source project built & maintained by DataCebo. It is free to use under the MIT License.

For other resources see our: GitHub, Docs, Blog.