Tabular Preset

The TabularPreset is a tabular model that comes with pre-configured settings. This is meant for users who want to get started with using synthetic data and spend less time worrying about which model to choose or how to tune its parameters.

Note

We are currently in Beta testing our speed-optimized machine learning preset. Help us by testing the model and filing issues for any bugs or feature requests you may have.

What is the FAST_ML preset?

The FAST_ML preset is our first preset. It uses machine learning (ML) to model your data while optimizing for the modeling time. This is a great choice if it’s your first time using the SDV for a large custom dataset or if you’re exploring the benefits of using ML to create synthetic data.

What will you get with this preset?

  • This preset optimizes for the modeling time while still applying machine learning to model and generate synthetic data.

  • Your synthetic data will capture correlations that exist between the columns of the original data.

  • Your synthetic data will adhere to the basic statistical properties of the original columns: min/max values, averages and standard deviations.

While other SDV models may create higher quality synthetic data, they will take longer. Using the FAST_ML preset allows you to get started with ML to create synthetic data right now.

Quick Usage

Preparation

To use this preset, you must have:

  1. Your data, loaded as a pandas DataFrame, and

  2. (Optional but strongly recommended) A metadata file that describes the columns of your dataset

For this guide, we’ll load the demo data and metadata from the SDV. This data contains information about students, including their grades, major and work experience.

In [1]: from sdv.demo import load_tabular_demo

In [2]: metadata, data = load_tabular_demo('student_placements', metadata=True)

In [3]: data.head()
Out[3]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date  duration
0       17264      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12       3.0
1       17265      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09       3.0
2       17266      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13       6.0
3       17267      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT       NaN
4       17268      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27       3.0

If you want to use your custom dataset, you can load it using pandas. For example, if your data is available as a CSV file, you can use the read_csv method.

You can write your metadata as a dictionary. Follow the Metadata guide to create a dictionary for a single table. For example, the metadata for our table looks something like this:

{
    'fields': {
        'start_date': {'type': 'datetime', 'format': '%Y-%m-%d'},
        'end_date': {'type': 'datetime', 'format': '%Y-%m-%d'},
        'salary': {'type': 'numerical', 'subtype': 'integer'},
        'duration': {'type': 'categorical'},
        'student_id': {'type': 'id', 'subtype': 'integer'},
        'high_perc': {'type': 'numerical', 'subtype': 'float'},
        'high_spec': {'type': 'categorical'},
        'mba_spec': {'type': 'categorical'},
        'second_perc': {'type': 'numerical', 'subtype': 'float'},
        'gender': {'type': 'categorical'},
        'degree_perc': {'type': 'numerical', 'subtype': 'float'},
        'placed': {'type': 'boolean'},
        'experience_years': {'type': 'numerical', 'subtype': 'integer'},
        'employability_perc': {'type': 'numerical', 'subtype': 'float'},
        'mba_perc': {'type': 'numerical', 'subtype': 'float'},
        'work_experience': {'type': 'boolean'},
        'degree_type': {'type': 'categorical'}
    },
    'constraints': [],
    'primary_key': 'student_id'
}

Modeling

Pass in your metadata to create the TabularPreset FAST_ML model.

In [4]: from sdv.lite import TabularPreset

# Use the FAST_ML preset to optimize for modeling time
In [5]: model = TabularPreset(name='FAST_ML', metadata=metadata)

Then, simply pass in your data to train the model.

In [6]: model.fit(data)

The modeling step is optimized for speed. The exact time it takes depends on several factors including the number of rows, columns and distinct categories in categorical columns. As a rough benchmark, our analysis shows that:

  • Datasets with around 100K rows and 50-100 columns will take a few minutes to model

  • Larger datasets with around 1M rows and hundreds of columns may take closer to an hour

After you are finished modeling, you can save the fitted model and load it in again for future use.

# save the model in a new file
In [7]: model.save('fast_ml_model.pkl')

# later, you can load it in again
In [8]: model = TabularPreset.load('fast_ml_model.pkl')

Sampling

Once you have your model, you can begin to create synthetic data. Use the sample method and pass in the number of rows you want to synthesize.

In [9]: synthetic_data = model.sample(num_rows=100)

In [10]: synthetic_data.head()
Out[10]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0           0      F    53.605866  58.142934      Arts    52.876905    Sci&Tech            False                 0           71.081351   Mkt&HR  67.529873      NaN   False 2020-01-06 2020-02-22       3.0
1           1      F    77.091223  66.554675   Science    80.062358   Comm&Mgmt             True                 1           63.838468   Mkt&HR  68.891074  26653.0    True 2020-07-16 2020-10-23       NaN
2           2      M    69.861386  68.273059  Commerce    66.249745   Comm&Mgmt            False                 0           89.928674  Mkt&Fin  58.006479      NaN   False        NaT 2020-08-07       NaN
3           3      F    59.787943  64.598799  Commerce    74.574091   Comm&Mgmt            False                 1           50.000000  Mkt&Fin  66.758063  20000.0    True        NaT        NaT       NaN
4           4      M    61.922911  76.016704  Commerce    62.459966   Comm&Mgmt            False                 0           50.000000  Mkt&Fin  62.518600      NaN    True 2020-09-11        NaT       NaN

For creating large amounts of synthetic data, provide a batch_size. This breaks up the sampling into multiple batches and shows a progress bar. Use the output_file_path parameter to write results to a file.

In [11]: model.sample(num_rows=1_000_000, batch_size=10_000, output_file_path='synthetic_data.csv')
Out[11]: 
        student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0                0      M    40.890000  59.556471  Commerce    61.312905   Comm&Mgmt            False                 0           84.580967  Mkt&Fin  57.063071      NaN    True 2020-03-16        NaT       NaN
1                1      M    76.624268  66.669565  Commerce    70.026584   Comm&Mgmt             True                 1           69.654434  Mkt&Fin  66.279355      NaN    True 2020-01-28 2020-06-15       NaN
2                2      F    89.400000  80.288157  Commerce    88.647683   Comm&Mgmt            False                 0           51.353619  Mkt&Fin  62.055338      NaN    True        NaT        NaT       NaN
3                3      M    70.596387  70.461620   Science    71.737315    Sci&Tech             True                 1           76.865223  Mkt&Fin  63.159530      NaN    True 2019-11-09        NaT       NaN
4                4      M    52.386387  50.402889   Science    66.228059   Comm&Mgmt            False                 0           62.037355   Mkt&HR  52.006703  33533.0   False 2020-05-02        NaT      12.0
...            ...    ...          ...        ...       ...          ...         ...              ...               ...                 ...      ...        ...      ...     ...        ...        ...       ...
999995        9995      M    67.829527  53.572014   Science    56.615745   Comm&Mgmt            False                 0           75.006729   Mkt&HR  52.764367  28330.0   False 2020-02-02 2020-05-28       NaN
999996        9996      M    43.902776  59.514344  Commerce    67.788189   Comm&Mgmt            False                 0           58.528543   Mkt&HR  54.160158  28838.0   False 2020-06-07 2020-06-22       NaN
999997        9997      F    73.886157  60.684399  Commerce    74.132715   Comm&Mgmt            False                 0           73.592714  Mkt&Fin  64.299279      NaN    True        NaT        NaT       NaN
999998        9998      M    53.786529  68.609708   Science    57.962045   Comm&Mgmt             True                 1           67.717798  Mkt&Fin  62.076065      NaN   False 2019-12-15        NaT       6.0
999999        9999      M    57.695809  52.530790   Science    58.597656   Comm&Mgmt             True                 1           69.925830  Mkt&Fin  51.210000      NaN    True 2020-02-01        NaT       NaN

[1000000 rows x 17 columns]

Conditional Sampling

The model generates new synthetic data – synthetic rows that do not refer to the original. But sometimes you may want to fix some values.

For example, you might only be interested in synthesizing science and commerce students with work experience. Using conditional sampling, you can specify the exact, fixed values that you need. The SDV model will then synthesize the rest of the data.

First, use the Condition object to specify the exact values you want. You specify a dictionary of column names and the exact value you want, along with the number of rows to synthesize.

In [12]: from sdv.sampling.tabular import Condition

# 100 science students with work experience
In [13]: science_students = Condition(
   ....:    column_values={'high_spec': 'Science', 'work_experience': True}, num_rows=100)
   ....: 

# 200 commerce students with work experience
In [14]: commerce_students = Condition(
   ....:    column_values={'high_spec': 'Commerce', 'work_experience': True}, num_rows=200)
   ....: 

You can now use the sample_conditions function and pass in a list of conditions.

In [15]: all_conditions = [science_students, commerce_students]

In [16]: model.sample_conditions(conditions=all_conditions)
Out[16]: 
     student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0             0      M    74.146528  54.064282   Science    63.344975   Comm&Mgmt             True                 2           75.018431  Mkt&Fin  67.088993      NaN    True 2020-03-03 2020-07-28       NaN
1             1      F    76.474854  64.676705   Science    69.875049      Others             True                 1           50.000000  Mkt&Fin  67.690054  43116.0    True 2020-02-15        NaT       NaN
2             2      M    65.990529  65.259392   Science    73.335155    Sci&Tech             True                 1           86.753870   Mkt&HR  57.928501      NaN    True        NaT        NaT       NaN
3             3      M    84.786714  77.534926   Science    51.719152   Comm&Mgmt             True                 1           69.417172  Mkt&Fin  54.064789      NaN    True        NaT        NaT       3.0
4             4      M    70.076205  56.027004   Science    60.660175    Sci&Tech             True                 2           97.596509   Mkt&HR  56.209663  50589.0    True 2020-04-12 2020-12-04      12.0
..          ...    ...          ...        ...       ...          ...         ...              ...               ...                 ...      ...        ...      ...     ...        ...        ...       ...
295         295      F    80.925749  92.063829  Commerce    77.085518   Comm&Mgmt             True                 1           73.592947  Mkt&Fin  75.879022  45416.0    True        NaT        NaT       NaN
296         296      F    65.843065  55.644624  Commerce    67.526574    Sci&Tech             True                 1           52.094756  Mkt&Fin  66.854837  21718.0    True 2020-02-17        NaT       3.0
297         297      M    68.030281  75.072779  Commerce    74.347777   Comm&Mgmt             True                 1           80.119017  Mkt&Fin  64.803793      NaN    True 2020-01-12        NaT       6.0
298         298      F    61.236386  62.097566  Commerce    70.906764   Comm&Mgmt             True                 1           72.567293  Mkt&Fin  67.501697  20000.0    True 2020-05-07 2020-07-03       NaN
299         299      M    77.151771  77.696541  Commerce    66.925670   Comm&Mgmt             True                 1           62.772069  Mkt&Fin  58.988102  21336.0    True 2020-01-29 2020-07-04       NaN

[300 rows x 17 columns]

Advanced Usage

Adding Constraints

A constraint is a logical business rule that must be met by every row in your dataset.

In most cases, the preset is able to learn a general trend and create synthetic data where most of the rows follow the rule. Use a constraint if you want to enforce that all of the rows must follow the rule.

In our dataset, we have a constraint: All the numerical values in the duration column must be divisible by 3. We can describe this using a FixedIncrements constraint.

In [17]: from sdv.constraints import FixedIncrements

# use the formula when defining the constraint
In [18]: duration_constraint = FixedIncrements(
   ....:     column_name='duration',
   ....:     increment_value=3,
   ....: )
   ....: 

You can input constraints into the presets when creating your model.

In [19]: constrained_model = TabularPreset(
   ....:     name='FAST_ML',
   ....:     metadata=metadata,
   ....:     constraints=[duration_constraint],
   ....: )
   ....: 

In [20]: constrained_model.fit(data)

When you sample from the model, the synthetic data will follow the constraints

In [21]: constrained_synthetic_data = constrained_model.sample(num_rows=1_000)

In [22]: constrained_synthetic_data.head(10)
Out[22]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0           0      M    62.251380  55.286421  Commerce    64.005795   Comm&Mgmt             True                 2           71.245081  Mkt&Fin  60.137717  37325.0    True 2020-04-14 2020-11-23       NaN
1           1      M    77.703265  61.395523   Science    60.732887    Sci&Tech            False                 0           59.133639   Mkt&HR  59.986495  25934.0    True 2020-07-25 2020-09-14       NaN
2           2      M    52.745563  61.137938  Commerce    59.391134   Comm&Mgmt            False                 0           79.642317   Mkt&HR  59.044961      NaN   False        NaT        NaT       3.0
3           3      F    48.583359  68.132515  Commerce    60.589847   Comm&Mgmt            False                 0           78.002896   Mkt&HR  70.833748      NaN   False        NaT        NaT       3.0
4           4      F    63.775845  75.028441  Commerce    63.090938   Comm&Mgmt            False                 1           61.736851   Mkt&HR  65.741610  32463.0    True 2020-03-31 2020-08-19       NaN
5           5      M    69.058240  65.143928  Commerce    69.011776   Comm&Mgmt            False                 0           71.344402   Mkt&HR  61.797667  20000.0    True 2020-05-06 2020-07-19       NaN
6           6      M    63.607294  73.752060      Arts    70.714755    Sci&Tech            False                 0           73.778373  Mkt&Fin  58.135275  31567.0    True 2019-11-09 2020-09-02       6.0
7           7      F    49.845707  50.301398   Science    53.192406   Comm&Mgmt            False                 0           58.883293  Mkt&Fin  55.476283  29285.0    True 2020-04-13 2021-01-07       NaN
8           8      F    60.026655  64.417167  Commerce    61.545889   Comm&Mgmt            False                 0           62.141131   Mkt&HR  57.001747      NaN   False        NaT        NaT       NaN
9           9      F    57.576395  68.378680   Science    57.673501   Comm&Mgmt            False                 0           72.483824   Mkt&HR  62.478950      NaN   False        NaT        NaT       6.0

To read more about defining constraints, see the Handling Constraints User Guide.

Resources

The SDV (Synthetic Data Vault) is an open source project built & maintained by DataCebo. It is free to use under the MIT License.

For other resources see our: GitHub, Docs, Blog.