Danger

You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software

Click here to go to the new docs pages.

Tabular Preset

The TabularPreset is a tabular model that comes with pre-configured settings. This is meant for users who want to get started with using synthetic data and spend less time worrying about which model to choose or how to tune its parameters.

Note

We are currently in Beta testing our speed-optimized machine learning preset. Help us by testing the model and filing issues for any bugs or feature requests you may have.

What is the FAST_ML preset?

The FAST_ML preset is our first preset. It uses machine learning (ML) to model your data while optimizing for the modeling time. This is a great choice if it’s your first time using the SDV for a large custom dataset or if you’re exploring the benefits of using ML to create synthetic data.

What will you get with this preset?

  • This preset optimizes for the modeling time while still applying machine learning to model and generate synthetic data.

  • Your synthetic data will capture correlations that exist between the columns of the original data.

  • Your synthetic data will adhere to the basic statistical properties of the original columns: min/max values, averages and standard deviations.

While other SDV models may create higher quality synthetic data, they will take longer. Using the FAST_ML preset allows you to get started with ML to create synthetic data right now.

Quick Usage

Preparation

To use this preset, you must have:

  1. Your data, loaded as a pandas DataFrame, and

  2. (Optional but strongly recommended) A metadata file that describes the columns of your dataset

For this guide, we’ll load the demo data and metadata from the SDV. This data contains information about students, including their grades, major and work experience.

In [1]: from sdv.demo import load_tabular_demo

In [2]: metadata, data = load_tabular_demo('student_placements', metadata=True)

In [3]: data.head()
Out[3]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date  duration
0       17264      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12       3.0
1       17265      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09       3.0
2       17266      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13       6.0
3       17267      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT       NaN
4       17268      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27       3.0

If you want to use your custom dataset, you can load it using pandas. For example, if your data is available as a CSV file, you can use the read_csv method.

You can write your metadata as a dictionary. Follow the Metadata guide to create a dictionary for a single table. For example, the metadata for our table looks something like this:

{
    'fields': {
        'start_date': {'type': 'datetime', 'format': '%Y-%m-%d'},
        'end_date': {'type': 'datetime', 'format': '%Y-%m-%d'},
        'salary': {'type': 'numerical', 'subtype': 'integer'},
        'duration': {'type': 'categorical'},
        'student_id': {'type': 'id', 'subtype': 'integer'},
        'high_perc': {'type': 'numerical', 'subtype': 'float'},
        'high_spec': {'type': 'categorical'},
        'mba_spec': {'type': 'categorical'},
        'second_perc': {'type': 'numerical', 'subtype': 'float'},
        'gender': {'type': 'categorical'},
        'degree_perc': {'type': 'numerical', 'subtype': 'float'},
        'placed': {'type': 'boolean'},
        'experience_years': {'type': 'numerical', 'subtype': 'integer'},
        'employability_perc': {'type': 'numerical', 'subtype': 'float'},
        'mba_perc': {'type': 'numerical', 'subtype': 'float'},
        'work_experience': {'type': 'boolean'},
        'degree_type': {'type': 'categorical'}
    },
    'constraints': [],
    'primary_key': 'student_id'
}

Modeling

Pass in your metadata to create the TabularPreset FAST_ML model.

In [4]: from sdv.lite import TabularPreset

# Use the FAST_ML preset to optimize for modeling time
In [5]: model = TabularPreset(name='FAST_ML', metadata=metadata)

Then, simply pass in your data to train the model.

In [6]: model.fit(data)

The modeling step is optimized for speed. The exact time it takes depends on several factors including the number of rows, columns and distinct categories in categorical columns. As a rough benchmark, our analysis shows that:

  • Datasets with around 100K rows and 50-100 columns will take a few minutes to model

  • Larger datasets with around 1M rows and hundreds of columns may take closer to an hour

After you are finished modeling, you can save the fitted model and load it in again for future use.

# save the model in a new file
In [7]: model.save('fast_ml_model.pkl')

# later, you can load it in again
In [8]: model = TabularPreset.load('fast_ml_model.pkl')

Sampling

Once you have your model, you can begin to create synthetic data. Use the sample method and pass in the number of rows you want to synthesize.

In [9]: synthetic_data = model.sample(num_rows=100)

In [10]: synthetic_data.head()
Out[10]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0           0      M    77.856718  72.820654  Commerce    62.668844   Comm&Mgmt             True                 2           63.072758  Mkt&Fin  66.975043      NaN    True        NaT        NaT       NaN
1           1      F    48.974286  58.695613  Commerce    60.767601   Comm&Mgmt            False                 0           75.802412  Mkt&Fin  71.636136      NaN   False        NaT        NaT      12.0
2           2      M    76.844911  77.697082   Science    63.887735      Others             True                 1           66.080579  Mkt&Fin  59.494052      NaN    True 2020-02-06 2020-05-30       NaN
3           3      F    60.830093  72.588383   Science    74.019364   Comm&Mgmt             True                 1           60.130465  Mkt&Fin  65.954341      NaN    True        NaT        NaT       3.0
4           4      F    68.373581  73.035314  Commerce    65.815315   Comm&Mgmt            False                 0           60.689325  Mkt&Fin  64.818992  21004.0    True 2019-12-30 2020-09-30      12.0

For creating large amounts of synthetic data, provide a batch_size. This breaks up the sampling into multiple batches and shows a progress bar. Use the output_file_path parameter to write results to a file.

In [11]: model.sample(num_rows=1_000_000, batch_size=10_000, output_file_path='synthetic_data.csv')
Out[11]: 
        student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0                0      F    66.076808  70.167282  Commerce    61.123129   Comm&Mgmt            False                 0           90.352363  Mkt&Fin  52.009704  37292.0    True 2020-03-20        NaT       NaN
1                1      M    81.378810  84.665910  Commerce    70.314368   Comm&Mgmt            False                 1           76.510793   Mkt&HR  64.003891  29712.0    True 2020-03-18 2020-07-02       3.0
2                2      M    74.672915  48.998928  Commerce    65.581655   Comm&Mgmt            False                 0           75.429694   Mkt&HR  55.613480  20998.0    True 2020-02-26 2019-12-24       3.0
3                3      M    80.600229  86.230726  Commerce    64.245977   Comm&Mgmt            False                 0           86.365788  Mkt&Fin  68.133274      NaN    True        NaT        NaT       6.0
4                4      M    52.836250  56.600862   Science    50.000000   Comm&Mgmt            False                 0           68.515163  Mkt&Fin  58.932920  25903.0   False        NaT 2020-03-15       NaN
...            ...    ...          ...        ...       ...          ...         ...              ...               ...                 ...      ...        ...      ...     ...        ...        ...       ...
999995        9995      M    80.032156  86.618673   Science    68.168072      Others            False                 0           76.156154  Mkt&Fin  73.580754  20000.0    True 2019-12-28 2020-08-15       NaN
999996        9996      M    66.834672  68.129867   Science    62.890542   Comm&Mgmt            False                 1           86.144812  Mkt&Fin  56.229785  29926.0    True 2019-11-17 2020-08-14       NaN
999997        9997      M    79.327632  56.517645  Commerce    74.299468   Comm&Mgmt            False                 0           76.707949  Mkt&Fin  59.972965  38806.0    True 2020-02-13 2020-08-16       NaN
999998        9998      M    52.243179  57.016449  Commerce    71.633991    Sci&Tech            False                 0           52.177788  Mkt&Fin  58.332636  20774.0   False        NaT 2020-07-10       6.0
999999        9999      F    72.408903  63.813712   Science    72.117249   Comm&Mgmt            False                 0           98.000000  Mkt&Fin  72.292758      NaN    True        NaT        NaT       NaN

[1000000 rows x 17 columns]

Conditional Sampling

The model generates new synthetic data – synthetic rows that do not refer to the original. But sometimes you may want to fix some values.

For example, you might only be interested in synthesizing science and commerce students with work experience. Using conditional sampling, you can specify the exact, fixed values that you need. The SDV model will then synthesize the rest of the data.

First, use the Condition object to specify the exact values you want. You specify a dictionary of column names and the exact value you want, along with the number of rows to synthesize.

In [12]: from sdv.sampling.tabular import Condition

# 100 science students with work experience
In [13]: science_students = Condition(
   ....:    column_values={'high_spec': 'Science', 'work_experience': True}, num_rows=100)
   ....: 

# 200 commerce students with work experience
In [14]: commerce_students = Condition(
   ....:    column_values={'high_spec': 'Commerce', 'work_experience': True}, num_rows=200)
   ....: 

You can now use the sample_conditions function and pass in a list of conditions.

In [15]: all_conditions = [science_students, commerce_students]

In [16]: model.sample_conditions(conditions=all_conditions)
Out[16]: 
     student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0             0      F    52.532758  49.561970   Science    51.140416   Comm&Mgmt             True                 1           74.854705  Mkt&Fin  61.185634  23372.0   False 2019-12-23        NaT       NaN
1             1      M    67.391777  67.272080   Science    70.670060    Sci&Tech             True                 1           60.662458  Mkt&Fin  60.500428      NaN    True 2020-06-22 2020-08-21       NaN
2             2      M    75.295397  73.627289   Science    75.030994   Comm&Mgmt             True                 1           75.446925  Mkt&Fin  66.893770  24174.0    True 2020-04-04 2020-07-31       NaN
3             3      M    43.837698  49.099764   Science    50.000000   Comm&Mgmt             True                 1           61.788954   Mkt&HR  52.526468      NaN   False        NaT        NaT       6.0
4             4      F    64.153389  57.316688   Science    64.189365   Comm&Mgmt             True                 1           72.684235  Mkt&Fin  75.674615  31780.0   False        NaT        NaT       NaN
..          ...    ...          ...        ...       ...          ...         ...              ...               ...                 ...      ...        ...      ...     ...        ...        ...       ...
295         295      F    80.681077  70.690053  Commerce    68.522117   Comm&Mgmt             True                 1           83.895190   Mkt&HR  66.331646  30946.0   False 2020-01-24 2020-04-12       3.0
296         296      M    71.552698  63.254244  Commerce    72.997475   Comm&Mgmt             True                 1           85.743597  Mkt&Fin  63.853970  31098.0    True 2020-01-02        NaT       NaN
297         297      F    79.950761  82.388268  Commerce    61.265158   Comm&Mgmt             True                 1           75.722719  Mkt&Fin  59.528437      NaN    True        NaT 2020-09-10       NaN
298         298      M    68.525135  78.184983  Commerce    76.519461   Comm&Mgmt             True                 1           78.799635  Mkt&Fin  62.282153  30818.0    True 2019-12-21 2020-07-22       NaN
299         299      M    43.719921  41.350703  Commerce    58.875628   Comm&Mgmt             True                 1           67.797344  Mkt&Fin  51.898341  28077.0   False 2020-02-18        NaT       NaN

[300 rows x 17 columns]

Advanced Usage

Adding Constraints

A constraint is a logical business rule that must be met by every row in your dataset.

In most cases, the preset is able to learn a general trend and create synthetic data where most of the rows follow the rule. Use a constraint if you want to enforce that all of the rows must follow the rule.

In our dataset, we have a constraint: All the numerical values in the duration column must be divisible by 3. We can describe this using a FixedIncrements constraint.

In [17]: from sdv.constraints import FixedIncrements

# use the formula when defining the constraint
In [18]: duration_constraint = FixedIncrements(
   ....:     column_name='duration',
   ....:     increment_value=3,
   ....: )
   ....: 

You can input constraints into the presets when creating your model.

In [19]: constrained_model = TabularPreset(
   ....:     name='FAST_ML',
   ....:     metadata=metadata,
   ....:     constraints=[duration_constraint],
   ....: )
   ....: 

In [20]: constrained_model.fit(data)

When you sample from the model, the synthetic data will follow the constraints

In [21]: constrained_synthetic_data = constrained_model.sample(num_rows=1_000)

In [22]: constrained_synthetic_data.head(10)
Out[22]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0           0      M    64.592829  73.630461  Commerce    74.138359   Comm&Mgmt            False                 0           87.675480  Mkt&Fin  66.020486  34853.0    True 2020-03-03 2020-05-12       3.0
1           1      M    70.870053  85.546354  Commerce    70.087021   Comm&Mgmt            False                 1           50.000000  Mkt&Fin  61.551202  22689.0    True 2020-04-29 2020-09-03       3.0
2           2      M    66.417049  68.707972  Commerce    68.907223   Comm&Mgmt             True                 1           84.787113  Mkt&Fin  61.278468  20004.0    True 2020-03-19 2020-09-14       6.0
3           3      M    78.246489  70.774004   Science    62.064204    Sci&Tech             True                 0           96.113512  Mkt&Fin  73.842500      NaN   False        NaT        NaT       3.0
4           4      M    58.706323  58.070675  Commerce    64.180914   Comm&Mgmt            False                 0           74.148656   Mkt&HR  54.104841      NaN   False        NaT        NaT       3.0
5           5      M    51.897824  60.005120  Commerce    54.509286   Comm&Mgmt            False                 0           74.693255   Mkt&HR  55.548106      NaN   False        NaT        NaT       NaN
6           6      M    81.948772  89.730430  Commerce    81.614766   Comm&Mgmt            False                 0           94.340478  Mkt&Fin  71.759414  31500.0    True 2020-02-17 2020-07-28       3.0
7           7      M    78.950939  70.579714   Science    77.318894   Comm&Mgmt             True                 1           50.000000  Mkt&Fin  62.273381  21770.0    True 2020-04-03 2020-11-09       NaN
8           8      M    71.176417  64.074905  Commerce    72.918041   Comm&Mgmt            False                 0           52.857051  Mkt&Fin  58.704327  20000.0    True 2020-01-23 2020-04-11       3.0
9           9      M    75.734755  68.945450   Science    63.842584   Comm&Mgmt            False                 0           56.267666   Mkt&HR  71.709611  33248.0    True 2019-10-03 2020-06-14       NaN

To read more about defining constraints, see the Handling Constraints User Guide.

Resources

The SDV (Synthetic Data Vault) is an open source project built & maintained by DataCebo. It is free to use under the MIT License.

For other resources see our: GitHub, Docs, Blog.