Tabular Preset

The TabularPreset is a tabular model that comes with pre-configured settings. This is meant for users who want to get started with using synthetic data and spend less time worrying about which model to choose or how to tune its parameters.

Note

We are currently in Beta testing our speed-optimized machine learning preset. Help us by testing the model and filing issues for any bugs or feature requests you may have.

What is the FAST_ML preset?

The FAST_ML preset is our first preset. It uses machine learning (ML) to model your data while optimizing for the modeling time. This is a great choice if it’s your first time using the SDV for a large custom dataset or if you’re exploring the benefits of using ML to create synthetic data.

What will you get with this preset?

  • This preset optimizes for the modeling time while still applying machine learning to model and generate synthetic data.

  • Your synthetic data will capture correlations that exist between the columns of the original data.

  • Your synthetic data will adhere to the basic statistical properties of the original columns: min/max values, averages and standard deviations.

While other SDV models may create higher quality synthetic data, they will take longer. Using the FAST_ML preset allows you to get started with ML to create synthetic data right now.

Quick Usage

Preparation

To use this preset, you must have:

  1. Your data, loaded as a pandas DataFrame, and

  2. (Optional but strongly recommended) A metadata file that describes the columns of your dataset

For this guide, we’ll load the demo data and metadata from the SDV. This data contains information about students, including their grades, major and work experience.

In [1]: from sdv.demo import load_tabular_demo

In [2]: metadata, data = load_tabular_demo('student_placements', metadata=True)

In [3]: data.head()
Out[3]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date  duration
0       17264      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12       3.0
1       17265      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09       3.0
2       17266      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13       6.0
3       17267      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT       NaN
4       17268      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27       3.0

If you want to use your custom dataset, you can load it using pandas. For example, if your data is available as a CSV file, you can use the read_csv method.

You can write your metadata as a dictionary. Follow the Metadata guide to create a dictionary for a single table. For example, the metadata for our table looks something like this:

{
    'fields': {
        'start_date': {'type': 'datetime', 'format': '%Y-%m-%d'},
        'end_date': {'type': 'datetime', 'format': '%Y-%m-%d'},
        'salary': {'type': 'numerical', 'subtype': 'integer'},
        'duration': {'type': 'categorical'},
        'student_id': {'type': 'id', 'subtype': 'integer'},
        'high_perc': {'type': 'numerical', 'subtype': 'float'},
        'high_spec': {'type': 'categorical'},
        'mba_spec': {'type': 'categorical'},
        'second_perc': {'type': 'numerical', 'subtype': 'float'},
        'gender': {'type': 'categorical'},
        'degree_perc': {'type': 'numerical', 'subtype': 'float'},
        'placed': {'type': 'boolean'},
        'experience_years': {'type': 'numerical', 'subtype': 'integer'},
        'employability_perc': {'type': 'numerical', 'subtype': 'float'},
        'mba_perc': {'type': 'numerical', 'subtype': 'float'},
        'work_experience': {'type': 'boolean'},
        'degree_type': {'type': 'categorical'}
    },
    'constraints': [],
    'primary_key': 'student_id'
}

Modeling

Pass in your metadata to create the TabularPreset FAST_ML model.

In [4]: from sdv.lite import TabularPreset

# Use the FAST_ML preset to optimize for modeling time
In [5]: model = TabularPreset(name='FAST_ML', metadata=metadata)

Then, simply pass in your data to train the model.

In [6]: model.fit(data)

The modeling step is optimized for speed. The exact time it takes depends on several factors including the number of rows, columns and distinct categories in categorical columns. As a rough benchmark, our analysis shows that:

  • Datasets with around 100K rows and 50-100 columns will take a few minutes to model

  • Larger datasets with around 1M rows and hundreds of columns may take closer to an hour

After you are finished modeling, you can save the fitted model and load it in again for future use.

# save the model in a new file
In [7]: model.save('fast_ml_model.pkl')

# later, you can load it in again
In [8]: model = TabularPreset.load('fast_ml_model.pkl')

Sampling

Once you have your model, you can begin to create synthetic data. Use the sample method and pass in the number of rows you want to synthesize.

In [9]: synthetic_data = model.sample(num_rows=100)

In [10]: synthetic_data.head()
Out[10]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0           0      F    70.931153  76.468415   Science    65.282425    Sci&Tech            False                 0           77.852752  Mkt&Fin  68.294430      NaN   False        NaT        NaT       NaN
1           1      F    87.103279  74.117240  Commerce    81.222904   Comm&Mgmt             True                 1           93.909728   Mkt&HR  72.412533      NaN    True        NaT        NaT       NaN
2           2      F    69.737860  79.646059   Science    70.455549   Comm&Mgmt            False                 0           91.204830  Mkt&Fin  66.028447  23859.0    True 2020-03-07 2020-04-28       3.0
3           3      F    50.731160  52.004504  Commerce    50.067599   Comm&Mgmt            False                 0           50.000000   Mkt&HR  53.152374      NaN   False        NaT        NaT      12.0
4           4      F    77.721778  66.850638      Arts    70.059706      Others             True                 1           65.048467  Mkt&Fin  69.058651      NaN    True        NaT 2021-01-05       NaN

For creating large amounts of synthetic data, provide a batch_size. This breaks up the sampling into multiple batches and shows a progress bar. Use the output_file_path parameter to write results to a file.

In [11]: model.sample(num_rows=1_000_000, batch_size=10_000, output_file_path='synthetic_data.csv')
Out[11]: 
        student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0                0      M    58.765336  86.773910  Commerce    72.517250   Comm&Mgmt             True                 1           81.598581  Mkt&Fin  61.976661  22889.0    True        NaT 2020-08-09       NaN
1                1      M    74.581980  73.973572   Science    70.714832   Comm&Mgmt            False                 0           85.005090   Mkt&HR  66.172339  39333.0    True 2020-03-03        NaT       NaN
2                2      M    67.954566  78.790246   Science    64.994648   Comm&Mgmt            False                 0           62.481884  Mkt&Fin  58.806014  36094.0    True 2020-05-07 2020-12-22       NaN
3                3      F    64.050518  60.348688  Commerce    61.679316    Sci&Tech             True                 1           51.037958  Mkt&Fin  66.630859      NaN   False        NaT        NaT       3.0
4                4      M    61.928477  67.966838   Science    64.011626   Comm&Mgmt             True                 1           51.495443  Mkt&Fin  59.060444      NaN    True 2020-04-15 2020-08-12       3.0
...            ...    ...          ...        ...       ...          ...         ...              ...               ...                 ...      ...        ...      ...     ...        ...        ...       ...
999995        9995      M    56.749616  49.843545   Science    56.351886    Sci&Tech             True                 1           80.384783  Mkt&Fin  61.642184  21894.0   False        NaT 2020-03-03       NaN
999996        9996      F    56.664455  47.212490  Commerce    69.523461   Comm&Mgmt            False                 0           52.064115   Mkt&HR  60.590673  24173.0   False        NaT 2020-07-23       NaN
999997        9997      F    49.928757  63.003384  Commerce    71.180204   Comm&Mgmt            False                 0           76.878479  Mkt&Fin  66.520026      NaN   False        NaT 2020-05-07       NaN
999998        9998      M    53.186535  72.079803  Commerce    59.661398   Comm&Mgmt            False                 0           86.267425  Mkt&Fin  54.540022  26119.0   False 2020-02-06 2020-07-10      12.0
999999        9999      F    79.624475  60.794820   Science    73.978941    Sci&Tech            False                 0           69.582562   Mkt&HR  65.929394      NaN    True        NaT        NaT       3.0

[1000000 rows x 17 columns]

Conditional Sampling

The model generates new synthetic data – synthetic rows that do not refer to the original. But sometimes you may want to fix some values.

For example, you might only be interested in synthesizing science and commerce students with work experience. Using conditional sampling, you can specify the exact, fixed values that you need. The SDV model will then synthesize the rest of the data.

First, use the Condition object to specify the exact values you want. You specify a dictionary of column names and the exact value you want, along with the number of rows to synthesize.

In [12]: from sdv.sampling.tabular import Condition

# 100 science students with work experience
In [13]: science_students = Condition(
   ....:    column_values={'high_spec': 'Science', 'work_experience': True}, num_rows=100)
   ....: 

# 200 commerce students with work experience
In [14]: commerce_students = Condition(
   ....:    column_values={'high_spec': 'Commerce', 'work_experience': True}, num_rows=200)
   ....: 

You can now use the sample_conditions function and pass in a list of conditions.

In [15]: all_conditions = [science_students, commerce_students]

In [16]: model.sample_conditions(conditions=all_conditions)
Out[16]: 
     student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0             0      M    79.969908  77.066978   Science    63.870095   Comm&Mgmt             True                 1           58.165325  Mkt&Fin  66.574283  40712.0    True 2020-01-18        NaT       NaN
1             1      M    54.372487  48.110956   Science    52.429576    Sci&Tech             True                 1           73.154065  Mkt&Fin  64.212517      NaN   False 2019-11-12 2020-03-03      12.0
2             2      F    70.405915  61.636377   Science    71.205577   Comm&Mgmt             True                 1           78.817061  Mkt&Fin  66.935122  32731.0    True        NaT 2020-10-10       NaN
3             3      M    49.655230  64.452432   Science    61.719666   Comm&Mgmt             True                 1           80.301731  Mkt&Fin  63.675328      NaN    True        NaT        NaT       NaN
4             4      M    59.417214  69.366414   Science    57.014539    Sci&Tech             True                 1           82.315317  Mkt&Fin  61.427436      NaN    True 2019-12-29 2020-06-30       NaN
..          ...    ...          ...        ...       ...          ...         ...              ...               ...                 ...      ...        ...      ...     ...        ...        ...       ...
295         295      M    64.869409  71.994898  Commerce    56.459347   Comm&Mgmt             True                 2           50.219674  Mkt&Fin  61.169114      NaN    True        NaT 2020-11-05       NaN
296         296      M    71.398536  73.647993  Commerce    65.327414   Comm&Mgmt             True                 1           75.004980   Mkt&HR  65.147772      NaN    True        NaT 2020-05-05       3.0
297         297      M    48.225334  64.863022  Commerce    55.592551    Sci&Tech             True                 1           69.602957  Mkt&Fin  61.689859  27959.0   False 2019-07-26 2019-12-14       6.0
298         298      M    70.669173  80.278429  Commerce    59.188725   Comm&Mgmt             True                 1           61.605968  Mkt&Fin  62.446519      NaN    True        NaT        NaT       NaN
299         299      M    62.123104  76.862480  Commerce    67.303464    Sci&Tech             True                 1           62.134860  Mkt&Fin  59.850985  20790.0    True        NaT        NaT       6.0

[300 rows x 17 columns]

Advanced Usage

Adding Constraints

A constraint is a logical business rule that must be met by every row in your dataset.

In most cases, the preset is able to learn a general trend and create synthetic data where most of the rows follow the rule. Use a constraint if you want to enforce that all of the rows must follow the rule.

In our dataset, we have a constraint: All the numerical values in the duration column must be divisible by 3. We can describe this using a FixedIncrements constraint.

In [17]: from sdv.constraints import FixedIncrements

# use the formula when defining the constraint
In [18]: duration_constraint = FixedIncrements(
   ....:     column_name='duration',
   ....:     increment_value=3,
   ....: )
   ....: 

You can input constraints into the presets when creating your model.

In [19]: constrained_model = TabularPreset(
   ....:     name='FAST_ML',
   ....:     metadata=metadata,
   ....:     constraints=[duration_constraint],
   ....: )
   ....: 

In [20]: constrained_model.fit(data)

When you sample from the model, the synthetic data will follow the constraints

In [21]: constrained_synthetic_data = constrained_model.sample(num_rows=1_000)

In [22]: constrained_synthetic_data.head(10)
Out[22]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0           0      M    48.340661  54.355252  Commerce    62.174106   Comm&Mgmt            False                 0           75.421465   Mkt&HR  57.191062      NaN   False        NaT        NaT       6.0
1           1      M    60.919328  60.810806  Commerce    69.100238   Comm&Mgmt             True                 1           50.000000  Mkt&Fin  66.367218  20462.0    True 2020-02-24 2020-08-04       NaN
2           2      M    81.948235  65.639248   Science    72.353788    Sci&Tech            False                 1           82.635021   Mkt&HR  72.649535  29500.0    True 2020-01-27 2020-06-10       3.0
3           3      M    69.997129  77.165532  Commerce    60.683959   Comm&Mgmt             True                 0           69.988601  Mkt&Fin  57.738733  25576.0    True 2020-02-26 2020-07-29       NaN
4           4      F    69.385325  50.097808  Commerce    72.128338   Comm&Mgmt            False                 1           70.677518  Mkt&Fin  53.189432  24719.0    True 2020-05-24 2020-07-06       3.0
5           5      F    66.158020  51.053796   Science    59.836460   Comm&Mgmt            False                 0           57.377028  Mkt&Fin  57.205191  21383.0    True 2020-01-13 2020-05-02       NaN
6           6      M    66.955703  44.357827      Arts    55.783784   Comm&Mgmt            False                 0           70.024011   Mkt&HR  52.163880      NaN   False        NaT        NaT       NaN
7           7      M    73.826523  81.360940  Commerce    72.212622   Comm&Mgmt             True                 1           81.838492   Mkt&HR  61.081182  36114.0    True 2020-08-07 2020-12-13       3.0
8           8      M    69.127094  75.031103  Commerce    74.053763   Comm&Mgmt             True                 2           79.873829  Mkt&Fin  67.450824      NaN   False        NaT        NaT       NaN
9           9      M    65.359324  72.711009  Commerce    64.367225   Comm&Mgmt            False                 0           82.818489   Mkt&HR  64.008653      NaN   False        NaT        NaT       3.0

To read more about defining constraints, see the Handling Constraints User Guide.

Resources

The SDV (Synthetic Data Vault) is an open source project built & maintained by DataCebo. It is free to use under the MIT License.

For other resources see our: GitHub, Docs, Blog.