The TabularPreset is a tabular model that comes with pre-configured settings. This is meant for users who want to get started with using synthetic data and spend less time worrying about which model to choose or how to tune its parameters.
TabularPreset
Note
We are currently in Beta testing our speed-optimized machine learning preset. Help us by testing the model and filing issues for any bugs or feature requests you may have.
The FAST_ML preset is our first preset. It uses machine learning (ML) to model your data while optimizing for the modeling time. This is a great choice if it’s your first time using the SDV for a large custom dataset or if you’re exploring the benefits of using ML to create synthetic data.
FAST_ML
This preset optimizes for the modeling time while still applying machine learning to model and generate synthetic data.
Your synthetic data will capture correlations that exist between the columns of the original data.
Your synthetic data will adhere to the basic statistical properties of the original columns: min/max values, averages and standard deviations.
While other SDV models may create higher quality synthetic data, they will take longer. Using the FAST_ML preset allows you to get started with ML to create synthetic data right now.
To use this preset, you must have:
Your data, loaded as a pandas DataFrame, and
(Optional but strongly recommended) A metadata file that describes the columns of your dataset
For this guide, we’ll load the demo data and metadata from the SDV. This data contains information about students, including their grades, major and work experience.
In [1]: from sdv.demo import load_tabular_demo In [2]: metadata, data = load_tabular_demo('student_placements', metadata=True) In [3]: data.head() Out[3]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17264 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 1 17265 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 2 17266 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 3 17267 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN 4 17268 M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
If you want to use your custom dataset, you can load it using pandas. For example, if your data is available as a CSV file, you can use the read_csv method.
read_csv
You can write your metadata as a dictionary. Follow the Metadata guide to create a dictionary for a single table. For example, the metadata for our table looks something like this:
{ 'fields': { 'start_date': {'type': 'datetime', 'format': '%Y-%m-%d'}, 'end_date': {'type': 'datetime', 'format': '%Y-%m-%d'}, 'salary': {'type': 'numerical', 'subtype': 'integer'}, 'duration': {'type': 'categorical'}, 'student_id': {'type': 'id', 'subtype': 'integer'}, 'high_perc': {'type': 'numerical', 'subtype': 'float'}, 'high_spec': {'type': 'categorical'}, 'mba_spec': {'type': 'categorical'}, 'second_perc': {'type': 'numerical', 'subtype': 'float'}, 'gender': {'type': 'categorical'}, 'degree_perc': {'type': 'numerical', 'subtype': 'float'}, 'placed': {'type': 'boolean'}, 'experience_years': {'type': 'numerical', 'subtype': 'integer'}, 'employability_perc': {'type': 'numerical', 'subtype': 'float'}, 'mba_perc': {'type': 'numerical', 'subtype': 'float'}, 'work_experience': {'type': 'boolean'}, 'degree_type': {'type': 'categorical'} }, 'constraints': [], 'primary_key': 'student_id' }
Pass in your metadata to create the TabularPreset FAST_ML model.
In [4]: from sdv.lite import TabularPreset # Use the FAST_ML preset to optimize for modeling time In [5]: model = TabularPreset(name='FAST_ML', metadata=metadata)
Then, simply pass in your data to train the model.
In [6]: model.fit(data)
The modeling step is optimized for speed. The exact time it takes depends on several factors including the number of rows, columns and distinct categories in categorical columns. As a rough benchmark, our analysis shows that:
Datasets with around 100K rows and 50-100 columns will take a few minutes to model
Larger datasets with around 1M rows and hundreds of columns may take closer to an hour
After you are finished modeling, you can save the fitted model and load it in again for future use.
# save the model in a new file In [7]: model.save('fast_ml_model.pkl') # later, you can load it in again In [8]: model = TabularPreset.load('fast_ml_model.pkl')
Once you have your model, you can begin to create synthetic data. Use the sample method and pass in the number of rows you want to synthesize.
In [9]: synthetic_data = model.sample(num_rows=100) In [10]: synthetic_data.head() Out[10]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 F 70.931153 76.468415 Science 65.282425 Sci&Tech False 0 77.852752 Mkt&Fin 68.294430 NaN False NaT NaT NaN 1 1 F 87.103279 74.117240 Commerce 81.222904 Comm&Mgmt True 1 93.909728 Mkt&HR 72.412533 NaN True NaT NaT NaN 2 2 F 69.737860 79.646059 Science 70.455549 Comm&Mgmt False 0 91.204830 Mkt&Fin 66.028447 23859.0 True 2020-03-07 2020-04-28 3.0 3 3 F 50.731160 52.004504 Commerce 50.067599 Comm&Mgmt False 0 50.000000 Mkt&HR 53.152374 NaN False NaT NaT 12.0 4 4 F 77.721778 66.850638 Arts 70.059706 Others True 1 65.048467 Mkt&Fin 69.058651 NaN True NaT 2021-01-05 NaN
For creating large amounts of synthetic data, provide a batch_size. This breaks up the sampling into multiple batches and shows a progress bar. Use the output_file_path parameter to write results to a file.
In [11]: model.sample(num_rows=1_000_000, batch_size=10_000, output_file_path='synthetic_data.csv') Out[11]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 58.765336 86.773910 Commerce 72.517250 Comm&Mgmt True 1 81.598581 Mkt&Fin 61.976661 22889.0 True NaT 2020-08-09 NaN 1 1 M 74.581980 73.973572 Science 70.714832 Comm&Mgmt False 0 85.005090 Mkt&HR 66.172339 39333.0 True 2020-03-03 NaT NaN 2 2 M 67.954566 78.790246 Science 64.994648 Comm&Mgmt False 0 62.481884 Mkt&Fin 58.806014 36094.0 True 2020-05-07 2020-12-22 NaN 3 3 F 64.050518 60.348688 Commerce 61.679316 Sci&Tech True 1 51.037958 Mkt&Fin 66.630859 NaN False NaT NaT 3.0 4 4 M 61.928477 67.966838 Science 64.011626 Comm&Mgmt True 1 51.495443 Mkt&Fin 59.060444 NaN True 2020-04-15 2020-08-12 3.0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 999995 9995 M 56.749616 49.843545 Science 56.351886 Sci&Tech True 1 80.384783 Mkt&Fin 61.642184 21894.0 False NaT 2020-03-03 NaN 999996 9996 F 56.664455 47.212490 Commerce 69.523461 Comm&Mgmt False 0 52.064115 Mkt&HR 60.590673 24173.0 False NaT 2020-07-23 NaN 999997 9997 F 49.928757 63.003384 Commerce 71.180204 Comm&Mgmt False 0 76.878479 Mkt&Fin 66.520026 NaN False NaT 2020-05-07 NaN 999998 9998 M 53.186535 72.079803 Commerce 59.661398 Comm&Mgmt False 0 86.267425 Mkt&Fin 54.540022 26119.0 False 2020-02-06 2020-07-10 12.0 999999 9999 F 79.624475 60.794820 Science 73.978941 Sci&Tech False 0 69.582562 Mkt&HR 65.929394 NaN True NaT NaT 3.0 [1000000 rows x 17 columns]
The model generates new synthetic data – synthetic rows that do not refer to the original. But sometimes you may want to fix some values.
For example, you might only be interested in synthesizing science and commerce students with work experience. Using conditional sampling, you can specify the exact, fixed values that you need. The SDV model will then synthesize the rest of the data.
First, use the Condition object to specify the exact values you want. You specify a dictionary of column names and the exact value you want, along with the number of rows to synthesize.
In [12]: from sdv.sampling.tabular import Condition # 100 science students with work experience In [13]: science_students = Condition( ....: column_values={'high_spec': 'Science', 'work_experience': True}, num_rows=100) ....: # 200 commerce students with work experience In [14]: commerce_students = Condition( ....: column_values={'high_spec': 'Commerce', 'work_experience': True}, num_rows=200) ....:
You can now use the sample_conditions function and pass in a list of conditions.
In [15]: all_conditions = [science_students, commerce_students] In [16]: model.sample_conditions(conditions=all_conditions) Out[16]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 79.969908 77.066978 Science 63.870095 Comm&Mgmt True 1 58.165325 Mkt&Fin 66.574283 40712.0 True 2020-01-18 NaT NaN 1 1 M 54.372487 48.110956 Science 52.429576 Sci&Tech True 1 73.154065 Mkt&Fin 64.212517 NaN False 2019-11-12 2020-03-03 12.0 2 2 F 70.405915 61.636377 Science 71.205577 Comm&Mgmt True 1 78.817061 Mkt&Fin 66.935122 32731.0 True NaT 2020-10-10 NaN 3 3 M 49.655230 64.452432 Science 61.719666 Comm&Mgmt True 1 80.301731 Mkt&Fin 63.675328 NaN True NaT NaT NaN 4 4 M 59.417214 69.366414 Science 57.014539 Sci&Tech True 1 82.315317 Mkt&Fin 61.427436 NaN True 2019-12-29 2020-06-30 NaN .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 295 295 M 64.869409 71.994898 Commerce 56.459347 Comm&Mgmt True 2 50.219674 Mkt&Fin 61.169114 NaN True NaT 2020-11-05 NaN 296 296 M 71.398536 73.647993 Commerce 65.327414 Comm&Mgmt True 1 75.004980 Mkt&HR 65.147772 NaN True NaT 2020-05-05 3.0 297 297 M 48.225334 64.863022 Commerce 55.592551 Sci&Tech True 1 69.602957 Mkt&Fin 61.689859 27959.0 False 2019-07-26 2019-12-14 6.0 298 298 M 70.669173 80.278429 Commerce 59.188725 Comm&Mgmt True 1 61.605968 Mkt&Fin 62.446519 NaN True NaT NaT NaN 299 299 M 62.123104 76.862480 Commerce 67.303464 Sci&Tech True 1 62.134860 Mkt&Fin 59.850985 20790.0 True NaT NaT 6.0 [300 rows x 17 columns]
A constraint is a logical business rule that must be met by every row in your dataset.
In most cases, the preset is able to learn a general trend and create synthetic data where most of the rows follow the rule. Use a constraint if you want to enforce that all of the rows must follow the rule.
In our dataset, we have a constraint: All the numerical values in the duration column must be divisible by 3. We can describe this using a FixedIncrements constraint.
In [17]: from sdv.constraints import FixedIncrements # use the formula when defining the constraint In [18]: duration_constraint = FixedIncrements( ....: column_name='duration', ....: increment_value=3, ....: ) ....:
You can input constraints into the presets when creating your model.
In [19]: constrained_model = TabularPreset( ....: name='FAST_ML', ....: metadata=metadata, ....: constraints=[duration_constraint], ....: ) ....: In [20]: constrained_model.fit(data)
When you sample from the model, the synthetic data will follow the constraints
In [21]: constrained_synthetic_data = constrained_model.sample(num_rows=1_000) In [22]: constrained_synthetic_data.head(10) Out[22]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 48.340661 54.355252 Commerce 62.174106 Comm&Mgmt False 0 75.421465 Mkt&HR 57.191062 NaN False NaT NaT 6.0 1 1 M 60.919328 60.810806 Commerce 69.100238 Comm&Mgmt True 1 50.000000 Mkt&Fin 66.367218 20462.0 True 2020-02-24 2020-08-04 NaN 2 2 M 81.948235 65.639248 Science 72.353788 Sci&Tech False 1 82.635021 Mkt&HR 72.649535 29500.0 True 2020-01-27 2020-06-10 3.0 3 3 M 69.997129 77.165532 Commerce 60.683959 Comm&Mgmt True 0 69.988601 Mkt&Fin 57.738733 25576.0 True 2020-02-26 2020-07-29 NaN 4 4 F 69.385325 50.097808 Commerce 72.128338 Comm&Mgmt False 1 70.677518 Mkt&Fin 53.189432 24719.0 True 2020-05-24 2020-07-06 3.0 5 5 F 66.158020 51.053796 Science 59.836460 Comm&Mgmt False 0 57.377028 Mkt&Fin 57.205191 21383.0 True 2020-01-13 2020-05-02 NaN 6 6 M 66.955703 44.357827 Arts 55.783784 Comm&Mgmt False 0 70.024011 Mkt&HR 52.163880 NaN False NaT NaT NaN 7 7 M 73.826523 81.360940 Commerce 72.212622 Comm&Mgmt True 1 81.838492 Mkt&HR 61.081182 36114.0 True 2020-08-07 2020-12-13 3.0 8 8 M 69.127094 75.031103 Commerce 74.053763 Comm&Mgmt True 2 79.873829 Mkt&Fin 67.450824 NaN False NaT NaT NaN 9 9 M 65.359324 72.711009 Commerce 64.367225 Comm&Mgmt False 0 82.818489 Mkt&HR 64.008653 NaN False NaT NaT 3.0
To read more about defining constraints, see the Handling Constraints User Guide.
The SDV (Synthetic Data Vault) is an open source project built & maintained by DataCebo. It is free to use under the MIT License.
For other resources see our: GitHub, Docs, Blog.