In this short tutorial we will guide you through a series of steps that will help you getting started using SDV.
To model a multi table, relational dataset, we follow two steps. In the first step, we will load the data and configures the meta data. In the second step, we will use the SDV API to fit and save a hierarchical model. We will cover these two steps in this section using an example dataset.
SDV comes with a toy dataset to play with, which can be loaded using the sdv.load_demo function:
sdv.load_demo
In [1]: from sdv import load_demo In [2]: metadata, tables = load_demo(metadata=True)
This will return two objects:
A Metadata object with all the information that SDV needs to know about the dataset.
Metadata
In [3]: metadata Out[3]: Metadata root_path: . tables: ['users', 'sessions', 'transactions'] relationships: sessions.user_id -> users.user_id transactions.session_id -> sessions.session_id In [4]: metadata.visualize();
For more details about how to build the Metadata for your own dataset, please refer to the Relational Metadata guide.
A dictionary containing three pandas.DataFrames with the tables described in the metadata object.
pandas.DataFrames
In [5]: tables Out[5]: {'users': user_id country gender age 0 0 US M 34 1 1 UK F 23 2 2 ES None 44 3 3 UK M 22 4 4 US F 54 5 5 DE M 57 6 6 BG F 45 7 7 ES None 41 8 8 FR F 23 9 9 UK None 30, 'sessions': session_id user_id device os 0 0 0 mobile android 1 1 1 tablet ios 2 2 1 tablet android 3 3 2 mobile android 4 4 4 mobile ios 5 5 5 mobile android 6 6 6 mobile ios 7 7 6 tablet ios 8 8 6 mobile ios 9 9 8 tablet ios, 'transactions': transaction_id session_id timestamp amount approved 0 0 0 2019-01-01 12:34:32 100.0 True 1 1 0 2019-01-01 12:42:21 55.3 True 2 2 1 2019-01-07 17:23:11 79.5 True 3 3 3 2019-01-10 11:08:57 112.1 False 4 4 5 2019-01-10 21:54:08 110.0 False 5 5 5 2019-01-11 11:21:20 76.3 True 6 6 7 2019-01-22 14:44:10 89.5 True 7 7 8 2019-01-23 10:14:09 132.1 False 8 8 9 2019-01-27 16:09:17 68.0 True 9 9 9 2019-01-29 12:10:48 99.9 True}
First, we build a hierarchical statistical model of the data using SDV. For this we will create an instance of the sdv.SDV class and use its fit method.
sdv.SDV
fit
During this process, SDV will traverse across all the tables in your dataset following the primary key-foreign key relationships and learn the probability distributions of the values in the columns.
In [6]: from sdv import SDV In [7]: sdv = SDV() In [8]: sdv.fit(metadata, tables)
Once the modeling has finished you are ready to generate new synthetic data using the sdv instance that you have.
sdv
For this, all you have to do is call the sample_all method from your instance passing the number of rows that you want to generate:
sample_all
In [9]: sampled = sdv.sample_all()
This will return a dictionary identical to the tables one that we passed to the SDV instance for learning, filled in with new synthetic data.
tables
Note
Only the parent tables of your dataset will have the specified number of rows, as the number of child rows that each row in the parent table has is also sampled following the original distribution of your dataset.
In [10]: sampled Out[10]: {'users': user_id country gender age 0 0 ES M 39 1 1 US M 55 2 2 US NaN 37 3 3 ES F 52 4 4 ES F 40 5 5 ES F 54 6 6 ES M 43 7 7 UK M 26 8 8 ES F 38 9 9 DE NaN 40, 'sessions': session_id user_id device os 0 0 0 mobile ios 1 1 1 mobile android 2 2 2 tablet ios 3 3 2 tablet ios 4 4 3 mobile ios 5 5 4 mobile ios 6 6 4 mobile ios 7 7 5 mobile ios 8 8 6 mobile android 9 9 7 tablet ios 10 10 8 mobile ios 11 11 8 mobile ios, 'transactions': transaction_id session_id timestamp amount approved 0 0 0 2019-01-18 10:26:18 100.068818 False 1 1 0 2019-01-18 10:26:18 100.072386 False 2 2 1 2019-01-07 02:02:52 101.303152 False 3 3 4 2019-01-14 12:38:24 88.105085 True 4 4 7 2019-01-11 22:05:08 90.699881 True 5 5 8 2019-01-03 02:59:17 92.933921 True 6 6 10 2019-01-09 18:44:37 93.508463 True 7 7 11 2019-01-09 18:44:38 93.509013 True}
In some cases, you might want to save the fitted SDV instance to be able to generate synthetic data from it later or on a different system.
In order to do so, you can save your fitted SDV instance for later usage using the save method of your instance.
SDV
save
In [11]: sdv.save('sdv.pkl')
The generated pkl file will not include any of the original data in it, so it can be safely sent to where the synthetic data will be generated without any privacy concerns.
pkl
Later on, in order to sample data from the fitted model, we will first need to load it from its pkl file.
In [12]: sdv = SDV.load('sdv.pkl')
After loading the instance, we can sample synthetic data using its sample_all method like before.
In [13]: sampled = sdv.sample_all(5) In [14]: sampled Out[14]: {'users': user_id country gender age 0 10 US F 34 1 11 UK M 21 2 12 UK NaN 29 3 13 ES F 30 4 14 FR F 28, 'sessions': session_id user_id device os 0 12 10 mobile ios 1 13 10 mobile ios 2 14 12 tablet android 3 15 13 tablet ios 4 16 13 tablet ios 5 17 14 tablet ios 6 18 14 tablet ios, 'transactions': transaction_id session_id timestamp amount approved 0 8 12 2019-01-04 07:23:39 81.271372 True 1 9 13 2019-01-04 07:26:09 81.328094 True 2 10 14 2019-01-07 03:38:30 102.661007 False 3 11 15 2019-01-20 13:24:31 49.658808 True 4 12 15 2019-01-26 08:21:40 529.961810 True 5 13 16 2019-01-24 04:47:43 -17.853099 True 6 14 16 2019-01-22 22:34:51 23.305019 True 7 15 17 2019-01-18 06:49:22 86.159834 True 8 16 18 2019-01-18 06:49:18 86.134607 True}