Danger You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software Click here to go to the new docs pages.
Danger
You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software
Click here to go to the new docs pages.
In this short tutorial we will guide you through a series of steps that will help you getting started using SDV.
To model a multi table, relational dataset, we follow two steps. In the first step, we will load the data and configures the meta data. In the second step, we will use the SDV API to fit and save a hierarchical model. We will cover these two steps in this section using an example dataset.
SDV comes with a toy dataset to play with, which can be loaded using the sdv.load_demo function:
sdv.load_demo
In [1]: from sdv import load_demo In [2]: metadata, tables = load_demo(metadata=True)
This will return two objects:
A Metadata object with all the information that SDV needs to know about the dataset.
Metadata
In [3]: metadata Out[3]: Metadata root_path: . tables: ['users', 'sessions', 'transactions'] relationships: sessions.user_id -> users.user_id transactions.session_id -> sessions.session_id In [4]: metadata.visualize();
For more details about how to build the Metadata for your own dataset, please refer to the Relational Metadata guide.
A dictionary containing three pandas.DataFrames with the tables described in the metadata object.
pandas.DataFrames
In [5]: tables Out[5]: {'users': user_id country gender age 0 0 US M 34 1 1 UK F 23 2 2 ES None 44 3 3 UK M 22 4 4 US F 54 5 5 DE M 57 6 6 BG F 45 7 7 ES None 41 8 8 FR F 23 9 9 UK None 30, 'sessions': session_id user_id device os minutes 0 0 0 mobile android 23 1 1 1 tablet ios 12 2 2 2 tablet android 8 3 3 3 mobile android 13 4 4 4 mobile ios 9 5 5 5 mobile android 32 6 6 6 mobile ios 7 7 7 7 tablet ios 21 8 8 8 mobile ios 29 9 9 9 tablet ios 34, 'transactions': transaction_id session_id timestamp amount cancelled 0 0 0 2019-01-01 12:34:32 100.0 False 1 1 1 2019-01-01 12:42:21 55.3 False 2 2 2 2019-01-07 17:23:11 79.5 False 3 3 3 2019-01-10 11:08:57 112.1 True 4 4 4 2019-01-10 21:54:08 110.0 True 5 5 5 2019-01-11 11:21:20 76.3 False 6 6 6 2019-01-22 14:44:10 89.5 False 7 7 7 2019-01-23 10:14:09 132.1 True 8 8 8 2019-01-27 16:09:17 68.0 False 9 9 9 2019-01-29 12:10:48 99.9 False}
First, we build a hierarchical statistical model of the data using SDV. For this we will create an instance of the sdv.SDV class and use its fit method.
sdv.SDV
fit
During this process, SDV will traverse across all the tables in your dataset following the primary key-foreign key relationships and learn the probability distributions of the values in the columns.
In [6]: from sdv import SDV In [7]: sdv = SDV() In [8]: sdv.fit(metadata, tables)
Once the modeling has finished you are ready to generate new synthetic data using the sdv instance that you have.
sdv
For this, all you have to do is call the sample_all method from your instance passing the number of rows that you want to generate:
sample_all
In [9]: sampled = sdv.sample_all()
This will return a dictionary identical to the tables one that we passed to the SDV instance for learning, filled in with new synthetic data.
tables
Note
Only the parent tables of your dataset will have the specified number of rows, as the number of child rows that each row in the parent table has is also sampled following the original distribution of your dataset.
In [10]: sampled Out[10]: {'users': user_id country gender age 0 0 DE F 26 1 1 UK M 24 2 2 ES F 33 3 3 UK NaN 25 4 4 ES NaN 24 5 5 UK F 29 6 6 UK M 31 7 7 FR M 49 8 8 ES M 41 9 9 US F 22, 'sessions': session_id user_id device os minutes 0 0 0 mobile ios 20 1 1 1 tablet ios 16 2 2 2 mobile ios 11 3 3 3 mobile android 34 4 4 4 tablet ios 31 5 5 5 tablet ios 7 6 6 6 mobile android 12 7 7 7 mobile android 13 8 8 8 mobile ios 13 9 9 9 mobile ios 13, 'transactions': transaction_id session_id timestamp amount cancelled 0 0 0 2019-01-29 05:13:58 73.3 False 1 1 1 2019-01-01 12:34:32 68.1 False 2 2 2 2019-01-19 07:47:50 125.2 True 3 3 3 2019-01-15 08:26:51 106.8 False 4 4 4 2019-01-27 14:14:41 74.9 False 5 5 5 2019-01-01 12:34:32 80.7 True 6 6 6 2019-01-08 11:05:22 87.2 True 7 7 7 2019-01-19 19:12:45 70.7 False 8 8 8 2019-01-17 12:27:03 100.8 True 9 9 9 2019-01-08 02:57:15 99.4 False}
In some cases, you might want to save the fitted SDV instance to be able to generate synthetic data from it later or on a different system.
In order to do so, you can save your fitted SDV instance for later usage using the save method of your instance.
SDV
save
In [11]: sdv.save('sdv.pkl')
The generated pkl file will not include any of the original data in it, so it can be safely sent to where the synthetic data will be generated without any privacy concerns.
pkl
Later on, in order to sample data from the fitted model, we will first need to load it from its pkl file.
In [12]: sdv = SDV.load('sdv.pkl')
After loading the instance, we can sample synthetic data using its sample_all method like before.
In [13]: sampled = sdv.sample_all(5) In [14]: sampled Out[14]: {'users': user_id country gender age 0 10 ES F 57 1 11 BG F 22 2 12 UK NaN 22 3 13 US M 48 4 14 ES M 50, 'sessions': session_id user_id device os minutes 0 10 10 mobile ios 28 1 11 11 tablet ios 25 2 12 12 tablet android 10 3 13 13 mobile android 17 4 14 14 tablet android 25, 'transactions': transaction_id session_id timestamp amount cancelled 0 10 10 2019-01-22 17:18:27 100.2 False 1 11 11 2019-01-29 10:57:33 55.3 False 2 12 12 2019-01-01 12:34:32 63.8 False 3 13 13 2019-01-04 22:30:47 105.3 True 4 14 14 2019-01-10 08:58:21 56.7 False}