Danger You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software Click here to go to the new docs pages.
Danger
You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software
Click here to go to the new docs pages.
In this guide we will go through a series of steps that will let you discover functionalities of the HMA1 class.
HMA1
Note
Is the HMA1 algorithm suited for my dataset?
The HMA1 algorithm can be used on various multi-table dataset schemas. Make sure you do not have any cyclical dependencies or missing references.
The HMA1 is designed to capture correlations between different tables with high quality. The algorithm is optimized for datasets with around 5 tables and 2 levels of depth (eg. a parent and its child table). You may find the modeling time will increase if you have multiple levels of tables and more columns.
In most uses, we’ve found that a small set of tables and columns are ideal for successfully deploying a synthetic data application. If you are looking for solutions with a larger schema, please contact us at info@sdv.dev.
The sdv.relational.HMA1 class implements what is called a Hierarchical Modeling Algorithm which is an algorithm that allows to recursively walk through a relational dataset and apply tabular models across all the tables in a way that lets the models learn how all the fields from all the tables are related.
sdv.relational.HMA1
Let’s now discover how to use the HMA1 class.
We will start by loading and exploring one of our demo datasets.
In [1]: from sdv import load_demo In [2]: metadata, tables = load_demo(metadata=True)
This will return two objects:
A Metadata object with all the information that SDV needs to know about the dataset.
Metadata
In [3]: metadata Out[3]: Metadata root_path: . tables: ['users', 'sessions', 'transactions'] relationships: sessions.user_id -> users.user_id transactions.session_id -> sessions.session_id In [4]: metadata.visualize();
For more details about how to build the Metadata for your own dataset, please refer to the Relational Metadata Guide.
A dictionary containing three pandas.DataFrames with the tables described in the metadata object.
pandas.DataFrames
In [5]: tables Out[5]: {'users': user_id country gender age 0 0 US M 34 1 1 UK F 23 2 2 ES None 44 3 3 UK M 22 4 4 US F 54 5 5 DE M 57 6 6 BG F 45 7 7 ES None 41 8 8 FR F 23 9 9 UK None 30, 'sessions': session_id user_id device os minutes 0 0 0 mobile android 23 1 1 1 tablet ios 12 2 2 2 tablet android 8 3 3 3 mobile android 13 4 4 4 mobile ios 9 5 5 5 mobile android 32 6 6 6 mobile ios 7 7 7 7 tablet ios 21 8 8 8 mobile ios 29 9 9 9 tablet ios 34, 'transactions': transaction_id session_id timestamp amount cancelled 0 0 0 2019-01-01 12:34:32 100.0 False 1 1 1 2019-01-01 12:42:21 55.3 False 2 2 2 2019-01-07 17:23:11 79.5 False 3 3 3 2019-01-10 11:08:57 112.1 True 4 4 4 2019-01-10 21:54:08 110.0 True 5 5 5 2019-01-11 11:21:20 76.3 False 6 6 6 2019-01-22 14:44:10 89.5 False 7 7 7 2019-01-23 10:14:09 132.1 True 8 8 8 2019-01-27 16:09:17 68.0 False 9 9 9 2019-01-29 12:10:48 99.9 False}
Let us now use the HMA1 class to learn this data to be ready to sample synthetic data about new users. In order to do this you will need to:
Import the sdv.relational.HMA1 class and create an instance of it passing the metadata that we just loaded.
metadata
Call its fit method passing the tables dict.
fit
tables
In [6]: from sdv.relational import HMA1 In [7]: model = HMA1(metadata) In [8]: model.fit(tables)
During the previous steps SDV walked through all the tables in the dataset following the relationships specified by the metadata, learned each table using a GaussianCopula Model and then augmented the parent tables using the copula parameters before learning them. By doing this, each copula model was able to learn how the child table rows were related to their parent tables.
Once the training process has finished you are ready to generate new synthetic data by calling the sample method from your model.
sample
In [9]: new_data = model.sample()
This will return a dictionary of tables identical to the one which the model was fitted on, but filled with new data which resembles the original one.
In [10]: new_data Out[10]: {'users': user_id country gender age 0 0 ES M 52 1 1 UK NaN 22 2 2 ES F 45 3 3 BG F 49 4 4 DE M 53 5 5 UK F 41 6 6 UK F 22 7 7 US NaN 39 8 8 ES NaN 36 9 9 ES NaN 52, 'sessions': session_id user_id device os minutes 0 0 0 tablet ios 7 1 1 1 mobile android 25 2 2 2 mobile ios 8 3 3 3 mobile ios 7 4 4 4 mobile android 29 5 5 5 mobile android 18 6 6 6 tablet ios 31 7 7 7 mobile android 23 8 8 8 mobile ios 23 9 9 9 tablet ios 28, 'transactions': transaction_id session_id timestamp amount cancelled 0 0 0 2019-01-20 10:50:26 132.1 True 1 1 1 2019-01-12 14:33:36 124.0 True 2 2 2 2019-01-15 19:43:55 94.2 False 3 3 3 2019-01-13 19:21:56 80.2 False 4 4 4 2019-01-14 11:42:23 66.7 False 5 5 5 2019-01-01 12:34:32 82.6 False 6 6 6 2019-01-28 06:08:47 92.1 False 7 7 7 2019-01-11 16:15:11 105.9 False 8 8 8 2019-01-21 17:32:39 113.5 True 9 9 9 2019-01-12 02:10:43 96.7 False}
In many scenarios it will be convenient to generate synthetic versions of your data directly in systems that do not have access to the original data source. For example, if you may want to generate testing data on the fly inside a testing environment that does not have access to your production database. In these scenarios, fitting the model with real data every time that you need to generate new data is feasible, so you will need to fit a model in your production environment, save the fitted model into a file, send this file to the testing environment and then load it there to be able to sample from it.
Let’s see how this process works.
Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is cloudpickle.
save
.pkl
In [11]: model.save('my_model.pkl')
This will have created a file called my_model.pkl in the same directory in which you are running SDV.
my_model.pkl
Important
If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains no information about the original data, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this my_model.pkl file without the risk of disclosing any of your real data!
The file you just generated can be sent over to the system where the synthetic data will be generated. Once it is there, you can load it using the HMA1.load method, and then you are ready to sample new data from the loaded instance:
HMA1.load
In [12]: loaded = HMA1.load('my_model.pkl') In [13]: new_data = loaded.sample() In [14]: new_data.keys() Out[14]: dict_keys(['users', 'sessions', 'transactions'])
Warning
Notice that the system where the model is loaded needs to also have sdv installed, otherwise it will not be able to load the model and use it.
sdv
In the steps above we did not tell the model at any moment how many rows we wanted to sample, so it produced as many rows as there were in the original dataset.
If you want to produce a different number of rows you can pass it as the num_rows argument and it will produce the indicated number of rows:
num_rows
In [15]: model.sample(num_rows=5) Out[15]: {'users': user_id country gender age 0 10 ES NaN 32 1 11 UK F 46 2 12 US NaN 55 3 13 FR F 57 4 14 UK M 33, 'sessions': session_id user_id device os minutes 0 10 10 tablet ios 27 1 11 11 mobile ios 34 2 12 12 tablet android 30 3 13 13 tablet ios 9 4 14 14 tablet ios 7, 'transactions': transaction_id session_id timestamp amount cancelled 0 10 10 2019-01-27 07:23:48 126.0 True 1 11 11 2019-01-09 17:13:16 110.0 True 2 12 12 2019-01-03 09:09:53 88.9 False 3 13 13 2019-01-04 17:39:27 62.7 False 4 14 14 2019-01-18 03:27:33 116.2 True}
Notice that the root table users has the indicated number of rows but some of the other tables do not. This is because the number of rows from the child tables is sampled based on the values form the parent table, which means that only the root table of the dataset is affected by the passed num_rows argument.
users
In some occasions you will not be interested in generating rows for the entire dataset and would rather generate data for only one table and its children.
To do this you can simply pass the name of the table that you want to sample.
For example, pass the name sessions to the sample method, the model will only generate data for the sessions table and its child table, transactions.
sessions
transactions
In [16]: model.sample('sessions', num_rows=5) Out[16]: {'sessions': session_id user_id device os minutes 0 15 15 mobile android 16 1 16 15 tablet ios 14 2 17 17 tablet ios 16 3 18 16 mobile android 17 4 19 17 mobile android 29, 'transactions': transaction_id session_id timestamp amount cancelled 0 15 15 2019-01-02 11:13:27 76.3 False 1 16 16 2019-01-24 22:58:11 94.2 False 2 17 17 2019-01-09 18:06:22 95.3 True 3 18 18 2019-01-01 12:34:32 66.9 False 4 19 19 2019-01-13 10:29:46 73.3 False}
If you want to further restrict the sampling process to only one table and also skip its child tables, you can add the argument sample_children=False.
sample_children=False
For example, you can sample data from the table users only without producing any rows for the tables sessions and transactions.
In [17]: model.sample('users', num_rows=5, sample_children=False) Out[17]: user_id country gender age 0 20 ES M 44 1 21 DE NaN 45 2 22 ES F 50 3 23 UK NaN 25 4 24 US M 23
In this case, since we are only producing a single table, the output is given directly as a pandas.DataFrame instead of a dictionary.
pandas.DataFrame
After creating synthetic data, you may be wondering how you can evaluate it against the original data. You can use the SDMetrics library to get more insights, generate reports and visualize the data. This library is automatically installed with SDV.
To get started, visit: https://docs.sdv.dev/sdmetrics/