Danger

You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software

Click here to go to the new docs pages.

HMA1 Class

In this guide we will go through a series of steps that will let you discover functionalities of the HMA1 class.

Note

Is the HMA1 algorithm suited for my dataset?

The HMA1 algorithm can be used on various multi-table dataset schemas. Make sure you do not have any cyclical dependencies or missing references.

The HMA1 is designed to capture correlations between different tables with high quality. The algorithm is optimized for datasets with around 5 tables and 2 levels of depth (eg. a parent and its child table). You may find the modeling time will increase if you have multiple levels of tables and more columns.

In most uses, we’ve found that a small set of tables and columns are ideal for successfully deploying a synthetic data application. If you are looking for solutions with a larger schema, please contact us at info@sdv.dev.

What is HMA1?

The sdv.relational.HMA1 class implements what is called a Hierarchical Modeling Algorithm which is an algorithm that allows to recursively walk through a relational dataset and apply tabular models across all the tables in a way that lets the models learn how all the fields from all the tables are related.

Let’s now discover how to use the HMA1 class.

Quick Usage

We will start by loading and exploring one of our demo datasets.

In [1]: from sdv import load_demo

In [2]: metadata, tables = load_demo(metadata=True)

This will return two objects:

  1. A Metadata object with all the information that SDV needs to know about the dataset.

In [3]: metadata
Out[3]: 
Metadata
  root_path: .
  tables: ['users', 'sessions', 'transactions']
  relationships:
    sessions.user_id -> users.user_id
    transactions.session_id -> sessions.session_id


In [4]: metadata.visualize();
../../_images/hma1_1.png

For more details about how to build the Metadata for your own dataset, please refer to the Relational Metadata Guide.

  1. A dictionary containing three pandas.DataFrames with the tables described in the metadata object.

In [5]: tables
Out[5]: 
{'users':    user_id country gender  age
 0        0      US      M   34
 1        1      UK      F   23
 2        2      ES   None   44
 3        3      UK      M   22
 4        4      US      F   54
 5        5      DE      M   57
 6        6      BG      F   45
 7        7      ES   None   41
 8        8      FR      F   23
 9        9      UK   None   30,
 'sessions':    session_id  user_id  device       os  minutes
 0           0        0  mobile  android       23
 1           1        1  tablet      ios       12
 2           2        2  tablet  android        8
 3           3        3  mobile  android       13
 4           4        4  mobile      ios        9
 5           5        5  mobile  android       32
 6           6        6  mobile      ios        7
 7           7        7  tablet      ios       21
 8           8        8  mobile      ios       29
 9           9        9  tablet      ios       34,
 'transactions':    transaction_id  session_id           timestamp  amount  cancelled
 0               0           0 2019-01-01 12:34:32   100.0      False
 1               1           1 2019-01-01 12:42:21    55.3      False
 2               2           2 2019-01-07 17:23:11    79.5      False
 3               3           3 2019-01-10 11:08:57   112.1       True
 4               4           4 2019-01-10 21:54:08   110.0       True
 5               5           5 2019-01-11 11:21:20    76.3      False
 6               6           6 2019-01-22 14:44:10    89.5      False
 7               7           7 2019-01-23 10:14:09   132.1       True
 8               8           8 2019-01-27 16:09:17    68.0      False
 9               9           9 2019-01-29 12:10:48    99.9      False}

Let us now use the HMA1 class to learn this data to be ready to sample synthetic data about new users. In order to do this you will need to:

  • Import the sdv.relational.HMA1 class and create an instance of it passing the metadata that we just loaded.

  • Call its fit method passing the tables dict.

In [6]: from sdv.relational import HMA1

In [7]: model = HMA1(metadata)

In [8]: model.fit(tables)

Note

During the previous steps SDV walked through all the tables in the dataset following the relationships specified by the metadata, learned each table using a GaussianCopula Model and then augmented the parent tables using the copula parameters before learning them. By doing this, each copula model was able to learn how the child table rows were related to their parent tables.

Generate synthetic data from the model

Once the training process has finished you are ready to generate new synthetic data by calling the sample method from your model.

In [9]: new_data = model.sample()

This will return a dictionary of tables identical to the one which the model was fitted on, but filled with new data which resembles the original one.

In [10]: new_data
Out[10]: 
{'users':    user_id country gender  age
 0        0      ES      M   52
 1        1      UK    NaN   22
 2        2      ES      F   45
 3        3      BG      F   49
 4        4      DE      M   53
 5        5      UK      F   41
 6        6      UK      F   22
 7        7      US    NaN   39
 8        8      ES    NaN   36
 9        9      ES    NaN   52,
 'sessions':    session_id  user_id  device       os  minutes
 0           0        0  tablet      ios        7
 1           1        1  mobile  android       25
 2           2        2  mobile      ios        8
 3           3        3  mobile      ios        7
 4           4        4  mobile  android       29
 5           5        5  mobile  android       18
 6           6        6  tablet      ios       31
 7           7        7  mobile  android       23
 8           8        8  mobile      ios       23
 9           9        9  tablet      ios       28,
 'transactions':    transaction_id  session_id           timestamp  amount  cancelled
 0               0           0 2019-01-20 10:50:26   132.1       True
 1               1           1 2019-01-12 14:33:36   124.0       True
 2               2           2 2019-01-15 19:43:55    94.2      False
 3               3           3 2019-01-13 19:21:56    80.2      False
 4               4           4 2019-01-14 11:42:23    66.7      False
 5               5           5 2019-01-01 12:34:32    82.6      False
 6               6           6 2019-01-28 06:08:47    92.1      False
 7               7           7 2019-01-11 16:15:11   105.9      False
 8               8           8 2019-01-21 17:32:39   113.5       True
 9               9           9 2019-01-12 02:10:43    96.7      False}

Save and Load the model

In many scenarios it will be convenient to generate synthetic versions of your data directly in systems that do not have access to the original data source. For example, if you may want to generate testing data on the fly inside a testing environment that does not have access to your production database. In these scenarios, fitting the model with real data every time that you need to generate new data is feasible, so you will need to fit a model in your production environment, save the fitted model into a file, send this file to the testing environment and then load it there to be able to sample from it.

Let’s see how this process works.

Save and share the model

Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is cloudpickle.

In [11]: model.save('my_model.pkl')

This will have created a file called my_model.pkl in the same directory in which you are running SDV.

Important

If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains no information about the original data, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this my_model.pkl file without the risk of disclosing any of your real data!

Load the model and generate new data

The file you just generated can be sent over to the system where the synthetic data will be generated. Once it is there, you can load it using the HMA1.load method, and then you are ready to sample new data from the loaded instance:

In [12]: loaded = HMA1.load('my_model.pkl')

In [13]: new_data = loaded.sample()

In [14]: new_data.keys()
Out[14]: dict_keys(['users', 'sessions', 'transactions'])

Warning

Notice that the system where the model is loaded needs to also have sdv installed, otherwise it will not be able to load the model and use it.

How to control the number of rows?

In the steps above we did not tell the model at any moment how many rows we wanted to sample, so it produced as many rows as there were in the original dataset.

If you want to produce a different number of rows you can pass it as the num_rows argument and it will produce the indicated number of rows:

In [15]: model.sample(num_rows=5)
Out[15]: 
{'users':    user_id country gender  age
 0       10      ES    NaN   32
 1       11      UK      F   46
 2       12      US    NaN   55
 3       13      FR      F   57
 4       14      UK      M   33,
 'sessions':    session_id  user_id  device       os  minutes
 0          10       10  tablet      ios       27
 1          11       11  mobile      ios       34
 2          12       12  tablet  android       30
 3          13       13  tablet      ios        9
 4          14       14  tablet      ios        7,
 'transactions':    transaction_id  session_id           timestamp  amount  cancelled
 0              10          10 2019-01-27 07:23:48   126.0       True
 1              11          11 2019-01-09 17:13:16   110.0       True
 2              12          12 2019-01-03 09:09:53    88.9      False
 3              13          13 2019-01-04 17:39:27    62.7      False
 4              14          14 2019-01-18 03:27:33   116.2       True}

Note

Notice that the root table users has the indicated number of rows but some of the other tables do not. This is because the number of rows from the child tables is sampled based on the values form the parent table, which means that only the root table of the dataset is affected by the passed num_rows argument.

Can I sample a subset of the tables?

In some occasions you will not be interested in generating rows for the entire dataset and would rather generate data for only one table and its children.

To do this you can simply pass the name of the table that you want to sample.

For example, pass the name sessions to the sample method, the model will only generate data for the sessions table and its child table, transactions.

In [16]: model.sample('sessions', num_rows=5)
Out[16]: 
{'sessions':    session_id  user_id  device       os  minutes
 0          15       15  mobile  android       16
 1          16       15  tablet      ios       14
 2          17       17  tablet      ios       16
 3          18       16  mobile  android       17
 4          19       17  mobile  android       29,
 'transactions':    transaction_id  session_id           timestamp  amount  cancelled
 0              15          15 2019-01-02 11:13:27    76.3      False
 1              16          16 2019-01-24 22:58:11    94.2      False
 2              17          17 2019-01-09 18:06:22    95.3       True
 3              18          18 2019-01-01 12:34:32    66.9      False
 4              19          19 2019-01-13 10:29:46    73.3      False}

If you want to further restrict the sampling process to only one table and also skip its child tables, you can add the argument sample_children=False.

For example, you can sample data from the table users only without producing any rows for the tables sessions and transactions.

In [17]: model.sample('users', num_rows=5, sample_children=False)
Out[17]: 
   user_id country gender  age
0       20      ES      M   44
1       21      DE    NaN   45
2       22      ES      F   50
3       23      UK    NaN   25
4       24      US      M   23

Note

In this case, since we are only producing a single table, the output is given directly as a pandas.DataFrame instead of a dictionary.

Can I evaluate the Synthetic Data?

After creating synthetic data, you may be wondering how you can evaluate it against the original data. You can use the SDMetrics library to get more insights, generate reports and visualize the data. This library is automatically installed with SDV.

To get started, visit: https://docs.sdv.dev/sdmetrics/