Try the new SDV 1.0 Beta! We are transitioning to a new version of SDV with improved workflows, new features and an updated documentation site. Click here to go to the new docs pages.
Try the new SDV 1.0 Beta!
We are transitioning to a new version of SDV with improved workflows, new features and an updated documentation site.
Click here to go to the new docs pages.
In order to work with complex dataset structures you will need to pass additional information about your data to SDV using Metadata.
Metadata
Let’s go over an example to see how to use it.
In order to follow this tutorial you will need to load one of the demo datasets included in SDV by calling the sdv.load_demo function.
sdv.load_demo
In [1]: from sdv import load_demo In [2]: tables = load_demo()
The output of this function will be a dict that contains several tables as pandas.DataFarmes.
pandas.DataFarmes
In [3]: tables Out[3]: {'users': user_id country gender age 0 0 US M 34 1 1 UK F 23 2 2 ES None 44 3 3 UK M 22 4 4 US F 54 5 5 DE M 57 6 6 BG F 45 7 7 ES None 41 8 8 FR F 23 9 9 UK None 30, 'sessions': session_id user_id device os minutes 0 0 0 mobile android 23 1 1 1 tablet ios 12 2 2 2 tablet android 8 3 3 3 mobile android 13 4 4 4 mobile ios 9 5 5 5 mobile android 32 6 6 6 mobile ios 7 7 7 7 tablet ios 21 8 8 8 mobile ios 29 9 9 9 tablet ios 34, 'transactions': transaction_id session_id timestamp amount cancelled 0 0 0 2019-01-01 12:34:32 100.0 False 1 1 1 2019-01-01 12:42:21 55.3 False 2 2 2 2019-01-07 17:23:11 79.5 False 3 3 3 2019-01-10 11:08:57 112.1 True 4 4 4 2019-01-10 21:54:08 110.0 True 5 5 5 2019-01-11 11:21:20 76.3 False 6 6 6 2019-01-22 14:44:10 89.5 False 7 7 7 2019-01-23 10:14:09 132.1 True 8 8 8 2019-01-27 16:09:17 68.0 False 9 9 9 2019-01-29 12:10:48 99.9 False}
You can observe that this dataset has the following structure:
A users table with:
users
A user_id field which acts as the table primary key.
user_id
Two categorical fields, country and gender, and an integer column age.
country
gender
age
A sessions table with:
sessions
A session_id field which acts as the table primary key.
session_id
A user_id field which is a foreign key to the users table and defines a One-to-Many relationship with it (one user can have many sessions).
user
Two categorical fields device and os.
device
os
A transactions table with:
transactions
A transaction_id field which acts as the table primary key.
transaction_id
A session_id field which is a foreign key to the sessions table and defines a One-to-Many relationship with it (one user can make many transactions within a single session).
session
A datetime field, timestamp.
timestamp
A float field, amount.
amount
A boolean field, approved.
approved
Let us now see how to build a Metadata object that represents this dataset.
In SDV, the structure of a dataset is represented using the class sdv.Metadata, which contains all the information that SDV needs in order to optimally learn the dataset and generate synthetic versions of it.
sdv.Metadata
In order to create a Metadata for our dataset, you will first need to import the class and create an empty instance:
In [4]: from sdv import Metadata In [5]: metadata = Metadata() In [6]: metadata Out[6]: Metadata root_path: . tables: [] relationships:
Once you have your Metadata instance ready you can start adding tables.
In this example, you will add the table users, which is the parent table of your dataset, indicating which is its Primary Key field, user_id.
Note that indicating the Primary Key is optional and can be skipped if your table has none, but if a table does not have one, you will not be able to add any child tables to it.
In [7]: metadata.add_table( ...: name='users', ...: data=tables['users'], ...: primary_key='user_id' ...: ) ...:
At this point, our metadata only contains one table and, of course, no relationships:
In [8]: metadata Out[8]: Metadata root_path: . tables: ['users'] relationships:
However, the Metadata instance will have already analyzed all the columns in the passed table and identified the different data types and subtypes, and will properly indicate that the user_id column is the table primary key.
You can see so by calling the visualize method of your metadata instance:
visualize
metadata
In [9]: metadata.visualize();
Or you can obtain this information in a machine-readable format by calling the get_table_meta method:
get_table_meta
In [10]: metadata.get_table_meta('users') Out[10]: {'fields': {'user_id': {'type': 'id', 'subtype': 'integer'}, 'country': {'type': 'categorical'}, 'gender': {'type': 'categorical'}, 'age': {'type': 'numerical', 'subtype': 'integer'}}, 'primary_key': 'user_id'}
Now that you have one parent table, you will add its child table sessions, which is related to the users table in a parent-child relationship where each user can have multiple sessions, and each session belongs to one and only one user.
In order to specify this, while creating the sessions table you have to indicate the name of the parent table, users, and the field from the sessions table that acts as the foreign key, which in this case is called user_id.
With this, apart from analyzing all the columns and indicating the primary key like in the previous step, the Metadata instance will specify a relationship between the two tables by adding a property to the user_id field that indicates that it is related to the user_id field in the users table.
In [11]: metadata.add_table( ....: name='sessions', ....: data=tables['sessions'], ....: primary_key='session_id', ....: parent='users', ....: foreign_key='user_id' ....: ) ....:
Now we can see how the table and the relationship have been registered:
In [12]: metadata Out[12]: Metadata root_path: . tables: ['users', 'sessions'] relationships: sessions.user_id -> users.user_id
There are situations where the Metadata analysis is not able to figure out some data types or subtypes, or to deduce some properties of the field such as the datetime format.
In these situations, you can pass a dictionary with the exact metadata of those fields, which will overwrite the deductions from the analysis process.
In this next example, you will be adding a transactions table, which is related to the previous sessions table, and contains a datetime field which needs to have the datetime format specified.
datetime
In [13]: transactions_fields = { ....: 'timestamp': { ....: 'type': 'datetime', ....: 'format': '%Y-%m-%d' ....: } ....: } ....: In [14]: metadata.add_table( ....: name='transactions', ....: data=tables['transactions'], ....: fields_metadata=transactions_fields, ....: primary_key='transaction_id', ....: parent='sessions' ....: ) ....:
Let’s see what our Metadata looks like right now:
In [15]: metadata Out[15]: Metadata root_path: . tables: ['users', 'sessions', 'transactions'] relationships: sessions.user_id -> users.user_id transactions.session_id -> sessions.session_id In [16]: metadata.to_dict() Out[16]: {'tables': {'users': {'fields': {'user_id': {'type': 'id', 'subtype': 'integer'}, 'country': {'type': 'categorical'}, 'gender': {'type': 'categorical'}, 'age': {'type': 'numerical', 'subtype': 'integer'}}, 'primary_key': 'user_id'}, 'sessions': {'fields': {'session_id': {'type': 'id', 'subtype': 'integer'}, 'user_id': {'type': 'id', 'subtype': 'integer', 'ref': {'table': 'users', 'field': 'user_id'}}, 'device': {'type': 'categorical'}, 'os': {'type': 'categorical'}, 'minutes': {'type': 'numerical', 'subtype': 'integer'}}, 'primary_key': 'session_id'}, 'transactions': {'fields': {'timestamp': {'type': 'datetime', 'format': '%Y-%m-%d'}, 'transaction_id': {'type': 'id', 'subtype': 'integer'}, 'session_id': {'type': 'id', 'subtype': 'integer', 'ref': {'table': 'sessions', 'field': 'session_id'}}, 'amount': {'type': 'numerical', 'subtype': 'float'}, 'cancelled': {'type': 'boolean'}}, 'primary_key': 'transaction_id'}}} In [17]: metadata.visualize();
The Metadata objects can also be saved as a JSON file and later on loaded from them.
In order to save the current metadata as a JSON file, all you need to do is call the to_json method passing the path to the JSON file that you want to create.
to_json
In [18]: metadata.to_json('demo_metadata.json')
You can see that the contents of the created file are very similar to the dict representation of the metadata:
dict
In [19]: with open('demo_metadata.json') as meta_file: ....: print(meta_file.read()) ....: { "tables": { "users": { "fields": { "user_id": { "type": "id", "subtype": "integer" }, "country": { "type": "categorical" }, "gender": { "type": "categorical" }, "age": { "type": "numerical", "subtype": "integer" } }, "primary_key": "user_id" }, "sessions": { "fields": { "session_id": { "type": "id", "subtype": "integer" }, "user_id": { "type": "id", "subtype": "integer", "ref": { "table": "users", "field": "user_id" } }, "device": { "type": "categorical" }, "os": { "type": "categorical" }, "minutes": { "type": "numerical", "subtype": "integer" } }, "primary_key": "session_id" }, "transactions": { "fields": { "timestamp": { "type": "datetime", "format": "%Y-%m-%d" }, "transaction_id": { "type": "id", "subtype": "integer" }, "session_id": { "type": "id", "subtype": "integer", "ref": { "table": "sessions", "field": "session_id" } }, "amount": { "type": "numerical", "subtype": "float" }, "cancelled": { "type": "boolean" } }, "primary_key": "transaction_id" } } }
After creating the JSON file, loading it back as a metadata object is as simple as passing it to the Metadata constructor:
In [20]: loaded = Metadata('demo_metadata.json') In [21]: loaded Out[21]: Metadata root_path: tables: ['users', 'sessions', 'transactions'] relationships: sessions.user_id -> users.user_id transactions.session_id -> sessions.session_id