Try the new SDV 1.0 Beta! We are transitioning to a new version of SDV with improved workflows, new features and an updated documentation site. Click here to go to the new docs pages.
Try the new SDV 1.0 Beta!
We are transitioning to a new version of SDV with improved workflows, new features and an updated documentation site.
Click here to go to the new docs pages.
This section explains the format of the metadata JSON file.
At the topmost level of the Metadata dictionary, there is only one element:
Mapping of tables in the dataset, each one represented as a sub-document, with the table name as the corresponding key.
A node table should be made for each table in your dataset. It contains the configuration on how to handle this table. It has the following elements:
table
"tables": { "users": { "fields": {...}, "path": "users.csv", "primary_key": "user_id" }, ... }
Mapping of fields in the table.
Name of the table.
Relative path to the .csv file from the data root folder. This can be skipped if the data is being passed as pandas.DataFrames.
.csv
pandas.DataFrames
Name of the field that act as a primary key of the table.
Optional. If set to false, skip this table when modeling and sampling the dataset.
Each field within a table needs to have its type specified, Additionally, some field types need additional details, such as the subtype or other properties.
The available types and subtypes are in this table:
Type
Subtype
Additional Properties
numerical
integer
float
datetime
format
categorical
pii, pii_category, pii_locales
boolean
id
ref
string
ref, regex
"tables": { "users": { "fields": { "country": { "type": "categorical" }, ... }, ... }, ... }
The type of the field.
For datetime types, a format key should be included containing the date format using strftime format.
"tables": { "transactions": { "fields": { "timestamp": { "type": "datetime", "format": "%Y-%m-%d" }, ... }, ... }, ... }
For categorical types, there is an option to anonymize data labeled as Personally Identifiable Information, pii, but keeping its statistical properties. To anonymize a field, you should use the following keys.
pii
"tables": { "users": { "fields": { "social_security_number": { "type": "categorical", "pii": true, "pii_category": "ssn" }, ... }, ... }, ... }
The most common supported values of pii_category are in the following table, but any value supported by faker can be used:
pii_category
name
first_name
last_name
phone_number
ssn
credit_card_number
credit_card_security_code
For a full list of available categories please check the Faker documentation site
Note
Sometime Faker categories admit a type, which can be passed as an additional argument. If that is the case, you set a list containing both the category and the type instead of only the string: 'pii_category': ['credict_card_number', 'visa']
Faker
list
'pii_category': ['credict_card_number', 'visa']
To use localizations in anonymization to get values in the specified language, the pii_locales parameter must be set. It must be passed the localizations with their country codes as a list. A list of all possible localizations can be found on the Faker documentation site.
pii_locales
"tables": { "users": { "fields": { "address": { "type": "categorical", "pii": true, "pii_category": "address" "pii_locales": ["sv_SE", "en_US"] }, ... }, ... }, ... }
Specifying localizations and using Faker categories may result in an error if the defined pii_category is not available for all specified languages.
If a field is specified as a primary_key of the table, then the field must be of type id:
primary_key
"tables": { "users": { "fields": { "user_id": { "name": "user_id" }, ... }, ... }, ... }
If the subtype of the primary key is string, an optional regular expression can be passed to generate keys that match it:
"tables": { "users": { "fields": { "user_id": { "name": "user_id", "type": "id", "subtype": "string", "regex": "[a-zA-Z]{10}" }, ... }, ... }, ... }
If a field is a foreign key to another table, then it has to also be of type id, and define define a relationship using the ref field:
"tables": { "sessions": { "fields": { "user_id": { "type": "id" "ref": { "field": "user_id", "table": "users" }, }, ... }, ... }, ... }]
Parent table name.
Parent table field name.