Danger

You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software

Click here to go to the new docs pages.

Metadata Schema

This section explains the format of the metadata JSON file.

Top Level

At the topmost level of the Metadata dictionary, there is only one element:

Tables

Mapping of tables in the dataset, each one represented as a sub-document, with the table name as the corresponding key.

Table

A node table should be made for each table in your dataset. It contains the configuration on how to handle this table. It has the following elements:

"tables": {
    "users": {
        "fields": {...},
        "path": "users.csv",
        "primary_key": "user_id"
    },
    ...
}
Fields

Mapping of fields in the table.

Name

Name of the table.

Path

Relative path to the .csv file from the data root folder. This can be skipped if the data is being passed as pandas.DataFrames.

Primary_key

Name of the field that act as a primary key of the table.

Use

Optional. If set to false, skip this table when modeling and sampling the dataset.

Field details

Each field within a table needs to have its type specified, Additionally, some field types need additional details, such as the subtype or other properties.

The available types and subtypes are in this table:

Type

Subtype

Additional Properties

numerical

integer

integer

numerical

float

float

datetime

format

categorical

pii, pii_category, pii_locales

boolean

id

integer

ref

id

string

ref, regex

"tables": {
    "users": {
        "fields": {
            "country": {
                "type": "categorical"
            },
            ...
        },
        ...
    },
    ...
}
Type

The type of the field.

Datetime fields

For datetime types, a format key should be included containing the date format using strftime format.

"tables": {
    "transactions": {
        "fields": {
            "timestamp": {
                "type": "datetime",
                "format": "%Y-%m-%d"
            },
            ...
        },
        ...
    },
    ...
}

Categorical fields (Data anonymization)

For categorical types, there is an option to anonymize data labeled as Personally Identifiable Information, pii, but keeping its statistical properties. To anonymize a field, you should use the following keys.

"tables": {
    "users": {
        "fields": {
            "social_security_number": {
                "type": "categorical",
                "pii": true,
                "pii_category": "ssn"
            },
            ...
        },
        ...
    },
    ...
}

The most common supported values of pii_category are in the following table, but any value supported by faker can be used:

name

first_name

last_name

phone_number

ssn

credit_card_number

credit_card_security_code

For a full list of available categories please check the Faker documentation site

Note

Sometime Faker categories admit a type, which can be passed as an additional argument. If that is the case, you set a list containing both the category and the type instead of only the string: 'pii_category': ['credict_card_number', 'visa']

Localized data anonymization

To use localizations in anonymization to get values in the specified language, the pii_locales parameter must be set. It must be passed the localizations with their country codes as a list. A list of all possible localizations can be found on the Faker documentation site.

"tables": {
    "users": {
        "fields": {
            "address": {
                "type": "categorical",
                "pii": true,
                "pii_category": "address"
                "pii_locales": ["sv_SE", "en_US"]
            },
            ...
        },
        ...
    },
    ...
}

Note

Specifying localizations and using Faker categories may result in an error if the defined pii_category is not available for all specified languages.

Primary key fields

If a field is specified as a primary_key of the table, then the field must be of type id:

"tables": {
    "users": {
        "fields": {
            "user_id": {
                "name": "user_id"
            },
            ...
        },
        ...
    },
    ...
}

If the subtype of the primary key is string, an optional regular expression can be passed to generate keys that match it:

"tables": {
    "users": {
        "fields": {
            "user_id": {
                "name": "user_id",
                "type": "id",
                "subtype": "string",
                "regex": "[a-zA-Z]{10}"
            },
            ...
        },
        ...
    },
    ...
}

Foreign key fields

If a field is a foreign key to another table, then it has to also be of type id, and define define a relationship using the ref field:

"tables": {
    "sessions": {
        "fields": {
            "user_id": {
                "type": "id"
                "ref": {
                    "field": "user_id",
                    "table": "users"
                },
            },
            ...
        },
        ...
    },
    ...
}]
table

Parent table name.

field

Parent table field name.