Danger You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software Click here to go to the new docs pages.
Danger
You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software
Click here to go to the new docs pages.
In this guide we will go through a series of steps that will let you discover functionalities of the PAR model for timeseries data.
PAR
The PAR class is an implementation of a Probabilistic AutoRegressive model that allows learning multi-type, multivariate timeseries data and later on generate new synthetic data that has the same format and properties as the learned one.
Additionally, the PAR model has the ability to generate new synthetic timeseries conditioned on the properties of the entity to which this timeseries data would be associated.
Note
The PAR model is under active development. Please use it, try it on your data and give us feedback on a github issue or our Slack workspace
We will start by loading one of our demo datasets, the nasdaq100_2019, which contains daily stock marked data from the NASDAQ 100 companies during the year 2019.
nasdaq100_2019
In [1]: from sdv.demo import load_timeseries_demo In [2]: data = load_timeseries_demo() In [3]: data.head() Out[3]: Symbol Date Open Close Volume MarketCap Sector Industry 0 AAPL 2018-12-31 39.632500 39.435001 140014000 7.378734e+11 Technology Computer Manufacturing 1 AAPL 2019-01-02 38.722500 39.480000 148158800 7.378734e+11 Technology Computer Manufacturing 2 AAPL 2019-01-03 35.994999 35.547501 365248800 7.378734e+11 Technology Computer Manufacturing 3 AAPL 2019-01-04 36.132500 37.064999 234428400 7.378734e+11 Technology Computer Manufacturing 4 AAPL 2019-01-07 37.174999 36.982498 219111200 7.378734e+11 Technology Computer Manufacturing
As you can see, this table contains information about multiple Tickers, including:
Symbol of the Ticker.
Date associated with the stock market values.
The opening and closing prices for the day.
The Volume of transactions of the day.
The MarketCap of the company
The Sector and the Industry in which the company operates.
This data format is a very common and well known format for timeseries data which includes 4 types of columns:
These are columns that indicate how the rows are associated with external, abstract, entities. The group of rows associated with each entity_id form a time series sequence, where order of the rows matters and where inter-row dependencies exist. However, the rows of different entities are completely independent from each other.
entities
entity_id
In this case, the external entity is the company, and the identifier of the company within our data is the Symbol column.
entity
Symbol
In [4]: entity_columns = ['Symbol']
In some cases, the datsets do not contain any entity_columns because the rows are not associated with any external entity. In these cases, the entity_columns specification can be omitted and the complete dataset will be interpreted as a single timeseries sequence.
entity_columns
The timeseries datasets may have one or more context_columns. context_columns are variables that provide information about the entities associated with the timeseries in the form of attributes and which may condition how the timeseries variables evolve.
context_columns
For example, in our stock market case, the MarketCap, the Sector and the Industry variables are all contextual attributes associated with each company and which have a great impact on what each timeseries look like.
MarketCap
Sector
Industry
In [5]: context_columns = ['MarketCap', 'Sector', 'Industry']
The context_columns are attributes that are associated with the entities, and which do not change over time. For this reason, since each timeseries sequence has a single entity associated, the values of the context_columns are expected to remain constant alongside each combination of entity_columns values.
By definition, the timeseries datasets have inter-row dependencies for which the order of the rows matter. In most cases, this order will be indicated by a sequence_index column that will contain sortable values such as integers, floats or datetimes. In some other cases there may be no sequence_index, which means that the rows are assumed to be already given in the right order.
sequence_index
In this case, the column that indicates us the order of the rows within each sequence is the Date column:
Date
In [6]: sequence_index = 'Date'
Finally, the rest of the columns of the dataset are what we call the data_columns, and they are the columns that our PAR model will learn to generate synthetically conditioned on the values of the context_columns.
data_columns
Let’s now see how to use the PAR class to learn this timeseries dataset and generate new synthetic timeseries that replicate its properties.
For this, you will need to:
Import the sdv.timeseries.PAR class and create an instance of it passing the variables that we just created.
sdv.timeseries.PAR
Call its fit method passing the timeseries data.
fit
Call its sample method indicating the number of sequences that we want to generate.
sample
In [7]: from sdv.timeseries import PAR In [8]: model = PAR( ...: entity_columns=entity_columns, ...: context_columns=context_columns, ...: sequence_index=sequence_index, ...: ) ...: In [9]: model.fit(data)
Notice that the model fitting process took care of transforming the different fields using the appropriate Reversible Data Transforms to ensure that the data has a format that the underlying models can handle.
fitting
Once the modeling has finished you are ready to generate new synthetic data by calling the sample method from your model passing the number of the sequences that we want to generate.
Let’s start by generating a single sequence.
In [10]: new_data = model.sample(1)
This will return a table identical to the one which the model was fitted on, but filled with new synthetic data which resembles the original one.
In [11]: new_data.head() Out[11]: Symbol Date Open Close Volume MarketCap Sector Industry 0 a 2019-01-10 199.622722 143.839686 6569659 1.482650e+11 Technology Semiconductors 1 a 2019-01-06 117.551088 206.462597 1500066 1.482650e+11 Technology Semiconductors 2 a 2019-01-05 135.209666 181.992995 9651562 1.482650e+11 Technology Semiconductors 3 a 2019-01-07 177.369814 174.824025 6903771 1.482650e+11 Technology Semiconductors 4 a 2019-01-05 152.971563 179.566565 8011418 1.482650e+11 Technology Semiconductors
Notice how the model generated a random string for the Symbol identifier which does not look like the regular Ticker symbols that we saw in the original data. This is because the model needs you to tell it how these symbols need to be generated by providing a regular expression that it can use. We will see how to do this in a later section.
In many scenarios it will be convenient to generate synthetic versions of your data directly in systems that do not have access to the original data source. For example, if you may want to generate testing data on the fly inside a testing environment that does not have access to your production database. In these scenarios, fitting the model with real data every time that you need to generate new data is feasible, so you will need to fit a model in your production environment, save the fitted model into a file, send this file to the testing environment and then load it there to be able to sample from it.
Let’s see how this process works.
Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is cloudpickle.
save
.pkl
In [12]: model.save('my_model.pkl')
This will have created a file called my_model.pkl in the same directory in which you are running SDV.
my_model.pkl
If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains no information about the original data, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this my_model.pkl file without the risk of disclosing any of your real data!
The file you just generated can be sent over to the system where the synthetic data will be generated. Once it is there, you can load it using the PAR.load method, and then you are ready to sample new data from the loaded instance:
PAR.load
In [13]: loaded = PAR.load('my_model.pkl') In [14]: loaded.sample(num_sequences=1).head() Out[14]: Symbol Date Open Close Volume MarketCap Sector Industry 0 a 2019-01-01 183.437196 411.817139 10478359 2.338333e+11 NaN Auto Manufacturing 1 a 2019-01-04 324.311156 297.204942 6569659 2.338333e+11 NaN Auto Manufacturing 2 a 2019-01-04 397.921350 374.417958 10532509 2.338333e+11 NaN Auto Manufacturing 3 a 2019-01-08 487.052894 479.283685 6569659 2.338333e+11 NaN Auto Manufacturing 4 a 2019-01-07 477.386230 434.422596 14427427 2.338333e+11 NaN Auto Manufacturing
Warning
Notice that the system where the model is loaded needs to also have sdv installed, otherwise it will not be able to load the model and use it.
sdv
In the previous examples we had the model generate random values for use to populate the context_columns and the entity_columns. In order to do this, the model learned the context and entity values using a GaussianCopula, which later on was used to sample new realistic values for them. This is fine for cases in which we do not have any constraints regarding the type of data that we generate, but in some cases we might want to control the values of the contextual columns to force the model into generating data of a certain type.
GaussianCopula
In order to achieve this, we will first have to create a pandas.DataFrame with the expected values.
pandas.DataFrame
As an example, let’s generate values for two companies in the Technology and Health Care sectors.
In [15]: import pandas as pd In [16]: context = pd.DataFrame([ ....: { ....: 'Symbol': 'AAAA', ....: 'MarketCap': 1.2345e+11, ....: 'Sector': 'Technology', ....: 'Industry': 'Electronic Components' ....: }, ....: { ....: 'Symbol': 'BBBB', ....: 'MarketCap': 4.5678e+10, ....: 'Sector': 'Health Care', ....: 'Industry': 'Medical/Nursing Services' ....: }, ....: ]) ....: In [17]: context Out[17]: Symbol MarketCap Sector Industry 0 AAAA 1.234500e+11 Technology Electronic Components 1 BBBB 4.567800e+10 Health Care Medical/Nursing Services
Once you have created this, you can simply pass the dataframe as the context argument to the sample method.
context
In [18]: new_data = model.sample(context=context)
And we can now see the data generated for the two companies:
In [19]: new_data[new_data.Symbol == 'AAAA'].head() Out[19]: Symbol Date Open Close Volume MarketCap Sector Industry 0 AAAA 2019-01-01 171.047410 -85.435700 -4490714 1.234500e+11 Technology Electronic Components 1 AAAA 2019-01-01 96.964487 118.335012 -417776 1.234500e+11 Technology Electronic Components 2 AAAA 2019-01-02 109.761573 90.862008 4507659 1.234500e+11 Technology Electronic Components 3 AAAA 2019-01-04 95.927067 78.049088 4489002 1.234500e+11 Technology Electronic Components 4 AAAA 2019-01-06 76.499953 77.663281 552803 1.234500e+11 Technology Electronic Components
In [20]: new_data[new_data.Symbol == 'BBBB'].head() Out[20]: Symbol Date Open Close Volume MarketCap Sector Industry 234 BBBB 2019-01-04 183.437196 183.531971 6569659 4.567800e+10 Health Care Medical/Nursing Services 235 BBBB 2019-01-03 215.484922 262.793707 2779874 4.567800e+10 Health Care Medical/Nursing Services 236 BBBB 2019-01-06 270.670779 217.897493 3412299 4.567800e+10 Health Care Medical/Nursing Services 237 BBBB 2019-01-06 279.633303 270.174154 3872601 4.567800e+10 Health Care Medical/Nursing Services 238 BBBB 2019-01-07 272.854060 265.463051 -257462 4.567800e+10 Health Care Medical/Nursing Services
Now that we have discovered the basics, let’s go over a few more advanced usage examples and see the different arguments that we can pass to our PAR Model in order to customize it to our needs.
In the previous examples we saw how the Symbol values were generated as random strings that do not look like those typically seen for Tickers, which usually are strings made of between 2 and 4 uppercase letters.
In order to fix this and force the model to generate values that are valid for the field, we can use the field_types argument to indicate the characteristics of each field by passing a dictionary that follows the Metadata field specification.
field_types
Metadata
For this case in particular, we will indicate that the Symbol field needs to be generated using the regular expression [A-Z]{2,4}.
[A-Z]{2,4}
In [21]: field_types = { ....: 'Symbol': { ....: 'type': 'id', ....: 'subtype': 'string', ....: 'regex': '[A-Z]{2,4}' ....: } ....: } ....: In [22]: model = PAR( ....: entity_columns=entity_columns, ....: context_columns=context_columns, ....: sequence_index=sequence_index, ....: field_types=field_types ....: ) ....: In [23]: model.fit(data)
After this, we can observe how the new Symbols are generated as indicated.
Symbols
In [24]: model.sample(num_sequences=1).head() Out[24]: Symbol Date Open Close Volume MarketCap Sector Industry 0 AA 2019-01-01 183.437196 -43.295255 1811487 NaN Consumer Services Semiconductors 1 AB 2019-01-01 183.437196 138.909068 2763962 NaN Consumer Services Semiconductors 2 AC 2019-01-03 123.192806 115.580938 13801948 NaN Consumer Services Semiconductors 3 AD 2019-01-04 98.096835 93.613023 4078450 NaN Consumer Services Semiconductors 4 AE 2019-01-05 104.297568 114.087184 7631698 NaN Consumer Services Semiconductors
Notice how in this case we only specified the properties of the Symbol field and the PAR model was able to handle the other fields appropriately without needing any indication from us.
When learning the data, the PAR model also learned the distribution of the lengths of the sequences, so each generated sequence may have a different length:
In [25]: model.sample(num_sequences=5).groupby('Symbol').size() Out[25]: Symbol AA 1 AAA 1 AAB 1 AAC 1 AAD 1 .. ZV 1 ZW 1 ZX 1 ZY 1 ZZ 1 Length: 1067, dtype: int64
If we want to force a specific length to the generated sequences we can pass the sequence_length argument to the sample method:
sequence_length
In [26]: model.sample(num_sequences=5, sequence_length=100).groupby('Symbol').size() Out[26]: Symbol AA 1 AB 1 AC 1 AD 1 AE 1 .. TB 1 TC 1 TD 1 TE 1 TF 1 Length: 500, dtype: int64
Sometimes the timeseries datasets do not provide any additional properties from the entities associated with each sequence, other than the unique identifier of the entity.
Let’s simulate this situation by dropping the context columns from our data.
In [27]: no_context = data[['Symbol', 'Date', 'Open', 'Close', 'Volume']].copy() In [28]: no_context.head() Out[28]: Symbol Date Open Close Volume 0 AAPL 2018-12-31 39.632500 39.435001 140014000 1 AAPL 2019-01-02 38.722500 39.480000 148158800 2 AAPL 2019-01-03 35.994999 35.547501 365248800 3 AAPL 2019-01-04 36.132500 37.064999 234428400 4 AAPL 2019-01-07 37.174999 36.982498 219111200
In this case, we can simply skip the context columns when creating the model, and PAR will be able to learn the timeseries without imposing any conditions to them.
In [29]: model = PAR( ....: entity_columns=entity_columns, ....: sequence_index=sequence_index, ....: field_types=field_types, ....: ) ....: In [30]: model.fit(no_context) In [31]: model.sample(num_sequences=1).head() Out[31]: Symbol Date Open Close Volume 0 AA 2019-01-10 -52.709471 164.446656 6569659 1 AB 2019-01-07 107.524217 189.898354 -3062656 2 AC 2019-01-06 97.581017 134.702230 5829555 3 AD 2019-01-03 108.395348 127.642501 4693104 4 AE 2019-01-05 135.928757 113.947397 -376928
In this case, of course, we are not able to sample new sequences conditioned on any value, but we are still able to force the symbols that we want on the generated data by passing them in a pandas.DataFrame
In [32]: symbols = pd.DataFrame({ ....: 'Symbol': ['TSLA'] ....: }) ....: In [33]: model.sample(context=symbols).head() Out[33]: Symbol Date Open Close Volume 0 AA 2019-01-01 183.437196 223.071735 6569659 1 AB 2019-01-08 358.006582 183.531971 3436776 2 AC 2019-01-03 273.697157 253.144183 10948505 3 AD 2019-01-03 296.006444 322.496527 7283934 4 AE 2019-01-03 312.183596 299.348282 3342744
In some cases the timeseries datasets are made of a single timeseries sequence with no identifiers of external entities. For example, suppose we only had the data from one company:
In [34]: tsla = no_context[no_context.Symbol == 'TSLA'].copy() In [35]: del tsla['Symbol'] In [36]: tsla.head() Out[36]: Date Open Close Volume 1008 2018-12-31 67.557999 66.559998 31511500 1009 2019-01-02 61.220001 62.023998 58293000 1010 2019-01-03 61.400002 60.071999 34826000 1011 2019-01-04 61.200001 63.537998 36970500 1012 2019-01-07 64.344002 66.991997 37756000
In this case, we can simply omit the entity_columns argument when creating our PAR instance:
In [37]: model = PAR( ....: sequence_index=sequence_index, ....: ) ....: In [38]: model.fit(tsla) In [39]: model.sample() Out[39]: Date Open Close Volume 0 2018-12-31 54.552286 53.287831 21315647 1 2019-01-02 53.660515 63.726171 45715575 2 2019-01-02 64.056969 62.119500 14363078 3 2019-01-03 66.276769 61.056126 -13010637 4 2019-01-05 63.646733 62.292669 37837522 .. ... ... ... ... 247 2019-12-29 51.734084 50.575215 61102714 248 2019-12-30 49.019854 50.786711 69760369 249 2019-12-30 46.749204 47.545879 80108194 250 2019-12-31 47.083506 46.950503 53793516 251 2020-01-01 54.552286 47.479569 56385846 [252 rows x 4 columns]
After creating synthetic data, you may be wondering how you can evaluate it against the original data. You can use the SDMetrics library to get more insights, generate reports and visualize the data. This library is automatically installed with SDV.
To get started, visit: https://docs.sdv.dev/sdmetrics/