PAR Model

In this guide we will go through a series of steps that will let you discover functionalities of the PAR model for timeseries data.

What is PAR?

The PAR class is an implementation of a Probabilistic AutoRegressive model that allows learning multi-type, multivariate timeseries data and later on generate new synthetic data that has the same format and properties as the learned one.

Additionally, the PAR model has the ability to generate new synthetic timeseries conditioned on the properties of the entity to which this timeseries data would be associated.

Note

The PAR model is under active development. Please use it, try it on your data and give us feedback on a github issue or our Slack workspace

Quick Usage

We will start by loading one of our demo datasets, the nasdaq100_2019, which contains daily stock marked data from the NASDAQ 100 companies during the year 2019.

In [1]: from sdv.demo import load_timeseries_demo

In [2]: data = load_timeseries_demo()

In [3]: data.head()
Out[3]: 
  Symbol       Date       Open      Close     Volume     MarketCap      Sector                Industry
0   AAPL 2018-12-31  39.632500  39.435001  140014000  7.378734e+11  Technology  Computer Manufacturing
1   AAPL 2019-01-02  38.722500  39.480000  148158800  7.378734e+11  Technology  Computer Manufacturing
2   AAPL 2019-01-03  35.994999  35.547501  365248800  7.378734e+11  Technology  Computer Manufacturing
3   AAPL 2019-01-04  36.132500  37.064999  234428400  7.378734e+11  Technology  Computer Manufacturing
4   AAPL 2019-01-07  37.174999  36.982498  219111200  7.378734e+11  Technology  Computer Manufacturing

As you can see, this table contains information about multiple Tickers, including:

  • Symbol of the Ticker.

  • Date associated with the stock market values.

  • The opening and closing prices for the day.

  • The Volume of transactions of the day.

  • The MarketCap of the company

  • The Sector and the Industry in which the company operates.

This data format is a very common an well known format for timeseries data which includes 4 types of columns:

Entity Columns

These are columns that indicate how the rows are associated with external, abstract, entities. The group of rows associated with each entity_id form a time series sequence, where order of the rows matters and where inter-row dependencies exist. However, the rows of different entities are completely independent from each other.

In this case, the external entity is the company, and the identifier of the company within our data is the Symbol column.

In [4]: entity_columns = ['Symbol']

Note

In some case, the datsets do not contain any entity_columns because the rows are not associated with any external entity. In these cases, the entity_columns specification can be omitted and the complete dataset will be interpreted as a single timeseries sequence.

Context

The timeseries datasets may have one or more context_columns. context_columns are variables that provide information about the entities associated with the timeseries in the form of attributes and which may condition how the timeseries variables evolve.

For example, in our stock market case, the MarketCap, the Sector and the Industry variables are all contextual attributes associated with each company and which have a great impact on what each timeseries look like.

In [5]: context_columns = ['MarketCap', 'Sector', 'Industry']

Note

The context_columns are attributes that are associated with the entities, and which do not change over time. For this reason, since each timeseries sequence has a single entity associated, the values of the context_columns are expected to remain constant alongside each combination of entity_columns values.

Sequence Index

By definition, the timeseries datasets have inter-row dependencies for which the order of the rows matter. In most cases, this order will be indicted by a sequence_index column that will contain sortable values such as integers, floats or datetimes. In some other cases there may be no sequence_index, which means that the rows are assumed to be already given in the right order.

In this case, the column that indicates us the order of the rows within each sequence is the Date column:

In [6]: sequence_index = 'Date'

Data Columns

Finally, the rest of the columns of the dataset are what we call the data_columns, and they are the columns that our PAR model will learn to generated synthetically conditioned on the values of the context_columns.

Let now see how to use the PAR class to learn this timeseries dataset and generate new synthetic timeseries that replicate its properties.

For this, you will need to:

  • Import the sdv.timeseries.PAR class and create an instance of it passing the variables that we just created.

  • Call its fit method passing the timeseries data.

  • Call its sample method indicating the number of sequences that we want to generate.

In [7]: from sdv.timeseries import PAR

In [8]: model = PAR(
   ...:     entity_columns=entity_columns,
   ...:     context_columns=context_columns,
   ...:     sequence_index=sequence_index,
   ...: )
   ...: 

In [9]: model.fit(data)

Note

Notice that the model fitting process took care of transforming the different fields using the appropriate Reversible Data Transforms to ensure that the data has a format that the underlying models can handle.

Generate synthetic data from the model

Once the modeling has finished you are ready to generate new synthetic data by calling the sample method from your model passing the number of the sequences that we want to generated.

Let’s start by generating a single sequence.

In [10]: new_data = model.sample(1)

This will return a table identical to the one which the model was fitted on, but filled with new synthetic data which resembles the original one.

In [11]: new_data.head()
Out[11]: 
  Symbol       Date        Open       Close   Volume     MarketCap       Sector        Industry
0  lBLMR 2019-01-03  183.437196  162.633317  6569659  9.666717e+09  Health Care  Hotels/Resorts
1  lBLMR 2019-01-01  157.959279  135.457768  7026877  9.666717e+09  Health Care  Hotels/Resorts
2  lBLMR 2019-01-02  137.603519  183.531971  2810058  9.666717e+09  Health Care  Hotels/Resorts
3  lBLMR 2019-01-05  124.584762  108.496213 -1345603  9.666717e+09  Health Care  Hotels/Resorts
4  lBLMR 2019-01-07  109.572716  107.096618  5505609  9.666717e+09  Health Care  Hotels/Resorts

Note

Note

Notice how the model generated a random string for the Symbol identifier which does not look like the regular Ticker symbols that we saw in the original data. This is because the model needs you to tell it how these symbols need to be generated by providing a regular expression that it can use. We will see how to do this in a later section.

Save and Load the model

In many scenarios it will be convenient to generate synthetic versions of your data directly in systems that do not have access to the original data source. For example, if you may want to generate testing data on the fly inside a testing environment that does not have access to your production database. In these scenarios, fitting the model with real data every time that you need to generate new data is feasible, so you will need to fit a model in your production environment, save the fitted model into a file, send this file to the testing environment and then load it there to be able to sample from it.

Let’s see how this process works.

Save and share the model

Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is pickle.

In [12]: model.save('my_model.pkl')

This will have created a file called my_model.pkl in the same directory in which you are running SDV.

Note

If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains no information about the original data, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this my_model.pkl file without the risc of disclosing any of your real data!

Load the model and generate new data

The file you just generated can be send over to the system where the synthetic data will be generated. Once it is there, you can load it using the PAR.load method, and then you are ready to sample new data from the loaded instance:

In [13]: loaded = PAR.load('my_model.pkl')

In [14]: loaded.sample(num_sequences=1).head()
Out[14]: 
  Symbol       Date        Open       Close    Volume     MarketCap      Sector Industry
0  rSFXA 2019-01-02  119.835386   86.233735 -10242526  1.274792e+11  Technology      NaN
1  rSFXA 2019-01-01  150.552350   28.424140   8748450  1.274792e+11  Technology      NaN
2  rSFXA 2019-01-04  102.031792  145.341994   9301296  1.274792e+11  Technology      NaN
3  rSFXA 2019-01-04  110.715567  100.367299   8528024  1.274792e+11  Technology      NaN
4  rSFXA 2019-01-04   94.241970   87.031027   2813231  1.274792e+11  Technology      NaN

Warning

Notice that the system where the model is loaded needs to also have sdv installed, otherwise it will not be able to load the model and use it.

Conditional Sampling

On the previous examples we had the model generate random values for use to populate the context_columns and the entity_columns. In order to do this, the model learned the context and entity values using a GaussianCopula, which later on used to sample new realistic values for them. This is fine for cases in which we do not have any constraints regarding the type of data that we generate, but in some cases we might want to control the values of the contextual columns to force the model into generating data of certain type.

In order to achieve this, we will first have to create a pandas.DataFrame with the expected values.

As an example, let’s generate values for two companies in the Technology and Health Care sectors.

In [15]: import pandas as pd

In [16]: context = pd.DataFrame([
   ....:     {
   ....:         'Symbol': 'AAAA',
   ....:         'MarketCap': 1.2345e+11,
   ....:         'Sector': 'Technology',
   ....:         'Industry': 'Electronic Components'
   ....:     },
   ....:     {
   ....:         'Symbol': 'BBBB',
   ....:         'MarketCap': 4.5678e+10,
   ....:         'Sector': 'Health Care',
   ....:         'Industry': 'Medical/Nursing Services'
   ....:     },
   ....: ])
   ....: 

In [17]: context
Out[17]: 
  Symbol     MarketCap       Sector                  Industry
0   AAAA  1.234500e+11   Technology     Electronic Components
1   BBBB  4.567800e+10  Health Care  Medical/Nursing Services

Once you have created this, you can simply pass the dataframe as the context argument to the sample method.

In [18]: new_data = model.sample(context=context)

And we can now see the data generated for the two companies:

In [19]: new_data[new_data.Symbol == 'AAAA'].head()
Out[19]: 
  Symbol       Date        Open       Close   Volume     MarketCap      Sector               Industry
0   AAAA 2019-01-02   96.483844  157.333170  2227595  1.234500e+11  Technology  Electronic Components
1   AAAA 2019-01-01  114.172031  183.531971  6373000  1.234500e+11  Technology  Electronic Components
2   AAAA 2019-01-03   68.572796  119.880745  6750367  1.234500e+11  Technology  Electronic Components
3   AAAA 2019-01-04   96.037921  100.140726  6608301  1.234500e+11  Technology  Electronic Components
4   AAAA 2019-01-04   80.157030   97.647850  8306543  1.234500e+11  Technology  Electronic Components
In [20]: new_data[new_data.Symbol == 'BBBB'].head()
Out[20]: 
    Symbol       Date        Open       Close   Volume     MarketCap       Sector                  Industry
252   BBBB 2018-12-26  135.107118  183.531971   985106  4.567800e+10  Health Care  Medical/Nursing Services
253   BBBB 2018-12-31  116.476343  162.663746  6569659  4.567800e+10  Health Care  Medical/Nursing Services
254   BBBB 2019-01-02  147.097433  105.826264  3743148  4.567800e+10  Health Care  Medical/Nursing Services
255   BBBB 2019-01-04  131.116133   89.455276  3267651  4.567800e+10  Health Care  Medical/Nursing Services
256   BBBB 2019-01-03  108.584836  110.737609  6569659  4.567800e+10  Health Care  Medical/Nursing Services

Advanced Usage

Now that we have discovered the basics, let’s go over a few more advanced usage examples and see the different arguments that we can pass to our PAR Model in order to customize it to our needs.

How to customize the generated IDs?

In the previous examples we saw how the Symbol values were generated as random strings that do not look like that ones typically seen for Tickers, which usually are strnigs made of between 2 and 4 uppercase letters.

In order to fix this and force the model to generate values that are valid for the field, we can use the field_types argument to indicate the characteristics of each field by passing a dictionary that follows the Metadata field specification.

For this case in particular, we will indicate that the Symbol field needs to be generated using the regular expression [A-Z]{2,4}.

In [21]: field_types = {
   ....:     'Symbol': {
   ....:         'type': 'id',
   ....:         'subtype': 'string',
   ....:         'regex': '[A-Z]{2,4}'
   ....:     }
   ....: }
   ....: 

In [22]: model = PAR(
   ....:     entity_columns=entity_columns,
   ....:     context_columns=context_columns,
   ....:     sequence_index=sequence_index,
   ....:     field_types=field_types
   ....: )
   ....: 

In [23]: model.fit(data)

After this, we can observe how the new Symbols are generated as indicated.

In [24]: model.sample(num_sequences=1).head()
Out[24]: 
  Symbol       Date        Open       Close    Volume     MarketCap       Sector                           Industry
0   YUDO 2019-01-10  430.981049  183.531971  16689013  5.687423e+10  Health Care  Consumer Electronics/Video Chains
1   YUDO 2019-01-02  260.410628  183.531971  10744115  5.687423e+10  Health Care  Consumer Electronics/Video Chains
2   YUDO 2019-01-05  215.418684  267.536210   7992211  5.687423e+10  Health Care  Consumer Electronics/Video Chains
3   YUDO 2019-01-05  169.709982  213.594150   7953143  5.687423e+10  Health Care  Consumer Electronics/Video Chains
4   YUDO 2019-01-06  207.237724  203.424979  12281735  5.687423e+10  Health Care  Consumer Electronics/Video Chains

Note

Notice how in this case we only specified the properties of the Symbol field and the PAR model was able to handle the other fields appropriately without needing any indication from us.

Can I control the length of the sequences?

When learning the data, the PAR model also learned the distribution of the lengths of the sequences, so each generated sequence may have a different length:

In [25]: model.sample(num_sequences=5).groupby('Symbol').size()
Out[25]: 
Symbol
HBV    210
II     252
IK     201
SL     252
XV     252
dtype: int64

If we want to force a specific length to the generated sequences we can pass the sequence_length argument to the sample method:

In [26]: model.sample(num_sequences=5, sequence_length=100).groupby('Symbol').size()
Out[26]: 
Symbol
AURZ    100
DE      100
ES      100
OA      100
TVZF    100
dtype: int64

Can I use timeseries without context?

Sometimes the timeseries datasets do not provide any additional properties from the entities associated with each sequence, other than the unique identifier of the entity.

Let’s simulate this situation by dropping the context columns from our data.

In [27]: no_context = data[['Symbol', 'Date', 'Open', 'Close', 'Volume']].copy()

In [28]: no_context.head()
Out[28]: 
  Symbol       Date       Open      Close     Volume
0   AAPL 2018-12-31  39.632500  39.435001  140014000
1   AAPL 2019-01-02  38.722500  39.480000  148158800
2   AAPL 2019-01-03  35.994999  35.547501  365248800
3   AAPL 2019-01-04  36.132500  37.064999  234428400
4   AAPL 2019-01-07  37.174999  36.982498  219111200

In this cases, we can simply skip the context columns when creating the model, and PAR will be able to learn the timeseries without imposing any conditions to them.

In [29]: model = PAR(
   ....:     entity_columns=entity_columns,
   ....:     sequence_index=sequence_index,
   ....:     field_types=field_types,
   ....: )
   ....: 

In [30]: model.fit(no_context)

In [31]: model.sample(num_sequences=1).head()
Out[31]: 
  Symbol       Date        Open       Close    Volume
0    HBD 2019-01-02  169.758773  227.099346  11176886
1    HBD 2019-01-03  179.312567  181.278781   6569659
2    HBD 2019-01-03  143.004130  171.029659  10421704
3    HBD 2019-01-04  130.277449  124.273505  14246370
4    HBD 2019-01-06  128.771455  114.999140   6909551

In this case, of course, we are not able to sample new sequences conditioned on any value, but we are still able to force the symbols that we want on the generated data by passing them in a pandas.DataFrame

In [32]: symbols = pd.DataFrame({
   ....:     'Symbol': ['TSLA']
   ....: })
   ....: 

In [33]: model.sample(context=symbols).head()
Out[33]: 
  Symbol       Date        Open       Close    Volume
0   TSLA 2019-01-05  196.234829  170.849768   6569659
1   TSLA 2019-01-02  142.237312   41.421629   1489970
2   TSLA 2019-01-03  116.910000   87.269713   8973659
3   TSLA 2019-01-04   82.959771  107.220750  12012336
4   TSLA 2019-01-05   79.715661  183.531971   5069630

What happens if there are no entity_columns either?

In some cases the timeseries datasets are made of a single timeseries sequence with no identifiers of external entities. For example, suppose we only had the data from one company:

In [34]: tsla = no_context[no_context.Symbol == 'TSLA'].copy()

In [35]: del tsla['Symbol']

In [36]: tsla.head()
Out[36]: 
           Date       Open      Close    Volume
1008 2018-12-31  67.557999  66.559998  31511500
1009 2019-01-02  61.220001  62.023998  58293000
1010 2019-01-03  61.400002  60.071999  34826000
1011 2019-01-04  61.200001  63.537998  36970500
1012 2019-01-07  64.344002  66.991997  37756000

In this case, we can simply omit the entity_columns argument when creating our PAR instance:

In [37]: model = PAR(
   ....:     sequence_index=sequence_index,
   ....: )
   ....: 

In [38]: model.fit(tsla)

In [39]: model.sample()
Out[39]: 
          Date       Open      Close    Volume
0   2018-12-31  66.461746  57.755902  29625731
1   2019-01-01  60.756997  63.157357   7094902
2   2019-01-03  61.597657  67.387280  17201318
3   2019-01-06  66.536039  64.665756  78443889
4   2019-01-07  64.066290  65.938488  17335426
..         ...        ...        ...       ...
247 2019-12-31  45.701326  45.894419   -472075
248 2020-01-01  46.591638  45.420758  60369104
249 2020-01-03  48.115160  47.642942  48284582
250 2020-01-05  46.043755  47.621902  30147361
251 2020-01-07  54.552286  49.160568  14100418

[252 rows x 4 columns]