Multivariate Distributions

Apart from the Univariate distributions, which only work on single random variables, the Copulas library supports several Multivariate distributions that support working with multiple random variables at the same time, taking into account the dependencies that may exist between them.

These distributions are supported by the Multivariate subclasses from defined within the copulas.multivariate package:

  • copulas.multivariate.GaussianMultivariate: Implements a multivariate distribution by combining the marginal univariate distributions with a Gaussian Copula.

  • copulas.multivariate.VineCopula: Implements a multivariate distribution using Vine Copulas.

Gaussian Multivariate

In this example we will be using the GaussianMultivariate class, which implements a multivariate distribution by using a Gaussian Copula to combine marginal probabilities estimated using Univariate distributions.

Firs of all, let’s load the data that we will be using later on in our examples.

This is a toy dataset with three columns following these distributions:

  • x: Beta distribution with a=0.1 and b=0.1

  • y: Beta distribution with a=0.1 and b=0.5

  • z: Normal distribution + 10 times y

[2]:
from copulas.datasets import sample_trivariate_xyz

data = sample_trivariate_xyz()
[3]:
data.head()
[3]:
x y z
0 9.004177e-05 2.883992e-06 0.638689
1 8.819273e-01 2.911979e-07 1.058121
2 5.003865e-01 4.886504e-04 0.372506
3 1.838544e-12 5.392802e-02 0.687370
4 1.627915e-01 1.634269e-08 -0.881068
[4]:
from copulas.visualization import scatter_3d

scatter_3d(data)

Fitting a Model and Generating Synthetic Data

The first step to use our GaussianUnivariate model is to fit it to the data by passing the data to its fit method.

[5]:
from copulas.multivariate import GaussianMultivariate

dist = GaussianMultivariate()
dist.fit(data)

During this process, the GaussianMultivariate class will:

  • Search for the Univariate distribution that better describes each column in the data.

  • Fit the corresponding Univariate distributions to each column.

  • Learn the join distribution based on the correlations between the marginal distributions.

After the model has been fitted, we can sample new data from it like we did with the Univariate distributions.

[6]:
sampled = dist.sample(1000)
[7]:
sampled.head()
[7]:
x y z
0 4.414737e-09 2.141581e-06 -0.539025
1 5.798618e-01 7.154227e-07 -0.586024
2 5.170681e-03 7.811971e-02 3.069053
3 9.515144e-01 7.970429e-04 0.435794
4 6.785048e-03 2.043194e-02 8.043096

We can now compare the distribution of the real data to the sampled one by plotting them side by side.

[8]:
from copulas.visualization import compare_3d

compare_3d(data, sampled)

Specifying column distributions

More advanced users can choose to manually specify the marginal distributions if they have additional information about the data.

This can be done by specifying a single distribution that will be used for all the columns.

[9]:
from copulas.univariate import GaussianUnivariate

dist = GaussianMultivariate(distribution=GaussianUnivariate)
dist.fit(data)
sampled = dist.sample(1000)

compare_3d(data, sampled)

Or by specifying the distribution that needs to be used in each column.

[10]:
from copulas.univariate import BetaUnivariate, GaussianKDE, GaussianUnivariate

dist = GaussianMultivariate(distribution={
    "x": BetaUnivariate,
    "y": GaussianKDE,
    "z": GaussianUnivariate,
})
dist.fit(data)
sampled = dist.sample(1000)

compare_3d(data, sampled)

Or even by specifying a family of Univariates.

[11]:
from copulas.univariate import ParametricType, Univariate

# Select the best PARAMETRIC univariate
univariate = Univariate(parametric=ParametricType.PARAMETRIC)

dist = GaussianMultivariate(distribution=univariate)
dist.fit(data)
sampled = dist.sample(1000)

compare_3d(data, sampled)

In general, however, letting the Univariate select the best model for each marginal distribution is what produces the best results.

Probability Density and Cumulative Distribution

The probability density and cumulative distribution can be computed for an array of data points using the probability_density and cumulative_distribution methods respectively.

[12]:
probability_density = dist.pdf(sampled)
[13]:
probability_density[0:5]
[13]:
array([0.01742569, 0.00370378, 0.04514849, 0.03004175, 0.00317649])
[14]:
cumulative_distribution = dist.cumulative_distribution(sampled)
[15]:
cumulative_distribution[0:5]
[15]:
array([7.29750584e-01, 6.95784204e-04, 5.85808565e-02, 7.81918856e-02,
       3.54056835e-02])

to_dict and from_dict

Like Univariate, Multivariate classes implement a to_dict method that allows obtaining all the parameters that define the distribution in a python dictionary.

[16]:
parameters = dist.to_dict()
parameters.keys()
[16]:
dict_keys(['correlation', 'univariates', 'columns', 'type'])

In the case of GaussianMultivariate, this contains information about the correlation that defines the join distribution:

[17]:
parameters['correlation']
[17]:
[[1.0, -0.021508199560644915, -0.03904683794009123],
 [-0.021508199560644915, 1.0, 0.709864802930119],
 [-0.03904683794009123, 0.709864802930119, 1.0]]

And the parameters of the univariates used for each column:

[18]:
parameters['univariates']
[18]:
[{'loc': 1.3765984634140687e-23,
  'scale': 1.0000000000000002,
  'a': 0.09657613485947575,
  'b': 0.10226371906555584,
  'type': 'copulas.univariate.beta.BetaUnivariate'},
 {'loc': 2.619420845777311e-49,
  'scale': 0.999986655706105,
  'a': 0.11353643335098354,
  'b': 0.5621697120496405,
  'type': 'copulas.univariate.beta.BetaUnivariate'},
 {'df': 1.4006628830710697,
  'loc': 0.4691588432897164,
  'scale': 1.247945331065269,
  'type': 'copulas.univariate.student_t.StudentTUnivariate'}]

Finally, this parameters dictionary can be later on passed to the Multivariate.from_dict class method, which will create an instance of our model with the same parameters as before.

[19]:
from copulas.multivariate import Multivariate

new_dist = Multivariate.from_dict(parameters)
[20]:
new_dist
[20]:
GaussianMultivariate()

Vine Copulas

In addition, Copulas also implements Vine Copulas. The Vine copulas work by building a vine (a set of trees) over the different columns in the dataset and estimating the pairwise (i.e. bivariate) relationship between the nodes on every edge.

There are three types of Vine copulas: direct, regular, and center.

[21]:
from copulas.multivariate import VineCopula

center = VineCopula('center')
regular = VineCopula('regular')
direct = VineCopula('direct')

center.fit(data)
regular.fit(data)
direct.fit(data)

center_samples = center.sample(1000)
regular_samples = regular.sample(1000)
direct_samples = direct.sample(1000)
[22]:
scatter_3d(data, title='Real Data')
[23]:
scatter_3d(center_samples, title='C-Vine')
[24]:
scatter_3d(regular_samples, title='R-Vine')
[25]:
scatter_3d(direct_samples, title='D-Vine')
[ ]: