Introduction to Copulas

Probability Review

Let’s start by reviewing some basic probability concepts.

We’ll focus specifically on continuous random variables, which is what the Copulas library is primarily intended to support.

Probability Density Function

A probability density function \(f(x)\) captures the likelihood that a random sample from the distribution is equal to \(x\). For example, the probability density function for the standard normal distribution is given by

\begin{equation} f(x) = \frac{1}{2 \pi} e^{-x^2/2} \end{equation}

Note that the probability density function does not return a probability but rather a “relative likelihood” which can take on values in the interval \([0, \infty)\); however, the integral over the probability density function from \(-\infty\) to \(\infty\) must be equal to one.

Cumulative Distribution Function

In many cases, the probability density function can be hard to work with directly. Instead, we will use the cumulative distribution function \(F(x)\) which is defined as the integral of the probability density function

\begin{equation} F(x) = \int_{-\infty}^x f(x) \end{equation}

The below figure shows the probability density function \(f(x)\) and the cumulative distribution function \(F(x)\) for a normal standard distribution with mean \(0.0\) and variance \(1\).

[2]:

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np
import scipy.stats as stats

def plot_cdf_pdf_plotly():
    # Generate 10000 evenly distributed values from -4 to 4
    x = np.linspace(-4.0, 4.0, 10000)

    # Compute their Probability Densities and Cumulative Distributions
    pdf = stats.norm.pdf(x)
    cdf = stats.norm.cdf(x)

    fig = make_subplots(rows=1, cols=2, subplot_titles=("PDF", "CDF"))

    fig.add_trace(
        go.Scatter(x=x, y=pdf),
        row=1, col=1
    )
    fig.update_xaxes(title_text="x", row=1, col=1)
    fig.update_yaxes(title_text="f(x)", row=1, col=1)

    fig.add_trace(
        go.Scatter(x=x, y=cdf),
        row=1, col=2
    )
    fig.update_xaxes(title_text="x", row=1, col=2)
    fig.update_yaxes(title_text="F(x)", row=1, col=2)

    # Update yaxis properties

    fig.update_layout(height=400, width=900, showlegend=False)
    fig.show()

plot_cdf_pdf_plotly()

Probability Integral Transform

The probability integral transform is a key component in our toolkit for working with probability distributions. Suppose we have a random variable \(X\) that comes from a distribution with cumulative density function \(F(X)\). Then, we can define a random variable \(Y\) as

\begin{equation} Y = F(X) \end{equation}

and prove that \(Y\) follows a uniform distribution over the interval \([0.0, 1.0]\).

The figure below shows an example of this. We sample some data from a normal distribution and plot it on the left. Then, we use the CDF of the normal distribution to transform the data, plot it on the right, and observe that it resembles an uniform distribution.

[3]:

from scipy import stats

X = stats.norm.rvs(size=10000)
X_pit = stats.norm.cdf(X)

fig = make_subplots(rows=1, cols=2, subplot_titles=("Samples", "Transformed Samples"))

fig.add_trace(
    go.Histogram(x=X),
    row=1, col=1
)

fig.add_trace(
    go.Histogram(x=X_pit),
    row=1, col=2
)

fig.update_layout(height=400, width=900, showlegend=False)
fig.show()

Copulas

The key intuition underlying copula functions is the idea that marginal distributions can be modeled independently from the joint distribution. For example, consider a dataset with two columns containing age and income. A copula-based modeling approach would:

Model age and income independently, transforming them into uniform distributions using the probability integral transform explained above.
Model the relationship between the transformed variables using the copula function.

In this section, we demonstrate a simplified example of a Gaussian copula.

[4]:

from copulas.datasets import sample_bivariate_age_income

df = sample_bivariate_age_income()
df.head()

[4]:

	age	income
0	48.935913	399.161393
1	39.234323	364.225531
2	55.659901	406.475105
3	31.810637	341.276022
4	65.342336	414.347815

[5]:

from copulas.visualization import scatter_2d

scatter_2d(df)

Here’s what the age and income variables look like separately.

[6]:

from copulas.visualization import dist_1d

dist_1d(df['age'], title='Age')

[7]:

dist_1d(df['income'], title='Income')

To model this using a Gaussian copula, we can simply run the following:

[8]:

from copulas.multivariate import GaussianMultivariate

copula = GaussianMultivariate()
copula.fit(df)

The GaussianMultivariate class will automatically transform the columns using the best available distribution; let’s take a look at what the transformed age and income variables look like.

[9]:

age_cdf = copula.univariates[0].cdf(df['age'])
dist_1d(age_cdf, title='Age')

[10]:

income_cdf = copula.univariates[1].cdf(df['income'])
dist_1d(income_cdf, title='income')

Note that this transformed data looks much more uniform than the original values. Using this transformed data, we can then model the relationship between age and income more easily and generate some synthetic data.

[11]:

synthetic = copula.sample(len(df))

[12]:

synthetic.head()

[12]:

	age	income
0	41.413145	403.587899
1	69.353723	407.674323
2	62.039631	376.736375
3	30.928629	17.809401
4	41.420123	403.134000

[13]:

from copulas.visualization import compare_2d

compare_2d(df, synthetic)