Method 2 — Copulas with marginal distributions
In the previous section, we looked at a multivariate normal distribution. Both chocolate and vanilla ice cream sales followed a normal distribution each and they are correlated. This made it easy to represent them in a multivariate normal distribution.
Sometimes we may have 2 or more variables that come from different statistical distributions and there is no known multivariate distribution that can explain their combined distribution. If we do know the individual (marginal) distributions of the variables + we know the correlations between them, then copulas can help.
A detailed explanation of copulas is out of scope, but simply put…
Suppose the lowest possible value a variable can have is 0 and the highest is 1. Copulas generate a combination of values in this range such that the relationship implied by the correlation between the variables is maintained. Since our variables are chocolate and vanilla ice cream sales and they are positively correlated, copulas will draw values for growth in chocolate ice cream sales closer to 1 (highest value) when vanilla also has a high value and vice verse.
There are existing libraries that can be installed to utilize copulas, but below we’ll use a hack to construct a copula (a Gaussian one) from scratch and it might help to understand the process better.
First, we need to recalculate the correlation between our 2 variables, chocolate and vanilla sales growth, because copulas are based on rank correlation. In the previous section we calculated the Pearson Correlation Coefficient, but now we will use Kendall’s Tau which is a measure of rank correlation. Rank correlation is based on calculating the correlation between the ranks of the values rathen than the values themselves.
corr = ss.kendalltau(
df.dropna().iloc[:, 0], df.dropna().iloc[:, 1])corr
Now we can construct our copula…
# We set variances to 1 because the covariance matrix we
# are constructing will be used with a multivariate normal
# distribution of means 0 and std 1 to derive a copula
cov_matrix = [[1, corr],
[corr, 1]]# We will draw 1000 combinations from the distribution
random_vals = ss.multivariate_normal(cov=cov).rvs(
1000, random_state=1)# Finally a cumulative density function for a distribution
# ranges between 0 to 1 so it will be used to convert the
# drawn samples to the values we need for our copula
copula = ss.norm.cdf(random_vals)
We can see the copula has 1000 correlated entries across 2 variables and they range from 0 to 1.
Below is a scatter plot of the correlated copula values for our 2 variables.
sns.scatterplot(x=copula[:, 0], y=copula[:, 1])
The copula values between 0 and 1 capture the correlation between the variables, so now we just have to convert the 0–1 values to the actual values we want to use for the sales growth rates of chocolate and vanilla ice cream.
We use percent point functions for the distributions of each variable to the conversion. We have already estimated the distributions in the independent parametric simulation section (“Wrong method 1”). The difference is that we now don’t just randomly draw values from each distribution, instead, we use the copula values and percent point functions to draw correlated random values from each distribution.
# distribution_choc and distribution_van area already
# calculated in previous sections
growth_chocolate = distribution_choc.ppf(copula[:, 0])
growth_vanilla = distribution_van.ppf(copula[:, 1])
Finally, let’s pass those growth rates to our function to check the likelihood of vanilla sales exceeding chocolate…
The probability of 11% is similar to the probability from the previous section and confirms that we have grossly overestimated the probability in the 2 wrong approaches we applied at first.
There are several steps in this approach and probably a more thorough explanation of different concepts is needed. The one thing to keep in mind though is that the values between 0 and 1 that we generated using our copula can be used with any combination of statistical distributions.
Our example is easy where both variables’ distributions are normal, so the approach in the previous section (multivariate normal distribution) is sufficient and actually should yield the same results as this section. But imagine a different scenario where variable A is normally distribution, variable B is binomial etc. In such cases, copulas would be needed.
A hypothetical scenario would be forecasting movie ticket sales and popcorn sales where we model the number of cinema visitors as a Poisson distribution and then we model the popcorn sales using a Binomial distribution where every cinema visitor who bought a ticket has a certain probability of buying popcorn.