
Kudos on making it to the last and final part of Mathematics for Machine Learning! You’ve worked so hard to get to this point and we shall wind this up in the next few minutes!
If you haven’t gone through Part-1, Part-2, Part-3 and Part 4 yet — Do it right now!
Probability and Statistics are the foundational pillars of Machine Learning and Data Science. In fact, the underlying principle of machine learning and artificial intelligence is nothing but statistical mathematics and linear algebra.
Often you might encounter situations, where you have to read research papers that involve a lot of maths in order to understand a particular topic and so if you want to get better at it, it’s imperative to have a strong mathematical understanding.
- Descriptive Statistics
- Dispersion
- Random Variables
- Probability Distributions
Getting Started
We shall first see what is Data Analysis, Central Tendency in Python: mean, median, and mode. Moreover, we will discuss Python Dispersion and Python Pandas Descriptive Statistics. Along with this, we will cover the variance in Python and how to calculate the variability for a set of values.
For more on this refer — Statistical Analysis in Python using Pandas
Data Analysis
With data analysis, we use two main statistical methods- Descriptive and Inferential.
- Descriptive statistics uses tools like mean and standard deviation on a sample to summarize data.
- Inferential statistics, on the other hand, looks at data that can randomly vary, and then draw conclusions from it.
Some such variations include observational errors and sampling variation.
Descriptive Statistics in Python
Python Descriptive Statistics process describes the basic features of data in a study. It delivers summaries on the sample and the measures and does not use the data to learn about the population it represents.
Under descriptive statistics, fall two sets of properties- central tendency and dispersion. Python Central tendency characterizes one central value for the entire distribution. Measures under this include mean, median, and mode.
Python Dispersion is the term for a practice that characterizes how apart the members of the distribution are from the center and from each other. Variance/Standard Deviation is one such measure of variability.
Implementation:
We shall begin by importing Statistics Library in Python —
import statistics as st
mean()
This function returns the arithmetic average of the data it operates on.
nums=[1,2,3,5,7,9]
st.mean(nums)Out:
4.5
mode()
This function returns the most common value in a set of data. This gives us a great idea of where the centre lies.
nums=[1,2,3,5,7,9,7,2,7,6]
st.mode(nums)Out:
7
median()
For data of odd length, this returns the middle item; for that of even length, it returns the average of the two middle items.
st.median(nums) #(5+6)/2Out:
5.5
harmonic_mean()
This function returns the harmonic mean of the data. For three values a, b, and c, the harmonic mean is- 3/(1/a + 1/b +1/c)
It is a measure of the centre; one such example would be speed.
st.harmonic_mean([2,4,9.7])Out:
3.516616314199396"""for the same set of data, the arithmetic mean would give us a value of 5.233"""
median_low()
When the data is of an even length, this provides us the low median of the data. Otherwise, it returns the middle value.
st.median_low([1,2,4])Out:
2
median_high()
Like median_low, this returns the high median when the data is of an even length. Otherwise, it returns the middle value.
st.median_high([1,2,4])Out:
2
median_grouped()
This function uses interpolation to return the median of grouped continuous data. This is the 50th percentile.
st.median([1,3,3,5,7])Out:
3st.median_grouped([1,3,3,5,7],interval=1)Out:
3.25st.median_grouped([1,3,3,5,7],interval=2)Out:
3.5
Python Descriptive Statistics — Dispersion in Python
Dispersion/spread gives us an idea of how the data strays from the typical value.
variance()
This returns the variance of the sample. This is the second moment about the mean and a larger value denotes a rather spread-out set of data. You can use this when your data is a sample out of a population.
st.variance(nums)Out:
7.433333333333334
pvariance()
This returns the population variance of data. Use this to calculate the variance from an entire population.
st.pvariance(nums)Out:
6.69
stdev()
This returns the standard deviation for the sample. This is equal to the square root of the sample variance.
st.stdev(nums)Out:
2.7264140062238043
pstdev()
This returns the population standard deviation. This is the square root of the population variance.
st.pstdev(nums)Out:
2.5865034312755126
Pandas with Descriptive Statistics in Python
We can do the same things using pandas too (Statistical Analysis in Python using Pandas)
import pandas as pd
df=pd.DataFrame(nums)df.mean()
df.mode()
df.std()
df.skew()
Random Variables
A random variable represents all the possible sets of events an outcome can take. There are two kinds of random variables, continuous and discrete.
Events such as coin tosses, dice throws and card games are events that can be represented using discrete random variables, while values of body heat, atmospheric pressure and student grade point averages(GPA) represents continuous random variables.
There’s another type of distribution that often pops up in literature which you should know about called cumulative distribution function. All random variables (discrete and continuous) have a cumulative distribution function.
It is a function giving the probability that the random variable X is less than or equal to x, for every value x.
For a discrete random variable, the cumulative distribution function is found by summing up the probabilities.
Probability Distributions
1. Uniform Distribution
Perhaps one of the simplest and useful distribution is the uniform distribution. The probability distribution function of the continuous uniform distribution is —
Since the area under the curve must be equal to 1, the length of the interval determines the height of the curve. The following figure shows a uniform distribution in the interval (a,b).
Notice since the area needs to be 1. The height is set to 1/(b−a).
You can visualize uniform distribution in python with the help of a random number generator acting over an interval of numbers (a,b). You need to import the uniform
function from scipy.stats
module.
# import uniform distribution
from scipy.stats import uniform# random numbers from uniform distribution
n = 10000
start = 10
width = 20
data_uniform = uniform.rvs(size=n, loc = start, scale=width)
The uniform
function generates a uniform continuous variable between the specified interval via its loc
and scale
arguments. This distribution is constant between loc
and loc + scale
. The size
arguments describe the number of random variates. If you want to maintain reproducibility, include a random_state
argument assigned to a number.
You can use Seaborn’s distplot
to plot the histogram of the distribution you just created. Seaborn’s distplot takes in multiple arguments to customize the plot. You first create a plot object ax
. Here, you can specify the number of bins
in the histogram, specify the color
of the histogram and specify density plot option with kde
and linewidth option with hist_kws
. You can also set labels for x and y axis using the xlabel
and ylabel
arguments.
For more on Seaborn refer — Data Visualization using Python Part-II
ax = sns.distplot(data_uniform,
bins=100,
kde=True,
color='skyblue',
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Uniform Distribution ', ylabel='Frequency')Out:
[Text(0,0.5,u'Frequency'), Text(0.5,0,u'Uniform Distribution ')]
2. Normal Distribution
Normal Distribution, also known as Gaussian distribution, is ubiquitous in Data Science.
You will encounter it in many places especially in topics of statistical inference. It is one of the assumptions of many data science algorithms too.
A normal distribution has a bell-shaped density curve described by its mean μ and standard deviation σ. The density curve is symmetrical, centred about its mean, with its spread determined by its standard deviation showing that data near the mean are more frequent in occurrence than data far from the mean.
The probability distribution function of a normal density curve with mean μ and standard deviation σ at a given point x is given by —
Below is the figure describing what the distribution looks like:
Similar to the uniform distribution, we shall now proceed to the implementation:
from scipy.stats import norm
# generate random numbers from N(0,1)
data_normal = norm.rvs(size=10000,loc=0,scale=1)
Visualizing using Seaborn:
ax = sns.distplot(data_normal,
bins=100,
kde=True,
color='skyblue',
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Normal Distribution', ylabel='Frequency')Out:
[Text(0,0.5,u'Frequency'), Text(0.5,0,u'Normal Distribution')]
3. Gamma Distribution
The gamma distribution is a two-parameter family of continuous probability distributions.
While it is used rarely in its raw form but other popularly used distributions like exponential, chi-squared, erlang distributions are special cases of the gamma distribution.
The gamma distribution can be parameterized in terms of a shape parameter α= and an inverse scale parameter β=1/θ, called a rate parameter, the symbol Γ(n) is the gamma function and is defined as (n−1)! —
A typical gamma distribution looks like:
Implementation:
from scipy.stats import gamma
data_gamma = gamma.rvs(a=5, size=10000)
Visualizing using Seaborn:
ax = sns.distplot(data_gamma,
kde=True,
bins=100,
color='skyblue',
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Gamma Distribution', ylabel='Frequency')Out:
[Text(0,0.5,u'Frequency'), Text(0.5,0,u'Gamma Distribution')]
4. Exponential Distribution
The exponential distribution describes the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It has a parameter λ called rate parameter, and its equation is described as :
A decreasing exponential distribution looks like :
Implementation:
from scipy.stats import expon
data_expon = expon.rvs(scale=1,loc=0,size=1000)
Visualizing using Seaborn:
ax = sns.distplot(data_expon,
kde=True,
bins=100,
color='skyblue',
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Exponential Distribution', ylabel='Frequency')Out:
[Text(0,0.5,u'Frequency'), Text(0.5,0,u'Exponential Distribution')]
5. Poisson Distribution
Poisson random variable is typically used to model the number of times an event happened in a time interval. For example, the number of users visited on a website in an interval can be thought of a Poisson process.
Poisson distribution is described in terms of the rate (μ) at which the events happen. An event can occur 0, 1, 2, … times in an interval. The average number of events in an interval is designated λ (lambda).
Lambda is the event rate, also called the rate parameter. The probability of observing k events in an interval is given by the equation —
The normal distribution is a limiting case of Poisson distribution with the parameter λ→∞. Also, if the times between random events follow an exponential distribution with rate λ, then the total number of events in a time period of length tt follows the Poisson distribution with parameter λt.
The following figure shows a typical Poisson distribution:
Implementation:
from scipy.stats import poisson
data_poisson = poisson.rvs(mu=3, size=10000)
Visualizing using Seaborn:
ax = sns.distplot(data_poisson,
bins=30,
kde=False,
color='skyblue',
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Poisson Distribution', ylabel='Frequency')Out:
[Text(0,0.5,u'Frequency'), Text(0.5,0,u'Poisson Distribution')]
6. Binomial Distribution
A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose and where the probability of success and failure is same for all the trials is called a Binomial Distribution.
However, the outcomes need not be equally likely, and each trial is independent of each other.
The parameters of a binomial distribution are n and p where n is the total number of trials, and p is the probability of success in each trial.
Its probability distribution function is given by:
where:
Implementation:
from scipy.stats import binom
data_binom = binom.rvs(n=10,p=0.8,size=10000)
Visualizing using Seaborn:
ax = sns.distplot(data_binom,
kde=False,
color='skyblue',
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Binomial Distribution', ylabel='Frequency')Out:
[Text(0,0.5,u'Frequency'), Text(0.5,0,u'Binomial Distribution')]
Since the probability of success was greater than 0.5 the distribution is skewed towards the right side.
Also, Poisson distribution is a limiting case of a binomial distribution under the following conditions:
- The number of trials is indefinitely large or n→∞.
- The probability of success for each trial is same and indefinitely small or p→0.
- np=λ, is finite.
Normal distribution is another limiting form of binomial distribution under the following conditions:
- The number of trials is indefinitely large, n→∞.
- Both p and q are not indefinitely small.
7. Bernoulli Distribution
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial, for example, a coin toss.
So the random variable X which has a Bernoulli distribution can take value 1 with the probability of success, p, and the value 0 with the probability of failure, q or 1−p.
The probabilities of success and failure need not be equally likely.
The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (n=1).
Its probability mass function is given by:
Implementation:
from scipy.stats import bernoulli
data_bern = bernoulli.rvs(size=10000,p=0.6)
Visualizing using Seaborn:
ax= sns.distplot(data_bern,
kde=False,
color="skyblue",
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Bernoulli Distribution', ylabel='Frequency')Out:
[Text(0,0.5,u'Frequency'), Text(0.5,0,u'Bernoulli Distribution')]
That’s all for Mathematics for Machine Learning! I know that’s a lot to take in at once! But you made it until the end! Kudos on that!
There are a lot of other good resources if you’re still interested in getting the most out of this topic —
For the complete implementation, do check out my GitHub Repository —
*link to be inserted shortly*
To contact, or for further queries, feel free to drop a mail at — [email protected]