## Warm-up topics for review in preparation for a data science journey in 2021

If you are considering to begin your data science journey in 2021, you may be wondering what prerequisites you need before starting the program. To give you an idea of the background knowledge that you need, find below are examples of universities that offer online master’s degree programs in data science/business analytics and their admission requirements:

** Duke University**: A bachelor’s degree in science, technology, engineering, mathematics, business, economics, or an equivalent quantitative major by the start of the program.

** Kansas State University**: Completion of STAT 350 and STAT 351 (or equivalent statistics courses). Familiarity with computer programming and applications is highly recommended.

** Columbia University**: A bachelor’s degree in a related field of engineering or science.

** Harvard University**: There are no prerequisite courses required, but the quantitative experience will be vital to your success in the program.

** Syracuse University**: An undergraduate degree in business, statistics, math, engineering, finance, information technology, physics, supply chain, or economics.

** University of Tulsa**: A four-year degree in business, engineering, or science.

** Walden University**: A bachelor’s degree in data science, computer science, information technology, or an equivalent subject.

From the list of requirements provided above, we see that the essential prerequisite for data science is background knowledge in a quantitative discipline. The good thing with most training programs in data science is that you learn and acquire the data science skills on the go. So beside some basic knowledge in math and programming, you don’t need any prerequisites to succeed in the program.

If you are looking into starting your data science journey in 2021 and you would like to familiarize yourself with some background knowledge prior to officially beginning your program, find below are some warm-up topics to review.

## (I) Statistics and Probability

In data science and machine learning, statistics and probability is used for visualization of features, data preprocessing, feature transformation, data imputation, dimensionality reduction, feature engineering, model evaluation, etc. Here are the topics you need to be familiar with:

a) Mean

b) Median

c) Mode

d) Standard deviation/variance

e) Correlation coefficient and the covariance matrix

f) Probability distributions (Binomial, Poisson, Normal)

g) p-value

h) MSE (mean square error)

i) R2 Score

j) Bayes’ Theorem (Precision, Recall, Positive Predictive Value, Negative Predictive Value, Confusion Matrix, ROC Curve)

k) A/B Testing

l) Monte Carlo Simulation

## (II) Multivariable Calculus

Most machine learning models are built with a data set having several features or predictors. Hence familiarity with multivariable calculus is extremely important for building a machine learning model. Here are the topics you need to be familiar with:

a) Functions of several variables

b) Derivatives and gradients

c) Step function, Sigmoid function, Logit function, ReLU (Rectified Linear Unit) function

d) Cost function

e) Plotting of functions

f) Minimum and Maximum values of a function

## (III) Linear Algebra

Linear algebra is the most important math skill in machine learning. A data set is represented as a matrix. Linear algebra is used in data preprocessing, data transformation, and model evaluation. Here are the topics you need to be familiar with:

a) Vectors

b) Matrices

c) Transpose of a matrix

d) The inverse of a matrix

e) The determinant of a matrix

f) Dot product

g) Eigenvalues

h) Eigenvectors

## (IV) Optimization Methods

Most machine learning algorithms perform predictive modeling by minimizing an objective function, thereby learning the weights that must be applied to the testing data in order to obtain the predicted labels. Here are the topics you need to be familiar with:

a) Cost function/Objective function

b) Likelihood function

c) Error function

d) Gradient Descent Algorithm and its variants (e.g. Stochastic Gradient Descent Algorithm)

Find out more about the gradient descent algorithm here: **Machine Learning: How the Gradient Descent Algorithm Works**.

Programming skills are essential in data science and machine learning. Since Python and R are considered the 2 most popular programming languages in data science, essential knowledge in both languages are crucial. Some organizations may only require skills in either R or Python, not both.

## (I) Skills in Python

Be familiar with basic programming skills in python. Here are the most important packages that you should master how to use:

a) Numpy

b) Pandas

c) Matplotlib

d) Seaborn

e) Scikit-learn

f) PyTorch

## (ii) Skills in R

a) Tidyverse

b) Dplyr

c) Ggplot2

d) Caret

e) Stringr

## (iii) Skills in Other Programming Languages

Skills in the following programming languages may be required by some organizations or industries:

a) Excel

b) Tableau

c) Hadoop

d) SQL

e) Spark

Data visualization is one of the most important branches of data science. Simply put, data visualization involves the representation of data using charts and graphs. A good data visualization is made up of several components that have to be pieced up together to produce an end product:

a) **Data Component**: An important first step in deciding how to visualize data is to know what type of data it is, e.g. categorical data, discrete data, continuous data, time series data, etc.

b) **Geometric Component:** Here is where you decide what kind of visualization is suitable for your data, e.g. scatter plot, line graphs, barplots, histograms, qqplots, smooth densities, boxplots, pairplots, heatmaps, etc.

c) **Mapping Component:** Here you need to decide what variable to use as your *x-variable (independent or predictor variable) *and what to use as your *y-variable (dependent or target variable)*. This is important especially when your dataset is multi-dimensional with several features.

d) **Scale Component:** Here you decide what kind of scales to use, e.g. linear scale, log scale, etc.

e) **Labels Component:** This includes things like axes labels, titles, legends, font size to use, etc.

f) **Ethical Component**: Here, you want to make sure your visualization tells the true story. You need to be aware of your actions when cleaning, summarizing, manipulating, and producing a data visualization and ensure you aren’t using your visualization to mislead or manipulate your audience.

Some examples of data visualizations with links to the code used for generating the plots can be found here: Examples of data visualization.

## 1. YouTube

YouTube contains several educational videos and tutorials that can teach you the essential math and programming skills required in data science, as well as several data science tutorials for beginners. A simple search would generate several video tutorials and lectures. A good introduction to linear algebra is found here: Linear Algebra by Gilbert Strang (Professor at MIT) — Youtube Course.

## 2. Khan Academy

Khan academy is also a great website for learning basic math, statistics, calculus, and linear algebra skills required in data science.

## 3. Learning from a Textbook

Learning from a textbook provides a more refined and in-depth knowledge beyond what you get from online courses. This book provides a great introduction to data science and machine learning, with code included: **“Python Machine Learning”, by Sebastian Raschka**. https://github.com/rasbt/python-machine-learning-book-3rd-edition