Want to learn data science in 2021? Here’s the internet’s best curriculum

Data Science Ethics

Next, you’ll learn how to navigate the ethical dilemmas when exercising your new data skills. The main resource you’ll use is H.V. Jagadish’s University of Michigan course, where you’ll learn about informed consent, data ownership, privacy, anonymity, data validity, and algorithmic fairness. You’ll then learn how to use deon, a command line tool that allows you to add an ethics checklist to your data science projects. You’ll then learn about ethics in AI at a deeper level.

Scalable Data Science

You’ll then learn how to scale up your work to “big data” using parallel computing and GPUs. I selected Dask and BlazingSQL for this curriculum because they are the easiest to learn given the Python skills you’ve acquired thus far, they have strong development teams, and they are gaining industry adoption.

Dask scales up the existing Python ecosystem to multi-core machines and distributed clusters. It allows you to use your NumPy, Pandas, and Scikit-Learn skills on big data, instead of having to learn a new programming style like you would have with big data tools like Scala or Spark.

BlazingSQL provides a high-performance distributed SQL engine in Python. Like Dask, it will feel natural for Python users. A quote from Dask co-creator Matthew Rocklin:

One of the common requests we get for Dask is, “Hey, do you support SQL? I love that [with Dask] I can do some custom Python manipulation, but then I want to hand it off to a SQL engine.” And my answer has always been, “No, there is no good SQL system in Python.” But now there is — if you have GPUs.

Built with the PyData ecosystem in mind, Dask and BlazingSQL work nicely together.

Dask and BlazingSQL are both cutting-edge tools that aren’t yet taught at most schools and companies. I consulted on the “How to learn Dask in 2021” curriculum, and personally compiled the resources in the BlazingSQL post.

Cloud Computing

Next, you’ll learn how to use cloud computing to scale up your work even further. First, you’ll learn how cloud computing works in the context of the industry’s major players — Amazon Web Services (AWS), Microsoft Azure, and Google Cloud. You’ll then learn how to use Coiled, a startup that aims to make cloud computing easy for Python and Dask users.

Note that “Coiled runs on AWS today, with Azure support coming soon.” Google Cloud is on the roadmap, and Google’s Head of Decision Intelligence Cassie Kozyrkov is excited about that.

Learning a tool that is still being built out shows the benefits of an opinionated curriculum curated by an individual. I can be a little more agile than a school or company — compiling online resources only takes a few hours. I can also take a little more “tool risk.” In this case, I believe the risk is worth the reward. Plus, you’ll still learn the basic mechanics of scaling to the cloud with Coiled.

Time Series Analysis

Next, you’ll hop back into developing your analyst skills. A time series is a series of data points indexed in time order. This type of data is ubiquitous, particularly in finance and applied science disciplines. First, you’ll learn how to handle time series data, then you’ll learn how to forecast based on that data.

Text Analysis

You’ll then develop your text analysis skills, learning the basics of regular expressions and natural language processing.

Other Fun Stuff

You’ll wrap up the program by learning skills that don’t have obvious curriculum categories. First, you’ll experience common machine learning pitfalls and how to fix them in real-life workflows. Then A/B testing, a critical skill for successful online experiments. Then, web scraping, which is a hacky but effective way of importing data on the internet. Next, you’ll learn how to analyze data that has a geographic component to it. Finally, you’ll learn an exciting new data analysis tool.

Siuba, born in 2019, is a new library that emulates an R library called dplyr that you’ll learn in ModernDive. Though Siuba doesn’t have much adoption yet, I’m including it because doing EDA in a dplyr-like way in Python would be a massive addition to an analyst’s toolbox, and early feedback on it is positive. Plus, the creator of the package built an online course (and the software to deliver that course) to promote adoption.

Interspersed between the resources above are blog posts and YouTube videos. These high-level resources frame your new skills in the context of the data industry in real life.

For example, after your introduction to data science, you’ll read a piece called, “Is data science a bubble?” You’ll gain an appreciation for where the industry is today and where the author thinks it is going.

That piece is a Cassie Kozyrkov creation — most of the Frame resources I selected are. She’s an excellent communicator with her pieces striking a nice balance between informative and humorous. She also has major industry experience so her opinions carry weight.

Other examples of blog posts of hers that you’ll read include:

An excerpt from the last linked piece to get a sense of her writing:

Today’s data science tool ecosystem is so fragmented and messy that it might make even Marie Kondo faint. If you’re thinking about making more tools, focus on making tools that spark joy. Make it easy to fold them all into one place. (Right, Marie?)

…

Don’t build tools for their own sake, build them to fulfill your users’ needs and make your users happy. Focus on integration — it’s important to make these tools play well with the rest of the ecosystem, because no one wants to stop what they’re doing to give your tool special treatment unless it’s a cure-all.

I’ve personally learned a lot from Cassie’s pieces. They’ve also shaped many of my decisions for this curriculum. I think you’ll find them valuable, too.

After you learn a new skill and frame that new skill, you’ll then assess how proficient you are at this new skill. You’ll use DataCamp Signal, a new adaptive testing tool launched in 2019. You’ll mainly use this tool to:

See if you need to revisit any of the Learn resources before starting your project.
Create a digital transcript using your test scores to prove what you learned.
Track your scores throughout the curriculum to visualize and gamify your progress.

Here’s how your score is presented:

From the DataCamp Signal white paper: “Assessment results include a score (0–200), a percentile (0%-100%), and an associated knowledge level (Novice, Intermediate, Advanced).”

Each assessment is a series of 15 challenges. The difficulty of your next challenge changes based on how well you’ve scored up until that point. The entire assessment takes 5–10 minutes total.

The screen before you start DataCamp’s Python Programming assessment.

I strategically interspersed the following assessments within the curriculum to leverage a memory phenomenon called the spacing effect, which describes how our brains learn more effectively when we space out our learning over time.

At points in these adaptive tests, you’ll encounter some skills that you haven’t learned yet, and that’s okay. Again, these tests are designed to adapt to your skill level. Skip those questions or give your best guess. You’ll come back to that assessment later and you’ll be able to visualize your progress.

At the end of Term 1 and Term 2, you’ll revisit all of the skill assessments you’ve completed up until that point. These scores will provide a quantitative gauge for how prepared you are for the analyst role (Term 1) and the analyst-ML expert hybrid role (Term 2).

My Python skills measured over time. June 8th: A little rusty. June 9th: After refreshing my skills, I scored 149 (95th percentile). December 24th: Rusty again (plus a little tired). Just like any skill, your data skills can erode over time if you don’t keep them sharp! I expect your chart to look more like this during this curriculum: 📈

Note that I didn’t include any R assessments because learners are unlikely to score well on those even if they master the R resources I recommend. Learners should, however, be able to score well on the SQL assessment.

Here’s where you’ll set yourself apart from the crowd.

A self-directed project is a project with no defined end goal, no starter code or dataset, and no templated grading. These projects, in my opinion, are the only kind of projects that employers and clients truly want to see.

You’ll use your newly acquired skills to create something unique on a subject that you’re passionate about. There are eight projects spread throughout the curriculum (four in each term) and a capstone project at the end. I recommend spending two days on each regular project, and four days on the capstone.

You’ll feature some or all of the skills you learned in the courses immediately preceding each project. Here’s one potential outcome:

In January 2021, I will launch a blog post called, “What makes a good data science project?” I’ll also post an example project. You can check out my projects in the meantime.

You can tailor the curriculum to an industry you’d like to target. Interested in Bitcoin? Interested in fashion? Interested in healthcare? Find a dataset (using Google’s Dataset Search, for example) and create a project on it. You can dedicate all of your projects to that industry if you’d like!

You’ll include all of these projects in your digital transcript with your skill assessment scores to prove what you learned.

The main drawbacks of self-directed projects are:

What happens if I get stuck?
Grading is hard. How do I know if my work is correct?

For the first one, DataCamp’s adaptive tests will help. Right before you start a project, you’ll get quizzed on your new skills. You’ll receive a score and a diagnosis of your skill gaps. You can revisit learning materials if necessary. If you score well, these will serve as 10-minute skill refreshers that will make starting your project a little less daunting.

DataCamp Signal telling me my current strengths and skill gaps for Python programming.

The community will also help mitigate these concerns.

First, I’ve set up a Circle community with dedicated spaces for each project.

I’ve also set up a Deepnote team. Deepnote (the tool) is a new kind of data science notebook with real-time collaboration. Think Google Docs, but for data science.

How we’ll collaborate in Deepnote.

If you get stuck, post in the community and someone (me, a fellow learner, or a community mentor) can help you debug in Circle and/or Deepnote. In 2021, my main priority will be solving the grading problem with these tools.

I’m excited to see the projects that you create.

The final piece that weaves the curriculum together is Build a Career in Data Science by Emily Robinson and Jacqueline Nolis. Published in March 2020, it’s comprehensive and up-to-date. It even has an accompanying podcast.

The book is divided into four parts, with the parts spread equally throughout this curriculum.

Part 1: Getting Started with Data Science
Part 2: Finding Your Data Science Job
Part 3: Settling Into Data Science
Part 4: Growing in Your Data Science Role

I will also experiment with additional career services as a part of the paid community throughout 2021. Resume reviews, interview coaching, etc.