- Introduction
- Automated Preprocessing
- Remote Work
- New Boosting Algorithm
- New Educational Experiences
- Airflow
- Summary
- References
With the events of 2020 comes a normalization of different ways we learn and work, some may revert back to how it used to be, while some are here to stay. Some of the things we can look forward to in 2021 are not necessarily dependent on 2020 but are new technologies that will most likely become prominent. I will be discussing both new things to look forward to because of the events of 2020, while also examining some possible increase in popularity of certain technologies. Keep reading below if you would like to learn more about five things to look forward to as a Data Science in 2021. I will be expounding upon automated processing, remote working, a new boosting algorithm, new educational experiences, and airflow, as well as their respective pros and cons, and general discussion of what to expect.
One of the more dreaded steps in Data Science in preparing your data to be used in your model. Isolating features for your Machine Learning algorithm can be more entertaining and interesting, but transforming them can be quite difficult and overwhelming. New Python libraries and packages are emerging where this step is automated.
Here are some pros and cons of automated preprocessing:
Pros
- More time to focus on features
- More time to focus on model error metrics
- Easy to use
- Can traffic numeric, categorical, text, and features that have a Hugh amount of variation like ID’s
- Decreased error
Cons
- Can be black-box like
- where you are unaware of how your features are transformed
CatBoost Example
The library CatBoost [3] is one of those libraries where you can automatically preprocess your features. It performs random permutations and converts label from float to integer. The ways that this algorithm trademarks categorical to numeric variables is by the following:
— Borders
— Buckets
— BinarizedTargetMeanValue
— Counter
The main benefits of this automatic preprocessing can include the ease-of-use, increased speed, and decreased error.
As a result of 2020, more and more companies have to work remotely, and some are planning to stay remote for the foreseeable future. Working remotely used to be seen somewhat negatively, where now it is seen as not only safer, but just as productive — if not, more than working in person at an office. There are a few reasons as to why you would be more productive even while working at home.
Here are some of the benefits of working remotely:
- no travel time
- no stress from driving in traffic
- more time to work if you need to
- taking breaks can be more restful (walking around your neighborhood vs walking in a busy downtown or monotonous company park)
- not being distracted from others in office
This next point is more generic and a prediction. It seems as though every year, there is a new boosting algorithm or Machine Learning algorithm in general that comes on the scene and becomes the most popular choice in Data Science competitions. Random Forest has always been reliable, and then XGBoost become the most popular, winning competitions left and right, as well as having several benefits like speed, accuracy, and ease of use. CatBoost has become another popular Machine Learning algorithm recently, so it will most certainly be interesting to see what pops up next year in 2021 for the newest, top Machine Learning algorithm.
Here are some of the most popular and competitive Machine Learning algorithms:
Random ForestXGBoostCatBoost
With remote work becoming popular, so has learning remotely. Online courses before 2020 have been evermore prevalent, and they are even more common now. A positive, too, is that online courses or remote learning is seen as the most reputable as it has ever been. Aside from learning at an undergraduate or graduate program in person, there are countless ways to learn remotely. Another nice way to look at these methods is that some of them are completely free, saving you both money and time — since you will purely be focused on the topic of Data Science or Machine Learning.
Here are some of the ways to learn remotely:
- online courses
- online certifications
- YouTube
- Medium or other article sites
- Kaggle
- GitHub
This platform is becoming more and more common amongst Dat Scientists, Software Engineers, and Machine Learning Engineers. The year 2021, will be no exception to this increase of awareness from Airflow [8]. Overall, Airflow is a great platform for scheduling and exciting Data Science processes with DAGs especially (Directed Acyclic Graphs).
Here are some of the benefits that Airflow has commented on:
- scalable
- dynamic
- extensible
- elegant
- Python
- UI is easy to use
- integrate easily
- open source
As you can see, 2020 has brought in some different ways we approach work and studying. I have listed so of the main differences I think will stay in 2021. Additionally, we have discussed some new technologies that may be more prominent in 2021 as well. Overall, there are new ways of learning, current ways of learning that are now more common, and new technologies that we can expect to become more popular in 2021. What do you think will become new or more common as a Data Scientist in 2021? Do you agree with mine listed, or are there more things to consider?
Here are those five points summarized:
Automated PreprocessingRemote WorkNew Boosting AlgorithmNew Educational ExperiencesAirflow
Thank you for reading! I hope you found my article useful and interesting. Please feel free to comment down below on what you expect to stay or leave in 2021, and what new things entirely can emerge for Data Scientists as well. Feel free to check out more of my articles on my profile and reach out to me if you have any questions. Thank you!
I am not affiliated with any of these companies.
[1] Photo by Moritz Knöringer on Unsplash, (2020)
[2] Photo by Sai Kiran Anagani on Unsplash, (2018)
[3] CatBoost, CatBoost, (2020)
[4] Photo by Wouter Beijert on Unsplash, (2020)
[5] Photo by veeterzy on Unsplash, (2016)
[6] Photo by Sam McGhee on Unsplash, (2017)
[7] Photo by Mila Young on Unsplash, (2017)
[8] The Apache Software Foundation, Apache Airflow, (2020)