These are tools, packages, and libraries that my colleagues and I use to increase Machine Learning pipeline development and production deployment productivity. What follows is a snapshot of our favorites as of December 24, 2020.
We used Python predominately (95%) over the last seven years because:
- Almost all new Machine Learning models, cloud, GPUs, and many other are available as a Python API;
- The assortment and number of free code and packages is the largest we have seen;
- Native Python is slower than C by 20+ times, but almost all Python packages are near C speed as they are thin APIs over CPython or use some other speedup technique.
We used C to speedup Python when Numba could not be used. We tried Go, but it did not work out.
4. Python GIL (lack of concurrency on multicore machines) is bypassed more and more each day by the cloud, Spark, package implementation (i.e.,XGBoost), and strong typing with the introduction of type hinting starting in Python 3.5.
Python’s runtime speed seems to gather the majority of criticism. A lot of criticism may disappear if some way is found to compile Python. Meanwhile, Python is the predominant choice for machine learning.
We used EMACS for 15 years. We were those people who learned computer science and accidentally absorbed some software engineering along the way coding in LISP.
We stopped porting EMACS or using someone else’s port to new hardware and OS platform. We started using other IDEs, as we worked with Java, Scala, R, Matlab, Go, and Python.
We discuss only Python-related tools, such as IDEs, for the rest of this blog. We think Python will be eventually drop in popularity as the first choice for Machine Learning, just not in the next few years.
I think there are three good choices for Python IDE.
Jupyter Notebook or JupyterLab
Jupyter Notebook enables you to embed text, embed code, and run code interactively. It is based on the lab notebook.
Project Jupyter exists to develop open-source software, open-standards, and interactive computing services across dozens of programming languages. — Jupyter Project
fast.ai coded a complete set of Jupyter Notebook tools. Please look them over.
fast.ai coded a complete set of Jupyter Notebook tools. Please look them over.
PyCharm or VSCode
PyCharm and VSCode are the most popular IDEs (Interactive Development Environments) for Python.
We use PyCharm (or VSCode) to develop, document, test and debug. Both integrate with inline documentation formatting, version control (git or GitHub), testing packages, coverage, linters, type hint checkers, and code formats.
The Python IDE for Professional Developers — JetBrains
Black formats your code into a superset of the PEP-8 standard. We use it to format all code files in a project triggered by PyCharm, VSCode, or GitHub actions.
Codacy is currently our favorite “pain-in-the-***” (PITA) development tool. It catches more errors and suspect code than pylint, some of the stylistic warning we ignore. We think of today as an automated code review tool. As codacy states in their tag-line: Automate code reviews on your commits and pull requests.
Coverage.py is the tool we use for measuring the amount of code covered by our Pytest framework.
We use git for local file version control. Once unit tests pass on one of the local machines, we push out our code to our repo on the GitHub cloud.
Mypy type checks programs that have type annotations conforming to PEP 484. mypy is often used in Continuous Integration to prevent type errors. Mypy joins our other developer tools such as pytest, black, pylint, and Codacy.
We use pylint to find errors and suspect code on our local coding node. We use today for “linting” full project codes for the pushes to the Github repo.
Over our careers, we have used many different frameworks. We settled on Pytest for unit testing primarily as minimal boilerplate is required.
We use Scalene instead of the built-in Python profiler.
Scalene is a high-performance CPU and memory profiler for Python that does several things that other Python profilers cannot do. It runs orders of magnitude faster than other profilers while delivering far more detailed information.
We use kuberflow to build Machine Learning pipelines of Kubernetes pods. We use kuberflow only on the Goggle Cloud Platform (GCP). It is open-sourced by Google and should work on other cloud platforms. However, GCP packages kuberflow as a cloud PaaS (Platform as Service). As a GCP PaaS, kuberflow has a convenient GUI template and DAG (Directed Acyclic Graph) display of the Machine Learning pipeline.
Continuous Development is the first part of the traditional Continuous Integration process. Continuous Deployment follows Continuous Integration (CD/CI/CD).
An introduction and overview of DevOps can be found at:
DevOps (Development Operations) was created for the computer language code lifecycle. MLOps (Machine Learning Operations) extends DevOps for the Machine Learning pipeline lifecycle.
An introduction and overview of MLOps can be found at:
We found MLFlow effective for specifying machine projects with five or fewer steps where some other framework performed any data preprocessing. We used MLFlow for the Machine Learning pipeline that followed a Spark-based distributed data pre-processing frontend.
Perhaps we were using MLFlow wrong, but inputs and outputs seemed to be file-based. However, we are enamored with the MLFlow Tracking component. The MLflow Tracking component enables us to log model parameters, code versions, metrics, and output files for display in dashboards built with Streamlit (discussed later).
Note: We rarely use MLFlow now. We use Photon.ai for quick experiments and use kuberflow for production Machine Learning pipelines on the cloud.
Photon.ai incorporates Scikit-Learn, pycluster, and other Machine Learning (ML) or deep learning (DL) frameworks with one unifying paradigm. Photon.ai adopts Scikit-Learn’s Estimator and Transformer class method architecture.
Photon.ai adds code that reduces manual coding and error by transforming pre- and post-learner algorithms into elements with their argument signature. Examples of elements are several data cleaners, scalers, imputers, class balancers, cross-validators, hyper-parameter tuners, and ensembles.
Photon.ai chains elements into a Machine Learning pipeline. Two or more pipelines are composed of a conjunctive (and) or disjunctive (or) operator to create a DAG (directed acyclic graph). The original Photon.ai source code, the extended Photon.ai source code, examples, arXiv paper, and documentation are found by clicking on the desired link.
You can look at how photon.ai extends scikit-learn into an MLOps tool in the following blog:
Our experience leads us to predict that GitHub Actions will be a significant choice for Continuous Development, Continuous Integration, and Continuous Deployment (CD/CI/CD) on and off GitHub.
Continuous Development is when any push from a local repo goes to the project version control Dev repo. At the GitHub repo, CD/CI scripts run for PEP-8 formatting compliance, unit testing, documentation testing, and code quality reviews.
It is the core base of the GitHub Action script that we show in:
Docker creates an image of an application and its dependencies as a complete stand-alone component that can be moved onto most cloud vendor offerings, Linux, Windows OS (operating system), and MacOS.
Docker-Compose is used to manage several containers at the same time for the same application. This tool offers the same features as Docker but allows you to have more complex applications.
A Docker image is similar to a Photon.ai element; the significant difference is that Kubernetes load balances Docker by image replication and manages a DAG Docker images are nodes across a distributed system.
Building Docker images are detailed in the following blogs:
Diagrams enable you to create high-quality architecture DAG (Directed Acyclic Graphs). Diagrams have a concept referred to as nodes. Nodes are how the Diagrams package organizes the icons into different groups where each node is in the public domain or cloud service.
The rendering of high-quality architecture diagrams of Azure, AWS, and GCP is shown using in the following blog:
HiPlot is an interactive visualization tool that enables us to discoverer correlations and patterns in high-dimensional data. HiPlot uses a technique known as parallel plots, which we use to visualize and filter high-dimensional data.
Why not use Tableau? We do if the customer has a license. HiPlot is open-sourced by Facebook and thus is license-free. We can use HiPlot anywhere we go. We think it is better at displaying high-dimensional data than Tableau.
HiPlot uses Streamlit, our favorite, for replacing Flask, Django, and other GUI front-end used for Machine Learning display. You can dive more in-depth in the tutorial: HiPlot component for Streamlit.
Python has the tried and true logger package. A good read on the logger package is in this article.
However, I choose to use the recently released loguru package because it is easier to use than logger, and loguru is process and thread-safe, while the logger is not out-of-the-box process safe. Ref: loguru project.
You can learn how I use loguru in:
pyclustering is an open-source Python, C++ data-mining library under BSD-3-Clause License. The library provides tools for cluster analysis, data visualization, and contains oscillatory network models. — Pyclustering Documentation.
You can look at a detailed study of pycluster Kmeans and Kmedoids in the following blog:
We use pysim for python-based simulations modeled as coupled differential equations. Some of us are ex-physicists. What we appreciate is that you can connect one simulated system to other simulated systems. Simple systems become complex systems as they hook together to create a complex simulation.
Smote is probably the most widely known package for augmenting underrepresented data class counts so that they are equal to the highest data class count. In other words, balancing unbalanced structured data classes to predict the lower count classes better.
What is not done, so often, is that you can continue argumenting the data of all classes.
You can learn how we use smote for argumentation of all structured data classes in:
If you use Spark and Keras or Tensorflow, use Sparkflow to speed up your pytorch training by N partitions, where N should be your batch size or number of GPUs, whichever is smaller. You use SparkTorch for pytorch.
Lightning to pytorch is similar as Keras is to Tensorflow. We use Lightning and Keras to raise us a couple of levels above the complexities of pytorch or Tensorflow.
Streamlit is an open-source Python framework that we use to quickly develop and deploy web-based GUIs for Machine Learning applications.
In earlier blog posts, we compared Flask to Streamlit using two different examples. We found that Flask needed about a hundred code lines while Streamlit needed ten lines of code to accomplish the same task.
spaCy is the fastest package we know for Natural Language Processing (NLP) operations. spaCy is an NLP library implemented both in Python and Cython. Because of Cython, parts of spaCy are faster than if implemented in Python. Spacy is available for operating systems MS Windows, macOS, and Ubuntu and runs natively on Nvidia GPUs.
spaCy is a good choice if you want to go into production with your NLP application. If you use a selection from spaCy, Hugging Face, fast.ai, and GPT-3, you perform SOTA (state-of-the-art) research of different NLP models (our opinion at the time of writing this blog).
We use mFST to build and deploy Finite-State Machines for Natural Language Parsing. We do not get into Finite-State Machine (FSM) here.
mFST is the Python library for working with Finite-State Machines based on OpenFST. mFST is a thin wrapper for OpenFST and exposes all of OpenFST’s methods for manipulating FSTs. — mFST paper.
If you ever try something other than a Convoluted Neural Net (CNN), you might try FSMs, but only if you are well-grounded in FSM theory. We recommend you start with a simple CNN.