During the course of our fall 2020 semester at Carnegie Mellon University, we had the opportunity of taking the 17–400/700 Data Science and Machine Learning at Scale class offered by Professor Heather Miller and Professor Michael Hilton. The class included 4 programming assignments where we worked on high volume ETL and Distributed Machine Learning, primarily using PySpark and Databricks notebook.
The most exciting part of the course however was the project where we worked on analyzing over a billion NYC taxi rides on the cloud by building an end to end Machine Learning Pipeline on AWS as well as deploying the model using Cortex as our Prediction Server. Here’s our journey!
We decided to tackle the task of analyzing NYC taxi data and predict the duration of a trip for both Green and Yellow cabs. One of the main reasons we chose this problem statement is the increased need in efficient and optimized scheduling in the mobility and logistics industry. In addition, the original dataset was around 150GB in size which was a scale appropriate for the project. The dataset was spread across multiple CSV files, where each CSV file contained the data of one month. We performed the analysis using data between 2013–2020. We used the following Github Repository to understand how to get access to the data and process it: https://github.com/toddwschneider/nyc-taxi-data
Prior to diving into our data engineering process, we performed Exploratory Data Analysis on a small sample (1 month) to get a sense of the dataset such as understanding the data and getting a sense of potentially problematic columns that might have NULL values.
We can observe that columns such as tip amount, payment type and trip type already seem to be filled with null values and potentially problematic. This helped us prepare better for our Data Engineering and ETL process.
We decided to use Amazon Web Services (AWS) as our Cloud Service Provider. One of the main reasons for our decision was the fact that the NYC Cab Dataset was already hosted on an S3 bucket on AWS. The potential cost savings by avoiding data transmission across cloud providers was a huge plus point. In addition, AWS EMR with Jupyter Notebooks provided a scalable platform to perform data engineering and write distributed machine learning code using PySpark. In addition, since we were also deploying our model, AWS EKS was always on our mind. This turned out to be helpful when we were setting up Cortex, our prediction server.
We first dumped the data from the public S3 bucket onto Hadoop Distributed File System (HDFS). We then wrote a PySpark ETL program that reads data from HDFS, transforms it and loads it onto our S3 bucket.
The transformations primarily included removing all rows with null values, extracting specific time information ( such as Pickup hour and Dropoff hour ) and ensuring consistency with respect to schema.
Upon completing ETL, we then performed Feature Engineering to find the most relevant features. This was done manually and the final engineered features included pickup hour, trip distance, rush hour as a boolean flag indicating if the day is the weekend.
We also tracked the platform i/o metrics in the process.
We observed that during data loading — HDFS utilization spiked from 0.5% to 1.5% utilization. Also, during Feature engineering, total load scaled from 10 concurrent reads to 28 concurrent reads. In addition, during Feature Engineering and Model Training (more on this soon!), the Available memory kept fluctuating between 122000 MB and 10000MB meaning close to 90% memory getting allocated for jobs running in the cluster.
After investigating our data, we concluded that it is a classical regression problem. We decided to use Linear Regression as our Machine Learning model. Our model training and storage configurations included and m5.xlarge 4 vCore, 16 GiB memory, EBS only storage of 64 GiB with 10 instances. We performed 10 iterations with a learning rate of 0.1 on the engineered features.
The metric we decided to track was accuracy of prediction within a time window. The window represented the acceptable absolute difference between our prediction and the value in the test data. We experimented with windows of 2 minutes, 5 minutes, 10 minutes and 15 minutes.
We can observe that we receive the trend as expected, within a 5 minute window the accuracy is as low as 23.9% but it shoots up to around 80% for the 15 minute window.
A possible extension or future work includes evaluating other approaches along with regularization techniques to improve the accuracy.
Through prediction serving we wanted to answer a simple question, “How effectively can we deploy our model to serve predictions to simulate real-world production load?”
We used Cortex as our prediction server and performed the following steps:
- The trained model was saved in .onnx format
- The loaded model in the onnx format was ported the prediction server
- The actual “serving” of the model is a simple python class that instantiates an onnx runtime session and loads the pre-trained model.
- The predict method in the class was invoked by cortex
- Once cortex was setup in our AWS instance we had the server running by using the following command: cortex cluster up — config cluster.yaml
We performed unit testing of our model using Postman.
We decided to use locust for load testing our service given its ease of setup and use. We simulated a load of 5000 users and monitored metrics such as response times of the service, requests per second and user growth.
In addition, we also decided to analyze the data using EKS — Cloudwatch Dashboard Monitoring.
Our Prediction server handled 5000 concurrent users with ~279k requests and performed steadily with just 1% failure
This was a great learning experience for all of us. Working on this project gave us insight into building and deploying an end to end Machine Learning application at Scale. We hope to dive deeper into this space and solve exciting challenges in the future!