Serverless Machine Learning architectures and engineering practices at Helixa

“Opportunities from Cloud”

On 15th December 2020 Data Science Milan has organized a webMeetup hosting Gianmario Spacagna and Luc Mioulet to talk about Helixa Machine Learning end-to-end system platform.

“Serverless Machine Learning architectures and engineering practices at Helixa”, by Gianmario Spacagna, Chief Scientist at Helixa, and Luc Mioulet, ML Engineer at Helixa

The talk has been split into three parts: 1) serverless services available in AWS; 2) Helixa Machine Learning platform; 3) Map/Reduce serverless architecture.

Gianmario started introducing the concept of serverless and an overview about serverless services available in AWS.

In the traditional way when we want to build and deploy a web application, we are responsible of the code and to manage the resources for the server where it runs. With serverless computing the code is sent to the cloud provider (AWS, Azure, or Google Cloud) that is responsible for executing the code by dynamically allocating the resources. Serverless is a cloud systems architecture that involves servers in the running application, but their presence is abstracted away from the data scientist or machine learning engineer because they are managed by the cloud provider.

There are many benefits with serverless architecture, first of all, it is secure by design because everything is managed by the cloud provider, servers are managed by the cloud provider reducing operational complexity. Serverless services are scalable with the usage, they are cheap because you pay only for the resources you consume reducing cloud costs. You never worry about the availability and fault tolerance of the serverless reducing engineering complexity and time spent.

An example of serverless computing comes from AWS lambda, that lets you run code without having to provision or manage servers. Other services available on AWS cloud are AWS Fargate, a serverless compute engine for containers, Amazon Athena, a query service with SQL, AWS Step functions, a serverless function orchestrator that makes it easy to sequence AWS Lambda functions and multiple AWS services into business applications, and many others.

So, what is the reason why Helixa has chosen to use AWS for its framework?

Helixa is an audience intelligence platform that uses Machine Learning to provide accurate, and timely, consumers insights for modern market research. The whole system has to be able to manage multiple data sets and provide accurate insights in real time. Results are required to be always available with reduced cost and minimum maintaining infrastructure of it. There is a very good level of overlapping between benefits provided by AWS serverless architectures and the Helixa requirements, here one of the reasons why to embrace this system.

Gianmario gave an exploration about Helixa architecture, starting from the data ingestion to run multiple data sets and populate the data lake. There are Machine Learning pipelines that interact with data lake and Machine Learning Cloud services. In Helixa are used pre-trained models, external APIs for geolocation mapping and ML libraries, mainly in Python. Models created in the pipelines are stored in a repository implemented by ML flow and deployed into either batch jobs or microservices. The results of this pipelines are databases powered by analytics applications.

Luc has spoken about how Machine Learning are integrated in the wide Helixa platform starting from data storage and more precisely about Data Lake (House), a new paradigm in Big Data processing with the goal to unify Data Lakes and Business Intelligence, to satisfy every single user. At this step the job done by Hadoop, expensive and always running, is replaced by Amazon S3, cheaper, elastic, always available and performant with the only limit on maximum file storage. Amazon S3 is linked with AWS Glue which is a serverless data preparation service that simplifies data extraction, cleansing, enrichment, normalization, and loading for data. It’s used AWS Glue Crawlers that explore partitions in the data and every new partition is then added to AWS Glue Data Catalog in this way feature are exploitable by Amazon Athena, which is an interactive query service that makes it easy to parse data with standard SQL expressions.

About Research & Development, everything starts locally, developing code with professional IDEs such as PyCharm, using Gitflow as branching model, very well suited for collaborative development, then the code from locally is pushed towards GitHub and then via SSH there is a connection on Amazon EC2 instance that is on the cloud. From the cloud will pull the GitHub code and then the developing job can start on Jupyter Notebook. The main idea is coding locally, avoiding to spend time copying code on Jupyter notebook, and discovering, for instance, some shared variables were shadowed.

All these jobs are moved into production by makefiles and Docker container. After set up the first few components of machine learning pipeline the next step is to automate everything using continuous integration, in this way you don’t have to worry about breaking things because code is stored on GitHub.

Luc have explained several technologies available for deployment, for instance, blue/green deployment using Terraform. The idea behind is to have production APIs available and decide a green deployment for new release rather than blue deployment for something already running that requires only to change the pointer. There are architectures available for different type of prediction, from batch model serving to model serving by microservices offered by Amazon either ECS or Kubernetes. Another interesting application is offered by AWS Lambda for real-time serverless model serving. The important part about all of these micro services is that you want to orchestrate them, and Amazon has AWS Step Functions, an orchestrator that supports most of AWS workflow technologies.

At the end has been given an exploration about several architectures of MapReduce used for Biga Data in Helixa, MapReduce is used in real-time applications replacing the use of Spark.

Recording&Slides:

video

slides

Written by Claudio G. Giancaterino

Footer