A journey into the evolution of Big Data Compute Platforms like Hadoop and Spark. Sharing my perspective on where we were, where we are and where we are headed.
Over the past few years I have been part of a large number of Hadoop projects. Back in 2012–2016 the majority of our work was done using on-premises Hadoop infrastructure.
The age of on premises clusters…..
On a typical project we would take care of every aspect of the Big Data pipeline including Hadoop node procurement, deployment, pipeline development and administration. Back in those days Hadoop was not as mature as it is now, so in some cases we had to jump hoops in order to get things done. Lack of proper documentation and expertise made things even more difficult.
Overall managing and administering a multi-node cluster environment is very challenging and confusing at times. There are several variables that need to be accounted for:
- Operating System Patches — Considering there are multiple machines (nodes) the challenge is to perform the upgrade while the system is up and running. This is a huge ask considering some security patches require system reboot.
- Hadoop Version Upgrades — Similar to OS Patches Hadoop needs to be upgraded regularly. Thanks to Hadoop advancements like Namenode High Availability and rolling upgrades there was some relief.
- Scalability — You may argue why this is a problem. Hadoop works on the principle of horizontal scalability so this should not be a issue…..just keep adding nodes. Well that claim is limited and hugely dependent of availability of hardware. Adding new nodes is easy only if there is extra/unused hardware lying around, so there is a big if here.
- Support for new frameworks and modern use cases like ML and AI— Distributed Frameworks like Mapreduce are not as memory hungry compared to Spark. As new frameworks like Spark started to evolve the need for CPU and memory became increasingly stronger. Famous theories around like Hadoop support on commodity grade hardware support were not true any more. With the growing demand for ML/AI use case we simply needed stronger hardware and a lot of it.
And then came the cloud…..
With the advent of the cloud the above challenges were automatically resolved…. or majority of them. Minimum worry about upgrades, patches and scalability. The nature of the cloud made it easy to add new nodes on demand…literally in a matter of minutes.
There were several ways in which we started to adopt to the cloud:
- Create a Hadoop/Cluster using the cloud providers virtual machine offerings like EC2. Once the virtual machines have been procured we installed Hadoop and Spark using various distributions like Cloudera, Hortonworks or simply using the open source version.
- Using cloud providers inbuilt services like Amazon EMR or Azure HDInsight. Using cloud provider services is marginally more expensive as compared to the self procured virtual machines but it offers several benefits. Faster deployments, minimal need for administration skills, inbuilt scalability and monitoring are some of the benefits that are worth the extra price. On the downside some customers do not like the idea of getting tightly integrated with a specific cloud provider.
- In an extreme case one of our customer chose to take a hybrid cloud approach. We created a 200 node Hadoop rack aware cluster using a combination of virtual machines from AWS and Azure. I must admit that the reason did not make too much sense to at that time but their reasoning was pretty far fetched. They wanted to keep all options open in case one offers a better price over the other one. This trend is surely catching on now.
Since the advent of the cloud the entire job orchestration landscape has been changing (evolving). Due to the flexible nature of cloud resources we are now able to restructure our data pipelines in such a fashion that the need for permanent Hadoop/Spark Clusters is quickly diminishing.
In most case the traditional cloud model a Data Lake is comprised of a permanent storage layer and compute layer. Moving compute platforms to the cloud definitely resolves a bunch of issues with resource provisioning, scalability and upgrades. However once the cluster has been provisioned all computational jobs are fired up using the same cluster. Since the computational jobs may get fired up at different times during the day the cluster needs to be available 24×7. Keeping a permanent cluster up all the time is a very expensive proposition. And its not about paying for 1 or 2 nodes but a bunch of them, whether or not you are using them 24×7.
In the traditional cloud model for a Data Lake all computational jobs go after the same cluster. Unless the jobs are well spaced out throughout the day this may lead to resource contention, performance degradation and unpredictable completion times. We started to question ourselves, is there is a better approach?
The age of serverless data pipelines and ephemeral clusters is upon us…..
In recent times customers are preferring the use of serverless data pipelines using cloud native services like AWS Glue or ephemeral Hadoop/Spark clusters. This means each computational job can run within a predefined cluster space or in a cluster specifically spun up for the purpose of running only one job.
What is the real advantage of doing this? There are two main reasons:
- Cost Reduction — By having the flexibility of using cloud resources on demand when required and promptly releasing releasing them when idle is a huge cost saver. Only pay for what you use.
- Predictable Performance — Having a job run with predefined resources assures timely completion of the job.
Above image is an example of how computational jobs can use the power of ephemeral clusters. In one of my previous articles (link shared below) I had shared the entire process of deploying a transient EMR cluster. Notice that a brand new cluster is created for each and every computational job. The cluster is promptly destroyed after the job has been completed.
Overall the transient cluster approach is a good choice if you would like to achieving consistent performance while saving costs.
It is important to state that employing transient cluster does require automation. It can be achieved very simply even with a basic level understanding of DevOps.
Above image depicts how to run your computational jobs using cloud vendor provided serverless compute services like AWS Glue. Using such services you can invoke jobs with predefined computational power. There is no need to spin up a new cluster and you only pay for the resources that your job uses. However it is important to realize that on the back end you are using preexisting cluster controlled by the cloud vendor, therefore in some cases you may experience some job invocation delays and variable performance.
Overall the serverless compute is a good choice for customers who have limited number of computational jobs, do not want to take the hassle of managing servers as well as are able to tolerate some delays.
Serverless data pipelines using microservices model is coming…..
A couple of months ago we did a POC on deployment of a serverless OCR-NLP pipeline using Kubernetes. The project involved a data pipeline that could withstand the load of performing OCR and NLP for hundreds of PDF documents. The customer was looking for a serverless approach and wanted a high degree of scalability because of variable loads throughout the day. Since Spark recently added support for Kubernetes we thought of giving it a shot.
With some work we were able to create a serverless compute pipeline using Spark on Kubernetes deployed over Amazon EKS and AWS Fargate.
You may fund that the above approach is very similar to serverless approach using cloud native services. But this one scales a lot better.
The data pipeline can sense the variation of incoming requests and can scale the computational power based on it…pretty cutting edge. We were able to successfully run the pipeline for a high number of incoming requests. Event tough the approach passed all tests, in the end we were a little skeptical so we got cold feet and decided to take an alternative approach.
Overall the microservices model would be an extremely suitable for customers who not only want to enjoy the flexibility of a serverless compute platform but want to achieve a high level of scalability as well.
I promise to employ the approach very soon on a upcoming project. I will keep you all posted once that happens.
I hope this article was helpful. AWS Data Lake & DataOps is covered as part of the AWS Big Data Analytics course offered by Datafence Cloud Academy. The course is taught online by myself on weekends.