The open source stack is powering real time analytics pipelines at Uber.
I recently started an AI-focused educational newsletter, that already has over 70,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:
Uber has to rank among the greatest contributors to open source data science infrastructure and frameworks. From machine learning frameworks like Horovod or Pyro to time-series infrastructures such as M3, the Uber engineering team has been incredibly active open sourcing different stacks that are key building blocks of Uber’s data science pipeline. Earlier this week, Uber unveiled yet another super cool technology to enable modern analytics solutions. AresDB is a database and runtime for massively scalable, real time analytics workloads.
Does the market really needs another real time analytics database? That was my first reaction when I read about AresDB. The challenge of real time analytics, particularly on time-series data, is well known and there are some very advanced stacks such as Apache Pinot, OmniSci and, even to some extent, Elasticsearch that power many of these use cases. While the capabilities of those solutions are incredibly solid, it is important to realize that Uber operates at a level of scale that is unknown to most software platforms in the market. Uber’s applications collect millions of data points per second that need to be used to make real time decisions across completely diverse areas such as fraud detection or tip pricing. The mainstream adoption of graphic processing units(GPU) opened a new frontier for scaling analytic workloads. However, how many real time analytics engines do you know that are leveraging GPUs effectively. That’s the challenge AresDB setup to address.
The idea of leveraging GPUs for scaling real time analytics workloads seems like a perfect fit for Uber. The ride haling company have thousands of analytic processes that perform real time computations over time-series data consisting of billions of data points collected in a few days timespan. Compared to traditional CPUs, GPUs offer very tangible advantages to power real time analytics workloads:
- Process data in parallel very quickly.
- Deliver greater computation throughput (GFLOPS/s), making them a good fit for heavy computation tasks (per unit data) that can be parallelized.
- Offer greater compute-to-storage (ALU to GPU global memory) data access throughput (not latency) making them ideal for processing I/O (memory)-bound parallel tasks that require a massive amount of data.
A real time analytics engine optimized for GPU-based architectures have the opportunity to scaling arbitrary query/computations against large subsets of a dataset without the need of any data optimization.
The design of AresDB combines both disk and in-memory storage as well as CPU and GPU computations. The stack follows an in-columnar data storage model popular in technologies like MemSQL or AWS RedShift. Architecturally, AresDB follows three key principles to achieve massive performance in real time analytic computations:
- Column-based Storage with Compression: This model is used for storage efficiency (less memory usage in terms of bytes to store data) and query efficiency (less data transfer from CPU memory to GPU memory during querying)
- Real-time Upsert with Primary Key Deduplication: Relevant for high data accuracy and near real-time data freshness within seconds.
- GPU Powered Query Processing: Used for highly parallelized data processing powered by GPU, rendering low query latency (sub-seconds to seconds).
A part of AresDB looks like any other real time database engine. Data in AresDB is stored in disk but swtich to memory for real time computations. Data ingestion processes work in commodity CPUs and the platform maintains a separate database with the schema and relevant metadata of the core data structures. However, when processing a real time query, AresDB switches computations from the host memory to GPU memory to achieve faster performance. That real time swtich on computation engines is one of the key capabilities of AresDB. At a high level, the architecture of AresDB is illustrated in the following figure:
The storage model in AresDB is based on two fundamental sets of tables:
· Fact Tables: AresDB’s fact tables store events/facts that are happening in real time, and each event is associated with an event time, with the table often queried by the event time.
· Dimension Tables: AresDB’s dimension tables stores current properties for entities (including cities, clients, and drivers). Compared to fact tables, which grow infinitely over time, dimension tables are always bounded by size.
In AresDB, data ingestion is abstracted by an HTTP API that can receive a batch of upsert records. After receiving a request, the AresDB API writes the upsert batch to redo logs for recovery. Following that, AresDB then identifies and skips late records on fact tables for ingestion into the live store. A record is considered “late” if its event time is older than the archived cut-off event time. For records not considered “late,” AresDB uses the primary key index to locate the batch within live store where they should be applied to.
In addition to the HTTP API, AresDB provides integration with several data sources such as Apache Spark, Apache Flink or Apache Kafka to scale large batch insertions. AresDB’s extensible programming model allows the addition of new connectors for integrating with mainstream data and line of business systems.
For data querying, AresDB relies on the Ares Query Language(AQL) which uses a JSON-like format for representing query commands. AQL syntax includes constructs such as dimensions or measures that are native representations of columnar storage models. Representing a query such as SELECT count(*) FROM trips GROUP BY city_id WHERE status = ‘completed’ AND request_at >= 1512000000 in AQL looks like the following:
{
“table”: “trips”,
“dimensions”: [
{“sqlExpression”: “city_id”}
],
“measures”: [
{“sqlExpression”: “count(*)”}
],
;”> “rowFilters”: [
“status = ‘completed'”
],
“timeFilter”: {
“column”: “request_at”,
“from”: “2 days ago”
}
}
While AQL is the primary query language in AresDB, the Uber engineering team is exploring plans to also support SQL in order to streamline the adoption of the platforms. Like any other new query language, AQL requires a new execution engine which can parse and execute data queries. The AQL query engine leverage NVIDIA’s Thrust framework for parallelizing the execution of queries in a GPU environment.
AresDB is the newest addition to Uber’s growing data science and analytics stack. At least functionally, AresDB seems like a natural complement to other Uber technologies such as M3 to enable highly scalable time series analysis. While not many companies operate at Uber’s data scale, almost everyone will agree that enable real time analytics over large datasets is far from being an easy endeavor. The simplicity of its architecture and its integration with widely adopted data infrastructures makes AresDB to power real time analytics workloads on many environments.