Hacking Analytics’ Compendium of Data News — December 2020

This month was quite active with news and release in the Data space, with two big conferences going on — Amazon’s re:Invent and Neurips, as well as the official release of Airflow 2.0 and an introduction to the principles and architecture of the Data Mesh…

SQLite received a new release(3.34.0) providing increased support for recursive queries, and an increased query planner.

Amazon open-sourced Babelfish to provide a SQL Server/T-SQL compatibility layer for Postgres. Postgres also received a docker image in which IVM (incremental view maintenance) is implemented.

Cockroach DB explained why they are compatible with Postgres, as well as how they built their spatial indexing.

On the orchestrators front, we saw the release of Airflow 2.0, as well as the introduction by Uber of their “no-code” workflow orchestrator uWorc built on top of Airflow.

Martin Flowler’ blog published an introduction to the principles and logical Architecture of the Data Mesh and Ali Ghodsi’s paper on the Lakehouse architecture got published on CIDR’s website.

On the more real-time side, Uber introduced us to how they deal with Kafka disaster recovery and to their platform to support push message notifications, While Salesforce provided some insights as to how they deal with real-time prediction and Confluera shared with us how they are leveraging Kafka, S3 and Pinot for generating real-time security insights.

Following along the data governance effort, Linkedin published a blog post detailing different metadata architecture, the British government published their framework for data quality.

Data Ops

Amazon had it’s re:Invent conference this month and a few things were announced, strong consistency support in S3, container image support for AWS lambda now supporting up to 10GB size, an EKS distribution, data sharing functionalities within redshift, as well as managed services for Grafana and Prometheus.

Microsoft Azure saw some new services to support data governance efforts with Azure Purview.

Octave received a new version (6.1), while R received support for pipes and lambda functions.

Use and concerns

Stories were published about the blurring lines caused by AI between what is real or fake, such as a story about an AI girlfriend seducing China’s lonely men, or this demonstration of deep fake movements based on a single image:

Concern about the power and impact of AI have also been published such as this story about Google AI going homophobic, a story by Bloomberg which asks what if data scientists had licenses like lawyers, or this story from Guardian which stresses how machine readability has become an important component of corporate filing, and how people are trying to adjust the sentiment and tone of their report to induce AI readers to draw favorable conclusions.

In the meantime, Channel 4 got in trouble for a deep fake video of the Queen portraying a fake Christmas announcement.

Systems

Salesforce detailed the architecture of their machine learning data platform,

Research

This month was the month of Neurips 2020, resulting in, quite too many to read, papers being presented. The 3 papers which received the best papers award winners, were “Language Models are Few-Shot Learners” (GPT-3), “No-Regret Learning Dynamics for Extensive-Form Correlated Equilibrium” and “Improved Guarantees and a Multiple-Descent Curve for Column Subset Selection and the Nystrom Method”.

Studying the Pandemic, Google researchers revealed problems of underspecification in using machine learning algorithms. While growing concerns related to the privacy implication in training large language models (such as GPT-2) have arisen following research from Google and Berkeley

Education

On the machine learning education front, we saw NYU making Yann Lecun’s deep learning course available online for free.

Use and concerns

Systems

Research

Education

Footer