Mini Encyclopedia of data engineering

The world is increasingly data-driven. Data-intensive applications are the new norm.The early-adopters of data-intensive applications were in the high-tech industry and the scientific community. But over the last decade, we’ve seen the rise of data-driven applications in a whole host of fields.

Data is at the center of many challenges in system design today. We have an overwhelming variety of tools, including relational databases, No.SQL datastores, stream or batch processors, and message brokers. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications. The book is written by Martin Kleppmann, a German-based author of the book.

In Silicon Valley, “ability to code” is now the uber-metric to track. The top echelon of technology gravitated towards things that it can see, feel, measure. What often gets neglected in this “code be all” culture is deep understanding of fundamental concepts. Most newer “innovations” are indeed built on a handful time-honored principles, says John Defterios. He says most newer innovations are built on time-old principles.

“Technology is a powerful force in our society. Data, software, and communication can be used for bad: to entrench unfair power structures, to undermine human rights, and to protect vested interests. But they can also be used for good: to make underrepresented people’s voices heard, to create opportunities for everyone, and to avert disasters. This book is dedicated to everyone working toward the good. “ Martin Kleppmann

Nowhere else perhaps is this more prominent than in data space that up-levels libraries and frameworks as the conversation starter. It is impossible to model Cassandra “tables” without understanding — quorum, compaction, log-merge data structure. Due to the way the present day solutions are built, if these solutions are not implemented well to the particular domain, failure is just a release away.

Martin Kleppmann does a great job of articulating the “systems” aspects of data engineering. He starts from a functional 4 lines code to build a database to the way how one can interpret and implement concurrency, serializability, isolation and linearizability (the latter for distributed systems). His book also has over 800 pointers to state of the art research as well as some of the computer science’s classic papers. The book slows down its pace on the chapter on Distributed System and on the final one.

Excellent overview of database internals, great book for anyone who want to build highly available large applications

That said, if you ever worked on data systems, especially across paradigms (IMS – RDBMS — NoSQL – Map-Reduce -Spark -Streaming – Polyglot), this book is pretty much only resource out there to tie the “loose ends” and paint a coherent narrative.

Data-Intensive Applications is an amazing piece of work. It’s easy to read. It drives you from simple to more complex topics with grace. It’s full of references to other people’s work, and it’s constantly linking to previous and future parts of the book where relevant content is further explained, making the book beautifully cohesive. It’s even funny (sometimes).

I think it took the author more than 4 years to finish it. He could have waited any number of years more and the book would be as valuable as it is today.

Martin Kleppmann starts out by solidly giving the reader the conceptual framework in the —Part I. Foundations of Data Systems : what does reliability mean? How is it defined? What is the difference between “fault” and “failure”? How do you describe load on a data intensive system? How do you talk about performance and scalability in a meaningful way? What does it mean to have a “maintainable” system?

Part II. Distributed Data– gives a brief overview of different data models and shows the suitability of them to different use cases, using modern challenges that companies such as Twitter faced. This chapter is a solid foundation for understanding the difference between the relational data model, document data model, graph data model, as well as the languages used for processing data stored using these models.

Part III. Derived Data goes into a lot of detail regarding the building blocks of different types of database systems: the data structures and algorithms used for the different systems shown in the previous chapter are described; you get to know hash indexes, SSTables (Sorted String Tables), Log-Structured Merge trees (LSM-trees), B-trees, and other data structures. Following this chapter, you are introduced to Column Databases, and the underlying principles and structures behind them.Each chapter also provides double digits or even more than one hundred of reference links for the reader to explore. One of the most interesting — Chapter 12. The Future of Data Systems

Data is the pollution problem of the information age, and protecting privacy is the environmental challenge. Almost all computers produce information. It stays around, festering. How we deal with it — how we contain it and how we dispose of it — is central to the health of our information economy. Just as we look back today at the early decades of the industrial age and wonder how our ancestors could have ignored pollution in their rush to build an industrial world, our grandchildren will look back at us during these early decades of the information age and judge us on how we addressed the challenge of data collection and misuse.
We should try to make them proud. Bruce Schneier

The assertion that personal data is a valuable asset is supported by the existence of data brokers, a shady industry operating in secrecy, purchasing, aggregating, analyzing, inferring, and reselling intrusive personal data about people. Startups are valued by their user numbers, by “eyeballs” — i.e., by their surveillance capabilities. When a company goes bankrupt, the personal data it has collected is one of the assets that get sold. The data is difficult to secure, so breaches happen disconcertingly often.

Martin Kleppmann is a researcher in distributed systems at the University of Cambridge. Previously he was a software engineer and entrepreneur at Internet companies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure. In the process he learned a few things the hard way, and he hopes this book will save you from repeating the same mistakes.

These days, it feels like most systems are distributed system in one way or another. Designing Data-Intensive Applications should almost be mandatory reading for all software developers. So many of the concepts explained in it are really useful to know.

A lot of the problems described and solved in the book come down to concurrency issues. There are great pictures and diagrams by Shabbir Diwan, Edie Freedman, and Ron Bilodeau illustrating the points. At the beginning of each chapter there is a fantasy-style map which lists the key concepts in the coming chapter. I quite liked those.

MLearning.ai interiors inspired by illustrations from Designing Data-Intensive Applications

Designing Data-Intensive Applications is thick —616 pages will take about 17.3 hours to read for the average reader. This made me hesitate to start it — it almost felt too imposing.I just wish I had read this book earlier. I am really happy I started.

This book gets deep under the skin of theory and infrastructure details. Those repeated dives into whatever a typical programmer would consider out of scope, does make a better developer.

If you liked this summary, you should definitely read the whole book. Will change your perspective and thought process how you see data and database and design Systems.

Footer