I’ve been working at Riskified for almost seven years, having started as partly a research analyst and partly a software engineer before becoming a full-time developer. Now I’m one of the software architects. Thanks to my unique career path within the company, I’m familiar with both sides of the coin when it comes to machine learning requirements — the data scientists’ point of view, and the developers’.
Machine learning requires a data input to make decisions. When talking about supervised machine learning, one of the most important elements of that data is its labels.
In Riskified’s case, the labels are mutable. This means the labels can change over time. Additionally, the labels are derived from multiple fields. Why is this an issue? As a company grows, so does the number of labels and fields in use. As a result, there is no single person who knows them all, and this may cause data loss.
If that’s the case in your company, and your labels are extracted from a state-driven architecture, you probably know what I’m referring to. These mutable labels, calculated from multiple fields, might produce noise that can have a big impact on model precision. So how can the system’s architecture help?
Today, many high-tech companies consider machine learning as one of their most important intellectual properties, and as the company grows, so does the amount of data. Software engineers and architects, therefore, frequently face the challenge of mutable labels and can choose between several possible solutions.
In this blog post, I will describe the solution we used and share how event-driven design helps our data scientists, who require two points of view for the labels:
- The most up-to-date version should be accessible from the real-time flow
- All points-in-time versions should be accessible for offline usages
Read on to see how pure technological architecture can help meet your customers’ — aka the data scientists — requirements.
There are many fields within supervised machine learning where the labels can change. I’ll use fraud detection in eCommerce as an example, as this is Riskified’s area of expertise.
Put simply, when we classify an order we only have two labels:
‘Fraud’ (positive) and ‘Not Fraud’ (negative).
Let’s assume, for example, that we’ve classified an order as ‘Not Fraud’:
This may have been easy but things are about to become more complicated. There are two types of issues that can occur, which I like to call the M&M issues. So what does each M stand for?
First, mutable fields: after our initial classification of ‘Not Fraud’, the customer updates the order with a different shipping address. This triggers additional analysis of the order. Let’s say the new classification is ‘Fraud’:
In this case, there is a point in time when the order was labeled ‘Not Fraud’, but after the update it is labeled ‘Fraud’.
Second, multiple fields: after our initial classification of ‘Not Fraud’, an additional review of the order reveals that our model gave a false-negative classification.
Since we want our labels to be as accurate as possible, we’ll tag the order with the “false-negative” tag. But this tag is saved in another field, which means we now have two different fields that describe our label:
In this case, we don’t want to count the order as ‘Not Fraud’, so our label will actually be derived from multiple (in this case two) different fields:
classification == "Fraud" or
(classification == "Not Fraud" and tag == "false-negative")def negative():
classification == "Not Fraud"
Now you know the M&M problem:
- Mutable fields — the labels are mutable and change over time, but only the most recent state is saved (example 1).
- Multiple fields — labeling might be spread across multiple domains and development groups, and therefore multiple fields. Example 2 is a simple example of one additional field, but in reality, it could include many more fields, possibly across multiple databases that determine the labels. When combined with multiple employees who might not be familiar with all the fields, knowledge loss is a very likely outcome.
What will happen if we do nothing? We’ll end up with chaos.
So how can we solve this M&M issue? Since it’s a direct result of the state-driven design, we might consider changing the architecture.
First of all, let’s agree on some basic definitions. As I’ve explained before, a state-driven design is when we save a specific state in some kind of database and overwrite the data whenever something requires it. In example 1, we overwrote the
classification field because the address was updated.
So what is an event and how is it different? Let’s say your age is a state (saved in a field called “age” in a database), and every year your age is overwritten by an event — your birthday. When the state changes, it’s easy to know what the state was before (if I’m 35-years-old today, I was 30-years-old 5 years ago), but this isn’t always the case. In example 1, you can’t know what the state was before it changed.
How will the event help us in this case? If we save a log of events, we can know what the state was at any given time.
Let’s go back to examples 1 and 2, and look at the timeline (same timeline for both examples):
On t1 our label is negative, while on t3 it is positive. Note that the time difference between t1 and t3 can vary: it can be seconds or days. So, which is the right label to choose?
Now is a good time to remind you of our requirements:
- The most up-to-date version of the label should be accessible from the real-time flow
- All points-in-time versions of the label should be accessible for offline usages
In a state-driven design, point 1 is easy, but what about point 2? Event-driven design is a possible solution because it provides us with all the states at any given time.
Now that it’s clear that event-driven design is a solution that can actually solve the problem of these noisy labels, let’s break it down:
First of all, we needed to define all the data that was relevant for our labels. We even added data points that were still being researched but that might have been relevant to our labels in the future.
This part is a little exploratory, as we have a lot of data in multiple databases and no single person who can know it all. As an architect, I had to research with multiple teams, and ended up with more than 40 (🤯) different events that originated in different parts of the system and were important for our model training.
What could we do with all these events? After choosing our event-messaging technology (Kafka), we made sure that all events would flow into the same place (topic) from each microservice:
Each of these topics is partitioned by the order’s ID. This way, when a new event is published to a topic, it is actually appended to one of the topic’s partitions. Events with the same order ID are written to the same partition, and Kafka guarantees that any consumer of a given topic-partition will always read that partition’s events in exactly the same order as they were written. This feature is crucial to the success of the project. Think about our two examples: if we confuse t1 and t3, we’ll end up with a completely different outcome.
How does the label-maker consumer work? It is a consumer that uses one of Kafka’s ecosystem products, Kafka Streams, which allows the development of stateful stream-processing. You can program different DSLs to calculate your state. We can filter, map, group, aggregate, and join the different topics, and calculate our labels (state).
After we published our end result (labels), we ended up with a consumer for the application itself (real-time flow) and another one for offline purposes.
Note that this is a specific solution that uses Kafka, but it is possible to use other streaming solutions.
Let’s go back to the M&Ms and make sure our event-driven design actually solved the issue:
- Mutable fields — now our data scientists will be able to access the data at any time, without worrying about the changes made.
- Multiple fields — we now have a single source of truth that solves the issue of multiple fields in multiple DBs for multiple employees. It doesn’t matter to the consumers where the data is saved.
We should also make sure we handled all our requirements:
- The most updated version of the label should be accessible from the real-time flow — ✅ It’s produced by the label maker.
- The specific point-in-time-version of the label should be accessible for offline usages — ✅ The offline consumer has all of the data over time saved.
Since we are an analytics-driven company, we add insights to our data all the time (in the form of tags, like the “false-negative” tag from example 2). We don’t always know from the beginning how a tag will affect the “positive” and “negative” labels and which ones will be affected. Therefore, with event-driven architecture, we are able to apply a positive/negative label to tags retroactively. This way, our data scientists can “play around” with the data and create any materialized view they want — creating the labels they need and maybe even new labels that are constructed in a completely different way.
In this blog post I’ve described the process we went through, starting with our data scientists requesting access to labels from different points in time to our event-driven solution. As I mentioned before, this is not the only way to solve this problem, but since we’re already in the process of moving to event-driven architecture for other reasons the approach we chose served us well.
And if you need another reason why event-driven design is so effective, here is a fun fact for you: once we obtained that centralized topic we mentioned, we received a request from another team to use it in other system domains, and we did it easily. What’s great about event-driven design is that we can add more and more consumers — and they won’t affect each other! Another win for event-driven design!