

A machine learning approach to predict churn from user logs using Spark.
Over the last 15 years streaming services such as Netflix and Spotify changed the way we watch movies or listen to music. Since then, many new service providers have emerged and are trying to poach customers from the established providers.
As company offering a streaming service it is important to have an effective customer retention strategy, to stop users changing their provider. How can a company predict when a user will stop using their service ahead of time?
In this article I will walk you through a project on how to predict users who are about to churn from the streaming service using only the recorded user activity. This project is structured following the CRISP-DM, the Cross Industry Process for Data Mining.
Before working with the data itself, it is relevant to understand the business context and therefore, the goal of the project. Why would a company spend resources in the first place to predict user churn ahead of time?
The dataset contains user interactions of a (fictional) music streaming service called Sparkify with a business case similar to Spotify. The service can be used on two levels: free tier or premium tier.
Both levels of service generate revenue for Sparkify. The free tier service is financed by advertisement in between the songs, the premium tier consists of a monthly subscription fee to have a free-advertisement experience. At any moment a user can decide to downgrade/upgrade from free to premium or to cancel from the service completely. Goal of this project is to predict those users who will cancel the service completely.
Defining Churn
Customer churn is generally defined as when a customer unsubscribes from a service, ceases to purchase a product or stops engaging with a service [1]. In the case of the music streaming service of this project, churn will be defined as when a user cancels from the service completely by deleting his/her user account. This can happen for both paid and free tier users.
Usually, it is more expensive for a business to acquire new customers than retaining existing customers [2]. To prevent churn candidates from leaving, special discounts or other costly measures are offered to customers. These measures typically lower the revenue per customer. Therefore, the goal is to identify users who are about to churn ahead of time with high precision and only target them with marketing campaigns.
Churn prediction is an important classification use case for streaming services such as Netflix, Spotify or Apple Music. Companies that can predict customers who are about to churn ahead of time can also implement a more effective customer retention strategies.
After understanding the business context of the data and explaining the importance of predicting churn ahead of time, the next step is to explore the data available. The aim is to provide an overview of the available data and its quality.
Basis of the project is a 12 GB user log containing all information about user interactions with the online streaming service. The data is stored in an AWS Simple Storage Service (S3) bucket as JSON format. Datasets of such a large scale are challenging to process on a single computer and can, therefore, be referred to as big data.
Spark for Big Data
Apache Spark is a tool for large scale data processing and will be used to work with the dataset. It allows to efficiently spread data and computations across a network of distributed computers, called clusters. Each cluster has nodes (computers) which do the computations in parallel.
To reduce necessary computation, the exploration of the data will be done in a small subset of the full dataset. The full dataset will be processed afterwards in Amazon Web Services (AWS) with an Elastic Map Reduce (EMR) cluster of 3 m5.xlarge machines.
Map Reduce is a technique developed by Google to process data in parallel among distributed machines. It works by first dividing up a large dataset and distributing the data across a cluster. In the map step, each data is analyzed and converted into a key-value pair. Then, these key-value pairs are shuffled across the cluster so all keys are on the same machine. In the final reduce step, the values with the same keys are combined together.
Exploring a subset of the data
The Sparkify data is an user log formatted as a table with 18 columns. The small subset contains only 286,500 rows. Each row represents an API event like a login or playing the next song. There are numerical and categorical columns. The screenshot below shows table structure.
By wrangling with the data, it is possible to get a deeper understanding. For example, filtering the “userId” and “gender” columns shows that there are 225 unique user Ids, of those users are 121 of male gender and 104 female.
The values in column “itemInSession” count the interactions which happened for one user during the same session Id. Which type of user interaction/ API call happened is described by the values in the “page” column.
Possible user interactions with the service and the number of their appearances in the dataset are shown in the graph below. The most occured page event is “NextSong”, which is the main function of the streaming service. It seems like the “NextSong” page gets loaded automatically once a song ends.
The Home page is the page the user enters when starting a streaming session and the second most common called page event. There is the exact same amount of “Cancel” events as there is for “Cancellation Confirmation” events. Therefore, the “Cancel” event seems to be part of the churning process and it won’t be used for predicting churn in this project.
The value in column “length” represents a song’s duration and therefore is null for all page events other than “NextSong”. Using time passed between a “NextSong” event and the following event, it would be possible to calculate the time a user spent listening to a song.
In the dataset each user accesses the streaming service always from the same location. This could mean that users access the service only from home. This hypothesis is supported by the fact that in the column “user agent” there are no entries for mobile devices.
It seems as the user log of this dataset contains a period of two months as the column “ts” containing timestamps has the minimum value October 1st and maximum value December 3rd.
After getting an understanding of the data, the next step of the CRISP-DM is to prepare the dataset for the model to train.
The first step of data preparation is cleaning the data from invalid or missing data. In this project, this step includes records without user ids or session ids. After that, an exploratory data analysis will be conducted to find possible features for the customer churn prediction.
Data Cleaning
There are user Id values with empty strings. These empty user Ids appear for instance when a user enters the streaming service without logging in. All records (8346) containing an empty user Id will be dropped, resulting into 278,154 rows in the cleaned dataset. This were the users with authentification status “Logged Out” (8249) and “Guest” (97).
Goal of the following exploratory data analysis is to observe differences in the behaviour of customers who stayed versus customers who churned. One way is to explore aggregates on these two groups of users, observing how much of a specific action they experienced per a certain time unit or number of songs played.
Labeling data
First a column “churned” will be created to use as a label for differentiation between customers who churned and those who stayed with the service. This column will later be used as label for training the supervised machine learning model. The “Cancellation Confirmation” events serve to define the exact moment of churn, which appear for both paid and free users.
The table below shows an example of the last six user’s interactions of the user called Adriel before churning. After listening to five songs, Adriel downgrades a song, then enters the cancel page and deletes his account.
+---------+-------------------------+--------------------+
|firstName| page | artist|
+---------+-------------------------+--------------------+
| Adriel|NextSong | Tonic|
| Adriel|NextSong | Arch Enemy|
| Adriel|NextSong |Les Ogres De Barback|
| Adriel|NextSong |The Notorious B.I.G.|
| Adriel|NextSong | Nickelback|
| Adriel|Downgrade | None|
| Adriel|Cancel | None|
| Adriel|Cancellation Confirmation| None|
+---------+-------------------------+--------------------+
Check for imbalance in Data
A new column “cancellation_event” is created to mark the exact event of cancellation confirmation. With the new column it is possible to check, if the dataset is balanced regarding the number of users who eventually churn and those who stay. This is interesting, because of the users amount who churn is substantially lower than those who will not churn. Then, there will be less examples for the model to train how these users behave.
There is an imbalance in the dataset regarding the amount of users who churned versus those who stayed. Of a total 225 users, only 52 users eventually churned, which equals a churn rate of 23.11% .
How does this imbalance scale in amount of users on the amount of interactions? Only 16.13% of user interactions are by churned customers. The amount of data available regarding interactions to analyse the difference in behaviour for users who stayed versus users who churned is clearly imbalanced.
Imbalance in the training data can lead to naive behaviour in the prediction of the supervised machine learning model. With 76.89% of users not churning a prediction accuracy of 76.89% can be achieved by simply always predicting “not churned” [3].
There are different ways to handle imbalanced data before feeding it into machine learning algorithms. One way would be to manipulate the input data by either undersampling data of loyal users, oversampling the data of churned users or generating synthetic data. In this project the way to handle the imbalance in data will be by creating additional features.
Feature Creation
The next step is Feature creation for the machine learning model. Features are created from the available data with the goal to allow the model to distinguish between users who churn and those who not. An useful feature exposes differences in the behaviour of loyal users and those who probably churn.
One example for a possible feature is the time since registration. The mean duration from registration to last interaction with the streaming service is 57.8 days for users who eventually churn and 87.1 days for users who stay with the service. It seems intuitive that loyal users in average stay longer with the service.
+-------+----------------------------+
|churned|avg(days_since_registration)|
+-------+----------------------------+
| true| 57.80769230769231|
| false| 87.14450867052022|
+-------+----------------------------+
The violin plot below shows that this difference in mean is also visible in the distribution of the values for users who churned versus users who did not churn.
Therefore the time passed since a user registered to the streaming service is feature that could be useful to predict users who are prone to churn.
After exploring possible features for the prediction model, the next step is to select the features which should be used for the machine learning model to decide if a user will churn or not.
The resulting features consist of 18 numerical features and two binary features. The binary features are gender and service level of the user. Among the numerical ones are:
- percentage_active_day: the percentage of days a user actually accessed the service during his registration period
- streaming_per_active_day: the accumulated length of songs the user listened to during a day
- songs_per_homevisit: the amount of songs a user listened to between visiting the home page
- days_registrated: the amount of days passed since the user registration
- event_per_songs_played: For each possible event, like visiting the home page or giving a Thumbs Down, there is a feature representing the amount of occurances of this event for a user relative to the amount of songs played for the same user
Check for Multicollinearity in Features
If the model will be based on algorithms like Logistic Regression or Linear Regression, the features have to be checked for Multicollinearity. When features have a high correlation and one feature can be predicted from other features there might be Multicollinearity. This can have misleading results in the prediction of the label.
Decision trees and boosted trees algorithms are immune to multicollinearity. When they decide to split, the tree will choose only one of the perfectly correlated features. 3
The following graph shows the pairwise Pearson correlation among the features created. In this case the Pearson’s correlation coefficient is the covariance of two features divided by the product of their standard deviations. There is no pair of features whith a Pearson’s coefficient higher than +0.62 or lower than -0.63.