Customer Churn Prediction with Spark

Predicting music streaming service user churn with exploratory analysis and machine learning

Edited by Peter Le

Customer churn prevention is a hot and challenging problem in almost every product and service company. If companies were able to utilize customer-usage data to find unique trends and accurately map them to indicate which customers may churn, it would be possible to incentivize customers to remain using their services giving them a loyal customer base which is key for a company’s growth.

Sparkify is a digital music service similar to Spotify and Pandora. In Sparkify, users can either listen to music for free or buy a subscription. In order to identify users who are likely to churn, it’s important to perform an exploratory analysis to glean insights from the data set and identify key variables of interest. The next process is to experiment different model algorithms and select the best model based on key evaluation metric such as F1 Score and accuracy using Spark ML Library.

Data

The data we have from Sparkify is composed of user events. Every interaction the user has with the application is provided for us. In other words, every time a user presses home page, listens to a song, presses next song, thumbs up a song, etc, an event is recorded in the data corresponding to the same.

“Churn” label is generated from the dataset by identifying users who confirm their subscription cancellation. Once churned users are identified, we can view how it behaves with other features in the dataset. We will be exploring the data to see trends and features that may influence the churn rate.

What is the churn rate of sparkify?

Figure 1: Distribution of Users by Churn Type

The above figure shows that out of 225 total users, 52 users were identified to be churned; this is approxiately 24% of the universe.

Does the type of gender affect churn rate?

Figure 2. Distribution of Churn per gender type

The figure above illustrates churn per gender. We have more male users (~54% male, ~46% female) in our dataset so it’s no surprise that we’d have more male users who churn. The churn rate for males is quite higher than females (26% vs 19%).

What is the page distribution for user activity?

Figure 3. Page distribution for user activity

Many users visit ‘Next Song’ page which is beneficial for the music streaming business. ‘Thumbs Up’ is another important factor that suggests users like the songs played and enjoy the app. ‘Home’ may indicate constant user activity with the app.

What is the page distribution for churn?

Figure 4. Distribution of page and churn

Next, we’ll explore the distribution of page and churn users. Pages such as ‘Next Song’, ‘Thumbs Up’, ‘Add Friend’, and ‘Add to Playlist’ have a higher proportion of non-churn users. Finding the number of users visit these pages may determine if the users are likely to churn or not.

Does the type of user device influence the churn rate?

Figure 5. Churn per device

We can see that most users use Windows or Mac to access the service, which also have the most customer churn. The churn rate for Windows users is 18.5% which is slightly higher than Mac sitting at 18.1%. Devices such as X11 and iPhone have a much lower user base resulting in lower churn amount.

Does user location affect the churn rate?

Figure 6. Churn vs total user in location

The locations with the highest total users and churn users are in ‘Los Angelos-Long Beach-Anaheim, CA’, ‘New York-Newark-Jersey City, NY-NJ-PA’, and ‘Phoenix-Mesa-Scottsdale, AZ’. User locations are scattered widely and are rather sparse in almost all locations.

After exploratory data analysis, 10 features hypothetically assumed to influence determining user churn were engineered. Following that, feature importance as a result of model training will determine what model and features to be adopted for full dataset modelling.

As a result, a Spark dataframe is created with 10 features:

1. Gender: Gender of the user. (Binary)
2. Level: Latest level of a user. (Binary)
3. Length: User total length of songs listened (Float)
4. Average Session Duration: User average session duration (Float)
5. Location: Location of the user (Binary)
6. Page: Number visits per page feature — Add friend, Add to Playlist, Downgrade, Home, etc. (Integer)
7. Time Since Registration: Time since user registration (Integer)
8. Sessions: Total number of sessions (Integer)
9. Songs: Total number of songs played (Integer)
10. User Agent: Device/Agent used by the user (Integer)

The dataframe is split into 70% for training, 30% for testing.

As it is a classifcation problem (churn/not churn), we will fit logistic regression, random forest, gradient boosting and decision tree classifier with default parameters. We’ll measure for f1 score and accuracy as only 24% of users churned, however f1 score is more reliable in this case due to imbalance in class from accuracy.

Results

Figure 7. The scores for the classifiers and logistic regression tuned.

With the default parameter, Gradient Boosting has the highest f1 score of all the metrics and Decision Tree comes second.

Model Tuning

Logistic regression model was selected to perform some tuning to further improve the model with parameter tuning using Grid Search. Two parameters were 1) regParam([0.1, 0.01]) and fitIntercept ([True, False]) were searched. Unfortunately, the result model f1 score and accuracy slightly decreased.

Feature Importance

Footer