Google Analytics Customer Revenue Prediction

Around 75% of pageviews lies between 1–4. The log of “transactionRevenue” follows a normal distribution against “pageviews” but there is no direct correlation between them.
Around 75% users spend less than 4 mins(244 secs) on a session.
The log of “transactionRevenue” also follows a normal distribution against “timeOnSite”.

5.4 Channel Grouping analysis

Transactions and revenue across channels

Most transactions come from Referral and Organic search; they also have high revenue generation.
No of transactions from Direct sources are low but their revenue generation is high, on par with Referral and Organic search.

5.5 Web browser analysis

Transactions and revenue across web browsers

No of transactions and revenue generated is highest from Chrome.
Firefox and Safari users have very low transactions compared to Chrome but their revenue generation is close to that of Chrome.
Marketing teams can focus on chrome users to maximise the revenue generation.

5.6 City Analysis

Transactions and revenue across cities

Lot of city data is missing in the dataset (58%).
New York, Mountain View and San Francisco are 3 most high revenue generating cities with most number of transactions.

Multivariate analysis

5.7 Grouping OS and browsers to see their impact on transactionRevenue

Transactions and revenue across Browser-OS

Both Windows and Mac users have higher transactions and total revenue generation using the Chrome browser.
Across all Operating systems chrome users have higher number of transactions.
This supports earlier conclusion that chrome users generates more revenue compared to other browser users.

6. Feature engineering

6.1 Data imputation

We will impute “null” values with 0 for all numerical varaibles. Following code snippet shows an example of imputation for the target variable “transactionRevenue”:-

%%time
print(‘Count of nan values:-’)
print(f”Before Imputation:{train_df[‘totals.transactionRevenue’].isnull().sum()}”)
# we will impute ‘nan’ with 0
train_df[‘totals.transactionRevenue’].fillna(0, inplace=True)
test_df[‘totals.transactionRevenue’].fillna(0, inplace=True)
print(f”After Imputation:{train_df[‘totals.transactionRevenue’].isnull().sum()}”)

6.2 Delete non-useful features

We will delete features having more than 85% of missing data and which may not have any useful data for predicting the target variable.

# list of columns to drop due to over 90% missing data
cols_to_drop = ['trafficSource.adContent', 'trafficSource.adwordsClickInfo.adNetworkType', 
'trafficSource.adwordsClickInfo.slot', 'trafficSource.adwordsClickInfo.page', 
'trafficSource.adwordsClickInfo.gclId', 'hits', 'totals.totalTransactionRevenue']train_df.drop(cols_to_drop, axis=1, inplace=True)# deleting all columns in test dataframe that are not present in train
# list of columns in test_df that are not in train_df
tl = [col for col in test_df.columns if col not in train_df.columns]
# dropping those columns
test_df.drop(tl,axis=1,inplace=True)

6.3 Standarising Numeric features

We will standarise the numeric features using the MinMaxScaler() from scikit-learn.

6.4 Label encoding Categorical features

We will encode categorical features using LabelEncoder() from scikit-learn.

6.5 Time window features

Main Idea:-
https://www.kaggle.com/c/ga-customer-revenue-prediction/discussion/81542

Inspired from the above discussion thread. The author tells that the problem is essentially a time window to time window prediction.
He created time windows using 15 days of overlapping windows. But instead we can try creating windows of size 168 days as the test data given by kaggle has 168 days of session.
We need to make sure that target variable for each window has a gap period of 45 days.
And the target period should be of 2 months same as that of the private leaderboard of kaggle.
Another key idea is not to do hyperparameter tuning.
Kaggle data:-
TEST DATA : transactions from May 1st 2018 to October 15th 2018 (168 DAYS)
KAGGLE PRIVATE DATA : Dec 1st 2018 to jan 31 st 2019 (2 MONTHS)
GAP PERIOD : Time interval between Test data and Private data (45 DAYS)

7. Data splitting: Train, validation and test sets

We will use first 8 windows for train, last 2 windows for validation(we will not be doing parameter tuning but this validation set will be used to compare models therefore it can also be thought as test data).

Train set

validation set

We will use the test data provided by kaggle for submission and getting the private leaderboard score.

Test set for kaggle submission

8. Machine learning models

As discussed in feature engineering and suggested here, we will not be doing any hyperparameter tuning for our models.

8.1 LightGBM (Light Gradient Boosting Machine)

This model gave the best kaggle private score of 0.884 among all different models i tried.

Kaggle private score

Following code snippet shows how to create the submission file.

Feature importance according to the LightGBM model.

Feature importance using LightGBM model

8.2 Random Forest model

Random Forest gave a score of 0.9373 on private leaderboard.

Feature importance using Random Forest model

9. Results

Following table contains the summary of the results.

Final results

10. Future work

In the future work we can try following things:-

Try ensembling of different models.
Try various sizes of time windows, in our case we didn’t use overlapping windows that can also be tried.

Google Analytics Customer Revenue Prediction

Multivariate analysis

6. Feature engineering

7. Data splitting: Train, validation and test sets

8. Machine learning models

9. Results

10. Future work

11. References

12. Profile

Multivariate analysis

6. Feature engineering

7. Data splitting: Train, validation and test sets

8. Machine learning models

9. Results

10. Future work

11. References

12. Profile

Footer