
- Around 75% of pageviews lies between 1–4. The log of “transactionRevenue” follows a normal distribution against “pageviews” but there is no direct correlation between them.
- Around 75% users spend less than 4 mins(244 secs) on a session.
The log of “transactionRevenue” also follows a normal distribution against “timeOnSite”.
5.4 Channel Grouping analysis
- Most transactions come from Referral and Organic search; they also have high revenue generation.
- No of transactions from Direct sources are low but their revenue generation is high, on par with Referral and Organic search.
5.5 Web browser analysis
- No of transactions and revenue generated is highest from Chrome.
- Firefox and Safari users have very low transactions compared to Chrome but their revenue generation is close to that of Chrome.
- Marketing teams can focus on chrome users to maximise the revenue generation.
5.6 City Analysis
- Lot of city data is missing in the dataset (58%).
- New York, Mountain View and San Francisco are 3 most high revenue generating cities with most number of transactions.
Multivariate analysis
5.7 Grouping OS and browsers to see their impact on transactionRevenue
- Both Windows and Mac users have higher transactions and total revenue generation using the Chrome browser.
- Across all Operating systems chrome users have higher number of transactions.
- This supports earlier conclusion that chrome users generates more revenue compared to other browser users.
6. Feature engineering
6.1 Data imputation
We will impute “null” values with 0 for all numerical varaibles. Following code snippet shows an example of imputation for the target variable “transactionRevenue”:-
%%time
print(‘Count of nan values:-’)
print(f”Before Imputation:{train_df[‘totals.transactionRevenue’].isnull().sum()}”)
# we will impute ‘nan’ with 0
train_df[‘totals.transactionRevenue’].fillna(0, inplace=True)
test_df[‘totals.transactionRevenue’].fillna(0, inplace=True)
print(f”After Imputation:{train_df[‘totals.transactionRevenue’].isnull().sum()}”)
6.2 Delete non-useful features
We will delete features having more than 85% of missing data and which may not have any useful data for predicting the target variable.
# list of columns to drop due to over 90% missing data
cols_to_drop = ['trafficSource.adContent', 'trafficSource.adwordsClickInfo.adNetworkType',
'trafficSource.adwordsClickInfo.slot', 'trafficSource.adwordsClickInfo.page',
'trafficSource.adwordsClickInfo.gclId', 'hits', 'totals.totalTransactionRevenue']train_df.drop(cols_to_drop, axis=1, inplace=True)# deleting all columns in test dataframe that are not present in train
# list of columns in test_df that are not in train_df
tl = [col for col in test_df.columns if col not in train_df.columns]
# dropping those columns
test_df.drop(tl,axis=1,inplace=True)
6.3 Standarising Numeric features
We will standarise the numeric features using the MinMaxScaler() from scikit-learn.
6.4 Label encoding Categorical features
We will encode categorical features using LabelEncoder() from scikit-learn.
6.5 Time window features
Main Idea:-
https://www.kaggle.com/c/ga-customer-revenue-prediction/discussion/81542
- Inspired from the above discussion thread. The author tells that the problem is essentially a time window to time window prediction.
- He created time windows using 15 days of overlapping windows. But instead we can try creating windows of size 168 days as the test data given by kaggle has 168 days of session.
- We need to make sure that target variable for each window has a gap period of 45 days.
- And the target period should be of 2 months same as that of the private leaderboard of kaggle.
- Another key idea is not to do hyperparameter tuning.
- Kaggle data:-
TEST DATA : transactions from May 1st 2018 to October 15th 2018 (168 DAYS)
KAGGLE PRIVATE DATA : Dec 1st 2018 to jan 31 st 2019 (2 MONTHS)
GAP PERIOD : Time interval between Test data and Private data (45 DAYS)
7. Data splitting: Train, validation and test sets
- We will use first 8 windows for train, last 2 windows for validation(we will not be doing parameter tuning but this validation set will be used to compare models therefore it can also be thought as test data).
- We will use the test data provided by kaggle for submission and getting the private leaderboard score.
8. Machine learning models
As discussed in feature engineering and suggested here, we will not be doing any hyperparameter tuning for our models.
8.1 LightGBM (Light Gradient Boosting Machine)
- This model gave the best kaggle private score of 0.884 among all different models i tried.
- Following code snippet shows how to create the submission file.
- Feature importance according to the LightGBM model.
8.2 Random Forest model
- Random Forest gave a score of 0.9373 on private leaderboard.
9. Results
Following table contains the summary of the results.
10. Future work
In the future work we can try following things:-
- Try ensembling of different models.
- Try various sizes of time windows, in our case we didn’t use overlapping windows that can also be tried.