
Online display advertising is a multi-billion dollar business nowadays, with annual revenue of 31.7 billion US dollars in fiscal year 2016, up 29% from fiscal year 2015. One of the core problems display advertising strives to solve is to deliver the right ads to the right people, in the right context, at the right time. Accurately predicting the click-through rate (CTR) is crucial to solve this problem and it has attached much research attention in the past few years.
The data involved in CTR prediction are typically multi-field categorical data which are also quite ubiquitous in many applications besides display advertising, such as recommender systems. Such data possess the following properties. First, all the features are categorical and are very sparse since many of them are identifiers, Therefore, the total number of features can easily reach millions. Second every feature belongs to one and only one field and there can be tens to there can be tens to hundreds of fields.
The characteristics of multi-field categorical data show some unique challenges:
1- Feature interactions are prevalent and need to be specifically modeled.
2- Features from one field often interact differently with features from different other fields.
3- Potentially high model complexity needs to be taken care of.
According to these mentioned problems, normally usual classification models won’t operate well enough on this kind of data. in this article, first of all, we will explain characteristics of our data set, second, we will implement some famous classification algorithms such as logistic regression, SVM, random forest, and so on.
The dataset that we will work on is brought to you thanks to Yektanet (an Iranian digital advertising platform). the labeled ‘training’ portion of the dataset consists of over 4 million data.
This data set consists of 14 features + 1 target, i.e., ‘Clicked’. Every time a user with a unique ‘User Id’ visits a web page with ‘Doc Id’ there will be a page view. For every page view a unique ‘Display Id’ will be produced. On every page view, some advertisements will be shown to the user simultaneously. ‘Timestamp’ indicates when the event happened. As we can say by their title, ‘Hour Of Day’ and ‘Day Of Week’ shows when an event happened. Every advertisement content is shown by a ‘Creative Id’, and each one of these contents belongs to an advertising campaign that is shown by a ‘Campaign Id’, every campaign belongs to an advertiser presented by an ‘Advertiser Id’. Each user uses a specific ‘Device’ to visit a web page. We can easily recognize that ‘OS’ column indicates the operating system used by the user. The ‘Browser’ column presents what browser, the user used to visit web pages. Each advertises displayed at the specific part of a web page that is called widget, ‘Widget Id’ indicates which part of the web page allocated to the advertisement. Each web page has a ‘Publisher’ and a ‘Source’.
You can Also Use all the works we do here with other free famous CTR prediction data sets. due to Copyright, we can not publish the content of our data.
First of all, we will see some 10 samples of data to figure out what are we deal with: