The prediction problem at hand is a linear classification problem. The Perceptron algorithm and Stochastic Gradient Descent are widely used in the field of linear classification. In this sample, we will look at and compare the performance of Perceptron and Pegasos (Primal Estimated sub-GrAdient SOlver for SVM) algorithms.
Feature Vector Construction
A simple expression of the classification problem is finding the best theta and theta_0 so that theta*x+theta_0 = y is true for as many x as possible. In this case, x is our feature vector and y is our labels. We have already labeled our data in the pre-processing stage but haven’t really constructed our feature vectors yet.
Because our assumption is that the future popularity of a submission can be predicted by its title, author, flair, content and top comments, one simple and natural to way to construct our feature vector is the “bag of words” approach where we iterate through all the words in the entire dataset and represent our combined text as a vector of shape (n, m) where n is the number of submissions and m is the total number of words, and for each row (or each submission), the value is 1 if the word exists in the submission, or 0 otherwise.
def bag_of_words(texts, remove_stopword=True):
"""
Inputs a list of string submissions
Returns a dictionary of unique unigrams occurring over the input"""
stopword = set()
if remove_stopword:
with open("data/stopwords.txt") as fp:
for line in fp:
word = line.strip()
stopword.add(word)dictionary = {} # maps word to unique index
for text in texts:
word_list = extract_words(text)
for word in word_list:
if word not in dictionary:
if word in stopword:
continue
dictionary[word] = len(dictionary)
return dictionary
def extract_bow_feature_vectors(submissions, dictionary, binarize=True):
"""
Inputs a list of string submissions
Inputs the dictionary of words as given by bag_of_words
Returns the bag-of-words feature matrix representation of the data.
The returned matrix is of shape (n, m), where n is the number of submissions
and m the total number of entries in the dictionary.
"""
num_submissions = len(submissions)
feature_matrix = np.zeros([num_submissions, len(dictionary)], dtype=np.float64)
for i, text in enumerate(submissions):
word_list = extract_words(text)
for word in word_list:
if word in dictionary:
if binarize:
feature_matrix[i, dictionary[word]] = 1
else:
feature_matrix[i, dictionary[word]] += 1
return feature_matrix
With both our feature vector and label ready, we can proceed with the Perceptron algorithm as follows:
The intuition is quite simple, if the point is classified correctly, the predicted value theta*x+theta_0 and label y should share the same sign, and if the point is classified incorrectly, the two values will be of opposite signs and their product will be less than 0. So we update the parameters theta and theta_0 accordingly.
def perceptron_single_step_update(
feature_vector, label, current_theta, current_theta_0
):
if label * (np.dot(current_theta, feature_vector) + current_theta_0) <= 0:
current_theta += label * feature_vector
current_theta_0 += label
return (current_theta, current_theta_0)
Perceptron will do a pretty good job in terms of fitting the training data because it uses training error to update and correct its parameters. However, it may result in overfitting and often times accuracy on validation set will reflect that. The Pegasos algorithm can somewhat mitigate this issue.
Pegasos looks at classification problems as an optimization problem where we are trying to minimize some objective function J, or more formally:
The first element is called hinge loss, and the second element is called regularization term with lambda being the regularization parameter. The idea is that the larger the regularization parameter, the more “regularized” and thus less overfitting they are, but at the same time, induces potentially higher high loss from training errors. The balance between training error and overfitting can be achieved through parameter tuning of lambda.
Stochastic Gradient Descent
Since we’ve turned the problem into an optimization problem, all we need to do is to find the theta and theta_0 that minimize J. We use Stochastic Gradient Descent (SGD) as follows:
The idea of Gradient Descent is that we update our parameters by moving against the direction of its slope by an amount that equals exactly to its derivative at that point until it reaches the minimum, and Stochastic Gradient Descent is simply a more computationally efficient technique by sampling the training data at random and slowly nudging the parameters to the direction rather than applying Gradient Descent on the whole objective function. The parameter eta is called learning rate and is there to mitigate the randomness we introduced in SGD. A simple form of eta is 1/(1+T).
One thing to note is that Pegasos updates the parameters regardless of whether the classification is accurate, this is different from Pecerptron.
def pegasos_single_step_update(
feature_vector, label, L, eta, current_theta, current_theta_0
):
"""
Properly updates the classification parameter, theta and theta_0, on a
single step of the Pegasos algorithmArgs:
feature_vector - A numpy array describing a single data point.
label - The correct classification of the feature vector.
L - The lamba value being used to update the parameters.
eta - Learning rate to update parameters.
current_theta - The current theta being used by the Pegasos
algorithm before this update.
current_theta_0 - The current theta_0 being used by the
Pegasos algorithm before this update.
Returns: A tuple where the first element is a numpy array with the value of
theta after the current update has completed and the second element is a
real valued number with the value of theta_0 after the current updated has
completed.
"""
f = 1 - (eta * L)
if label * (np.dot(feature_vector, current_theta) + current_theta_0) <= 1:
return (
(f * current_theta) + (eta * label * feature_vector),
(current_theta_0) + (eta * label),
)
return (f * current_theta, current_theta_0)