The prediction problem at hand is a linear classification problem. The Perceptron algorithm and Stochastic Gradient Descent are widely used in the field of linear classification. In this sample, we will look at and compare the performance of Perceptron and Pegasos (Primal Estimated sub-GrAdient SOlver for SVM) algorithms.

**Feature Vector Construction**

A simple expression of the classification problem is finding the best theta and theta_0 so that theta*x+theta_0 = y is true for as many x as possible. In this case, x is our feature vector and y is our labels. We have already labeled our data in the pre-processing stage but haven’t really constructed our feature vectors yet.

Because our assumption is that the future popularity of a submission can be predicted by its title, author, flair, content and top comments, one simple and natural to way to construct our feature vector is the “bag of words” approach where we iterate through all the words in the entire dataset and represent our combined text as a vector of shape (n, m) where n is the number of submissions and m is the total number of words, and for each row (or each submission), the value is 1 if the word exists in the submission, or 0 otherwise.

`def bag_of_words(texts, remove_stopword=True):`

*"""*

Inputs a list of string submissions

Returns a dictionary of unique unigrams occurring over the input"""

stopword = set()

if remove_stopword:

with open("data/stopwords.txt") as fp:

for line in fp:

word = line.strip()

stopword.add(word)dictionary = {} # maps word to unique index

for text in texts:

word_list = extract_words(text)

for word in word_list:

if word not in dictionary:

if word in stopword:

continue

dictionary[word] = len(dictionary)

return dictionary

def extract_bow_feature_vectors(submissions, dictionary, binarize=True):

*"""*

Inputs a list of string submissions

Inputs the dictionary of words as given by bag_of_words

Returns the bag-of-words feature matrix representation of the data.

The returned matrix is of shape (n, m), where n is the number of submissions

and m the total number of entries in the dictionary.

"""

num_submissions = len(submissions)

feature_matrix = np.zeros([num_submissions, len(dictionary)], dtype=np.float64)

for i, text in enumerate(submissions):

word_list = extract_words(text)

for word in word_list:

if word in dictionary:

if binarize:

feature_matrix[i, dictionary[word]] = 1

else:

feature_matrix[i, dictionary[word]] += 1

return feature_matrix

With both our feature vector and label ready, we can proceed with the Perceptron algorithm as follows: