The Problem Statement is about predicting or classifying the Quora Questions into two category is duplicate or not on the basis of its context. Main task is to identify whether they are duplicate or not using some of the Natural language processing & Machine learning techniques such as TF-IDF, logistic-regression, K-nearest neighbor, CNN, Support vector machine and many more.
Due to an advancement in technology and skillset among the youth, there arose many questions in their mind. And they use Quora like platforms to find out the answers for their questions. Quora is one of the most useful and comprehensive platform for this type of question and answering. People ask questions and connect with many different people around the globe to get their queries or questions answered. This platform is also used by many researchers, students and experts to enhanced their knowledge in their respective domains and to introduce new and useful inventions.
Nearly 100 million people visits this kind of platform on a monthly basis to get their queries answered. Due to this enormous amount of users, there exists a possibility that questions were repeated or have duplicate nature i.e. the questions asked were similar in terms of syntax or semantics. Due to this, the answer seekers have to spend much of their time to get the answer of their questions as well as to select the best answer among them. To tackle this kind of problem, we’ll try to use many natural language processing and Machine learning algorithms such as TF-IDF based cosine similarity, logistic regression, SVM, Deep learning approaches and many more to detect similarity in questions asked.
Dataset is taken from the Kaggle competition named ”Quora Question Pairs” . The ultimate goal of this problem is to predict which of the provided pairs of questions contain two questions with the same meaning i.e. to check whether the two questions are duplicate of each other or not. Dataset comprises of 6 columns i.e. id, Qid1, Qid2, question1, question2 and is duplicate.
- id: id of each question pairs
- qid1: id of first question of the pair
- qid2: id of second question of the pair
- question1: content of first question
- question2: content of second question
- is duplicate: label 0 for not similar, 1 for similar
Dataset comprises of 404290 question pairs along with 2345795 test set question pairs. There were question pair set which contains NAN values. Among these 404290 train question pairs, nearly63% is different whereas 36% are duplicate. The class distribution of the provided dataset can be seen using the figure below.
The percentage of questions that are similar to each other is:
The number of unique questions in the corpus are:
The histogram depicting the occurrences of questions along with the count of questions is shown below.
Word Cloud shows the most common words that are used in this data set. The words ‘difference’, ‘best’, ‘way’, ‘use’, ‘one’, ‘will’ etc. are highly used in this data set.
The normalized character count in all the questions of the data set is shown below.
This histogram shows that most questions have 35–40 characters in them. The maximum character counts in this data set is around 300. This histogram shows the upper limit of characters as 200 as questions with more than 200 characters are very rare.
Various preprocessing steps has been done to make the available dataset efficient:-
- Both questions columns are tokenised using nltk.word.tokenize.
- Text normlisation is done by converting each text into lower case.
- Then stop words and punctuation marks are removed from the text.
- Stemming of words is done using porter stemmer. Stemming basically returns the root word of the text after removing inflections from it.
- Lemmatization is then performed using wordnet lemmatizer.It also convert text into its root or base word but with the consideration of context of the text.
- We have removed NaN value data points from the dataset.
Feature Extraction is the process of retrieving useful features or attributes from the given dataset which could facilitates the process of training data accurately and predicting the test labels efficiently. In the provided dataset, there is only question pairs are available. Therefore there exists a need to extract some new as well as useful features, so that we could apply multi-dimensional models accurately. Some of the features extracted are as follows:-
- Cosine Similarity score :- It is the measure to compare the two texts in the inner product space.
- Jaccard similarity between two texts:- Jaccard compares the two texts that is how many common words are there with respect to the total words in the question pair provided.
- Euclidean distance between two texts:-It is 0 if both texts are identical else it is 1.
- Edit distance between two texts:- Its includes cost as 1 for each insertion, deletion and replace operation and computes the shortest distance.
- Length of Question 1 :- It is the total length of question 1 including punctuation marks, white spaces, characters.
- Length of Question 2 :- It is the total length of question 2 including punctuation marks, white spaces, characters.
- Difference in Length of two Questions :- Difference between the length of both the questions.
- Number of characters in Question 1 :- Distinct or unique number of characters in question 1 excluding white spaces.
- Number of characters in Question 2 :- Distinct or unique number of characters in question 2 excluding white spaces.
- Number of words in Question 1 :- Total number of words in question 1 including the repeated words.
- Number of words in Question 2 :- Total number of words in question 2 including the repeated words.
- Number of common words in both the Questions:- Total number of Common words between both the Questions.
- Qratio :- It is the Quick ratio comparison of the two question strings and has value ranging from 0 to 100. Higher score value denotes more similarity between the questions.
- Wratio :- It is the Weighted ratio that basically uses the different algorithm to compute the similarity score and returns the best out of it. Its value also ranges from 0 to 100.
- Parallel Ratio :- It calculates the best score for partial string matching against all substring of the greater length and returns the best score. Its score value from 0 to 100.
Semantic similarity is the knowledge driven approach that finds out the degree with which two sentences/questions are similar according to the semantic-network provided. NLTK i.e. Natural language toolkit is the most popular semantic network to measure such similarities between the sentences. Sentences/quora questions that are similar in context of their meaning is assigned a similarity score of 1 and similarity measure varies from 0–1.
Resnik Similarity
It is a Knowledge/meaning based similarity measure which calculates similarity between two words on the basis of semantic or information content. Resnik similarity score is calculated using the Information content. Information Content is computed for least common subsumer, it is the count(frequency) of word or concept found in the text corpus. Resnik similarity basically works on the Word-net corpus noun tag.
Normalized Compression Distance(NCD)
In Normalized Compression Distance(NCD), we uses a compression technique(I have used gzip technique for compression) in order to extract a function from strings to length of the compressed byte version of those strings.
Latent Semantic Analysis(LSA)
It is the Vectorial semantic technique used to find the similarity score between the two texts. It was assumed that if any set of word is similar then they must have co-occurred together in the same text. In LSA technique matrix is created having row as questions and columns as vocab created using both the questions. Each cell represents the count of such unique token in the particular question. The matrix created is very sparse in nature therefore to reduce its size linalg.svd and linalg.diagsvd functions of Scipy are used. After that cosine similarity technique is used to find out the similarity score between two texts.
To tackle the problem of classifying the given question pair as duplicate or non-duplicate, We have initially created features such as cosine similarity score, jaccard score, euclidean score, and edit distance measure, and other basic feature set such as difference in length of two questions, difference in character length of two questions, etc., along with various other syntactic and semantic similarity measures such as Resnik, NCD, LSA, etc. and then using such features we have trained various machine learning and deep learning algorithms in order to predict duplicate nature of set of Quora question pair.
Tf-Idf based cosine similarity
Tf-Idf technique describe the importance of word to a document in the corpus or collection. Cosine similarity is used here to find the similarity between the two questions in terms of the meaning. If the cosine similarity is greater than 0.5 then the given question pairs are similar otherwise they are distinct. Here TF is basically a ratio of count of term in the given particular question to the total terms in that question. IDF is taken as log ratio of total number of questions available to the number of questions in which that terms appears.
Accuracy is :- 67.24504687229465
Jaccard similarity based prediction model
Jaccard compares the two texts that is how many common words are there with respect to the total words in the question pair provided. Jaccard similarity score is more than 0.5 if they are similar else it is less than 0.5.
Accuracy is :- 64.60362610997056
Logistic Regression
Binary Logistic Regression is used to train the given data. Learning rate is used as 0.001 and threshold value for decision boundary is used as 60.5. Stochastic gradient descent is used to prune the value. Data is divided into 3 folds and then trained. Cross-Entropy loss is considered here.
Mean Testing-Accuracy:- 65.03085694497693
Naïve Bayes
By taking all the features created(above section) such as resnik similarity score, Cosine similarity score, NCD similarity score, structural features such as count of words, difference of length of two questions, q ratio, w ratio, partial ratio etc, and then normalizing these features using standard scalar, we have trained Gaussian Naive Bayes model from Sklearn and predicted the label.
Decision Tree
By taking all the features created(above section) such as resnik similarity score, Cosine similarity score, NCD similarity score, structural features such as count of words, difference of length of two questions, q ratio, w ratio, partial ratio etc, and then normalizing these features using standard scalar, we have trained Decision Tree model with random state 0 from Sklearn and predicted the test labels.
Support Vector Machine(SVM)
By taking all the features created(above section) such as resnik similarity score, Cosine similarity score, NCD similarity score, structural features such as count of words, difference of length of two questions, q ratio, w ratio, partial ratio etc, and then normalizing these features using standard scalar, we have trained SVC function of SVM model from Sklearn and predicted the test labels.
K-Nearest Neighbor(KNN)
By taking all the features created(above section) such as resnik similarity score, Cosine similarity score, NCD similarity score, structural features such as count of words, difference of length of two questions, q ratio, w ratio, partial ratio etc, and then normalizing these features using standard scalar, we have trained NearestNeighbourClassifier model from Sklearn with hyperparameter k value as 10 and predicted the test labels.
Extra Tree Classifier
By taking all the features created(above section) such as resnik similarity score, Cosine similarity score, NCD similarity score, structural features such as count of words, difference of length of two questions, q ratio, w ratio, partial ratio etc, and then normalizing these features using standard scalar, we have trained ExtraTreeClassifier model from Sklearn and predicted the test labels.