Question Pair Similarity

Introduction

Over 100 million people visit Quora every month, so it’s no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

Credits: Kaggle

Project page

Give it a try here. It might take some time initially to wake up as it is in running on a free tier.

General Information

Identify which questions asked on Quora are duplicates of the questions that have already been asked. This could be useful to instantly provide answers. We are tasked with predicting whether a pair of questions are duplicates or not.

Constraints

The cost of a mis-classification can be very high. If a non duplicate question pair is predicted as duplicate, then it is going impact the user experience.
No strict latency concerns.
Interpretability is partially important.

Metrics

Log loss
Confusion Matrix

Data Overview

Data has 6 columns and a total of 4,04,287 entries.

id - Id
qid1 - Id corresponding to question1
qid2 - Id corresponding to question2
question1 - Question 1
question2 - Question 2
is_duplicate - Determines whether a pair is duplicate or not

Data Analysis

Target label distribution
We can observe that data is imbalanced. There are around 64% non duplicate question pairs and 36% duplicate.
Number of unique and repeated questions
It is as expected that number of questions that are repeated are comparatively lesser than unique questions.
Number of occurrences of each question
There is a question that occurs 157 times, as expected most of the questions occurs very less number of times.
Wordcloud for duplicate questions
Wordcloud for non duplicate questions

Preprocessing

Convert to lower case
Strip HTML tags
Remove URLs
Remove empty lines and extra spaces
Remove accented characters like ë, õ
Expand contractions like he’d, she’d
Remove characters other than alphabets and digits
Stopword removal (got better results without removing)
Stemming (got better results without stemming) or Lemmatize (did not try as it is slow)

Feature Extraction

Basic Features
- q1_len = length of q1
- q2_len = length of q2
- diff_len = absolute difference of length of q1 and q2
- avg_len = (length of q1 + length of q2)/2
- freq_qid1 = Frequency of qid1’s
- freq_qid2 = Frequency of qid2’s
- freq_q1+freq_q2 = sum total of frequency of qid1 and qid2
- freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2
- q1_n_words = number of words in q1
- q2_n_words = number of words in q2
- diff_word = absolute difference of number of words in q1 and q2
- avg_words = (number of words in q1 + number of words in q2)/q2
- first_same = is the first word of both questions same
- last_same = is the last word of both questions same
- avg_words = (number of words in q1 + number of words in q2)/q2
- word_common = number of common unique words in q1 and q2
- word_total = total number of words in q1 + total number of words in q2
- word_share = word_common/word_total
Advanced Features
- cnsc_min = (common non stop words count)/min(length of q1 non stopwords, length q2 non stopwords)
- cnsc_max = (common non stop words count)/max(length of q1 non stopwords, length q2 non stopwords)
- csc_min = (common stop words count) / min(length q1 stopwords, length q2 stopwords)
- csc_max = (common stop words count) / min(length q1 stopwords, length q2 stopwords)
- ctc_min = (common tokens count) / min(length q1 tokens, length q2 tokens)
- ctc_min = (common tokens count) / max(length q1 tokens, length q2 tokens)
- fuzz_qratio - refer blog
- fuzz_partial_ratio - refer blog
- token_set_ratio - refer blog
- token_sort_ratio - refer blog
- longest_substr_ratio = len(longest common substring) / min(length of q1, length of q2)
Distance Features
Obtained word2vec embeddings from spacy library and computed following features
- cosine_distance
- cityblock_distance
- jaccard_distance
- canberra_distance
- euclidean_distance
- minkowski_distance
- braycurtis_distance
- skew_q1vec
- skew_q2vec
- kur_q1vec
- kur_q2vec

Visualizations of some new features

word_share
We can check from below that it is overlapping a bit, but it is giving some classifiable score for dissimilar questions.
word_common
It is almost overlapping.
token_sort_ratio

Modelling

Random model gives a log loss of 0.887699
As log loss is dependent of the value rather than ordering, we can use 0.36 as prediction and it gives a log loss of ~0.69
Xgboost gave the best results with a log loss of 0.34

Outputs

Competition

To score better in the quora competition, we can use graph based features i.e. treating every question as a node and find all its neighbors and store it in a adjacent list. Combining both train and test for this purpose will yield much better results. In practice we should not be touching the test set as the main purpose of the model is to see the train data and generalize on unknown data.
Check out my notebook which uses graph features, glove.840b.300d embeddings with LSTM to result in log loss of 0.189

Question Pair Similarity

Introduction

General Information

Constraints

Metrics

Data Overview

Data Analysis

Target label distribution

Number of unique and repeated questions

Number of occurrences of each question

Wordcloud for duplicate questions

Wordcloud for non duplicate questions

Preprocessing

Feature Extraction

Basic Features

Advanced Features

Distance Features

Visualizations of some new features

word_share

word_common

token_sort_ratio

Modelling

Outputs

Competition

Source code

References

Introduction#

General Information#

Constraints#

Metrics#

Data Overview#

Data Analysis#

Target label distribution#

Number of unique and repeated questions#

Number of occurrences of each question#

Wordcloud for duplicate questions#

Wordcloud for non duplicate questions#

Preprocessing#

Feature Extraction#

Basic Features#

Advanced Features#

Distance Features#

Visualizations of some new features#

word_share#

word_common#

token_sort_ratio#

Modelling#

Outputs#

Competition#

Source code#

References#

Introduction

General Information

Constraints

Metrics

Data Overview

Data Analysis

Target label distribution

Number of unique and repeated questions

Number of occurrences of each question

Wordcloud for duplicate questions

Wordcloud for non duplicate questions

Preprocessing

Feature Extraction

Basic Features

Advanced Features

Distance Features

Visualizations of some new features

word_share

word_common

token_sort_ratio

Modelling

Outputs

Competition

Source code

References