Introduction

Over 100 million people visit Quora every month, so it’s no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

Credits: Kaggle

Project page

Give it a try here. It might take some time initially to wake up as it is in running on a free tier.

General Information

Identify which questions asked on Quora are duplicates of the questions that have already been asked. This could be useful to instantly provide answers. We are tasked with predicting whether a pair of questions are duplicates or not.

Constraints

  • The cost of a mis-classification can be very high. If a non duplicate question pair is predicted as duplicate, then it is going impact the user experience.
  • No strict latency concerns.
  • Interpretability is partially important.

Metrics

  • Log loss
  • Confusion Matrix

Data Overview

Data has 6 columns and a total of 4,04,287 entries.

  • id - Id
  • qid1 - Id corresponding to question1
  • qid2 - Id corresponding to question2
  • question1 - Question 1
  • question2 - Question 2
  • is_duplicate - Determines whether a pair is duplicate or not

Data Analysis

  • Target label distribution

    Number of data points per class We can observe that data is imbalanced. There are around 64% non duplicate question pairs and 36% duplicate.

  • Number of unique and repeated questions

    Number of unique and repeated questions It is as expected that number of questions that are repeated are comparatively lesser than unique questions.

  • Number of occurrences of each question

    Number of occurrences of each question There is a question that occurs 157 times, as expected most of the questions occurs very less number of times.

  • Wordcloud for duplicate questions

    Wordcloud for duplicate questions

  • Wordcloud for non duplicate questions

    Wordcloud for non duplicate questions

Preprocessing

  • Convert to lower case
  • Strip HTML tags
  • Remove URLs
  • Remove empty lines and extra spaces
  • Remove accented characters like ë, õ
  • Expand contractions like he’d, she’d
  • Remove characters other than alphabets and digits
  • Stopword removal (got better results without removing)
  • Stemming (got better results without stemming) or Lemmatize (did not try as it is slow)

Feature Extraction

  • Basic Features

    • q1_len = length of q1
    • q2_len = length of q2
    • diff_len = absolute difference of length of q1 and q2
    • avg_len = (length of q1 + length of q2)/2
    • freq_qid1 = Frequency of qid1’s
    • freq_qid2 = Frequency of qid2’s
    • freq_q1+freq_q2 = sum total of frequency of qid1 and qid2
    • freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2
    • q1_n_words = number of words in q1
    • q2_n_words = number of words in q2
    • diff_word = absolute difference of number of words in q1 and q2
    • avg_words = (number of words in q1 + number of words in q2)/q2
    • first_same = is the first word of both questions same
    • last_same = is the last word of both questions same
    • avg_words = (number of words in q1 + number of words in q2)/q2
    • word_common = number of common unique words in q1 and q2
    • word_total = total number of words in q1 + total number of words in q2
    • word_share = word_common/word_total
  • Advanced Features

    • cnsc_min = (common non stop words count)/min(length of q1 non stopwords, length q2 non stopwords)
    • cnsc_max = (common non stop words count)/max(length of q1 non stopwords, length q2 non stopwords)
    • csc_min = (common stop words count) / min(length q1 stopwords, length q2 stopwords)
    • csc_max = (common stop words count) / min(length q1 stopwords, length q2 stopwords)
    • ctc_min = (common tokens count) / min(length q1 tokens, length q2 tokens)
    • ctc_min = (common tokens count) / max(length q1 tokens, length q2 tokens)
    • fuzz_qratio - refer blog
    • fuzz_partial_ratio - refer blog
    • token_set_ratio - refer blog
    • token_sort_ratio - refer blog
    • longest_substr_ratio = len(longest common substring) / min(length of q1, length of q2)
  • Distance Features

    Obtained word2vec embeddings from spacy library and computed following features

    • cosine_distance
    • cityblock_distance
    • jaccard_distance
    • canberra_distance
    • euclidean_distance
    • minkowski_distance
    • braycurtis_distance
    • skew_q1vec
    • skew_q2vec
    • kur_q1vec
    • kur_q2vec

Visualizations of some new features

  • word_share

    We can check from below that it is overlapping a bit, but it is giving some classifiable score for dissimilar questions. word_share_is_duplicate

  • word_common

    It is almost overlapping. word_common_is_duplicate

  • token_sort_ratio

    token_sort_ratio_is_duplicate

Modelling

  • Random model gives a log loss of 0.887699
  • As log loss is dependent of the value rather than ordering, we can use 0.36 as prediction and it gives a log loss of ~0.69
  • Xgboost gave the best results with a log loss of 0.34

Outputs

  • output_1
  • output_2
  • output_3

Competition

  • To score better in the quora competition, we can use graph based features i.e. treating every question as a node and find all its neighbors and store it in a adjacent list. Combining both train and test for this purpose will yield much better results. In practice we should not be touching the test set as the main purpose of the model is to see the train data and generalize on unknown data.
  • Check out my notebook which uses graph features, glove.840b.300d embeddings with LSTM to result in log loss of 0.189

Source code

References