Question Pair Similarity

Over 100 million people visit Quora every month, so it’s no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

Credits: Kaggle

To learn more about this project, check out my blog

Give it a try here

General Information

Identify which questions asked on Quora are duplicates of the questions that have already been asked. This could be useful to instantly provide answers. We are tasked with predicting whether a pair of questions are duplicates or not.

Constraints

The cost of a mis-classification can be very high.
No strict latency concerns.
Interpretability is partially important.

Metrics

Log loss
Confusion Matrix

Data Overview

Data has 6 columns and a total of 4,04,287 entries.

id - Id
qid1 - Id corresponding to question1
qid2 - Id corresponding to question2
question1 - Question 1
question2 - Question 2
is_duplicate - Determines whether a pair is duplicate or not

Libraries

Application Framework - flask, wsgiref
Data processing and ML - numpy, pandas, matplotlib, sklearn, xgboost, seaborn, spacy, nltk, contractions, fuzzywuzzy, distance, optuna
General operations - os, pickle, re

Outputs

screenshot1 screenshot2 screenshot3

Setup

Clone this repo using

git clone https://github.com/Anil-45/Question_Pair_Similarity.git

Install the required modules using

pip install -r requirements.txt

Usage

Run the following command to start the application

python app.py

Access the application

For training yourself, download the data and place it in data folder

Find the optimal parameters by running tune.py

python tune.py

Train the model from optimal parameters

python train.py

For predictions, place the test.csv in data folder

python predict.py

Contact

Created by @Anil_Reddy

General Information#

Constraints#

Metrics#

Data Overview#

Libraries#

Outputs#

Setup#

Usage#

Contact#

References#