An end to end case study on Crowd Flower Search results relevance

2.1 Understanding the train data:

1. Id — Unique identifier of row.
2. Query — Text about the query.
3. Product_title — Text about the title of the product.
4. Product_description — Text about the product description.
5. Median_relevance — Median relevance score given by 3 raters
6. relevance_variance — Variance of the relevant scores given by these three raters.

Let us visualize the target values distribution:

An end to end case study on Crowd Flower Search results relevance

3. Performance metric:

4. First cut approach:

5. Text Preprocessing:

Is preprocessing necessary, what will it bring to us?

6. Text vectorization:

7. Advanced feature engineering.

8. Modelling: