
All the code is available at this Git Repository:
https://github.com/Myau5x/anti-recommender.git
minimum result: create a model that performs better than just predict all that worse 3.5 (or 4) is bad, all is more than 3.5 (4) is good.
For training models, I used Yelp Academic Dataset available here: https://www.yelp.com/dataset
For validation, I used data scraped from yelp. The folder Scraping
contains code for web scraping and working with YELP API
First download data through yelp api for (king county zip codes) files.
Then with BeautifulSoup, I scraped user reviews for about 100 user files. Folder data
contains examples of scraped data.
example how to do that in Jupiter notebook king_county_food.ipynb
Filter reviews that are for restaurants
filter bad reviews
Split on a test train
Using pyspark for this
Try the ALS model for predicting rating but it predicts worse than mean rating. (Jupiter notebooks and other source are in the ALS folder)
Countvectorizing + IDF reviews
Using Kmeans for clustering
Than using clusters on the review I assign cluster to restaurants, and to users on the train set (every user/ restaurant can have several reviews)
If the user don’t like a particular feature and the restaurant has it I predict that it bad restaurant (User rate 1 or 2)
Check that for pair user/restaurants unseen in train test predicting bad rating works better
Save Kmeans cluster centroids, idf vector, and countvectorising Vocabulary
Code for this: nlp_model.py
and NLP_tuning.ipynb
Save to csv basic restaurants info and predicted cluster biz_cluster.csv
Code for creating this file save_biz.py
Split biz_cluster.csv
on train and test set
Drop features that Yelp doesn’t give through API Create new feature rating/(number of reviews)
Train 16 Random forests and GradientBoosting Regressors for every cluster to predict if a particular restaurant can be assigned to this cluster
Test it on the test set
Create sklearn model working same as pyspark model using saved cluster centroids, idf vector and countvectorising Vocabulary (https://github.com/Myau5x/anti-recommender/tree/master/model_parts)
using this model assign cluster to the user based on their reviews
Assign clusters to restaurants using Random Forest (GradientBoostClassifier)
Predict if user rate restaurant as bad Code for this in a notebook testing_on_scrap
At this moment web site works locally
- User can give a link to his profile on URL
- My tool scrapes it
- Clusters user according to his bad reviews
- Than user provide the location
- The tool calls Yelp API and takes the first 100 restaurants for this location
- Predicts if those restaurants bad for user or not.
For easy use with Flask instead of a trained pyspark model, I created a sklearn model working the same way. Look at the code here rewrite_model_as_sklearn.ipynb
The web app works using Flask and Brython source code for this: antirec.py
and templatesindex_2.html
Also static
need to Brython.
My advice to you is to be open-minded and think outside of the box while you are looking for a career in data science. It will give you a competitive edge in your career in data science.
Bio: Shaik Sameeruddin I help businesses drive growth using Analytics & Data Science | Public speaker | Uplifting students in the field of tech and personal growth | Pursuing b-tech 3rd year in Computer Science and Engineering(Specialisation in Data Analytics) from “VELLORE INSTITUTE OF TECHNOLOGY(V.I.T)”