Machine Learning and Data Science Applications in Industry

All the code is available at this Git Repository:

https://github.com/Myau5x/anti-recommender.git

minimum result: create a model that performs better than just predict all that worse 3.5 (or 4) is bad, all is more than 3.5 (4) is good.

Top 45 Projects to Master Data Science/ Machine learning with Source Code

For training models, I used Yelp Academic Dataset available here: https://www.yelp.com/dataset

For validation, I used data scraped from yelp. The folder Scraping contains code for web scraping and working with YELP API

First download data through yelp api for (king county zip codes) files.

Then with BeautifulSoup, I scraped user reviews for about 100 user files. Folder data contains examples of scraped data.

example how to do that in Jupiter notebook king_county_food.ipynb

Filter reviews that are for restaurants

filter bad reviews

Split on a test train

Using pyspark for this

Try the ALS model for predicting rating but it predicts worse than mean rating. (Jupiter notebooks and other source are in the ALS folder)

Countvectorizing + IDF reviews

Using Kmeans for clustering

Than using clusters on the review I assign cluster to restaurants, and to users on the train set (every user/ restaurant can have several reviews)

If the user don’t like a particular feature and the restaurant has it I predict that it bad restaurant (User rate 1 or 2)

Check that for pair user/restaurants unseen in train test predicting bad rating works better

Save Kmeans cluster centroids, idf vector, and countvectorising Vocabulary

Code for this: nlp_model.py and NLP_tuning.ipynb

Save to csv basic restaurants info and predicted cluster biz_cluster.csv Code for creating this file save_biz.py

Top 7 Artificial Intelligence/Data Science/M.L Projects with Source Code Must Watch (with Tips)

Split biz_cluster.csv on train and test set

Drop features that Yelp doesn’t give through API Create new feature rating/(number of reviews)

Train 16 Random forests and GradientBoosting Regressors for every cluster to predict if a particular restaurant can be assigned to this cluster

Test it on the test set

Create sklearn model working same as pyspark model using saved cluster centroids, idf vector and countvectorising Vocabulary (https://github.com/Myau5x/anti-recommender/tree/master/model_parts)

using this model assign cluster to the user based on their reviews

Assign clusters to restaurants using Random Forest (GradientBoostClassifier)

Predict if user rate restaurant as bad Code for this in a notebook testing_on_scrap

At this moment web site works locally

User can give a link to his profile on URL
My tool scrapes it
Clusters user according to his bad reviews
Than user provide the location
The tool calls Yelp API and takes the first 100 restaurants for this location
Predicts if those restaurants bad for user or not.

For easy use with Flask instead of a trained pyspark model, I created a sklearn model working the same way. Look at the code here rewrite_model_as_sklearn.ipynb The web app works using Flask and Brython source code for this: antirec.py and templatesindex_2.html Also static need to Brython.

My advice to you is to be open-minded and think outside of the box while you are looking for a career in data science. It will give you a competitive edge in your career in data science.

Bio: Shaik Sameeruddin I help businesses drive growth using Analytics & Data Science | Public speaker | Uplifting students in the field of tech and personal growth | Pursuing b-tech 3rd year in Computer Science and Engineering(Specialisation in Data Analytics) from “VELLORE INSTITUTE OF TECHNOLOGY(V.I.T)”

All the code is available at this Git Repository:

Footer