Now, we have seen how data-set looks like
What you need to know is, whether your problem is a Regression problem or Classification problem or Clustering problem.
So, for that, you need to look at the dependent variable which we now already know what is dependent variable?
[Note: If you don’t have dependent variable then it means it is Clustering Problem.]
This data was collected on our social survey mobile platform Whatsgoodly. We have 300,000 millennial and Gen Z members and have collected 150,000,000 survey responses from this demographic to date.
Now, if your data-set contains a Dependent variable, then you have to see if it has the Continuous outcome or a Categorical outcome.
If it is a Continuous outcome then your problem is a Regression Problem.
And if it’s a Categorical outcome then your problem is a Classification problem.
Let’s see how dataset looks like with DV (Dependent variable
This is a House Prices Data-set and in this dataset, there are lots of rows and columns are there. And you have to predict the SalePrice which is the Dependent variable, however, rest all others are independent variables. You can easily see it is Regression problem and we have to use some Regression Model on it like -RandomForest, SVR etc.
Now, see this dataset in which you have given User ID, Gender, Age, Estimated Salary which all are Independent Variable and you have to predict whether if some new person comes they going to buy new SUV car or not. [Note: One can easily see it is classification problem because the dependent variable which is Purchased one having binary output 0 or 1 only, where 1 means it will go to buy the SUV and 0 means not going to buy the SUV.]
So, till now we got enough idea by just seeing the dataset we can classify our problem into Regression or Classification or Clustering.
Now, how would I know which model is the best one like for example you are working on Home Price Prediction and you have to predict the price of the house based on the several parameters. But, which model should I use or what parameters should I have to insert into that. See, all you can do is use Grid Search for that which provide you which parameters is best for your model.
What does the Grid Search do?
It will find the optimal values for your model like which parameters should to choose. All you need to do is import the class from the Sklearn library.
from sklearn.model_selection import GridSearchCV
Nobody can tell you in this World which model will give you the best performance or accuracy by just seeing the dataset. All you can do is classify your problem by seeing the dataset whether the dataset is linear or non-linear and the model problem is classification, regression or clustering problem.
Don’t be sad because you will have the cheat sheet, which helps you detect the model.
If you find any difficulty in reading the cheat sheet go to this link Cheat Sheet.
I hope you like this article!! If you have any problem or query in any topic related to Data Science then do let me know in the comment Section!! I’ll share more concepts soon on LinkedIn.com Article column as well as Medium.
Give some love too!