Being a data scientist by profession is different from being an enthusiasts. While the latter can experiment and learn by trial and error, professionals need to produce results, FAST! Having been on both ends of the spectrum, here is what I use most frequently at work .I’ve included Visualization libraries , Feature engineering and statistical testing libraries , a low code autoML library, ML debugging library and finally, a predictive one and a deep learning one. While this works over and above the standard ones (Pandas, Numpy, Sklearn and matplotlib), I hope the readers find this useful.
If you’re coming from Tableau and miss those cool maps, Folium is your new friend. Not only does it make use of vector and raster layers , it also makes for interactive maps which you can zoom into after it has been rendered.
The official documentation is here.
This Kaggle Notebook shows it in action for maps, heatmaps and time analysis
While its mostly used for exploratory analysis and modelling results, seaborn also offers high dimensional plots for pro visual thinkers. Its declarative API lets you focus on what the different elements of your plots mean, rather than on the details of how to draw them.
The official example gallery is a treat to explore.
Built specifically for in-browser visualizations use it if the graphics produced needs to be embedded in ppts and decks. Chose this over anything if you need to embed plots in web applications, or export across apps and servers.
Get started here.
The amount of time a data science professional spends in feature selection is huge. If you don’t already, it’s recommended that you do. This library generates features for analysis and can come handy while performing feature engineering. However, its recommended to use it in conjunction with business question at hand.
Here is an amazing resource to start with.
This Kaggle notebook shows you how it’s done.
If you miss the simplicity of R results in Python, this library is for you.
Best known for its ability to conduct statistical tests and data exploration, it also houses regression and time series models.
Here is the official documentation.
A low code autoML library allowing its users to build multiple models with a few lines of codes, this gem is perfect for those quick and dirty prototyping that we all do.
Official Documentation is here.
I wrote a Kaggle Notebook to submit to Titanic survival prediction using PyCaret here.
‘I don’t care what your model does, explain it to me like I am five.’
This library is explain-ability at its simplistic best, much for its Reddit inspired name. A very handy way to debug ML models, its often used to explain deep learning predictions.
A working example is mentioned in this Kaggle Notebook.
Deep learning is not as widely used in professional settings, but it does come in handy for uncertain objectives sometimes particularly for NLP and Image data. This explains the capabilities, environments and features PyTorch supports.
Often called as a successor of XGBoost, CatBoost makes use of ordered boosting ,and can handle categorical data on its own to give even better results. It is best suited for settings involving highly heterogenous data.
What is so special about Catboost? can answer any questions that you have.
Not enough or no test data? No problem. Need anonymized data? No problem. Faker generates Fake data for you to play around with.
Official Documentation is here, and a quick example to get started is this one.