Datasets for Data Science and AI

Kaggle is the OG of Data Science and Machine Learning platforms. The challenges they host created a level of competition within the Machine Learning community to become a master ‘kaggler’ and that title does indeed come with respect.

Even better though, the guys at Kaggle decided to open-source a plethora of data sets (where users can even share their favourite datasets) to help users apply machine learning models to real life scenarios. From data on stock prices to ECG data, there’s something for everyone here. It really is a huge repository.

The UCI Repository is pretty simple. The US college have always made an effort to make their data open and transparent. It’s a broad library of data for anything from guitar finger positions to Taiwanese Bankruptcy Prediciton. Like Kaggle, you’ll find anything you need here. The data comes well structured and easy to use. Generally speaking, it’s also pretty clean!

Who doesn’t love reddit?

The ‘front page of the internet’ has it all — even amongst all the cat memes and r/TIFU posts, we find a hub of data that’s very diverse.

The great thing about reddit is that you join discussions where people talk at length regarding specific custom datasets and their problems. Having a certain level of discourse surrounding a dataset makes it easier to get started, and also, make developments from the knowledge of others.

Moreover, if you can’t find data that you’re looking for, you can always request it and hope that someone passes your way!

Let’s be honest, there are a lot of people who write articles on Medium. Towards Data Science have well over 500,000 subscribers and that membership is expected to grow over the next few years. Given that, thousands of articles are written every month on any number of Data Science topics and good writers will often link or share their data.

Take the link provided, they really have just shared every data set possible!

Footer