Open Source Datasets for Machine Learning

Machine learning, a form of artificial intelligence, teaches computer systems to learn and improve based on past experiences. Tasks are automated in a similar way to how humans would complete them as machine learning reveals patterns and rules out of gathered data. Machine learning allows companies to transform their business by leveraging customer services like offerings, advertisement, and other previously manually conducted tasks of their daily business. The use cases can be extended to multiple industries and levels of detail. Current endeavors focus on complex and critical tasks as autonomous driving, disease detection, and disaster predictions.

Machine learning algorithms are only as good as the training sets — datasets are integral for prediction quality. It is quite hard to find specific datasets for experimenting and solving machine learning problems. The budget does not allow to pick the best-suited set or the required datasets may not be available at all. There are various datasets openly available that help you to rather focus on the creation of the prediction model than the gathering and labeling of the data foundation.

Open source data is important since the world relies on data and the contained information. May it be data for business insights, machine control, or sales initiatives — we are living in a digital era of data-driven business models. Open source data is data that is publicly available and open for reuse and sharing. Various initiatives of governments and organizations provide data as a foundation for the evolvement of the current status quo. All in all, open data will help the world to transform the processes and systems that our generation has built.

Machine learning uses algorithms to improve over time, nevertheless, the indicator for its quality is the data used to create and adopt these models. The following four data types are mainly used for machine learning:

Numerical data: Any form of quantitative and measurable data such as speed or height. Numerical data include discrete and continuous numbers that allow conducting mathematical operations on the numbers.
Categorical data: This type is defined by categories of labels such as gender and industry. Categorical data is non-numerical and therefore mathematical operations can not directly be conducted.
Time series data: Time series data is indexed at specific points in time according to defined time intervals. Data can be compared based on time-based metrics. The difference to numerical data is its time reference with start and end data points.
Text data: Text data includes words and sentences which can be grouped and analyzed using approaches like word count or sentiment analysis.

Machine learning algorithms typically require several datasets to meet the requirements of an operational model considering various indicators. Consequently, training and testing datasets are used to ensure the accuracy of the model. Subsequent datasets are used in operation to validate and adjust the machine learning algorithm. A set of public datasets have already been identified as being broadly used for machine learning algorithms due to the number of downloads. The following list categorizes different use cases for machine learning and famous examples of datasets that can be applied for each category:

Computer vision: Google Open Images Dataset (Link)
Natural language processing: Rotten Tomatoes Review Dataset (Link)
Sentiment analysis: IMDB Review Dataset (Link)
Autonomous driving: Waymo Open Dataset (Link)
Recommendation systems: MovieLens Review Dataset (Link)

If famous open source datasets do not meet the requirements of your project, you might find suitable datasets using dataset search platforms as described at a later point.

The publicly available Waymo Open Dataset (Link) is a collection of sensor data gathered by Waymo’s self-driving cars. The collection is one of the biggest and most diverse datasets for training machine learning models for autonomous driving. It contains data from urban and suburban landscapes in the US with different light as well as weather conditions. The current dataset counts 1,950 segments each with 20 seconds captures of sensor data which allows researchers to predict the behavior of cars and other traffic participants. Data is gathered from five lidars and five front-and-side-facing cameras that are permanently installed.

The set with compressed 2 TB of data chunked in files of max. 25 GB contains labeled training, labeled validation, and unlabeled test data. The training dataset includes 12.6 million 3d boxes for the lidar frames with labels for vehicles, pedestrians, cyclists, and traffic signs. Moreover, 11.8 million 2d boxes for all camera frames with labels for vehicles, pedestrians, and cyclists are provided. Lidar and camera labels were created independently and are not projections of each other. The following image is an example of a lidar frame with one 3d box of an identified vehicle (Link). On the left bottom of the image, an adequate camera frame is provided.

Camera frame (left) and lidar frame (right) of a vehicle

With its open approach, Waymo actively contributes to the public research of machine learning for autonomous driving.

It can be hard to find specific datasets and, in some cases, even the famous datasets are not suitable for the domain of application. The following lists shows are a collection of platforms that provide search functionality to find suited datasets for specific purposes:

Google Dataset Search Engine (Link): The search engine indexes machine learning datasets that are available on the world-wide-web.
Amazon Open Datasets (Link): Amazon’s data registry offers 200 datasets of open data which is maintained by third parties and stored on AWS storage.
Microsoft Research Open Data (Link): This data repository contains several available datasets mainly in the domain of natural science.
Kaggle Datasets (Link): The dataset store is an openly available platform with more than 66,000 uploaded datasets for various domains.
DATA.GOV Datasets (Link): DATA.GOV is unique data with more than 200,000 datasets provided by the US government.

The dataset platforms differ in size, covered use cases, complexity, and quality. Several of the mentioned platforms shall be considered to find suitable datasets.

Kaggle, a Google subsidiary, is an online platform that enables to publicly share datasets for machine learning and data analyses. The target groups are data scientists, enterprises, and organizations from different industries that are interested in publishing datasets and building machine learning models in cooperation with other researchers. Data can be uploaded in different formats such as markup or database file types. As a core, the platform offers a cloud-based environment to compute data and to exchange machine learning notebooks. With a broad community of data researches, Kaggle allows exchanging knowledge by having open discussions and participating in data challenges.

Machine learning heavily relies on data for training, testing, and permanent validation. The availability of suitable data is important for the quality of prediction models as well as for the effort of creating the initial model. Several famous datasets are used for creating such prediction models without the need of gathering and labeling data. If the famous datasets are not suitable, broad dataset repositories e. g. DATA.GOV, dataset search engines like Google’s Dataset engine, or dataset platforms like Kaggle might help.

The future of machine learning is an open community of enterprises, companies, and institutions sharing not only data but also knowledge and jointly working on machine learning challenges. Kaggle, for example, already provides a platform to share as well as compute data and to exchange knowledge with other experts.

It is worthwhile to consider the following readings for a better understanding of data-driven business models.

Business Models of a Digital Era (Link):
Do not miss this article about how digital transformation and digital natives are changing the business. With the adoption of emerging technologies and customer behaviors, companies show a variety of new business model patterns that face the characteristics of a digital era.

Footer