In the previous post of this series (Part 2) we looked at a taxonomy for build Question Answering systems. We looked at information retrieval based Question Answering systems developed using supervised machine learning based Question Answer models. We need data to develop Question Answer models. Thankfully, there are data sources available in the open space that can be used to develop and benchmark Question Answer models. In this post we will look at four such data sets.
Let us do a quick recap of Question Answer models, before we look at the data sets. Question Answer models are supervised machine learning models that take as input a passage of text, and a question. Question-answer models extract the span of text in the passage which is the answer to the user query. They return the start and the end position of the answer in the input passage.
Stanford Question Answering Dataset (SQUAD)
SQUAD has become the de-facto standard data set for developing and benchmarking Question Answer models. SQUAD is primarily the results of the efforts of Pranav Rajpurkar who is currently a PhD candidate in the Computer Science department at Stanford University. The data set is developed via crowdsourcing using Wikipedia data. To generate the data set:
- First crowd sourced workers are used to generate questions for a set of given passages chosen from Wikipedia
- Then crowd sourced workers are given the passages and questions from the step above and are asked to select three answers for each question.
There are two versions of SQUAD: SQUAD v1.1, and SQUAD v2.0. SQUAD v1.1 was released in 2016 with approximately 100K Question Answer pairs based on 536 passages from Wikipedia. SQUAD v2.0 was released in 2018. It had 50K additional unanswerable questions on top of existing questions in SQUAD v1.1. The models must abstain from answering these questions whenever possible.
Conversational Question Answering Challenge (COQA)
COQA is another interesting data set for developing Question Answer models. The questions in SQUAD are not conversational in nature. COQA is for developing Question Answer models that can answer a set of interconnected conversational questions on a passage. COQA questions answers are generated using crowdsourcing like SQUAD. COQA has 127k question and answers from 8000 conversations taken from varying sources. The answers are free form.
MS MARCO(Microsoft Machine Reading Comprehension)
MS MARCO is a collection of data sets from Microsoft that includes a data set for Question Answering. The questions are based on real life user questions from the Bing search engine, and answers are human generated. The data set has over one million queries. The Question Answering data set will now no longer be maintained going ahead.
Google’s Natural Questions (NQ) corpus
Google’s Natural Questions (NQ) corpus is another dataset for Question Answer model development that contains real life questions which have answers in Wikipedia pages. The dataset answers include both the passage and the exact answer from the Wikipedia page.
In the next part of this series we will look at Transformers and BERT that are behind most of the high performing Question Answer models.