Setup
If you are coming from the previous tutorial, you will need to make some small changes to app.py
, FinBertQARanker/__init__.py
, and FinBertQARanker/tests/test_finbertqaranker.py
. Joan Fontanals Martinez and I have added some helper functions and batching in the Ranker to help speed up the process.
Instead of pointing out the changes, I have made a new template to simplify the workflow and also show those of you who are already familiar with Jina how to implement the evaluation mode.
Clone project template:
git clone https://github.com/yuanbit/jina-financial-qa-evaluator-template.git
Make sure the requirements are installed and you have downloaded the data and model.
You can find the final code of this tutorial here.
Let us walk through the Evaluation Flow step-by-step.
Step 1. Define our test set data
Our working directory will be jina-financial-qa-evaluator-template/
. In the dataset/
folder you should have the following files:
For this tutorial we will need:
sample_test_set.pickle
: a sample test set with 50 questions and ground truth answers
qid_to_text.pickle
: a dictionary to map the question ids to question text
If you want to use the complete test set from FinBERT-QA, test_set.pickle
, which has 333 questions and ground truth answers, you can simply change the path.
The test set that we will be working with in this tutorial is a pickle file, sample_test_set.pickle
. It is a list of lists in the form [[question id, [ground truth answer ids]]]
, where each element contains the question id and a list of ground truth answer ids. Here is a slice from the test set:
[[14, [398960]],
[458, [263485, 218858]],
[502, [498631, 549435, 181678]],
[712, [212810, 580479, 527433, 28356, 97582, 129965, 273307]],...]
Next, similar to defining our Document for indexing the answer passages, we will create two Documents containing the data of the questions and the ground truth answers.
Recall in our Index Flow, when we defined our data in index_generator
function we included the answer passage ids (docids) in the Document. Therefore, after indexing, these answer ids are stored in the index and they are important because they are part of the search results during query time. Thus, we only need to define the Ground truth Document with the ground truth answer ids for each query and compare these answer ids with the answer ids of the matches.
Let’s add a Python generator under the load_pickle
function to define our test set for evaluation. For each Document, we will map the corresponding question and from the test set to the actual text.
Similar to the Query Flow, we will pass our two Documents into the Encoder Pod from pods/encode.yml
. The Driver will pass the question text to the Encoder to transform it into an embedding and the same Driver will add the embedding to the Query Document. The only difference this time is that we are passing two Documents into the Encoder Pod and the Ground truth Document is immutable and stays unchanged through the Flow.
In flows/
lets create a file to configure our Evaluation Flow called evaluate.yml
and add the Encoder Pod as follows:
The output of the Encoder will contain the Query Document with the embeddings of the questions and the Ground truth Document stays unchanged as shown in Figure 4.
Next, the Indexer Pod from pods/doc.yml
will search for the answers with the most similar embeddings and the Driver of the Indexer will add a list of top-k answer matches to the Query Document. The Ground truth Document remains unchanged.
Let’s add the doc_indexer
to flows/evaluate.yml
as follows:
The output of the Indexer will contain the Query Document with the answer matches and their corresponding information and the Ground truth Document.
Since I mentioned in the beginning that we will evaluate the search results both before and after reranking, you might think that now we will add the following sequence:
- Evaluator for the match results
- Ranker
- Evaluator for the reranked results
However, since evaluation serves to improve the results of our search system, it is not an actual component of the final application. You can think of it as a tool that provide us information on which part of the system needs improvement.
Having the goal to allow us to inspect any part of the pipeline and enabling us to evaluate at arbitrary places in the Flow, we can use Jina Flow API’s inspect
feature to attach the Evaluator Pods to the main pipeline so that the evaluations will not block messages to the other components of the pipeline.
For example, without the inspect
mode, we would have a sequential design mentioned above. With the inspect
mode, after retrieving the answer matches from the Indexer, the Documents will be sent to an Evaluator and the Ranker in parallel. Consequently the Ranker won’t have to wait for the initial answer matches to be evaluated before it can output the reranked answer matches!
The benefit of this design in our QA system is that the Evaluator can perform evaluations without blocking the progress of the Flow because it is independent from the other components of the pipeline. You can think of the Evaluator as a side task running in parallel with the Flow. As a result, we can evaluate with minimal impact to the performance of the Flow.
You can refer to this article to learn more about the design of the evaluation mode and the inspect
feature.
Let’s take a closer look at the evaluation part of the Flow: