In this tutorial, we’re gonna implement a rudimentary Semantic Search engine using Haystack. we’ll use ElasticSearech and Faiss (Facebook AI Similarity Search) as DocumentStores.
Below are the segments I’m gonna talk about:
- Intro to Semantic Search & Terminologies
- Implementation nit&grit
- Environment Setup
- Dataset preparation
- Indexing & Searching
Intro to Semantic Search & Terminologies
In recent times, with NLP (natural language processing) advancement and availability of vast computing power (GPU, TPU unit, etc.), Semantic Search is making its place in the search industry. Contrary to lexical or syntactical search, Semantic/neural search focuses more on the Intent and Semantics of the query. Representing your query/documents as an n-dim vector (embedding) using a neural network (trained on your custom data or pretrained) is the crux of this Semantic Search.
- Haystack: Haystack is an open-source framework for building end-to-end question-answering systems for large document collections. You can read more about it here.
- FAISS: FAISS is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other. If you want to read in-depth about it, I suggest you read this amazing blog.
Implementation nit&grit
Now we’ll go into technical implementation details, so if you’re more interested in the coding part, you can skip further this part and directly jump into the collab notebook.
- Environment Setup: I’ve used Google collab notebook (GPU runtime) because creating embedding is computationally expensive. First, install the required libs:
!pip install git+https://github.com/deepset-ai/haystack.git OR !pip install farm-haystack!pip install sentence-transformers
The first package will install Haystack python library and the second one, sentence-transformers which we’ll use to create embedding. Sentence Transformer is very handy in providing various pretrained transformer-based models to embed a sentence or document. To check out these models (use-case wise), click here.
- Dataset preparation: For this setup, I’ve downloaded Foodb dataset. FooDB is the world’s largest and most comprehensive resource on food constituents, chemistry, and biology. FooDB is offered to the public as a freely available resource. I’ve used only Content.json, below is the dataframe I got after processing:
And we’ll use only three columns i.e. code, url, product_name in indexing. Haystack provides a handy method to index List[Dict]. so I’ve converted the above dataframe to the below format (as mentioned in Haystack docs):
- Indexing & Searching: Haystack provides the three building blocks for indexing and searching:
a. DocumentStore: Database in which you want to store your data. They support different kinds of databases like Elasticsearch, FAISS, InMemory, SQL, etc. For now, We’ll load the data in both Elasticsesaarch and FAISS and also seek the comparison later.
#FAISS Indexing
from haystack.document_store.faiss import FAISSDocumentStoredocument_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat",return_embedding=True)#ES Indexing
from haystack.document_store.elasticsearch import ElasticsearchDocumentStoredocument_store_es = ElasticsearchDocumentStore(host="localhost", index="food_haystack_embedding", similarity='cosine')
Also, For FAISS indexing, the similarity metric is ‘dot_product’ now but for ES, ‘cosine’ similarity is available. There are various args in FAISS index for optimization with which you can play around. In my case, I’m sticking to ‘Flat’ indexing because my dataset isn’t of that volume.
b. Retriever: Filter for extracting potential candidates for the query. Currently, they support BM25, TF-IDF, Embedding, DPR (Dense Passage Retrieval). We’re using Embedding Retriever. Also, for creating embedding, we’re using distilroberta-base-msmarco-v2 model (pretrained on Microsoft MACRO. dataset).
#FAISS Retriever initialization
retriever_faiss = EmbeddingRetriever(document_store_faiss,
embedding_model='distilroberta-base-msmarco-v2',model_format='sentence_transformers')#ES Retriever initialization
retriever_es = EmbeddingRetriever(document_store_es, embedding_model='distilroberta-base-msmarco-v2', model_format='sentence_transformers')#Running the process for indexing
# Delete existing documents in documents store
document_store_faiss.delete_all_documents()# Write documents to document store
document_store_faiss.write_documents(food_data_to_upload)# Add documents embeddings to index
document_store_faiss.update_embeddings(retriever=retriever)
So first, it will index all the data with write_documents() method. Then it will create an embedding of each doc (doc[‘text’]) and store it in each corresponding index (in-place) with update_embeddings() method, to create embedding it will use the model which you’ve mentioned in the retriever initialization i.e. distilroberta-base-msmarco-v2 here. Also, Haystack facilitates batch processing in bulk indexing.
%time taken in indexing
#docs: 338,487
FAISS: 386.73 sec
ElasticSearch: 2329.32 sec
Note: As you can see, FAISS indexing approx x6 fast compared to ES.
c. Reader: Though we’re not using this component in our task, it is said to be a core component in QA systems provided by Haystack. It takes the output of Retriever (potential candidates) and try to give you the best match for your query. It harnesses the power of transformer-based language models to pick the best candidate.
d. Results: Let’s dive in to see how our neural search is performing 😋
q = 'pipneapple banana cake'print('-'*100)
print('FAISS')
print(get_search_result(q, retriever_faiss))
print('-'*100)
print('ElasticSearch Dense')
print(get_search_result(q, retriever_es))
print('-'*100)#return the (text, url) tuple.