Introducing pyvespa simplified API. Build Vespa application from python with few lines of code.
This post will introduce you to the simplified pyvespa
API that allows us to build a basic text search application from scratch with just a few code lines from python. Follow-up posts will add layers of complexity by incrementally building on top of the basic app described here.
pyvespa
exposes a subset of Vespa API in python. The library’s primary goal is to allow for faster prototyping and facilitate Machine Learning experiments for Vespa applications. I have written about how we can use it to connect and interact with running Vespa applications and evaluate Vespa ranking functions from python. This time, we focus on building and deploying applications from scratch.
Install
The pyvespa simplified API introduced here was released on version 0.2.0
pip3 install pyvespa>=0.2.0
Define the application
As an example, we will build an application to search through CORD19 sample data.
Create an application package
The first step is to create a Vespa ApplicationPackage:
from vespa.package import ApplicationPackageapp_package = ApplicationPackage(name="cord19")
Add fields to the Schema
We can then add fields to the application’s Schema created by default in app_package
.
from vespa.package import Fieldapp_package.schema.add_fields(
Field(
name = "cord_uid",
type = "string",
indexing = ["attribute", "summary"]
),
Field(
name = "title",
type = "string",
indexing = ["index", "summary"],
index = "enable-bm25"
),
Field(
name = "abstract",
type = "string",
indexing = ["index", "summary"],
index = "enable-bm25"
)
)
cord_uid
will store the cord19 document ids, whiletitle
andabstract
are self-explanatory.- All the fields, in this case, are of type
string
. - Including
"index"
in theindexing
list means that Vespa will create a searchable index fortitle
andabstract
. You can read more about which options is available forindexing
in the Vespa documentation. - Setting
index = "enable-bm25"
makes Vespa pre-compute quantities to make it fast to compute the bm25 score. We will use BM25 to rank the documents retrieved.
Search multiple fields when querying
A Fieldset groups fields together for searching. For example, the default
fieldset defined below groups title
and abstract
together.
from vespa.package import FieldSetapp_package.schema.add_field_set(
FieldSet(name = "default", fields = ["title", "abstract"])
)
Define how to rank the documents matched
We can specify how to rank the matched documents by defining a RankProfile. In this case, we defined the bm25
rank profile that combines that BM25 scores computed over the title
and abstract
fields.
from vespa.package import RankProfileapp_package.schema.add_rank_profile(
RankProfile(
name = "bm25",
first_phase = "bm25(title) + bm25(abstract)"
)
)
Deploy your application
We have now defined a basic text search app containing relevant fields, a fieldset to group fields together, and a rank profile to rank matched documents. It is time to deploy our application. We can locally deploy our app_package
using Docker without leaving the notebook, by creating an instance of VespaDocker, as shown below:
from vespa.package import VespaDockervespa_docker = VespaDocker(port=8080)
app = vespa_docker.deploy(
Waiting for configuration server.
application_package = app_package,
disk_folder="/Users/username/cord19_app"
)
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for application status.
Finished deployment.
app
now holds a Vespa instance, which we are going to use to interact with our application. Congratulations, you now have a Vespa application up and running.
It is important to know that pyvespa
provides a convenient API to define Vespa application packages from python. vespa_docker.deploy
export Vespa configuration files to the disk_folder
defined above. Going through those files is an excellent way to start learning about Vespa syntax.
Feed some data
Our first action after deploying a Vespa application is usually to feed some data to it. To make it easier to follow, we have prepared a DataFrame
containing 100 rows and the cord_uid
, title
, and abstract
columns required by our schema definition.
from pandas import read_csvparsed_feed = read_csv(
parsed_feed
"https://thigm85.github.io/data/cord19/parsed_feed_100.csv"
)
We can then iterate through the DataFrame
above and feed each row by using the app.feed_data_point method:
- The schema name is by default set to be equal to the application name, which is
cord19
in this case. - When feeding data to Vespa, we must have a unique id for each data point. We will use
cord_uid
here.
for idx, row in parsed_feed.iterrows():
fields = {
"cord_uid": str(row["cord_uid"]),
"title": str(row["title"]),
"abstract": str(row["abstract"])
}
response = app.feed_data_point(
schema = "cord19",
data_id = str(row["cord_uid"]),
fields = fields,
)
You can also inspect the response to each request if desired.
response.json(){'pathId': '/document/v1/cord19/cord19/docid/qbldmef1',
'id': 'id:cord19:cord19::qbldmef1'}
Query your application
With data fed, we can start to query our text search app. We can use the Vespa Query language directly by sending the required parameters to the body argument of the app.query method.
query = {
'yql': 'select * from sources * where userQuery();',
'query': 'What is the role of endothelin-1',
'ranking': 'bm25',
'type': 'any',
'presentation.timing': True,
'hits': 3
}res = app.query(body=query)
res.hits[0]{'id': 'id:cord19:cord19::2b73a28n',
'relevance': 20.79338929607865,
'source': 'cord19_content',
'fields': {'sddocname': 'cord19',
'documentid': 'id:cord19:cord19::2b73a28n',
'cord_uid': '2b73a28n',
'title': 'Role of endothelin-1 in lung disease',
'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'}}
We can also define the same query by using the QueryModel abstraction that allows us to specify how we want to match and rank our documents. In this case, we defined that we want to:
- match our documents using the
OR
operator, which matches all the documents that share at least one term with the query. - rank the matched documents using the
bm25
rank profile defined in our application package.
from vespa.query import QueryModel, RankProfile as Ranking, ORres = app.query(
query="What is the role of endothelin-1",
query_model=QueryModel(
match_phase = OR(),
rank_profile = Ranking(name="bm25")
))
{'id': 'id:cord19:cord19::2b73a28n',
res.hits[0]
'relevance': 20.79338929607865,
'source': 'cord19_content',
'fields': {'sddocname': 'cord19',
'documentid': 'id:cord19:cord19::2b73a28n',
'cord_uid': '2b73a28n',
'title': 'Role of endothelin-1 in lung disease',
'abstract': 'Endothelin-1 (ET-1) is a 21 amino acid peptide with diverse biological activity that has been implicated in numerous diseases. ET-1 is a potent mitogen regulator of smooth muscle tone, and inflammatory mediator that may play a key role in diseases of the airways, pulmonary circulation, and inflammatory lung diseases, both acute and chronic. This review will focus on the biology of ET-1 and its role in lung disease.'}}
Using the Vespa Query Language as our first example gives you the full power and flexibility that Vespa can offer. In contrast, the QueryModel abstraction focuses on specific use cases and can be more useful for ML experiments, but this is a future post topic.