Improving Pgvector Keyword Search
Improving PostgreSQL Keyword Search to Avoid Empty Results
Notebook by Mayank Laddha
As noted in the Haystack documentation for PgvectorKeywordRetriever, this component, unlike others such as ElasticsearchBM25Retriever, doesn’t apply fuzzy search by default. As a result, queries need to be crafted carefully to avoid returning empty results.
In this notebook, you’ll extend PgvectorDocumentStore to make it more forgiving and flexible. You’ll learn how to subclass it to use PostgreSQL’s websearch_to_tsquery and how to leverage NLTK to extract keywords and transform user queries.
Haystack’s modular design makes it easy to tweak or enhance components when results don’t meet expectations and this notebook will show exactly how to do that.
Setting up the Development Environment
Install required dependencies and set up PostgreSQL
Set an environment variable PG_CONN_STR with the connection string to your PostgreSQL database. This is needed for Haystack.
Subclassing PgvectorDocumentStore to Enable Websearch-Style Queries
Why not plainto_tsquery? Why websearch_to_tsquery?
plainto_tsquery transforms the unformatted text querytext to a tsquery value. The text is parsed and normalized much as for to_tsvector, then the & (AND) tsquery operator is inserted between surviving words. so all your keywords need to be present in the document.
websearch_to_tsquery creates a tsquery value from querytext using an alternative syntax in which simple unformatted text is a valid query. Unlike plainto_tsquery and phraseto_tsquery, it also recognizes certain operators. Moreover, this function will never raise syntax errors, which makes it possible to use raw user-supplied input for search. The following syntax is supported:
unquoted text: text not inside quote marks will be converted to terms separated by & operators, as if processed by plainto_tsquery.
"quoted text": text inside quote marks will be converted to terms separated by <-> operators, as if processed by phraseto_tsquery.
OR: the word “or” will be converted to the | operator.
-: a dash will be converted to the ! operator.
Detect Keywords with NLTK
Detecting keywords make sure we use only the relevant words. So, even if you decide to use the default implementation with plainto_tsquery, which uses AND operator, you stil have better chances of not getting zero results.
Download required packages
Simple keyword detector
Test the Improved Implementation
transformed query Jean OR Mayank OR Alex OR Paris
result {'documents': [Document(id=1, content: 'My name is Jean and I live in Paris.', score: 0.2)]}
Compare with the Default Keyword Search
result {'documents': []}
As you can see the results are empty.