NER Powered Semantic Search With LanceDB
NER Powered Semantic Search
This notebook shows how to use Named Entity Recognition (NER) for vector search with LanceDB. We will:
- Extract named entities from text.
- Store them in a LanceDB as metadata (alongside respective text vectors).
- We extract named entities from incoming queries and use them to filter and search only through records containing these named entities.
This is particularly helpful if you want to restrict the search score to records that contain information about the named entities that are also found within the query.
Let's get started.
Installing Dependencies
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.0/86.0 kB 2.4 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 493.7/493.7 kB 16.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.0/72.0 kB 9.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.9/7.9 MB 96.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 86.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 311.2/311.2 kB 36.9 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 14.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 17.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 65.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.0/38.0 MB 16.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.7/98.7 kB 11.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 105.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 63.9 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 295.0/295.0 kB 34.9 MB/s eta 0:00:00 Building wheel for sentence_transformers (setup.py) ... done ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. ibis-framework 6.2.0 requires pyarrow<13,>=2, but you have pyarrow 14.0.0 which is incompatible. pandas-gbq 0.17.9 requires pyarrow<10.0dev,>=3.0.0, but you have pyarrow 14.0.0 which is incompatible.
Load and Prepare Datasets
We use a dataset containing ~190K articles scraped from Medium. We select 50K articles from the dataset as indexing all the articles may take some time. This dataset can be loaded from the HuggingFace dataset hub as follows:
Downloading readme: 0%| | 0.00/2.26k [00:00<?, ?B/s]
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/1.04G [00:00<?, ?B/s]
Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]
Generating train split: 0 examples [00:00, ? examples/s]
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'. table = cls._concat_blocks(blocks, axis=0)
Preprocessing on dataset
Initialize NER model
To extract named entities, we will use a NER model finetuned on a BERT-base model. The model can be loaded from the HuggingFace model hub as follows:
Downloading (…)okenizer_config.json: 0%| | 0.00/59.0 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 0%| | 0.00/829 [00:00<?, ?B/s]
Downloading (…)solve/main/vocab.txt: 0%| | 0.00/213k [00:00<?, ?B/s]
Downloading (…)in/added_tokens.json: 0%| | 0.00/2.00 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 0%| | 0.00/112 [00:00<?, ?B/s]
Downloading model.safetensors: 0%| | 0.00/433M [00:00<?, ?B/s]
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight'] - This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[{'entity_group': 'LOC',
, 'score': 0.99969244,
, 'word': 'London',
, 'start': 37,
, 'end': 43}] Our NER pipeline is working as expected and accurately extracting entities from the text.
Initialize Retreiver
A retriever model is used to embed passages (article title + first 1000 characters) and queries. It creates embeddings such that queries and passages with similar meanings are close in the vector space. We will use a sentence-transformer model as our retriever. The model can be loaded as follows:
Downloading (…)e933c/.gitattributes: 0%| | 0.00/737 [00:00<?, ?B/s]
Downloading (…)_Pooling/config.json: 0%| | 0.00/190 [00:00<?, ?B/s]
Downloading (…)cbe6ee933c/README.md: 0%| | 0.00/9.85k [00:00<?, ?B/s]
Downloading (…)e6ee933c/config.json: 0%| | 0.00/591 [00:00<?, ?B/s]
Downloading (…)ce_transformers.json: 0%| | 0.00/116 [00:00<?, ?B/s]
Downloading (…)33c/data_config.json: 0%| | 0.00/15.7k [00:00<?, ?B/s]
Downloading pytorch_model.bin: 0%| | 0.00/438M [00:00<?, ?B/s]
Downloading (…)nce_bert_config.json: 0%| | 0.00/53.0 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 0%| | 0.00/239 [00:00<?, ?B/s]
Downloading (…)e933c/tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 0%| | 0.00/383 [00:00<?, ?B/s]
Downloading (…)933c/train_script.py: 0%| | 0.00/13.2k [00:00<?, ?B/s]
Downloading (…)cbe6ee933c/vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
Downloading (…)6ee933c/modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]
SentenceTransformer(
, (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel
, (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
, (2): Normalize()
,) Initialize LanceDB
Generate Embeddings and Insert
We generate embeddings for the title_text column we created earlier. Alongside the embeddings, we also include the named entities in the index as metadata. Later we will apply a filter based on these named entities when executing queries.
Let's first write a helper function to extract named entities from a batch of text.
0%| | 0/313 [00:00<?, ?it/s]
pyarrow.Table ,vector: fixed_size_list<item: float>[768] , child 0, item: float ,metadata: struct<authors: string, named_entities: list<item: string>, tags: string, text: string, timestamp: string, title: string, url: string> , child 0, authors: string , child 1, named_entities: list<item: string> , child 0, item: string , child 2, tags: string , child 3, text: string , child 4, timestamp: string , child 5, title: string , child 6, url: string ,named_entities: list<item: string> , child 0, item: string ,---- ,vector: [[[-0.009049614,0.10612086,-0.027753588,0.07209486,-0.032509252,...,0.11016317,-0.013526588,-0.0046699173,0.035262693,-0.051537305],[-0.04104508,-0.049538508,-0.026324937,0.019106576,-0.017135208,...,-0.050371084,-0.058374014,0.014137886,-0.046907514,-0.012160475],[0.0068191127,0.05442695,0.0059523294,-0.0272331,0.05366467,...,0.05989369,-0.02457071,-0.01919812,0.059475537,-0.040533062],[-0.074406564,0.06398625,-0.0032167286,0.0006136286,-0.04038913,...,-0.0035826706,0.017684124,-0.039035726,-0.012032948,-0.0018114883],[0.0030152015,0.024747377,-0.009255136,0.00078397495,0.01981007,...,-0.03834965,-0.021975843,-0.029613132,-0.0023647717,-0.0017810423]]] ,metadata: [ , -- is_valid: all not null , -- child 0 type: string ,["[]","['Tomas Rasymas']","['Yourself Quotes']","['Matt Charnock']","['Sip Wines']"] , -- child 1 type: list<item: string> ,[["Data","Christmas","America","Anonymous"],["MicroPython","Unsplash","ESP32 Board","Light Switch","Jaye Haych"],["Hope","Keep Going","Walker Morgan"],["David King","Martian","AQI","California","SFGate","Bay Area","NWS","Bay Area Skies","National Weather Service"],["ABC","Julia DiPrete","Jackson 5","Patton Valley Vineyard"]] , -- child 2 type: string ,["['Data Science', 'Big Data', 'Dataops', 'Analytics', 'DevOps']","['Programming', 'Python', 'Software Development', 'Hardware', 'Automation']","['Quotes']","['Bay Area', 'San Francisco', 'California', 'Wildfires', 'Environment']","['Wine Tasting', 'Sustainability', 'Wine']"] , -- child 3 type: string ,["by Anonymous , ,The door sprung open and our three little ones joyfully leaped onto the bed, waking my wife and I from peaceful dreams of slumbering sugarplum fairies. Oh Gosh, it was 6:03am, but who could blame them. The kids wait for Christmas day all year. , ,I sat up as giggles thrashed with reckless excitement on the bed. My feet found slippers and I lumbered down the hall in the direction of the coffee maker. I could hear the tearing of wrapping paper and whoops of joy coming from the living-room tree. , ,This scene was playing out in countless ways in households all across America. Christmas is a time for family, friends and a relaxing respite from the busy work calendar. Unfortunately, it was all about to come to an abrupt end as my cell phone interrupted our family time with an impatient buzz. “Who could be calling me today?” , ,I should have powered it off and tossed the damn thing into a snowbank, but as the manager of a data team in a global e-commerce company, I was used to taking","A story about how I escaped the boring task that I was doing twice a day by automating it! , ,Photo by Jaye Haych on Unsplash , ,A few weeks ago, I mounted outdoor lights on my house. It was a happy moment! 🥳 They added cosiness to my place and of course, made my yard lighter in the middle of the night. , ,But at the time, I did not think about the switch too much. I installed a simple manual switch 🤦♂️ It seemed like no big deal, turning lights on for the night and turning them off in the morning. But after a few weeks, I had changed my mind! , ,I started to think about how I could tackle this problem and I came up with a few ideas: , ,Buy an automated light switch, that can be programmed to turn on and off the lights by the time of the day, or maybe over WiFi or something similar. More money is spent but the effect is guaranteed but I need to reinstall the switch. , ,Another option is to do everything by myself. I am going to learn few new things, everything gonna cost less and I think I will n","It’s a very thrilling thing to achieve a goal. If you’re battling with the idea of giving up, remember that if you give up, you could be giving away a wonderful thing — your best future. You may be fearful and uncertain of the future, but it’s crucial that you drive to keep going. After all, if you quit now, you’ll never know how close you were to succeeding. There are many things that can help you overcome your anxiety and fear. , ,If at any point you think you are going to fall or lose hope, here are some Keep Going Quotes that will help empower you to never lose hope. , ,If you have gone through something traumatic and you don’t feel like living anymore, don’t lose hope. There are many people who have faced major accident and burn injuries and still are doing great in life. For example, the national burn injury firm, Walker Morgan, has provided legal services for persons who have suffered severe burn injuries, helping them to get compensation for their medical expenses. , ,There are many ","Bay Area cities are contending with some of the worst air quality in the world. We’ve been grappling with these dangerously toxic atmospheres for weeks now. Ironically enough: Last Wednesday’s Martian landscape was one of the few “moderate level” air quality days; Thursday through Sunday, however, saw record-breaking AQI recordings. , ,Therefore, the question on everyone’s mind: When will Bay Area skies finally turn blue again and boast healthy air conditions? Well, the answer’s somewhat muddied, but clear, deep-azure overhead conditions are on the horizon — eventually. , ,The National Weather Service (NWS) has tweeted a series of animated models over the past few days that suggest seaward gusts will inevitably sweep away our hazy skies by the middle of this week or possibly a little later than that. , ,NWS meteorologist David King told SFGate on Sunday that the weather system is slated to blow its way to California’s coastline this week. The system is expected to carry with it increased win","By Julia DiPrete , ,(according to the Jackson 5, anyway) , ,(Am I still allowed to like the Jackson 5? I’m so conflicted… Sigh.) , ,Patton Valley Vineyard , ,I’ll admit, sitting down to write a post about sustainability was daunting. For one thing, I knew very little about sustainability before writing this, so I had to learn. Then, once I started researching, I became slightly overwhelmed as to how I could distill all of it into a blog post of reasonable length. On top of all that, I had to figure out a way to present the distilled version of a serious and somewhat technical topic in a way that was attention-grabbing, entertaining, relatable — you know, all the things that make you actually want to read something. I hope you’ll read on, and as you do so, please keep in mind that I gave it my all. , ,Here’s the thing: Sustainability is INCREDIBLY important… and it’s way more achievable than you might think. , ,First things first: What is “sustainability” at its core? , ,When I said I knew very littl"] , -- child 4 type: string ,["2019-12-24 13:22:33.143000+00:00","2021-09-14 07:20:52.342000+00:00","2021-01-05 12:13:04.018000+00:00","2020-09-15 22:38:33.924000+00:00","2021-03-02 23:39:49.948000+00:00"] , -- child 5 type: string ,["How the Data Stole Christmas","Automating Light Switch using the ESP32 Board and MicroPython","Keep Going Quotes Sayings for When Hope is Lost","When Will the Smoke Clear From Bay Area Skies?","The ABC’s of Sustainability… easy as 1, 2, 3"] , -- child 6 type: string ,["https://medium.com/data-ops/how-the-data-stole-christmas-78454531d0a8","https://python.plainenglish.io/automating-light-switch-using-esp32-board-and-micropython-502622f54242","https://medium.com/@yourselfquotes/keep-going-quotes-sayings-for-when-hope-is-lost-92ad02ebc187","https://thebolditalic.com/when-will-the-smoke-clear-from-bay-area-skies-f88d18ce94f5","https://medium.com/sipwines/the-abcs-of-sustainability-easy-as-1-2-3-1a89e46c0f8a"]] ,named_entities: [[["Data","Christmas","America","Anonymous"],["MicroPython","Unsplash","ESP32 Board","Light Switch","Jaye Haych"],["Hope","Keep Going","Walker Morgan"],["David King","Martian","AQI","California","SFGate","Bay Area","NWS","Bay Area Skies","National Weather Service"],["ABC","Julia DiPrete","Jackson 5","Patton Valley Vineyard"]]]
Quering
Now lets try quering
{'Extracted Named Entities': ['Data'],
'Result': ['Data Science is all about making the right choices']}
{'Extracted Named Entities': ['SpaceX', 'Mars'],
'Result': ['Mars Habitat: NASA 3D Printed Habitat Challenge',
'Reusable rockets and the robots at sea: The SpaceX story',
'Reusable rockets and the robots at sea: The SpaceX story',
'Colonising Planets Beyond Mars',
'Colonising Planets Beyond Mars',
'Musk Explained: The Musk approach to marketing',
'How We’ll Access the Water on Mars',
'Chasing Immortality',
'Mission Possible: How Space Exploration Can Deliver Sustainable '
'Development']}
These all look like great results, making the most of LanceDB advanced vector search capabilities