Notebooks
W
Weaviate
Late Chunking Berlin

Late Chunking Berlin

vector-searchvector-databaseretrieval-augmented-generationservices-researchllm-frameworksweaviate-featuresfunction-callingweaviate-recipesPythongenerative-ai

Open In Colab

Late Chunking with Weaviate

Notebook author: Danny Williams @ weaviate (Developer Growth)

This notebook implements late chunking with Weaviate. Late chunking is a change in the classical chunking framework where chunking happens after token embeddings are output from the full document. This preserves contextual information from one chunk to another.

Setup

First we install all required packages. We are using

[1]

Then we load the packages and connect to the Weaviate client. Important, you need some API keys within a .env file:

  • your Weaviate REST endpoint saved as WEAVIATE_URL
  • your Weaviate API key saved as WEAVIATE_KEY
  • if you want to run the final comparison in this notebook, an OpenAI API key saved as OPENAI_API_KEY, otherwise delete the headers argument in the weaviate.connect_to_weaviate_cloud function.
[2]
/Users/danny/Documents/Work/Other/recipes/.venv/lib/python3.12/site-packages/google/protobuf/runtime_version.py:112: UserWarning: Protobuf gencode version 5.27.2 is older than the runtime version 5.28.0 at grpc_health/v1/health.proto. Please avoid checked-in Protobuf gencode that can be obsolete.
  warnings.warn(
/Users/danny/Documents/Work/Other/recipes/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Finally just for future-proofing, the versions of these packages are:

[3]
Weaviate version 4.7.1
Pytorch version 2.4.1
Numpy version 1.26.4
Spacy version 3.7.6
Transformers version 4.44.2

Functions

Below are some general functions for chunking text into sentences, as well as the bulk of the operations behind late chunking.

Late chunking is simply the same chunks we would have on the naively chunked text, but the chunk embedding is taken from the pooling of the token embeddings, rather than an independently embedded chunk.

[4]

Import into Weaviate

We aim to perform late chunking, obtain the contextually-aware embeddings, and then import these into a Weaviate collection.

First, create a Weaviate collection called test_late_chunking.

[5]

Now let's use a test document - the wikipedia page for Berlin (saved in a separate text file). We will later query this text using late chunking/naive chunking.

[6]
First 50 characters of the document:
Berlin[a] is the capital and largest city of Germany, both by area and by population.[11] Its more than 3.85 million inhabitants[12] make it the Europ...

Now, load the jina-embeddings-v2-base-en model from Huggingface. Other embedding models can be used, but Jina's model has up to 8192 token length documents, which is important for late chunking as we want to encode large documents and separate them later.

[7]

We call our functions we defined earlier: First chunk the text as normal, to obtain the beginning and end points of the chunks. Then embed the full document. Then perform the late chunking step - take the average over all token embeddings that correspond to each chunk (based on the beginning/end points of the chunks). These form as our embeddings for the chunks.

[8]

Finally, we can add this to our Weaviate collection by supplying our own vector embedding for each chunk.

[9]

Example Query

First, define two functions to process queries. One using our Weaviate collection, and a different, slower search using cosine similarity running locally that we will use for comparison.

[10]

Test both search functions.

[11]
['The Independent Evangelical Lutheran Church has eight parishes of different sizes in Berlin.[131] There are 36 Baptist congregations (within Union of Evangelical Free Church Congregations in Germany), 29 New Apostolic Churches, 15 United Methodist churches, eight Free Evangelical Congregations, four Churches of Christ, Scientist (1st, 2nd, 3rd, and 11th), six congregations of the Church of Jesus Christ of Latter-day Saints, an Old Catholic church, and an Anglican church in Berlin.',
, 'Each borough has several subdistricts or neighborhoods (Ortsteile), which have roots in much older municipalities that predate the formation of Greater Berlin on 1 October 1920.',
, 'The Senate consists of the Governing Mayor of Berlin (Regierender Bürgermeister), and up to ten senators holding ministerial positions, two of them holding the title of "Mayor" (Bürgermeister) as deputy to the Governing Mayor.[134]\n\n\nCharlottenburg Town Hall\n\nRathaus Spandau\nThe total annual budget of Berlin in 2015 exceeded €24.5 ($30.0) billion including a budget surplus of €205 ($240) million.[135] The German Federal city state of Berlin owns extensive assets, including administrative and government buildings, real estate companies, as well as stakes in the Olympic Stadium, swimming pools, housing companies, and numerous public enterprises and subsidiary companies.[136][137] The federal state of Berlin runs a real estate portal to advertise commercial spaces or land suitable for redevelopment.[138]\n\nThe Social Democratic Party (SPD) and The Left (Die Linke) took control of the city government after the 2001 state election and won another term in the 2006 state election.[139] From the 2016 state election until the 2023 state election, there was a coalition between the Social Democratic Party, the Greens and the Left Party.',
, 'Since April 2023, the government has been formed by a coalition between the Christian Democrats and the Social Democrats.[140]\n\nThe Governing Mayor is simultaneously Lord Mayor of the City of Berlin (Oberbürgermeister der Stadt) and Minister President of the State of Berlin (Ministerpräsident des Bundeslandes).',
, '24] About 2.7% of the population identify with other Christian denominations (mostly Eastern Orthodox, but also various Protestants).[125] According to the Berlin residents register, in 2018 14.9 percent were members of the Evangelical Church, and 8.5 percent were members of the Catholic Church.[103] The government keeps a register of members of these churches for tax purposes, because it collects church tax on behalf of the churches.']
[12]
['The Independent Evangelical Lutheran Church has eight parishes of different sizes in Berlin.[131] There are 36 Baptist congregations (within Union of Evangelical Free Church Congregations in Germany), 29 New Apostolic Churches, 15 United Methodist churches, eight Free Evangelical Congregations, four Churches of Christ, Scientist (1st, 2nd, 3rd, and 11th), six congregations of the Church of Jesus Christ of Latter-day Saints, an Old Catholic church, and an Anglican church in Berlin.',
, 'Each borough has several subdistricts or neighborhoods (Ortsteile), which have roots in much older municipalities that predate the formation of Greater Berlin on 1 October 1920.',
, 'The Senate consists of the Governing Mayor of Berlin (Regierender Bürgermeister), and up to ten senators holding ministerial positions, two of them holding the title of "Mayor" (Bürgermeister) as deputy to the Governing Mayor.[134]\n\n\nCharlottenburg Town Hall\n\nRathaus Spandau\nThe total annual budget of Berlin in 2015 exceeded €24.5 ($30.0) billion including a budget surplus of €205 ($240) million.[135] The German Federal city state of Berlin owns extensive assets, including administrative and government buildings, real estate companies, as well as stakes in the Olympic Stadium, swimming pools, housing companies, and numerous public enterprises and subsidiary companies.[136][137] The federal state of Berlin runs a real estate portal to advertise commercial spaces or land suitable for redevelopment.[138]\n\nThe Social Democratic Party (SPD) and The Left (Die Linke) took control of the city government after the 2001 state election and won another term in the 2006 state election.[139] From the 2016 state election until the 2023 state election, there was a coalition between the Social Democratic Party, the Greens and the Left Party.',
, 'Since April 2023, the government has been formed by a coalition between the Christian Democrats and the Social Democrats.[140]\n\nThe Governing Mayor is simultaneously Lord Mayor of the City of Berlin (Oberbürgermeister der Stadt) and Minister President of the State of Berlin (Ministerpräsident des Bundeslandes).',
, '24] About 2.7% of the population identify with other Christian denominations (mostly Eastern Orthodox, but also various Protestants).[125] According to the Berlin residents register, in 2018 14.9 percent were members of the Evangelical Church, and 8.5 percent were members of the Catholic Church.[103] The government keeps a register of members of these churches for tax purposes, because it collects church tax on behalf of the churches.']

Both give the same results so we are confident that our vector search for late chunking works! We would expect something slightly different as Weaviate uses HNSW for a speedy search, and we have directly used cosine similarity, but in this case, they are the same.

For comparison, let's look at what a naive chunking method implemented with Weaviate's search would give us.

[13]
[14]
[15]
["\n\nDemographics\nMain article: Demographics of Berlin\n\nBerlin population pyramid in 2022\n\nBerlin's population, 1880–2012\nAt the end of 2018, the city-state of Berlin had 3.75 million registered inhabitants[103] in an area of 891.1 km2 (344.1 sq mi).[3] The city's population density was 4,206 inhabitants per km2.",
, 'Foreign residents of Berlin originate from about 190 countries.[112] 48 percent of the residents under the age of 15 have a migration background in 2017.[113] Berlin in 2009 was estimated to have 100,000 to 250,000 unregistered inhabitants.[114] Boroughs of Berlin with a significant number of migrants or foreign born population are Mitte, Neukölln and Friedrichshain-Kreuzberg.[115] The number of Arabic speakers in Berlin could be higher than 150,000.',
, 'Around 130,000 jobs were added in this period.[150]\n\nImportant economic sectors in Berlin include life sciences, transportation, information and communication technologies, media and music, advertising and design, biotechnology, environmental services, construction, e-commerce, retail, hotel business, and medical engineering.[151]\n\nResearch and development have economic significance for the city.[152] Several major corporations like Volkswagen, Pfizer, and SAP operate innovation laboratories in the city.[153] The Science and Business Park in Adlershof is the largest technology park in Germany measured by revenue.[154] Within the Eurozone, Berlin has become a center for business relocation and international investments.[155][156]\n\nYear[157]\t2010\t2011\t2012\t2013\t2014\t2015\t2016\t2017\t20',
, "Of the estimated population of 30,000–45,000 Jewish residents,[130] approximately 12,000 are registered members of religious organizations.[125]\n\nBerlin is the seat of the Roman Catholic archbishop of Berlin and EKBO's elected chairperson is titled the bishop of EKBO.",
, "Polish, English, Russian, and Vietnamese have more native speakers in East Berlin.[121]\n\nReligion\nMain article: Religion in Berlin\nReligion in Berlin (2022)[122]\n\n  Not religious/other (72%)\n  EKD Protestants (15%)\n  Catholics (9%)\n  Islam (4%)\n  Jewish (1%)\n  Other (0.5%)\n\n\n\n\nClockwise from top left: Berlin Cathedral, New Synagogue, Şehitlik Mosque, and St. Hedwig's Cathedral\nOn the report of the 2011 census, approximately 37 percent of the population reported being members of a legally-recognized church or religious organization."]

We can see that the naive chunking query still gives us good results - it matches more specifically with the question. Whereas the late chunking example skips straight to the chunks it knows to be relevant, because they contain contextual information within the embeddings themselves!