Late Chunking (Chunked Pooling)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 471.6/471.6 kB 1.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.0/27.0 MB 24.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30.0/30.0 MB 17.9 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 255.2/255.2 kB 9.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 7.4 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 7.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 10.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.7/98.7 kB 5.6 MB/s eta 0:00:00
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn(
tokenizer_config.json: 0%| | 0.00/373 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/712k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/125 [00:00<?, ?B/s]
config.json: 0%| | 0.00/1.18k [00:00<?, ?B/s]
configuration_bert.py: 0%| | 0.00/8.24k [00:00<?, ?B/s]
A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation: - configuration_bert.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
modeling_bert.py: 0%| | 0.00/97.7k [00:00<?, ?B/s]
A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation: - modeling_bert.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
model.safetensors: 0%| | 0.00/275M [00:00<?, ?B/s]
Helpers
Now we define the text which we want to encode and split it into chunks. The chunk_by_sentences function also returns the span annotations. Those specify the number of tokens per chunk which is needed for the chunked pooling.
Chunks: - "Germany is known for it's automative industry, jevlin throwers, football teams and a lot more things from the history." - " It's Capital is Berlin and is pronounced as 'ber-liin' in German." - " The capital is the largest city of Germany, both by area and by population." - " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits." - " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."
As you can see above, in the second chunk, Berlin is mentioned explicitely but in the second and third one, the pronouns are used
Now we encode the chunks with the traditional and the context-sensitive chunked pooling method:
Query -> What are some of the attributes about the capital of a country whose Oktoberfest is famous?
Query -> What are some of the attributes about capital of Germany?
Query -> What are some of the attributes about Berlin?
Results
You see, in the Vanilla Chunking, in the 3rd query where Berlin is explicitly mentioned, the naive chunking gave Top-3 results where there are no specifications about Berlin mentioned BUT only the name is mentioned.
Now, when you look at the Late chunking, Top-3 results are specifically about the specifications even though the word is out of scope.
Also, the main thing to look at is the cosine similarity where in the Naive chunking, the chunk:
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.
is having last place with distance 0.28 is very relevant to the query and for a turn of surprise, it is having more distance than the chunk:
Germany is known for it's automative industry, jevlin throwers, football teams and a lot more things from the history
which has a distanxe of 0.23
While in the Late Chunking, it perfectly aligns. So it also works as a contextual reranker within