LanceDB Langchain Llamaindex Chunking

Langchain Llamaindex Chunking

agentsLangchain-LlamaIndex-Chunkingllmsvector-databaselancedbgptopenaiAImultimodal-aitutorialsmachine-learningembeddingsfine-tuningdeep-learninggpt-4-visionllama-indexragmultimodallangchainlancedb-recipes

alph-notebooks/lancedb-recipes / Langchain_Llamaindex_chunking.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Llama Index Text Chunking Strategies

The aim is to get the data in a format where it can be used for anticipated tasks, and retrieved for value later. Rather than asking “How should I chunk my data?”, the actual question should be “What is the optimal way for me to pass data to my language model that it needs for its task?”

This example shows different types of chunking which can be utilized for different types of data for making sense out of chunks too, and not doing chunking for the sake of doing.

[ ]

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 496.7/496.7 kB 5.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.4/8.4 MB 18.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.4/15.4 MB 33.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 28.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.3/268.3 kB 19.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75.6/75.6 kB 7.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 136.1/136.1 kB 15.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 58.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 290.4/290.4 kB 21.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.9/77.9 kB 5.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 kB 6.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.4/49.4 kB 4.6 MB/s eta 0:00:00

Data files for applying different chunking methods

[ ]

--2024-04-15 10:06:43--  https://raw.githubusercontent.com/lancedb/vectordb-recipes/main/README.md
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29701 (29K) [text/plain]
Saving to: ‘README.md’

README.md           100%[===================>]  29.00K  --.-KB/s    in 0.002s  

2024-04-15 10:06:43 (12.1 MB/s) - ‘README.md’ saved [29701/29701]

--2024-04-15 10:06:43--  https://frontiernerds.com/files/state_of_the_union.txt
Resolving frontiernerds.com (frontiernerds.com)... 172.67.180.189, 104.21.31.232, 2606:4700:3036::6815:1fe8, ...
Connecting to frontiernerds.com (frontiernerds.com)|172.67.180.189|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘state_of_the_union.txt’

state_of_the_union.     [ <=>                ]  39.91K  --.-KB/s    in 0.001s  

2024-04-15 10:06:43 (64.5 MB/s) - ‘state_of_the_union.txt’ saved [40864]

File based Node Parsers

There are different types of file-based parsers that create nodes depending on the content they are reading (like JSON, Markdown, etc.).

The easiest way is to use the FlatFileReader with the SimpleFileNodeParser, which will automatically choose the right parser for each type of content. After that, you might want to add a text-based parser to handle the actual length of the text.

Node Parser - Simple File

Covering all the files intelligently

[ ]

'VectorDB-recipes\n<br />\nDive into building GenAI applications!\nThis repository contains examples, applications, starter code, & tutorials to help you kickstart your GenAI projects.\n\n- These are built using LanceDB, a free, open-source, serverless vectorDB that **requires no setup**. \n- It **integrates into python data ecosystem** so you can simply start using these in your existing data pipelines in pandas, arrow, pydantic etc.\n- LanceDB has **native Typescript SDK** using which you can **run vector search** in serverless functions!\n\n<img src="https://github.com/lancedb/vectordb-recipes/assets/5846846/d284accb-24b9-4404-8605-56483160e579" height="85%" width="85%" />\n\n<br />\nJoin our community for support - <a href="https://discord.gg/zMM32dvNtd">Discord</a> •\n<a href="https://twitter.com/lancedb">Twitter</a>\n\n---\n\nThis repository is divided into 3 sections:\n- [Examples](#examples) - Get right into the code with minimal introduction, aimed at getting you from an idea to PoC within minutes!\n- [Applications](#projects--applications) - Ready to use Python and web apps using applied LLMs, VectorDB and GenAI tools\n- [Tutorials](#tutorials) - A curated list of tutorials, blogs, Colabs and courses to get you started with GenAI in greater depth.'

Node Parser - HTML

This node parser uses beautifulsoup to parse raw HTML.

By default, it will parse a select subset of HTML tags, but you can override this.

The default tags are: ["p", "h1", "h2", "h3", "h4", "h5", "h6", "li", "b", "i", "u", "section"]

[ ]

<Response [200]>
[TextNode(id_='bf308ea9-b937-4746-8645-c8023e2087d7', embedding=None, metadata={'tag': 'h1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://www.utoronto.ca/', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='247fb639a05bc6898fd1750072eceb47511d3b8dae80999f9438e50a1faeb4b2'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='7c280bdf-7373-4be8-8e70-6360848581e9', node_type=<ObjectType.TEXT: '1'>, metadata={'tag': 'p'}, hash='3e989bb32b04814d486ed9edeefb1b0ce580ba7fc8c375f64473ddd95ca3e824')}, text='Welcome to University of Toronto', start_char_idx=2784, end_char_idx=2816, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), TextNode(id_='7c280bdf-7373-4be8-8e70-6360848581e9', embedding=None, metadata={'tag': 'p'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://www.utoronto.ca/', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='247fb639a05bc6898fd1750072eceb47511d3b8dae80999f9438e50a1faeb4b2'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='bf308ea9-b937-4746-8645-c8023e2087d7', node_type=<ObjectType.TEXT: '1'>, metadata={'tag': 'h1'}, hash='e1e6af749b6a40a4055c80ca6b821ed841f1d20972e878ca1881e508e4446c26')}, text='In photos: Under cloudy skies, U of T community gathers to experience near-total solar eclipse\nYour guide to the U of T community\nThe University of Toronto is home to some of the world’s top faculty, students, alumni and staff. U of T Celebrates recognizes their award-winning accomplishments.\nDavid Dyzenhaus recognized with Gold Medal from Social Sciences and Humanities Research Council\nOur latest issue is all about feeling good: the only diet you really need to know about, the science behind cold plunges, a uniquely modern way to quit smoking, the “sex, drugs and rock ‘n’ roll” of university classes, how to become a better workplace leader, and more.\nFaculty and Staff\nHis course about the body is a workout for the mind\nProfessor Doug Richards teaches his students the secret to living a longer – and healthier – life\n\nStatement of Land Acknowledgement\nWe wish to acknowledge this land on which the University of Toronto operates. For thousands of years it has been the traditional land of the Huron-Wendat, the Seneca, and the Mississaugas of the Credit. Today, this meeting place is still the home to many Indigenous people from across Turtle Island and we are grateful to have the opportunity to work on this land.\nRead about U of T’s Statement of Land Acknowledgement.\nUNIVERSITY OF TORONTO - SINCE 1827', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]

Node Parser - JSON

The JSONNodeParser parses raw JSON.

[ ]

Node ID: 05325093-16a2-41ac-b952-3882c817ac4d
Text: status True data house_list id_listing owJKR7PNnP9YXeLP data
house_list house_type_in_map D data house_list price_abbr 0.75M data
house_list price 749,000 data house_list price_sold 690,000 data
house_list tags Sold data house_list list_status public 1 data
house_list list_status live 0 data house_list list_status s_r Sale
data house_list list_s...

Node Parser - Markdown

The MarkdownNodeParser parses raw markdown text.

[ ]

'VectorDB-recipes\n<br />\nDive into building GenAI applications!\nThis repository contains examples, applications, starter code, & tutorials to help you kickstart your GenAI projects.\n\n- These are built using LanceDB, a free, open-source, serverless vectorDB that **requires no setup**. \n- It **integrates into python data ecosystem** so you can simply start using these in your existing data pipelines in pandas, arrow, pydantic etc.\n- LanceDB has **native Typescript SDK** using which you can **run vector search** in serverless functions!\n\n<img src="https://github.com/lancedb/vectordb-recipes/assets/5846846/d284accb-24b9-4404-8605-56483160e579" height="85%" width="85%" />\n\n<br />\nJoin our community for support - <a href="https://discord.gg/zMM32dvNtd">Discord</a> •\n<a href="https://twitter.com/lancedb">Twitter</a>\n\n---\n\nThis repository is divided into 3 sections:\n- [Examples](#examples) - Get right into the code with minimal introduction, aimed at getting you from an idea to PoC within minutes!\n- [Applications](#projects--applications) - Ready to use Python and web apps using applied LLMs, VectorDB and GenAI tools\n- [Tutorials](#tutorials) - A curated list of tutorials, blogs, Colabs and courses to get you started with GenAI in greater depth.'

Text-Splitters

Download a .py file for some other chunking methods

[ ]

--2024-04-15 10:22:58--  https://raw.githubusercontent.com/lancedb/vectordb-recipes/main/applications/talk-with-podcast/app.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1582 (1.5K) [text/plain]
Saving to: ‘app.py’

app.py              100%[===================>]   1.54K  --.-KB/s    in 0s      

2024-04-15 10:22:58 (12.1 MB/s) - ‘app.py’ saved [1582/1582]

Code Splitting

Splits raw code-text based on the language it is written in.

Check the full list of supported languages here.

[ ]

/usr/local/lib/python3.10/dist-packages/tree_sitter/__init__.py:36: FutureWarning: Language(path, name) is deprecated. Use Language(ptr, name) instead.
  warn("{} is deprecated. Use {} instead.".format(old, new), FutureWarning)

'from youtube_podcast_download import podcast_audio_retreival\nfrom transcribe_podcast import transcribe\nfrom chat_retreival import retrieverSetup, chat\nfrom langroid_utils import configure, agent\n\nimport os\nimport glob\nimport json\nimport streamlit as st\n\nOPENAI_KEY = os.environ["OPENAI_API_KEY"]\n\n\n@st.cache_resource\ndef video_data_retreival(framework):\n    f = open("output.json")\n    data = json.load(f)\n\n    # setting up reteriver\n    if framework == "Langchain":\n        qa = retrieverSetup(data["text"], OPENAI_KEY)\n        return qa\n    elif framework == "Langroid":\n        langroid_file = open("langroid_doc.txt", "w")  # write mode\n        langroid_file.write(data["text"])\n        cfg = configure("langroid_doc.txt")\n        return cfg\n\n\nst.header("Talk with Youtube Podcasts", divider="rainbow")\n\nurl = st.text_input("Youtube Link")\nframework = st.radio(\n    "**Select Framework 👇**",\n    ["Langchain", "Langroid"],\n    key="Langchain",\n)\n\nif url:\n    st.video(url)\n    # Podcast Audio Retreival from Youtube\n    podcast_audio_retreival(url)\n\n    # Trascribing podcast audio\n    filename = glob.glob("*.mp3")[0]\n    transcribe(filename)\n\n    st.markdown(f"##### `{framework}` Framework Selected for talking with Podcast")\n    # Chat Agent getting ready\n    qa = video_data_retreival(framework)\n\n\nprompt = st.chat_input("Talk with Podcast")\n\ni'

Sentence Splitting

The SentenceSplitter attempts to split text while respecting the boundaries of sentences.

[ ]

"Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\n\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.\n\nIt's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain. These were times that tested the courage of our convictions and the strength of our union. And despite all our divisions and disagreements, our hesitations and our fears, America prevailed because we chose to move forward as one nation and one people.\n\nAgain, we are tested. And again, we must answer history's call."

Node Parser - Sentence Window

The SentenceWindowNodeParser is similar to other node parsers, except that it splits all documents into individual sentences. The resulting nodes also contain the surrounding "window" of sentences around each node in the metadata. Note that this metadata will not be visible to the LLM or embedding model.

This is most useful for generating embeddings that have a very specific scope. Then, combined with a MetadataReplacementNodePostProcessor, you can replace the sentence with it's surrounding context before sending the node to the LLM.

An example of setting up the parser with default settings is below. In practice, you would usually only want to adjust the window size of sentences.

[ ]

'Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\n\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. '

Node Parser - Semantic Splitting

"Semantic chunking" is a new concept proposed Greg Kamradt in his video tutorial on 5 levels of embedding chunking: https://youtu.be/8OJC21T2SL4?t=1933.

Instead of chunking text with a fixed chunk size, the semantic splitter adaptively picks the breakpoint in-between sentences using embedding similarity. This ensures that a "chunk" contains sentences that are semantically related to each other.

[ ]

'Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\n\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. '

Token Text Splitting

The TokenTextSplitter attempts to split to a consistent chunk size according to raw token counts.

[ ]

"Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\n\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.\n\nIt's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain. These were times that tested the courage of our convictions and the strength of our union. And despite all our divisions and disagreements, our hesitations and our fears, America prevailed because we chose to move forward as one nation and one people.\n\nAgain, we are tested. And again, we must answer history's call.\n\nOne year ago, I took office amid two wars, an economy"

Relation based Node Parser

Node Parser - Hierarchical

This node parser will chunk nodes into hierarchical nodes. This means a single input will be chunked into several hierarchies of chunk sizes, with each node containing a reference to it's parent node.

When combined with the AutoMergingRetriever, this enables us to automatically replace retrieved nodes with their parents when a majority of children are retrieved. This process provides the LLM with more complete context for response synthesis.

[ ]

"Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:\n\nOur Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.\n\nIt's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain. These were times that tested the courage of our convictions and the strength of our union. And despite all our divisions and disagreements, our hesitations and our fears, America prevailed because we chose to move forward as one nation and one people.\n\nAgain, we are tested. And again, we must answer history's call.\n\nOne year ago, I took office amid two wars, an economy rocked by severe recession, a financial system on the verge of collapse and a government deeply in debt. Experts from across the political spectrum warned that if we did not act, we might face a second depression. So we acted immediately and aggressively. And one year later, the worst of the storm has passed.\n\nBut the devastation remains. One in 10 Americans still cannot find work. Many businesses have shuttered. Home values have declined. Small towns and rural communities have been hit especially hard. For those who had already known poverty, life has become that much harder.\n\nThis recession has also compounded the burdens that America's families have been dealing with for decades -- the burden of working harder and longer for less, of being unable to save enough to retire or help kids with college.\n\nSo I know the anxieties that are out there right now. They're not new. These struggles are the reason I ran for president. These struggles are what I've witnessed for years in places like Elkhart, Ind., and Galesburg, Ill. I hear about them in the letters that I read each night."

Langchain Text Chunking Strategies

[ ]

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 287.5/287.5 kB 5.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 113.0/113.0 kB 11.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.0/53.0 kB 6.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 144.8/144.8 kB 11.5 MB/s eta 0:00:00
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (2.31.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests) (2024.2.2)

Text Splitting - Character

Splitting text that looks at characters.

[ ]

WARNING:langchain_text_splitters.base:Created a chunk of size 1163, which is longer than the specified 1000
WARNING:langchain_text_splitters.base:Created a chunk of size 1015, which is longer than the specified 1000

Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:

Our Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.

Text Splitting - Recursive Character

Splitting text by recursively look at characters.

Recursively tries to split by different characters to find one that works.

[ ]

Chunk 2: It's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain. These were times that tested the courage of our convictions and the strength of our union. And despite all our divisions and disagreements, our hesitations and our fears, America prevailed because we chose to move forward as one nation and one people.

Again, we are tested. And again, we must answer history's call.
Chunk 3: Again, we are tested. And again, we must answer history's call.

One year ago, I took office amid two wars, an economy rocked by severe recession, a financial system on the verge of collapse and a government deeply in debt. Experts from across the political spectrum warned that if we did not act, we might face a second depression. So we acted immediately and aggressively. And one year later, the worst of the storm has passed.

But the devastation remains. One in 10 Americans still cannot find work. Many businesses have shuttered. Home values have declined. Small towns and rural communities have been hit especially hard. For those who had already known poverty, life has become that much harder.

This recession has also compounded the burdens that America's families have been dealing with for decades -- the burden of working harder and longer for less, of being unable to save enough to retire or help kids with college.

Text Splitting - HTML Header

Splitting HTML files based on specified headers.

Requires lxml package.

[ ]

'Welcome to University of Toronto  \nMain menu tools'

Text Splitting - Code

[ ]

'from youtube_podcast_download import podcast_audio_retreival'

Text Splitting - Recursive JSON

Splits JSON data into smaller, structured chunks while preserving hierarchy.

This method splits JSON data into smaller dictionaries or JSON-formatted strings based on configurable maximum and minimum chunk sizes. It supports nested JSON structures, optionally converts lists into dictionaries for better chunking, and allows the creation of document objects for further use.

[ ]

{'openapi': '3.1.0',
, 'info': {'title': 'LangSmith', 'version': '0.1.0'},
, 'servers': [{'url': 'https://api.smith.langchain.com',
,   'description': 'LangSmith API endpoint.'}]}

Semantic Splitting

Split the text based on semantic similarity.

[ ]

Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:

Our Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty.

Splitting by Tokens

Langchain tiktoken encoder

Text splitter that uses tiktoken encoder to count length.

[ ]

WARNING:langchain_text_splitters.base:Created a chunk of size 123, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 104, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 109, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 106, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 129, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 111, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 118, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 132, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 231, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 177, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 112, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 130, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 116, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 184, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 139, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 112, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 151, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 203, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 138, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 123, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 213, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 134, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 130, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 125, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 139, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 111, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 130, which is longer than the specified 100
WARNING:langchain_text_splitters.base:Created a chunk of size 124, which is longer than the specified 100

Madame Speaker, Vice President Biden, members of Congress, distinguished guests, and fellow Americans:

Our Constitution declares that from time to time, the president shall give to Congress information about the state of our union. For 220 years, our leaders have fulfilled this duty. They have done so during periods of prosperity and tranquility. And they have done so in the midst of war and depression; at moments of great strife and great struggle.

[ ]