Skip to content

Commit 1b6d21e

Browse files
committed
Created using Colaboratory
1 parent 17fc877 commit 1b6d21e

File tree

1 file changed

+354
-0
lines changed

1 file changed

+354
-0
lines changed

LangChain_TextSplitter.ipynb

+354
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,354 @@
1+
{
2+
"nbformat": 4,
3+
"nbformat_minor": 0,
4+
"metadata": {
5+
"colab": {
6+
"provenance": [],
7+
"gpuType": "T4",
8+
"authorship_tag": "ABX9TyPMVWfftf8OvHN6BWqQwq5N",
9+
"include_colab_link": true
10+
},
11+
"kernelspec": {
12+
"name": "python3",
13+
"display_name": "Python 3"
14+
},
15+
"language_info": {
16+
"name": "python"
17+
},
18+
"accelerator": "GPU"
19+
},
20+
"cells": [
21+
{
22+
"cell_type": "markdown",
23+
"metadata": {
24+
"id": "view-in-github",
25+
"colab_type": "text"
26+
},
27+
"source": [
28+
"<a href=\"https://colab.research.google.com/github/sugarforever/LangChain-Tutorials/blob/main/LangChain_TextSplitter.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
29+
]
30+
},
31+
{
32+
"cell_type": "markdown",
33+
"source": [
34+
"In this notebook, I will show you the main text splitters LangChain framework supports."
35+
],
36+
"metadata": {
37+
"id": "amP-lCFgKUb-"
38+
}
39+
},
40+
{
41+
"cell_type": "code",
42+
"execution_count": 92,
43+
"metadata": {
44+
"id": "TO7WJgpwKLA-"
45+
},
46+
"outputs": [],
47+
"source": [
48+
"!pip install -qU langchain"
49+
]
50+
},
51+
{
52+
"cell_type": "code",
53+
"source": [
54+
"long_text = '''\n",
55+
"WASHINGTON (Reuters) -Former U.S. President Donald Trump faces 37 criminal counts including charges of unauthorized retention of classified documents and conspiracy to obstruct justice after leaving the White House in 2021, according to federal court documents made public on Friday.\n",
56+
"\n",
57+
"The Justice Department made the charging documents public on a tumultuous day in which two of Trump's lawyers quit the case and a former aide face charges as well.\n",
58+
"\n",
59+
"The charges stem from Trump's treatment of sensitive government materials he took with him when he left the White House in January 2021.\n",
60+
"\n",
61+
"He is due to make a first court appearance in the case in a Miami court on Tuesday, a day before his 77th birthday.\n",
62+
"\n",
63+
"The indictment of a former U.S. president on federal charges is unprecedented in American history and emerges at a time when Trump is the front-runner for the Republican presidential nomination next year.\n",
64+
"\n",
65+
"Investigators seized roughly 13,000 documents from Trump's Mar-a-Lago estate in Palm Beach, Florida, nearly a year ago. One hundred were marked as classified, even though one of Trump's lawyers had previously said all records with classified markings had been returned to the government.\n",
66+
"'''"
67+
],
68+
"metadata": {
69+
"id": "vwX4O06HUSye"
70+
},
71+
"execution_count": 93,
72+
"outputs": []
73+
},
74+
{
75+
"cell_type": "markdown",
76+
"source": [
77+
"# CharacterTextSplitter"
78+
],
79+
"metadata": {
80+
"id": "CAO_cdXlJwf-"
81+
}
82+
},
83+
{
84+
"cell_type": "code",
85+
"source": [
86+
"from langchain.text_splitter import CharacterTextSplitter"
87+
],
88+
"metadata": {
89+
"id": "ISg0Zv8yKfVi"
90+
},
91+
"execution_count": 94,
92+
"outputs": []
93+
},
94+
{
95+
"cell_type": "code",
96+
"source": [
97+
"text_splitter = CharacterTextSplitter( \n",
98+
" separator = \"\\n\\n\",\n",
99+
" chunk_size = 50,\n",
100+
" chunk_overlap = 10,\n",
101+
" length_function = len,\n",
102+
")\n",
103+
"\n",
104+
"documents = text_splitter.create_documents([long_text])\n",
105+
"print(documents[0].page_content)\n",
106+
"print(documents[1].page_content)"
107+
],
108+
"metadata": {
109+
"colab": {
110+
"base_uri": "https://localhost:8080/"
111+
},
112+
"id": "VciFcr6sUr94",
113+
"outputId": "3482f157-47db-4f9c-df46-bbe0de490218"
114+
},
115+
"execution_count": 95,
116+
"outputs": [
117+
{
118+
"output_type": "stream",
119+
"name": "stderr",
120+
"text": [
121+
"WARNING:langchain.text_splitter:Created a chunk of size 284, which is longer than the specified 50\n",
122+
"WARNING:langchain.text_splitter:Created a chunk of size 163, which is longer than the specified 50\n",
123+
"WARNING:langchain.text_splitter:Created a chunk of size 136, which is longer than the specified 50\n",
124+
"WARNING:langchain.text_splitter:Created a chunk of size 115, which is longer than the specified 50\n",
125+
"WARNING:langchain.text_splitter:Created a chunk of size 204, which is longer than the specified 50\n"
126+
]
127+
},
128+
{
129+
"output_type": "stream",
130+
"name": "stdout",
131+
"text": [
132+
"WASHINGTON (Reuters) -Former U.S. President Donald Trump faces 37 criminal counts including charges of unauthorized retention of classified documents and conspiracy to obstruct justice after leaving the White House in 2021, according to federal court documents made public on Friday.\n",
133+
"The Justice Department made the charging documents public on a tumultuous day in which two of Trump's lawyers quit the case and a former aide face charges as well.\n"
134+
]
135+
}
136+
]
137+
},
138+
{
139+
"cell_type": "markdown",
140+
"source": [
141+
"# RecursiveCharacterTextSplitter"
142+
],
143+
"metadata": {
144+
"id": "zw2Wh3u_J0EV"
145+
}
146+
},
147+
{
148+
"cell_type": "code",
149+
"source": [
150+
"from langchain.text_splitter import RecursiveCharacterTextSplitter"
151+
],
152+
"metadata": {
153+
"id": "i6dJm78aC6C_"
154+
},
155+
"execution_count": 96,
156+
"outputs": []
157+
},
158+
{
159+
"cell_type": "code",
160+
"source": [
161+
"text_splitter = RecursiveCharacterTextSplitter(\n",
162+
" chunk_size = 50,\n",
163+
" chunk_overlap = 10,\n",
164+
" length_function = len,\n",
165+
" add_start_index = True\n",
166+
")\n",
167+
"\n",
168+
"documents = text_splitter.create_documents([long_text])\n",
169+
"print(documents[0])\n",
170+
"print(documents[1])\n",
171+
"print(len(documents[1].page_content))"
172+
],
173+
"metadata": {
174+
"colab": {
175+
"base_uri": "https://localhost:8080/"
176+
},
177+
"id": "Ep6YoDaXC9rM",
178+
"outputId": "427eaf1f-a5a4-4b98-b157-27a9099a5fa3"
179+
},
180+
"execution_count": 97,
181+
"outputs": [
182+
{
183+
"output_type": "stream",
184+
"name": "stdout",
185+
"text": [
186+
"page_content='WASHINGTON (Reuters) -Former U.S. President' metadata={'start_index': 1}\n",
187+
"page_content='President Donald Trump faces 37 criminal counts' metadata={'start_index': 35}\n",
188+
"47\n"
189+
]
190+
}
191+
]
192+
},
193+
{
194+
"cell_type": "markdown",
195+
"source": [
196+
"# TokenTextSplitter"
197+
],
198+
"metadata": {
199+
"id": "o5k-2CspJ4kP"
200+
}
201+
},
202+
{
203+
"cell_type": "code",
204+
"source": [
205+
"!pip install tiktoken"
206+
],
207+
"metadata": {
208+
"colab": {
209+
"base_uri": "https://localhost:8080/"
210+
},
211+
"id": "rGeK2Vv6FRyr",
212+
"outputId": "3b7e9887-1b5a-416a-f8e4-a4748072cb41"
213+
},
214+
"execution_count": 98,
215+
"outputs": [
216+
{
217+
"output_type": "stream",
218+
"name": "stdout",
219+
"text": [
220+
"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
221+
"Requirement already satisfied: tiktoken in /usr/local/lib/python3.10/dist-packages (0.4.0)\n",
222+
"Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.10/dist-packages (from tiktoken) (2022.10.31)\n",
223+
"Requirement already satisfied: requests>=2.26.0 in /usr/local/lib/python3.10/dist-packages (from tiktoken) (2.27.1)\n",
224+
"Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->tiktoken) (1.26.15)\n",
225+
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->tiktoken) (2022.12.7)\n",
226+
"Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->tiktoken) (2.0.12)\n",
227+
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.26.0->tiktoken) (3.4)\n"
228+
]
229+
}
230+
]
231+
},
232+
{
233+
"cell_type": "code",
234+
"source": [
235+
"from langchain.text_splitter import TokenTextSplitter"
236+
],
237+
"metadata": {
238+
"id": "q9lZ0lfLFUZG"
239+
},
240+
"execution_count": 99,
241+
"outputs": []
242+
},
243+
{
244+
"cell_type": "code",
245+
"source": [
246+
"text_splitter = TokenTextSplitter(chunk_size=50, chunk_overlap=0)"
247+
],
248+
"metadata": {
249+
"id": "ARvJa2NZFNhf"
250+
},
251+
"execution_count": 100,
252+
"outputs": []
253+
},
254+
{
255+
"cell_type": "code",
256+
"source": [
257+
"documents = text_splitter.create_documents([long_text])\n",
258+
"print(documents[0])"
259+
],
260+
"metadata": {
261+
"colab": {
262+
"base_uri": "https://localhost:8080/"
263+
},
264+
"id": "0iucjqxDFYSC",
265+
"outputId": "83845a19-8811-4865-8340-a9d7d512afb2"
266+
},
267+
"execution_count": 101,
268+
"outputs": [
269+
{
270+
"output_type": "stream",
271+
"name": "stdout",
272+
"text": [
273+
"page_content='\\nWASHINGTON (Reuters) -Former U.S. President Donald Trump faces 37 criminal counts including charges of unauthorized retention of classified documents and conspiracy to obstruct justice after leaving the White House in 2021, according to federal court documents made public on Friday.\\n' metadata={}\n"
274+
]
275+
}
276+
]
277+
},
278+
{
279+
"cell_type": "code",
280+
"source": [
281+
"print(documents[1])"
282+
],
283+
"metadata": {
284+
"colab": {
285+
"base_uri": "https://localhost:8080/"
286+
},
287+
"id": "KvIK0pgdYO00",
288+
"outputId": "a73fcf64-c70d-44e3-b670-d6e5a61baf5a"
289+
},
290+
"execution_count": 102,
291+
"outputs": [
292+
{
293+
"output_type": "stream",
294+
"name": "stdout",
295+
"text": [
296+
"page_content=\"\\nThe Justice Department made the charging documents public on a tumultuous day in which two of Trump's lawyers quit the case and a former aide face charges as well.\\n\\nThe charges stem from Trump's treatment of sensitive government materials he took with him when\" metadata={}\n"
297+
]
298+
}
299+
]
300+
},
301+
{
302+
"cell_type": "code",
303+
"source": [
304+
"import tiktoken\n",
305+
"enc = tiktoken.get_encoding(\"gpt2\")\n",
306+
"print(len(enc.encode(documents[0].page_content)))\n",
307+
"print(len(enc.encode(documents[1].page_content)))\n",
308+
"print(len(enc.encode(documents[2].page_content)))"
309+
],
310+
"metadata": {
311+
"colab": {
312+
"base_uri": "https://localhost:8080/"
313+
},
314+
"id": "413dzpLfFymc",
315+
"outputId": "bff7c8bc-e3db-4a7b-8a1e-ce7058fd6e1e"
316+
},
317+
"execution_count": 103,
318+
"outputs": [
319+
{
320+
"output_type": "stream",
321+
"name": "stdout",
322+
"text": [
323+
"50\n",
324+
"50\n",
325+
"50\n"
326+
]
327+
}
328+
]
329+
},
330+
{
331+
"cell_type": "code",
332+
"source": [
333+
"print(enc.encode(documents[0].page_content))"
334+
],
335+
"metadata": {
336+
"colab": {
337+
"base_uri": "https://localhost:8080/"
338+
},
339+
"id": "ZRGx3pdWZJuh",
340+
"outputId": "db790f9c-430b-472a-a95c-6c14943787ae"
341+
},
342+
"execution_count": 104,
343+
"outputs": [
344+
{
345+
"output_type": "stream",
346+
"name": "stdout",
347+
"text": [
348+
"[198, 21793, 357, 12637, 8, 532, 14282, 471, 13, 50, 13, 1992, 3759, 1301, 6698, 5214, 4301, 9853, 1390, 4530, 286, 22959, 21545, 286, 10090, 4963, 290, 10086, 284, 26520, 5316, 706, 4305, 262, 2635, 2097, 287, 33448, 11, 1864, 284, 2717, 2184, 4963, 925, 1171, 319, 3217, 13, 198]\n"
349+
]
350+
}
351+
]
352+
}
353+
]
354+
}

0 commit comments

Comments
 (0)