Split by character
Обновлено 26 августа 2024
This is the simplest method. This splits based on characters (by default "\n\n") and measure chunk length by number of characters.
- How the text is split: by single character.
- How the chunk size is measured: by number of characters.
# This is a long document we can split up.
with open("../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
Here's an example of passing metadata along with the documents, notice that it is split along with the documents.
metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents(
[state_of_the_union, state_of_the_union], metadatas=metadatas
)
print(documents[0])
text_splitter.split_text(state_of_the_union)[0]