ym88659208ym87991671
Split by character | Документация для разработчиков

Split by character

Обновлено 26 августа 2024

This is the simplest method. This splits based on characters (by default "\n\n") and measure chunk length by number of characters.

  1. How the text is split: by single character.
  2. How the chunk size is measured: by number of characters.
# This is a long document we can split up.
with open("../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])

Here's an example of passing metadata along with the documents, notice that it is split along with the documents.

metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents(
[state_of_the_union, state_of_the_union], metadatas=metadatas
)
print(documents[0])
text_splitter.split_text(state_of_the_union)[0]
ПАО Сбербанк использует cookie для персонализации сервисов и удобства пользователей.
Вы можете запретить сохранение cookie в настройках своего браузера.