Semantic Chunking

Обновлено 26 августа 2024

Splits the text based on semantic similarity.

All credit to him.

At a high level, this splits into sentences, then groups into groups of 3 sentences, and then merges one that are similar in the embedding space.

Install Dependencies

!pip install --quiet langchain_experimental langchain_openai

# This is a long document we can split up.
with open("../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)