Extraction

Обновлено 24 мая 2024

LLMs can be used to generate text that is structured according to a specific schema. This can be useful in a number of scenarios, including:

Extracting a structured row to insert into a database
Extracting API parameters
Extracting different parts of a user query (e.g., for semantic vs keyword search)

Overview

There are two broad approaches for this:

Tools and JSON mode: Some LLMs specifically support structured output generation in certain contexts. Examples include OpenAI's function and tool calling or JSON mode.
Parsing: LLMs can often be instructed to output their response in a dseired format. Output parsers will parse text generations into a structured form.

Parsers extract precisely what is enumerated in a provided schema (e.g., specific attributes of a person).

Functions and tools can infer things beyond of a provided schema (e.g., attributes about a person that you did not ask for).

Option 1: Leveraging tools and JSON mode

Quickstart

create_structured_output_runnable will create Runnables to support structured data extraction via OpenAI tool use and JSON mode.

The desired output schema can be expressed either via a Pydantic model or a Python dict representing valid JsonSchema.

This function supports three modes for structured data extraction:

"openai-functions" will define OpenAI functions and bind them to the given LLM;
"openai-tools" will define OpenAI tools and bind them to the given LLM;
"openai-json" will bind response_format={"type": "json_object"} to the given LLM.

%pip install gigachain langchain-openai

from typing import Optional

from langchain.chains import create_structured_output_runnable
from langchain_core.pydantic_v1 import BaseModel
from langchain_openai import ChatOpenAI


class Person(BaseModel):
    person_name: str
    person_height: int
    person_hair_color: str
    dog_breed: Optional[str]
    dog_name: Optional[str]


llm = ChatOpenAI(model="gpt-4-0125-preview", temperature=0)
runnable = create_structured_output_runnable(Person, llm)

inp = "Alex is 5 feet tall and has blond hair."
runnable.invoke(inp)

    Giga generation stopped with reason: function_call

    Person(person_name='Alex', person_height=60, person_hair_color='blond', dog_breed=None, dog_name=None)

Specifying schemas

A convenient way to express desired output schemas is via Pydantic. The above example specified the desired output schema via Person, a Pydantic model. Such schemas can be easily combined together to generate richer output formats:

from typing import Sequence

class People(BaseModel):
    """Identifying information about all people in a text."""

    people: Sequence[Person]

runnable = create_structured_output_runnable(People, llm)

inp = """Alex is 5 feet tall and has blond hair.
Claudia is 1 feet taller Alex and jumps higher than him.
Claudia is a brunette and has a beagle named Harry."""

runnable.invoke(inp)

    People(people=[Person(person_name='Alex', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None), Person(person_name='Claudia', person_height=6, person_hair_color='brunette', dog_breed='beagle', dog_name='Harry')])

Note that dog_breed and dog_name are optional attributes, such that here they are extracted for Claudia and not for Alex.

One can also specify the desired output format with a Python dict representing valid JsonSchema:

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "height": {"type": "integer"},
        "hair_color": {"type": "string"},
    },
    "required": ["name", "height"],
}

runnable = create_structured_output_runnable(schema, llm)

inp = "Alex is 5 feet tall. I don't know his hair color."
runnable.invoke(inp)

    {'name': 'Alex', 'height': 60}

inp = "Alex is 5 feet tall. He is blond."
runnable.invoke(inp)

    {'name': 'Alex', 'height': 60, 'hair_color': 'blond'}

Extra information

Runnables constructed via create_structured_output_runnable generally are capable of semantic extraction, such that they can populate information that is not explicitly enumerated in the schema.

Suppose we want unspecified additional information about dogs.

We can use add a placeholder for unstructured extraction, dog_extra_info.

inp = """Alex is 5 feet tall and has blond hair.
Claudia is 1 feet taller Alex and jumps higher than him.
Claudia is a brunette and has a beagle named Harry.
Harry likes to play with other dogs and can always be found
playing with Milo, a border collie that lives close by."""

class Person(BaseModel):
    person_name: str
    person_height: int
    person_hair_color: str
    dog_breed: Optional[str]
    dog_name: Optional[str]
    dog_extra_info: Optional[str]


class People(BaseModel):
    """Identifying information about all people in a text."""

    people: Sequence[Person]


runnable = create_structured_output_runnable(People, llm)
runnable.invoke(inp)

    People(people=[Person(person_name='Alex', person_height=60, person_hair_color='blond', dog_breed=None, dog_name=None, dog_extra_info=None), Person(person_name='Claudia', person_height=72, person_hair_color='brunette', dog_breed='beagle', dog_name='Harry', dog_extra_info='likes to play with other dogs and can always be found playing with Milo, a border collie that lives close by.')])

This gives us additional information about the dogs.

Specifying extraction mode

create_structured_output_runnable supports varying implementations of the underlying extraction under the hood, which are configured via the mode parameter. This parameter can be one of "openai-functions", "openai-tools", or "openai-json".

OpenAI Functions and Tools

Some LLMs are fine-tuned to support the invocation of functions or tools. If they are given an input schema for a tool and recognize an occasion to use it, they may emit JSON output conforming to that schema. We can leverage this to drive structured data extraction from natural language.

OpenAI originally released this via a functions parameter in its chat completions API. This has since been deprecated in favor of a tools parameter, which can include (multiple) functions. Using OpenAI Functions:

runnable = create_structured_output_runnable(Person, llm, mode="openai-functions")

inp = "Alex is 5 feet tall and has blond hair."
runnable.invoke(inp)

    Person(person_name='Alex', person_height=60, person_hair_color='blond', dog_breed=None, dog_name=None, dog_extra_info=None)

Using OpenAI Tools:

runnable = create_structured_output_runnable(Person, llm, mode="openai-tools")

runnable.invoke(inp)

    Person(person_name='Alex', person_height=152, person_hair_color='blond', dog_breed=None, dog_name=None)

The corresponding LangSmith trace illustrates the tool call that generated our structured output.

JSON Mode

Some LLMs support generating JSON more generally. OpenAI implements this via a response_format parameter in its chat completions API.

Note that this method may require explicit prompting (e.g., OpenAI requires that input messages contain the word "json" in some form when using this parameter).

from langchain_core.prompts import ChatPromptTemplate

system_prompt = """You extract information in structured JSON formats.

Extract a valid JSON blob from the user input that matches the following JSON Schema:

{output_schema}"""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)
runnable = create_structured_output_runnable(
    Person,
    llm,
    mode="openai-json",
    prompt=prompt,
    enforce_function_usage=False,
)

runnable.invoke({"input": inp})

    Person(person_name='Alex', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None, dog_extra_info=None)

Few-shot examples

Suppose we want to tune the behavior of our extractor. There are a few options available. For example, if we want to redact names but retain other information, we could adjust the system prompt:

system_prompt = """You extract information in structured JSON formats.

Extract a valid JSON blob from the user input that matches the following JSON Schema:

{output_schema}

Redact all names.
"""
prompt = ChatPromptTemplate.from_messages(
    [("system", system_prompt), ("human", "{input}")]
)
runnable = create_structured_output_runnable(
    Person,
    llm,
    mode="openai-json",
    prompt=prompt,
    enforce_function_usage=False,
)

runnable.invoke({"input": inp})

    Person(person_name='REDACTED', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None)

Few-shot examples are another, effective way to illustrate intended behavior. For instance, if we want to redact names with a specific character string, a one-shot example will convey this. We can use a FewShotChatMessagePromptTemplate to easily accommodate both a fixed set of examples as well as the dynamic selection of examples based on the input.

from langchain_core.prompts import FewShotChatMessagePromptTemplate

examples = [
    {
        "input": "Samus is 6 ft tall and blonde.",
        "output": Person(
            person_name="######",
            person_height=5,
            person_hair_color="blonde",
        ).dict(),
    }
]

example_prompt = ChatPromptTemplate.from_messages(
    [("human", "{input}"), ("ai", "{output}")]
)
few_shot_prompt = FewShotChatMessagePromptTemplate(
    examples=examples,
    example_prompt=example_prompt,
)
prompt = ChatPromptTemplate.from_messages(
    [("system", system_prompt), few_shot_prompt, ("human", "{input}")]
)
runnable = create_structured_output_runnable(
    Person,
    llm,
    mode="openai-json",
    prompt=prompt,
    enforce_function_usage=False,
)

runnable.invoke({"input": inp})

    Person(person_name='#####', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None)

Here, the LangSmith trace for the chat model call shows how the one-shot example is formatted into the prompt.

Option 2: Parsing

Output parsers are classes that help structure language model responses.

As shown above, they are used to parse the output of the runnable created by create_structured_output_runnable.

They can also be used more generally, if a LLM is instructed to emit its output in a certain format. Parsers include convenience methods for generating formatting instructions for use in prompts.

Below we implement an example.

from typing import Optional, Sequence

from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator
from langchain_openai import ChatOpenAI


class Person(BaseModel):
    person_name: str
    person_height: int
    person_hair_color: str
    dog_breed: Optional[str]
    dog_name: Optional[str]


class People(BaseModel):
    """Identifying information about all people in a text."""

    people: Sequence[Person]


# Run
query = """Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blond."""

# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=People)

# Prompt
prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

# Run
_input = prompt.format_prompt(query=query)
model = ChatOpenAI()

output = model.invoke(_input.to_string())

parser.parse(output.content)

    People(people=[Person(person_name='Alex', person_height=5, person_hair_color='blond', dog_breed=None, dog_name=None), Person(person_name='Claudia', person_height=6, person_hair_color='brunette', dog_breed=None, dog_name=None)])

We can see from the LangSmith trace that we get the same output as above.

We can see that we provide a two-shot prompt in order to instruct the LLM to output in our desired format.

# Define your desired data structure.
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

    # You can add custom validation logic easily with Pydantic.
    @validator("setup")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "?":
            raise ValueError("Badly formed question!")
        return field


# And a query intended to prompt a language model to populate the data structure.
joke_query = "Tell me a joke."

# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=Joke)

# Prompt
prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

# Run
_input = prompt.format_prompt(query=joke_query)
model = ChatOpenAI(temperature=0)
output = model.invoke(_input.to_string())
parser.parse(output.content)

    Joke(setup="Why couldn't the bicycle find its way home?", punchline='Because it lost its bearings!')

As we can see, we get an output of the Joke class, which respects our originally desired schema: 'setup' and 'punchline'.

We can look at the LangSmith trace to see exactly what is going on under the hood.

Going deeper

The output parser documentation includes various parser examples for specific types (e.g., lists, datetime, enum, etc).
The experimental Anthropic function calling support provides similar functionality to Anthropic chat models.
LlamaCPP natively supports constrained decoding using custom grammars, making it easy to output structured content using local LLMs
Kor is another library for extraction where schema and examples can be provided to the LLM.

Extraction

Overview﻿

Option 1: Leveraging tools and JSON mode﻿

Quickstart﻿

Specifying schemas﻿

Extra information﻿

Specifying extraction mode﻿

OpenAI Functions and Tools﻿

JSON Mode﻿

Few-shot examples﻿

Option 2: Parsing﻿

Going deeper﻿