XML parser

Обновлено 24 мая 2024

This output parser allows users to obtain results from LLM in the popular XML format.

Keep in mind that large language models are leaky abstractions! You'll have to use an LLM with sufficient capacity to generate well-formed XML.

In the following example we use Claude model (https://docs.anthropic.com/claude/docs) which works really well with XML tags.

from langchain.output_parsers import XMLOutputParser
from langchain.prompts import PromptTemplate
from langchain_community.chat_models import ChatAnthropic

model = ChatAnthropic(model="claude-2", max_tokens_to_sample=512, temperature=0.1)

Let's start with the simple request to the model.

actor_query = "Generate the shortened filmography for Tom Hanks."
output = model.invoke(
    f"""{actor_query}
Please enclose the movies in <movie></movie> tags"""
)
print(output.content)

     Here is the shortened filmography for Tom Hanks, enclosed in XML tags:
    
    <movie>Splash</movie>
    <movie>Big</movie>
    <movie>A League of Their Own</movie>
    <movie>Sleepless in Seattle</movie>
    <movie>Forrest Gump</movie>
    <movie>Toy Story</movie>
    <movie>Apollo 13</movie>
    <movie>Saving Private Ryan</movie>
    <movie>Cast Away</movie>
    <movie>The Da Vinci Code</movie>
    <movie>Captain Phillips</movie>

Now we will use the XMLOutputParser in order to get the structured output.

parser = XMLOutputParser()

prompt = PromptTemplate(
    template="""{query}\n{format_instructions}""",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

output = chain.invoke({"query": actor_query})
print(output)

    {'filmography': [{'movie': [{'title': 'Big'}, {'year': '1988'}]}, {'movie': [{'title': 'Forrest Gump'}, {'year': '1994'}]}, {'movie': [{'title': 'Toy Story'}, {'year': '1995'}]}, {'movie': [{'title': 'Saving Private Ryan'}, {'year': '1998'}]}, {'movie': [{'title': 'Cast Away'}, {'year': '2000'}]}]}

Finally, let's add some tags to tailor the output to our needs.

parser = XMLOutputParser(tags=["movies", "actor", "film", "name", "genre"])
prompt = PromptTemplate(
    template="""{query}\n{format_instructions}""",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)


chain = prompt | model | parser

output = chain.invoke({"query": actor_query})

print(output)

    {'movies': [{'actor': [{'name': 'Tom Hanks'}, {'film': [{'name': 'Forrest Gump'}, {'genre': 'Drama'}]}, {'film': [{'name': 'Cast Away'}, {'genre': 'Adventure'}]}, {'film': [{'name': 'Saving Private Ryan'}, {'genre': 'War'}]}]}]}

for s in chain.stream({"query": actor_query}):
    print(s)