ym88659208ym87991671
Using local models | Документация для разработчиков

Using local models

Обновлено 24 мая 2024

The popularity of projects like PrivateGPT, llama.cpp, GPT4All, and llamafile underscore the importance of running LLMs locally.

LangChain has integrations with many open-source LLMs that can be run locally.

See her for setup instructions for these LLMs.

For example, here we show how to run GPT4All or LLaMA2 locally (e.g., on your laptop) using local embeddings and a local LLM.

Document Loading

First, install packages needed for local embeddings and vector storage.

%pip install --upgrade --quiet  gigachain gigachain-community langchainhub gpt4all chromadb

Load and split an example document.

We'll use a blog post on agents as an example.

from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

Next, the below steps will download the GPT4All embeddings locally (if you don't already have them).

from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())

Test similarity search is working with our local embeddings.

question = "What are the approaches to Task Decomposition?"
docs = vectorstore.similarity_search(question)
len(docs)
    4
docs[0]
    Document(page_content='Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.', metadata={'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:', 'language': 'en', 'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'title': "LLM Powered Autonomous Agents | Lil'Log"})

Model

Llama-v2

You can download a GGUF converted model (e.g., here).

Finally, as noted in detail here install llama-cpp-python

%pip install --upgrade --quiet  llama-cpp-python

To enable use of GPU on Apple Silicon, follow the steps here to use the Python binding with Metal support.

In particular, ensure that conda is using the correct virtual environment that you created (miniforge3).

E.g., for me:

conda activate /Users/rlm/miniforge3/envs/llama

With this confirmed:

! CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
from langchain_community.llms import LlamaCpp

Setting model parameters as noted in the llama.cpp docs.

n_gpu_layers = 1  # Metal set to 1 is enough.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
n_ctx=2048,
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
verbose=True,
)
    llama.cpp: loading model from /Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 8819.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size = 1600.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x76add7460
ggml_metal_init: loaded kernel_mul 0x76add5090
ggml_metal_init: loaded kernel_mul_row 0x76addae00
ggml_metal_init: loaded kernel_scale 0x76adb2940
ggml_metal_init: loaded kernel_silu 0x76adb8610
ggml_metal_init: loaded kernel_relu 0x76addb700
ggml_metal_init: loaded kernel_gelu 0x76addc100
ggml_metal_init: loaded kernel_soft_max 0x76addcb80
ggml_metal_init: loaded kernel_diag_mask_inf 0x76addd600
ggml_metal_init: loaded kernel_get_rows_f16 0x295f16380
ggml_metal_init: loaded kernel_get_rows_q4_0 0x295f165e0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x295f16840
ggml_metal_init: loaded kernel_get_rows_q2_K 0x295f16aa0
ggml_metal_init: loaded kernel_get_rows_q3_K 0x295f16d00
ggml_metal_init: loaded kernel_get_rows_q4_K 0x295f16f60
ggml_metal_init: loaded kernel_get_rows_q5_K 0x295f171c0
ggml_metal_init: loaded kernel_get_rows_q6_K 0x295f17420
ggml_metal_init: loaded kernel_rms_norm 0x295f17680
ggml_metal_init: loaded kernel_norm 0x295f178e0
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x295f17b40
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x295f17da0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x295f18000
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x7962b9900
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x7962bf5f0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x7962bc630
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x142045960
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x7962ba2b0
ggml_metal_init: loaded kernel_rope 0x7962c35f0
ggml_metal_init: loaded kernel_alibi_f32 0x7962c30b0
ggml_metal_init: loaded kernel_cpy_f32_f16 0x7962c15b0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x7962beb10
ggml_metal_init: loaded kernel_cpy_f16_f16 0x7962bf060
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB, (35852.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1026.00 MB, (36878.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1602.00 MB, (38480.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 298.00 MB, (38778.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB, (39290.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size

Note that these indicate that Metal was enabled properly:

ggml_metal_init: allocating
ggml_metal_init: using MPS
llm.invoke("Simulate a rap battle between Stephen Colbert and John Oliver")
    Llama.generate: prefix-match hit
    
Setting: The Late Show with Stephen Colbert. The studio audience is filled with fans of both comedians, and the energy is electric. The two comedians are seated at a table, ready to begin their epic rap battle.

Stephen Colbert: (smirking) Oh, you think you can take me down, John? You're just a Brit with a funny accent, and I'm the king of comedy!
John Oliver: (grinning) Oh, you think you're tough, Stephen? You're just a has-been from South Carolina, and I'm the future of comedy!
The battle begins, with each comedian delivering clever rhymes and witty insults. Here are a few lines that might be included:
Stephen Colbert: (rapping) You may have a big brain, John, but you can't touch my charm / I've got the audience in stitches, while you're just a blemish on the screen / Your accent is so thick, it's like trying to hear a speech through a mouthful of marshmallows / You may have
    
llama_print_timings: load time = 2201.54 ms
llama_print_timings: sample time = 182.54 ms / 256 runs ( 0.71 ms per token, 1402.41 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_print_timings: eval time = 8484.62 ms / 256 runs ( 33.14 ms per token, 30.17 tokens per second)
llama_print_timings: total time = 9000.62 ms
    "\nSetting: The Late Show with Stephen Colbert. The studio audience is filled with fans of both comedians, and the energy is electric. The two comedians are seated at a table, ready to begin their epic rap battle.\n\nStephen Colbert: (smirking) Oh, you think you can take me down, John? You're just a Brit with a funny accent, and I'm the king of comedy!\nJohn Oliver: (grinning) Oh, you think you're tough, Stephen? You're just a has-been from South Carolina, and I'm the future of comedy!\nThe battle begins, with each comedian delivering clever rhymes and witty insults. Here are a few lines that might be included:\nStephen Colbert: (rapping) You may have a big brain, John, but you can't touch my charm / I've got the audience in stitches, while you're just a blemish on the screen / Your accent is so thick, it's like trying to hear a speech through a mouthful of marshmallows / You may have"

GPT4All

Similarly, we can use GPT4All.

The Model Explorer on the GPT4All is a great way to choose and download a model.

Then, specify the path that you downloaded to to.

E.g., for me, the model lives here:

/Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin

from langchain_community.llms import GPT4All

gpt4all = GPT4All(
model="/Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin",
max_tokens=2048,
)
    Found model file at  /Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin
    objc[47842]: Class GGMLMetalClass is implemented in both /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x29f48c208) and /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllamamodel-mainline-metal.dylib (0x29f970208). One of the two will be used. Which one is undefined.
llama.cpp: using Metal
llama.cpp: loading model from /Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 9031.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size = 1600.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x115fcbfb0
ggml_metal_init: loaded kernel_mul 0x115fcd4a0
ggml_metal_init: loaded kernel_mul_row 0x115fce850
ggml_metal_init: loaded kernel_scale 0x115fcd700
ggml_metal_init: loaded kernel_silu 0x115fcd960
ggml_metal_init: loaded kernel_relu 0x115fcfd50
ggml_metal_init: loaded kernel_gelu 0x115fd03c0
ggml_metal_init: loaded kernel_soft_max 0x115fcf640
ggml_metal_init: loaded kernel_diag_mask_inf 0x115fd07f0
ggml_metal_init: loaded kernel_get_rows_f16 0x1147b2450
ggml_metal_init: loaded kernel_get_rows_q4_0 0x11479d1d0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x1147ad1f0
ggml_metal_init: loaded kernel_get_rows_q2_k 0x1147aef50
ggml_metal_init: loaded kernel_get_rows_q3_k 0x1147af1b0
ggml_metal_init: loaded kernel_get_rows_q4_k 0x1147af410
ggml_metal_init: loaded kernel_get_rows_q5_k 0x1147affa0
ggml_metal_init: loaded kernel_get_rows_q6_k 0x1147b0200
ggml_metal_init: loaded kernel_rms_norm 0x1147b0460
ggml_metal_init: loaded kernel_norm 0x1147bfc90
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x1147c0230
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x1147c0490
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x1147c06f0
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32 0x1147c0950
ggml_metal_init: loaded kernel_mul_mat_q3_k_f32 0x1147c0bb0
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32 0x1147c0e10
ggml_metal_init: loaded kernel_mul_mat_q5_k_f32 0x1147c1070
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32 0x1147c13d0
ggml_metal_init: loaded kernel_rope 0x1147c1a00
ggml_metal_init: loaded kernel_alibi_f32 0x1147c2120
ggml_metal_init: loaded kernel_cpy_f32_f16 0x115fd1690
ggml_metal_init: loaded kernel_cpy_f32_f32 0x115fd1c60
ggml_metal_init: loaded kernel_cpy_f16_f16 0x115fd2d40
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB, ( 6984.45 / 21845.34)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1024.00 MB, ( 8008.45 / 21845.34)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1602.00 MB, ( 9610.45 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 512.00 MB, (10122.45 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB, (10634.45 / 21845.34)

llamafile

One of the simplest ways to run an LLM locally is using a llamafile. All you need to do is:

1) Download a llamafile from HuggingFace 2) Make the file executable 3) Run the file

llamafiles bundle model weights and a specially-compiled version of llama.cpp into a single file that can run on most computers without any additional dependencies. They also come with an embedded inference server that provides an API for interacting with your model.

Here's a simple bash script that shows all 3 setup steps:

# Download a llamafile from HuggingFace
wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# Make the file executable. On Windows, instead just rename the file to end in ".exe".
chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# Start the model server. Listens at http://localhost:8080 by default.
./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser

After you run the above setup steps, you can interact with the model via LangChain:

from langchain_community.llms.llamafile import Llamafile

llamafile = Llamafile()

llamafile.invoke("Here is my grandmother's beloved recipe for spaghetti and meatballs:")
    '\n-1 1/2 (8 oz. Pounds) ground beef, browned and cooked until no longer pink\n-3 cups whole wheat spaghetti\n-4 (10 oz) cans diced tomatoes with garlic and basil\n-2 eggs, beaten\n-1 cup grated parmesan cheese\n-1/2 teaspoon salt\n-1/4 teaspoon black pepper\n-1 cup breadcrumbs (16 oz)\n-2 tablespoons olive oil\n\nInstructions:\n1. Cook spaghetti according to package directions. Drain and set aside.\n2. In a large skillet, brown ground beef over medium heat until no longer pink. Drain any excess grease.\n3. Stir in diced tomatoes with garlic and basil, and season with salt and pepper. Cook for 5 to 7 minutes or until sauce is heated through. Set aside.\n4. In a large bowl, beat eggs with a fork or whisk until fluffy. Add cheese, salt, and black pepper. Set aside.\n5. In another bowl, combine breadcrumbs and olive oil. Dip each spaghetti into the egg mixture and then coat in the breadcrumb mixture. Place on baking sheet lined with parchment paper to prevent sticking. Repeat until all spaghetti are coated.\n6. Heat oven to 375 degrees. Bake for 18 to 20 minutes, or until lightly golden brown.\n7. Serve hot with meatballs and sauce on the side. Enjoy!'

Using in a chain

We can create a summarization chain with either model by passing in the retrieved docs and a simple prompt.

It formats the prompt template using the input key values provided and passes the formatted string to GPT4All, LLama-V2, or another specified LLM.

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate

# Prompt
prompt = PromptTemplate.from_template(
"Summarize the main themes in these retrieved docs: {docs}"
)


# Chain
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)


chain = {"docs": format_docs} | prompt | llm | StrOutputParser()

# Run
question = "What are the approaches to Task Decomposition?"
docs = vectorstore.similarity_search(question)
chain.invoke(docs)
    Llama.generate: prefix-match hit
    
Based on the retrieved documents, the main themes are:
1. Task decomposition: The ability to break down complex tasks into smaller subtasks, which can be handled by an LLM or other components of the agent system.
2. LLM as the core controller: The use of a large language model (LLM) as the primary controller of an autonomous agent system, complemented by other key components such as a knowledge graph and a planner.
3. Potentiality of LLM: The idea that LLMs have the potential to be used as powerful general problem solvers, not just for generating well-written copies but also for solving complex tasks and achieving human-like intelligence.
4. Challenges in long-term planning: The challenges in planning over a lengthy history and effectively exploring the solution space, which are important limitations of current LLM-based autonomous agent systems.
    
llama_print_timings: load time = 1191.88 ms
llama_print_timings: sample time = 134.47 ms / 193 runs ( 0.70 ms per token, 1435.25 tokens per second)
llama_print_timings: prompt eval time = 39470.18 ms / 1055 tokens ( 37.41 ms per token, 26.73 tokens per second)
llama_print_timings: eval time = 8090.85 ms / 192 runs ( 42.14 ms per token, 23.73 tokens per second)
llama_print_timings: total time = 47943.12 ms
    '\nBased on the retrieved documents, the main themes are:\n1. Task decomposition: The ability to break down complex tasks into smaller subtasks, which can be handled by an LLM or other components of the agent system.\n2. LLM as the core controller: The use of a large language model (LLM) as the primary controller of an autonomous agent system, complemented by other key components such as a knowledge graph and a planner.\n3. Potentiality of LLM: The idea that LLMs have the potential to be used as powerful general problem solvers, not just for generating well-written copies but also for solving complex tasks and achieving human-like intelligence.\n4. Challenges in long-term planning: The challenges in planning over a lengthy history and effectively exploring the solution space, which are important limitations of current LLM-based autonomous agent systems.'

Q&A

We can also use the LangChain Prompt Hub to store and fetch prompts that are model-specific.

Let's try with a default RAG prompt, here.

from langchain import hub

rag_prompt = hub.pull("rlm/rag-prompt")
rag_prompt.messages
    [HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))]
from langchain_core.runnables import RunnablePassthrough, RunnablePick

# Chain
chain = (
RunnablePassthrough.assign(context=RunnablePick("context") | format_docs)
| rag_prompt
| llm
| StrOutputParser()
)

# Run
chain.invoke({"context": docs, "question": question})
    Llama.generate: prefix-match hit
     Hi there! There are three main approaches to task decomposition. One is using LLM with simple prompting like "Steps for XYZ." or "What are the subgoals for achieving XYZ?" Another approach is by using task-specific instructions, such as "Write a story outline" for writing a novel. Finally, task decomposition can also be done with human inputs. Thanks for asking!
    
llama_print_timings: load time = 1191.88 ms
llama_print_timings: sample time = 61.21 ms / 85 runs ( 0.72 ms per token, 1388.64 tokens per second)
llama_print_timings: prompt eval time = 8014.11 ms / 267 tokens ( 30.02 ms per token, 33.32 tokens per second)
llama_print_timings: eval time = 2908.17 ms / 84 runs ( 34.62 ms per token, 28.88 tokens per second)
llama_print_timings: total time = 11096.23 ms
    {'output_text': ' Hi there! There are three main approaches to task decomposition. One is using LLM with simple prompting like "Steps for XYZ." or "What are the subgoals for achieving XYZ?" Another approach is by using task-specific instructions, such as "Write a story outline" for writing a novel. Finally, task decomposition can also be done with human inputs. Thanks for asking!'}

Now, let's try with a prompt specifically for LLaMA, which includes special tokens.

# Prompt
rag_prompt_llama = hub.pull("rlm/rag-prompt-llama")
rag_prompt_llama.messages
    ChatPromptTemplate(input_variables=['question', 'context'], output_parser=None, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['question', 'context'], output_parser=None, partial_variables={}, template="[INST]<<SYS>> You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.<</SYS>> \nQuestion: {question} \nContext: {context} \nAnswer: [/INST]", template_format='f-string', validate_template=True), additional_kwargs={})])
# Chain
chain = (
RunnablePassthrough.assign(context=RunnablePick("context") | format_docs)
| rag_prompt_llama
| llm
| StrOutputParser()
)

# Run
chain.invoke({"context": docs, "question": question})
    Llama.generate: prefix-match hit
      Sure, I'd be happy to help! Based on the context, here are some to task:

1. LLM with simple prompting: This using a large model (LLM) with simple prompts like "Steps for XYZ" or "What are the subgoals for achieving XYZ?" to decompose tasks into smaller steps.
2. Task-specific: Another is to use task-specific, such as "Write a story outline" for writing a novel, to guide the of tasks.
3. Human inputs:, human inputs can be used to supplement the process, in cases where the task a high degree of creativity or expertise.

As fores in long-term and task, one major is that LLMs to adjust plans when faced with errors, making them less robust to humans who learn from trial and error.
    
llama_print_timings: load time = 11326.20 ms
llama_print_timings: sample time = 144.81 ms / 207 runs ( 0.70 ms per token, 1429.47 tokens per second)
llama_print_timings: prompt eval time = 1506.13 ms / 258 tokens ( 5.84 ms per token, 171.30 tokens per second)
llama_print_timings: eval time = 6231.92 ms / 206 runs ( 30.25 ms per token, 33.06 tokens per second)
llama_print_timings: total time = 8158.41 ms
    {'output_text': '  Sure, I\'d be happy to help! Based on the context, here are some to task:\n\n1. LLM with simple prompting: This using a large model (LLM) with simple prompts like "Steps for XYZ" or "What are the subgoals for achieving XYZ?" to decompose tasks into smaller steps.\n2. Task-specific: Another is to use task-specific, such as "Write a story outline" for writing a novel, to guide the of tasks.\n3. Human inputs:, human inputs can be used to supplement the process, in cases where the task a high degree of creativity or expertise.\n\nAs fores in long-term and task, one major is that LLMs to adjust plans when faced with errors, making them less robust to humans who learn from trial and error.'}

Q&A with retrieval

Instead of manually passing in docs, we can automatically retrieve them from our vector store based on the user question.

This will use a QA default prompt (shown here) and will retrieve from the vectorDB.

retriever = vectorstore.as_retriever()
qa_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| rag_prompt
| llm
| StrOutputParser()
)
qa_chain.invoke(question)
    Llama.generate: prefix-match hit
     
The three approaches to Task decomposition are LLMs with simple prompting, task-specific instructions, or human inputs. Thanks for asking!
    
llama_print_timings: load time = 1191.88 ms
llama_print_timings: sample time = 22.78 ms / 31 runs ( 0.73 ms per token, 1360.66 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_print_timings: eval time = 1320.23 ms / 31 runs ( 42.59 ms per token, 23.48 tokens per second)
llama_print_timings: total time = 1387.70 ms
    {'query': 'What are the approaches to Task Decomposition?',
'result': ' \nThe three approaches to Task decomposition are LLMs with simple prompting, task-specific instructions, or human inputs. Thanks for asking!'}
ПАО Сбербанк использует cookie для персонализации сервисов и удобства пользователей.
Вы можете запретить сохранение cookie в настройках своего браузера.