如何处理长文本
先决条件
本指南假设你熟悉以下内容
当处理 PDF 等文件时,你可能会遇到超过语言模型上下文窗口的文本。要处理此文本,请考虑以下策略
- 更改 LLM 选择支持更大上下文窗口的另一个 LLM。
- 蛮力 将文档分成块,并从每个块中提取内容。
- RAG 将文档分成块,索引这些块,只从看起来“相关”的块子集中提取内容。
请记住,这些策略有不同的权衡,最佳策略可能取决于你正在设计的应用程序!
设置
首先,让我们安装一些必需的依赖项
提示
请参阅 此部分以获取有关安装集成包的一般说明。
- npm
- yarn
- pnpm
npm i @langchain/openai @langchain/core zod cheerio
yarn add @langchain/openai @langchain/core zod cheerio
pnpm add @langchain/openai @langchain/core zod cheerio
接下来,我们需要一些示例数据!让我们下载一篇关于 维基百科上的汽车 的文章,并将其加载为 LangChain Document
。
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
// Only required in a Deno notebook environment to load the peer dep.
import "cheerio";
const loader = new CheerioWebBaseLoader("https://en.wikipedia.org/wiki/Car");
const docs = await loader.load();
docs[0].pageContent.length;
97336
定义模式
在这里,我们将定义模式以从文本中提取关键发展。
import { z } from "zod";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";
const keyDevelopmentSchema = z
.object({
year: z
.number()
.describe("The year when there was an important historic development."),
description: z
.string()
.describe("What happened in this year? What was the development?"),
evidence: z
.string()
.describe(
"Repeat verbatim the sentence(s) from which the year and description information were extracted"
),
})
.describe("Information about a development in the history of cars.");
const extractionDataSchema = z
.object({
key_developments: z.array(keyDevelopmentSchema),
})
.describe(
"Extracted information about key developments in the history of cars"
);
const SYSTEM_PROMPT_TEMPLATE = [
"You are an expert at identifying key historic development in text.",
"Only extract important historic developments. Extract nothing if no important information can be found in the text.",
].join("\n");
// Define a custom prompt to provide instructions and any additional context.
// 1) You can add examples into the prompt template to improve extraction quality
// 2) Introduce additional parameters to take context into account (e.g., include metadata
// about the document from which the text was extracted.)
const prompt = ChatPromptTemplate.fromMessages([
["system", SYSTEM_PROMPT_TEMPLATE],
// Keep on reading through this use case to see how to use examples to improve performance
// MessagesPlaceholder('examples'),
["human", "{text}"],
]);
// We will be using tool calling mode, which
// requires a tool calling capable model.
const llm = new ChatOpenAI({
model: "gpt-4-0125-preview",
temperature: 0,
});
const extractionChain = prompt.pipe(
llm.withStructuredOutput(extractionDataSchema)
);
蛮力方法
将文档拆分为块,使每个块适合 LLMs 的上下文窗口。
import { TokenTextSplitter } from "langchain/text_splitter";
const textSplitter = new TokenTextSplitter({
chunkSize: 2000,
chunkOverlap: 20,
});
// Note that this method takes an array of docs
const splitDocs = await textSplitter.splitDocuments(docs);
使用所有 runnable 上都存在的 .batch
方法以 **并行** 方式在每个块上运行提取!
提示
你通常可以使用 .batch()
来并行化提取!
如果你的模型通过 API 公开,这可能会加快你的提取流程。
// Limit just to the first 3 chunks
// so the code can be re-run quickly
const firstFewTexts = splitDocs.slice(0, 3).map((doc) => doc.pageContent);
const extractionChainParams = firstFewTexts.map((text) => {
return { text };
});
const results = await extractionChain.batch(extractionChainParams, {
maxConcurrency: 5,
});
合并结果
在从所有块中提取数据后,我们将需要将提取合并在一起。
const keyDevelopments = results.flatMap((result) => result.key_developments);
keyDevelopments.slice(0, 20);
[
{ year: 0, description: "", evidence: "" },
{
year: 1769,
description: "French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle.",
evidence: "French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769."
},
{
year: 1808,
description: "French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"... 25 more characters,
evidence: "French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"... 33 more characters
},
{
year: 1886,
description: "German inventor Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car—a practical,"... 40 more characters,
evidence: "The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German"... 56 more characters
},
{
year: 1908,
description: "The 1908 Model T, an American car manufactured by the Ford Motor Company, became one of the first ca"... 28 more characters,
evidence: "One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by"... 24 more characters
}
]
基于 RAG 的方法
另一个简单的想法是将文本分成块,但不是从每个块中提取信息,而是只关注最相关的块。
注意
很难确定哪些块是相关的。
例如,在我们这里使用的 car
文章中,大部分文章都包含关键发展信息。因此,通过使用 RAG,我们可能会丢弃很多相关信息。
我们建议尝试你的用例,并确定这种方法是否有效。
这是一个依赖于内存中演示 MemoryVectorStore
向量存储的简单示例。
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
// Only load the first 10 docs for speed in this demo use-case
const vectorstore = await MemoryVectorStore.fromDocuments(
splitDocs.slice(0, 10),
new OpenAIEmbeddings()
);
// Only extract from top document
const retriever = vectorstore.asRetriever({ k: 1 });
在这种情况下,RAG 提取器只查看最上面的文档。
import { RunnableSequence } from "@langchain/core/runnables";
const ragExtractor = RunnableSequence.from([
{
text: retriever.pipe((docs) => docs[0].pageContent),
},
extractionChain,
]);
const ragExtractorResults = await ragExtractor.invoke(
"Key developments associated with cars"
);
ragExtractorResults.key_developments;
[
{
year: 2020,
description: "The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 million km (1."... 33 more characters,
evidence: "The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 millionkm (1.2"... 31 more characters
},
{
year: 2030,
description: "All fossil fuel vehicles will be banned in Amsterdam from 2030.",
evidence: "all fossil fuel vehicles will be banned in Amsterdam from 2030."
},
{
year: 2020,
description: "In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year.",
evidence: "In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year."
}
]
常见问题
不同的方法在成本、速度和准确性方面有各自的优缺点。
注意以下问题
- 将内容分成块意味着如果信息分布在多个块中,LLM 可能无法提取信息。
- 大块重叠可能会导致相同的信息被提取两次,所以要做好去重准备!
- LLMs 可能会编造数据。如果你在大量文本中寻找单个事实并使用暴力方法,你最终可能会得到更多虚构的数据。
下一步
你现在已经了解了如何使用少量示例来提高提取质量。
接下来,查看本节中的其他指南,例如 一些关于如何使用示例提高提取质量的技巧。