跳至主要内容

如何处理长文本

先决条件

本指南假设您熟悉以下内容

在处理文件(如 PDF)时,您可能会遇到超过语言模型上下文窗口的文本。要处理此文本,请考虑以下策略

  1. 更改 LLM 选择支持更大上下文窗口的不同 LLM。
  2. 蛮力 将文档分成块,并从每个块中提取内容。
  3. RAG 将文档分成块,索引这些块,并仅从看起来“相关”的块子集中提取内容。

请记住,这些策略有不同的权衡,最佳策略可能取决于您正在设计的应用程序!

设置

首先,让我们安装一些所需的依赖项

yarn add @langchain/openai zod cheerio

接下来,我们需要一些示例数据!让我们下载一篇关于 维基百科上的汽车 的文章,并将其加载为 LangChain Document

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
// Only required in a Deno notebook environment to load the peer dep.
import "cheerio";

const loader = new CheerioWebBaseLoader("https://en.wikipedia.org/wiki/Car");

const docs = await loader.load();

docs[0].pageContent.length;
97336

定义模式

在这里,我们将定义模式以从文本中提取关键发展。

import { z } from "zod";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";

const keyDevelopmentSchema = z
.object({
year: z
.number()
.describe("The year when there was an important historic development."),
description: z
.string()
.describe("What happened in this year? What was the development?"),
evidence: z
.string()
.describe(
"Repeat verbatim the sentence(s) from which the year and description information were extracted"
),
})
.describe("Information about a development in the history of cars.");

const extractionDataSchema = z
.object({
key_developments: z.array(keyDevelopmentSchema),
})
.describe(
"Extracted information about key developments in the history of cars"
);

const SYSTEM_PROMPT_TEMPLATE = [
"You are an expert at identifying key historic development in text.",
"Only extract important historic developments. Extract nothing if no important information can be found in the text.",
].join("\n");

// Define a custom prompt to provide instructions and any additional context.
// 1) You can add examples into the prompt template to improve extraction quality
// 2) Introduce additional parameters to take context into account (e.g., include metadata
// about the document from which the text was extracted.)
const prompt = ChatPromptTemplate.fromMessages([
["system", SYSTEM_PROMPT_TEMPLATE],
// Keep on reading through this use case to see how to use examples to improve performance
// MessagesPlaceholder('examples'),
["human", "{text}"],
]);

// We will be using tool calling mode, which
// requires a tool calling capable model.
const llm = new ChatOpenAI({
model: "gpt-4-0125-preview",
temperature: 0,
});

const extractionChain = prompt.pipe(
llm.withStructuredOutput(extractionDataSchema)
);

蛮力方法

将文档分成块,使每个块都适合 LLM 的上下文窗口。

import { TokenTextSplitter } from "langchain/text_splitter";

const textSplitter = new TokenTextSplitter({
chunkSize: 2000,
chunkOverlap: 20,
});

// Note that this method takes an array of docs
const splitDocs = await textSplitter.splitDocuments(docs);

使用所有 Runnable 上存在的 .batch 方法,在每个块上并行运行提取!

提示

您通常可以使用 .batch() 来并行化提取!

如果您的模型通过 API 公开,这可能会加快您的提取流程。

// Limit just to the first 3 chunks
// so the code can be re-run quickly
const firstFewTexts = splitDocs.slice(0, 3).map((doc) => doc.pageContent);

const extractionChainParams = firstFewTexts.map((text) => {
return { text };
});

const results = await extractionChain.batch(extractionChainParams, {
maxConcurrency: 5,
});

合并结果

从各个块中提取数据后,我们需要将提取结果合并在一起。

const keyDevelopments = results.flatMap((result) => result.key_developments);

keyDevelopments.slice(0, 20);
[
{ year: 0, description: "", evidence: "" },
{
year: 1769,
description: "French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle.",
evidence: "French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769."
},
{
year: 1808,
description: "French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"... 25 more characters,
evidence: "French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"... 33 more characters
},
{
year: 1886,
description: "German inventor Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car—a practical,"... 40 more characters,
evidence: "The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German"... 56 more characters
},
{
year: 1908,
description: "The 1908 Model T, an American car manufactured by the Ford Motor Company, became one of the first ca"... 28 more characters,
evidence: "One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by"... 24 more characters
}
]

基于 RAG 的方法

另一个简单的想法是将文本分成块,但不是从每个块中提取信息,而是只关注最相关的块。

警告

很难确定哪些块是相关的。

例如,在我们这里使用的 car 文章中,大部分文章都包含关键的发展信息。因此,通过使用 RAG,我们可能会丢弃大量相关信息。

我们建议您对自己的用例进行实验,并确定这种方法是否有效。

这是一个依赖于内存中演示 MemoryVectorStore 向量存储的简单示例。

import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";

// Only load the first 10 docs for speed in this demo use-case
const vectorstore = await MemoryVectorStore.fromDocuments(
splitDocs.slice(0, 10),
new OpenAIEmbeddings()
);

// Only extract from top document
const retriever = vectorstore.asRetriever({ k: 1 });

在这种情况下,RAG 提取器只查看最上面的文档。

import { RunnableSequence } from "@langchain/core/runnables";

const ragExtractor = RunnableSequence.from([
{
text: retriever.pipe((docs) => docs[0].pageContent),
},
extractionChain,
]);
const results = await ragExtractor.invoke(
"Key developments associated with cars"
);
results.key_developments;
[
{
year: 2020,
description: "The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 million km (1."... 33 more characters,
evidence: "The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 millionkm (1.2"... 31 more characters
},
{
year: 2030,
description: "All fossil fuel vehicles will be banned in Amsterdam from 2030.",
evidence: "all fossil fuel vehicles will be banned in Amsterdam from 2030."
},
{
year: 2020,
description: "In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year.",
evidence: "In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year."
}
]

常见问题

不同的方法在成本、速度和准确性方面各有优缺点。

注意以下问题

  • 对内容进行分块意味着如果信息分散在多个块中,LLM 可能会无法提取信息。
  • 块重叠过多可能会导致相同的信息被提取两次,因此准备好进行去重!
  • LLM 可能会编造数据。如果你在一个大型文本中寻找一个单一的事实,并使用蛮力方法,你最终可能会得到更多编造的数据。

下一步

你现在已经了解了如何使用少量示例来提高提取质量。

接下来,查看本节中的其他指南,例如 关于如何使用示例提高提取质量的一些提示.


本页对您有帮助吗?


您也可以在 GitHub 上留下详细的反馈 on GitHub.