如何处理长文本

先决条件

本指南假定您熟悉以下内容

抽取

当处理文件（如 PDF）时，您可能会遇到超出语言模型上下文窗口的文本。为了处理此文本，请考虑以下策略

更换 LLM 选择支持更大上下文窗口的不同 LLM。
暴力破解 将文档分块，并从每个块中抽取内容。
RAG 将文档分块，索引这些块，并仅从看起来“相关”的块子集中抽取内容。

请记住，这些策略具有不同的权衡，最佳策略可能取决于您正在设计的应用程序！

设置

首先，让我们安装一些必需的依赖项

提示

请参阅此部分，了解有关安装集成包的通用说明。

npm
yarn
pnpm

npm i @langchain/openai @langchain/core zod cheerio

yarn add @langchain/openai @langchain/core zod cheerio

pnpm add @langchain/openai @langchain/core zod cheerio

接下来，我们需要一些示例数据！让我们下载一篇关于维基百科上的汽车的文章，并将其作为 LangChain Document 加载。

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
// Only required in a Deno notebook environment to load the peer dep.
import "cheerio";

const loader = new CheerioWebBaseLoader("https://en.wikipedia.org/wiki/Car");

const docs = await loader.load();

docs[0].pageContent.length;

定义模式

在这里，我们将定义模式以从文本中抽取关键发展信息。

import { z } from "zod";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";

const keyDevelopmentSchema = z
  .object({
    year: z
      .number()
      .describe("The year when there was an important historic development."),
    description: z
      .string()
      .describe("What happened in this year? What was the development?"),
    evidence: z
      .string()
      .describe(
        "Repeat verbatim the sentence(s) from which the year and description information were extracted"
      ),
  })
  .describe("Information about a development in the history of cars.");

const extractionDataSchema = z
  .object({
    key_developments: z.array(keyDevelopmentSchema),
  })
  .describe(
    "Extracted information about key developments in the history of cars"
  );

const SYSTEM_PROMPT_TEMPLATE = [
  "You are an expert at identifying key historic development in text.",
  "Only extract important historic developments. Extract nothing if no important information can be found in the text.",
].join("\n");

// Define a custom prompt to provide instructions and any additional context.
// 1) You can add examples into the prompt template to improve extraction quality
// 2) Introduce additional parameters to take context into account (e.g., include metadata
//    about the document from which the text was extracted.)
const prompt = ChatPromptTemplate.fromMessages([
  ["system", SYSTEM_PROMPT_TEMPLATE],
  // Keep on reading through this use case to see how to use examples to improve performance
  // MessagesPlaceholder('examples'),
  ["human", "{text}"],
]);

// We will be using tool calling mode, which
// requires a tool calling capable model.
const llm = new ChatOpenAI({
  model: "gpt-4-0125-preview",
  temperature: 0,
});

const extractionChain = prompt.pipe(
  llm.withStructuredOutput(extractionDataSchema)
);

暴力破解方法

将文档分割成块，使每个块都适合 LLM 的上下文窗口。

import { TokenTextSplitter } from "langchain/text_splitter";

const textSplitter = new TokenTextSplitter({
  chunkSize: 2000,
  chunkOverlap: 20,
});

// Note that this method takes an array of docs
const splitDocs = await textSplitter.splitDocuments(docs);

使用所有 runnables 上存在的 .batch 方法在每个块上并行运行抽取！

提示

您通常可以使用 .batch() 来并行化抽取！

如果您的模型通过 API 公开，这将可能加快您的抽取流程。

// Limit just to the first 3 chunks
// so the code can be re-run quickly
const firstFewTexts = splitDocs.slice(0, 3).map((doc) => doc.pageContent);

const extractionChainParams = firstFewTexts.map((text) => {
  return { text };
});

const results = await extractionChain.batch(extractionChainParams, {
  maxConcurrency: 5,
});

合并结果

从各个块中抽取数据后，我们将需要将抽取结果合并在一起。

const keyDevelopments = results.flatMap((result) => result.key_developments);

keyDevelopments.slice(0, 20);

[
  { year: 0, description: "", evidence: "" },
  {
    year: 1769,
    description: "French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle.",
    evidence: "French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769."
  },
  {
    year: 1808,
    description: "French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"... 25 more characters,
    evidence: "French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combu"... 33 more characters
  },
  {
    year: 1886,
    description: "German inventor Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car—a practical,"... 40 more characters,
    evidence: "The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German"... 56 more characters
  },
  {
    year: 1908,
    description: "The 1908 Model T, an American car manufactured by the Ford Motor Company, became one of the first ca"... 28 more characters,
    evidence: "One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by"... 24 more characters
  }
]

基于 RAG 的方法

另一个简单的想法是将文本分块，但不是从每个块中抽取信息，而是只关注最相关的块。

注意

可能难以识别哪些块是相关的。

例如，在我们此处使用的 car 文章中，文章的大部分内容都包含关键发展信息。因此，通过使用 RAG，我们可能会丢弃大量相关信息。

我们建议针对您的用例进行实验，并确定此方法是否有效。

这是一个简单的示例，它依赖于内存演示 MemoryVectorStore 向量存储。

import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";

// Only load the first 10 docs for speed in this demo use-case
const vectorstore = await MemoryVectorStore.fromDocuments(
  splitDocs.slice(0, 10),
  new OpenAIEmbeddings()
);

// Only extract from top document
const retriever = vectorstore.asRetriever({ k: 1 });

在这种情况下，RAG 抽取器仅查看排名靠前的文档。

import { RunnableSequence } from "@langchain/core/runnables";

const ragExtractor = RunnableSequence.from([
  {
    text: retriever.pipe((docs) => docs[0].pageContent),
  },
  extractionChain,
]);

const ragExtractorResults = await ragExtractor.invoke(
  "Key developments associated with cars"
);

ragExtractorResults.key_developments;

[
  {
    year: 2020,
    description: "The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 million km (1."... 33 more characters,
    evidence: "The lifetime of a car built in the 2020s is expected to be about 16 years, or about 2 millionkm (1.2"... 31 more characters
  },
  {
    year: 2030,
    description: "All fossil fuel vehicles will be banned in Amsterdam from 2030.",
    evidence: "all fossil fuel vehicles will be banned in Amsterdam from 2030."
  },
  {
    year: 2020,
    description: "In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year.",
    evidence: "In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year."
  }
]

常见问题

不同的方法在成本、速度和准确性方面各有优缺点。

注意以下问题

分块内容意味着如果信息分散在多个块中，LLM 可能无法抽取信息。
较大的块重叠可能会导致相同的信息被抽取两次，因此请准备好进行去重！
LLM 可能会编造数据。如果在一个大型文本中查找单个事实并使用暴力破解方法，您最终可能会得到更多编造的数据。

下一步

您现在已经学习了如何使用少样本示例来提高抽取质量。

接下来，查看本节中的其他指南，例如关于如何通过示例提高抽取质量的一些技巧。

如何处理长文本

设置

定义模式

暴力破解方法

合并结果

基于 RAG 的方法

常见问题

下一步

此页内容是否对您有帮助？

您还可以留下详细的反馈在 GitHub 上.

设置​

定义模式​

暴力破解方法​

合并结果​

基于 RAG 的方法​

常见问题​

下一步​

此页内容是否对您有帮助？

您还可以留下详细的反馈 在 GitHub 上.

设置

定义模式

暴力破解方法

合并结果

基于 RAG 的方法

常见问题

下一步

您还可以留下详细的反馈在 GitHub 上.