跳至主要内容

构建 PDF 摄取和问答系统

先决条件

PDF 文件通常包含来自其他来源无法获得的重要非结构化数据。它们可能很长,与纯文本文件不同,通常无法直接馈送到语言模型的提示中。

在本教程中,您将创建一个可以回答有关 PDF 文件的问题的系统。更具体地说,您将使用 文档加载器 将文本加载为 LLM 可以更轻松地处理的格式,然后构建一个检索增强生成 (RAG) 管道来回答问题,包括来自源材料的引用。

本教程将略过我们 RAG 教程中更深入介绍的一些概念,因此如果您还没有阅读,您可能需要先阅读它们。

让我们开始吧!

加载文档

首先,您需要选择要加载的 PDF。我们将使用来自 耐克的年度公共 SEC 报告 的文档。它超过 100 页,包含一些关键数据,以及更长的解释性文本。但是,您可以随意使用您选择的 PDF。

选择好 PDF 后,下一步是将其加载到 LLM 可以更轻松地处理的格式中,因为 LLM 通常需要文本输入。LangChain 有几个不同的 内置文档加载器 可用于此目的,您可以尝试使用它们。下面,我们将使用一个由 pdf-parse 包支持的加载器,它从文件路径读取

import "pdf-parse"; // Peer dep
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

const loader = new PDFLoader("../../data/nke-10k-2023.pdf");

const docs = await loader.load();

console.log(docs.length);
107
console.log(docs[0].pageContent.slice(0, 100));
console.log(docs[0].metadata);
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K

{
source: '../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: {
PDFFormatVersion: '1.4',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: '0000320187-23-000039',
Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
Keywords: '0000320187-23-000039; ; 10-K',
Creator: 'EDGAR Filing HTML Converter',
Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
CreationDate: "D:20230720162200-04'00'",
ModDate: "D:20230720162208-04'00'"
},
metadata: null,
totalPages: 107
},
loc: { pageNumber: 1 }
}

那么刚刚发生了什么?

  • 加载器将指定路径处的 PDF 读取到内存中。
  • 然后,它使用 pdf-parse 包提取文本数据。
  • 最后,它为 PDF 的每一页创建一个 LangChain 文档,其中包含页面内容和一些有关文本在文档中的位置的元数据。

LangChain 还有许多其他 文档加载器 用于其他数据源,或者您可以创建一个 自定义文档加载器

使用 RAG 进行问答

接下来,您将为以后的检索准备加载的文档。使用 文本拆分器,您将加载的文档拆分为更小的文档,这些文档可以更轻松地放入 LLM 的上下文窗口中,然后将它们加载到 向量存储 中。然后,您可以从向量存储创建一个 检索器,用于我们的 RAG 链中

选择您的聊天模型

安装依赖项

yarn add @langchain/openai 

添加环境变量

OPENAI_API_KEY=your-api-key

实例化模型

import { ChatOpenAI } from "@langchain/openai";

const model = new ChatOpenAI({ model: "gpt-4o" });
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});

const splits = await textSplitter.splitDocuments(docs);

const vectorstore = await MemoryVectorStore.fromDocuments(
splits,
new OpenAIEmbeddings()
);

const retriever = vectorstore.asRetriever();

最后,您将使用一些内置助手来构建最终的 ragChain

import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";

const systemTemplate = [
`You are an assistant for question-answering tasks. `,
`Use the following pieces of retrieved context to answer `,
`the question. If you don't know the answer, say that you `,
`don't know. Use three sentences maximum and keep the `,
`answer concise.`,
`\n\n`,
`{context}`,
].join("");

const prompt = ChatPromptTemplate.fromMessages([
["system", systemTemplate],
["human", "{input}"],
]);

const questionAnswerChain = await createStuffDocumentsChain({ llm, prompt });
const ragChain = await createRetrievalChain({
retriever,
combineDocsChain: questionAnswerChain,
});

const results = await ragChain.invoke({
input: "What was Nike's revenue in 2023?",
});

console.log(results);
{
input: "What was Nike's revenue in 2023?",
chat_history: [],
context: [
Document {
pageContent: 'Enterprise Resource Planning Platform, data and analytics, demand sensing, insight gathering, and other areas to create an end-to-end technology foundation, which we\n' +
'believe will further accelerate our digital transformation. We believe this unified approach will accelerate growth and unlock more efficiency for our business, while driving\n' +
'speed and responsiveness as we serve consumers globally.\n' +
'FINANCIAL HIGHLIGHTS\n' +
'•In fiscal 2023, NIKE, Inc. achieved record Revenues of $51.2 billion, which increased 10% and 16% on a reported and currency-neutral basis, respectively\n' +
'•NIKE Direct revenues grew 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023, and represented approximately 44% of total NIKE Brand revenues for\n' +
'fiscal 2023\n' +
'•Gross margin for the fiscal year decreased 250 basis points to 43.5% primarily driven by higher product costs, higher markdowns and unfavorable changes in foreign\n' +
'currency exchange rates, partially offset by strategic pricing actions',
metadata: [Object]
},
Document {
pageContent: 'Table of Contents\n' +
'FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\n' +
'The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:\n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
'•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.\n' +
'The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,\n' +
'2 and 1 percentage points to NIKE, Inc. Revenues, respectively.\n' +
'•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This\n' +
"increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale\n" +
'equivalent basis.',
metadata: [Object]
},
Document {
pageContent: 'Table of Contents\n' +
'EUROPE, MIDDLE EAST & AFRICA\n' +
'(Dollars in millions)\n' +
'FISCAL 2023FISCAL 2022% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGESFISCAL 2021% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGES\n' +
'Revenues by:\n' +
'Footwear$8,260 $7,388 12 %25 %$6,970 6 %9 %\n' +
'Apparel4,566 4,527 1 %14 %3,996 13 %16 %\n' +
'Equipment592 564 5 %18 %490 15 %17 %\n' +
'TOTAL REVENUES$13,418 $12,479 8 %21 %$11,456 9 %12 %\n' +
'Revenues by: \n' +
'Sales to Wholesale Customers$8,522 $8,377 2 %15 %$7,812 7 %10 %\n' +
'Sales through NIKE Direct4,896 4,102 19 %33 %3,644 13 %15 %\n' +
'TOTAL REVENUES$13,418 $12,479 8 %21 %$11,456 9 %12 %\n' +
'EARNINGS BEFORE INTEREST AND TAXES$3,531 $3,293 7 %$2,435 35 % \n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
"•EMEA revenues increased 21% on a currency-neutral basis, due to higher revenues in Men's, the Jordan Brand, Women's and Kids'. NIKE Direct revenues\n" +
'increased 33%, driven primarily by strong digital sales growth of 43% and comparable store sales growth of 22%.',
metadata: [Object]
},
Document {
pageContent: 'Table of Contents\n' +
'NORTH AMERICA\n' +
'(Dollars in millions)\n' +
'FISCAL 2023FISCAL 2022% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGESFISCAL 2021% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGES\n' +
'Revenues by:\n' +
'Footwear$14,897 $12,228 22 %22 %$11,644 5 %5 %\n' +
'Apparel5,947 5,492 8 %9 %5,028 9 %9 %\n' +
'Equipment764 633 21 %21 %507 25 %25 %\n' +
'TOTAL REVENUES$21,608 $18,353 18 %18 %$17,179 7 %7 %\n' +
'Revenues by: \n' +
'Sales to Wholesale Customers$11,273 $9,621 17 %18 %$10,186 -6 %-6 %\n' +
'Sales through NIKE Direct10,335 8,732 18 %18 %6,993 25 %25 %\n' +
'TOTAL REVENUES$21,608 $18,353 18 %18 %$17,179 7 %7 %\n' +
'EARNINGS BEFORE INTEREST AND TAXES$5,454 $5,114 7 %$5,089 0 %\n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
"•North America revenues increased 18% on a currency-neutral basis, primarily due to higher revenues in Men's and the Jordan Brand. NIKE Direct revenues\n" +
'increased 18%, driven by strong digital sales growth of 23%, comparable store sales growth of 9% and the addition of new stores.',
metadata: [Object]
}
],
answer: 'According to the financial highlights, Nike, Inc. achieved record revenues of $51.2 billion in fiscal 2023, which increased 10% on a reported basis and 16% on a currency-neutral basis compared to fiscal 2022.'
}

您可以看到,您既获得了结果对象answer键中的最终答案,也获得了LLM用于生成答案的context

进一步检查context下的值,您会发现它们是包含已摄取页面内容的一部分的文档。有用的是,这些文档还保留了您第一次加载它们时的原始元数据。

console.log(results.context[0].pageContent);
Enterprise Resource Planning Platform, data and analytics, demand sensing, insight gathering, and other areas to create an end-to-end technology foundation, which we
believe will further accelerate our digital transformation. We believe this unified approach will accelerate growth and unlock more efficiency for our business, while driving
speed and responsiveness as we serve consumers globally.
FINANCIAL HIGHLIGHTS
•In fiscal 2023, NIKE, Inc. achieved record Revenues of $51.2 billion, which increased 10% and 16% on a reported and currency-neutral basis, respectively
•NIKE Direct revenues grew 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023, and represented approximately 44% of total NIKE Brand revenues for
fiscal 2023
•Gross margin for the fiscal year decreased 250 basis points to 43.5% primarily driven by higher product costs, higher markdowns and unfavorable changes in foreign
currency exchange rates, partially offset by strategic pricing actions
console.log(results.context[0].metadata);
{
source: '../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: {
PDFFormatVersion: '1.4',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: '0000320187-23-000039',
Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
Keywords: '0000320187-23-000039; ; 10-K',
Creator: 'EDGAR Filing HTML Converter',
Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
CreationDate: "D:20230720162200-04'00'",
ModDate: "D:20230720162208-04'00'"
},
metadata: null,
totalPages: 107
},
loc: { pageNumber: 31, lines: { from: 14, to: 22 } }
}

这个特定的块来自原始 PDF 中的第 31 页。您可以使用此数据来显示答案来自 PDF 中的哪一页,从而允许用户快速验证答案是否基于源材料。

要更深入地了解 RAG,请参阅这个更集中的教程我们的操作指南

下一步

您现在已经了解了如何使用文档加载器从 PDF 文件加载文档,以及一些可用于准备加载数据以用于 RAG 的技术。

有关文档加载器的更多信息,您可以查看

有关 RAG 的更多信息,请参阅


此页面对您有帮助吗?


您也可以留下详细的反馈 在 GitHub 上.