构建 PDF 摄取和问答系统
PDF 文件通常包含来自其他来源无法获得的关键非结构化数据。它们可能非常长,与纯文本文件不同,通常无法直接输入到语言模型的提示中。
在本教程中,您将创建一个系统,可以回答有关 PDF 文件的问题。更具体地说,您将使用 文档加载器 加载 LLM 可以更容易处理的格式的文本,然后构建检索增强生成 (RAG) 管道来回答问题,包括来自源材料的引用。
本教程将简要介绍一些在我们的 RAG 教程中更深入地介绍的概念,因此如果您还没有,您可能希望先阅读这些教程。
让我们开始吧!
加载文档
首先,您需要选择要加载的 PDF。我们将使用来自 耐克年度公共 SEC 报告 的文档。它超过 100 页,包含一些关键数据,以及更长的解释性文本。但是,您可以随意使用您选择的 PDF。
选择完 PDF 后,下一步是将其加载到 LLM 可以更容易处理的格式中,因为 LLM 通常需要文本输入。LangChain 有几个不同的 内置文档加载器 可供您尝试。在下面,我们将使用一个由 pdf-parse
包提供支持的加载器,该包从文件路径读取
import "pdf-parse"; // Peer dep
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
const loader = new PDFLoader("../../data/nke-10k-2023.pdf");
const docs = await loader.load();
console.log(docs.length);
107
console.log(docs[0].pageContent.slice(0, 100));
console.log(docs[0].metadata);
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
{
source: '../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: {
PDFFormatVersion: '1.4',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: '0000320187-23-000039',
Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
Keywords: '0000320187-23-000039; ; 10-K',
Creator: 'EDGAR Filing HTML Converter',
Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
CreationDate: "D:20230720162200-04'00'",
ModDate: "D:20230720162208-04'00'"
},
metadata: null,
totalPages: 107
},
loc: { pageNumber: 1 }
}
那么刚才发生了什么?
- 加载器将指定的路径处的 PDF 读取到内存中。
- 然后,它使用
pdf-parse
包提取文本数据。 - 最后,它为 PDF 的每一页创建一个 LangChain 文档,其中包含页面的内容和一些有关文本来自文档中的哪个位置的元数据。
LangChain 有 许多其他文档加载器 用于其他数据源,或者您可以创建一个 自定义文档加载器。
使用 RAG 进行问答
接下来,您将为以后的检索准备加载的文档。使用 文本分割器,您将把加载的文档拆分成更小的文档,这些文档更容易放入 LLM 的上下文窗口中,然后将它们加载到 向量存储 中。然后,您可以从向量存储中创建一个 检索器,用于我们 RAG 链中。
选择您的聊天模型
- OpenAI
- Anthropic
- FireworksAI
- MistralAI
- Groq
- VertexAI
安装依赖项
查看 本节,了解安装集成包的一般说明.
- npm
- yarn
- pnpm
npm i @langchain/openai
yarn add @langchain/openai
pnpm add @langchain/openai
添加环境变量
OPENAI_API_KEY=your-api-key
实例化模型
import { ChatOpenAI } from "@langchain/openai";
const model = new ChatOpenAI({ model: "gpt-4o" });
安装依赖项
查看 本节,了解安装集成包的一般说明.
- npm
- yarn
- pnpm
npm i @langchain/anthropic
yarn add @langchain/anthropic
pnpm add @langchain/anthropic
添加环境变量
ANTHROPIC_API_KEY=your-api-key
实例化模型
import { ChatAnthropic } from "@langchain/anthropic";
const model = new ChatAnthropic({
model: "claude-3-5-sonnet-20240620",
temperature: 0
});
安装依赖项
查看 本节,了解安装集成包的一般说明.
- npm
- yarn
- pnpm
npm i @langchain/community
yarn add @langchain/community
pnpm add @langchain/community
添加环境变量
FIREWORKS_API_KEY=your-api-key
实例化模型
import { ChatFireworks } from "@langchain/community/chat_models/fireworks";
const model = new ChatFireworks({
model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
temperature: 0
});
安装依赖项
查看 本节,了解安装集成包的一般说明.
- npm
- yarn
- pnpm
npm i @langchain/mistralai
yarn add @langchain/mistralai
pnpm add @langchain/mistralai
添加环境变量
MISTRAL_API_KEY=your-api-key
实例化模型
import { ChatMistralAI } from "@langchain/mistralai";
const model = new ChatMistralAI({
model: "mistral-large-latest",
temperature: 0
});
安装依赖项
查看 本节,了解安装集成包的一般说明.
- npm
- yarn
- pnpm
npm i @langchain/groq
yarn add @langchain/groq
pnpm add @langchain/groq
添加环境变量
GROQ_API_KEY=your-api-key
实例化模型
import { ChatGroq } from "@langchain/groq";
const model = new ChatGroq({
model: "mixtral-8x7b-32768",
temperature: 0
});
安装依赖项
查看 本节,了解安装集成包的一般说明.
- npm
- yarn
- pnpm
npm i @langchain/google-vertexai
yarn add @langchain/google-vertexai
pnpm add @langchain/google-vertexai
添加环境变量
GOOGLE_APPLICATION_CREDENTIALS=credentials.json
实例化模型
import { ChatVertexAI } from "@langchain/google-vertexai";
const model = new ChatVertexAI({
model: "gemini-1.5-flash",
temperature: 0
});
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const splits = await textSplitter.splitDocuments(docs);
const vectorstore = await MemoryVectorStore.fromDocuments(
splits,
new OpenAIEmbeddings()
);
const retriever = vectorstore.asRetriever();
最后,您将使用一些内置辅助程序来构建最终的ragChain
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
const systemTemplate = [
`You are an assistant for question-answering tasks. `,
`Use the following pieces of retrieved context to answer `,
`the question. If you don't know the answer, say that you `,
`don't know. Use three sentences maximum and keep the `,
`answer concise.`,
`\n\n`,
`{context}`,
].join("");
const prompt = ChatPromptTemplate.fromMessages([
["system", systemTemplate],
["human", "{input}"],
]);
const questionAnswerChain = await createStuffDocumentsChain({ llm, prompt });
const ragChain = await createRetrievalChain({
retriever,
combineDocsChain: questionAnswerChain,
});
const results = await ragChain.invoke({
input: "What was Nike's revenue in 2023?",
});
console.log(results);
{
input: "What was Nike's revenue in 2023?",
chat_history: [],
context: [
Document {
pageContent: 'Enterprise Resource Planning Platform, data and analytics, demand sensing, insight gathering, and other areas to create an end-to-end technology foundation, which we\n' +
'believe will further accelerate our digital transformation. We believe this unified approach will accelerate growth and unlock more efficiency for our business, while driving\n' +
'speed and responsiveness as we serve consumers globally.\n' +
'FINANCIAL HIGHLIGHTS\n' +
'•In fiscal 2023, NIKE, Inc. achieved record Revenues of $51.2 billion, which increased 10% and 16% on a reported and currency-neutral basis, respectively\n' +
'•NIKE Direct revenues grew 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023, and represented approximately 44% of total NIKE Brand revenues for\n' +
'fiscal 2023\n' +
'•Gross margin for the fiscal year decreased 250 basis points to 43.5% primarily driven by higher product costs, higher markdowns and unfavorable changes in foreign\n' +
'currency exchange rates, partially offset by strategic pricing actions',
metadata: [Object]
},
Document {
pageContent: 'Table of Contents\n' +
'FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\n' +
'The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:\n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
'•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.\n' +
'The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,\n' +
'2 and 1 percentage points to NIKE, Inc. Revenues, respectively.\n' +
'•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This\n' +
"increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale\n" +
'equivalent basis.',
metadata: [Object]
},
Document {
pageContent: 'Table of Contents\n' +
'EUROPE, MIDDLE EAST & AFRICA\n' +
'(Dollars in millions)\n' +
'FISCAL 2023FISCAL 2022% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGESFISCAL 2021% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGES\n' +
'Revenues by:\n' +
'Footwear$8,260 $7,388 12 %25 %$6,970 6 %9 %\n' +
'Apparel4,566 4,527 1 %14 %3,996 13 %16 %\n' +
'Equipment592 564 5 %18 %490 15 %17 %\n' +
'TOTAL REVENUES$13,418 $12,479 8 %21 %$11,456 9 %12 %\n' +
'Revenues by: \n' +
'Sales to Wholesale Customers$8,522 $8,377 2 %15 %$7,812 7 %10 %\n' +
'Sales through NIKE Direct4,896 4,102 19 %33 %3,644 13 %15 %\n' +
'TOTAL REVENUES$13,418 $12,479 8 %21 %$11,456 9 %12 %\n' +
'EARNINGS BEFORE INTEREST AND TAXES$3,531 $3,293 7 %$2,435 35 % \n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
"•EMEA revenues increased 21% on a currency-neutral basis, due to higher revenues in Men's, the Jordan Brand, Women's and Kids'. NIKE Direct revenues\n" +
'increased 33%, driven primarily by strong digital sales growth of 43% and comparable store sales growth of 22%.',
metadata: [Object]
},
Document {
pageContent: 'Table of Contents\n' +
'NORTH AMERICA\n' +
'(Dollars in millions)\n' +
'FISCAL 2023FISCAL 2022% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGESFISCAL 2021% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGES\n' +
'Revenues by:\n' +
'Footwear$14,897 $12,228 22 %22 %$11,644 5 %5 %\n' +
'Apparel5,947 5,492 8 %9 %5,028 9 %9 %\n' +
'Equipment764 633 21 %21 %507 25 %25 %\n' +
'TOTAL REVENUES$21,608 $18,353 18 %18 %$17,179 7 %7 %\n' +
'Revenues by: \n' +
'Sales to Wholesale Customers$11,273 $9,621 17 %18 %$10,186 -6 %-6 %\n' +
'Sales through NIKE Direct10,335 8,732 18 %18 %6,993 25 %25 %\n' +
'TOTAL REVENUES$21,608 $18,353 18 %18 %$17,179 7 %7 %\n' +
'EARNINGS BEFORE INTEREST AND TAXES$5,454 $5,114 7 %$5,089 0 %\n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
"•North America revenues increased 18% on a currency-neutral basis, primarily due to higher revenues in Men's and the Jordan Brand. NIKE Direct revenues\n" +
'increased 18%, driven by strong digital sales growth of 23%, comparable store sales growth of 9% and the addition of new stores.',
metadata: [Object]
}
],
answer: 'According to the financial highlights, Nike, Inc. achieved record revenues of $51.2 billion in fiscal 2023, which increased 10% on a reported basis and 16% on a currency-neutral basis compared to fiscal 2022.'
}
您可以看到,您既获得了结果对象answer
键中的最终答案,也获得了LLM用来生成答案的context
。
进一步检查context
下的值,您会发现它们是包含已摄取页面内容的一部分的文档。有益的是,这些文档还保留了您第一次加载它们时的原始元数据。
console.log(results.context[0].pageContent);
Enterprise Resource Planning Platform, data and analytics, demand sensing, insight gathering, and other areas to create an end-to-end technology foundation, which we
believe will further accelerate our digital transformation. We believe this unified approach will accelerate growth and unlock more efficiency for our business, while driving
speed and responsiveness as we serve consumers globally.
FINANCIAL HIGHLIGHTS
•In fiscal 2023, NIKE, Inc. achieved record Revenues of $51.2 billion, which increased 10% and 16% on a reported and currency-neutral basis, respectively
•NIKE Direct revenues grew 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023, and represented approximately 44% of total NIKE Brand revenues for
fiscal 2023
•Gross margin for the fiscal year decreased 250 basis points to 43.5% primarily driven by higher product costs, higher markdowns and unfavorable changes in foreign
currency exchange rates, partially offset by strategic pricing actions
console.log(results.context[0].metadata);
{
source: '../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: {
PDFFormatVersion: '1.4',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: '0000320187-23-000039',
Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
Keywords: '0000320187-23-000039; ; 10-K',
Creator: 'EDGAR Filing HTML Converter',
Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
CreationDate: "D:20230720162200-04'00'",
ModDate: "D:20230720162208-04'00'"
},
metadata: null,
totalPages: 107
},
loc: { pageNumber: 31, lines: { from: 14, to: 22 } }
}
这个特定片段来自原始PDF中的第31页。您可以使用此数据来显示答案来自PDF中的哪一页,从而使用户能够快速验证答案是否基于源材料。
下一步
您现在已经了解了如何使用文档加载器从PDF文件加载文档,以及一些可以用来准备加载的数据以用于RAG的技术。
要了解更多关于文档加载器的信息,您可以查看
要了解更多关于RAG的信息,请查看