跳转至主要内容

构建语义搜索引擎

本教程将使您熟悉 LangChain 的 文档加载器嵌入向量存储 抽象概念。这些抽象概念旨在支持从(向量)数据库和其他来源检索数据,以便与 LLM 工作流程集成。它们对于需要获取数据以作为模型推理一部分进行推理的应用程序非常重要,例如检索增强生成或 RAG(请参阅我们的 RAG 教程 此处)。

在这里,我们将构建一个基于 PDF 文档的搜索引擎。这将使我们能够检索 PDF 中与输入查询相似的段落。

概念

本指南侧重于文本数据检索。我们将涵盖以下概念

  • 文档和文档加载器;
  • 文本分割器;
  • 嵌入;
  • 向量存储和检索器。

设置

Jupyter Notebook

本教程和其他教程最好在 Jupyter notebook 中运行。有关如何安装的说明,请参阅 此处

安装

本指南需要 @langchain/communitypdf-parse

yarn add @langchain/community pdf-parse

有关更多详细信息,请参阅我们的 安装指南

LangSmith

您使用 LangChain 构建的许多应用程序将包含多个步骤,其中包含多次 LLM 调用。随着这些应用程序变得越来越复杂,能够检查您的链或代理内部究竟发生了什么是至关重要的。最好的方法是使用 LangSmith

在上面的链接注册后,请确保设置您的环境变量以开始记录跟踪

export LANGSMITH_TRACING="true"
export LANGSMITH_API_KEY="..."

# Reduce tracing latency if you are not in a serverless environment
# export LANGCHAIN_CALLBACKS_BACKGROUND=true

文档和文档加载器

LangChain 实现了 Document 抽象,旨在表示文本单元和关联的元数据。它具有三个属性

  • pageContent:表示内容的字符串;
  • metadata:任意元数据的记录;
  • id:(可选)文档的字符串标识符。

metadata 属性可以捕获有关文档来源、与其他文档的关系以及其他信息。请注意,单个 Document 对象通常表示较大文档的块。

我们可以在需要时生成示例文档

import { Document } from "@langchain/core/documents";

const documents = [
new Document({
pageContent:
"Dogs are great companions, known for their loyalty and friendliness.",
metadata: { source: "mammal-pets-doc" },
}),
new Document({
pageContent: "Cats are independent pets that often enjoy their own space.",
metadata: { source: "mammal-pets-doc" },
}),
];

但是,LangChain 生态系统实现了 文档加载器,这些加载器 与数百个常见来源集成。这使得将来自这些来源的数据合并到您的 AI 应用程序中变得容易。

加载文档

让我们将 PDF 加载到 Document 对象序列中。LangChain 仓库 此处 有一个示例 PDF – Nike 2023 年的 10-k 备案文件。LangChain 实现了 PDFLoader,我们可以使用它来解析 PDF

import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

const loader = new PDFLoader("../../data/nke-10k-2023.pdf");

const docs = await loader.load();
console.log(docs.length);
107
提示

有关 PDF 文档加载器的更多详细信息,请参阅 本指南

PDFLoader 每个 PDF 页面加载一个 Document 对象。对于每个对象,我们可以轻松访问

  • 页面的字符串内容;
  • 包含文件名和页码的元数据。
docs[0].pageContent.slice(0, 200);
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FO
docs[0].metadata;
{
source: '../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: {
PDFFormatVersion: '1.4',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: '0000320187-23-000039',
Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
Keywords: '0000320187-23-000039; ; 10-K',
Creator: 'EDGAR Filing HTML Converter',
Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
CreationDate: "D:20230720162200-04'00'",
ModDate: "D:20230720162208-04'00'"
},
metadata: null,
totalPages: 107
},
loc: { pageNumber: 1 }
}

分割

对于信息检索和下游问答目的,页面表示可能过于粗糙。我们最终的目标是检索回答输入查询的 Document 对象,进一步分割我们的 PDF 将有助于确保文档相关部分的含义不会被周围的文本“冲淡”。

我们可以为此目的使用 文本分割器。在这里,我们将使用一个简单的文本分割器,它基于字符进行分区。我们将文档分割成 1000 个字符的块,块之间重叠 200 个字符。重叠有助于减轻将陈述与与其相关的重要上下文分离的可能性。我们使用 RecursiveCharacterTextSplitter,它将使用常见的分隔符(如换行符)递归分割文档,直到每个块的大小都合适为止。这是通用文本用例推荐的文本分割器。

我们设置 add_start_index=True,以便每个分割的 Document 在初始 Document 中开始的字符索引作为元数据属性“start_index”保留。

有关使用 PDF 的更多详细信息,请参阅 本指南

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});

const allSplits = await textSplitter.splitDocuments(docs);

allSplits.length;
513

嵌入

向量搜索是存储和搜索非结构化数据(例如非结构化文本)的常用方法。其思想是存储与文本关联的数字向量。给定一个查询,我们可以 嵌入 它作为相同维度的向量,并使用向量相似性度量(例如余弦相似度)来识别相关文本。

LangChain 支持来自 数十个提供商 的嵌入。这些模型指定了如何将文本转换为数字向量。让我们选择一个模型。

选择您的嵌入模型

安装依赖项

yarn add @langchain/openai
OPENAI_API_KEY=your-api-key
import { OpenAIEmbeddings } from "@langchain/openai";

const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-large"
});
const vector1 = await embeddings.embedQuery(allSplits[0].pageContent);
const vector2 = await embeddings.embedQuery(allSplits[1].pageContent);

console.assert(vector1.length === vector2.length);
console.log(`Generated vectors of length ${vector1.length}\n`);
console.log(vector1.slice(0, 10));
Generated vectors of length 3072

[
0.014310152,
-0.01681044,
-0.0011537228,
0.010546423,
0.022808468,
-0.028327717,
-0.00058849837,
0.0419197,
-0.0012900416,
0.0661778
]

有了用于生成文本嵌入的模型,我们接下来可以将它们存储在支持高效相似性搜索的特殊数据结构中。

向量存储

LangChain VectorStore 对象包含用于将文本和 Document 对象添加到存储以及使用各种相似性度量查询它们的方法。它们通常使用 嵌入 模型初始化,这些模型确定如何将文本数据转换为数字向量。

LangChain 包含一套与不同向量存储技术 集成的工具。一些向量存储由提供商托管(例如,各种云提供商),并且需要特定的凭据才能使用;一些(例如 Postgres)在可以本地运行或通过第三方运行的单独基础设施中运行;其他一些可以在内存中运行,以用于轻量级工作负载。

选择您的向量存储

安装依赖项

yarn add langchain
import { MemoryVectorStore } from "langchain/vectorstores/memory";

const vectorStore = new MemoryVectorStore(embeddings);

实例化我们的向量存储后,我们现在可以索引文档。

await vectorStore.addDocuments(allSplits);

请注意,大多数向量存储实现都允许您连接到现有的向量存储——例如,通过提供客户端、索引名称或其他信息。有关更多详细信息,请参阅特定 集成 的文档。

一旦我们实例化了一个包含文档的 VectorStore,我们就可以查询它。VectorStore 包括用于查询的方法:- 同步和异步;- 通过字符串查询和向量;- 返回和不返回相似性分数;- 通过相似性和 最大边际相关性(以平衡与查询的相似性,以实现检索结果的多样性)。

这些方法通常会在其输出中包含 Document 对象列表。

用法

嵌入通常将文本表示为“密集”向量,使得含义相似的文本在几何上接近。这使我们只需传入一个问题即可检索相关信息,而无需了解文档中使用的任何特定关键词。

根据与字符串查询的相似性返回文档

const results1 = await vectorStore.similaritySearch(
"When was Nike incorporated?"
);

results1[0];
Document {
pageContent: 'Table of Contents\n' +
'PART I\n' +
'ITEM 1. BUSINESS\n' +
'GENERAL\n' +
'NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n' +
'"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\n' +
'Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\n' +
'the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\n' +
'and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales',
metadata: {
source: '../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: [Object],
metadata: null,
totalPages: 107
},
loc: { pageNumber: 4, lines: [Object] }
},
id: undefined
}

返回分数

const results2 = await vectorStore.similaritySearchWithScore(
"What was Nike's revenue in 2023?"
);

results2[0];
[
Document {
pageContent: 'Table of Contents\n' +
'FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\n' +
'The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:\n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
'•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.\n' +
'The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,\n' +
'2 and 1 percentage points to NIKE, Inc. Revenues, respectively.\n' +
'•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This\n' +
"increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale\n" +
'equivalent basis.',
metadata: {
source: '../../data/nke-10k-2023.pdf',
pdf: [Object],
loc: [Object]
},
id: undefined
},
0.6992287611800424
]

根据与嵌入查询的相似性返回文档

const embedding = await embeddings.embedQuery(
"How were Nike's margins impacted in 2023?"
);

const results3 = await vectorStore.similaritySearchVectorWithScore(
embedding,
1
);

results3[0];
[
Document {
pageContent: 'Table of Contents\n' +
'GROSS MARGIN\n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
'For fiscal 2023, our consolidated gross profit increased 4% to $22,292 million compared to $21,479 million for fiscal 2022. Gross margin decreased 250 basis points to\n' +
'43.5% for fiscal 2023 compared to 46.0% for fiscal 2022 due to the following:\n' +
'*Wholesale equivalent\n' +
'The decrease in gross margin for fiscal 2023 was primarily due to:\n' +
'•Higher NIKE Brand product costs, on a wholesale equivalent basis, primarily due to higher input costs and elevated inbound freight and logistics costs as well as\n' +
'product mix;\n' +
'•Lower margin in our NIKE Direct business, driven by higher promotional activity to liquidate inventory in the current period compared to lower promotional activity in\n' +
'the prior period resulting from lower available inventory supply;\n' +
'•Unfavorable changes in net foreign currency exchange rates, including hedges; and\n' +
'•Lower off-price margin, on a wholesale equivalent basis.\n' +
'This was partially offset by:',
metadata: {
source: '../../data/nke-10k-2023.pdf',
pdf: [Object],
loc: [Object]
},
id: undefined
},
0.7368815472158006
]

了解更多

检索器

LangChain VectorStore 对象不子类化 Runnable。LangChain Retrievers 是 Runnables,因此它们实现了一组标准方法(例如,同步和异步 invokebatch 操作)。虽然我们可以从向量存储构建检索器,但检索器也可以与非向量存储数据源(例如外部 API)接口。

Vectorstores 实现了一个 as retriever 方法,该方法将生成一个 Retriever,特别是 VectorStoreRetriever。这些检索器包括特定的 search_typesearch_kwargs 属性,用于标识要调用的底层向量存储的方法以及如何参数化它们。

const retriever = vectorStore.asRetriever({
searchType: "mmr",
searchKwargs: {
fetchK: 1,
},
});

await retriever.batch([
"When was Nike incorporated?",
"What was Nike's revenue in 2023?",
]);
[
[
Document {
pageContent: 'Table of Contents\n' +
'PART I\n' +
'ITEM 1. BUSINESS\n' +
'GENERAL\n' +
'NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n' +
'"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\n' +
'Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\n' +
'the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\n' +
'and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales',
metadata: [Object],
id: undefined
}
],
[
Document {
pageContent: 'Table of Contents\n' +
'FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\n' +
'The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:\n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
'•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.\n' +
'The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,\n' +
'2 and 1 percentage points to NIKE, Inc. Revenues, respectively.\n' +
'•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This\n' +
"increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale\n" +
'equivalent basis.',
metadata: [Object],
id: undefined
}
]
]

VectorStoreRetriever 支持 "similarity"(默认)和 "mmr"(最大边际相关性,如上所述)的搜索类型。

检索器可以轻松地集成到更复杂的应用程序中,例如 检索增强生成 (RAG) 应用程序,这些应用程序将给定的问题与检索到的上下文组合到 LLM 的提示中。要了解有关构建此类应用程序的更多信息,请查看 RAG 教程 教程。

了解更多:

检索策略可能丰富而复杂。例如

操作指南的 检索器 部分涵盖了这些和其他内置检索策略。

扩展 BaseRetriever 类以实现自定义检索器也很简单。请参阅我们的操作指南 此处

下一步

您现在已经了解了如何构建基于 PDF 文档的语义搜索引擎。

有关文档加载器的更多信息

有关嵌入的更多信息

有关向量存储的更多信息

有关 RAG 的更多信息,请参阅


此页内容是否对您有帮助?


您也可以留下详细的反馈 在 GitHub 上.