如何为一个片段检索整个文档
本指南假定您熟悉以下概念
在将文档拆分为检索时,通常会存在相互冲突的愿望
- 您可能希望拥有较小的文档,以便它们的嵌入能够最准确地反映其含义。如果文档太长,则嵌入可能会失去含义。
- 您希望拥有足够长的文档,以便保留每个片段的上下文。
The ParentDocumentRetriever
通过拆分和存储少量数据片段来实现这种平衡。在检索期间,它首先获取少量片段,但随后查找这些片段的父 ID 并返回那些较大的文档。
请注意,“父文档”是指少量片段来源的文档。这可以是整个原始文档,也可以是较大的片段。
这是一种更具体的 为每个文档生成多个嵌入 的形式。
用法
- npm
- Yarn
- pnpm
npm install @langchain/openai @langchain/core
yarn add @langchain/openai @langchain/core
pnpm add @langchain/openai @langchain/core
import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { InMemoryStore } from "@langchain/core/stores";
const vectorstore = new MemoryVectorStore(new OpenAIEmbeddings());
const byteStore = new InMemoryStore<Uint8Array>();
const retriever = new ParentDocumentRetriever({
vectorstore,
byteStore,
// Optional, not required if you're already passing in split documents
parentSplitter: new RecursiveCharacterTextSplitter({
chunkOverlap: 0,
chunkSize: 500,
}),
childSplitter: new RecursiveCharacterTextSplitter({
chunkOverlap: 0,
chunkSize: 50,
}),
// Optional `k` parameter to search for more child documents in VectorStore.
// Note that this does not exactly correspond to the number of final (parent) documents
// retrieved, as multiple child documents can point to the same parent.
childK: 20,
// Optional `k` parameter to limit number of final, parent documents returned from this
// retriever and sent to LLM. This is an upper-bound, and the final count may be lower than this.
parentK: 5,
});
const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();
// We must add the parent documents via the retriever's addDocuments method
await retriever.addDocuments(parentDocuments);
const retrievedDocs = await retriever.invoke("justice breyer");
// Retrieved chunks are the larger parent chunks
console.log(retrievedDocs);
/*
[
Document {
pageContent: 'Tonight, I call on the Senate to pass — pass the Freedom to Vote Act. Pass the John Lewis Act — Voting Rights Act. And while you’re at it, pass the DISCLOSE Act so Americans know who is funding our elections.\n' +
'\n' +
'Look, tonight, I’d — I’d like to honor someone who has dedicated his life to serve this country: Justice Breyer — an Army veteran, Constitutional scholar, retiring Justice of the United States Supreme Court.',
metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
},
Document {
pageContent: 'As I did four days ago, I’ve nominated a Circuit Court of Appeals — Ketanji Brown Jackson. One of our nation’s top legal minds who will continue in just Brey- — Justice Breyer’s legacy of excellence. A former top litigator in private practice, a former federal public defender from a family of public-school educators and police officers — she’s a consensus builder.',
metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
},
Document {
pageContent: 'Justice Breyer, thank you for your service. Thank you, thank you, thank you. I mean it. Get up. Stand — let me see you. Thank you.\n' +
'\n' +
'And we all know — no matter what your ideology, we all know one of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.',
metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
}
]
*/
API 参考
- OpenAIEmbeddings 来自
@langchain/openai
- MemoryVectorStore 来自
langchain/vectorstores/memory
- ParentDocumentRetriever 来自
langchain/retrievers/parent_document
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- TextLoader 来自
langchain/document_loaders/fs/text
- InMemoryStore 来自
@langchain/core/stores
带分数阈值
通过设置 scoreThresholdOptions
中的选项,我们可以强制 ParentDocumentRetriever
在后台使用 ScoreThresholdRetriever
。这将 ScoreThresholdRetriever
内部的向量存储设置为我们在初始化 ParentDocumentRetriever
时传递的向量存储,同时还允许我们为检索器设置分数阈值。
当您不确定想要多少文档(或者您确定,只需设置 maxK
选项),但您想确保您获得的文档在一定相关性阈值内时,这很有用。
注意:如果传递了检索器,ParentDocumentRetriever
将默认使用它来检索少量片段,以及通过 addDocuments
方法添加文档。
import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { InMemoryStore } from "@langchain/core/stores";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { ScoreThresholdRetriever } from "langchain/retrievers/score_threshold";
const vectorstore = new MemoryVectorStore(new OpenAIEmbeddings());
const byteStore = new InMemoryStore<Uint8Array>();
const childDocumentRetriever = ScoreThresholdRetriever.fromVectorStore(
vectorstore,
{
minSimilarityScore: 0.01, // Essentially no threshold
maxK: 1, // Only return the top result
}
);
const retriever = new ParentDocumentRetriever({
vectorstore,
byteStore,
childDocumentRetriever,
// Optional, not required if you're already passing in split documents
parentSplitter: new RecursiveCharacterTextSplitter({
chunkOverlap: 0,
chunkSize: 500,
}),
childSplitter: new RecursiveCharacterTextSplitter({
chunkOverlap: 0,
chunkSize: 50,
}),
});
const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();
// We must add the parent documents via the retriever's addDocuments method
await retriever.addDocuments(parentDocuments);
const retrievedDocs = await retriever.invoke("justice breyer");
// Retrieved chunk is the larger parent chunk
console.log(retrievedDocs);
/*
[
Document {
pageContent: 'Tonight, I call on the Senate to pass — pass the Freedom to Vote Act. Pass the John Lewis Act — Voting Rights Act. And while you’re at it, pass the DISCLOSE Act so Americans know who is funding our elections.\n' +
'\n' +
'Look, tonight, I’d — I’d like to honor someone who has dedicated his life to serve this country: Justice Breyer — an Army veteran, Constitutional scholar, retiring Justice of the United States Supreme Court.',
metadata: { source: '../examples/state_of_the_union.txt', loc: [Object] }
},
]
*/
API 参考
- OpenAIEmbeddings 来自
@langchain/openai
- MemoryVectorStore 来自
langchain/vectorstores/memory
- InMemoryStore 来自
@langchain/core/stores
- ParentDocumentRetriever 来自
langchain/retrievers/parent_document
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- TextLoader 来自
langchain/document_loaders/fs/text
- ScoreThresholdRetriever 来自
langchain/retrievers/score_threshold
带上下文片段标题
考虑一种场景,您想将文档集合存储在向量存储中并对其执行问答任务。仅仅将文档拆分为带有重叠文本的片段可能不足以提供足够的上下文,让大型语言模型(LLM)确定多个片段是否引用了相同的信息,或者如何解决来自矛盾来源的信息。
如果您知道要过滤的内容,则为每个文档添加元数据标签是一种解决方案,但您可能事先不知道您的向量存储将需要处理哪些类型的查询。将额外的上下文信息以标题的形式直接包含在每个片段中,可以帮助处理任意查询。
如果您有多个细粒度的子片段需要从向量存储中正确检索,这一点尤为重要。
import { OpenAIEmbeddings } from "@langchain/openai";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { InMemoryStore } from "@langchain/core/stores";
import { ParentDocumentRetriever } from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1500,
chunkOverlap: 0,
});
const jimDocs = await splitter.createDocuments([`My favorite color is blue.`]);
const jimChunkHeaderOptions = {
chunkHeader: "DOC NAME: Jim Interview\n---\n",
appendChunkOverlapHeader: true,
};
const pamDocs = await splitter.createDocuments([`My favorite color is red.`]);
const pamChunkHeaderOptions = {
chunkHeader: "DOC NAME: Pam Interview\n---\n",
appendChunkOverlapHeader: true,
};
const vectorstore = await HNSWLib.fromDocuments([], new OpenAIEmbeddings());
const byteStore = new InMemoryStore<Uint8Array>();
const retriever = new ParentDocumentRetriever({
vectorstore,
byteStore,
// Very small chunks for demo purposes.
// Use a bigger chunk size for serious use-cases.
childSplitter: new RecursiveCharacterTextSplitter({
chunkSize: 10,
chunkOverlap: 0,
}),
childK: 50,
parentK: 5,
});
// We pass additional option `childDocChunkHeaderOptions`
// that will add the chunk header to child documents
await retriever.addDocuments(jimDocs, {
childDocChunkHeaderOptions: jimChunkHeaderOptions,
});
await retriever.addDocuments(pamDocs, {
childDocChunkHeaderOptions: pamChunkHeaderOptions,
});
// This will search child documents in vector store with the help of chunk header,
// returning the unmodified parent documents
const retrievedDocs = await retriever.invoke("What is Pam's favorite color?");
// Pam's favorite color is returned first!
console.log(JSON.stringify(retrievedDocs, null, 2));
/*
[
{
"pageContent": "My favorite color is red.",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
}
}
},
{
"pageContent": "My favorite color is blue.",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
}
}
}
]
*/
const rawDocs = await vectorstore.similaritySearch(
"What is Pam's favorite color?"
);
// Raw docs in vectorstore are short but have chunk headers
console.log(JSON.stringify(rawDocs, null, 2));
/*
[
{
"pageContent": "DOC NAME: Pam Interview\n---\n(cont'd) color is",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
},
"doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
}
},
{
"pageContent": "DOC NAME: Pam Interview\n---\n(cont'd) favorite",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
},
"doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
}
},
{
"pageContent": "DOC NAME: Pam Interview\n---\n(cont'd) red.",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
},
"doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
}
},
{
"pageContent": "DOC NAME: Pam Interview\n---\nMy",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
},
"doc_id": "affdcbeb-6bfb-42e9-afe5-80f4f2e9f6aa"
}
}
]
*/
API 参考
- OpenAIEmbeddings 来自
@langchain/openai
- HNSWLib 来自
@langchain/community/vectorstores/hnswlib
- InMemoryStore 来自
@langchain/core/stores
- ParentDocumentRetriever 来自
langchain/retrievers/parent_document
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
带重排
当将来自向量存储的许多文档传递给 LLM 时,最终答案有时会包含来自不相关片段的信息,从而使其精度降低,有时甚至不正确。此外,传递多个不相关文档会增加成本。因此,使用重排有两个理由 - 精度和成本。
import { OpenAIEmbeddings } from "@langchain/openai";
import { CohereRerank } from "@langchain/cohere";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { InMemoryStore } from "@langchain/core/stores";
import {
ParentDocumentRetriever,
type SubDocs,
} from "langchain/retrievers/parent_document";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
// init Cohere Rerank. Remember to add COHERE_API_KEY to your .env
const reranker = new CohereRerank({
topN: 50,
model: "rerank-multilingual-v2.0",
});
export function documentCompressorFiltering({
relevanceScore,
}: { relevanceScore?: number } = {}) {
return (docs: SubDocs) => {
let outputDocs = docs;
if (relevanceScore) {
const docsRelevanceScoreValues = docs.map(
(doc) => doc?.metadata?.relevanceScore
);
outputDocs = docs.filter(
(_doc, index) =>
(docsRelevanceScoreValues?.[index] || 1) >= relevanceScore
);
}
return outputDocs;
};
}
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500,
chunkOverlap: 0,
});
const jimDocs = await splitter.createDocuments([`Jim favorite color is blue.`]);
const pamDocs = await splitter.createDocuments([`Pam favorite color is red.`]);
const vectorstore = await HNSWLib.fromDocuments([], new OpenAIEmbeddings());
const byteStore = new InMemoryStore<Uint8Array>();
const retriever = new ParentDocumentRetriever({
vectorstore,
byteStore,
// Very small chunks for demo purposes.
// Use a bigger chunk size for serious use-cases.
childSplitter: new RecursiveCharacterTextSplitter({
chunkSize: 10,
chunkOverlap: 0,
}),
childK: 50,
parentK: 5,
// We add Reranker
documentCompressor: reranker,
documentCompressorFilteringFn: documentCompressorFiltering({
relevanceScore: 0.3,
}),
});
const docs = jimDocs.concat(pamDocs);
await retriever.addDocuments(docs);
// This will search for documents in vector store and return for LLM already reranked and sorted document
// with appropriate minimum relevance score
const retrievedDocs = await retriever.invoke("What is Pam's favorite color?");
// Pam's favorite color is returned first!
console.log(JSON.stringify(retrievedDocs, null, 2));
/*
[
{
"pageContent": "My favorite color is red.",
"metadata": {
"relevanceScore": 0.9
"loc": {
"lines": {
"from": 1,
"to": 1
}
}
}
}
]
*/
API 参考
- OpenAIEmbeddings 来自
@langchain/openai
- CohereRerank 来自
@langchain/cohere
- HNSWLib 来自
@langchain/community/vectorstores/hnswlib
- InMemoryStore 来自
@langchain/core/stores
- ParentDocumentRetriever 来自
langchain/retrievers/parent_document
- 子文档 来自
langchain/retrievers/parent_document
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
下一步
您现在已经了解了如何使用 ParentDocumentRetriever
。
接下来,查看更通用的形式,即为每个文档生成多个嵌入,关于 RAG 的更广泛教程,或本节以了解如何在任何数据源上创建您自己的自定义检索器.