如何使用上下文压缩进行检索
本指南假定您熟悉以下概念
检索的一个挑战是,通常您在将数据摄取到系统时,不知道文档存储系统将面临的具体查询。这意味着与查询最相关的信息可能埋在包含大量不相关文本的文档中。将整个文档传递到您的应用程序可能会导致更昂贵的 LLM 调用和较差的响应。
上下文压缩旨在解决此问题。这个想法很简单:与其立即按原样返回检索到的文档,不如使用给定查询的上下文压缩它们,以便仅返回相关信息。“压缩”在这里指的是压缩单个文档的内容和完全过滤掉文档。
要使用上下文压缩检索器,您需要
- 基本检索器
- 文档压缩器
上下文压缩检索器将查询传递给基本检索器,获取初始文档并通过文档压缩器传递它们。文档压缩器通过减少文档内容或完全删除文档来缩短文档列表。
使用普通的向量存储检索器
让我们从初始化一个简单的向量存储检索器并存储 2023 年国情咨文演讲(分块)开始。给定一个示例问题,我们的检索器返回一两个相关文档和一些不相关文档,甚至相关文档中也包含大量不相关信息。为了提取所有可能的上下文,我们使用 LLMChainExtractor
,它将迭代最初返回的文档,并仅从每个文档中提取与查询相关的内容。
- npm
- Yarn
- pnpm
npm install @langchain/openai @langchain/community @langchain/core
yarn add @langchain/openai @langchain/community @langchain/core
pnpm add @langchain/openai @langchain/community @langchain/core
import * as fs from "fs";
import { OpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";
import { LLMChainExtractor } from "langchain/retrievers/document_compressors/chain_extract";
const model = new OpenAI({
model: "gpt-3.5-turbo-instruct",
});
const baseCompressor = LLMChainExtractor.fromLLM(model);
const text = fs.readFileSync("state_of_the_union.txt", "utf8");
const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000 });
const docs = await textSplitter.createDocuments([text]);
// Create a vector store from the documents.
const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());
const retriever = new ContextualCompressionRetriever({
baseCompressor,
baseRetriever: vectorStore.asRetriever(),
});
const retrievedDocs = await retriever.invoke(
"What did the speaker say about Justice Breyer?"
);
console.log({ retrievedDocs });
/*
{
retrievedDocs: [
Document {
pageContent: 'One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.',
metadata: [Object]
},
Document {
pageContent: '"Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service."',
metadata: [Object]
},
Document {
pageContent: 'The onslaught of state laws targeting transgender Americans and their families is wrong.',
metadata: [Object]
}
]
}
*/
API 参考
- OpenAI 来自
@langchain/openai
- OpenAIEmbeddings 来自
@langchain/openai
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- HNSWLib 来自
@langchain/community/vectorstores/hnswlib
- ContextualCompressionRetriever 来自
langchain/retrievers/contextual_compression
- LLMChainExtractor 来自
langchain/retrievers/document_compressors/chain_extract
EmbeddingsFilter
对每个检索到的文档进行额外的 LLM 调用既昂贵又缓慢。EmbeddingsFilter
通过嵌入文档和查询,并且仅返回那些与查询具有足够相似嵌入的文档,从而提供了一种更便宜、更快速的选择。
这对于非向量存储检索器最有用,在这些检索器中,我们可能无法控制返回的块大小,或者作为管道的一部分,如下所述。
这是一个例子
import * as fs from "fs";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings } from "@langchain/openai";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";
import { EmbeddingsFilter } from "langchain/retrievers/document_compressors/embeddings_filter";
const baseCompressor = new EmbeddingsFilter({
embeddings: new OpenAIEmbeddings(),
similarityThreshold: 0.8,
});
const text = fs.readFileSync("state_of_the_union.txt", "utf8");
const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000 });
const docs = await textSplitter.createDocuments([text]);
// Create a vector store from the documents.
const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());
const retriever = new ContextualCompressionRetriever({
baseCompressor,
baseRetriever: vectorStore.asRetriever(),
});
const retrievedDocs = await retriever.invoke(
"What did the speaker say about Justice Breyer?"
);
console.log({ retrievedDocs });
/*
{
retrievedDocs: [
Document {
pageContent: 'And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n' +
'\n' +
'A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n' +
'\n' +
'And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n' +
'\n' +
'We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. \n' +
'\n' +
'We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n' +
'\n' +
'We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster.',
metadata: [Object]
},
Document {
pageContent: 'In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n' +
'\n' +
'We cannot let this happen. \n' +
'\n' +
'Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n' +
'\n' +
'Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n' +
'\n' +
'One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n' +
'\n' +
'And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.',
metadata: [Object]
}
]
}
*/
API 参考
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- HNSWLib 来自
@langchain/community/vectorstores/hnswlib
- OpenAIEmbeddings 来自
@langchain/openai
- ContextualCompressionRetriever 来自
langchain/retrievers/contextual_compression
- EmbeddingsFilter 来自
langchain/retrievers/document_compressors/embeddings_filter
将压缩器和文档转换器串在一起
使用 DocumentCompressorPipeline
,我们还可以轻松地按顺序组合多个压缩器。除了压缩器之外,我们还可以向管道添加 BaseDocumentTransformers,它们不执行任何上下文压缩,而只是对一组文档执行一些转换。例如,TextSplitters
可以用作文档转换器,将文档拆分成更小的块,而 EmbeddingsFilter
可以用于根据各个块与输入查询的相似性来过滤文档。
下面我们创建一个压缩器管道,首先将从 Tavily 网络搜索 API 检索器检索的原始网页文档拆分成更小的块,然后根据与查询的相关性进行过滤。结果是与输入查询语义相似的较小块。这跳过了将文档添加到向量存储以执行相似性搜索的需求,这对于一次性用例非常有用
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { OpenAIEmbeddings } from "@langchain/openai";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";
import { EmbeddingsFilter } from "langchain/retrievers/document_compressors/embeddings_filter";
import { TavilySearchAPIRetriever } from "@langchain/community/retrievers/tavily_search_api";
import { DocumentCompressorPipeline } from "langchain/retrievers/document_compressors";
const embeddingsFilter = new EmbeddingsFilter({
embeddings: new OpenAIEmbeddings(),
similarityThreshold: 0.8,
k: 5,
});
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 200,
chunkOverlap: 0,
});
const compressorPipeline = new DocumentCompressorPipeline({
transformers: [textSplitter, embeddingsFilter],
});
const baseRetriever = new TavilySearchAPIRetriever({
includeRawContent: true,
});
const retriever = new ContextualCompressionRetriever({
baseCompressor: compressorPipeline,
baseRetriever,
});
const retrievedDocs = await retriever.invoke(
"What did the speaker say about Justice Breyer in the 2022 State of the Union?"
);
console.log({ retrievedDocs });
/*
{
retrievedDocs: [
Document {
pageContent: 'Justice Stephen Breyer talks to President Joe Biden ahead of the State of the Union address on Tuesday. (jabin botsford/Agence France-Presse/Getty Images)',
metadata: [Object]
},
Document {
pageContent: 'President Biden recognized outgoing US Supreme Court Justice Stephen Breyer during his State of the Union on Tuesday.',
metadata: [Object]
},
Document {
pageContent: 'What we covered here\n' +
'Biden recognized outgoing Supreme Court Justice Breyer during his speech',
metadata: [Object]
},
Document {
pageContent: 'States Supreme Court. Justice Breyer, thank you for your service,” the president said.',
metadata: [Object]
},
Document {
pageContent: 'Court," Biden said. "Justice Breyer, thank you for your service."',
metadata: [Object]
}
]
}
*/
API 参考
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- OpenAIEmbeddings 来自
@langchain/openai
- ContextualCompressionRetriever 来自
langchain/retrievers/contextual_compression
- EmbeddingsFilter 来自
langchain/retrievers/document_compressors/embeddings_filter
- TavilySearchAPIRetriever 来自
@langchain/community/retrievers/tavily_search_api
- DocumentCompressorPipeline 来自
langchain/retrievers/document_compressors
下一步
您现在已经学习了几种使用上下文压缩从结果中删除不良数据的方法。
请参阅各个部分,深入了解特定的检索器、关于 RAG 的更广泛教程,或本节,了解如何在任何数据源上创建您自己的自定义检索器。