如何使用上下文压缩进行检索
本指南假设您熟悉以下概念
检索的一个挑战是,通常您不知道在将数据摄取到系统中时,文档存储系统将面临的具体查询。这意味着与查询最相关的信息可能被埋藏在包含大量不相关文本的文档中。将该完整文档通过您的应用程序可能会导致更昂贵的 LLM 调用和更糟糕的响应。
上下文压缩旨在解决此问题。这个想法很简单:与其立即按原样返回检索到的文档,您可以使用给定查询的上下文对其进行压缩,以便仅返回相关信息。这里的“压缩”是指压缩单个文档的内容以及整体过滤掉文档。
要使用上下文压缩检索器,您需要
- 一个基础检索器
- 一个文档压缩器
上下文压缩检索器将查询传递给基础检索器,获取初始文档并将其通过文档压缩器。文档压缩器获取文档列表,通过减少文档内容或完全删除文档来缩短列表。
使用普通向量存储检索器
让我们首先初始化一个简单的向量存储检索器,并存储 2023 年国情咨文(分块)。给定一个示例问题,我们的检索器返回一个或两个相关文档以及一些不相关文档,即使相关文档也包含大量不相关信息。为了提取所有可能的上下文,我们使用 LLMChainExtractor
,它将遍历最初返回的文档,并仅从每个文档中提取与查询相关的部分。
请参阅 此部分,了解有关安装集成包的常规说明。
- npm
- Yarn
- pnpm
npm install @langchain/openai @langchain/community @langchain/core
yarn add @langchain/openai @langchain/community @langchain/core
pnpm add @langchain/openai @langchain/community @langchain/core
import * as fs from "fs";
import { OpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";
import { LLMChainExtractor } from "langchain/retrievers/document_compressors/chain_extract";
const model = new OpenAI({
model: "gpt-3.5-turbo-instruct",
});
const baseCompressor = LLMChainExtractor.fromLLM(model);
const text = fs.readFileSync("state_of_the_union.txt", "utf8");
const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000 });
const docs = await textSplitter.createDocuments([text]);
// Create a vector store from the documents.
const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());
const retriever = new ContextualCompressionRetriever({
baseCompressor,
baseRetriever: vectorStore.asRetriever(),
});
const retrievedDocs = await retriever.invoke(
"What did the speaker say about Justice Breyer?"
);
console.log({ retrievedDocs });
/*
{
retrievedDocs: [
Document {
pageContent: 'One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.',
metadata: [Object]
},
Document {
pageContent: '"Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service."',
metadata: [Object]
},
Document {
pageContent: 'The onslaught of state laws targeting transgender Americans and their families is wrong.',
metadata: [Object]
}
]
}
*/
API 参考
- OpenAI 来自
@langchain/openai
- OpenAIEmbeddings 来自
@langchain/openai
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- HNSWLib 来自
@langchain/community/vectorstores/hnswlib
- ContextualCompressionRetriever 来自
langchain/retrievers/contextual_compression
- LLMChainExtractor 来自
langchain/retrievers/document_compressors/chain_extract
EmbeddingsFilter
对每个检索到的文档进行额外的 LLM 调用既昂贵又缓慢。EmbeddingsFilter
通过嵌入文档和查询并仅返回那些嵌入与查询足够相似的文档,提供了一种更便宜、更快的选择。
这对于我们可能无法控制返回的块大小的非向量存储检索器,或者作为管道的一部分(如下所述)最为有用。
以下是一个示例
import * as fs from "fs";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings } from "@langchain/openai";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";
import { EmbeddingsFilter } from "langchain/retrievers/document_compressors/embeddings_filter";
const baseCompressor = new EmbeddingsFilter({
embeddings: new OpenAIEmbeddings(),
similarityThreshold: 0.8,
});
const text = fs.readFileSync("state_of_the_union.txt", "utf8");
const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000 });
const docs = await textSplitter.createDocuments([text]);
// Create a vector store from the documents.
const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());
const retriever = new ContextualCompressionRetriever({
baseCompressor,
baseRetriever: vectorStore.asRetriever(),
});
const retrievedDocs = await retriever.invoke(
"What did the speaker say about Justice Breyer?"
);
console.log({ retrievedDocs });
/*
{
retrievedDocs: [
Document {
pageContent: 'And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n' +
'\n' +
'A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n' +
'\n' +
'And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n' +
'\n' +
'We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. \n' +
'\n' +
'We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n' +
'\n' +
'We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster.',
metadata: [Object]
},
Document {
pageContent: 'In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n' +
'\n' +
'We cannot let this happen. \n' +
'\n' +
'Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n' +
'\n' +
'Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n' +
'\n' +
'One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n' +
'\n' +
'And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.',
metadata: [Object]
}
]
}
*/
API 参考
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- HNSWLib 来自
@langchain/community/vectorstores/hnswlib
- OpenAIEmbeddings 来自
@langchain/openai
- ContextualCompressionRetriever 来自
langchain/retrievers/contextual_compression
- EmbeddingsFilter 来自
langchain/retrievers/document_compressors/embeddings_filter
串联压缩器和文档转换器
使用 DocumentCompressorPipeline
,我们也可以轻松地将多个压缩器按顺序组合起来。除了压缩器之外,我们还可以将 BaseDocumentTransformers 添加到我们的管道中,它们不会执行任何上下文压缩,而只是对一组文档执行一些转换。例如,TextSplitters
可以用作文档转换器,将文档拆分成更小的片段,而 EmbeddingsFilter
可以用来根据单个片段与输入查询的相似度来过滤文档。
下面,我们首先将从 Tavily 网页搜索 API 检索器 检索到的原始网页文档拆分成更小的片段,然后根据与查询的相关性进行过滤,从而创建一个压缩器管道。结果是更小的片段,这些片段在语义上与输入查询相似。这跳过了将文档添加到向量存储以执行相似性搜索的必要性,这对于一次性使用情况可能很有用。
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { OpenAIEmbeddings } from "@langchain/openai";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";
import { EmbeddingsFilter } from "langchain/retrievers/document_compressors/embeddings_filter";
import { TavilySearchAPIRetriever } from "@langchain/community/retrievers/tavily_search_api";
import { DocumentCompressorPipeline } from "langchain/retrievers/document_compressors";
const embeddingsFilter = new EmbeddingsFilter({
embeddings: new OpenAIEmbeddings(),
similarityThreshold: 0.8,
k: 5,
});
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 200,
chunkOverlap: 0,
});
const compressorPipeline = new DocumentCompressorPipeline({
transformers: [textSplitter, embeddingsFilter],
});
const baseRetriever = new TavilySearchAPIRetriever({
includeRawContent: true,
});
const retriever = new ContextualCompressionRetriever({
baseCompressor: compressorPipeline,
baseRetriever,
});
const retrievedDocs = await retriever.invoke(
"What did the speaker say about Justice Breyer in the 2022 State of the Union?"
);
console.log({ retrievedDocs });
/*
{
retrievedDocs: [
Document {
pageContent: 'Justice Stephen Breyer talks to President Joe Biden ahead of the State of the Union address on Tuesday. (jabin botsford/Agence France-Presse/Getty Images)',
metadata: [Object]
},
Document {
pageContent: 'President Biden recognized outgoing US Supreme Court Justice Stephen Breyer during his State of the Union on Tuesday.',
metadata: [Object]
},
Document {
pageContent: 'What we covered here\n' +
'Biden recognized outgoing Supreme Court Justice Breyer during his speech',
metadata: [Object]
},
Document {
pageContent: 'States Supreme Court. Justice Breyer, thank you for your service,” the president said.',
metadata: [Object]
},
Document {
pageContent: 'Court," Biden said. "Justice Breyer, thank you for your service."',
metadata: [Object]
}
]
}
*/
API 参考
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- OpenAIEmbeddings 来自
@langchain/openai
- ContextualCompressionRetriever 来自
langchain/retrievers/contextual_compression
- EmbeddingsFilter 来自
langchain/retrievers/document_compressors/embeddings_filter
- TavilySearchAPIRetriever 来自
@langchain/community/retrievers/tavily_search_api
- DocumentCompressorPipeline 来自
langchain/retrievers/document_compressors
下一步
您现在已经学习了使用上下文压缩从结果中删除不良数据的几种方法。
查看各个部分以更深入地了解特定检索器,有关 RAG 的更广泛教程,或查看此部分以了解如何在任何数据源上创建您自己的自定义检索器。