如何使用上下文压缩进行检索
本指南假设你熟悉以下概念
检索面临的一个挑战是,你通常不知道在将数据导入系统时,你的文档存储系统将面临哪些具体查询。这意味着与查询最相关的的信息可能埋藏在包含大量不相关文本的文档中。将该完整文档通过你的应用程序可能会导致更昂贵的 LLM 调用和更差的响应。
上下文压缩旨在解决这个问题。这个想法很简单:与其直接返回检索到的文档,不如使用给定查询的上下文对其进行压缩,以便只返回相关信息。这里的“压缩”指的是压缩单个文档的内容以及过滤掉整个文档。
要使用上下文压缩检索器,你需要
- 一个基础检索器
- 一个文档压缩器
上下文压缩检索器将查询传递给基础检索器,获取初始文档并将其通过文档压缩器。文档压缩器接收文档列表并通过减少文档内容或完全删除文档来缩短列表。
使用普通向量存储检索器
让我们从初始化一个简单的向量存储检索器并存储 2023 年国情咨文(分块)开始。给定一个示例问题,我们的检索器返回一个或两个相关文档和一些不相关文档,即使是相关文档也包含大量不相关信息。为了提取所有可用的上下文,我们使用 LLMChainExtractor
,它将遍历最初返回的文档,并从每个文档中仅提取与查询相关的部分。
- npm
- Yarn
- pnpm
npm install @langchain/openai @langchain/community
yarn add @langchain/openai @langchain/community
pnpm add @langchain/openai @langchain/community
import * as fs from "fs";
import { OpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";
import { LLMChainExtractor } from "langchain/retrievers/document_compressors/chain_extract";
const model = new OpenAI({
model: "gpt-3.5-turbo-instruct",
});
const baseCompressor = LLMChainExtractor.fromLLM(model);
const text = fs.readFileSync("state_of_the_union.txt", "utf8");
const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000 });
const docs = await textSplitter.createDocuments([text]);
// Create a vector store from the documents.
const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());
const retriever = new ContextualCompressionRetriever({
baseCompressor,
baseRetriever: vectorStore.asRetriever(),
});
const retrievedDocs = await retriever.invoke(
"What did the speaker say about Justice Breyer?"
);
console.log({ retrievedDocs });
/*
{
retrievedDocs: [
Document {
pageContent: 'One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.',
metadata: [Object]
},
Document {
pageContent: '"Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service."',
metadata: [Object]
},
Document {
pageContent: 'The onslaught of state laws targeting transgender Americans and their families is wrong.',
metadata: [Object]
}
]
}
*/
API 参考
- OpenAI 来自
@langchain/openai
- OpenAIEmbeddings 来自
@langchain/openai
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- HNSWLib 来自
@langchain/community/vectorstores/hnswlib
- ContextualCompressionRetriever 来自
langchain/retrievers/contextual_compression
- LLMChainExtractor 来自
langchain/retrievers/document_compressors/chain_extract
EmbeddingsFilter
对每个检索到的文档进行额外的 LLM 调用既昂贵又缓慢。EmbeddingsFilter
提供了一种更便宜、更快的选项,通过嵌入文档和查询,只返回与查询具有足够相似嵌入的那些文档。
这对于我们可能无法控制返回的块大小的非向量存储检索器或作为以下概述的管道的一部分最为有用。
以下是一个示例
import * as fs from "fs";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings } from "@langchain/openai";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";
import { EmbeddingsFilter } from "langchain/retrievers/document_compressors/embeddings_filter";
const baseCompressor = new EmbeddingsFilter({
embeddings: new OpenAIEmbeddings(),
similarityThreshold: 0.8,
});
const text = fs.readFileSync("state_of_the_union.txt", "utf8");
const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000 });
const docs = await textSplitter.createDocuments([text]);
// Create a vector store from the documents.
const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());
const retriever = new ContextualCompressionRetriever({
baseCompressor,
baseRetriever: vectorStore.asRetriever(),
});
const retrievedDocs = await retriever.invoke(
"What did the speaker say about Justice Breyer?"
);
console.log({ retrievedDocs });
/*
{
retrievedDocs: [
Document {
pageContent: 'And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n' +
'\n' +
'A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n' +
'\n' +
'And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n' +
'\n' +
'We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. \n' +
'\n' +
'We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n' +
'\n' +
'We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster.',
metadata: [Object]
},
Document {
pageContent: 'In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n' +
'\n' +
'We cannot let this happen. \n' +
'\n' +
'Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n' +
'\n' +
'Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n' +
'\n' +
'One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n' +
'\n' +
'And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.',
metadata: [Object]
}
]
}
*/
API 参考
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- HNSWLib 来自
@langchain/community/vectorstores/hnswlib
- OpenAIEmbeddings 来自
@langchain/openai
- ContextualCompressionRetriever 来自
langchain/retrievers/contextual_compression
- EmbeddingsFilter 来自
langchain/retrievers/document_compressors/embeddings_filter
将压缩器和文档转换器串联起来
使用 DocumentCompressorPipeline
,我们也可以轻松地将多个压缩器按顺序组合。除了压缩器之外,我们还可以将 BaseDocumentTransformers 添加到我们的管道中,它们不执行任何上下文压缩,而只是对一组文档执行一些转换。例如 TextSplitters
可以用作文档转换器来将文档拆分为更小的片段,EmbeddingsFilter
可以用来根据各个片段与输入查询的相似度来过滤掉文档。
在下面,我们首先将从 Tavily 网络搜索 API 检索器 检索到的原始网页文档拆分为更小的块,然后根据与查询的相关性进行过滤,从而创建一个压缩器管道。结果是更小的块,这些块在语义上与输入查询相似。这避免了将文档添加到向量存储中以执行相似性搜索的需要,这对于一次性使用情况非常有用。
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { OpenAIEmbeddings } from "@langchain/openai";
import { ContextualCompressionRetriever } from "langchain/retrievers/contextual_compression";
import { EmbeddingsFilter } from "langchain/retrievers/document_compressors/embeddings_filter";
import { TavilySearchAPIRetriever } from "@langchain/community/retrievers/tavily_search_api";
import { DocumentCompressorPipeline } from "langchain/retrievers/document_compressors";
const embeddingsFilter = new EmbeddingsFilter({
embeddings: new OpenAIEmbeddings(),
similarityThreshold: 0.8,
k: 5,
});
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 200,
chunkOverlap: 0,
});
const compressorPipeline = new DocumentCompressorPipeline({
transformers: [textSplitter, embeddingsFilter],
});
const baseRetriever = new TavilySearchAPIRetriever({
includeRawContent: true,
});
const retriever = new ContextualCompressionRetriever({
baseCompressor: compressorPipeline,
baseRetriever,
});
const retrievedDocs = await retriever.invoke(
"What did the speaker say about Justice Breyer in the 2022 State of the Union?"
);
console.log({ retrievedDocs });
/*
{
retrievedDocs: [
Document {
pageContent: 'Justice Stephen Breyer talks to President Joe Biden ahead of the State of the Union address on Tuesday. (jabin botsford/Agence France-Presse/Getty Images)',
metadata: [Object]
},
Document {
pageContent: 'President Biden recognized outgoing US Supreme Court Justice Stephen Breyer during his State of the Union on Tuesday.',
metadata: [Object]
},
Document {
pageContent: 'What we covered here\n' +
'Biden recognized outgoing Supreme Court Justice Breyer during his speech',
metadata: [Object]
},
Document {
pageContent: 'States Supreme Court. Justice Breyer, thank you for your service,” the president said.',
metadata: [Object]
},
Document {
pageContent: 'Court," Biden said. "Justice Breyer, thank you for your service."',
metadata: [Object]
}
]
}
*/
API 参考
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- OpenAIEmbeddings 来自
@langchain/openai
- ContextualCompressionRetriever 来自
langchain/retrievers/contextual_compression
- EmbeddingsFilter 来自
langchain/retrievers/document_compressors/embeddings_filter
- TavilySearchAPIRetriever 来自
@langchain/community/retrievers/tavily_search_api
- DocumentCompressorPipeline 来自
langchain/retrievers/document_compressors
下一步
您现在已经学习了一些使用上下文压缩从结果中删除不良数据的方法。
有关特定检索器的深入了解,请参阅各个部分,有关 RAG 的更广泛教程,请参阅 RAG 的更广泛教程,或本节以了解如何 创建您自己的自定义检索器,用于任何数据源。