如何生成多个查询来检索数据
前提条件
本指南假设您熟悉以下概念
基于距离的向量数据库检索将查询嵌入(表示)在高维空间中,并根据“距离”查找相似的嵌入文档。但是,查询措辞的细微变化,或者如果嵌入不能很好地捕捉数据的语义,则检索可能会产生不同的结果。有时会进行提示工程/调整来手动解决这些问题,但这可能很繁琐。
MultiQueryRetriever
通过使用 LLM 从给定用户输入查询的不同角度生成多个查询,从而自动化提示调整过程。对于每个查询,它检索一组相关文档,并在所有查询中取唯一并集,以获得更大的潜在相关文档集。通过生成关于同一问题的多个视角,MultiQueryRetriever
可以帮助克服基于距离的检索的一些限制,并获得更丰富的结果集。
开始使用
提示
- npm
- yarn
- pnpm
npm i @langchain/anthropic @langchain/cohere
yarn add @langchain/anthropic @langchain/cohere
pnpm add @langchain/anthropic @langchain/cohere
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { CohereEmbeddings } from "@langchain/cohere";
import { MultiQueryRetriever } from "langchain/retrievers/multi_query";
import { ChatAnthropic } from "@langchain/anthropic";
const embeddings = new CohereEmbeddings();
const vectorstore = await MemoryVectorStore.fromTexts(
[
"Buildings are made out of brick",
"Buildings are made out of wood",
"Buildings are made out of stone",
"Cars are made out of metal",
"Cars are made out of plastic",
"mitochondria is the powerhouse of the cell",
"mitochondria is made of lipids",
],
[{ id: 1 }, { id: 2 }, { id: 3 }, { id: 4 }, { id: 5 }],
embeddings
);
const model = new ChatAnthropic({
model: "claude-3-sonnet-20240229",
});
const retriever = MultiQueryRetriever.fromLLM({
llm: model,
retriever: vectorstore.asRetriever(),
});
const query = "What are mitochondria made of?";
const retrievedDocs = await retriever.invoke(query);
/*
Generated queries: What are the components of mitochondria?,What substances comprise the mitochondria organelle? ,What is the molecular composition of mitochondria?
*/
console.log(retrievedDocs);
[
Document {
pageContent: "mitochondria is made of lipids",
metadata: {}
},
Document {
pageContent: "mitochondria is the powerhouse of the cell",
metadata: {}
},
Document {
pageContent: "Buildings are made out of brick",
metadata: { id: 1 }
},
Document {
pageContent: "Buildings are made out of wood",
metadata: { id: 2 }
}
]
自定义
您还可以提供自定义提示来调整生成的问题类型。您还可以传递自定义输出解析器,以解析 LLM 调用的结果并将其拆分为查询列表。
import { LLMChain } from "langchain/chains";
import { pull } from "langchain/hub";
import { BaseOutputParser } from "@langchain/core/output_parsers";
import { PromptTemplate } from "@langchain/core/prompts";
type LineList = {
lines: string[];
};
class LineListOutputParser extends BaseOutputParser<LineList> {
static lc_name() {
return "LineListOutputParser";
}
lc_namespace = ["langchain", "retrievers", "multiquery"];
async parse(text: string): Promise<LineList> {
const startKeyIndex = text.indexOf("<questions>");
const endKeyIndex = text.indexOf("</questions>");
const questionsStartIndex =
startKeyIndex === -1 ? 0 : startKeyIndex + "<questions>".length;
const questionsEndIndex = endKeyIndex === -1 ? text.length : endKeyIndex;
const lines = text
.slice(questionsStartIndex, questionsEndIndex)
.trim()
.split("\n")
.filter((line) => line.trim() !== "");
return { lines };
}
getFormatInstructions(): string {
throw new Error("Not implemented.");
}
}
// Default prompt is available at: https://smith.langchain.com/hub/jacob/multi-vector-retriever-german
const prompt: PromptTemplate = await pull(
"jacob/multi-vector-retriever-german"
);
const vectorstore = await MemoryVectorStore.fromTexts(
[
"Gebäude werden aus Ziegelsteinen hergestellt",
"Gebäude werden aus Holz hergestellt",
"Gebäude werden aus Stein hergestellt",
"Autos werden aus Metall hergestellt",
"Autos werden aus Kunststoff hergestellt",
"Mitochondrien sind die Energiekraftwerke der Zelle",
"Mitochondrien bestehen aus Lipiden",
],
[{ id: 1 }, { id: 2 }, { id: 3 }, { id: 4 }, { id: 5 }],
embeddings
);
const model = new ChatAnthropic({});
const llmChain = new LLMChain({
llm: model,
prompt,
outputParser: new LineListOutputParser(),
});
const retriever = new MultiQueryRetriever({
retriever: vectorstore.asRetriever(),
llmChain,
});
const query = "What are mitochondria made of?";
const retrievedDocs = await retriever.invoke(query);
/*
Generated queries: Was besteht ein Mitochondrium?,Aus welchen Komponenten setzt sich ein Mitochondrium zusammen? ,Welche Moleküle finden sich in einem Mitochondrium?
*/
console.log(retrievedDocs);
[
Document {
pageContent: "Mitochondrien bestehen aus Lipiden",
metadata: {}
},
Document {
pageContent: "Mitochondrien sind die Energiekraftwerke der Zelle",
metadata: {}
},
Document {
pageContent: "Gebäude werden aus Stein hergestellt",
metadata: { id: 3 }
},
Document {
pageContent: "Autos werden aus Metall hergestellt",
metadata: { id: 4 }
},
Document {
pageContent: "Gebäude werden aus Holz hergestellt",
metadata: { id: 2 }
},
Document {
pageContent: "Gebäude werden aus Ziegelsteinen hergestellt",
metadata: { id: 1 }
}
]
下一步
您现在已经学会了如何使用 MultiQueryRetriever
通过自动生成的查询来查询向量存储。
请参阅各个部分,以深入了解特定的检索器、关于 RAG 的 更广泛的教程,或本节,以了解如何 在任何数据源上创建您自己的自定义检索器。