跳至主要内容

如何生成多个查询以检索数据

先决条件

本指南假设你熟悉以下概念

基于距离的向量数据库检索将查询嵌入(表示)到高维空间,并根据“距离”找到相似的嵌入文档。但检索可能会在查询措辞略有变化时或嵌入没有很好地捕获数据的语义时产生不同的结果。提示工程/调整有时用于手动解决这些问题,但这可能很乏味。

The MultiQueryRetriever 通过使用 LLM 从不同角度为给定的用户输入查询生成多个查询来自动化提示调整过程。对于每个查询,它检索一组相关文档,并获取所有查询的唯一联合以获得一组更大的潜在相关文档。通过对同一问题生成多个视角,MultiQueryRetriever 可以帮助克服基于距离的检索的一些限制,并获得更丰富的结果集。

入门

yarn add @langchain/anthropic @langchain/cohere
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { CohereEmbeddings } from "@langchain/cohere";
import { MultiQueryRetriever } from "langchain/retrievers/multi_query";
import { ChatAnthropic } from "@langchain/anthropic";

const embeddings = new CohereEmbeddings();

const vectorstore = await MemoryVectorStore.fromTexts(
[
"Buildings are made out of brick",
"Buildings are made out of wood",
"Buildings are made out of stone",
"Cars are made out of metal",
"Cars are made out of plastic",
"mitochondria is the powerhouse of the cell",
"mitochondria is made of lipids",
],
[{ id: 1 }, { id: 2 }, { id: 3 }, { id: 4 }, { id: 5 }],
embeddings
);

const model = new ChatAnthropic({
model: "claude-3-sonnet-20240229",
});

const retriever = MultiQueryRetriever.fromLLM({
llm: model,
retriever: vectorstore.asRetriever(),
});

const query = "What are mitochondria made of?";
const retrievedDocs = await retriever.invoke(query);

/*
Generated queries: What are the components of mitochondria?,What substances comprise the mitochondria organelle? ,What is the molecular composition of mitochondria?
*/

console.log(retrievedDocs);
[
Document {
pageContent: "mitochondria is made of lipids",
metadata: {}
},
Document {
pageContent: "mitochondria is the powerhouse of the cell",
metadata: {}
},
Document {
pageContent: "Buildings are made out of brick",
metadata: { id: 1 }
},
Document {
pageContent: "Buildings are made out of wood",
metadata: { id: 2 }
}
]

自定义

你也可以提供自定义提示来调整生成哪些类型的问题。你也可以传递自定义输出解析器来解析和拆分 LLM 调用的结果以形成查询列表。

import { LLMChain } from "langchain/chains";
import { pull } from "langchain/hub";
import { BaseOutputParser } from "@langchain/core/output_parsers";
import { PromptTemplate } from "@langchain/core/prompts";

type LineList = {
lines: string[];
};

class LineListOutputParser extends BaseOutputParser<LineList> {
static lc_name() {
return "LineListOutputParser";
}

lc_namespace = ["langchain", "retrievers", "multiquery"];

async parse(text: string): Promise<LineList> {
const startKeyIndex = text.indexOf("<questions>");
const endKeyIndex = text.indexOf("</questions>");
const questionsStartIndex =
startKeyIndex === -1 ? 0 : startKeyIndex + "<questions>".length;
const questionsEndIndex = endKeyIndex === -1 ? text.length : endKeyIndex;
const lines = text
.slice(questionsStartIndex, questionsEndIndex)
.trim()
.split("\n")
.filter((line) => line.trim() !== "");
return { lines };
}

getFormatInstructions(): string {
throw new Error("Not implemented.");
}
}

// Default prompt is available at: https://smith.langchain.com/hub/jacob/multi-vector-retriever-german
const prompt: PromptTemplate = await pull(
"jacob/multi-vector-retriever-german"
);

const vectorstore = await MemoryVectorStore.fromTexts(
[
"Gebäude werden aus Ziegelsteinen hergestellt",
"Gebäude werden aus Holz hergestellt",
"Gebäude werden aus Stein hergestellt",
"Autos werden aus Metall hergestellt",
"Autos werden aus Kunststoff hergestellt",
"Mitochondrien sind die Energiekraftwerke der Zelle",
"Mitochondrien bestehen aus Lipiden",
],
[{ id: 1 }, { id: 2 }, { id: 3 }, { id: 4 }, { id: 5 }],
embeddings
);
const model = new ChatAnthropic({});
const llmChain = new LLMChain({
llm: model,
prompt,
outputParser: new LineListOutputParser(),
});
const retriever = new MultiQueryRetriever({
retriever: vectorstore.asRetriever(),
llmChain,
});

const query = "What are mitochondria made of?";
const retrievedDocs = await retriever.invoke(query);

/*
Generated queries: Was besteht ein Mitochondrium?,Aus welchen Komponenten setzt sich ein Mitochondrium zusammen? ,Welche Moleküle finden sich in einem Mitochondrium?
*/

console.log(retrievedDocs);
[
Document {
pageContent: "Mitochondrien bestehen aus Lipiden",
metadata: {}
},
Document {
pageContent: "Mitochondrien sind die Energiekraftwerke der Zelle",
metadata: {}
},
Document {
pageContent: "Gebäude werden aus Stein hergestellt",
metadata: { id: 3 }
},
Document {
pageContent: "Autos werden aus Metall hergestellt",
metadata: { id: 4 }
},
Document {
pageContent: "Gebäude werden aus Holz hergestellt",
metadata: { id: 2 }
},
Document {
pageContent: "Gebäude werden aus Ziegelsteinen hergestellt",
metadata: { id: 1 }
}
]

下一步

你现在已经学习了如何使用 MultiQueryRetriever 通过自动生成的查询来查询向量存储。

查看有关特定检索器的各个部分,有关 RAG 的更广泛教程,或此部分以了解如何在任何数据源上创建你自己的自定义检索器


本页是否有帮助?


你也可以在 GitHub 上留下详细的反馈 GitHub.