如何在每个文档中生成多个嵌入
先决条件
本指南假设你熟悉以下概念
嵌入原始文档的不同表示,然后在任何表示导致搜索命中时返回原始文档,可以让你调整和提高检索性能。LangChain 有一个基础 MultiVectorRetriever
设计用于完成此操作!
大部分复杂性在于如何为每个文档创建多个向量。本指南介绍了一些创建这些向量并使用 MultiVectorRetriever
的常见方法。
一些为每个文档创建多个向量的方法包括
- 更小的块:将文档分成更小的块,然后嵌入这些块(例如
ParentDocumentRetriever
) - 摘要:为每个文档创建摘要,将其与文档一起(或代替文档)嵌入
- 假设问题:创建假设问题,每个文档都适合回答,将其与文档一起(或代替文档)嵌入
请注意,这也启用了另一种添加嵌入的方法 - 手动。这很棒,因为你可以显式地添加应该导致检索文档的问题或查询,从而获得更多控制权。
更小的块
通常,检索更大的信息块但嵌入更小的块可能很有用。这使得嵌入能够尽可能紧密地捕获语义含义,但可以将尽可能多的上下文传递到下游。注意:这就是 ParentDocumentRetriever 的作用。这里我们展示了幕后发生的事情。
提示
- npm
- Yarn
- pnpm
npm install @langchain/openai @langchain/community
yarn add @langchain/openai @langchain/community
pnpm add @langchain/openai @langchain/community
import * as uuid from "uuid";
import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { Document } from "@langchain/core/documents";
const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10000,
chunkOverlap: 20,
});
const docs = await splitter.splitDocuments(parentDocuments);
const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());
const childSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 400,
chunkOverlap: 0,
});
const subDocs = [];
for (let i = 0; i < docs.length; i += 1) {
const childDocs = await childSplitter.splitDocuments([docs[i]]);
const taggedChildDocs = childDocs.map((childDoc) => {
// eslint-disable-next-line no-param-reassign
childDoc.metadata[idKey] = docIds[i];
return childDoc;
});
subDocs.push(...taggedChildDocs);
}
// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();
// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
subDocs,
new OpenAIEmbeddings()
);
const retriever = new MultiVectorRetriever({
vectorstore,
byteStore,
idKey,
// Optional `k` parameter to search for more child documents in VectorStore.
// Note that this does not exactly correspond to the number of final (parent) documents
// retrieved, as multiple child documents can point to the same parent.
childK: 20,
// Optional `k` parameter to limit number of final, parent documents returned from this
// retriever and sent to LLM. This is an upper-bound, and the final count may be lower than this.
parentK: 5,
});
const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
docIds[i],
originalDoc,
]);
// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);
// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
"justice breyer"
);
console.log(vectorstoreResult[0].pageContent.length);
/*
390
*/
// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
9770
*/
API 参考
- MultiVectorRetriever 来自
langchain/retrievers/multi_vector
- FaissStore 来自
@langchain/community/vectorstores/faiss
- OpenAIEmbeddings 来自
@langchain/openai
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- InMemoryStore 来自
@langchain/core/stores
- TextLoader 来自
langchain/document_loaders/fs/text
- Document 来自
@langchain/core/documents
摘要
通常,摘要可能能够更准确地提炼出块的内容,从而导致更好的检索。这里我们展示如何创建摘要,然后嵌入这些摘要。
import * as uuid from "uuid";
import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { PromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence } from "@langchain/core/runnables";
import { Document } from "@langchain/core/documents";
const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10000,
chunkOverlap: 20,
});
const docs = await splitter.splitDocuments(parentDocuments);
const chain = RunnableSequence.from([
{ content: (doc: Document) => doc.pageContent },
PromptTemplate.fromTemplate(`Summarize the following document:\n\n{content}`),
new ChatOpenAI({
maxRetries: 0,
}),
new StringOutputParser(),
]);
const summaries = await chain.batch(docs, {
maxConcurrency: 5,
});
const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());
const summaryDocs = summaries.map((summary, i) => {
const summaryDoc = new Document({
pageContent: summary,
metadata: {
[idKey]: docIds[i],
},
});
return summaryDoc;
});
// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();
// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
summaryDocs,
new OpenAIEmbeddings()
);
const retriever = new MultiVectorRetriever({
vectorstore,
byteStore,
idKey,
});
const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
docIds[i],
originalDoc,
]);
// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);
// We could also add the original chunks to the vectorstore if we wish
// const taggedOriginalDocs = docs.map((doc, i) => {
// doc.metadata[idKey] = docIds[i];
// return doc;
// });
// retriever.vectorstore.addDocuments(taggedOriginalDocs);
// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
"justice breyer"
);
console.log(vectorstoreResult[0].pageContent.length);
/*
1118
*/
// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
9770
*/
API 参考
- ChatOpenAI 来自
@langchain/openai
- OpenAIEmbeddings 来自
@langchain/openai
- MultiVectorRetriever 来自
langchain/retrievers/multi_vector
- FaissStore 来自
@langchain/community/vectorstores/faiss
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- InMemoryStore 来自
@langchain/core/stores
- TextLoader 来自
langchain/document_loaders/fs/text
- PromptTemplate 来自
@langchain/core/prompts
- StringOutputParser 来自
@langchain/core/output_parsers
- RunnableSequence 来自
@langchain/core/runnables
- Document 来自
@langchain/core/documents
假设查询
LLM 还可以用来生成一系列关于特定文档的假设问题。然后可以嵌入这些问题并用于检索原始文档
import * as uuid from "uuid";
import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { PromptTemplate } from "@langchain/core/prompts";
import { RunnableSequence } from "@langchain/core/runnables";
import { Document } from "@langchain/core/documents";
import { JsonKeyOutputFunctionsParser } from "@langchain/core/output_parsers/openai_functions";
const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10000,
chunkOverlap: 20,
});
const docs = await splitter.splitDocuments(parentDocuments);
const functionsSchema = [
{
name: "hypothetical_questions",
description: "Generate hypothetical questions",
parameters: {
type: "object",
properties: {
questions: {
type: "array",
items: {
type: "string",
},
},
},
required: ["questions"],
},
},
];
const functionCallingModel = new ChatOpenAI({
maxRetries: 0,
model: "gpt-4",
}).bind({
functions: functionsSchema,
function_call: { name: "hypothetical_questions" },
});
const chain = RunnableSequence.from([
{ content: (doc: Document) => doc.pageContent },
PromptTemplate.fromTemplate(
`Generate a list of 3 hypothetical questions that the below document could be used to answer:\n\n{content}`
),
functionCallingModel,
new JsonKeyOutputFunctionsParser<string[]>({ attrName: "questions" }),
]);
const hypotheticalQuestions = await chain.batch(docs, {
maxConcurrency: 5,
});
const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());
const hypotheticalQuestionDocs = hypotheticalQuestions
.map((questionArray, i) => {
const questionDocuments = questionArray.map((question) => {
const questionDocument = new Document({
pageContent: question,
metadata: {
[idKey]: docIds[i],
},
});
return questionDocument;
});
return questionDocuments;
})
.flat();
// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();
// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
hypotheticalQuestionDocs,
new OpenAIEmbeddings()
);
const retriever = new MultiVectorRetriever({
vectorstore,
byteStore,
idKey,
});
const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
docIds[i],
originalDoc,
]);
// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);
// We could also add the original chunks to the vectorstore if we wish
// const taggedOriginalDocs = docs.map((doc, i) => {
// doc.metadata[idKey] = docIds[i];
// return doc;
// });
// retriever.vectorstore.addDocuments(taggedOriginalDocs);
// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
"justice breyer"
);
console.log(vectorstoreResult[0].pageContent);
/*
"What measures will be taken to crack down on corporations overcharging American businesses and consumers?"
*/
// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
9770
*/
API 参考
- ChatOpenAI 来自
@langchain/openai
- OpenAIEmbeddings 来自
@langchain/openai
- MultiVectorRetriever 来自
langchain/retrievers/multi_vector
- FaissStore 来自
@langchain/community/vectorstores/faiss
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- InMemoryStore 来自
@langchain/core/stores
- TextLoader 来自
langchain/document_loaders/fs/text
- PromptTemplate 来自
@langchain/core/prompts
- RunnableSequence 来自
@langchain/core/runnables
- Document 来自
@langchain/core/documents
- JsonKeyOutputFunctionsParser 来自
@langchain/core/output_parsers/openai_functions
下一步
您已经学习了几种为每个文档生成多个嵌入的方法。
接下来,查看各个部分以深入了解特定检索器,有关 RAG 的更广泛教程,或本节了解如何在任何数据源上创建您自己的自定义检索器。