跳至主要内容

如何为每个文档生成多个嵌入

先决条件

本指南假设您熟悉以下概念

嵌入原始文档的不同表示,然后在任何表示导致搜索命中时返回原始文档,可以帮助您调整和改进检索性能。LangChain 具有一个基本 MultiVectorRetriever 旨在实现这一点!

许多复杂性在于如何为每个文档创建多个向量。本指南介绍了一些创建这些向量并使用 MultiVectorRetriever 的常用方法。

创建每个文档的多个向量的一些方法包括

  • 较小的块:将文档拆分为较小的块,并嵌入这些块(例如 ParentDocumentRetriever
  • 摘要:为每个文档创建摘要,并将摘要与文档一起嵌入(或代替文档)
  • 假设问题:创建每个文档都适合回答的假设问题,并将这些问题与文档一起嵌入(或代替文档)

请注意,这也支持另一种添加嵌入的方法 - 手动添加。这种方法很棒,因为您可以明确添加应该导致检索文档的问题或查询,从而获得更多控制权。

较小的块

通常,检索较大的信息块,但嵌入较小的块可能很有用。这允许嵌入尽可能准确地捕捉语义含义,但允许尽可能多的上下文传递到下游。注意:这就是 ParentDocumentRetriever 所做的。这里我们展示了幕后发生的事件。

npm install @langchain/openai @langchain/community @langchain/core
import * as uuid from "uuid";

import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { Document } from "@langchain/core/documents";

const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10000,
chunkOverlap: 20,
});

const docs = await splitter.splitDocuments(parentDocuments);

const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());

const childSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 400,
chunkOverlap: 0,
});

const subDocs = [];
for (let i = 0; i < docs.length; i += 1) {
const childDocs = await childSplitter.splitDocuments([docs[i]]);
const taggedChildDocs = childDocs.map((childDoc) => {
// eslint-disable-next-line no-param-reassign
childDoc.metadata[idKey] = docIds[i];
return childDoc;
});
subDocs.push(...taggedChildDocs);
}

// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();

// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
subDocs,
new OpenAIEmbeddings()
);

const retriever = new MultiVectorRetriever({
vectorstore,
byteStore,
idKey,
// Optional `k` parameter to search for more child documents in VectorStore.
// Note that this does not exactly correspond to the number of final (parent) documents
// retrieved, as multiple child documents can point to the same parent.
childK: 20,
// Optional `k` parameter to limit number of final, parent documents returned from this
// retriever and sent to LLM. This is an upper-bound, and the final count may be lower than this.
parentK: 5,
});

const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
docIds[i],
originalDoc,
]);

// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);

// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
"justice breyer"
);
console.log(vectorstoreResult[0].pageContent.length);
/*
390
*/

// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
9770
*/

API 参考

摘要

通常,摘要能够更准确地概括块的内容,从而导致更好的检索。这里我们展示了如何创建摘要,然后嵌入这些摘要。

import * as uuid from "uuid";

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { PromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence } from "@langchain/core/runnables";
import { Document } from "@langchain/core/documents";

const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10000,
chunkOverlap: 20,
});

const docs = await splitter.splitDocuments(parentDocuments);

const chain = RunnableSequence.from([
{ content: (doc: Document) => doc.pageContent },
PromptTemplate.fromTemplate(`Summarize the following document:\n\n{content}`),
new ChatOpenAI({
maxRetries: 0,
}),
new StringOutputParser(),
]);

const summaries = await chain.batch(docs, {
maxConcurrency: 5,
});

const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());
const summaryDocs = summaries.map((summary, i) => {
const summaryDoc = new Document({
pageContent: summary,
metadata: {
[idKey]: docIds[i],
},
});
return summaryDoc;
});

// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();

// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
summaryDocs,
new OpenAIEmbeddings()
);

const retriever = new MultiVectorRetriever({
vectorstore,
byteStore,
idKey,
});

const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
docIds[i],
originalDoc,
]);

// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);

// We could also add the original chunks to the vectorstore if we wish
// const taggedOriginalDocs = docs.map((doc, i) => {
// doc.metadata[idKey] = docIds[i];
// return doc;
// });
// retriever.vectorstore.addDocuments(taggedOriginalDocs);

// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
"justice breyer"
);
console.log(vectorstoreResult[0].pageContent.length);
/*
1118
*/

// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
9770
*/

API 参考

假设查询

LLM 也可以用来生成针对特定文档可以提出的假设问题列表。然后可以嵌入这些问题并用于检索原始文档

import * as uuid from "uuid";

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { PromptTemplate } from "@langchain/core/prompts";
import { RunnableSequence } from "@langchain/core/runnables";
import { Document } from "@langchain/core/documents";
import { JsonKeyOutputFunctionsParser } from "@langchain/core/output_parsers/openai_functions";

const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10000,
chunkOverlap: 20,
});

const docs = await splitter.splitDocuments(parentDocuments);

const functionsSchema = [
{
name: "hypothetical_questions",
description: "Generate hypothetical questions",
parameters: {
type: "object",
properties: {
questions: {
type: "array",
items: {
type: "string",
},
},
},
required: ["questions"],
},
},
];

const functionCallingModel = new ChatOpenAI({
maxRetries: 0,
model: "gpt-4",
}).bind({
functions: functionsSchema,
function_call: { name: "hypothetical_questions" },
});

const chain = RunnableSequence.from([
{ content: (doc: Document) => doc.pageContent },
PromptTemplate.fromTemplate(
`Generate a list of 3 hypothetical questions that the below document could be used to answer:\n\n{content}`
),
functionCallingModel,
new JsonKeyOutputFunctionsParser<string[]>({ attrName: "questions" }),
]);

const hypotheticalQuestions = await chain.batch(docs, {
maxConcurrency: 5,
});

const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());
const hypotheticalQuestionDocs = hypotheticalQuestions
.map((questionArray, i) => {
const questionDocuments = questionArray.map((question) => {
const questionDocument = new Document({
pageContent: question,
metadata: {
[idKey]: docIds[i],
},
});
return questionDocument;
});
return questionDocuments;
})
.flat();

// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();

// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
hypotheticalQuestionDocs,
new OpenAIEmbeddings()
);

const retriever = new MultiVectorRetriever({
vectorstore,
byteStore,
idKey,
});

const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
docIds[i],
originalDoc,
]);

// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);

// We could also add the original chunks to the vectorstore if we wish
// const taggedOriginalDocs = docs.map((doc, i) => {
// doc.metadata[idKey] = docIds[i];
// return doc;
// });
// retriever.vectorstore.addDocuments(taggedOriginalDocs);

// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
"justice breyer"
);
console.log(vectorstoreResult[0].pageContent);
/*
"What measures will be taken to crack down on corporations overcharging American businesses and consumers?"
*/

// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
9770
*/

API 参考

下一步

您现在已经学习了几种为每个文档生成多个嵌入的方法。

接下来,查看各个部分以深入了解特定检索器、关于 RAG 的更广泛教程,或查看此部分以了解如何 创建自己的自定义检索器来处理任何数据源


本页内容是否有帮助?


您也可以留下详细的反馈 在 GitHub 上.