如何为每个文档生成多个嵌入

先决条件

本指南假设您熟悉以下概念

嵌入原始文档的不同表示，然后在任何表示导致搜索命中时返回原始文档，可以帮助您调整和改进检索性能。LangChain 具有一个基本 MultiVectorRetriever 旨在实现这一点！

许多复杂性在于如何为每个文档创建多个向量。本指南介绍了一些创建这些向量并使用 MultiVectorRetriever 的常用方法。

创建每个文档的多个向量的一些方法包括

较小的块：将文档拆分为较小的块，并嵌入这些块（例如 ParentDocumentRetriever）
摘要：为每个文档创建摘要，并将摘要与文档一起嵌入（或代替文档）
假设问题：创建每个文档都适合回答的假设问题，并将这些问题与文档一起嵌入（或代替文档）

请注意，这也支持另一种添加嵌入的方法 - 手动添加。这种方法很棒，因为您可以明确添加应该导致检索文档的问题或查询，从而获得更多控制权。

较小的块

通常，检索较大的信息块，但嵌入较小的块可能很有用。这允许嵌入尽可能准确地捕捉语义含义，但允许尽可能多的上下文传递到下游。注意：这就是 ParentDocumentRetriever 所做的。这里我们展示了幕后发生的事件。

提示

请参阅此部分以获取有关安装集成包的常规说明。

npm
Yarn
pnpm

npm install @langchain/openai @langchain/community @langchain/core

yarn add @langchain/openai @langchain/community @langchain/core

pnpm add @langchain/openai @langchain/community @langchain/core

import * as uuid from "uuid";

import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { Document } from "@langchain/core/documents";

const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 10000,
  chunkOverlap: 20,
});

const docs = await splitter.splitDocuments(parentDocuments);

const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());

const childSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 400,
  chunkOverlap: 0,
});

const subDocs = [];
for (let i = 0; i < docs.length; i += 1) {
  const childDocs = await childSplitter.splitDocuments([docs[i]]);
  const taggedChildDocs = childDocs.map((childDoc) => {
    // eslint-disable-next-line no-param-reassign
    childDoc.metadata[idKey] = docIds[i];
    return childDoc;
  });
  subDocs.push(...taggedChildDocs);
}

// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();

// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
  subDocs,
  new OpenAIEmbeddings()
);

const retriever = new MultiVectorRetriever({
  vectorstore,
  byteStore,
  idKey,
  // Optional `k` parameter to search for more child documents in VectorStore.
  // Note that this does not exactly correspond to the number of final (parent) documents
  // retrieved, as multiple child documents can point to the same parent.
  childK: 20,
  // Optional `k` parameter to limit number of final, parent documents returned from this
  // retriever and sent to LLM. This is an upper-bound, and the final count may be lower than this.
  parentK: 5,
});

const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
  docIds[i],
  originalDoc,
]);

// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);

// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
  "justice breyer"
);
console.log(vectorstoreResult[0].pageContent.length);
/*
  390
*/

// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
  9770
*/

API 参考

MultiVectorRetriever 来自 langchain/retrievers/multi_vector
FaissStore 来自 @langchain/community/vectorstores/faiss
OpenAIEmbeddings 来自 @langchain/openai
RecursiveCharacterTextSplitter 来自 @langchain/textsplitters
InMemoryStore 来自 @langchain/core/stores
TextLoader 来自 langchain/document_loaders/fs/text
Document 来自 @langchain/core/documents

摘要

通常，摘要能够更准确地概括块的内容，从而导致更好的检索。这里我们展示了如何创建摘要，然后嵌入这些摘要。

import * as uuid from "uuid";

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { PromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence } from "@langchain/core/runnables";
import { Document } from "@langchain/core/documents";

const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 10000,
  chunkOverlap: 20,
});

const docs = await splitter.splitDocuments(parentDocuments);

const chain = RunnableSequence.from([
  { content: (doc: Document) => doc.pageContent },
  PromptTemplate.fromTemplate(`Summarize the following document:\n\n{content}`),
  new ChatOpenAI({
    maxRetries: 0,
  }),
  new StringOutputParser(),
]);

const summaries = await chain.batch(docs, {
  maxConcurrency: 5,
});

const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());
const summaryDocs = summaries.map((summary, i) => {
  const summaryDoc = new Document({
    pageContent: summary,
    metadata: {
      [idKey]: docIds[i],
    },
  });
  return summaryDoc;
});

// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();

// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
  summaryDocs,
  new OpenAIEmbeddings()
);

const retriever = new MultiVectorRetriever({
  vectorstore,
  byteStore,
  idKey,
});

const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
  docIds[i],
  originalDoc,
]);

// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);

// We could also add the original chunks to the vectorstore if we wish
// const taggedOriginalDocs = docs.map((doc, i) => {
//   doc.metadata[idKey] = docIds[i];
//   return doc;
// });
// retriever.vectorstore.addDocuments(taggedOriginalDocs);

// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
  "justice breyer"
);
console.log(vectorstoreResult[0].pageContent.length);
/*
  1118
*/

// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
  9770
*/

API 参考

ChatOpenAI 来自 @langchain/openai
OpenAIEmbeddings 来自 @langchain/openai
MultiVectorRetriever 来自 langchain/retrievers/multi_vector
FaissStore 来自 @langchain/community/vectorstores/faiss
RecursiveCharacterTextSplitter 来自 @langchain/textsplitters
InMemoryStore 来自 @langchain/core/stores
TextLoader 来自 langchain/document_loaders/fs/text
PromptTemplate 来自 @langchain/core/prompts
StringOutputParser 来自 @langchain/core/output_parsers
RunnableSequence 来自 @langchain/core/runnables
Document 来自 @langchain/core/documents

假设查询

LLM 也可以用来生成针对特定文档可以提出的假设问题列表。然后可以嵌入这些问题并用于检索原始文档

import * as uuid from "uuid";

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { PromptTemplate } from "@langchain/core/prompts";
import { RunnableSequence } from "@langchain/core/runnables";
import { Document } from "@langchain/core/documents";
import { JsonKeyOutputFunctionsParser } from "@langchain/core/output_parsers/openai_functions";

const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 10000,
  chunkOverlap: 20,
});

const docs = await splitter.splitDocuments(parentDocuments);

const functionsSchema = [
  {
    name: "hypothetical_questions",
    description: "Generate hypothetical questions",
    parameters: {
      type: "object",
      properties: {
        questions: {
          type: "array",
          items: {
            type: "string",
          },
        },
      },
      required: ["questions"],
    },
  },
];

const functionCallingModel = new ChatOpenAI({
  maxRetries: 0,
  model: "gpt-4",
}).bind({
  functions: functionsSchema,
  function_call: { name: "hypothetical_questions" },
});

const chain = RunnableSequence.from([
  { content: (doc: Document) => doc.pageContent },
  PromptTemplate.fromTemplate(
    `Generate a list of 3 hypothetical questions that the below document could be used to answer:\n\n{content}`
  ),
  functionCallingModel,
  new JsonKeyOutputFunctionsParser<string[]>({ attrName: "questions" }),
]);

const hypotheticalQuestions = await chain.batch(docs, {
  maxConcurrency: 5,
});

const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());
const hypotheticalQuestionDocs = hypotheticalQuestions
  .map((questionArray, i) => {
    const questionDocuments = questionArray.map((question) => {
      const questionDocument = new Document({
        pageContent: question,
        metadata: {
          [idKey]: docIds[i],
        },
      });
      return questionDocument;
    });
    return questionDocuments;
  })
  .flat();

// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();

// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
  hypotheticalQuestionDocs,
  new OpenAIEmbeddings()
);

const retriever = new MultiVectorRetriever({
  vectorstore,
  byteStore,
  idKey,
});

const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
  docIds[i],
  originalDoc,
]);

// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);

// We could also add the original chunks to the vectorstore if we wish
// const taggedOriginalDocs = docs.map((doc, i) => {
//   doc.metadata[idKey] = docIds[i];
//   return doc;
// });
// retriever.vectorstore.addDocuments(taggedOriginalDocs);

// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
  "justice breyer"
);
console.log(vectorstoreResult[0].pageContent);
/*
  "What measures will be taken to crack down on corporations overcharging American businesses and consumers?"
*/

// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
  9770
*/

API 参考

ChatOpenAI 来自 @langchain/openai
OpenAIEmbeddings 来自 @langchain/openai
MultiVectorRetriever 来自 langchain/retrievers/multi_vector
FaissStore 来自 @langchain/community/vectorstores/faiss
RecursiveCharacterTextSplitter 来自 @langchain/textsplitters
InMemoryStore 来自 @langchain/core/stores
TextLoader 来自 langchain/document_loaders/fs/text
PromptTemplate 来自 @langchain/core/prompts
RunnableSequence 来自 @langchain/core/runnables
Document 来自 @langchain/core/documents
JsonKeyOutputFunctionsParser 来自 @langchain/core/output_parsers/openai_functions

下一步

您现在已经学习了几种为每个文档生成多个嵌入的方法。

接下来，查看各个部分以深入了解特定检索器、关于 RAG 的更广泛教程，或查看此部分以了解如何创建自己的自定义检索器来处理任何数据源。

如何为每个文档生成多个嵌入

较小的块

API 参考

摘要

API 参考

假设查询

API 参考

下一步

本页内容是否有帮助？

您也可以留下详细的反馈在 GitHub 上.

如何为每个文档生成多个嵌入

较小的块​

API 参考

摘要​

API 参考

假设查询​

API 参考

下一步​

本页内容是否有帮助？

您也可以留下详细的反馈 在 GitHub 上.

较小的块

摘要

假设查询

下一步

您也可以留下详细的反馈在 GitHub 上.