如何在每个文档中生成多个嵌入

先决条件

本指南假设你熟悉以下概念

嵌入原始文档的不同表示，然后在任何表示导致搜索命中时返回原始文档，可以让你调整和提高检索性能。LangChain 有一个基础 MultiVectorRetriever 设计用于完成此操作！

大部分复杂性在于如何为每个文档创建多个向量。本指南介绍了一些创建这些向量并使用 MultiVectorRetriever 的常见方法。

一些为每个文档创建多个向量的方法包括

更小的块：将文档分成更小的块，然后嵌入这些块（例如 ParentDocumentRetriever）
摘要：为每个文档创建摘要，将其与文档一起（或代替文档）嵌入
假设问题：创建假设问题，每个文档都适合回答，将其与文档一起（或代替文档）嵌入

请注意，这也启用了另一种添加嵌入的方法 - 手动。这很棒，因为你可以显式地添加应该导致检索文档的问题或查询，从而获得更多控制权。

更小的块

通常，检索更大的信息块但嵌入更小的块可能很有用。这使得嵌入能够尽可能紧密地捕获语义含义，但可以将尽可能多的上下文传递到下游。注意：这就是 ParentDocumentRetriever 的作用。这里我们展示了幕后发生的事情。

提示

查看此部分以获取有关安装集成包的一般说明。

npm
Yarn
pnpm

npm install @langchain/openai @langchain/community

yarn add @langchain/openai @langchain/community

pnpm add @langchain/openai @langchain/community

import * as uuid from "uuid";

import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { Document } from "@langchain/core/documents";

const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 10000,
  chunkOverlap: 20,
});

const docs = await splitter.splitDocuments(parentDocuments);

const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());

const childSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 400,
  chunkOverlap: 0,
});

const subDocs = [];
for (let i = 0; i < docs.length; i += 1) {
  const childDocs = await childSplitter.splitDocuments([docs[i]]);
  const taggedChildDocs = childDocs.map((childDoc) => {
    // eslint-disable-next-line no-param-reassign
    childDoc.metadata[idKey] = docIds[i];
    return childDoc;
  });
  subDocs.push(...taggedChildDocs);
}

// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();

// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
  subDocs,
  new OpenAIEmbeddings()
);

const retriever = new MultiVectorRetriever({
  vectorstore,
  byteStore,
  idKey,
  // Optional `k` parameter to search for more child documents in VectorStore.
  // Note that this does not exactly correspond to the number of final (parent) documents
  // retrieved, as multiple child documents can point to the same parent.
  childK: 20,
  // Optional `k` parameter to limit number of final, parent documents returned from this
  // retriever and sent to LLM. This is an upper-bound, and the final count may be lower than this.
  parentK: 5,
});

const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
  docIds[i],
  originalDoc,
]);

// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);

// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
  "justice breyer"
);
console.log(vectorstoreResult[0].pageContent.length);
/*
  390
*/

// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
  9770
*/

API 参考

MultiVectorRetriever 来自 langchain/retrievers/multi_vector
FaissStore 来自 @langchain/community/vectorstores/faiss
OpenAIEmbeddings 来自 @langchain/openai
RecursiveCharacterTextSplitter 来自 @langchain/textsplitters
InMemoryStore 来自 @langchain/core/stores
TextLoader 来自 langchain/document_loaders/fs/text
Document 来自 @langchain/core/documents

摘要

通常，摘要可能能够更准确地提炼出块的内容，从而导致更好的检索。这里我们展示如何创建摘要，然后嵌入这些摘要。

import * as uuid from "uuid";

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { PromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence } from "@langchain/core/runnables";
import { Document } from "@langchain/core/documents";

const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 10000,
  chunkOverlap: 20,
});

const docs = await splitter.splitDocuments(parentDocuments);

const chain = RunnableSequence.from([
  { content: (doc: Document) => doc.pageContent },
  PromptTemplate.fromTemplate(`Summarize the following document:\n\n{content}`),
  new ChatOpenAI({
    maxRetries: 0,
  }),
  new StringOutputParser(),
]);

const summaries = await chain.batch(docs, {
  maxConcurrency: 5,
});

const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());
const summaryDocs = summaries.map((summary, i) => {
  const summaryDoc = new Document({
    pageContent: summary,
    metadata: {
      [idKey]: docIds[i],
    },
  });
  return summaryDoc;
});

// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();

// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
  summaryDocs,
  new OpenAIEmbeddings()
);

const retriever = new MultiVectorRetriever({
  vectorstore,
  byteStore,
  idKey,
});

const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
  docIds[i],
  originalDoc,
]);

// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);

// We could also add the original chunks to the vectorstore if we wish
// const taggedOriginalDocs = docs.map((doc, i) => {
//   doc.metadata[idKey] = docIds[i];
//   return doc;
// });
// retriever.vectorstore.addDocuments(taggedOriginalDocs);

// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
  "justice breyer"
);
console.log(vectorstoreResult[0].pageContent.length);
/*
  1118
*/

// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
  9770
*/

API 参考

ChatOpenAI 来自 @langchain/openai
OpenAIEmbeddings 来自 @langchain/openai
MultiVectorRetriever 来自 langchain/retrievers/multi_vector
FaissStore 来自 @langchain/community/vectorstores/faiss
RecursiveCharacterTextSplitter 来自 @langchain/textsplitters
InMemoryStore 来自 @langchain/core/stores
TextLoader 来自 langchain/document_loaders/fs/text
PromptTemplate 来自 @langchain/core/prompts
StringOutputParser 来自 @langchain/core/output_parsers
RunnableSequence 来自 @langchain/core/runnables
Document 来自 @langchain/core/documents

假设查询

LLM 还可以用来生成一系列关于特定文档的假设问题。然后可以嵌入这些问题并用于检索原始文档

import * as uuid from "uuid";

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MultiVectorRetriever } from "langchain/retrievers/multi_vector";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { InMemoryStore } from "@langchain/core/stores";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { PromptTemplate } from "@langchain/core/prompts";
import { RunnableSequence } from "@langchain/core/runnables";
import { Document } from "@langchain/core/documents";
import { JsonKeyOutputFunctionsParser } from "@langchain/core/output_parsers/openai_functions";

const textLoader = new TextLoader("../examples/state_of_the_union.txt");
const parentDocuments = await textLoader.load();

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 10000,
  chunkOverlap: 20,
});

const docs = await splitter.splitDocuments(parentDocuments);

const functionsSchema = [
  {
    name: "hypothetical_questions",
    description: "Generate hypothetical questions",
    parameters: {
      type: "object",
      properties: {
        questions: {
          type: "array",
          items: {
            type: "string",
          },
        },
      },
      required: ["questions"],
    },
  },
];

const functionCallingModel = new ChatOpenAI({
  maxRetries: 0,
  model: "gpt-4",
}).bind({
  functions: functionsSchema,
  function_call: { name: "hypothetical_questions" },
});

const chain = RunnableSequence.from([
  { content: (doc: Document) => doc.pageContent },
  PromptTemplate.fromTemplate(
    `Generate a list of 3 hypothetical questions that the below document could be used to answer:\n\n{content}`
  ),
  functionCallingModel,
  new JsonKeyOutputFunctionsParser<string[]>({ attrName: "questions" }),
]);

const hypotheticalQuestions = await chain.batch(docs, {
  maxConcurrency: 5,
});

const idKey = "doc_id";
const docIds = docs.map((_) => uuid.v4());
const hypotheticalQuestionDocs = hypotheticalQuestions
  .map((questionArray, i) => {
    const questionDocuments = questionArray.map((question) => {
      const questionDocument = new Document({
        pageContent: question,
        metadata: {
          [idKey]: docIds[i],
        },
      });
      return questionDocument;
    });
    return questionDocuments;
  })
  .flat();

// The byteStore to use to store the original chunks
const byteStore = new InMemoryStore<Uint8Array>();

// The vectorstore to use to index the child chunks
const vectorstore = await FaissStore.fromDocuments(
  hypotheticalQuestionDocs,
  new OpenAIEmbeddings()
);

const retriever = new MultiVectorRetriever({
  vectorstore,
  byteStore,
  idKey,
});

const keyValuePairs: [string, Document][] = docs.map((originalDoc, i) => [
  docIds[i],
  originalDoc,
]);

// Use the retriever to add the original chunks to the document store
await retriever.docstore.mset(keyValuePairs);

// We could also add the original chunks to the vectorstore if we wish
// const taggedOriginalDocs = docs.map((doc, i) => {
//   doc.metadata[idKey] = docIds[i];
//   return doc;
// });
// retriever.vectorstore.addDocuments(taggedOriginalDocs);

// Vectorstore alone retrieves the small chunks
const vectorstoreResult = await retriever.vectorstore.similaritySearch(
  "justice breyer"
);
console.log(vectorstoreResult[0].pageContent);
/*
  "What measures will be taken to crack down on corporations overcharging American businesses and consumers?"
*/

// Retriever returns larger result
const retrieverResult = await retriever.invoke("justice breyer");
console.log(retrieverResult[0].pageContent.length);
/*
  9770
*/

API 参考

ChatOpenAI 来自 @langchain/openai
OpenAIEmbeddings 来自 @langchain/openai
MultiVectorRetriever 来自 langchain/retrievers/multi_vector
FaissStore 来自 @langchain/community/vectorstores/faiss
RecursiveCharacterTextSplitter 来自 @langchain/textsplitters
InMemoryStore 来自 @langchain/core/stores
TextLoader 来自 langchain/document_loaders/fs/text
PromptTemplate 来自 @langchain/core/prompts
RunnableSequence 来自 @langchain/core/runnables
Document 来自 @langchain/core/documents
JsonKeyOutputFunctionsParser 来自 @langchain/core/output_parsers/openai_functions

下一步

您已经学习了几种为每个文档生成多个嵌入的方法。

接下来，查看各个部分以深入了解特定检索器，有关 RAG 的更广泛教程，或本节了解如何在任何数据源上创建您自己的自定义检索器。

如何在每个文档中生成多个嵌入

更小的块

API 参考

摘要

API 参考

假设查询

API 参考

下一步

此页面是否有用？

您也可以留下详细的反馈在 GitHub 上.

如何在每个文档中生成多个嵌入

更小的块​

API 参考

摘要​

API 参考

假设查询​

API 参考

下一步​

此页面是否有用？

您也可以留下详细的反馈 在 GitHub 上.

更小的块

摘要

假设查询

下一步

您也可以留下详细的反馈在 GitHub 上.