构建查询分析系统
本页面将展示如何在基本的端到端示例中使用查询分析。这将涵盖创建简单的搜索引擎,展示将原始用户问题传递给该搜索时出现的失败模式,以及查询分析如何帮助解决该问题的示例。存在许多不同的查询分析技术,本端到端示例不会展示所有技术。
在本示例中,我们将对 LangChain YouTube 视频进行检索。
设置
安装依赖项
- npm
- yarn
- pnpm
npm i langchain @langchain/community @langchain/openai @langchain/core youtubei.js chromadb youtube-transcript
yarn add langchain @langchain/community @langchain/openai @langchain/core youtubei.js chromadb youtube-transcript
pnpm add langchain @langchain/community @langchain/openai @langchain/core youtubei.js chromadb youtube-transcript
设置环境变量
在本示例中,我们将使用 OpenAI
OPENAI_API_KEY=your-api-key
# Optional, use LangSmith for best-in-class observability
LANGSMITH_API_KEY=your-api-key
LANGCHAIN_TRACING_V2=true
# Reduce tracing latency if you are not in a serverless environment
# LANGCHAIN_CALLBACKS_BACKGROUND=true
加载文档
我们可以使用 YouTubeLoader
加载一些 LangChain 视频的转录文本
import { DocumentInterface } from "@langchain/core/documents";
import { YoutubeLoader } from "@langchain/community/document_loaders/web/youtube";
import { getYear } from "date-fns";
const urls = [
"https://www.youtube.com/watch?v=HAn9vnJy6S4",
"https://www.youtube.com/watch?v=dA1cHGACXCo",
"https://www.youtube.com/watch?v=ZcEMLz27sL4",
"https://www.youtube.com/watch?v=hvAPnpSfSGo",
"https://www.youtube.com/watch?v=EhlPDL4QrWY",
"https://www.youtube.com/watch?v=mmBo8nlu2j0",
"https://www.youtube.com/watch?v=rQdibOsL1ps",
"https://www.youtube.com/watch?v=28lC4fqukoc",
"https://www.youtube.com/watch?v=es-9MgxB-uc",
"https://www.youtube.com/watch?v=wLRHwKuKvOE",
"https://www.youtube.com/watch?v=ObIltMaRJvY",
"https://www.youtube.com/watch?v=DjuXACWYkkU",
"https://www.youtube.com/watch?v=o7C9ld6Ln-M",
];
let docs: Array<DocumentInterface> = [];
for await (const url of urls) {
const doc = await YoutubeLoader.createFromUrl(url, {
language: "en",
addVideoInfo: true,
}).load();
docs = docs.concat(doc);
}
console.log(docs.length);
/*
13
*/
// Add some additional metadata: what year the video was published
// The JS API does not provide publish date, so we can use a
// hardcoded array with the dates instead.
const dates = [
new Date("Jan 31, 2024"),
new Date("Jan 26, 2024"),
new Date("Jan 24, 2024"),
new Date("Jan 23, 2024"),
new Date("Jan 16, 2024"),
new Date("Jan 5, 2024"),
new Date("Jan 2, 2024"),
new Date("Dec 20, 2023"),
new Date("Dec 19, 2023"),
new Date("Nov 27, 2023"),
new Date("Nov 22, 2023"),
new Date("Nov 16, 2023"),
new Date("Nov 2, 2023"),
];
docs.forEach((doc, idx) => {
// eslint-disable-next-line no-param-reassign
doc.metadata.publish_year = getYear(dates[idx]);
// eslint-disable-next-line no-param-reassign
doc.metadata.publish_date = dates[idx];
});
// Here are the titles of the videos we've loaded:
console.log(docs.map((doc) => doc.metadata.title));
/*
[
'OpenGPTs',
'Building a web RAG chatbot: using LangChain, Exa (prev. Metaphor), LangSmith, and Hosted Langserve',
'Streaming Events: Introducing a new `stream_events` method',
'LangGraph: Multi-Agent Workflows',
'Build and Deploy a RAG app with Pinecone Serverless',
'Auto-Prompt Builder (with Hosted LangServe)',
'Build a Full Stack RAG App With TypeScript',
'Getting Started with Multi-Modal LLMs',
'SQL Research Assistant',
'Skeleton-of-Thought: Building a New Template from Scratch',
'Benchmarking RAG over LangChain Docs',
'Building a Research Assistant from Scratch',
'LangServe and LangChain Templates Webinar'
]
*/
API 参考
- DocumentInterface 来自
@langchain/core/documents
- YoutubeLoader 来自
@langchain/community/document_loaders/web/youtube
以下是与每个视频关联的元数据。
我们可以看到每个文档还具有标题、观看次数、发布时间和长度
import { getDocs } from "./docs.js";
const docs = await getDocs();
console.log(docs[0].metadata);
/**
{
source: 'HAn9vnJy6S4',
description: 'OpenGPTs is an open-source platform aimed at recreating an experience like the GPT Store - but with any model, any tools, and that you can self-host.\n' +
'\n' +
'This video covers both how to use it as well as how to build it.\n' +
'\n' +
'GitHub: https://github.com/langchain-ai/opengpts',
title: 'OpenGPTs',
view_count: 7262,
author: 'LangChain'
}
*/
// And here's a sample from a document's contents:
console.log(docs[0].pageContent.slice(0, 500));
/*
hello today I want to talk about open gpts open gpts is a project that we built here at linkchain uh that replicates the GPT store in a few ways so it creates uh end user-facing friendly interface to create different Bots and these Bots can have access to different tools and they can uh be given files to retrieve things over and basically it's a way to create a variety of bots and expose the configuration of these Bots to end users it's all open source um it can be used with open AI it can be us
*/
API 参考
索引文档
每当我们进行检索时,都需要创建可以查询的文档索引。我们将使用向量存储来索引我们的文档,并且首先对其进行分块,以使我们的检索更加简洁和精确
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { OpenAIEmbeddings } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";
import { getDocs } from "./docs.js";
const docs = await getDocs();
const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 2000 });
const chunkedDocs = await textSplitter.splitDocuments(docs);
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small",
});
const vectorStore = await Chroma.fromDocuments(chunkedDocs, embeddings, {
collectionName: "yt-videos",
});
API 参考
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- OpenAIEmbeddings 来自
@langchain/openai
- Chroma 来自
@langchain/community/vectorstores/chroma
然后,您可以检索索引,而无需重新查询和嵌入
import "chromadb";
import { OpenAIEmbeddings } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small",
});
const vectorStore = await Chroma.fromExistingCollection(embeddings, {
collectionName: "yt-videos",
});
[Module: null prototype] {
AdminClient: [class AdminClient],
ChromaClient: [class ChromaClient],
CloudClient: [class CloudClient extends ChromaClient],
CohereEmbeddingFunction: [class CohereEmbeddingFunction],
Collection: [class Collection],
DefaultEmbeddingFunction: [class _DefaultEmbeddingFunction],
GoogleGenerativeAiEmbeddingFunction: [class _GoogleGenerativeAiEmbeddingFunction],
HuggingFaceEmbeddingServerFunction: [class HuggingFaceEmbeddingServerFunction],
IncludeEnum: {
Documents: "documents",
Embeddings: "embeddings",
Metadatas: "metadatas",
Distances: "distances"
},
JinaEmbeddingFunction: [class JinaEmbeddingFunction],
OpenAIEmbeddingFunction: [class _OpenAIEmbeddingFunction],
TransformersEmbeddingFunction: [class _TransformersEmbeddingFunction]
}
无查询分析的检索
我们可以直接对用户问题执行相似度搜索,以查找与问题相关的块
const searchResults = await vectorStore.similaritySearch(
"how do I build a RAG agent"
);
console.log(searchResults[0].metadata.title);
console.log(searchResults[0].pageContent.slice(0, 500));
OpenGPTs
hardcoded that it will always do a retrieval step here the assistant decides whether to do a retrieval step or not sometimes this is good sometimes this is bad sometimes it you don't need to do a retrieval step when I said hi it didn't need to call it tool um but other times you know the the llm might mess up and not realize that it needs to do a retrieval step and so the rag bot will always do a retrieval step so it's more focused there because this is also a simpler architecture so it's always
效果还不错!我们的第一个结果与问题有点相关。
如果我们想搜索特定时间段的结果怎么办?
const specificSearchResults = await vectorStore.similaritySearch(
"videos on RAG published in 2023"
);
console.log(specificSearchResults[0].metadata.title);
console.log(specificSearchResults[0].metadata.publish_year);
console.log(specificSearchResults[0].pageContent.slice(0, 500));
OpenGPTs
2024
hardcoded that it will always do a retrieval step here the assistant decides whether to do a retrieval step or not sometimes this is good sometimes this is bad sometimes it you don't need to do a retrieval step when I said hi it didn't need to call it tool um but other times you know the the llm might mess up and not realize that it needs to do a retrieval step and so the rag bot will always do a retrieval step so it's more focused there because this is also a simpler architecture so it's always
我们的第一个结果来自 2024 年,与输入信息不太相关。由于我们只是针对文档内容进行搜索,因此无法根据任何文档属性过滤结果。
这仅仅是可能出现的故障模式之一。现在让我们看看基本形式的查询分析如何解决它!
查询分析
为了处理这些故障模式,我们将进行一些查询结构化。这将涉及定义一个包含一些日期过滤器的**查询模式**,并使用函数调用模型将用户问题转换为结构化查询。
查询模式
在这种情况下,我们将为发布时间设置明确的最小值和最大值属性,以便可以对其进行过滤。
import { z } from "zod";
const searchSchema = z
.object({
query: z
.string()
.describe("Similarity search query applied to video transcripts."),
publish_year: z.number().optional().describe("Year of video publication."),
})
.describe(
"Search over a database of tutorial videos about a software library."
);
查询生成
为了将用户问题转换为结构化查询,我们将使用 OpenAI 的函数调用 API。具体来说,我们将使用新的 ChatModel.withStructuredOutput() 构造函数来处理将模式传递给模型并解析输出。
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";
import {
RunnablePassthrough,
RunnableSequence,
} from "@langchain/core/runnables";
const system = `You are an expert at converting user questions into database queries.
You have access to a database of tutorial videos about a software library for building LLM-powered applications.
Given a question, return a list of database queries optimized to retrieve the most relevant results.
If there are acronyms or words you are not familiar with, do not try to rephrase them.`;
const prompt = ChatPromptTemplate.fromMessages([
["system", system],
["human", "{question}"],
]);
const llm = new ChatOpenAI({
model: "gpt-3.5-turbo-0125",
temperature: 0,
});
const structuredLLM = llm.withStructuredOutput(searchSchema, {
name: "search",
});
const queryAnalyzer = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
prompt,
structuredLLM,
]);
让我们看看我们的分析器为我们之前搜索的问题生成的查询是什么。
console.log(await queryAnalyzer.invoke("How do I build a rag agent"));
{ query: "build a rag agent" }
console.log(await queryAnalyzer.invoke("videos on RAG published in 2023"));
{ query: "RAG", publish_year: 2023 }
使用查询分析进行检索
我们的查询分析看起来相当不错;现在让我们尝试使用我们生成的查询来实际执行检索。
**注意:**在我们的示例中,我们指定了tool_choice: "Search"
。这将强制 LLM 调用一个(也只有一个)函数,这意味着我们将始终只有一个优化的查询来查找。请注意,这并不总是这样 - 请参阅其他指南,了解如何处理没有或多个优化查询返回的情况。
import { DocumentInterface } from "@langchain/core/documents";
const retrieval = async (input: {
query: string;
publish_year?: number;
}): Promise<DocumentInterface[]> => {
let _filter: Record<string, any> = {};
if (input.publish_year) {
// This syntax is specific to Chroma
// the vector database we are using.
_filter = {
publish_year: {
$eq: input.publish_year,
},
};
}
return vectorStore.similaritySearch(input.query, undefined, _filter);
};
import { RunnableLambda } from "@langchain/core/runnables";
const retrievalChain = queryAnalyzer.pipe(
new RunnableLambda({
func: async (input) =>
retrieval(input as unknown as { query: string; publish_year?: number }),
})
);
现在,我们可以将此链运行在之前有问题的输入上,并发现它只返回了该年的结果!
const results = await retrievalChain.invoke("RAG tutorial published in 2023");
console.log(
results.map((doc) => ({
title: doc.metadata.title,
year: doc.metadata.publish_date,
}))
);
[
{
title: "Getting Started with Multi-Modal LLMs",
year: "2023-12-20T08:00:00.000Z"
},
{
title: "LangServe and LangChain Templates Webinar",
year: "2023-11-02T07:00:00.000Z"
},
{
title: "Getting Started with Multi-Modal LLMs",
year: "2023-12-20T08:00:00.000Z"
},
{
title: "Building a Research Assistant from Scratch",
year: "2023-11-16T08:00:00.000Z"
}
]