构建一个查询分析系统
本页将展示如何在基本端到端示例中使用查询分析。这将涵盖创建简单的搜索引擎,展示将原始用户问题传递到该搜索引擎时出现的故障模式,以及查询分析如何帮助解决该问题的示例。存在许多不同的查询分析技术,本端到端示例不会展示所有技术。
在本示例中,我们将对 LangChain YouTube 视频进行检索。
设置
安装依赖项
- npm
- yarn
- pnpm
npm i langchain @langchain/community @langchain/openai youtubei.js chromadb youtube-transcript
yarn add langchain @langchain/community @langchain/openai youtubei.js chromadb youtube-transcript
pnpm add langchain @langchain/community @langchain/openai youtubei.js chromadb youtube-transcript
设置环境变量
在本示例中,我们将使用 OpenAI
OPENAI_API_KEY=your-api-key
# Optional, use LangSmith for best-in-class observability
LANGSMITH_API_KEY=your-api-key
LANGCHAIN_TRACING_V2=true
# Reduce tracing latency if you are not in a serverless environment
# LANGCHAIN_CALLBACKS_BACKGROUND=true
加载文档
我们可以使用 YouTubeLoader
加载一些 LangChain 视频的字幕
import { DocumentInterface } from "@langchain/core/documents";
import { YoutubeLoader } from "@langchain/community/document_loaders/web/youtube";
import { getYear } from "date-fns";
const urls = [
"https://www.youtube.com/watch?v=HAn9vnJy6S4",
"https://www.youtube.com/watch?v=dA1cHGACXCo",
"https://www.youtube.com/watch?v=ZcEMLz27sL4",
"https://www.youtube.com/watch?v=hvAPnpSfSGo",
"https://www.youtube.com/watch?v=EhlPDL4QrWY",
"https://www.youtube.com/watch?v=mmBo8nlu2j0",
"https://www.youtube.com/watch?v=rQdibOsL1ps",
"https://www.youtube.com/watch?v=28lC4fqukoc",
"https://www.youtube.com/watch?v=es-9MgxB-uc",
"https://www.youtube.com/watch?v=wLRHwKuKvOE",
"https://www.youtube.com/watch?v=ObIltMaRJvY",
"https://www.youtube.com/watch?v=DjuXACWYkkU",
"https://www.youtube.com/watch?v=o7C9ld6Ln-M",
];
let docs: Array<DocumentInterface> = [];
for await (const url of urls) {
const doc = await YoutubeLoader.createFromUrl(url, {
language: "en",
addVideoInfo: true,
}).load();
docs = docs.concat(doc);
}
console.log(docs.length);
/*
13
*/
// Add some additional metadata: what year the video was published
// The JS API does not provide publish date, so we can use a
// hardcoded array with the dates instead.
const dates = [
new Date("Jan 31, 2024"),
new Date("Jan 26, 2024"),
new Date("Jan 24, 2024"),
new Date("Jan 23, 2024"),
new Date("Jan 16, 2024"),
new Date("Jan 5, 2024"),
new Date("Jan 2, 2024"),
new Date("Dec 20, 2023"),
new Date("Dec 19, 2023"),
new Date("Nov 27, 2023"),
new Date("Nov 22, 2023"),
new Date("Nov 16, 2023"),
new Date("Nov 2, 2023"),
];
docs.forEach((doc, idx) => {
// eslint-disable-next-line no-param-reassign
doc.metadata.publish_year = getYear(dates[idx]);
// eslint-disable-next-line no-param-reassign
doc.metadata.publish_date = dates[idx];
});
// Here are the titles of the videos we've loaded:
console.log(docs.map((doc) => doc.metadata.title));
/*
[
'OpenGPTs',
'Building a web RAG chatbot: using LangChain, Exa (prev. Metaphor), LangSmith, and Hosted Langserve',
'Streaming Events: Introducing a new `stream_events` method',
'LangGraph: Multi-Agent Workflows',
'Build and Deploy a RAG app with Pinecone Serverless',
'Auto-Prompt Builder (with Hosted LangServe)',
'Build a Full Stack RAG App With TypeScript',
'Getting Started with Multi-Modal LLMs',
'SQL Research Assistant',
'Skeleton-of-Thought: Building a New Template from Scratch',
'Benchmarking RAG over LangChain Docs',
'Building a Research Assistant from Scratch',
'LangServe and LangChain Templates Webinar'
]
*/
API 参考
- DocumentInterface 来自
@langchain/core/documents
- YoutubeLoader 来自
@langchain/community/document_loaders/web/youtube
这是与每个视频相关的元数据。
我们可以看到每个文档也有标题、观看次数、发布时间和时长
import { getDocs } from "./docs.js";
const docs = await getDocs();
console.log(docs[0].metadata);
/**
{
source: 'HAn9vnJy6S4',
description: 'OpenGPTs is an open-source platform aimed at recreating an experience like the GPT Store - but with any model, any tools, and that you can self-host.\n' +
'\n' +
'This video covers both how to use it as well as how to build it.\n' +
'\n' +
'GitHub: https://github.com/langchain-ai/opengpts',
title: 'OpenGPTs',
view_count: 7262,
author: 'LangChain'
}
*/
// And here's a sample from a document's contents:
console.log(docs[0].pageContent.slice(0, 500));
/*
hello today I want to talk about open gpts open gpts is a project that we built here at linkchain uh that replicates the GPT store in a few ways so it creates uh end user-facing friendly interface to create different Bots and these Bots can have access to different tools and they can uh be given files to retrieve things over and basically it's a way to create a variety of bots and expose the configuration of these Bots to end users it's all open source um it can be used with open AI it can be us
*/
API 参考
索引文档
无论何时执行检索,都需要创建文档的索引以供查询。我们将使用向量存储来索引文档,并且首先对其进行分块,使我们的检索更加简洁和精确
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { OpenAIEmbeddings } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";
import { getDocs } from "./docs.js";
const docs = await getDocs();
const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 2000 });
const chunkedDocs = await textSplitter.splitDocuments(docs);
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small",
});
const vectorStore = await Chroma.fromDocuments(chunkedDocs, embeddings, {
collectionName: "yt-videos",
});
API 参考
- RecursiveCharacterTextSplitter 来自
@langchain/textsplitters
- OpenAIEmbeddings 来自
@langchain/openai
- Chroma 来自
@langchain/community/vectorstores/chroma
之后,您可以检索索引,而无需重新查询和嵌入
import "chromadb";
import { OpenAIEmbeddings } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small",
});
const vectorStore = await Chroma.fromExistingCollection(embeddings, {
collectionName: "yt-videos",
});
[Module: null prototype] {
AdminClient: [class AdminClient],
ChromaClient: [class ChromaClient],
CloudClient: [class CloudClient extends ChromaClient],
CohereEmbeddingFunction: [class CohereEmbeddingFunction],
Collection: [class Collection],
DefaultEmbeddingFunction: [class _DefaultEmbeddingFunction],
GoogleGenerativeAiEmbeddingFunction: [class _GoogleGenerativeAiEmbeddingFunction],
HuggingFaceEmbeddingServerFunction: [class HuggingFaceEmbeddingServerFunction],
IncludeEnum: {
Documents: "documents",
Embeddings: "embeddings",
Metadatas: "metadatas",
Distances: "distances"
},
JinaEmbeddingFunction: [class JinaEmbeddingFunction],
OpenAIEmbeddingFunction: [class _OpenAIEmbeddingFunction],
TransformersEmbeddingFunction: [class _TransformersEmbeddingFunction]
}
不使用查询分析进行检索
我们可以直接对用户问题执行相似性搜索,以查找与问题相关的片段
const searchResults = await vectorStore.similaritySearch(
"how do I build a RAG agent"
);
console.log(searchResults[0].metadata.title);
console.log(searchResults[0].pageContent.slice(0, 500));
OpenGPTs
hardcoded that it will always do a retrieval step here the assistant decides whether to do a retrieval step or not sometimes this is good sometimes this is bad sometimes it you don't need to do a retrieval step when I said hi it didn't need to call it tool um but other times you know the the llm might mess up and not realize that it needs to do a retrieval step and so the rag bot will always do a retrieval step so it's more focused there because this is also a simpler architecture so it's always
效果还不错!第一个结果与问题有点相关。
如果我们想要搜索特定时间段的结果呢?
const searchResults = await vectorStore.similaritySearch(
"videos on RAG published in 2023"
);
console.log(searchResults[0].metadata.title);
console.log(searchResults[0].metadata.publish_year);
console.log(searchResults[0].pageContent.slice(0, 500));
OpenGPTs
2024
hardcoded that it will always do a retrieval step here the assistant decides whether to do a retrieval step or not sometimes this is good sometimes this is bad sometimes it you don't need to do a retrieval step when I said hi it didn't need to call it tool um but other times you know the the llm might mess up and not realize that it needs to do a retrieval step and so the rag bot will always do a retrieval step so it's more focused there because this is also a simpler architecture so it's always
我们的第一个结果来自 2024 年,与输入无关。因为我们只是针对文档内容进行搜索,所以无法根据任何文档属性过滤结果。
这只是一个可能出现的失败模式。现在让我们看看如何使用基本的查询分析来解决它!
查询分析
为了处理这些失败模式,我们将进行一些查询结构化。这将涉及定义一个包含一些日期过滤器的 **查询模式**,并使用一个函数调用模型将用户问题转换为结构化查询。
查询模式
在这种情况下,我们将为发布时间设置明确的最小和最大属性,以便可以对其进行过滤。
import { z } from "zod";
const searchSchema = z
.object({
query: z
.string()
.describe("Similarity search query applied to video transcripts."),
publish_year: z.number().optional().describe("Year of video publication."),
})
.describe(
"Search over a database of tutorial videos about a software library."
);
查询生成
为了将用户问题转换为结构化查询,我们将使用 OpenAI 的函数调用 API。具体来说,我们将使用新的 ChatModel.withStructuredOutput() 构造函数来处理将模式传递给模型并解析输出。
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";
import {
RunnablePassthrough,
RunnableSequence,
} from "@langchain/core/runnables";
const system = `You are an expert at converting user questions into database queries.
You have access to a database of tutorial videos about a software library for building LLM-powered applications.
Given a question, return a list of database queries optimized to retrieve the most relevant results.
If there are acronyms or words you are not familiar with, do not try to rephrase them.`;
const prompt = ChatPromptTemplate.fromMessages([
["system", system],
["human", "{question}"],
]);
const llm = new ChatOpenAI({
model: "gpt-3.5-turbo-0125",
temperature: 0,
});
const structuredLLM = llm.withStructuredOutput(searchSchema, {
name: "search",
});
const queryAnalyzer = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
prompt,
structuredLLM,
]);
让我们看看我们的分析器为我们之前搜索的问题生成的查询。
console.log(await queryAnalyzer.invoke("How do I build a rag agent"));
{ query: "build a rag agent" }
console.log(await queryAnalyzer.invoke("videos on RAG published in 2023"));
{ query: "RAG", publish_year: 2023 }
使用查询分析进行检索
我们的查询分析看起来不错;现在让我们尝试使用我们生成的查询来执行实际的检索。
**注意:** 在我们的示例中,我们指定了 tool_choice: "Search"
。这将强制 LLM 调用一个(且仅一个)函数,这意味着我们始终只有一个优化的查询要查找。注意,情况并非总是如此——请参阅其他指南了解如何在没有或多个优化的查询返回时进行处理。
import { DocumentInterface } from "@langchain/core/documents";
const retrieval = async (input: {
query: string;
publish_year?: number;
}): Promise<DocumentInterface[]> => {
let _filter: Record<string, any> = {};
if (input.publish_year) {
// This syntax is specific to Chroma
// the vector database we are using.
_filter = {
publish_year: {
$eq: input.publish_year,
},
};
}
return vectorStore.similaritySearch(input.query, undefined, _filter);
};
import { RunnableLambda } from "@langchain/core/runnables";
const retrievalChain = queryAnalyzer.pipe(
new RunnableLambda({
func: async (input) =>
retrieval(input as unknown as { query: string; publish_year?: number }),
})
);
我们现在可以在之前的有问题输入上运行此链,并看到它只产生了该年的结果!
const results = await retrievalChain.invoke("RAG tutorial published in 2023");
console.log(
results.map((doc) => ({
title: doc.metadata.title,
year: doc.metadata.publish_date,
}))
);
[
{
title: "Getting Started with Multi-Modal LLMs",
year: "2023-12-20T08:00:00.000Z"
},
{
title: "LangServe and LangChain Templates Webinar",
year: "2023-11-02T07:00:00.000Z"
},
{
title: "Getting Started with Multi-Modal LLMs",
year: "2023-12-20T08:00:00.000Z"
},
{
title: "Building a Research Assistant from Scratch",
year: "2023-11-16T08:00:00.000Z"
}
]