跳至主要内容

如何处理高基数分类变量

先决条件

本指南假设您熟悉以下内容

高基数数据是指包含大量唯一值的列。本指南演示了一些处理这些输入的技术。

例如,您可能希望进行查询分析以对分类列创建过滤器。这里的一个难点是您通常需要指定确切的分类值。问题是您需要确保 LLM 准确地生成该分类值。当只有几个有效的数值时,这可以通过提示轻松实现。当存在大量有效数值时,它会变得更加困难,因为这些数值可能不适合 LLM 上下文,或者(如果适合)数值可能太多,LLM 无法正确地关注它们。

在本笔记本中,我们将探讨如何处理这种情况。

设置

安装依赖项

yarn add @langchain/community zod @faker-js/faker

设置环境变量

# Optional, use LangSmith for best-in-class observability
LANGSMITH_API_KEY=your-api-key
LANGCHAIN_TRACING_V2=true

# Reduce tracing latency if you are not in a serverless environment
# LANGCHAIN_CALLBACKS_BACKGROUND=true

设置数据

我们将生成一些假名字

import { faker } from "@faker-js/faker";

const names = Array.from({ length: 10000 }, () => faker.person.fullName());

让我们看看一些名字

names[0];
"Rolando Wilkinson"
names[567];
"Homer Harber"

查询分析

现在,我们可以设置一个基线查询分析

import { z } from "zod";

const searchSchema = z.object({
query: z.string(),
author: z.string(),
});

选择您的聊天模型

安装依赖项

yarn add @langchain/openai 

添加环境变量

OPENAI_API_KEY=your-api-key

实例化模型

import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0
});
import { ChatPromptTemplate } from "@langchain/core/prompts";
import {
RunnablePassthrough,
RunnableSequence,
} from "@langchain/core/runnables";

const system = `Generate a relevant search query for a library system`;
const prompt = ChatPromptTemplate.fromMessages([
["system", system],
["human", "{question}"],
]);
const llmWithTools = llm.withStructuredOutput(searchSchema, {
name: "Search",
});
const queryAnalyzer = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
prompt,
llmWithTools,
]);

我们可以看到,如果我们准确地拼写名字,它就知道如何处理它

await queryAnalyzer.invoke("what are books about aliens by Jesse Knight");
{ query: "aliens", author: "Jesse Knight" }

问题是您要过滤的值可能不会完全准确地拼写

await queryAnalyzer.invoke("what are books about aliens by jess knight");
{ query: "books about aliens", author: "jess knight" }

添加所有值

解决此问题的一种方法是将所有可能的值添加到提示中。这通常会将查询引导到正确的方向

const system = `Generate a relevant search query for a library system using the 'search' tool.

The 'author' you return to the user MUST be one of the following authors:

{authors}

Do NOT hallucinate author name!`;
const basePrompt = ChatPromptTemplate.fromMessages([
["system", system],
["human", "{question}"],
]);
const prompt = await basePrompt.partial({ authors: names.join(", ") });

const queryAnalyzerAll = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
prompt,
llmWithTools,
]);

但是… 如果分类列表足够长,它可能会出错!

try {
const res = await queryAnalyzerAll.invoke(
"what are books about aliens by jess knight"
);
} catch (e) {
console.error(e);
}
Error: 400 This model's maximum context length is 16385 tokens. However, your messages resulted in 50197 tokens (50167 in the messages, 30 in the functions). Please reduce the length of the messages or functions.
at Function.generate (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/error.mjs:41:20)
at OpenAI.makeStatusError (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/core.mjs:256:25)
at OpenAI.makeRequest (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/core.mjs:299:30)
at eventLoopTick (ext:core/01_core.js:63:7)
at async file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/@langchain/openai/0.0.31/dist/chat_models.js:756:29
at async RetryOperation._fn (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/p-retry/4.6.2/index.js:50:12) {
status: 400,
headers: {
"alt-svc": 'h3=":443"; ma=86400',
"cf-cache-status": "DYNAMIC",
"cf-ray": "885f794b3df4fa52-SJC",
"content-length": "340",
"content-type": "application/json",
date: "Sat, 18 May 2024 23:02:16 GMT",
"openai-organization": "langchain",
"openai-processing-ms": "230",
"openai-version": "2020-10-01",
server: "cloudflare",
"set-cookie": "_cfuvid=F_c9lnRuQDUhKiUE2eR2PlsxHPldf1OAVMonLlHTjzM-1716073336256-0.0.1.1-604800000; path=/; domain="... 48 more characters,
"strict-transport-security": "max-age=15724800; includeSubDomains",
"x-ratelimit-limit-requests": "10000",
"x-ratelimit-limit-tokens": "2000000",
"x-ratelimit-remaining-requests": "9999",
"x-ratelimit-remaining-tokens": "1958402",
"x-ratelimit-reset-requests": "6ms",
"x-ratelimit-reset-tokens": "1.247s",
"x-request-id": "req_7b88677d6883fac1520e44543f68c839"
},
request_id: "req_7b88677d6883fac1520e44543f68c839",
error: {
message: "This model's maximum context length is 16385 tokens. However, your messages resulted in 50197 tokens"... 101 more characters,
type: "invalid_request_error",
param: "messages",
code: "context_length_exceeded"
},
code: "context_length_exceeded",
param: "messages",
type: "invalid_request_error",
attemptNumber: 1,
retriesLeft: 6
}

我们可以尝试使用更长的上下文窗口… 但由于其中包含太多信息,它不能保证可靠地提取它

选择您的聊天模型

安装依赖项

yarn add @langchain/openai 

添加环境变量

OPENAI_API_KEY=your-api-key

实例化模型

import { ChatOpenAI } from "@langchain/openai";

const llmLong = new ChatOpenAI({ model: "gpt-4-turbo-preview" });
const structuredLlmLong = llmLong.withStructuredOutput(searchSchema, {
name: "Search",
});
const queryAnalyzerAll = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
prompt,
structuredLlmLong,
]);
await queryAnalyzerAll.invoke("what are books about aliens by jess knight");
{ query: "aliens", author: "jess knight" }

查找所有相关值

相反,我们可以创建一个 向量存储索引,在相关值上,然后查询该索引以获取 N 个最相关的值。

import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";

const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small",
});
const vectorstore = await MemoryVectorStore.fromTexts(names, {}, embeddings);

const selectNames = async (question: string) => {
const _docs = await vectorstore.similaritySearch(question, 10);
const _names = _docs.map((d) => d.pageContent);
return _names.join(", ");
};

const createPrompt = RunnableSequence.from([
{
question: new RunnablePassthrough(),
authors: selectNames,
},
basePrompt,
]);

await createPrompt.invoke("what are books by jess knight");
ChatPromptValue {
lc_serializable: true,
lc_kwargs: {
messages: [
SystemMessage {
lc_serializable: true,
lc_kwargs: {
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 243 more characters,
additional_kwargs: {},
response_metadata: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 243 more characters,
name: undefined,
additional_kwargs: {},
response_metadata: {}
},
HumanMessage {
lc_serializable: true,
lc_kwargs: {
content: "what are books by jess knight",
additional_kwargs: {},
response_metadata: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "what are books by jess knight",
name: undefined,
additional_kwargs: {},
response_metadata: {}
}
]
},
lc_namespace: [ "langchain_core", "prompt_values" ],
messages: [
SystemMessage {
lc_serializable: true,
lc_kwargs: {
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 243 more characters,
additional_kwargs: {},
response_metadata: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 243 more characters,
name: undefined,
additional_kwargs: {},
response_metadata: {}
},
HumanMessage {
lc_serializable: true,
lc_kwargs: {
content: "what are books by jess knight",
additional_kwargs: {},
response_metadata: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "what are books by jess knight",
name: undefined,
additional_kwargs: {},
response_metadata: {}
}
]
}
const queryAnalyzerSelect = createPrompt.pipe(llmWithTools);

await queryAnalyzerSelect.invoke("what are books about aliens by jess knight");
{ query: "aliens", author: "Jess Knight" }

下一步

您现在已经学习了如何在构建查询时处理高基数数据。

接下来,查看本节中的一些其他查询分析指南,例如 如何使用少样本学习来提高性能


此页面是否有帮助?


您也可以在 GitHub 上留下详细的反馈 GitHub.