跳至主要内容

如何处理高基数分类变量

先决条件

本指南假设您熟悉以下内容

高基数数据是指数据集中包含大量唯一值的列。本指南演示了一些处理这些输入的技术。

例如,您可能希望进行查询分析以在分类列上创建过滤器。这里的一个难点是,您通常需要指定确切的分类值。问题是您需要确保 LLM 能够准确地生成该分类值。当只有几个有效值时,这可以通过提示相对容易地完成。当存在大量有效值时,就会变得更加困难,因为这些值可能不适合 LLM 上下文,或者(如果它们适合)可能太多,以至于 LLM 无法正确地关注它们。

在本笔记本中,我们将探讨如何解决这个问题。

设置

安装依赖项

yarn add @langchain/community @langchain/core zod @faker-js/faker

设置环境变量

# Optional, use LangSmith for best-in-class observability
LANGSMITH_API_KEY=your-api-key
LANGCHAIN_TRACING_V2=true

# Reduce tracing latency if you are not in a serverless environment
# LANGCHAIN_CALLBACKS_BACKGROUND=true

设置数据

我们将生成一堆虚假姓名

import { faker } from "@faker-js/faker";

const names = Array.from({ length: 10000 }, () =>
(faker as any).person.fullName()
);

让我们看看一些姓名

names[0];
"Rolando Wilkinson"
names[567];
"Homer Harber"

查询分析

我们现在可以设置一个基线查询分析

import { z } from "zod";

const searchSchema = z.object({
query: z.string(),
author: z.string(),
});

选择您的聊天模型

安装依赖项

yarn add @langchain/openai 

添加环境变量

OPENAI_API_KEY=your-api-key

实例化模型

import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0
});
import { ChatPromptTemplate } from "@langchain/core/prompts";
import {
RunnablePassthrough,
RunnableSequence,
} from "@langchain/core/runnables";

const system = `Generate a relevant search query for a library system`;
const prompt = ChatPromptTemplate.fromMessages([
["system", system],
["human", "{question}"],
]);
const llmWithTools = llm.withStructuredOutput(searchSchema, {
name: "Search",
});
const queryAnalyzer = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
prompt,
llmWithTools,
]);

我们可以看到,如果我们完全正确地拼写姓名,它就知道如何处理它

await queryAnalyzer.invoke("what are books about aliens by Jesse Knight");
{ query: "aliens", author: "Jesse Knight" }

问题是您要过滤的值可能不会完全正确地拼写

await queryAnalyzer.invoke("what are books about aliens by jess knight");
{ query: "books about aliens", author: "jess knight" }

添加所有值

解决此问题的一种方法是将所有可能的値添加到提示中。这通常会将查询引导到正确的方向

const systemTemplate = `Generate a relevant search query for a library system using the 'search' tool.

The 'author' you return to the user MUST be one of the following authors:

{authors}

Do NOT hallucinate author name!`;
const basePrompt = ChatPromptTemplate.fromMessages([
["system", systemTemplate],
["human", "{question}"],
]);
const promptWithAuthors = await basePrompt.partial({
authors: names.join(", "),
});

const queryAnalyzerAll = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
promptWithAuthors,
llmWithTools,
]);

但是……如果分类列表足够长,它可能会出错!

try {
const res = await queryAnalyzerAll.invoke(
"what are books about aliens by jess knight"
);
} catch (e) {
console.error(e);
}
Error: 400 This model's maximum context length is 16385 tokens. However, your messages resulted in 50197 tokens (50167 in the messages, 30 in the functions). Please reduce the length of the messages or functions.
at Function.generate (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/error.mjs:41:20)
at OpenAI.makeStatusError (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/core.mjs:256:25)
at OpenAI.makeRequest (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/core.mjs:299:30)
at eventLoopTick (ext:core/01_core.js:63:7)
at async file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/@langchain/openai/0.0.31/dist/chat_models.js:756:29
at async RetryOperation._fn (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/p-retry/4.6.2/index.js:50:12) {
status: 400,
headers: {
"alt-svc": 'h3=":443"; ma=86400',
"cf-cache-status": "DYNAMIC",
"cf-ray": "885f794b3df4fa52-SJC",
"content-length": "340",
"content-type": "application/json",
date: "Sat, 18 May 2024 23:02:16 GMT",
"openai-organization": "langchain",
"openai-processing-ms": "230",
"openai-version": "2020-10-01",
server: "cloudflare",
"set-cookie": "_cfuvid=F_c9lnRuQDUhKiUE2eR2PlsxHPldf1OAVMonLlHTjzM-1716073336256-0.0.1.1-604800000; path=/; domain="... 48 more characters,
"strict-transport-security": "max-age=15724800; includeSubDomains",
"x-ratelimit-limit-requests": "10000",
"x-ratelimit-limit-tokens": "2000000",
"x-ratelimit-remaining-requests": "9999",
"x-ratelimit-remaining-tokens": "1958402",
"x-ratelimit-reset-requests": "6ms",
"x-ratelimit-reset-tokens": "1.247s",
"x-request-id": "req_7b88677d6883fac1520e44543f68c839"
},
request_id: "req_7b88677d6883fac1520e44543f68c839",
error: {
message: "This model's maximum context length is 16385 tokens. However, your messages resulted in 50197 tokens"... 101 more characters,
type: "invalid_request_error",
param: "messages",
code: "context_length_exceeded"
},
code: "context_length_exceeded",
param: "messages",
type: "invalid_request_error",
attemptNumber: 1,
retriesLeft: 6
}

我们可以尝试使用更长的上下文窗口……但是由于其中包含太多信息,因此不能保证它能够可靠地提取它

选择您的聊天模型

安装依赖项

yarn add @langchain/openai 

添加环境变量

OPENAI_API_KEY=your-api-key

实例化模型

import { ChatOpenAI } from "@langchain/openai";

const llmLong = new ChatOpenAI({ model: "gpt-4-turbo-preview" });
const structuredLlmLong = llmLong.withStructuredOutput(searchSchema, {
name: "Search",
});
const queryAnalyzerAllLong = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
prompt,
structuredLlmLong,
]);
await queryAnalyzerAllLong.invoke("what are books about aliens by jess knight");
{ query: "aliens", author: "jess knight" }

查找所有相关值

相反,我们可以做的是在相关值上创建一个 向量存储索引,然后查询它以获取 N 个最相关的值,

import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";

const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small",
});
const vectorstore = await MemoryVectorStore.fromTexts(names, {}, embeddings);

const selectNames = async (question: string) => {
const _docs = await vectorstore.similaritySearch(question, 10);
const _names = _docs.map((d) => d.pageContent);
return _names.join(", ");
};

const createPrompt = RunnableSequence.from([
{
question: new RunnablePassthrough(),
authors: selectNames,
},
basePrompt,
]);

await createPrompt.invoke("what are books by jess knight");
ChatPromptValue {
lc_serializable: true,
lc_kwargs: {
messages: [
SystemMessage {
lc_serializable: true,
lc_kwargs: {
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 243 more characters,
additional_kwargs: {},
response_metadata: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 243 more characters,
name: undefined,
additional_kwargs: {},
response_metadata: {}
},
HumanMessage {
lc_serializable: true,
lc_kwargs: {
content: "what are books by jess knight",
additional_kwargs: {},
response_metadata: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "what are books by jess knight",
name: undefined,
additional_kwargs: {},
response_metadata: {}
}
]
},
lc_namespace: [ "langchain_core", "prompt_values" ],
messages: [
SystemMessage {
lc_serializable: true,
lc_kwargs: {
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 243 more characters,
additional_kwargs: {},
response_metadata: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 243 more characters,
name: undefined,
additional_kwargs: {},
response_metadata: {}
},
HumanMessage {
lc_serializable: true,
lc_kwargs: {
content: "what are books by jess knight",
additional_kwargs: {},
response_metadata: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "what are books by jess knight",
name: undefined,
additional_kwargs: {},
response_metadata: {}
}
]
}
const queryAnalyzerSelect = createPrompt.pipe(llmWithTools);

await queryAnalyzerSelect.invoke("what are books about aliens by jess knight");
{ query: "aliens", author: "Jess Knight" }

下一步

您现在已经学习了在构建查询时如何处理高基数数据。

接下来,查看本节中的其他查询分析指南,例如 如何使用少样本学习来提高性能


本页面是否有帮助?


您也可以留下详细的反馈 在 GitHub 上.