如何处理高基数分类变量
先决条件
本指南假设您熟悉以下内容
高基数数据是指数据集中包含大量唯一值的列。本指南演示了一些处理这些输入的技术。
例如,您可能希望进行查询分析以在分类列上创建过滤器。这里的一个难点是,您通常需要指定确切的分类值。问题是您需要确保 LLM 能够准确地生成该分类值。当只有几个有效值时,这可以通过提示相对容易地完成。当存在大量有效值时,就会变得更加困难,因为这些值可能不适合 LLM 上下文,或者(如果它们适合)可能太多,以至于 LLM 无法正确地关注它们。
在本笔记本中,我们将探讨如何解决这个问题。
设置
安装依赖项
提示
请参见 此部分了解有关安装集成包的常规说明。
- npm
- yarn
- pnpm
npm i @langchain/community @langchain/core zod @faker-js/faker
yarn add @langchain/community @langchain/core zod @faker-js/faker
pnpm add @langchain/community @langchain/core zod @faker-js/faker
设置环境变量
# Optional, use LangSmith for best-in-class observability
LANGSMITH_API_KEY=your-api-key
LANGCHAIN_TRACING_V2=true
# Reduce tracing latency if you are not in a serverless environment
# LANGCHAIN_CALLBACKS_BACKGROUND=true
设置数据
我们将生成一堆虚假姓名
import { faker } from "@faker-js/faker";
const names = Array.from({ length: 10000 }, () =>
(faker as any).person.fullName()
);
让我们看看一些姓名
names[0];
"Rolando Wilkinson"
names[567];
"Homer Harber"
查询分析
我们现在可以设置一个基线查询分析
import { z } from "zod";
const searchSchema = z.object({
query: z.string(),
author: z.string(),
});
选择您的聊天模型
- OpenAI
- Anthropic
- FireworksAI
- MistralAI
- Groq
- VertexAI
安装依赖项
提示
- npm
- yarn
- pnpm
npm i @langchain/openai
yarn add @langchain/openai
pnpm add @langchain/openai
添加环境变量
OPENAI_API_KEY=your-api-key
实例化模型
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0
});
安装依赖项
提示
- npm
- yarn
- pnpm
npm i @langchain/anthropic
yarn add @langchain/anthropic
pnpm add @langchain/anthropic
添加环境变量
ANTHROPIC_API_KEY=your-api-key
实例化模型
import { ChatAnthropic } from "@langchain/anthropic";
const llm = new ChatAnthropic({
model: "claude-3-5-sonnet-20240620",
temperature: 0
});
安装依赖项
提示
- npm
- yarn
- pnpm
npm i @langchain/community
yarn add @langchain/community
pnpm add @langchain/community
添加环境变量
FIREWORKS_API_KEY=your-api-key
实例化模型
import { ChatFireworks } from "@langchain/community/chat_models/fireworks";
const llm = new ChatFireworks({
model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
temperature: 0
});
安装依赖项
提示
- npm
- yarn
- pnpm
npm i @langchain/mistralai
yarn add @langchain/mistralai
pnpm add @langchain/mistralai
添加环境变量
MISTRAL_API_KEY=your-api-key
实例化模型
import { ChatMistralAI } from "@langchain/mistralai";
const llm = new ChatMistralAI({
model: "mistral-large-latest",
temperature: 0
});
安装依赖项
提示
- npm
- yarn
- pnpm
npm i @langchain/groq
yarn add @langchain/groq
pnpm add @langchain/groq
添加环境变量
GROQ_API_KEY=your-api-key
实例化模型
import { ChatGroq } from "@langchain/groq";
const llm = new ChatGroq({
model: "mixtral-8x7b-32768",
temperature: 0
});
安装依赖项
提示
- npm
- yarn
- pnpm
npm i @langchain/google-vertexai
yarn add @langchain/google-vertexai
pnpm add @langchain/google-vertexai
添加环境变量
GOOGLE_APPLICATION_CREDENTIALS=credentials.json
实例化模型
import { ChatVertexAI } from "@langchain/google-vertexai";
const llm = new ChatVertexAI({
model: "gemini-1.5-flash",
temperature: 0
});
import { ChatPromptTemplate } from "@langchain/core/prompts";
import {
RunnablePassthrough,
RunnableSequence,
} from "@langchain/core/runnables";
const system = `Generate a relevant search query for a library system`;
const prompt = ChatPromptTemplate.fromMessages([
["system", system],
["human", "{question}"],
]);
const llmWithTools = llm.withStructuredOutput(searchSchema, {
name: "Search",
});
const queryAnalyzer = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
prompt,
llmWithTools,
]);
我们可以看到,如果我们完全正确地拼写姓名,它就知道如何处理它
await queryAnalyzer.invoke("what are books about aliens by Jesse Knight");
{ query: "aliens", author: "Jesse Knight" }
问题是您要过滤的值可能不会完全正确地拼写
await queryAnalyzer.invoke("what are books about aliens by jess knight");
{ query: "books about aliens", author: "jess knight" }
添加所有值
解决此问题的一种方法是将所有可能的値添加到提示中。这通常会将查询引导到正确的方向
const systemTemplate = `Generate a relevant search query for a library system using the 'search' tool.
The 'author' you return to the user MUST be one of the following authors:
{authors}
Do NOT hallucinate author name!`;
const basePrompt = ChatPromptTemplate.fromMessages([
["system", systemTemplate],
["human", "{question}"],
]);
const promptWithAuthors = await basePrompt.partial({
authors: names.join(", "),
});
const queryAnalyzerAll = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
promptWithAuthors,
llmWithTools,
]);
但是……如果分类列表足够长,它可能会出错!
try {
const res = await queryAnalyzerAll.invoke(
"what are books about aliens by jess knight"
);
} catch (e) {
console.error(e);
}
Error: 400 This model's maximum context length is 16385 tokens. However, your messages resulted in 50197 tokens (50167 in the messages, 30 in the functions). Please reduce the length of the messages or functions.
at Function.generate (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/error.mjs:41:20)
at OpenAI.makeStatusError (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/core.mjs:256:25)
at OpenAI.makeRequest (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/core.mjs:299:30)
at eventLoopTick (ext:core/01_core.js:63:7)
at async file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/@langchain/openai/0.0.31/dist/chat_models.js:756:29
at async RetryOperation._fn (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/p-retry/4.6.2/index.js:50:12) {
status: 400,
headers: {
"alt-svc": 'h3=":443"; ma=86400',
"cf-cache-status": "DYNAMIC",
"cf-ray": "885f794b3df4fa52-SJC",
"content-length": "340",
"content-type": "application/json",
date: "Sat, 18 May 2024 23:02:16 GMT",
"openai-organization": "langchain",
"openai-processing-ms": "230",
"openai-version": "2020-10-01",
server: "cloudflare",
"set-cookie": "_cfuvid=F_c9lnRuQDUhKiUE2eR2PlsxHPldf1OAVMonLlHTjzM-1716073336256-0.0.1.1-604800000; path=/; domain="... 48 more characters,
"strict-transport-security": "max-age=15724800; includeSubDomains",
"x-ratelimit-limit-requests": "10000",
"x-ratelimit-limit-tokens": "2000000",
"x-ratelimit-remaining-requests": "9999",
"x-ratelimit-remaining-tokens": "1958402",
"x-ratelimit-reset-requests": "6ms",
"x-ratelimit-reset-tokens": "1.247s",
"x-request-id": "req_7b88677d6883fac1520e44543f68c839"
},
request_id: "req_7b88677d6883fac1520e44543f68c839",
error: {
message: "This model's maximum context length is 16385 tokens. However, your messages resulted in 50197 tokens"... 101 more characters,
type: "invalid_request_error",
param: "messages",
code: "context_length_exceeded"
},
code: "context_length_exceeded",
param: "messages",
type: "invalid_request_error",
attemptNumber: 1,
retriesLeft: 6
}
我们可以尝试使用更长的上下文窗口……但是由于其中包含太多信息,因此不能保证它能够可靠地提取它
选择您的聊天模型
- OpenAI
- Anthropic
- FireworksAI
- MistralAI
- Groq
- VertexAI
安装依赖项
提示
- npm
- yarn
- pnpm
npm i @langchain/openai
yarn add @langchain/openai
pnpm add @langchain/openai
添加环境变量
OPENAI_API_KEY=your-api-key
实例化模型
import { ChatOpenAI } from "@langchain/openai";
const llmLong = new ChatOpenAI({ model: "gpt-4-turbo-preview" });
安装依赖项
提示
- npm
- yarn
- pnpm
npm i @langchain/anthropic
yarn add @langchain/anthropic
pnpm add @langchain/anthropic
添加环境变量
ANTHROPIC_API_KEY=your-api-key
实例化模型
import { ChatAnthropic } from "@langchain/anthropic";
const llmLong = new ChatAnthropic({
model: "claude-3-5-sonnet-20240620",
temperature: 0
});
安装依赖项
提示
- npm
- yarn
- pnpm
npm i @langchain/community
yarn add @langchain/community
pnpm add @langchain/community
添加环境变量
FIREWORKS_API_KEY=your-api-key
实例化模型
import { ChatFireworks } from "@langchain/community/chat_models/fireworks";
const llmLong = new ChatFireworks({
model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
temperature: 0
});
安装依赖项
提示
- npm
- yarn
- pnpm
npm i @langchain/mistralai
yarn add @langchain/mistralai
pnpm add @langchain/mistralai
添加环境变量
MISTRAL_API_KEY=your-api-key
实例化模型
import { ChatMistralAI } from "@langchain/mistralai";
const llmLong = new ChatMistralAI({
model: "mistral-large-latest",
temperature: 0
});
安装依赖项
提示
- npm
- yarn
- pnpm
npm i @langchain/groq
yarn add @langchain/groq
pnpm add @langchain/groq
添加环境变量
GROQ_API_KEY=your-api-key
实例化模型
import { ChatGroq } from "@langchain/groq";
const llmLong = new ChatGroq({
model: "mixtral-8x7b-32768",
temperature: 0
});
安装依赖项
提示
- npm
- yarn
- pnpm
npm i @langchain/google-vertexai
yarn add @langchain/google-vertexai
pnpm add @langchain/google-vertexai
添加环境变量
GOOGLE_APPLICATION_CREDENTIALS=credentials.json
实例化模型
import { ChatVertexAI } from "@langchain/google-vertexai";
const llmLong = new ChatVertexAI({
model: "gemini-1.5-flash",
temperature: 0
});
const structuredLlmLong = llmLong.withStructuredOutput(searchSchema, {
name: "Search",
});
const queryAnalyzerAllLong = RunnableSequence.from([
{
question: new RunnablePassthrough(),
},
prompt,
structuredLlmLong,
]);
await queryAnalyzerAllLong.invoke("what are books about aliens by jess knight");
{ query: "aliens", author: "jess knight" }
查找所有相关值
相反,我们可以做的是在相关值上创建一个 向量存储索引,然后查询它以获取 N 个最相关的值,
import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small",
});
const vectorstore = await MemoryVectorStore.fromTexts(names, {}, embeddings);
const selectNames = async (question: string) => {
const _docs = await vectorstore.similaritySearch(question, 10);
const _names = _docs.map((d) => d.pageContent);
return _names.join(", ");
};
const createPrompt = RunnableSequence.from([
{
question: new RunnablePassthrough(),
authors: selectNames,
},
basePrompt,
]);
await createPrompt.invoke("what are books by jess knight");
ChatPromptValue {
lc_serializable: true,
lc_kwargs: {
messages: [
SystemMessage {
lc_serializable: true,
lc_kwargs: {
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 243 more characters,
additional_kwargs: {},
response_metadata: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 243 more characters,
name: undefined,
additional_kwargs: {},
response_metadata: {}
},
HumanMessage {
lc_serializable: true,
lc_kwargs: {
content: "what are books by jess knight",
additional_kwargs: {},
response_metadata: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "what are books by jess knight",
name: undefined,
additional_kwargs: {},
response_metadata: {}
}
]
},
lc_namespace: [ "langchain_core", "prompt_values" ],
messages: [
SystemMessage {
lc_serializable: true,
lc_kwargs: {
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 243 more characters,
additional_kwargs: {},
response_metadata: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
"\n" +
"The 'author' you ret"... 243 more characters,
name: undefined,
additional_kwargs: {},
response_metadata: {}
},
HumanMessage {
lc_serializable: true,
lc_kwargs: {
content: "what are books by jess knight",
additional_kwargs: {},
response_metadata: {}
},
lc_namespace: [ "langchain_core", "messages" ],
content: "what are books by jess knight",
name: undefined,
additional_kwargs: {},
response_metadata: {}
}
]
}
const queryAnalyzerSelect = createPrompt.pipe(llmWithTools);
await queryAnalyzerSelect.invoke("what are books about aliens by jess knight");
{ query: "aliens", author: "Jess Knight" }
下一步
您现在已经学习了在构建查询时如何处理高基数数据。
接下来,查看本节中的其他查询分析指南,例如 如何使用少样本学习来提高性能。