如何处理高基数分类变量

先决条件

本指南假设您熟悉以下内容

查询分析

高基数数据是指包含大量唯一值的列。本指南演示了一些处理这些输入的技术。

例如，您可能希望进行查询分析以对分类列创建过滤器。这里的一个难点是您通常需要指定确切的分类值。问题是您需要确保 LLM 准确地生成该分类值。当只有几个有效的数值时，这可以通过提示轻松实现。当存在大量有效数值时，它会变得更加困难，因为这些数值可能不适合 LLM 上下文，或者（如果适合）数值可能太多，LLM 无法正确地关注它们。

在本笔记本中，我们将探讨如何处理这种情况。

设置

安装依赖项

小贴士

查看此部分以获取有关安装集成包的常规说明.

npm
yarn
pnpm

npm i @langchain/community zod @faker-js/faker

yarn add @langchain/community zod @faker-js/faker

pnpm add @langchain/community zod @faker-js/faker

设置环境变量

# Optional, use LangSmith for best-in-class observability
LANGSMITH_API_KEY=your-api-key
LANGCHAIN_TRACING_V2=true

# Reduce tracing latency if you are not in a serverless environment
# LANGCHAIN_CALLBACKS_BACKGROUND=true

设置数据

我们将生成一些假名字

import { faker } from "@faker-js/faker";

const names = Array.from({ length: 10000 }, () => faker.person.fullName());

让我们看看一些名字

names[0];

"Rolando Wilkinson"

names[567];

"Homer Harber"

查询分析

现在，我们可以设置一个基线查询分析

import { z } from "zod";

const searchSchema = z.object({
  query: z.string(),
  author: z.string(),
});

选择您的聊天模型

安装依赖项

小贴士

查看此部分以获取有关安装集成包的常规说明.

npm
yarn
pnpm

npm i @langchain/openai

yarn add @langchain/openai 

pnpm add @langchain/openai 

添加环境变量

OPENAI_API_KEY=your-api-key

实例化模型

import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0
});

安装依赖项

小贴士

查看此部分以获取有关安装集成包的常规说明.

npm
yarn
pnpm

npm i @langchain/anthropic

yarn add @langchain/anthropic 

pnpm add @langchain/anthropic 

添加环境变量

ANTHROPIC_API_KEY=your-api-key

实例化模型

import { ChatAnthropic } from "@langchain/anthropic";

const llm = new ChatAnthropic({
  model: "claude-3-5-sonnet-20240620",
  temperature: 0
});

安装依赖项

小贴士

查看此部分以获取有关安装集成包的常规说明.

npm
yarn
pnpm

npm i @langchain/community

yarn add @langchain/community 

pnpm add @langchain/community 

添加环境变量

FIREWORKS_API_KEY=your-api-key

实例化模型

import { ChatFireworks } from "@langchain/community/chat_models/fireworks";

const llm = new ChatFireworks({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
  temperature: 0
});

安装依赖项

小贴士

查看此部分以获取有关安装集成包的常规说明.

npm
yarn
pnpm

npm i @langchain/mistralai

yarn add @langchain/mistralai 

pnpm add @langchain/mistralai 

添加环境变量

MISTRAL_API_KEY=your-api-key

实例化模型

import { ChatMistralAI } from "@langchain/mistralai";

const llm = new ChatMistralAI({
  model: "mistral-large-latest",
  temperature: 0
});

安装依赖项

小贴士

查看此部分以获取有关安装集成包的常规说明.

npm
yarn
pnpm

npm i @langchain/groq

yarn add @langchain/groq 

pnpm add @langchain/groq 

添加环境变量

GROQ_API_KEY=your-api-key

实例化模型

import { ChatGroq } from "@langchain/groq";

const llm = new ChatGroq({
  model: "mixtral-8x7b-32768",
  temperature: 0
});

安装依赖项

小贴士

查看此部分以获取有关安装集成包的常规说明.

npm
yarn
pnpm

npm i @langchain/google-vertexai

yarn add @langchain/google-vertexai 

pnpm add @langchain/google-vertexai 

添加环境变量

GOOGLE_APPLICATION_CREDENTIALS=credentials.json

实例化模型

import { ChatVertexAI } from "@langchain/google-vertexai";

const llm = new ChatVertexAI({
  model: "gemini-1.5-flash",
  temperature: 0
});

import { ChatPromptTemplate } from "@langchain/core/prompts";
import {
  RunnablePassthrough,
  RunnableSequence,
} from "@langchain/core/runnables";

const system = `Generate a relevant search query for a library system`;
const prompt = ChatPromptTemplate.fromMessages([
  ["system", system],
  ["human", "{question}"],
]);
const llmWithTools = llm.withStructuredOutput(searchSchema, {
  name: "Search",
});
const queryAnalyzer = RunnableSequence.from([
  {
    question: new RunnablePassthrough(),
  },
  prompt,
  llmWithTools,
]);

我们可以看到，如果我们准确地拼写名字，它就知道如何处理它

await queryAnalyzer.invoke("what are books about aliens by Jesse Knight");

{ query: "aliens", author: "Jesse Knight" }

问题是您要过滤的值可能不会完全准确地拼写

await queryAnalyzer.invoke("what are books about aliens by jess knight");

{ query: "books about aliens", author: "jess knight" }

添加所有值

解决此问题的一种方法是将所有可能的值添加到提示中。这通常会将查询引导到正确的方向

const system = `Generate a relevant search query for a library system using the 'search' tool.

The 'author' you return to the user MUST be one of the following authors:

{authors}

Do NOT hallucinate author name!`;
const basePrompt = ChatPromptTemplate.fromMessages([
  ["system", system],
  ["human", "{question}"],
]);
const prompt = await basePrompt.partial({ authors: names.join(", ") });

const queryAnalyzerAll = RunnableSequence.from([
  {
    question: new RunnablePassthrough(),
  },
  prompt,
  llmWithTools,
]);

但是… 如果分类列表足够长，它可能会出错！

try {
  const res = await queryAnalyzerAll.invoke(
    "what are books about aliens by jess knight"
  );
} catch (e) {
  console.error(e);
}

Error: 400 This model's maximum context length is 16385 tokens. However, your messages resulted in 50197 tokens (50167 in the messages, 30 in the functions). Please reduce the length of the messages or functions.
    at Function.generate (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/error.mjs:41:20)
    at OpenAI.makeStatusError (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/core.mjs:256:25)
    at OpenAI.makeRequest (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/openai/4.47.1/core.mjs:299:30)
    at eventLoopTick (ext:core/01_core.js:63:7)
    at async file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/@langchain/openai/0.0.31/dist/chat_models.js:756:29
    at async RetryOperation._fn (file:///Users/jacoblee/Library/Caches/deno/npm/registry.npmjs.org/p-retry/4.6.2/index.js:50:12) {
  status: 400,
  headers: {
    "alt-svc": 'h3=":443"; ma=86400',
    "cf-cache-status": "DYNAMIC",
    "cf-ray": "885f794b3df4fa52-SJC",
    "content-length": "340",
    "content-type": "application/json",
    date: "Sat, 18 May 2024 23:02:16 GMT",
    "openai-organization": "langchain",
    "openai-processing-ms": "230",
    "openai-version": "2020-10-01",
    server: "cloudflare",
    "set-cookie": "_cfuvid=F_c9lnRuQDUhKiUE2eR2PlsxHPldf1OAVMonLlHTjzM-1716073336256-0.0.1.1-604800000; path=/; domain="... 48 more characters,
    "strict-transport-security": "max-age=15724800; includeSubDomains",
    "x-ratelimit-limit-requests": "10000",
    "x-ratelimit-limit-tokens": "2000000",
    "x-ratelimit-remaining-requests": "9999",
    "x-ratelimit-remaining-tokens": "1958402",
    "x-ratelimit-reset-requests": "6ms",
    "x-ratelimit-reset-tokens": "1.247s",
    "x-request-id": "req_7b88677d6883fac1520e44543f68c839"
  },
  request_id: "req_7b88677d6883fac1520e44543f68c839",
  error: {
    message: "This model's maximum context length is 16385 tokens. However, your messages resulted in 50197 tokens"... 101 more characters,
    type: "invalid_request_error",
    param: "messages",
    code: "context_length_exceeded"
  },
  code: "context_length_exceeded",
  param: "messages",
  type: "invalid_request_error",
  attemptNumber: 1,
  retriesLeft: 6
}

我们可以尝试使用更长的上下文窗口… 但由于其中包含太多信息，它不能保证可靠地提取它

选择您的聊天模型

安装依赖项

小贴士

查看此部分以获取有关安装集成包的常规说明.

npm
yarn
pnpm

npm i @langchain/openai

yarn add @langchain/openai 

pnpm add @langchain/openai 

添加环境变量

OPENAI_API_KEY=your-api-key

实例化模型

import { ChatOpenAI } from "@langchain/openai";

const llmLong = new ChatOpenAI({ model: "gpt-4-turbo-preview" });

安装依赖项

小贴士

查看此部分以获取有关安装集成包的常规说明.

npm
yarn
pnpm

npm i @langchain/anthropic

yarn add @langchain/anthropic 

pnpm add @langchain/anthropic 

添加环境变量

ANTHROPIC_API_KEY=your-api-key

实例化模型

import { ChatAnthropic } from "@langchain/anthropic";

const llmLong = new ChatAnthropic({
  model: "claude-3-5-sonnet-20240620",
  temperature: 0
});

安装依赖项

小贴士

查看此部分以获取有关安装集成包的常规说明.

npm
yarn
pnpm

npm i @langchain/community

yarn add @langchain/community 

pnpm add @langchain/community 

添加环境变量

FIREWORKS_API_KEY=your-api-key

实例化模型

import { ChatFireworks } from "@langchain/community/chat_models/fireworks";

const llmLong = new ChatFireworks({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
  temperature: 0
});

安装依赖项

小贴士

查看此部分以获取有关安装集成包的常规说明.

npm
yarn
pnpm

npm i @langchain/mistralai

yarn add @langchain/mistralai 

pnpm add @langchain/mistralai 

添加环境变量

MISTRAL_API_KEY=your-api-key

实例化模型

import { ChatMistralAI } from "@langchain/mistralai";

const llmLong = new ChatMistralAI({
  model: "mistral-large-latest",
  temperature: 0
});

安装依赖项

小贴士

查看此部分以获取有关安装集成包的常规说明.

npm
yarn
pnpm

npm i @langchain/groq

yarn add @langchain/groq 

pnpm add @langchain/groq 

添加环境变量

GROQ_API_KEY=your-api-key

实例化模型

import { ChatGroq } from "@langchain/groq";

const llmLong = new ChatGroq({
  model: "mixtral-8x7b-32768",
  temperature: 0
});

安装依赖项

小贴士

查看此部分以获取有关安装集成包的常规说明.

npm
yarn
pnpm

npm i @langchain/google-vertexai

yarn add @langchain/google-vertexai 

pnpm add @langchain/google-vertexai 

添加环境变量

GOOGLE_APPLICATION_CREDENTIALS=credentials.json

实例化模型

import { ChatVertexAI } from "@langchain/google-vertexai";

const llmLong = new ChatVertexAI({
  model: "gemini-1.5-flash",
  temperature: 0
});

const structuredLlmLong = llmLong.withStructuredOutput(searchSchema, {
  name: "Search",
});
const queryAnalyzerAll = RunnableSequence.from([
  {
    question: new RunnablePassthrough(),
  },
  prompt,
  structuredLlmLong,
]);

await queryAnalyzerAll.invoke("what are books about aliens by jess knight");

{ query: "aliens", author: "jess knight" }

查找所有相关值

相反，我们可以创建一个向量存储索引，在相关值上，然后查询该索引以获取 N 个最相关的值。

import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";

const embeddings = new OpenAIEmbeddings({
  model: "text-embedding-3-small",
});
const vectorstore = await MemoryVectorStore.fromTexts(names, {}, embeddings);

const selectNames = async (question: string) => {
  const _docs = await vectorstore.similaritySearch(question, 10);
  const _names = _docs.map((d) => d.pageContent);
  return _names.join(", ");
};

const createPrompt = RunnableSequence.from([
  {
    question: new RunnablePassthrough(),
    authors: selectNames,
  },
  basePrompt,
]);

await createPrompt.invoke("what are books by jess knight");

ChatPromptValue {
  lc_serializable: true,
  lc_kwargs: {
    messages: [
      SystemMessage {
        lc_serializable: true,
        lc_kwargs: {
          content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
            "\n" +
            "The 'author' you ret"... 243 more characters,
          additional_kwargs: {},
          response_metadata: {}
        },
        lc_namespace: [ "langchain_core", "messages" ],
        content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
          "\n" +
          "The 'author' you ret"... 243 more characters,
        name: undefined,
        additional_kwargs: {},
        response_metadata: {}
      },
      HumanMessage {
        lc_serializable: true,
        lc_kwargs: {
          content: "what are books by jess knight",
          additional_kwargs: {},
          response_metadata: {}
        },
        lc_namespace: [ "langchain_core", "messages" ],
        content: "what are books by jess knight",
        name: undefined,
        additional_kwargs: {},
        response_metadata: {}
      }
    ]
  },
  lc_namespace: [ "langchain_core", "prompt_values" ],
  messages: [
    SystemMessage {
      lc_serializable: true,
      lc_kwargs: {
        content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
          "\n" +
          "The 'author' you ret"... 243 more characters,
        additional_kwargs: {},
        response_metadata: {}
      },
      lc_namespace: [ "langchain_core", "messages" ],
      content: "Generate a relevant search query for a library system using the 'search' tool.\n" +
        "\n" +
        "The 'author' you ret"... 243 more characters,
      name: undefined,
      additional_kwargs: {},
      response_metadata: {}
    },
    HumanMessage {
      lc_serializable: true,
      lc_kwargs: {
        content: "what are books by jess knight",
        additional_kwargs: {},
        response_metadata: {}
      },
      lc_namespace: [ "langchain_core", "messages" ],
      content: "what are books by jess knight",
      name: undefined,
      additional_kwargs: {},
      response_metadata: {}
    }
  ]
}

const queryAnalyzerSelect = createPrompt.pipe(llmWithTools);

await queryAnalyzerSelect.invoke("what are books about aliens by jess knight");

{ query: "aliens", author: "Jess Knight" }

下一步

您现在已经学习了如何在构建查询时处理高基数数据。

接下来，查看本节中的一些其他查询分析指南，例如如何使用少样本学习来提高性能。

设置​

安装依赖项​

设置环境变量​

设置数据​

查询分析​

选择您的聊天模型

安装依赖项

添加环境变量

实例化模型

安装依赖项

添加环境变量

实例化模型

安装依赖项

添加环境变量

实例化模型

安装依赖项

添加环境变量

实例化模型

安装依赖项

添加环境变量

实例化模型

安装依赖项

添加环境变量

实例化模型

添加所有值​

选择您的聊天模型

安装依赖项

添加环境变量

实例化模型

安装依赖项

添加环境变量

实例化模型

安装依赖项

添加环境变量

实例化模型

安装依赖项

添加环境变量

实例化模型

安装依赖项

添加环境变量

实例化模型

安装依赖项

添加环境变量

实例化模型

查找所有相关值​

下一步​

此页面是否有帮助？

您也可以在 GitHub 上留下详细的反馈 GitHub.

设置

安装依赖项

设置环境变量

设置数据

查询分析

添加所有值

查找所有相关值

下一步