如何使用多模态数据调用工具

先决条件

本指南假设您熟悉以下概念

这里我们将演示如何使用多模态数据（如图像）调用工具。

一些多模态模型，例如可以对图像或音频进行推理的模型，也支持工具调用功能。

要使用此类模型调用工具，只需将工具绑定到它们通常的方式，并使用所需类型的 content 块（例如，包含图像数据）调用模型。

下面，我们将演示使用 OpenAI 和 Anthropic 的示例。在所有情况下，我们将使用相同的图像和工具。让我们首先选择一个图像，并构建一个占位符工具，该工具期望输入字符串“晴朗”、“多云”或“下雨”。我们将要求模型描述图像中的天气。

tool 函数可在 @langchain/core 版本 0.2.7 及更高版本中使用。

如果您使用的是旧版本的 core，则应使用实例化并使用 DynamicStructuredTool。

import { tool } from "@langchain/core/tools";
import { z } from "zod";

const imageUrl =
  "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg";

const weatherTool = tool(
  async ({ weather }) => {
    console.log(weather);
    return weather;
  },
  {
    name: "multiply",
    description: "Describe the weather",
    schema: z.object({
      weather: z.enum(["sunny", "cloudy", "rainy"]),
    }),
  }
);

OpenAI

对于 OpenAI，我们可以在类型为“image_url”的 content 块中直接提供图像 URL

import { HumanMessage } from "@langchain/core/messages";
import { ChatOpenAI } from "@langchain/openai";

const model = new ChatOpenAI({
  model: "gpt-4o",
}).bindTools([weatherTool]);

const message = new HumanMessage({
  content: [
    {
      type: "text",
      text: "describe the weather in this image",
    },
    {
      type: "image_url",
      image_url: {
        url: imageUrl,
      },
    },
  ],
});

const response = await model.invoke([message]);

console.log(response.tool_calls);

[
  {
    name: "multiply",
    args: { weather: "sunny" },
    id: "call_ZaBYUggmrTSuDjcuZpMVKpMR"
  }
]

请注意，我们在 LangChain 的标准格式中的模型响应中恢复了具有解析参数的工具调用。

Anthropic

对于 Anthropic，我们可以将 base64 编码的图像格式化为类型为“image”的 content 块，如下所示

import * as fs from "node:fs/promises";

import { ChatAnthropic } from "@langchain/anthropic";
import { HumanMessage } from "@langchain/core/messages";

const imageData = await fs.readFile("../../data/sunny_day.jpeg");

const model = new ChatAnthropic({
  model: "claude-3-sonnet-20240229",
}).bindTools([weatherTool]);

const message = new HumanMessage({
  content: [
    {
      type: "text",
      text: "describe the weather in this image",
    },
    {
      type: "image_url",
      image_url: {
        url: `data:image/jpeg;base64,${imageData.toString("base64")}`,
      },
    },
  ],
});

const response = await model.invoke([message]);

console.log(response.tool_calls);

[
  {
    name: "multiply",
    args: { weather: "sunny" },
    id: "toolu_01HLY1KmXZkKMn7Ar4ZtFuAM"
  }
]

Google Generative AI

对于 Google GenAI，我们可以将 base64 编码的图像格式化为类型为“image”的 content 块，如下所示

import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import axios from "axios";
import {
  ChatPromptTemplate,
  MessagesPlaceholder,
} from "@langchain/core/prompts";
import { HumanMessage } from "@langchain/core/messages";

const axiosRes = await axios.get(imageUrl, { responseType: "arraybuffer" });
const base64 = btoa(
  new Uint8Array(axiosRes.data).reduce(
    (data, byte) => data + String.fromCharCode(byte),
    ""
  )
);

const model = new ChatGoogleGenerativeAI({
  model: "gemini-1.5-pro-latest",
}).bindTools([weatherTool]);

const prompt = ChatPromptTemplate.fromMessages([
  ["system", "describe the weather in this image"],
  new MessagesPlaceholder("message"),
]);

const response = await prompt.pipe(model).invoke({
  message: new HumanMessage({
    content: [
      {
        type: "media",
        mimeType: "image/jpeg",
        data: base64,
      },
    ],
  }),
});
console.log(response.tool_calls);

[ { name: 'multiply', args: { weather: 'sunny' } } ]

音频输入

Google 的 Gemini 也支持音频输入。在接下来的示例中，我们将看到如何将音频文件传递给模型，并以结构化格式获取摘要。

import { SystemMessage } from "@langchain/core/messages";
import { tool } from "@langchain/core/tools";

const summaryTool = tool(
  (input) => {
    return input.summary;
  },
  {
    name: "summary_tool",
    description: "Log the summary of the content",
    schema: z.object({
      summary: z.string().describe("The summary of the content to log"),
    }),
  }
);

const audioUrl =
  "https://www.pacdv.com/sounds/people_sound_effects/applause-1.wav";

const axiosRes = await axios.get(audioUrl, { responseType: "arraybuffer" });
const base64 = btoa(
  new Uint8Array(axiosRes.data).reduce(
    (data, byte) => data + String.fromCharCode(byte),
    ""
  )
);

const model = new ChatGoogleGenerativeAI({
  model: "gemini-1.5-pro-latest",
}).bindTools([summaryTool]);

const response = await model.invoke([
  new SystemMessage(
    "Summarize this content. always use the summary_tool in your response"
  ),
  new HumanMessage({
    content: [
      {
        type: "media",
        mimeType: "audio/wav",
        data: base64,
      },
    ],
  }),
]);

console.log(response.tool_calls);

[
  {
    name: 'summary_tool',
    args: { summary: 'The video shows a person clapping their hands.' }
  }
]

如何使用多模态数据调用工具

OpenAI

Anthropic

Google Generative AI

音频输入

此页面对您有帮助吗？

您也可以在 GitHub 上留下详细的反馈 GitHub.

OpenAI​

Anthropic​

Google Generative AI​

音频输入​

此页面对您有帮助吗？

您也可以在 GitHub 上留下详细的反馈 GitHub.

OpenAI

Anthropic

Google Generative AI

音频输入