跳至主要内容

如何使用多模态数据调用工具

先决条件

本指南假设您熟悉以下概念

这里我们将演示如何使用多模态数据(如图像)调用工具。

一些多模态模型,例如可以对图像或音频进行推理的模型,也支持 工具调用 功能。

要使用此类模型调用工具,只需将工具绑定到它们 通常的方式,并使用所需类型的 content 块(例如,包含图像数据)调用模型。

下面,我们将演示使用 OpenAIAnthropic 的示例。在所有情况下,我们将使用相同的图像和工具。让我们首先选择一个图像,并构建一个占位符工具,该工具期望输入字符串“晴朗”、“多云”或“下雨”。我们将要求模型描述图像中的天气。

tool 函数可在 @langchain/core 版本 0.2.7 及更高版本中使用。

如果您使用的是旧版本的 core,则应使用实例化并使用 DynamicStructuredTool

import { tool } from "@langchain/core/tools";
import { z } from "zod";

const imageUrl =
"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg";

const weatherTool = tool(
async ({ weather }) => {
console.log(weather);
return weather;
},
{
name: "multiply",
description: "Describe the weather",
schema: z.object({
weather: z.enum(["sunny", "cloudy", "rainy"]),
}),
}
);

OpenAI

对于 OpenAI,我们可以在类型为“image_url”的 content 块中直接提供图像 URL

import { HumanMessage } from "@langchain/core/messages";
import { ChatOpenAI } from "@langchain/openai";

const model = new ChatOpenAI({
model: "gpt-4o",
}).bindTools([weatherTool]);

const message = new HumanMessage({
content: [
{
type: "text",
text: "describe the weather in this image",
},
{
type: "image_url",
image_url: {
url: imageUrl,
},
},
],
});

const response = await model.invoke([message]);

console.log(response.tool_calls);
[
{
name: "multiply",
args: { weather: "sunny" },
id: "call_ZaBYUggmrTSuDjcuZpMVKpMR"
}
]

请注意,我们在 LangChain 的 标准格式 中的模型响应中恢复了具有解析参数的工具调用。

Anthropic

对于 Anthropic,我们可以将 base64 编码的图像格式化为类型为“image”的 content 块,如下所示

import * as fs from "node:fs/promises";

import { ChatAnthropic } from "@langchain/anthropic";
import { HumanMessage } from "@langchain/core/messages";

const imageData = await fs.readFile("../../data/sunny_day.jpeg");

const model = new ChatAnthropic({
model: "claude-3-sonnet-20240229",
}).bindTools([weatherTool]);

const message = new HumanMessage({
content: [
{
type: "text",
text: "describe the weather in this image",
},
{
type: "image_url",
image_url: {
url: `data:image/jpeg;base64,${imageData.toString("base64")}`,
},
},
],
});

const response = await model.invoke([message]);

console.log(response.tool_calls);
[
{
name: "multiply",
args: { weather: "sunny" },
id: "toolu_01HLY1KmXZkKMn7Ar4ZtFuAM"
}
]

Google Generative AI

对于 Google GenAI,我们可以将 base64 编码的图像格式化为类型为“image”的 content 块,如下所示

import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import axios from "axios";
import {
ChatPromptTemplate,
MessagesPlaceholder,
} from "@langchain/core/prompts";
import { HumanMessage } from "@langchain/core/messages";

const axiosRes = await axios.get(imageUrl, { responseType: "arraybuffer" });
const base64 = btoa(
new Uint8Array(axiosRes.data).reduce(
(data, byte) => data + String.fromCharCode(byte),
""
)
);

const model = new ChatGoogleGenerativeAI({
model: "gemini-1.5-pro-latest",
}).bindTools([weatherTool]);

const prompt = ChatPromptTemplate.fromMessages([
["system", "describe the weather in this image"],
new MessagesPlaceholder("message"),
]);

const response = await prompt.pipe(model).invoke({
message: new HumanMessage({
content: [
{
type: "media",
mimeType: "image/jpeg",
data: base64,
},
],
}),
});
console.log(response.tool_calls);
[ { name: 'multiply', args: { weather: 'sunny' } } ]

音频输入

Google 的 Gemini 也支持音频输入。在接下来的示例中,我们将看到如何将音频文件传递给模型,并以结构化格式获取摘要。

import { SystemMessage } from "@langchain/core/messages";
import { tool } from "@langchain/core/tools";

const summaryTool = tool(
(input) => {
return input.summary;
},
{
name: "summary_tool",
description: "Log the summary of the content",
schema: z.object({
summary: z.string().describe("The summary of the content to log"),
}),
}
);

const audioUrl =
"https://www.pacdv.com/sounds/people_sound_effects/applause-1.wav";

const axiosRes = await axios.get(audioUrl, { responseType: "arraybuffer" });
const base64 = btoa(
new Uint8Array(axiosRes.data).reduce(
(data, byte) => data + String.fromCharCode(byte),
""
)
);

const model = new ChatGoogleGenerativeAI({
model: "gemini-1.5-pro-latest",
}).bindTools([summaryTool]);

const response = await model.invoke([
new SystemMessage(
"Summarize this content. always use the summary_tool in your response"
),
new HumanMessage({
content: [
{
type: "media",
mimeType: "audio/wav",
data: base64,
},
],
}),
]);

console.log(response.tool_calls);
[
{
name: 'summary_tool',
args: { summary: 'The video shows a person clapping their hands.' }
}
]

此页面对您有帮助吗?


您也可以在 GitHub 上留下详细的反馈 GitHub.