如何使用多模态数据调用工具
先决条件
本指南假设您熟悉以下概念
这里演示了如何使用多模态数据(例如图像)调用工具。
一些多模态模型(例如可以推理图像或音频的模型)也支持工具调用功能。
要使用此类模型调用工具,只需将工具绑定到它们以通常的方式,并使用所需类型的内容块(例如,包含图像数据的块)来调用模型。
下面,我们将演示使用OpenAI和Anthropic的示例。我们将在所有情况下使用相同的图像和工具。首先,让我们选择一个图像,并构建一个占位符工具,它期望以字符串“晴朗”、“多云”或“下雨”作为输入。我们将要求模型描述图像中的天气。
tool
函数可在 @langchain/core
版本 0.2.7 及更高版本中使用。
如果您使用的是较旧版本的 core,则应使用实例化并使用DynamicStructuredTool
。
import { tool } from "@langchain/core/tools";
import { z } from "zod";
const imageUrl =
"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg";
const weatherTool = tool(
async ({ weather }) => {
console.log(weather);
return weather;
},
{
name: "multiply",
description: "Describe the weather",
schema: z.object({
weather: z.enum(["sunny", "cloudy", "rainy"]),
}),
}
);
OpenAI
对于 OpenAI,我们可以直接在类型为“image_url”的内容块中提供图像 URL
import { HumanMessage } from "@langchain/core/messages";
import { ChatOpenAI } from "@langchain/openai";
const model = new ChatOpenAI({
model: "gpt-4o",
}).bindTools([weatherTool]);
const message = new HumanMessage({
content: [
{
type: "text",
text: "describe the weather in this image",
},
{
type: "image_url",
image_url: {
url: imageUrl,
},
},
],
});
const response = await model.invoke([message]);
console.log(response.tool_calls);
[
{
name: "multiply",
args: { weather: "sunny" },
id: "call_ZaBYUggmrTSuDjcuZpMVKpMR"
}
]
请注意,我们以 LangChain 中的标准格式从模型响应中恢复带有已解析参数的工具调用。
Anthropic
对于 Anthropic,我们可以将一个 base64 编码的图像格式化为类型为“image”的内容块,如下所示
import * as fs from "node:fs/promises";
import { ChatAnthropic } from "@langchain/anthropic";
import { HumanMessage } from "@langchain/core/messages";
const imageData = await fs.readFile("../../data/sunny_day.jpeg");
const model = new ChatAnthropic({
model: "claude-3-sonnet-20240229",
}).bindTools([weatherTool]);
const message = new HumanMessage({
content: [
{
type: "text",
text: "describe the weather in this image",
},
{
type: "image_url",
image_url: {
url: `data:image/jpeg;base64,${imageData.toString("base64")}`,
},
},
],
});
const response = await model.invoke([message]);
console.log(response.tool_calls);
[
{
name: "multiply",
args: { weather: "sunny" },
id: "toolu_01HLY1KmXZkKMn7Ar4ZtFuAM"
}
]
Google 生成式 AI
对于 Google GenAI,我们可以将一个 base64 编码的图像格式化为类型为“image”的内容块,如下所示
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import axios from "axios";
import {
ChatPromptTemplate,
MessagesPlaceholder,
} from "@langchain/core/prompts";
import { HumanMessage } from "@langchain/core/messages";
const axiosRes = await axios.get(imageUrl, { responseType: "arraybuffer" });
const base64 = btoa(
new Uint8Array(axiosRes.data).reduce(
(data, byte) => data + String.fromCharCode(byte),
""
)
);
const model = new ChatGoogleGenerativeAI({
model: "gemini-1.5-pro-latest",
}).bindTools([weatherTool]);
const prompt = ChatPromptTemplate.fromMessages([
["system", "describe the weather in this image"],
new MessagesPlaceholder("message"),
]);
const response = await prompt.pipe(model).invoke({
message: new HumanMessage({
content: [
{
type: "media",
mimeType: "image/jpeg",
data: base64,
},
],
}),
});
console.log(response.tool_calls);
[ { name: 'multiply', args: { weather: 'sunny' } } ]
音频输入
Google 的 Gemini 也支持音频输入。在下面的示例中,我们将看到如何将音频文件传递给模型,并以结构化格式获取摘要。
import { SystemMessage } from "@langchain/core/messages";
import { tool } from "@langchain/core/tools";
const summaryTool = tool(
(input) => {
return input.summary;
},
{
name: "summary_tool",
description: "Log the summary of the content",
schema: z.object({
summary: z.string().describe("The summary of the content to log"),
}),
}
);
const audioUrl =
"https://www.pacdv.com/sounds/people_sound_effects/applause-1.wav";
const axiosRes = await axios.get(audioUrl, { responseType: "arraybuffer" });
const base64 = btoa(
new Uint8Array(axiosRes.data).reduce(
(data, byte) => data + String.fromCharCode(byte),
""
)
);
const model = new ChatGoogleGenerativeAI({
model: "gemini-1.5-pro-latest",
}).bindTools([summaryTool]);
const response = await model.invoke([
new SystemMessage(
"Summarize this content. always use the summary_tool in your response"
),
new HumanMessage({
content: [
{
type: "media",
mimeType: "audio/wav",
data: base64,
},
],
}),
]);
console.log(response.tool_calls);
[
{
name: 'summary_tool',
args: { summary: 'The video shows a person clapping their hands.' }
}
]