跳到主要内容

如何使用多模态数据调用工具

先决条件

本指南假定您熟悉以下概念

在此,我们将演示如何使用多模态数据(例如图像)调用工具。

一些多模态模型,例如那些可以推理图像或音频的模型,也支持工具调用功能。

要使用此类模型调用工具,只需以通常的方式将工具绑定到它们,并使用所需类型的内容块(例如,包含图像数据)调用模型。

下面,我们将演示使用 OpenAIAnthropic 的示例。在所有情况下,我们将使用相同的图像和工具。让我们首先选择一张图像,并构建一个占位符工具,该工具期望输入字符串“sunny”、“cloudy”或“rainy”。我们将要求模型描述图像中的天气。

tool 函数在 @langchain/core 版本 0.2.7 及更高版本中可用。

如果您使用的是旧版本的 core,则应实例化并使用 DynamicStructuredTool 代替。

import { tool } from "@langchain/core/tools";
import { z } from "zod";

const imageUrl =
"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg";

const weatherTool = tool(
async ({ weather }) => {
console.log(weather);
return weather;
},
{
name: "multiply",
description: "Describe the weather",
schema: z.object({
weather: z.enum(["sunny", "cloudy", "rainy"]),
}),
}
);

OpenAI

对于 OpenAI,我们可以直接在类型为“image_url”的内容块中输入图像 URL

import { HumanMessage } from "@langchain/core/messages";
import { ChatOpenAI } from "@langchain/openai";

const model = new ChatOpenAI({
model: "gpt-4o",
}).bindTools([weatherTool]);

const message = new HumanMessage({
content: [
{
type: "text",
text: "describe the weather in this image",
},
{
type: "image_url",
image_url: {
url: imageUrl,
},
},
],
});

const response = await model.invoke([message]);

console.log(response.tool_calls);
[
{
name: "multiply",
args: { weather: "sunny" },
id: "call_ZaBYUggmrTSuDjcuZpMVKpMR"
}
]

请注意,我们在模型响应中以 LangChain 的 标准格式 恢复了带有已解析参数的工具调用。

Anthropic

对于 Anthropic,我们可以将 base64 编码的图像格式化为类型为“image”的内容块,如下所示

import * as fs from "node:fs/promises";

import { ChatAnthropic } from "@langchain/anthropic";
import { HumanMessage } from "@langchain/core/messages";

const imageData = await fs.readFile("../../data/sunny_day.jpeg");

const model = new ChatAnthropic({
model: "claude-3-sonnet-20240229",
}).bindTools([weatherTool]);

const message = new HumanMessage({
content: [
{
type: "text",
text: "describe the weather in this image",
},
{
type: "image_url",
image_url: {
url: `data:image/jpeg;base64,${imageData.toString("base64")}`,
},
},
],
});

const response = await model.invoke([message]);

console.log(response.tool_calls);
[
{
name: "multiply",
args: { weather: "sunny" },
id: "toolu_01HLY1KmXZkKMn7Ar4ZtFuAM"
}
]

Google Generative AI

对于 Google GenAI,我们可以将 base64 编码的图像格式化为类型为“image”的内容块,如下所示

import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import axios from "axios";
import {
ChatPromptTemplate,
MessagesPlaceholder,
} from "@langchain/core/prompts";
import { HumanMessage } from "@langchain/core/messages";

const axiosRes = await axios.get(imageUrl, { responseType: "arraybuffer" });
const base64 = btoa(
new Uint8Array(axiosRes.data).reduce(
(data, byte) => data + String.fromCharCode(byte),
""
)
);

const model = new ChatGoogleGenerativeAI({
model: "gemini-1.5-pro-latest",
}).bindTools([weatherTool]);

const prompt = ChatPromptTemplate.fromMessages([
["system", "describe the weather in this image"],
new MessagesPlaceholder("message"),
]);

const response = await prompt.pipe(model).invoke({
message: new HumanMessage({
content: [
{
type: "media",
mimeType: "image/jpeg",
data: base64,
},
],
}),
});
console.log(response.tool_calls);
[ { name: 'multiply', args: { weather: 'sunny' } } ]

音频输入

Google 的 Gemini 也支持音频输入。在下一个示例中,我们将看到如何将音频文件传递给模型,并以结构化格式获取摘要。

import { SystemMessage } from "@langchain/core/messages";
import { tool } from "@langchain/core/tools";

const summaryTool = tool(
(input) => {
return input.summary;
},
{
name: "summary_tool",
description: "Log the summary of the content",
schema: z.object({
summary: z.string().describe("The summary of the content to log"),
}),
}
);

const audioUrl =
"https://www.pacdv.com/sounds/people_sound_effects/applause-1.wav";

const axiosRes = await axios.get(audioUrl, { responseType: "arraybuffer" });
const base64 = btoa(
new Uint8Array(axiosRes.data).reduce(
(data, byte) => data + String.fromCharCode(byte),
""
)
);

const model = new ChatGoogleGenerativeAI({
model: "gemini-1.5-pro-latest",
}).bindTools([summaryTool]);

const response = await model.invoke([
new SystemMessage(
"Summarize this content. always use the summary_tool in your response"
),
new HumanMessage({
content: [
{
type: "media",
mimeType: "audio/wav",
data: base64,
},
],
}),
]);

console.log(response.tool_calls);
[
{
name: 'summary_tool',
args: { summary: 'The video shows a person clapping their hands.' }
}
]

此页面是否对您有帮助?


您还可以留下详细的反馈 在 GitHub 上.