跳到主要内容

Apify 数据集

本指南介绍如何使用 Apify 与 LangChain 从 Apify 数据集加载文档。

概述

Apify 是一个用于网络抓取和数据提取的云平台,它提供了一个 生态系统,其中包含超过一千个针对各种网络抓取、爬取和数据提取用例的预制应用程序,称为Actor

本指南介绍如何从 Apify 数据集 加载文档——一个可扩展的追加式存储,用于存储结构化的网络抓取结果,例如产品列表或 Google SERP,然后将它们导出到各种格式,如 JSON、CSV 或 Excel。

数据集通常用于保存 Actor 的结果。例如,网站内容爬取器 Actor 深入爬取网站(如文档、知识库、帮助中心或博客),然后将网页的文本内容存储到数据集中,您可以从中将文档馈送到向量索引并从该索引中回答问题。

设置

您首先需要安装官方 Apify 客户端

npm install apify-client
npm install @langchain/openai @langchain/community

您还需要注册并检索您的 Apify API 令牌.

使用

从新数据集

如果您还没有 Apify 平台上的现有数据集,则需要通过调用 Actor 并等待结果来初始化文档加载器。

注意:调用 Actor 可能需要大量时间,对于大型网站来说可能需要数小时甚至数天!

以下是一个示例

import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";

/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = await ApifyDatasetLoader.fromActorCall(
"apify/website-content-crawler",
{
startUrls: [{ url: "https://js.langchain.ac.cn/docs/" }],
},
{
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: "your-apify-token", // Or set as process.env.APIFY_API_TOKEN
},
}
);

const docs = await loader.load();

const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());

const model = new ChatOpenAI({
temperature: 0,
});

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
["human", "{input}"],
]);

const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: questionAnsweringPrompt,
});

const chain = await createRetrievalChain({
retriever: vectorStore.asRetriever(),
combineDocsChain,
});

const res = await chain.invoke({ input: "What is LangChain?" });

console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));

/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.ac.cn/docs/',
'https://js.langchain.ac.cn/docs/modules/chains/',
'https://js.langchain.ac.cn/docs/modules/chains/llmchain/',
'https://js.langchain.ac.cn/docs/category/functions-4'
]
*/

API 参考

从现有数据集

如果您已经在 Apify 平台上拥有现有数据集,则可以直接使用构造函数初始化文档加载器

import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";

/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = new ApifyDatasetLoader("your-dataset-id", {
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: "your-apify-token", // Or set as process.env.APIFY_API_TOKEN
},
});

const docs = await loader.load();

const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());

const model = new ChatOpenAI({
temperature: 0,
});

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
["human", "{input}"],
]);

const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: questionAnsweringPrompt,
});

const chain = await createRetrievalChain({
retriever: vectorStore.asRetriever(),
combineDocsChain,
});

const res = await chain.invoke({ input: "What is LangChain?" });

console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));

/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.ac.cn/docs/',
'https://js.langchain.ac.cn/docs/modules/chains/',
'https://js.langchain.ac.cn/docs/modules/chains/llmchain/',
'https://js.langchain.ac.cn/docs/category/functions-4'
]
*/

API 参考


此页面有帮助吗?


您也可以在 GitHub 上留下详细反馈 GitHub.