Apify 数据集
本指南介绍如何使用 Apify 与 LangChain 从 Apify 数据集加载文档。
概述
Apify 是一个用于网络抓取和数据提取的云平台,它提供了一个 生态系统,其中包含超过一千个针对各种网络抓取、爬取和数据提取用例的预制应用程序,称为Actor。
本指南介绍如何从 Apify 数据集 加载文档——一个可扩展的追加式存储,用于存储结构化的网络抓取结果,例如产品列表或 Google SERP,然后将它们导出到各种格式,如 JSON、CSV 或 Excel。
数据集通常用于保存 Actor 的结果。例如,网站内容爬取器 Actor 深入爬取网站(如文档、知识库、帮助中心或博客),然后将网页的文本内容存储到数据集中,您可以从中将文档馈送到向量索引并从该索引中回答问题。
设置
您首先需要安装官方 Apify 客户端
- npm
- Yarn
- pnpm
npm install apify-client
yarn add apify-client
pnpm add apify-client
提示
参见 本节了解有关安装集成包的常规说明.
- npm
- Yarn
- pnpm
npm install @langchain/openai @langchain/community
yarn add @langchain/openai @langchain/community
pnpm add @langchain/openai @langchain/community
您还需要注册并检索您的 Apify API 令牌.
使用
从新数据集
如果您还没有 Apify 平台上的现有数据集,则需要通过调用 Actor 并等待结果来初始化文档加载器。
注意:调用 Actor 可能需要大量时间,对于大型网站来说可能需要数小时甚至数天!
以下是一个示例
import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";
/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = await ApifyDatasetLoader.fromActorCall(
"apify/website-content-crawler",
{
startUrls: [{ url: "https://js.langchain.ac.cn/docs/" }],
},
{
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: "your-apify-token", // Or set as process.env.APIFY_API_TOKEN
},
}
);
const docs = await loader.load();
const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());
const model = new ChatOpenAI({
temperature: 0,
});
const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
["human", "{input}"],
]);
const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: questionAnsweringPrompt,
});
const chain = await createRetrievalChain({
retriever: vectorStore.asRetriever(),
combineDocsChain,
});
const res = await chain.invoke({ input: "What is LangChain?" });
console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));
/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.ac.cn/docs/',
'https://js.langchain.ac.cn/docs/modules/chains/',
'https://js.langchain.ac.cn/docs/modules/chains/llmchain/',
'https://js.langchain.ac.cn/docs/category/functions-4'
]
*/
API 参考
- ApifyDatasetLoader 来自
@langchain/community/document_loaders/web/apify_dataset
- HNSWLib 来自
@langchain/community/vectorstores/hnswlib
- OpenAIEmbeddings 来自
@langchain/openai
- ChatOpenAI 来自
@langchain/openai
- Document 来自
@langchain/core/documents
- ChatPromptTemplate 来自
@langchain/core/prompts
- createStuffDocumentsChain 来自
langchain/chains/combine_documents
- createRetrievalChain 来自
langchain/chains/retrieval
从现有数据集
如果您已经在 Apify 平台上拥有现有数据集,则可以直接使用构造函数初始化文档加载器
import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = new ApifyDatasetLoader("your-dataset-id", {
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: "your-apify-token", // Or set as process.env.APIFY_API_TOKEN
},
});
const docs = await loader.load();
const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());
const model = new ChatOpenAI({
temperature: 0,
});
const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
["human", "{input}"],
]);
const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: questionAnsweringPrompt,
});
const chain = await createRetrievalChain({
retriever: vectorStore.asRetriever(),
combineDocsChain,
});
const res = await chain.invoke({ input: "What is LangChain?" });
console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));
/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.ac.cn/docs/',
'https://js.langchain.ac.cn/docs/modules/chains/',
'https://js.langchain.ac.cn/docs/modules/chains/llmchain/',
'https://js.langchain.ac.cn/docs/category/functions-4'
]
*/
API 参考
- ApifyDatasetLoader 来自
@langchain/community/document_loaders/web/apify_dataset
- HNSWLib 来自
@langchain/community/vectorstores/hnswlib
- OpenAIEmbeddings 来自
@langchain/openai
- ChatOpenAI 来自
@langchain/openai
- Document 来自
@langchain/core/documents
- ChatPromptTemplate 来自
@langchain/core/prompts
- createRetrievalChain 来自
langchain/chains/retrieval
- createStuffDocumentsChain 来自
langchain/chains/combine_documents