Apify Dataset
本指南介绍如何将 Apify 与 LangChain 结合使用,从 Apify Dataset 加载文档。
概述
Apify 是一个用于网络抓取和数据提取的云平台,它提供了一个由两千多个现成的应用程序(称为 Actors)组成的生态系统,用于各种网络抓取、爬取和数据提取用例。
本指南介绍如何从 Apify Dataset 加载文档。Apify Dataset 是一个可扩展的仅追加存储,专为存储结构化网络抓取结果而构建,例如产品列表或 Google SERP,然后将其导出为各种格式,如 JSON、CSV 或 Excel。
数据集通常用于保存不同 Actors 的结果。例如,网站内容爬虫 Actor 深入爬取网站,例如文档、知识库、帮助中心或博客,然后将网页的文本内容存储到数据集中,您可以从中将文档馈送到向量数据库,并将其用于信息检索。另一个例子是 RAG Web 浏览器 Actor,它查询 Google 搜索,抓取结果中排名靠前的 N 个页面,并以 Markdown 格式返回清理后的内容,以便大型语言模型进一步处理。
设置
您首先需要安装官方 Apify 客户端
- npm
- Yarn
- pnpm
npm install apify-client
yarn add apify-client
pnpm add apify-client
有关安装集成包的常规说明,请参阅此部分。
- npm
- Yarn
- pnpm
npm install hnswlib-node @langchain/openai @langchain/community @langchain/core
yarn add hnswlib-node @langchain/openai @langchain/community @langchain/core
pnpm add hnswlib-node @langchain/openai @langchain/community @langchain/core
您还需要注册并检索您的 Apify API 令牌。
用法
从新数据集(爬取网站并将数据存储在 Apify Dataset 中)
如果您在 Apify 平台上还没有现有数据集,则需要通过调用 Actor 并等待结果来初始化文档加载器。在下面的示例中,我们使用 网站内容爬虫 Actor 来爬取 LangChain 文档,将结果存储在 Apify Dataset 中,然后使用 ApifyDatasetLoader
加载数据集。对于此演示,我们将使用快速 Cheerio 爬虫类型并将爬取页面数量限制为 10。
注意: 运行网站内容爬虫可能需要一些时间,具体取决于网站的大小。对于大型网站,可能需要几个小时甚至几天!
这是一个示例
import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";
const APIFY_API_TOKEN = "YOUR-APIFY-API-TOKEN"; // or set as process.env.APIFY_API_TOKEN
const OPENAI_API_KEY = "YOUR-OPENAI-API-KEY"; // or set as process.env.OPENAI_API_KEY
/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = await ApifyDatasetLoader.fromActorCall(
"apify/website-content-crawler",
{
maxCrawlPages: 10,
crawlerType: "cheerio",
startUrls: [{ url: "https://js.langchain.ac.cn/docs/" }],
},
{
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: APIFY_API_TOKEN,
},
}
);
const docs = await loader.load();
const vectorStore = await HNSWLib.fromDocuments(
docs,
new OpenAIEmbeddings({ apiKey: OPENAI_API_KEY })
);
const model = new ChatOpenAI({
temperature: 0,
apiKey: OPENAI_API_KEY,
});
const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
["human", "{input}"],
]);
const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: questionAnsweringPrompt,
});
const chain = await createRetrievalChain({
retriever: vectorStore.asRetriever(),
combineDocsChain,
});
const res = await chain.invoke({ input: "What is LangChain?" });
console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));
/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.ac.cn/docs/',
'https://js.langchain.ac.cn/docs/modules/chains/',
'https://js.langchain.ac.cn/docs/modules/chains/llmchain/',
'https://js.langchain.ac.cn/docs/category/functions-4'
]
*/
API 参考
- ApifyDatasetLoader 来自
@langchain/community/document_loaders/web/apify_dataset
- HNSWLib 来自
@langchain/community/vectorstores/hnswlib
- OpenAIEmbeddings 来自
@langchain/openai
- ChatOpenAI 来自
@langchain/openai
- Document 来自
@langchain/core/documents
- ChatPromptTemplate 来自
@langchain/core/prompts
- createStuffDocumentsChain 来自
langchain/chains/combine_documents
- createRetrievalChain 来自
langchain/chains/retrieval
从现有数据集
如果您已经运行了 Actor 并在 Apify 平台上拥有现有数据集,则可以使用构造函数直接初始化文档加载器
import { ApifyDatasetLoader } from "@langchain/community/document_loaders/web/apify_dataset";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
const APIFY_API_TOKEN = "YOUR-APIFY-API-TOKEN"; // or set as process.env.APIFY_API_TOKEN
const OPENAI_API_KEY = "YOUR-OPENAI-API-KEY"; // or set as process.env.OPENAI_API_KEY
/*
* datasetMappingFunction is a function that maps your Apify dataset format to LangChain documents.
* In the below example, the Apify dataset format looks like this:
* {
* "url": "https://apify.com",
* "text": "Apify is the best web scraping and automation platform."
* }
*/
const loader = new ApifyDatasetLoader("your-dataset-id", {
datasetMappingFunction: (item) =>
new Document({
pageContent: (item.text || "") as string,
metadata: { source: item.url },
}),
clientOptions: {
token: APIFY_API_TOKEN,
},
});
const docs = await loader.load();
const vectorStore = await HNSWLib.fromDocuments(
docs,
new OpenAIEmbeddings({ apiKey: OPENAI_API_KEY })
);
const model = new ChatOpenAI({
temperature: 0,
apiKey: OPENAI_API_KEY,
});
const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
["human", "{input}"],
]);
const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: questionAnsweringPrompt,
});
const chain = await createRetrievalChain({
retriever: vectorStore.asRetriever(),
combineDocsChain,
});
const res = await chain.invoke({ input: "What is LangChain?" });
console.log(res.answer);
console.log(res.context.map((doc) => doc.metadata.source));
/*
LangChain is a framework for developing applications powered by language models.
[
'https://js.langchain.ac.cn/docs/',
'https://js.langchain.ac.cn/docs/modules/chains/',
'https://js.langchain.ac.cn/docs/modules/chains/llmchain/',
'https://js.langchain.ac.cn/docs/category/functions-4'
]
*/
API 参考
- ApifyDatasetLoader 来自
@langchain/community/document_loaders/web/apify_dataset
- HNSWLib 来自
@langchain/community/vectorstores/hnswlib
- OpenAIEmbeddings 来自
@langchain/openai
- ChatOpenAI 来自
@langchain/openai
- Document 来自
@langchain/core/documents
- ChatPromptTemplate 来自
@langchain/core/prompts
- createRetrievalChain 来自
langchain/chains/retrieval
- createStuffDocumentsChain 来自
langchain/chains/combine_documents