RecursiveUrlLoader
仅在 Node.js 上可用。
此笔记本提供了有关如何开始使用RecursiveUrlLoader 的快速概述。有关所有 RecursiveUrlLoader 功能和配置的详细文档,请转至API 参考。
概述
集成详细信息
类 | 包 | 本地 | 可序列化 | PY 支持 |
---|---|---|---|---|
RecursiveUrlLoader | @langchain/community | ✅ | beta | ❌ |
加载器功能
来源 | 网络加载器 | 仅限 Node 环境 |
---|---|---|
RecursiveUrlLoader | ✅ | ✅ |
从网站加载内容时,我们可能希望处理加载页面上的所有 URL。
例如,让我们看看LangChain.js 简介 文档。
它有许多有趣的子页面,我们可能希望批量加载、分割和检索它们。
挑战在于遍历子页面的树并组装一个列表!
我们使用 RecursiveUrlLoader
来做到这一点。
这也为我们提供了灵活性,让我们可以排除某些子项,自定义提取器等等。
设置
要访问 RecursiveUrlLoader
文档加载器,您需要安装 @langchain/community
集成和jsdom
包。
凭据
如果您希望自动跟踪您的模型调用,您还可以通过取消下面的注释来设置您的LangSmith API 密钥
# export LANGCHAIN_TRACING_V2="true"
# export LANGCHAIN_API_KEY="your-api-key"
安装
LangChain RecursiveUrlLoader 集成位于 @langchain/community
包中
有关安装集成包的一般说明,请参阅此部分。
- npm
- yarn
- pnpm
npm i @langchain/community jsdom
yarn add @langchain/community jsdom
pnpm add @langchain/community jsdom
我们还建议添加一个像html-to-text
或@mozilla/readability
这样的包,用于从页面中提取原始文本。
- npm
- yarn
- pnpm
npm i html-to-text
yarn add html-to-text
pnpm add html-to-text
实例化
现在我们可以实例化我们的模型对象并加载文档
import { RecursiveUrlLoader } from "@langchain/community/document_loaders/web/recursive_url";
import { compile } from "html-to-text";
const compiledConvert = compile({ wordwrap: 130 }); // returns (text: string) => string;
const loader = new RecursiveUrlLoader("https://www.langchain.ac.cn/", {
extractor: compiledConvert,
maxDepth: 1,
excludeDirs: ["/docs/api/"],
});
加载
const docs = await loader.load();
docs[0];
{
pageContent: '\n' +
'/\n' +
'Products\n' +
'\n' +
'LangChain [/langchain]LangSmith [/langsmith]LangGraph [/langgraph]\n' +
'Methods\n' +
'\n' +
'Retrieval [/retrieval]Agents [/agents]Evaluation [/evaluation]\n' +
'Resources\n' +
'\n' +
'Blog [https://blog.langchain.ac.cn/]Case Studies [/case-studies]Use Case Inspiration [/use-cases]Experts [/experts]Changelog\n' +
'[https://changelog.langchain.ac.cn/]\n' +
'Docs\n' +
'\n' +
'LangChain Docs [https://python.langchain.ac.cn/v0.2/docs/introduction/]LangSmith Docs [https://docs.smith.langchain.com/]\n' +
'Company\n' +
'\n' +
'About [/about]Careers [/careers]\n' +
'Pricing [/pricing]\n' +
'Get a demo [/contact-sales]\n' +
'Sign up [https://smith.langchain.com/]\n' +
'\n' +
'\n' +
'\n' +
'\n' +
'LangChain’s suite of products supports developers along each step of the LLM application lifecycle.\n' +
'\n' +
'\n' +
'APPLICATIONS THAT CAN REASON. POWERED BY LANGCHAIN.\n' +
'\n' +
'Get a demo [/contact-sales]Sign up for free [https://smith.langchain.com/]\n' +
'\n' +
'\n' +
'\n' +
'FROM STARTUPS TO GLOBAL ENTERPRISES,\n' +
'AMBITIOUS BUILDERS CHOOSE\n' +
'LANGCHAIN PRODUCTS.\n' +
'\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c22746faa78338532_logo_Ally.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c08e67bb7eefba4c2_logo_Rakuten.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c576fdde32d03c1a0_logo_Elastic.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c6d5592036dae24e5_logo_BCG.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667f19528c3557c2c19c3086_the-home-depot-2%201.png][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7cbcf6473519b06d84_logo_IDEO.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7cb5f96dcc100ee3b7_logo_Zapier.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/6606183e52d49bc369acc76c_mdy_logo_rgb_moodysblue.png][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c8ad7db6ed6ec611e_logo_Adyen.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c737d50036a62768b_logo_Infor.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667f59d98444a5f98aabe21c_acxiom-vector-logo-2022%201.png][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c09a158ffeaab0bd2_logo_Replit.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c9d2b23d292a0cab0_logo_Retool.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c44e67a3d0a996bf3_logo_Databricks.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667f5a1299d6ba453c78a849_image%20(19).png][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c63af578816bafcc3_logo_Instacart.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/665dc1dabc940168384d9596_podium%20logo.svg]\n' +
'\n' +
'Build\n' +
'\n' +
'LangChain is a framework to build with LLMs by chaining interoperable components. LangGraph is the framework for building\n' +
'controllable agentic workflows.\n' +
'\n' +
'\n' +
'\n' +
'Run\n' +
'\n' +
'Deploy your LLM applications at scale with LangGraph Cloud, our infrastructure purpose-built for agents.\n' +
'\n' +
'\n' +
'\n' +
'Manage\n' +
'\n' +
"Debug, collaborate, test, and monitor your LLM app in LangSmith - whether it's built with a LangChain framework or not. \n" +
'\n' +
'\n' +
'\n' +
'\n' +
'BUILD YOUR APP WITH LANGCHAIN\n' +
'\n' +
'Build context-aware, reasoning applications with LangChain’s flexible framework that leverages your company’s data and APIs.\n' +
'Future-proof your application by making vendor optionality part of your LLM infrastructure design.\n' +
'\n' +
'Learn more about LangChain\n' +
'\n' +
'[/langchain]\n' +
'\n' +
'\n' +
'RUN AT SCALE WITH LANGGRAPH CLOUD\n' +
'\n' +
'Deploy your LangGraph app with LangGraph Cloud for fault-tolerant scalability - including support for async background jobs,\n' +
'built-in persistence, and distributed task queues.\n' +
'\n' +
'Learn more about LangGraph\n' +
'\n' +
'[/langgraph]\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667c6d7284e58f4743a430e6_Langgraph%20UI-home-2.webp]\n' +
'\n' +
'\n' +
'MANAGE LLM PERFORMANCE WITH LANGSMITH\n' +
'\n' +
'Ship faster with LangSmith’s debug, test, deploy, and monitoring workflows. Don’t rely on “vibes” – add engineering rigor to your\n' +
'LLM-development workflow, whether you’re building with LangChain or not.\n' +
'\n' +
'Learn more about LangSmith\n' +
'\n' +
'[/langsmith]\n' +
'\n' +
'\n' +
'HEAR FROM OUR HAPPY CUSTOMERS\n' +
'\n' +
'LangChain, LangGraph, and LangSmith help teams of all sizes, across all industries - from ambitious startups to established\n' +
'enterprises.\n' +
'\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308aee06d9826765c897_Retool_logo%201.png]\n' +
'\n' +
'“LangSmith helped us improve the accuracy and performance of Retool’s fine-tuned models. Not only did we deliver a better product\n' +
'by iterating with LangSmith, but we’re shipping new AI features to our users in a fraction of the time it would have taken without\n' +
'it.”\n' +
'\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308abdd2dbbdde5a94a1_Jamie%20Cuffe.png]\n' +
'Jamie Cuffe\n' +
'Head of Self-Serve and New Products\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308a04d37cf7d3eb1341_Rakuten_Global_Brand_Logo.png]\n' +
'\n' +
'“By combining the benefits of LangSmith and standing on the shoulders of a gigantic open-source community, we’re able to identify\n' +
'the right approaches of using LLMs in an enterprise-setting faster.”\n' +
'\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308a8b6137d44c621cb4_Yusuke%20Kaji.png]\n' +
'Yusuke Kaji\n' +
'General Manager of AI\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308aea1371b447cc4af9_elastic-ar21.png]\n' +
'\n' +
'“Working with LangChain and LangSmith on the Elastic AI Assistant had a significant positive impact on the overall pace and\n' +
'quality of the development and shipping experience. We couldn’t have achieved the product experience delivered to our customers\n' +
'without LangChain, and we couldn’t have done it at the same pace without LangSmith.”\n' +
'\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308a4095d5a871de7479_James%20Spiteri.png]\n' +
'James Spiteri\n' +
'Director of Security Products\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c530539f4824b828357352_Logo_de_Fintual%201.png]\n' +
'\n' +
'“As soon as we heard about LangSmith, we moved our entire development stack onto it. We could have built evaluation, testing and\n' +
'monitoring tools in house, but with LangSmith it took us 10x less time to get a 1000x better tool.”\n' +
'\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c53058acbff86f4c2dcee2_jose%20pena.png]\n' +
'Jose Peña\n' +
'Senior Manager\n' +
'\n' +
'\n' +
'\n' +
'\n' +
'THE REFERENCE ARCHITECTURE ENTERPRISES ADOPT FOR SUCCESS.\n' +
'\n' +
'LangChain’s suite of products can be used independently or stacked together for multiplicative impact – guiding you through\n' +
'building, running, and managing your LLM apps.\n' +
'\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/6695b116b0b60c78fd4ef462_15.07.24%20-Updated%20stack%20diagram%20-%20lightfor%20website-3.webp][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667d392696fc0bc3e17a6d04_New%20LC%20stack%20-%20light-2.webp]\n' +
'15M+\n' +
'Monthly Downloads\n' +
'100K+\n' +
'Apps Powered\n' +
'75K+\n' +
'GitHub Stars\n' +
'3K+\n' +
'Contributors\n' +
'\n' +
'\n' +
'THE BIGGEST DEVELOPER COMMUNITY IN GENAI\n' +
'\n' +
'Learn alongside the 1M+ developers who are pushing the industry forward.\n' +
'\n' +
'Explore LangChain\n' +
'\n' +
'[/langchain]\n' +
'\n' +
'\n' +
'GET STARTED WITH THE LANGSMITH PLATFORM TODAY\n' +
'\n' +
'Get a demo [/contact-sales]Sign up for free [https://smith.langchain.com/]\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ccf12801bc39bf912a58f3_Home%20C.webp]\n' +
'\n' +
'Teams building with LangChain are driving operational efficiency, increasing discovery & personalization, and delivering premium\n' +
'products that generate revenue.\n' +
'\n' +
'Discover Use Cases\n' +
'\n' +
'[/use-cases]\n' +
'\n' +
'\n' +
'GET INSPIRED BY COMPANIES WHO HAVE DONE IT.\n' +
'\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65bcd7ee85507bdf350399c3_Ally_Financial%201.svg]\n' +
'Financial Services\n' +
'\n' +
'[https://blog.langchain.ac.cn/ally-financial-collaborates-with-langchain-to-deliver-critical-coding-module-to-mask-personal-identifying-information-in-a-compliant-and-safe-manner/]\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65bcd8b3ae4dc901daa3037a_Adyen_Corporate_Logo%201.svg]\n' +
'FinTech\n' +
'\n' +
'[https://blog.langchain.ac.cn/llms-accelerate-adyens-support-team-through-smart-ticket-routing-and-support-agent-copilot/]\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c534b3fa387379c0f4ebff_elastic-ar21%20(1).png]\n' +
'Technology\n' +
'\n' +
'[https://blog.langchain.ac.cn/langchain-partners-with-elastic-to-launch-the-elastic-ai-assistant/]\n' +
'\n' +
'\n' +
'LANGSMITH IS THE ENTERPRISE DEVOPS PLATFORM BUILT FOR LLMS.\n' +
'\n' +
'Explore LangSmith\n' +
'\n' +
'[/langsmith]\n' +
'Gain visibility to make trade offs between cost, latency, and quality.\n' +
'Increase developer productivity.\n' +
'Eliminate manual, error-prone testing.\n' +
'Reduce hallucinations and improve reliability.\n' +
'Enterprise deployment options to keep data secure.\n' +
'\n' +
'\n' +
'READY TO START SHIPPING
RELIABLE GENAI APPS FASTER?\n' +
'\n' +
'Get started with LangChain, LangGraph, and LangSmith to enhance your LLM app development, from prototype to production.\n' +
'\n' +
'Get a demo [/contact-sales]Sign up for free [https://smith.langchain.com/]\n' +
'Products\n' +
'LangChain [/langchain]LangSmith [/langsmith]LangGraph [/langgraph]Agents [/agents]Evaluation [/evaluation]Retrieval [/retrieval]\n' +
'Resources\n' +
'Python Docs [https://python.langchain.ac.cn/]JS/TS Docs [https://js.langchain.ac.cn/docs/get_started/introduction/]GitHub\n' +
'[https://github.com/langchain-ai]Integrations [https://python.langchain.ac.cn/v0.2/docs/integrations/platforms/]Templates\n' +
'[https://templates.langchain.com/]Changelog [https://changelog.langchain.ac.cn/]LangSmith Trust Portal\n' +
'[https://trust.langchain.com/]\n' +
'Company\n' +
'About [/about]Blog [https://blog.langchain.ac.cn/]Twitter [https://twitter.com/LangChainAI]LinkedIn\n' +
'[https://www.linkedin.com/company/langchain/]YouTube [https://www.youtube.com/@LangChain]Community [/join-community]Marketing\n' +
'Assets [https://drive.google.com/drive/folders/17xybjzmVBdsQA-VxouuGLxF6bDsHDe80?usp=sharing]\n' +
'Sign up for our newsletter to stay up to date\n' +
'Thank you! Your submission has been received!\n' +
'Oops! Something went wrong while submitting the form.\n' +
'[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c6a38f9c53ec71f5fc73de_langchain-word.svg]\n' +
'All systems operational\n' +
'[https://status.smith.langchain.com/]Privacy Policy [/'... 111 more characters,
metadata: {
source: 'https://www.langchain.ac.cn/',
title: 'LangChain',
description: 'LangChain’s suite of products supports developers along each step of their development journey.',
language: 'en'
}
}
console.log(docs[0].metadata);
{
source: 'https://www.langchain.ac.cn/',
title: 'LangChain',
description: 'LangChain’s suite of products supports developers along each step of their development journey.',
language: 'en'
}
选项
interface Options {
excludeDirs?: string[]; // webpage directories to exclude.
extractor?: (text: string) => string; // a function to extract the text of the document from the webpage, by default it returns the page as it is. It is recommended to use tools like html-to-text to extract the text. By default, it just returns the page as it is.
maxDepth?: number; // the maximum depth to crawl. By default, it is set to 2. If you need to crawl the whole website, set it to a number that is large enough would simply do the job.
timeout?: number; // the timeout for each request, in the unit of seconds. By default, it is set to 10000 (10 seconds).
preventOutside?: boolean; // whether to prevent crawling outside the root url. By default, it is set to true.
callerOptions?: AsyncCallerConstructorParams; // the options to call the AsyncCaller for example setting max concurrency (default is 64)
}
但是,由于很难进行完美的过滤,您可能仍然会在结果中看到一些无关的结果。如果需要,您可以自己对返回的文档进行过滤。大多数情况下,返回的结果已经足够好了。
API 参考
有关所有 RecursiveUrlLoader 功能和配置的详细文档,请转至 API 参考:https://api.js.langchain.com/classes/langchain_community_document_loaders_web_recursive_url.RecursiveUrlLoader.html