跳至主要内容

PuppeteerWebBaseLoader

兼容性

仅在 Node.js 上可用。

此笔记本提供快速入门指南,用于开始使用 PuppeteerWebBaseLoader。有关所有 PuppeteerWebBaseLoader 功能和配置的详细文档,请访问 API 参考.

Puppeteer 是一个 Node.js 库,它为控制无头 Chrome 或 Chromium 提供高级 API。您可以使用 Puppeteer 自动执行网页交互,包括从需要 JavaScript 渲染的动态网页中提取数据。

如果您想要更轻量级的解决方案,并且您要加载的网页不需要 JavaScript 渲染,则可以使用 CheerioWebBaseLoader

概述

集成详细信息

本地可序列化PY 支持
PuppeteerWebBaseLoader@langchain/communitybeta

加载器功能

来源网页加载器仅 Node 环境
PuppeteerWebBaseLoader

设置

要访问 PuppeteerWebBaseLoader 文档加载器,您需要安装 @langchain/community 集成包,以及 puppeteer 同级依赖项。

凭据

如果您想获得对模型调用的自动跟踪,您还可以通过取消下面的注释来设置您的 LangSmith API 密钥

# export LANGCHAIN_TRACING_V2="true"
# export LANGCHAIN_API_KEY="your-api-key"

安装

LangChain PuppeteerWebBaseLoader 集成位于 @langchain/community 包中

yarn add @langchain/community puppeteer

实例化

现在,我们可以实例化我们的模型对象并加载文档

import { PuppeteerWebBaseLoader } from "@langchain/community/document_loaders/web/puppeteer";

const loader = new PuppeteerWebBaseLoader("https://www.langchain.ac.cn", {
// required params = ...
// optional params = ...
});

加载

const docs = await loader.load();
docs[0];
Document {
pageContent: '<div class="page-wrapper"><div class="global-styles w-embed"><style>\n' +
'\n' +
'* {\n' +
' -webkit-font-smoothing: antialiased;\n' +
'}\n' +
'\n' +
'.page-wrapper {\n' +
'overflow: clip;\n' +
' }\n' +
'\n' +
'\n' +
'\n' +
'/* Set fluid size change for smaller breakpoints */\n' +
' html { font-size: 1rem; }\n' +
' @media screen and (max-width:1920px) and (min-width:1281px) { html { font-size: calc(0.2499999999999999rem + 0.6250000000000001vw); } }\n' +
' @media screen and (max-width:1280px) and (min-width:992px) { html { font-size: calc(0.41223612197028925rem + 0.4222048475371384vw); } }\n' +
'/* video sizing */\n' +
'\n' +
'video {\n' +
' object-fit: fill;\n' +
'\t\twidth: 100%;\n' +
'}\n' +
'\n' +
'\n' +
'\n' +
'#retrieval-video {\n' +
' object-fit: cover;\n' +
' width: 100%;\n' +
'}\n' +
'\n' +
'\n' +
'\n' +
'/* Set color style to inherit */\n' +
'.inherit-color * {\n' +
' color: inherit;\n' +
'}\n' +
'\n' +
'/* Focus state style for keyboard navigation for the focusable elements */\n' +
'*[tabindex]:focus-visible,\n' +
' input[type="file"]:focus-visible {\n' +
' outline: 0.125rem solid #4d65ff;\n' +
' outline-offset: 0.125rem;\n' +
'}\n' +
'\n' +
'/* Get rid of top margin on first element in any rich text element */\n' +
'.w-richtext > :not(div):first-child, .w-richtext > div:first-child > :first-child {\n' +
' margin-top: 0 !important;\n' +
'}\n' +
'\n' +
'/* Get rid of bottom margin on last element in any rich text element */\n' +
'.w-richtext>:last-child, .w-richtext ol li:last-child, .w-richtext ul li:last-child {\n' +
'\tmargin-bottom: 0 !important;\n' +
'}\n' +
'\n' +
'/* Prevent all click and hover interaction with an element */\n' +
'.pointer-events-off {\n' +
'\tpointer-events: none;\n' +
'}\n' +
'\n' +
'/* Enables all click and hover interaction with an element */\n' +
'.pointer-events-on {\n' +
' pointer-events: auto;\n' +
'}\n' +
'\n' +
'/* Create a class of .div-square which maintains a 1:1 dimension of a div */\n' +
'.div-square::after {\n' +
'\tcontent: "";\n' +
'\tdisplay: block;\n' +
'\tpadding-bottom: 100%;\n' +
'}\n' +
'\n' +
'/* Make sure containers never lose their center alignment */\n' +
'.container-medium,.container-small, .container-large {\n' +
'\tmargin-right: auto !important;\n' +
' margin-left: auto !important;\n' +
'}\n' +
'\n' +
'/* \n' +
'Make the following elements inherit typography styles from the parent and not have hardcoded values. \n' +
'Important: You will not be able to style for example "All Links" in Designer with this CSS applied.\n' +
'Uncomment this CSS to use it in the project. Leave this message for future hand-off.\n' +
'*/\n' +
'/*\n' +
'a,\n' +
'.w-input,\n' +
'.w-select,\n' +
'.w-tab-link,\n' +
'.w-nav-link,\n' +
'.w-dropdown-btn,\n' +
'.w-dropdown-toggle,\n' +
'.w-dropdown-link {\n' +
' color: inherit;\n' +
' text-decoration: inherit;\n' +
' font-size: inherit;\n' +
'}\n' +
'*/\n' +
'\n' +
'/* Apply "..." after 3 lines of text */\n' +
'.text-style-3lines {\n' +
'\tdisplay: -webkit-box;\n' +
'\toverflow: hidden;\n' +
'\t-webkit-line-clamp: 3;\n' +
'\t-webkit-box-orient: vertical;\n' +
'}\n' +
'\n' +
'/* Apply "..." after 2 lines of text */\n' +
'.text-style-2lines {\n' +
'\tdisplay: -webkit-box;\n' +
'\toverflow: hidden;\n' +
'\t-webkit-line-clamp: 2;\n' +
'\t-webkit-box-orient: vertical;\n' +
'}\n' +
'\n' +
'/* Adds inline flex display */\n' +
'.display-inlineflex {\n' +
' display: inline-flex;\n' +
'}\n' +
'\n' +
'/* These classes are never overwritten */\n' +
'.hide {\n' +
' display: none !important;\n' +
'}\n' +
'\n' +
'@media screen and (max-width: 991px) {\n' +
' .hide, .hide-tablet {\n' +
' display: none !important;\n' +
' }\n' +
'}\n' +
' @media screen and (max-width: 767px) {\n' +
' .hide-mobile-landscape{\n' +
' display: none !important;\n' +
' }\n' +
'}\n' +
' @media screen and (max-width: 479px) {\n' +
' .hide-mobile{\n' +
' display: none !important;\n' +
' }\n' +
'}\n' +
' \n' +
'.margin-0 {\n' +
' margin: 0rem !important;\n' +
'}\n' +
' \n' +
'.padding-0 {\n' +
' padding: 0rem !important;\n' +
'}\n' +
'\n' +
'.spacing-clean {\n' +
'padding: 0rem !important;\n' +
'margin: 0rem !important;\n' +
'}\n' +
'\n' +
'.margin-top {\n' +
' margin-right: 0rem !important;\n' +
' margin-bottom: 0rem !important;\n' +
' margin-left: 0rem !important;\n' +
'}\n' +
'\n' +
'.padding-top {\n' +
' padding-right: 0rem !important;\n' +
' padding-bottom: 0rem !important;\n' +
' padding-left: 0rem !important;\n' +
'}\n' +
' \n' +
'.margin-right {\n' +
' margin-top: 0rem !important;\n' +
' margin-bottom: 0rem !important;\n' +
' margin-left: 0rem !important;\n' +
'}\n' +
'\n' +
'.padding-right {\n' +
' padding-top: 0rem !important;\n' +
' padding-bottom: 0rem !important;\n' +
' padding-left: 0rem !important;\n' +
'}\n' +
'\n' +
'.margin-bottom {\n' +
' margin-top: 0rem !important;\n' +
' margin-right: 0rem !important;\n' +
' margin-left: 0rem !important;\n' +
'}\n' +
'\n' +
'.padding-bottom {\n' +
' padding-top: 0rem !important;\n' +
' padding-right: 0rem !important;\n' +
' padding-left: 0rem !important;\n' +
'}\n' +
'\n' +
'.margin-left {\n' +
' margin-top: 0rem !important;\n' +
' margin-right: 0rem !important;\n' +
' margin-bottom: 0rem !important;\n' +
'}\n' +
' \n' +
'.padding-left {\n' +
' padding-top: 0rem !important;\n' +
' padding-right: 0rem !important;\n' +
' padding-bottom: 0rem !important;\n' +
'}\n' +
' \n' +
'.margin-horizontal {\n' +
' margin-top: 0rem !important;\n' +
' margin-bottom: 0rem !important;\n' +
'}\n' +
'\n' +
'.padding-horizontal {\n' +
' padding-top: 0rem !important;\n' +
' padding-bottom: 0rem !important;\n' +
'}\n' +
'\n' +
'.margin-vertical {\n' +
' margin-right: 0rem !important;\n' +
' margin-left: 0rem !important;\n' +
'}\n' +
' \n' +
'.padding-vertical {\n' +
' padding-right: 0rem !important;\n' +
' padding-left: 0rem !important;\n' +
'}\n' +
'\n' +
'/* Apply "..." at 100% width */\n' +
'.truncate-width { \n' +
'\t\twidth: 100%; \n' +
' white-space: nowrap; \n' +
' overflow: hidden; \n' +
' text-overflow: ellipsis; \n' +
'}\n' +
'/* Removes native scrollbar */\n' +
'.no-scrollbar {\n' +
' -ms-overflow-style: none;\n' +
' overflow: -moz-scrollbars-none; \n' +
'}\n' +
'\n' +
'.no-scrollbar::-webkit-scrollbar {\n' +
' display: none;\n' +
'}\n' +
'\n' +
'input:checked + span {\n' +
'color: white /* styles for the div immediately following the checked input */\n' +
'}\n' +
'\n' +
'/* styles for word-wrapping\n' +
'h1, h2, h3 {\n' +
'word-wrap: break-word;\n' +
'hyphens: auto;\n' +
'}*/\n' +
'\n' +
'[nav-theme="light"] .navbar_logo-svg {\n' +
'\t--nav--logo: var(--light--logo);\n' +
'}\n' +
'\n' +
'[nav-theme="light"] .button.is-nav {\n' +
'\t--nav--button-bg: var(--light--button-bg);\n' +
'\t--nav--button-text: var(--light--button-text);\n' +
'}\n' +
'\n' +
'[nav-theme="light"] .button.is-nav:hover {\n' +
'\t--nav--button-bg: var(--dark--button-bg);\n' +
'\t--nav--button-text:var(--dark--button-text);\n' +
'}\n' +
'\n' +
'[nav-theme="dark"] .navbar_logo-svg {\n' +
'\t--nav--logo: var(--dark--logo);\n' +
'}\n' +
'\n' +
'[nav-theme="dark"] .button.is-nav {\n' +
'\t--nav--button-bg: var(--dark--button-bg);\n' +
'\t--nav--button-text: var(--dark--button-text);\n' +
'}\n' +
'\n' +
'[nav-theme="dark"] .button.is-nav:hover {\n' +
'\t--nav--button-bg: var(--light--button-bg);\n' +
'\t--nav--button-text: var(--light--button-text);\n' +
'}\n' +
'\n' +
'[nav-theme="red"] .navbar_logo-svg {\n' +
'\t--nav--logo: var(--red--logo);\n' +
'}\n' +
'\n' +
'\n' +
'[nav-theme="red"] .button.is-nav {\n' +
'\t--nav--button-bg: var(--red--button-bg);\n' +
'\t--nav--button-text: var(--red--button-text);\n' +
'}\n' +
'\n' +
'.navbar_logo-svg.is-light, .navbar_logo-svg.is-red.is-light{\n' +
'color: #F8F7FF!important;\n' +
'}\n' +
'\n' +
'.news_button[disabled] {\n' +
'background: none;\n' +
'}\n' +
'\n' +
'.product_bg-video video {\n' +
'object-fit: fill;\n' +
'}\n' +
'\n' +
'</style></div><div data-animation="default" class="navbar_component w-nav" data-easing2="ease" fs-scrolldisable-element="smart-nav" data-easing="ease" data-collapse="medium" data-w-id="78839fc1-6b85-b108-b164-82fcae730868" role="banner" data-duration="400" style="will-change: width, height; height: 10rem;"><div class="navbar_container"><a href="/" aria-current="page" class="navbar_logo-link w-nav-brand w--current" aria-label="home"><div class="navbar_logo-svg w-embed" style="color: rgb(255, 255, 255);"><svg width="100%" height="100%" viewBox="0 0 240 41" fill="none" xmlns="http://www.w3.org/2000/svg">\n' +
'<path d="M61.5139 11.1569C60.4527 11.1569 59.4549 11.568 58.708 12.3148L55.6899 15.3248C54.8757 16.1368 54.4574 17.2643 54.5431 18.4202C54.5492 18.4833 54.5553 18.5464 54.5615 18.6115C54.6696 19.4988 55.0594 20.2986 55.6899 20.9254C56.1246 21.3589 56.6041 21.6337 57.1857 21.825C57.2163 22 57.2326 22.177 57.2326 22.3541C57.2326 23.1519 56.9225 23.9008 56.3592 24.4625L56.1735 24.6477C55.1655 24.3037 54.3247 23.8011 53.5656 23.044C52.5576 22.0386 51.8903 20.7687 51.6393 19.3747L51.6046 19.1813L51.4515 19.3055C51.3475 19.3889 51.2495 19.4785 51.1577 19.57L48.1396 22.58C46.5928 24.1226 46.5928 26.636 48.1396 28.1786C48.913 28.9499 49.9292 29.3366 50.9475 29.3366C51.9658 29.3366 52.98 28.9499 53.7534 28.1786L56.7715 25.1687C58.3183 23.626 58.3183 21.1147 56.7715 19.57C56.3592 19.159 55.8675 18.8496 55.3104 18.6502C55.2798 18.469 55.2634 18.2879 55.2634 18.1109C55.2634 17.2439 55.6063 16.4217 56.2348 15.7949C57.2449 16.1388 58.1407 16.6965 58.8978 17.4515C59.9038 18.4548 60.5691 19.7227 60.8241 21.1208L60.8588 21.3141L61.0119 21.19C61.116 21.1066 61.2139 21.017 61.3078 20.9234L64.3259 17.9135C65.8727 16.3708 65.8747 13.8575 64.3259 12.3148C63.577 11.568 62.5811 11.1569 61.518 11.1569H61.5139Z" fill="CurrentColor"></path>\n' +
'<path d="M59.8966 0.148865H20.4063C9.15426 0.148865 0 9.27841 0 20.5001C0 31.7217 9.15426 40.8513 20.4063 40.8513H59.8966C71.1486 40.8513 80.3029 31.7217 80.3029 20.5001C80.3029 9.27841 71.1486 0.148865 59.8966 0.148865ZM40.4188 32.0555C39.7678 32.1898 39.0352 32.2142 38.5373 31.6953C38.3536 32.1165 37.9251 31.8947 37.5945 31.8398C37.5639 31.9252 37.5374 32.0005 37.5088 32.086C36.4089 32.1593 35.5845 31.04 35.0601 30.1954C34.0193 29.6337 32.8378 29.2918 31.7746 28.7036C31.7134 29.6724 31.9257 30.8731 31.0012 31.4979C30.9543 33.36 33.8255 31.7177 34.0887 33.1056C33.8847 33.128 33.6582 33.073 33.4949 33.2297C32.746 33.9563 31.8869 32.6803 31.0237 33.2074C29.8646 33.7894 29.7483 34.2656 28.3137 34.3857C28.2342 34.2656 28.2668 34.1862 28.3341 34.113C28.7382 33.6449 28.7668 33.0934 29.4565 32.8939C28.7464 32.782 28.1525 33.1728 27.5546 33.4821C26.7771 33.7996 26.7833 32.7657 25.5875 33.537C25.4548 33.4292 25.5181 33.3315 25.5936 33.2481C25.8976 32.8777 26.2976 32.8227 26.7486 32.8431C24.5304 31.6098 23.4856 34.3511 22.4612 32.9876C22.1531 33.069 22.0368 33.3457 21.8429 33.5411C21.6756 33.358 21.8021 33.1361 21.8103 32.9204C21.6103 32.8268 21.3572 32.782 21.4164 32.4625C21.0246 32.3302 20.7512 32.5622 20.4594 32.782C20.1961 32.5785 20.6369 32.2814 20.7185 32.0697C20.9532 31.6627 21.4878 31.9863 21.7592 31.6932C22.5306 31.2557 23.606 31.9659 24.4876 31.8459C25.1671 31.9313 26.0078 31.2353 25.667 30.5413C24.9406 29.6154 25.0691 28.4045 25.0528 27.2974C24.963 26.6522 23.4101 25.83 22.9612 25.134C22.4061 24.5072 21.9735 23.7807 21.5409 23.0664C19.9798 20.0523 20.4716 16.1795 18.5044 13.3812C17.6147 13.8717 16.4556 13.6397 15.6884 12.9823C15.2741 13.3588 15.2557 13.8513 15.2231 14.3744C14.2293 13.3833 14.3538 11.5109 15.1476 10.4079C15.4721 9.97239 15.8598 9.61421 16.2924 9.29876C16.3903 9.22754 16.423 9.15834 16.4209 9.04844C17.2066 5.52362 22.5653 6.20335 24.259 8.70044C25.4875 10.237 25.8589 12.27 27.2526 13.6967C29.1279 15.744 31.2645 17.5471 32'... 73262 more characters,
metadata: { source: 'https://www.langchain.ac.cn' },
id: undefined
}
console.log(docs[0].metadata);
{ source: 'https://www.langchain.ac.cn' }

选项

以下是对您可以使用 PuppeteerWebBaseLoaderOptions 接口传递给 PuppeteerWebBaseLoader 构造函数的参数的解释

type PuppeteerWebBaseLoaderOptions = {
launchOptions?: PuppeteerLaunchOptions;
gotoOptions?: PuppeteerGotoOptions;
evaluate?: (page: Page, browser: Browser) => Promise<string>;
};
  1. launchOptions:一个可选对象,指定要传递给 puppeteer.launch() 方法的额外选项。这可以包括选项,例如 headless 标志以无头模式启动浏览器,或者 slowMo 选项以减慢 Puppeteer 的操作,以便更容易跟踪。

  2. gotoOptions:一个可选对象,指定要传递给 page.goto() 方法的额外选项。这可以包括选项,例如 timeout 选项以指定以毫秒为单位的最大导航时间,或者 waitUntil 选项以指定何时将导航视为成功。

  3. evaluate:一个可选函数,可以使用该函数使用 page.evaluate() 方法评估页面上的 JavaScript 代码。这对于从页面中提取数据或与页面元素交互很有用。该函数应返回一个 Promise,该 Promise 解析为一个字符串,其中包含评估结果。

通过将这些选项传递给 PuppeteerWebBaseLoader 构造函数,您可以自定义加载器的行为,并使用 Puppeteer 的强大功能来抓取和交互网页。

屏幕截图

要截取网站的屏幕截图,请与上面一样初始化加载器,并调用 .screenshot() 方法。这将返回一个 Document 实例,其中页面内容是 base64 编码的图像,元数据包含一个 source 字段,其中包含页面的 URL。

import { PuppeteerWebBaseLoader } from "@langchain/community/document_loaders/web/puppeteer";

const loaderForScreenshot = new PuppeteerWebBaseLoader(
"https://www.langchain.ac.cn",
{
launchOptions: {
headless: true,
},
gotoOptions: {
waitUntil: "domcontentloaded",
},
}
);
const screenshot = await loaderForScreenshot.screenshot();

console.log(screenshot.pageContent.slice(0, 100));
console.log(screenshot.metadata);
iVBORw0KGgoAAAANSUhEUgAACWAAAAdoCAIAAAA/Q2IJAAAAAXNSR0IArs4c6QAAIABJREFUeJzsvUuzHUeSJuaPiMjMk3nOuU88
{ source: 'https://www.langchain.ac.cn' }

API 参考

有关所有 PuppeteerWebBaseLoader 功能和配置的详细文档,请访问 API 参考:https://api.js.langchain.com/classes/langchain_community_document_loaders_web_puppeteer.PuppeteerWebBaseLoader.html


此页面是否有帮助?


您也可以留下详细的反馈 在 GitHub 上.