跳到主要内容

Cheerio

本笔记本提供了快速入门 CheerioWebBaseLoader 的概述。有关所有 CheerioWebBaseLoader 功能和配置的详细文档,请访问 API 参考

概述

集成详情

此示例介绍了如何使用 Cheerio 从网页加载数据。每个网页将创建一个文档。

Cheerio 是一个快速轻量级的库,允许您使用类似 jQuery 的语法解析和遍历 HTML 文档。您可以使用 Cheerio 从网页中提取数据,而无需在浏览器中渲染它们。

但是,Cheerio 不模拟 Web 浏览器,因此它无法执行页面上的 JavaScript 代码。这意味着它无法从需要 JavaScript 才能渲染的动态网页中提取数据。为此,您可以使用 PlaywrightWebBaseLoaderPuppeteerWebBaseLoader 代替。

本地可序列化PY 支持
CheerioWebBaseLoader@langchain/community

加载器功能

来源Web 支持Node 支持
CheerioWebBaseLoader

设置

要访问 CheerioWebBaseLoader 文档加载器,您需要安装 @langchain/community 集成包,以及 cheerio 对等依赖项。

凭证

如果您想获取模型调用的自动跟踪,您还可以通过取消注释下方内容来设置您的 LangSmith API 密钥

# export LANGSMITH_TRACING="true"
# export LANGSMITH_API_KEY="your-api-key"

安装

LangChain CheerioWebBaseLoader 集成位于 @langchain/community 包中

提示

有关安装集成包的一般说明,请参阅此部分

yarn add @langchain/community @langchain/core cheerio

实例化

现在我们可以实例化我们的模型对象并加载文档

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";

const loader = new CheerioWebBaseLoader(
"https://news.ycombinator.com/item?id=34817881",
{
// optional params: ...
}
);

加载

const docs = await loader.load();
docs[0];
Document {
pageContent: '\n' +
' \n' +
' Hacker News\n' +
' new | past | comments | ask | show | jobs | submit \n' +
' login\n' +
' \n' +
' \n' +
'\n' +
' \n' +
' What Lights the Universe’s Standard Candles? (quantamagazine.org)\n' +
' 75 points by Amorymeltzer on Feb 17, 2023 | hide | past | favorite | 6 comments \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' delta_p_delta_x on Feb 17, 2023 \n' +
' | next [–] \n' +
' \n' +
" Astrophysical and cosmological simulations are often insightful. They're also very cross-disciplinary; besides the obvious astrophysics, there's networking and sysadmin, parallel computing and algorithm theory (so that the simulation programs are actually fast but still accurate), systems design, and even a bit of graphic design for the visualisations.Some of my favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/- SWIFT: https://swift.dur.ac.uk/- CO5BOLD: https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)- AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the simulations in the article, too.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' froeb on Feb 18, 2023 \n' +
' | parent | next [–] \n' +
' \n' +
" Supernova simulations are especially interesting too. I have heard them described as the only time in physics when all 4 of the fundamental forces are important. The explosion can be quite finicky too. If I remember right, you can't get supernova to explode properly in 1D simulations, only in higher dimensions. This was a mystery until the realization that turbulence is necessary for supernova to trigger--there is no turbulent flow in 1D.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andrewflnr on Feb 17, 2023 \n' +
' | prev | next [–] \n' +
' \n' +
" Whoa. I didn't know the accretion theory of Ia supernovae was dead, much less that it had been since 2011.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andreareina on Feb 17, 2023 \n' +
' | prev | next [–] \n' +
' \n' +
' This seems to be the paper https://academic.oup.com/mnras/article/517/4/5260/6779709\n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andreareina on Feb 17, 2023 \n' +
' | prev [–] \n' +
' \n' +
" Wouldn't double detonation show up as variance in the brightness?\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' yencabulator on Feb 18, 2023 \n' +
' | parent [–] \n' +
' \n' +
' Or widening of the peak. If one type Ia supernova goes 1,2,3,2,1, the sum of two could go 1+0=1\n' +
' 2+1=3\n' +
' 3+2=5\n' +
' 2+3=5\n' +
' 1+2=3\n' +
' 0+1=1\n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
'\n' +
'\n' +
'Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact\n' +
'Search: \n' +
' \n' +
' \n',
metadata: { source: 'https://news.ycombinator.com/item?id=34817881' },
id: undefined
}
console.log(docs[0].metadata);
{ source: 'https://news.ycombinator.com/item?id=34817881' }

其他配置

CheerioWebBaseLoader 在实例化加载器时支持其他配置。以下是如何将它与传递的 selector 字段一起使用的示例,使其仅从提供的 HTML 类名称加载内容

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";

const loaderWithSelector = new CheerioWebBaseLoader(
"https://news.ycombinator.com/item?id=34817881",
{
selector: "p",
}
);

const docsWithSelector = await loaderWithSelector.load();
docsWithSelector[0].pageContent;
Some of my favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/- SWIFT: https://swift.dur.ac.uk/- CO5BOLD: https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)- AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the simulations in the article, too.












API 参考

有关所有 CheerioWebBaseLoader 功能和配置的详细文档,请访问 API 参考: https://api.js.langchain.com/classes/langchain_community_document_loaders_web_cheerio.CheerioWebBaseLoader.html


此页是否对您有帮助?


您也可以留下详细的反馈 在 GitHub 上.