Cheerio
此笔记本提供有关使用 CheerioWebBaseLoader 入门的快速概述。有关 CheerioWebBaseLoader 所有功能和配置的详细文档,请转到 API 参考。
概述
集成详情
本示例介绍了如何使用 Cheerio 从网页加载数据。将为每个网页创建一个文档。
Cheerio 是一个快速且轻量级的库,允许您使用类似 jQuery 的语法解析和遍历 HTML 文档。您可以使用 Cheerio 从网页中提取数据,而无需在浏览器中渲染它们。
但是,Cheerio 不会模拟 Web 浏览器,因此它无法在页面上执行 JavaScript 代码。这意味着它无法从需要 JavaScript 才能呈现的动态网页中提取数据。要做到这一点,您可以使用 PlaywrightWebBaseLoader
或 PuppeteerWebBaseLoader
代替。
类 | 包 | 本地 | 可序列化 | PY 支持 |
---|---|---|---|---|
CheerioWebBaseLoader | @langchain/community | ✅ | ✅ | ❌ |
加载器功能
来源 | 网络支持 | 节点支持 |
---|---|---|
CheerioWebBaseLoader | ✅ | ✅ |
设置
要访问 CheerioWebBaseLoader
文档加载器,您需要安装 @langchain/community
集成包,以及 cheerio
对等依赖项。
凭据
如果您想获得模型调用的自动跟踪,您也可以通过取消下面注释来设置您的 LangSmith API 密钥
# export LANGCHAIN_TRACING_V2="true"
# export LANGCHAIN_API_KEY="your-api-key"
安装
LangChain CheerioWebBaseLoader 集成位于 @langchain/community
包中
有关安装集成包的一般说明,请参阅 此部分。
- npm
- yarn
- pnpm
npm i @langchain/community @langchain/core cheerio
yarn add @langchain/community @langchain/core cheerio
pnpm add @langchain/community @langchain/core cheerio
实例化
现在我们可以实例化我们的模型对象并加载文档
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
const loader = new CheerioWebBaseLoader(
"https://news.ycombinator.com/item?id=34817881",
{
// optional params: ...
}
);
加载
const docs = await loader.load();
docs[0];
Document {
pageContent: '\n' +
' \n' +
' Hacker News\n' +
' new | past | comments | ask | show | jobs | submit \n' +
' login\n' +
' \n' +
' \n' +
'\n' +
' \n' +
' What Lights the Universe’s Standard Candles? (quantamagazine.org)\n' +
' 75 points by Amorymeltzer on Feb 17, 2023 | hide | past | favorite | 6 comments \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' delta_p_delta_x on Feb 17, 2023 \n' +
' | next [–] \n' +
' \n' +
" Astrophysical and cosmological simulations are often insightful. They're also very cross-disciplinary; besides the obvious astrophysics, there's networking and sysadmin, parallel computing and algorithm theory (so that the simulation programs are actually fast but still accurate), systems design, and even a bit of graphic design for the visualisations.Some of my favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/- SWIFT: https://swift.dur.ac.uk/- CO5BOLD: https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)- AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the simulations in the article, too.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' froeb on Feb 18, 2023 \n' +
' | parent | next [–] \n' +
' \n' +
" Supernova simulations are especially interesting too. I have heard them described as the only time in physics when all 4 of the fundamental forces are important. The explosion can be quite finicky too. If I remember right, you can't get supernova to explode properly in 1D simulations, only in higher dimensions. This was a mystery until the realization that turbulence is necessary for supernova to trigger--there is no turbulent flow in 1D.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andrewflnr on Feb 17, 2023 \n' +
' | prev | next [–] \n' +
' \n' +
" Whoa. I didn't know the accretion theory of Ia supernovae was dead, much less that it had been since 2011.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andreareina on Feb 17, 2023 \n' +
' | prev | next [–] \n' +
' \n' +
' This seems to be the paper https://academic.oup.com/mnras/article/517/4/5260/6779709\n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andreareina on Feb 17, 2023 \n' +
' | prev [–] \n' +
' \n' +
" Wouldn't double detonation show up as variance in the brightness?\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' yencabulator on Feb 18, 2023 \n' +
' | parent [–] \n' +
' \n' +
' Or widening of the peak. If one type Ia supernova goes 1,2,3,2,1, the sum of two could go 1+0=1\n' +
' 2+1=3\n' +
' 3+2=5\n' +
' 2+3=5\n' +
' 1+2=3\n' +
' 0+1=1\n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
'\n' +
'\n' +
'Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact\n' +
'Search: \n' +
' \n' +
' \n',
metadata: { source: 'https://news.ycombinator.com/item?id=34817881' },
id: undefined
}
console.log(docs[0].metadata);
{ source: 'https://news.ycombinator.com/item?id=34817881' }
其他配置
CheerioWebBaseLoader
在实例化加载器时支持其他配置。这是一个关于如何使用它与传递的 selector
字段的示例,使其仅加载来自提供的 HTML 类名的内容
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
const loaderWithSelector = new CheerioWebBaseLoader(
"https://news.ycombinator.com/item?id=34817881",
{
selector: "p",
}
);
const docsWithSelector = await loaderWithSelector.load();
docsWithSelector[0].pageContent;
Some of my favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/- SWIFT: https://swift.dur.ac.uk/- CO5BOLD: https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)- AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the simulations in the article, too.
API 参考
有关 CheerioWebBaseLoader 所有功能和配置的详细文档,请转到 API 参考: https://api.js.langchain.com/classes/langchain_community_document_loaders_web_cheerio.CheerioWebBaseLoader.html