Cheerio
本笔记本提供了快速入门 CheerioWebBaseLoader 的概述。有关所有 CheerioWebBaseLoader 功能和配置的详细文档,请访问 API 参考。
概述
集成详情
此示例介绍了如何使用 Cheerio 从网页加载数据。每个网页将创建一个文档。
Cheerio 是一个快速轻量级的库,允许您使用类似 jQuery 的语法解析和遍历 HTML 文档。您可以使用 Cheerio 从网页中提取数据,而无需在浏览器中渲染它们。
但是,Cheerio 不模拟 Web 浏览器,因此它无法执行页面上的 JavaScript 代码。这意味着它无法从需要 JavaScript 才能渲染的动态网页中提取数据。为此,您可以使用 PlaywrightWebBaseLoader
或 PuppeteerWebBaseLoader
代替。
类 | 包 | 本地 | 可序列化 | PY 支持 |
---|---|---|---|---|
CheerioWebBaseLoader | @langchain/community | ✅ | ✅ | ❌ |
加载器功能
来源 | Web 支持 | Node 支持 |
---|---|---|
CheerioWebBaseLoader | ✅ | ✅ |
设置
要访问 CheerioWebBaseLoader
文档加载器,您需要安装 @langchain/community
集成包,以及 cheerio
对等依赖项。
凭证
如果您想获取模型调用的自动跟踪,您还可以通过取消注释下方内容来设置您的 LangSmith API 密钥
# export LANGSMITH_TRACING="true"
# export LANGSMITH_API_KEY="your-api-key"
安装
LangChain CheerioWebBaseLoader 集成位于 @langchain/community
包中
有关安装集成包的一般说明,请参阅此部分。
- npm
- yarn
- pnpm
npm i @langchain/community @langchain/core cheerio
yarn add @langchain/community @langchain/core cheerio
pnpm add @langchain/community @langchain/core cheerio
实例化
现在我们可以实例化我们的模型对象并加载文档
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
const loader = new CheerioWebBaseLoader(
"https://news.ycombinator.com/item?id=34817881",
{
// optional params: ...
}
);
加载
const docs = await loader.load();
docs[0];
Document {
pageContent: '\n' +
' \n' +
' Hacker News\n' +
' new | past | comments | ask | show | jobs | submit \n' +
' login\n' +
' \n' +
' \n' +
'\n' +
' \n' +
' What Lights the Universe’s Standard Candles? (quantamagazine.org)\n' +
' 75 points by Amorymeltzer on Feb 17, 2023 | hide | past | favorite | 6 comments \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' delta_p_delta_x on Feb 17, 2023 \n' +
' | next [–] \n' +
' \n' +
" Astrophysical and cosmological simulations are often insightful. They're also very cross-disciplinary; besides the obvious astrophysics, there's networking and sysadmin, parallel computing and algorithm theory (so that the simulation programs are actually fast but still accurate), systems design, and even a bit of graphic design for the visualisations.Some of my favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/- SWIFT: https://swift.dur.ac.uk/- CO5BOLD: https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)- AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the simulations in the article, too.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' froeb on Feb 18, 2023 \n' +
' | parent | next [–] \n' +
' \n' +
" Supernova simulations are especially interesting too. I have heard them described as the only time in physics when all 4 of the fundamental forces are important. The explosion can be quite finicky too. If I remember right, you can't get supernova to explode properly in 1D simulations, only in higher dimensions. This was a mystery until the realization that turbulence is necessary for supernova to trigger--there is no turbulent flow in 1D.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andrewflnr on Feb 17, 2023 \n' +
' | prev | next [–] \n' +
' \n' +
" Whoa. I didn't know the accretion theory of Ia supernovae was dead, much less that it had been since 2011.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andreareina on Feb 17, 2023 \n' +
' | prev | next [–] \n' +
' \n' +
' This seems to be the paper https://academic.oup.com/mnras/article/517/4/5260/6779709\n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andreareina on Feb 17, 2023 \n' +
' | prev [–] \n' +
' \n' +
" Wouldn't double detonation show up as variance in the brightness?\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' yencabulator on Feb 18, 2023 \n' +
' | parent [–] \n' +
' \n' +
' Or widening of the peak. If one type Ia supernova goes 1,2,3,2,1, the sum of two could go 1+0=1\n' +
' 2+1=3\n' +
' 3+2=5\n' +
' 2+3=5\n' +
' 1+2=3\n' +
' 0+1=1\n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
'\n' +
'\n' +
'Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact\n' +
'Search: \n' +
' \n' +
' \n',
metadata: { source: 'https://news.ycombinator.com/item?id=34817881' },
id: undefined
}
console.log(docs[0].metadata);
{ source: 'https://news.ycombinator.com/item?id=34817881' }
其他配置
CheerioWebBaseLoader
在实例化加载器时支持其他配置。以下是如何将它与传递的 selector
字段一起使用的示例,使其仅从提供的 HTML 类名称加载内容
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
const loaderWithSelector = new CheerioWebBaseLoader(
"https://news.ycombinator.com/item?id=34817881",
{
selector: "p",
}
);
const docsWithSelector = await loaderWithSelector.load();
docsWithSelector[0].pageContent;
Some of my favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/- SWIFT: https://swift.dur.ac.uk/- CO5BOLD: https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)- AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the simulations in the article, too.
API 参考
有关所有 CheerioWebBaseLoader 功能和配置的详细文档,请访问 API 参考: https://api.js.langchain.com/classes/langchain_community_document_loaders_web_cheerio.CheerioWebBaseLoader.html