跳至主要内容

Cheerio

此笔记本提供有关使用 CheerioWebBaseLoader 入门的快速概述。有关 CheerioWebBaseLoader 所有功能和配置的详细文档,请转到 API 参考

概述

集成详情

本示例介绍了如何使用 Cheerio 从网页加载数据。将为每个网页创建一个文档。

Cheerio 是一个快速且轻量级的库,允许您使用类似 jQuery 的语法解析和遍历 HTML 文档。您可以使用 Cheerio 从网页中提取数据,而无需在浏览器中渲染它们。

但是,Cheerio 不会模拟 Web 浏览器,因此它无法在页面上执行 JavaScript 代码。这意味着它无法从需要 JavaScript 才能呈现的动态网页中提取数据。要做到这一点,您可以使用 PlaywrightWebBaseLoaderPuppeteerWebBaseLoader 代替。

本地可序列化PY 支持
CheerioWebBaseLoader@langchain/community

加载器功能

来源网络支持节点支持
CheerioWebBaseLoader

设置

要访问 CheerioWebBaseLoader 文档加载器,您需要安装 @langchain/community 集成包,以及 cheerio 对等依赖项。

凭据

如果您想获得模型调用的自动跟踪,您也可以通过取消下面注释来设置您的 LangSmith API 密钥

# export LANGCHAIN_TRACING_V2="true"
# export LANGCHAIN_API_KEY="your-api-key"

安装

LangChain CheerioWebBaseLoader 集成位于 @langchain/community 包中

提示

有关安装集成包的一般说明,请参阅 此部分

yarn add @langchain/community @langchain/core cheerio

实例化

现在我们可以实例化我们的模型对象并加载文档

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";

const loader = new CheerioWebBaseLoader(
"https://news.ycombinator.com/item?id=34817881",
{
// optional params: ...
}
);

加载

const docs = await loader.load();
docs[0];
Document {
pageContent: '\n' +
' \n' +
' Hacker News\n' +
' new | past | comments | ask | show | jobs | submit \n' +
' login\n' +
' \n' +
' \n' +
'\n' +
' \n' +
' What Lights the Universe’s Standard Candles? (quantamagazine.org)\n' +
' 75 points by Amorymeltzer on Feb 17, 2023 | hide | past | favorite | 6 comments \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' delta_p_delta_x on Feb 17, 2023 \n' +
' | next [–] \n' +
' \n' +
" Astrophysical and cosmological simulations are often insightful. They're also very cross-disciplinary; besides the obvious astrophysics, there's networking and sysadmin, parallel computing and algorithm theory (so that the simulation programs are actually fast but still accurate), systems design, and even a bit of graphic design for the visualisations.Some of my favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/- SWIFT: https://swift.dur.ac.uk/- CO5BOLD: https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)- AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the simulations in the article, too.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' froeb on Feb 18, 2023 \n' +
' | parent | next [–] \n' +
' \n' +
" Supernova simulations are especially interesting too. I have heard them described as the only time in physics when all 4 of the fundamental forces are important. The explosion can be quite finicky too. If I remember right, you can't get supernova to explode properly in 1D simulations, only in higher dimensions. This was a mystery until the realization that turbulence is necessary for supernova to trigger--there is no turbulent flow in 1D.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andrewflnr on Feb 17, 2023 \n' +
' | prev | next [–] \n' +
' \n' +
" Whoa. I didn't know the accretion theory of Ia supernovae was dead, much less that it had been since 2011.\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andreareina on Feb 17, 2023 \n' +
' | prev | next [–] \n' +
' \n' +
' This seems to be the paper https://academic.oup.com/mnras/article/517/4/5260/6779709\n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' andreareina on Feb 17, 2023 \n' +
' | prev [–] \n' +
' \n' +
" Wouldn't double detonation show up as variance in the brightness?\n" +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' yencabulator on Feb 18, 2023 \n' +
' | parent [–] \n' +
' \n' +
' Or widening of the peak. If one type Ia supernova goes 1,2,3,2,1, the sum of two could go 1+0=1\n' +
' 2+1=3\n' +
' 3+2=5\n' +
' 2+3=5\n' +
' 1+2=3\n' +
' 0+1=1\n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
' \n' +
'\n' +
'\n' +
'Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact\n' +
'Search: \n' +
' \n' +
' \n',
metadata: { source: 'https://news.ycombinator.com/item?id=34817881' },
id: undefined
}
console.log(docs[0].metadata);
{ source: 'https://news.ycombinator.com/item?id=34817881' }

其他配置

CheerioWebBaseLoader 在实例化加载器时支持其他配置。这是一个关于如何使用它与传递的 selector 字段的示例,使其仅加载来自提供的 HTML 类名的内容

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";

const loaderWithSelector = new CheerioWebBaseLoader(
"https://news.ycombinator.com/item?id=34817881",
{
selector: "p",
}
);

const docsWithSelector = await loaderWithSelector.load();
docsWithSelector[0].pageContent;
Some of my favourite simulation projects:- IllustrisTNG: https://www.tng-project.org/- SWIFT: https://swift.dur.ac.uk/- CO5BOLD: https://www.astro.uu.se/~bf/co5bold_main.html (which produced these animations of a red-giant star: https://www.astro.uu.se/~bf/movie/AGBmovie.html)- AbacusSummit: https://abacussummit.readthedocs.io/en/latest/And I can add the simulations in the article, too.












API 参考

有关 CheerioWebBaseLoader 所有功能和配置的详细文档,请转到 API 参考: https://api.js.langchain.com/classes/langchain_community_document_loaders_web_cheerio.CheerioWebBaseLoader.html


此页面对您有帮助吗?


您也可以留下详细的反馈 在 GitHub 上.