跳至主要内容

如何加载 Markdown

Markdown 是一种轻量级标记语言,用于使用纯文本编辑器创建格式化文本。

这里我们将介绍如何将 Markdown 文档加载到 LangChain Document 对象中,以便我们可以在下游使用它们。

我们将介绍

  • 基本用法;
  • 将 Markdown 解析为元素,如标题、列表项和文本。

LangChain 实现了一个 UnstructuredLoader 类。

先决条件

本指南假设你熟悉以下概念

安装

yarn add @langchain/community

设置

虽然 Unstructured 提供开源版本,但你仍然需要提供 API 密钥才能访问服务。要使一切都正常运行,请执行以下两个步骤

  1. 下载并启动 Docker 容器
docker run -p 8000:8000 -d --rm --name unstructured-api downloads.unstructured.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0
  1. 这里 获取免费 API 密钥和 API URL,并在你的环境中设置它(根据 Unstructured 网站,分配你的 API 密钥和 URL 可能需要长达一个小时。)
export UNSTRUCTURED_API_KEY="..."
# Replace with your `Full URL` from the email
export UNSTRUCTURED_API_URL="https://<ORG_NAME>-<SECRET>.api.unstructuredapp.io/general/v0/general"

基本用法会将 Markdown 文件导入单个文档。这里我们演示了 LangChain 的自述文件

import { UnstructuredLoader } from "@langchain/community/document_loaders/fs/unstructured";

const markdownPath = "../../../../README.md";

const loader = new UnstructuredLoader(markdownPath, {
apiKey: process.env.UNSTRUCTURED_API_KEY,
apiUrl: process.env.UNSTRUCTURED_API_URL,
});

const data = await loader.load();
console.log(data.slice(0, 5));
[
Document {
pageContent: '🦜️🔗 LangChain.js',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: '⚡ Building applications with LLMs through composability ⚡',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'Looking for the Python version? Check out LangChain.',
metadata: {
languages: [Array],
parent_id: '7ea17bcb17b10f303cbb93b4cb95de93',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'To help you ship LangChain apps to production faster, check out LangSmith.\n' +
'LangSmith is a unified developer platform for building, testing, and monitoring LLM applications.\n' +
'Fill out this form to get on the waitlist or speak with our sales team.',
metadata: {
languages: [Array],
parent_id: '7ea17bcb17b10f303cbb93b4cb95de93',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: '⚡️ Quick Install',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
}
]

保留元素

在后台,Unstructured 为不同的文本块创建不同的“元素”。默认情况下,我们会将它们组合在一起,但你可以通过指定 chunkingStrategy: "by_title" 来轻松地保持这种分离。

const loader = new UnstructuredLoader(markdownPath, {
chunkingStrategy: "by_title",
});

const data = await loader.load();

console.log(`Number of documents: ${data.length}\n`);

for (const doc of data.slice(0, 2)) {
console.log(doc);
console.log("\n");
}
Number of documents: 13

Document {
pageContent: '🦜️🔗 LangChain.js\n' +
'\n' +
'⚡ Building applications with LLMs through composability ⚡\n' +
'\n' +
'Looking for the Python version? Check out LangChain.\n' +
'\n' +
'To help you ship LangChain apps to production faster, check out LangSmith.\n' +
'LangSmith is a unified developer platform for building, testing, and monitoring LLM applications.\n' +
'Fill out this form to get on the waitlist or speak with our sales team.',
metadata: {
filename: 'README.md',
filetype: 'text/markdown',
languages: [ 'eng' ],
orig_elements: 'eJzNUtuO0zAQ/ZVRnquSS3PjBcGyPHURgr5tV2hijxNTJ45ip0u14t8Zp1y6CCF4ACFLlufuc+bcPkRkqKfBv9cyegpREWNZosxS0RRVzmeTCiFlnmRUFZmQ0QqinjxK9Mj5D5HShgbsKRS/vX7+8uZ63S9ZIeBP4xLw9NE/6XxvQsDg0M7YkuPIbURDG919Wp1zQu5+llVGfMta7GdFsVo8MniSErZcfdWhHtYfXOj2dcROe0MRN/oRUUmYlI1o+EpilcWZaJo6azaiqXNJdfYvEKUFJvBi1kbqoQUcR6MFem0HB/fad7Dd3jjw3WTntgNh+9E6bLTR/gTn4t9CmhHFTc1w80oKSUlTpFWaFKWsVR5nFf0dpOwdcfoDvi+p2Vp7CJQoOzF+gjcn39kBjjQ5ZucZXHUkDmBnf7H3Sy5e4zQxkUfahYY/4UQqVcZJpSpspKqSMslVllWJzDdMC6XVf8jJzkJHZoSTncF1evwOPSiHdWJhnKycRRAQKHSephWIR0y961lW6/3w7Q3aAcI8aKVJgqQjGTvSBKNBz+T3ywaaLwpdgSfnlwcOEno7aG+nsCcW6iP58ohX2phlru94xtKLf9iSB/5d2Ok9smC1Y3sCNxIezpq3M5toiAER9r/a6t1n6BJ/zg==',
category: 'CompositeElement'
}
}


Document {
pageContent: '⚡️ Quick Install\n' +
'\n' +
'You can use npm, yarn, or pnpm to install LangChain.js\n' +
'\n' +
'npm install -S langchain or yarn add langchain or pnpm add langchain\n' +
'\n' +
'typescript\n' +
'import { ChatOpenAI } from "langchain/chat_models/openai";\n' +
'\n' +
'🌐 Supported Environments\n' +
'\n' +
'LangChain is written in TypeScript and can be used in:\n' +
'\n' +
'Node.js (ESM and CommonJS) - 18.x, 19.x, 20.x\n' +
'\n' +
'Cloudflare Workers\n' +
'\n' +
'Vercel / Next.js (Browser, Serverless and Edge functions)\n' +
'\n' +
'Supabase Edge Functions\n' +
'\n' +
'Browser\n' +
'\n' +
'Deno',
metadata: {
filename: 'README.md',
filetype: 'text/markdown',
languages: [ 'eng' ],
orig_elements: 'eJzNlm1v2zYQx7/KQa9WwE1Iik/qXnWpB2RoM2wOOgx1URzJY6pVogyJTlME/e6j3KZIhgBzULjIG0Li3VH+/e/BfHNdUUc9pfyuDdUzqGzUjUUda1ZbL7R1UQetnNdMK9swVy2g6iljwIzF/7qKbUcJe5qD/1w+f/FqedSH2Ws25E+bnSHTVT5+n/tuNnSYLrZ4QVOxvKkoXVRvPy+++My+663QyNfbSCzCH9vWf4DTNGXsdsE3J563uaOqxP0XIDSxCdobSZIYd9w7JpQlLU3TaKf4YQDK7gbHB8h4m/jvYQseE2wngrTpF/AJx7SAYYRNeYU8QPtFAHhZvnzyHtt09M90W40zHEfM7SWdz0fep0otuUISLBqMjfNFjMYzI6SWFFWQj1CVGf2G++kK5uP9jD7rMgsEGMLd3Z1ad3YfpJHWsubSchGQeNRItUGPElF7wck2hy/9OWbyY7vJ69T2m2HMcA0l3/n3DaXnp/AZ4jj0sK6+AR6XNb/rh0DddDwUL2zX1c97NUpjVAEOxkh0tbOaN1qU1vG8VtYGe6CSuNvpwda+rJEzWG03MzAFWKbLdhzS/FOnvUhcdChlNC6iKBWuJVrCGMhxIaKMP6i4/1fP2+jfGhnaCT6Obc5UHhOcl4+vdhUAmMJuKjiaB0Mo1mcPKmdBvlFWK6ZMaXfNI2ojIvNORMsUHWiSf5cqZ6WOy2SDn5arVzv+k6Hvh/Tb6gk8BW6PrhbAm3kV7Ojqthgv2ymfZurvrQ4hvRLCSaUEj8YG77TzQTNriYv6B/0hPEiHk24oTdGVePhrGD/QOO0LyxRHKZivAxldS41akzXcxELPm/oxJv01jZ46OIazsrHL/i/j8HGicQErGi9p7GiadtWwDBcEcZt8boc0PdlXE9KlAoSkZh4PtUBZ5oRjTAbiSgd3oLn+XZqUYYgOy3Vgh/zrDfK+xA0rqY6GaQrGo5JM1azcgawzjeOa2CMk/przvXMayvXQEA8meEmCsxiDrkO54/iAVvtHSPiC0nA/3tt/AY+igwk=',
category: 'CompositeElement'
}
}

请注意,在这种情况下,我们只恢复了一种不同的元素类型

const categories = new Set(data.map((document) => document.metadata.category));
console.log(categories);
Set(1) { 'CompositeElement' }

本页面有帮助吗?


你也可以留下详细的反馈 在 GitHub 上.