如何按字符拆分
先决条件
本指南假设您熟悉以下概念
这是最简单的文本拆分方法。此方法基于给定的字符序列进行拆分,默认值为 "\n\n"
。块长度以字符数衡量。
- 文本如何拆分:按单个字符分隔符。
- 块大小如何衡量:按字符数。
要直接获取字符串内容,请使用 .splitText()
。
要创建 LangChain Document 对象(例如,用于下游任务),请使用 .createDocuments()
。
import { CharacterTextSplitter } from "@langchain/textsplitters";
import * as fs from "node:fs";
// Load an example document
const rawData = await fs.readFileSync(
"../../../../examples/state_of_the_union.txt"
);
const stateOfTheUnion = rawData.toString();
const textSplitter = new CharacterTextSplitter({
separator: "\n\n",
chunkSize: 1000,
chunkOverlap: 200,
});
const texts = await textSplitter.createDocuments([stateOfTheUnion]);
console.log(texts[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { loc: { lines: { from: 1, to: 17 } } }
}
您还可以将与每个文档相关的元数据传播到输出块
const metadatas = [{ document: 1 }, { document: 2 }];
const documents = await textSplitter.createDocuments(
[stateOfTheUnion, stateOfTheUnion],
metadatas
);
console.log(documents[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { document: 1, loc: { lines: { from: 1, to: 17 } } }
}
要直接获取字符串内容,请使用 .splitText()
const chunks = await textSplitter.splitText(stateOfTheUnion);
chunks[0];
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters
下一步
您现在已经学习了一种按字符拆分文本的方法。
接下来,查看 更高级的按字符拆分方法 或 检索增强生成完整教程。