如何按字符分割
先决条件
本指南假设您熟悉以下概念
这是分割文本的最简单方法。它根据给定的字符序列进行分割,默认值为 "\n\n"
。块长度以字符数衡量。
- 文本如何分割:按单个字符分隔符分割。
- 块大小如何衡量:以字符数衡量。
要直接获取字符串内容,请使用 .splitText()
。
要创建 LangChain Document 对象(例如,用于下游任务),请使用 .createDocuments()
。
import { CharacterTextSplitter } from "@langchain/textsplitters";
import * as fs from "node:fs";
// Load an example document
const rawData = await fs.readFileSync(
"../../../../examples/state_of_the_union.txt"
);
const stateOfTheUnion = rawData.toString();
const textSplitter = new CharacterTextSplitter({
separator: "\n\n",
chunkSize: 1000,
chunkOverlap: 200,
});
const texts = await textSplitter.createDocuments([stateOfTheUnion]);
console.log(texts[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { loc: { lines: { from: 1, to: 17 } } }
}
您还可以将与每个文档关联的元数据传播到输出块
const metadatas = [{ document: 1 }, { document: 2 }];
const documents = await textSplitter.createDocuments(
[stateOfTheUnion, stateOfTheUnion],
metadatas
);
console.log(documents[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { document: 1, loc: { lines: { from: 1, to: 17 } } }
}
要直接获取字符串内容,请使用 .splitText()
const chunks = await textSplitter.splitText(stateOfTheUnion);
chunks[0];
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters
下一步
您现在已经了解了按字符分割文本的方法。
接下来,查看 按字符分割的更高级方法,或者 检索增强生成的完整教程。