跳至主要内容

如何按字符拆分

先决条件

本指南假设您熟悉以下概念

这是最简单的文本拆分方法。此方法基于给定的字符序列进行拆分,默认值为 "\n\n"。块长度以字符数衡量。

  1. 文本如何拆分:按单个字符分隔符。
  2. 块大小如何衡量:按字符数。

要直接获取字符串内容,请使用 .splitText()

要创建 LangChain Document 对象(例如,用于下游任务),请使用 .createDocuments()

import { CharacterTextSplitter } from "@langchain/textsplitters";
import * as fs from "node:fs";

// Load an example document
const rawData = await fs.readFileSync(
"../../../../examples/state_of_the_union.txt"
);
const stateOfTheUnion = rawData.toString();

const textSplitter = new CharacterTextSplitter({
separator: "\n\n",
chunkSize: 1000,
chunkOverlap: 200,
});
const texts = await textSplitter.createDocuments([stateOfTheUnion]);
console.log(texts[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { loc: { lines: { from: 1, to: 17 } } }
}

您还可以将与每个文档相关的元数据传播到输出块

const metadatas = [{ document: 1 }, { document: 2 }];

const documents = await textSplitter.createDocuments(
[stateOfTheUnion, stateOfTheUnion],
metadatas
);

console.log(documents[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { document: 1, loc: { lines: { from: 1, to: 17 } } }
}

要直接获取字符串内容,请使用 .splitText()

const chunks = await textSplitter.splitText(stateOfTheUnion);

chunks[0];
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters

下一步

您现在已经学习了一种按字符拆分文本的方法。

接下来,查看 更高级的按字符拆分方法检索增强生成完整教程


此页面是否有帮助?


您也可以留下详细的反馈 在 GitHub 上.