跳至主要内容

如何按字符分割

先决条件

本指南假设您熟悉以下概念

这是分割文本的最简单方法。它根据给定的字符序列进行分割,默认值为 "\n\n"。块长度以字符数衡量。

  1. 文本如何分割:按单个字符分隔符分割。
  2. 块大小如何衡量:以字符数衡量。

要直接获取字符串内容,请使用 .splitText()

要创建 LangChain Document 对象(例如,用于下游任务),请使用 .createDocuments()

import { CharacterTextSplitter } from "@langchain/textsplitters";
import * as fs from "node:fs";

// Load an example document
const rawData = await fs.readFileSync(
"../../../../examples/state_of_the_union.txt"
);
const stateOfTheUnion = rawData.toString();

const textSplitter = new CharacterTextSplitter({
separator: "\n\n",
chunkSize: 1000,
chunkOverlap: 200,
});
const texts = await textSplitter.createDocuments([stateOfTheUnion]);
console.log(texts[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { loc: { lines: { from: 1, to: 17 } } }
}

您还可以将与每个文档关联的元数据传播到输出块

const metadatas = [{ document: 1 }, { document: 2 }];

const documents = await textSplitter.createDocuments(
[stateOfTheUnion, stateOfTheUnion],
metadatas
);

console.log(documents[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { document: 1, loc: { lines: { from: 1, to: 17 } } }
}

要直接获取字符串内容,请使用 .splitText()

const chunks = await textSplitter.splitText(stateOfTheUnion);

chunks[0];
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters

下一步

您现在已经了解了按字符分割文本的方法。

接下来,查看 按字符分割的更高级方法,或者 检索增强生成的完整教程


此页面对您有帮助吗?


您也可以留下详细的反馈 在 GitHub 上.