跳至主要内容

如何递归地按字符分割文本

先决条件

本指南假设您熟悉以下概念

此文本分割器是针对一般文本的推荐分割器。它由字符列表参数化。它尝试按顺序在这些字符上进行分割,直到块足够小。默认列表是 ["\n\n", "\n", " ", ""]。这将尝试尽可能地将所有段落(然后是句子,然后是单词)放在一起,因为这些通常是语义上相关的文本片段。

  1. 文本如何分割:按字符列表。
  2. 如何测量块大小:按字符数。

下面显示示例用法。

要直接获取字符串内容,请使用 .splitText

要创建 LangChain Document 对象(例如,用于下游任务),请使用 .createDocuments

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const text = `Hi.\n\nI'm Harrison.\n\nHow? Are? You?\nOkay then f f f f.
This is a weird text to write, but gotta test the splittingggg some how.\n\n
Bye!\n\n-H.`;
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10,
chunkOverlap: 1,
});

const output = await splitter.createDocuments([text]);

console.log(output.slice(0, 3));
[
Document {
pageContent: "Hi.",
metadata: { loc: { lines: { from: 1, to: 1 } } }
},
Document {
pageContent: "I'm",
metadata: { loc: { lines: { from: 3, to: 3 } } }
},
Document {
pageContent: "Harrison.",
metadata: { loc: { lines: { from: 3, to: 3 } } }
}
]

您会注意到,在上面的示例中,我们正在分割原始文本字符串并获取文档列表。我们也可以直接分割文档。

import { Document } from "@langchain/core/documents";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const text = `Hi.\n\nI'm Harrison.\n\nHow? Are? You?\nOkay then f f f f.
This is a weird text to write, but gotta test the splittingggg some how.\n\n
Bye!\n\n-H.`;
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10,
chunkOverlap: 1,
});

const docOutput = await splitter.splitDocuments([
new Document({ pageContent: text }),
]);

console.log(docOutput.slice(0, 3));
[
Document {
pageContent: "Hi.",
metadata: { loc: { lines: { from: 1, to: 1 } } }
},
Document {
pageContent: "I'm",
metadata: { loc: { lines: { from: 3, to: 3 } } }
},
Document {
pageContent: "Harrison.",
metadata: { loc: { lines: { from: 3, to: 3 } } }
}
]

您可以通过传递 separators 参数来使用任意分隔符自定义 RecursiveCharacterTextSplitter,如下所示

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { Document } from "@langchain/core/documents";

const text = `Some other considerations include:

- Do you deploy your backend and frontend together, or separately?
- Do you deploy your backend co-located with your database, or separately?

**Production Support:** As you move your LangChains into production, we'd love to offer more hands-on support.
Fill out [this form](https://airtable.com/appwQzlErAS2qiP0L/shrGtGaVBVAz7NcV2) to share more about what you're building, and our team will get in touch.

## Deployment Options

See below for a list of deployment options for your LangChain app. If you don't see your preferred option, please get in touch and we can add it to this list.`;

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 50,
chunkOverlap: 1,
separators: ["|", "##", ">", "-"],
});

const docOutput = await splitter.splitDocuments([
new Document({ pageContent: text }),
]);

console.log(docOutput.slice(0, 3));
[
Document {
pageContent: "Some other considerations include:",
metadata: { loc: { lines: { from: 1, to: 1 } } }
},
Document {
pageContent: "- Do you deploy your backend and frontend together",
metadata: { loc: { lines: { from: 3, to: 3 } } }
},
Document {
pageContent: "r, or separately?",
metadata: { loc: { lines: { from: 3, to: 3 } } }
}
]

后续步骤

您现在已经了解了一种按字符分割文本的方法。

接下来,查看 针对代码的特定分割技术有关检索增强生成的完整教程


此页面是否有帮助?


您也可以留下详细的反馈 在 GitHub 上.