9.3 文本数据预处理

TensorFlow 文本数据预处理完整指南：从清洗到向量化与填充

TensorFlow 中文手册

本指南详细讲解如何使用 TensorFlow 进行文本数据预处理，包括文本清洗（去噪、分词、去停用词）、词汇表构建（使用 tf.keras.preprocessing.text.Tokenizer）、文本向量化（整数编码、独热编码、词嵌入前置）以及序列填充与截断（使用 tf.keras.preprocessing.sequence.pad_sequences），适合新手快速入门自然语言处理。

文本数据预处理在 TensorFlow 中的应用

文本数据预处理是自然语言处理（NLP）的基础步骤，它帮助我们将原始文本转换为适合机器学习模型输入的格式。本章将引导您逐步完成文本清洗、词汇表构建、向量化和序列填充，使用 TensorFlow 工具，对新学习者友好。

1. 文本清洗：去噪、分词、去停用词

文本清洗旨在移除噪声，提取有用信息。主要包括去噪、分词和去停用词。

去噪

去噪涉及移除无关字符，如标点符号、数字或 HTML 标签。您可以使用 Python 标准库或正则表达式。例如：

import re
text = "Hello, TensorFlow! 这是示例文本123。"
cleaned_text = re.sub(r'[^\w\s]', '', text)  # 移除标点
print(cleaned_text)  # 输出: Hello TensorFlow 这是示例文本123

分词

分词是将文本分割成单词或子词的过程。在 TensorFlow 中，可以使用 tf.strings.split 或外部库如 NLTK。简单示例：

import tensorflow as tf
text = "我爱学习 TensorFlow"
tokens = tf.strings.split(text)
print(tokens)  # 输出: [b'我爱', b'学习', b'TensorFlow']

去停用词

停用词是常见但无意义的词，如“的”、“和”。可以使用预定义列表移除它们。例如：

stopwords = set(["的", "和", "在"])
tokens_list = ["我", "爱", "的", "学习"]
filtered_tokens = [word for word in tokens_list if word not in stopwords]
print(filtered_tokens)  # 输出: ['我', '爱', '学习']

2. 词汇表构建：使用 tf.keras.preprocessing.text.Tokenizer

构建词汇表是将文本映射到整数索引的过程，TensorFlow 提供了 Tokenizer 类来简化这一步骤。

初始化与拟合

首先，实例化 Tokenizer，并拟合文本数据以构建词汇表。

from tensorflow.keras.preprocessing.text import Tokenizer
texts = ["我爱 TensorFlow", "TensorFlow 强大易用"]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)  # 拟合文本
print(tokenizer.word_index)  # 输出: {'tensorflow': 1, '我': 2, '爱': 3, '强大': 4, '易用': 5}

转换为序列

将文本转换为整数序列：

sequences = tokenizer.texts_to_sequences(texts)
print(sequences)  # 输出: [[2, 3, 1], [1, 4, 5]]

3. 文本向量化：整数编码、独热编码和词嵌入前置

向量化是将文本表示为数字向量的方法。常见方式有整数编码、独热编码和词嵌入前置。

整数编码

这是最简单的向量化方法，直接将单词映射到整数索引，如上文所示。

独热编码

将每个单词表示为一个高维的二进制向量。可以使用 tf.keras.utils.to_categorical 或 Tokenizer 的 texts_to_matrix 方法。

from tensorflow.keras.utils import to_categorical
import numpy as np
sequence = [2, 3, 1]
vocab_size = len(tokenizer.word_index) + 1  # 词汇表大小，包括未登录词
one_hot_encoded = to_categorical(sequence, num_classes=vocab_size)
print(one_hot_encoded)  # 输出: [[0. 0. 1. 0. 0. 0.], [0. 0. 0. 1. 0. 0.], [0. 1. 0. 0. 0. 0.]]

词嵌入前置

词嵌入将单词映射到低维连续向量，通常在模型训练时学习。您可以使用 tf.keras.layers.Embedding 层，在预处理阶段只需提供整数编码序列作为输入。

# 在模型定义中
embedding_layer = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=10, input_length=max_sequence_length)

4. 序列填充与截断：使用 tf.keras.preprocessing.sequence.pad_sequences

在输入模型前，需要确保所有序列长度一致，可以使用 pad_sequences 进行填充或截断。

基本用法

from tensorflow.keras.preprocessing.sequence import pad_sequences
sequences = [[2, 3, 1], [1, 4, 5, 6]]  # 示例序列
padded_sequences = pad_sequences(sequences, maxlen=5, padding='post', truncating='post')
print(padded_sequences)  # 输出: [[2 3 1 0 0] [1 4 5 6 0]]

maxlen: 设置最大序列长度，此处为5。
padding: 'post' 表示在序列末尾填充，'pre' 表示在开头填充。
truncating: 'post' 表示从末尾截断超过长度的部分。

应用于预处理流程

假设您已经完成了清洗和词汇表构建，整个流程可能如下：

# 假设 cleaned_texts 是清洗后的文本列表
cleaned_texts = ["我爱 TensorFlow", "TensorFlow 强大易用"]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(cleaned_texts)
sequences = tokenizer.texts_to_sequences(cleaned_texts)
max_length = 10  # 选择合适长度
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')
print(padded_sequences)  # 现在可以输入到模型中

总结

本章介绍了 TensorFlow 中文本数据预处理的关键步骤：从文本清洗开始，通过去噪、分词和去停用词优化原始数据；使用 Tokenizer 构建词汇表和整数编码；探索向量化方法如整数编码、独热编码和词嵌入前置；最后，使用 pad_sequences 确保序列长度一致。这些步骤是构建 NLP 模型的基础，建议练习代码示例以加深理解。

下一章将讨论如何将这些预处理数据输入到 TensorFlow 模型中进行训练。

上一章 9.2 数值型与类别型数据预处理

下一章 9.4 图像数据预处理

TensorFlow 中文手册

9.3 文本数据预处理

文本数据预处理在 TensorFlow 中的应用

1. 文本清洗：去噪、分词、去停用词

去噪

分词

去停用词

2. 词汇表构建：使用 tf.keras.preprocessing.text.Tokenizer

初始化与拟合

转换为序列

3. 文本向量化：整数编码、独热编码和词嵌入前置

整数编码

独热编码

词嵌入前置

4. 序列填充与截断：使用 tf.keras.preprocessing.sequence.pad_sequences

基本用法

应用于预处理流程

总结

相关文档

Python 教程

FastAPI 教程

Django 6中文教程

Flask 中文教程

NumPy 中文教程

Scikit-learn 中文教程