如何建立一个基于TF-INF的文本分类器
对于一个文本分类任务来讲,第一件要做的事情是将文本库转换为模型可以使用的输入,即将文本分解转换为向量(str -> float)。 TF-IDF则是这一问题的一个解决方案。
TF-IDF是Term Frequency – Inverse Document Frequency (词频-逆文本频率)的缩写。
TF
某一文章中该词出现的频次除以所有出现的词量
$$
TF(i, j)=\frac{\text { Term i frequency in document } \mathrm{j}}{\text { Total words in document } \mathrm{j}}
$$
IDF
所有文章的数量除以所有文章中该词出现的次数的对数
$$
IDF(i)=\log _{2}\left(\frac{\text { Total documents }}{\text { documents with term i }}\right)
$$
TF-IDF
前面两个相乘
$$TF-IDF=TF(i, j)*IDF(i)$$
通过一个例子来看



通过sklearn.feature_extraction.text模块中的例子来看
导入模块
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer()方法获得频次
corpus = ['this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
X.toarray()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 2, 0, 1, 0, 1, 1, 0, 1],
[1, 0, 0, 1, 1, 0, 1, 1, 1],
[0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)
TfidfTransformer()方法获得IDF
tfidf =TfidfTransformer()
tfidf.fit_transform(X).toarray()
tfidf.idf_
array([1.91629073, 1.22314355, 1.51082562, 1. , 1.91629073,
1.91629073, 1. , 1.91629073, 1. ])
TfidfVectorizer()方法一步到位
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray()/tfidf.idf_)
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0. 0.38408524 0.38408524 0.38408524 0. 0.
0.38408524 0. 0.38408524]
[0. 0.56217735 0. 0.28108867 0. 0.28108867
0.28108867 0. 0.28108867]
[0.26710379 0. 0. 0.26710379 0.26710379 0.
0.26710379 0.26710379 0.26710379]
[0. 0.38408524 0.38408524 0.38408524 0. 0.
0.38408524 0. 0.38408524]]
举个遇到的例子
项目链接
kaggle:https://www.kaggle.com/c/umd-inst414-20f-imdb/data
项目描述
一个文本分类任务-情感分类。每个文档(数据文件中的一行)都是从IMDB提取的电影评论。目的是将每次评论的情绪分为“积极”或“消极”。
训练数据包含10000条评论,已经标记为1(积极情绪)或0(消极情绪)。测试数据包含5000个未标记的句子,将其标记输出。
导入模块
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from sklearn.linear_model import LogisticRegression
读取数据
train = pd.read_csv('input/inst/train.csv') test = pd.read_csv('input/spooky/test.csv')
#将有标签的数据分为训练集和测试集 y=train.label.values xtrain, xvalid, ytrain, yvalid = train_test_split(train.text.values, y, stratify=y, random_state=42, test_size=0.1, shuffle=True)
# 构建模型sklearn tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}', ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1, stop_words = 'english') # Fitting TF-IDF to both training and test sets (semi-supervised learning) tfv.fit(list(xtrain) + list(xvalid)) xtrain_tfv = tfv.transform(xtrain) xvalid_tfv = tfv.transform(xvalid)
# 获取模型预测水平 clf = LogisticRegression(C=1.0) clf.fit(xtrain_tfv, ytrain) clf.score(xtrain_tfv, ytrain)
0.96
# 对未知数据进行测试 xtest_tfv = tfv.transform(test.text.values) labels=clf.predict(xtest_tfv) print(labels[:10])
[1 1 1 0 1 1 0 1 1 1]
总结
TF-IDF原理理解,后面的分类比较常规。LogisticRegression的分类水平还是很不错的。
其他的文本分类方法,等后面吧。
1 Comment
Evolution of Embeddings — Episode: 1 | by Surya Teja Menta | Jul, 2023 - Crypto4nerd · 2023年7月10日 at 01:31
[…] Sources: https://www.xinzipanghuang.net/ […]