如何建立一个基于TF-INF的文本分类器

对于一个文本分类任务来讲,第一件要做的事情是将文本库转换为模型可以使用的输入,即将文本分解转换为向量(str -> float)。 TF-IDF则是这一问题的一个解决方案。

TF-IDF是Term Frequency – Inverse Document Frequency (词频-逆文本频率)的缩写。

TF

某一文章中该词出现的频次除以所有出现的词量

$$
TF(i, j)=\frac{\text { Term i frequency in document } \mathrm{j}}{\text { Total words in document } \mathrm{j}}
$$

IDF

所有文章的数量除以所有文章中该词出现的次数的对数

$$
IDF(i)=\log _{2}\left(\frac{\text { Total documents }}{\text { documents with term i }}\right)
$$

TF-IDF

前面两个相乘

$$TF-IDF=TF(i, j)*IDF(i)$$

通过一个例子来看

通过sklearn.feature_extraction.text模块中的例子来看

导入模块

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

CountVectorizer()方法获得频次

corpus = ['this is the first document',
         'this document is the second document',
         'and this is the third one',
         'is this the first document']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
X.toarray()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
      [0, 2, 0, 1, 0, 1, 1, 0, 1],
      [1, 0, 0, 1, 1, 0, 1, 1, 1],
      [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

TfidfTransformer()方法获得IDF

tfidf =TfidfTransformer()
tfidf.fit_transform(X).toarray()
tfidf.idf_
array([1.91629073, 1.22314355, 1.51082562, 1.        , 1.91629073,
      1.91629073, 1.       , 1.91629073, 1.       ])

TfidfVectorizer()方法一步到位

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray()/tfidf.idf_)
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0.         0.38408524 0.38408524 0.38408524 0.         0.
0.38408524 0.         0.38408524]
[0.         0.56217735 0.         0.28108867 0.         0.28108867
0.28108867 0.         0.28108867]
[0.26710379 0.         0.         0.26710379 0.26710379 0.
0.26710379 0.26710379 0.26710379]
[0.         0.38408524 0.38408524 0.38408524 0.         0.
0.38408524 0.         0.38408524]]

举个遇到的例子

项目链接

kaggle:https://www.kaggle.com/c/umd-inst414-20f-imdb/data

项目描述

一个文本分类任务-情感分类。每个文档(数据文件中的一行)都是从IMDB提取的电影评论。目的是将每次评论的情绪分为“积极”或“消极”。

训练数据包含10000条评论,已经标记为1(积极情绪)或0(消极情绪)。测试数据包含5000个未标记的句子,将其标记输出。

导入模块

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression

读取数据

train = pd.read_csv('input/inst/train.csv')
test = pd.read_csv('input/spooky/test.csv')
#将有标签的数据分为训练集和测试集
y=train.label.values
xtrain, xvalid, ytrain, yvalid = train_test_split(train.text.values, y, 
                                                  stratify=y, 
                                                  random_state=42, 
                                                  test_size=0.1, shuffle=True)
# 构建模型sklearn
tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')

# Fitting TF-IDF to both training and test sets (semi-supervised learning)
tfv.fit(list(xtrain) + list(xvalid))
xtrain_tfv =  tfv.transform(xtrain) 
xvalid_tfv = tfv.transform(xvalid)
# 获取模型预测水平
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_tfv, ytrain)
clf.score(xtrain_tfv, ytrain)
0.96
# 对未知数据进行测试
xtest_tfv = tfv.transform(test.text.values)
labels=clf.predict(xtest_tfv)
print(labels[:10])
[1 1 1 0 1 1 0 1 1 1]

总结

TF-IDF原理理解,后面的分类比较常规。LogisticRegression的分类水平还是很不错的。

其他的文本分类方法,等后面吧。

Categories: Python

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *