文本表示方法:
- One-hot
- Bag of Words
- N-gram
- TF-IDF
这几种文本表示方法存在的缺陷:转换得到的向量维度很高,需要较长的训练实践;没有考虑单词与单词之间的关系,只是进行了统计。?
?Count Vecotrs(Bag of Words词袋模型)
词向量之词袋模型(BOW)详解
sklearn——CountVectorizer详
from sklearn.feature_extraction.text import CountVectorizer
#CountVectors+RidgeClassifier
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
df = pd.read_csv('新建文件夹/天池—新闻文本分类/train_set.csv', sep='\t',nrows = 15000)
##统计每个字出现的次数,并赋值为0/1 用词袋表示text(特征集)
##max_features=3000文档中出现频率最多的前3000个词
#ngram_range(1,3)(单个字,两个字,三个字 都会统计
vectorizer = CountVectorizer(max_features = 3000,ngram_range=(1,3))
train_text = vectorizer.fit_transform(train_df['text'])
X_train,X_val,y_train,y_val = train_test_split(train_text,df.label,test_size = 0.3)
#岭回归拟合训练集(包含text 和 label)
clf = RidgeClassifier()
clf.fit(X_train,y_train)
val_pred = clf.predict(X_test)
print(f1_score(y_val,val_pred,average = 'macro'))
TF-IDF模型
TF-IDF 分数由两部分组成:第一部分是词语频率(Term Frequency),第二部分是逆文档频率(Inverse Document Frequency)。其中计算语料库中文档总数除以含有该词语的文档数量,然后再取对数就是逆文档频率。
- TF(t)= 该词语在当前文档出现的次数 / 当前文档中词语的总数
- IDF(t)= log_e(文档总数 / 出现该词语的文档总数)
当有TF(词频)和IDF(逆文档频率)后,将这两个词相乘,就能得到一个词的TF-IDF的值。某个词在文章中的TF-IDF越大,那么一般而言这个词在这篇文章的重要性会越高,所以通过计算文章中各个词的TF-IDF,由大到小排序,排在最前面的几个词,就是该文章的关键词。
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()
#['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
X.toarray()
#最后to_array()函数返回的是每个文档中关键词的tf-idf值
#将每个文档的toptf-idf值输出
word = vectorizer.get_feature_names()
weight = X.toarray()
for i in range(len(weight)):
w_sort = np.argsort(-weight[i])
print('doc: {0}, top tf-idf is : {1},{2}'.format(corpus[i], word[w_sort[0]], weight[i][w_sort[0]]) )
?实例
#TF-IDF + RidgeClassifier
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score
df = pd.read_csv('新建文件夹/天池—新闻文本分类/train_set.csv', sep='\t',nrows = 15000)
train_test = TfidfVectorizer(ngram_range=(1,3),max_features = 3000).fit_transform(df.text)
X_train,X_val,y_train,y_val = train_test_split(train_text,df.label,test_size = 0.3)
clf = RidgeClassifier()
clf.fit(X_train,y_train)
val_pred = clf.predict(X_test)
print(f1_score(y_val,val_pred,average = 'macro'))
- 这两个模型一般与机器学习模型一起使用,前者负责提取文本中的特征,机器学习模型负责预测和分类
CountVectorizer TfidfVectorizer 中文处理
|