概述

从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁.

在这里插入图片描述

TF-IDF 关键词提取

TF-IDF (Term Frequency-Inverse Document Frequency), 即词频-逆文件频率是一种用于信息检索与数据挖掘的常用加权技术. TF-IDF 可以帮助我们挖掘文章中的关键词. 通过数值统计, 反映一个词对于语料库中某篇文章的重要性.

TF

TF (Term Frequency), 即词频. 表示词在文本中出现的频率.

公式:
在这里插入图片描述

IDF

IDF (Inverse Document Frequency), 即逆文档频率. 表示语料库中包含词的文档的数目的倒数.

公式:
在这里插入图片描述

TF-IDF

公式:
在这里插入图片描述

TF-IDF = (词的频率 / 句子总字数) × (总文档数 / 包含该词的文档数)

如果一个词非常常见, 那么 IDF 就会很低, 反之就会很高. TF-IDF 可以帮助我们过滤常见词语, 提取关键词.

TfidfVectorizer

TfidfVectorizer 可以帮助我们把原始文本转化为 tf-idf 的特征矩阵, 从而进行相似度计算. sklearn 的TfidfVectorizer 默认输入文本矩阵每行表示一篇文本. 不同文本中相同词项的 tf 值不同, 因此 tf 值与词项所在文本有关.

在这里插入图片描述
格式:

tfidfVectorizer(input='content', encoding='utf-8',
                 decode_error='strict', strip_accents=None, lowercase=True,
                 preprocessor=None, tokenizer=None, analyzer='word',
                 stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
                 ngram_range=(1, 1), max_df=1.0, min_df=1,
                 max_features=None, vocabulary=None, binary=False,
                 dtype=np.float64, norm='l2', use_idf=True, smooth_idf=True,
                 sublinear_tf=False)

参数:

input: 输入
encoding: 编码, 默认为 utf-8
analyzer: “word” 或 “char”, 默认按词 (word) 分析
stopwords: 停用词
ngram_range: ngrame 上下限
lowercase: 转换为小写
max_features: 关键词个数

数据介绍

数据由 12 个不同网站的新闻数据组成. 如下:

  Class  ...                                            Content
0  news  ...  中广网唐山６月１２日消息（记者汤一亮　庄胜春）据中国之声《新闻晚高峰》报道，今天（１２日）上...
1  news  ...  天津卫视求职节目《非你莫属》“晕倒门”事件余波未了，主持人张绍刚前日通过《非你莫属》节目组发...
2  news  ...  临沂（山东），２０１２年６月４日　夫妻“麦客”忙麦收　６月４日，在山东省临沂市郯城县郯城街道...
3  news  ...  中广网北京６月１３日消息（记者王宇）据中国之声《新闻晚高峰》报道，明天凌晨两场欧洲杯的精彩比...
4  news  ...  环球网记者李亮报道，正在意大利度蜜月的“脸谱”创始人扎克伯格与他华裔妻子的一举一动都处于媒体...

在这里插入图片描述

流程:

读取数据
计算数据 tf-idf 值
贝叶斯分类

代码实现

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
import jieba


def load_data():
    """读取数据/停用词"""

    # 读取数据
    data = pd.read_csv("test.txt", sep="\t", names=["Class", "Title", "Content"])
    print(data.head())

    # 读取停用词
    stop_words = pd.read_csv("stopwords.txt", names=["stop_words"], encoding="utf-8")
    stop_words = stop_words["stop_words"].values.tolist()
    print(stop_words)

    return data, stop_words


def main():
    """主函数"""

    # 读取数据
    data, stop_words = load_data()

    # 分词
    segs = data["Content"].apply(lambda x: ' '.join(jieba.cut(x)))

    # Tf-Idf
    tf_idf = TfidfVectorizer(stop_words=stop_words, max_features=1000, lowercase=False)

    # 拟合
    tf_idf.fit(segs)

    # 转换
    X = tf_idf.transform(segs)

    # 分割数据
    X_train, X_test, y_train, y_test = train_test_split(X, data["Class"], random_state=0)

    # 调试输出
    print(X_train[:2])
    print(y_train[:2])

    # 实例化朴素贝叶斯
    classifier = MultinomialNB()

    # 拟合
    classifier.fit(X_train, y_train)

    # 计算分数
    acc = classifier.score(X_test, y_test)
    print("准确率:", acc)

    # 报告
    report = classification_report(y_test, classifier.predict(X_test))
    print(report)

if __name__ == '__main__':
    main()

输出结果:

  Class  ...                                            Content
0  news  ...  中广网唐山６月１２日消息（记者汤一亮　庄胜春）据中国之声《新闻晚高峰》报道，今天（１２日）上...
1  news  ...  天津卫视求职节目《非你莫属》“晕倒门”事件余波未了，主持人张绍刚前日通过《非你莫属》节目组发...
2  news  ...  临沂（山东），２０１２年６月４日　夫妻“麦客”忙麦收　６月４日，在山东省临沂市郯城县郯城街道...
3  news  ...  中广网北京６月１３日消息（记者王宇）据中国之声《新闻晚高峰》报道，明天凌晨两场欧洲杯的精彩比...
4  news  ...  环球网记者李亮报道，正在意大利度蜜月的“脸谱”创始人扎克伯格与他华裔妻子的一举一动都处于媒体...

[5 rows x 3 columns]
['?', '、', '。', '“', '”', '《', '》', '！', '，', '：', '；', '？', '啊', '阿', '哎', '哎呀', '哎哟', '唉', '俺', '俺们', '按', '按照', '吧', '吧哒', '把', '罢了', '被', '本', '本着', '比', '比方', '比如', '鄙人', '彼', '彼此', '边', '别', '别的', '别说', '并', '并且', '不比', '不成', '不单', '不但', '不独', '不管', '不光', '不过', '不仅', '不拘', '不论', '不怕', '不然', '不如', '不特', '不惟', '不问', '不只', '朝', '朝着', '趁', '趁着', '乘', '冲', '除', '除此之外', '除非', '除了', '此', '此间', '此外', '从', '从而', '打', '待', '但', '但是', '当', '当着', '到', '得', '的', '的话', '等', '等等', '地', '第', '叮咚', '对', '对于', '多', '多少', '而', '而况', '而且', '而是', '而外', '而言', '而已', '尔后', '反过来', '反过来说', '反之', '非但', '非徒', '否则', '嘎', '嘎登', '该', '赶', '个', '各', '各个', '各位', '各种', '各自', '给', '根据', '跟', '故', '故此', '固然', '关于', '管', '归', '果然', '果真', '过', '哈', '哈哈', '呵', '和', '何', '何处', '何况', '何时', '嘿', '哼', '哼唷', '呼哧', '乎', '哗', '还是', '还有', '换句话说', '换言之', '或', '或是', '或者', '极了', '及', '及其', '及至', '即', '即便', '即或', '即令', '即若', '即使', '几', '几时', '己', '既', '既然', '既是', '继而', '加之', '假如', '假若', '假使', '鉴于', '将', '较', '较之', '叫', '接着', '结果', '借', '紧接着', '进而', '尽', '尽管', '经', '经过', '就', '就是', '就是说', '据', '具体地说', '具体说来', '开始', '开外', '靠', '咳', '可', '可见', '可是', '可以', '况且', '啦', '来', '来着', '离', '例如', '哩', '连', '连同', '两者', '了', '临', '另', '另外', '另一方面', '论', '嘛', '吗', '慢说', '漫说', '冒', '么', '每', '每当', '们', '莫若', '某', '某个', '某些', '拿', '哪', '哪边', '哪儿', '哪个', '哪里', '哪年', '哪怕', '哪天', '哪些', '哪样', '那', '那边', '那儿', '那个', '那会儿', '那里', '那么', '那么些', '那么样', '那时', '那些', '那样', '乃', '乃至', '呢', '能', '你', '你们', '您', '宁', '宁可', '宁肯', '宁愿', '哦', '呕', '啪达', '旁人', '呸', '凭', '凭借', '其', '其次', '其二', '其他', '其它', '其一', '其余', '其中', '起', '起见', '起见', '岂但', '恰恰相反', '前后', '前者', '且', '然而', '然后', '然则', '让', '人家', '任', '任何', '任凭', '如', '如此', '如果', '如何', '如其', '如若', '如上所述', '若', '若非', '若是', '啥', '上下', '尚且', '设若', '设使', '甚而', '甚么', '甚至', '省得', '时候', '什么', '什么样', '使得', '是', '是的', '首先', '谁', '谁知', '顺', '顺着', '似的', '虽', '虽然', '虽说', '虽则', '随', '随着', '所', '所以', '他', '他们', '他人', '它', '它们', '她', '她们', '倘', '倘或', '倘然', '倘若', '倘使', '腾', '替', '通过', '同', '同时', '哇', '万一', '往', '望', '为', '为何', '为了', '为什么', '为着', '喂', '嗡嗡', '我', '我们', '呜', '呜呼', '乌乎', '无论', '无宁', '毋宁', '嘻', '吓', '相对而言', '像', '向', '向着', '嘘', '呀', '焉', '沿', '沿着', '要', '要不', '要不然', '要不是', '要么', '要是', '也', '也罢', '也好', '一', '一般', '一旦', '一方面', '一来', '一切', '一样', '一则', '依', '依照', '矣', '以', '以便', '以及', '以免', '以至', '以至于', '以致', '抑或', '因', '因此', '因而', '因为', '哟', '用', '由', '由此可见', '由于', '有', '有的', '有关', '有些', '又', '于', '于是', '于是乎', '与', '与此同时', '与否', '与其', '越是', '云云', '哉', '再说', '再者', '在', '在下', '咱', '咱们', '则', '怎', '怎么', '怎么办', '怎么样', '怎样', '咋', '照', '照着', '者', '这', '这边', '这儿', '这个', '这会儿', '这就是说', '这里', '这么', '这么点儿', '这么些', '这么样', '这时', '这些', '这样', '正如', '吱', '之', '之类', '之所以', '之一', '只是', '只限', '只要', '只有', '至', '至于', '诸位', '着', '着呢', '自', '自从', '自个儿', '自各儿', '自己', '自家', '自身', '综上所述', '总的来看', '总的来说', '总的说来', '总而言之', '总之', '纵', '纵令', '纵然', '纵使', '遵照', '作为', '兮', '呃', '呗', '咚', '咦', '喏', '啐', '喔唷', '嗬', '嗯', '嗳', 'a', 'able', 'about', 'above', 'abroad', 'according', 'accordingly', 'across', 'actually', 'adj', 'after', 'afterwards', 'again', 'against', 'ago', 'ahead', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'alongside', 'already', 'also', 'although', 'always', 'am', 'amid', 'amidst', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', "a's", 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'b', 'back', 'backward', 'backwards', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'c', 'came', 'can', 'cannot', 'cant', "can't", 'caption', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', "c'mon", 'co', 'co.', 'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'contains', 'corresponding', 'could', "couldn't", 'course', "c's", 'currently', 'd', 'dare', "daren't", 'definitely', 'described', 'despite', 'did', "didn't", 'different', 'directly', 'do', 'does', "doesn't", 'doing', 'done', "don't", 'down', 'downwards', 'during', 'e', 'each', 'edu', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'entirely', 'especially', 'et', 'etc', 'even', 'ever', 'evermore', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'exactly', 'example', 'except', 'f', 'fairly', 'far', 'farther', 'few', 'fewer', 'fifth', 'first', 'five', 'followed', 'following', 'follows', 'for', 'forever', 'former', 'formerly', 'forth', 'forward', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'get', 'gets', 'getting', 'given', 'gives', 'go', 'goes', 'going', 'gone', 'got', 'gotten', 'greetings', 'h', 'had', "hadn't", 'half', 'happens', 'hardly', 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", 'hello', 'help', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', "here's", 'hereupon', 'hers', 'herself', "he's", 'hi', 'him', 'himself', 'his', 'hither', 'hopefully', 'how', 'howbeit', 'however', 'hundred', 'i', "i'd", 'ie', 'if', 'ignored', "i'll", "i'm", 'immediate', 'in', 'inasmuch', 'inc', 'inc.', 'indeed', 'indicate', 'indicated', 'indicates', 'inner', 'inside', 'insofar', 'instead', 'into', 'inward', 'is', "isn't", 'it', "it'd", "it'll", 'its', "it's", 'itself', "i've", 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'know', 'known', 'knows', 'l', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', "let's", 'like', 'liked', 'likely', 'likewise', 'little', 'look', 'looking', 'looks', 'low', 'lower', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', "mayn't", 'me', 'mean', 'meantime', 'meanwhile', 'merely', 'might', "mightn't", 'mine', 'minus', 'miss', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'must', "mustn't", 'my', 'myself', 'n', 'name', 'namely', 'nd', 'near', 'nearly', 'necessary', 'need', "needn't", 'needs', 'neither', 'never', 'neverf', 'neverless', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'no-one', 'nor', 'normally', 'not', 'nothing', 'notwithstanding', 'novel', 'now', 'nowhere', 'o', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'on', 'once', 'one', 'ones', "one's", 'only', 'onto', 'opposite', 'or', 'other', 'others', 'otherwise', 'ought', "oughtn't", 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'own', 'p', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'possible', 'presumably', 'probably', 'provided', 'provides', 'q', 'que', 'quite', 'qv', 'r', 'rather', 'rd', 're', 'really', 'reasonably', 'recent', 'recently', 'regarding', 'regardless', 'regards', 'relatively', 'respectively', 'right', 'round', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'second', 'secondly', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sensible', 'sent', 'serious', 'seriously', 'seven', 'several', 'shall', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'since', 'six', 'so', 'some', 'somebody', 'someday', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specified', 'specify', 'specifying', 'still', 'sub', 'such', 'sup', 'sure', 't', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', "that'll", 'thats', "that's", "that've", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', "there'd", 'therefore', 'therein', "there'll", "there're", 'theres', "there's", 'thereupon', "there've", 'these', 'they', "they'd", "they'll", "they're", "they've", 'thing', 'things', 'think', 'third', 'thirty', 'this', 'thorough', 'thoroughly', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'till', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', "t's", 'twice', 'two', 'u', 'un', 'under', 'underneath', 'undoing', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'upwards', 'us', 'use', 'used', 'useful', 'uses', 'using', 'usually', 'v', 'value', 'various', 'versus', 'very', 'via', 'viz', 'vs', 'w', 'want', 'wants', 'was', "wasn't", 'way', 'we', "we'd", 'welcome', 'well', "we'll", 'went', 'were', "we're", "weren't", "we've", 'what', 'whatever', "what'll", "what's", "what've", 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', "where's", 'whereupon', 'wherever', 'whether', 'which', 'whichever', 'while', 'whilst', 'whither', 'who', "who'd", 'whoever', 'whole', "who'll", 'whom', 'whomever', "who's", 'whose', 'why', 'will', 'willing', 'wish', 'with', 'within', 'without', 'wonder', "won't", 'would', "wouldn't", 'x', 'y', 'yes', 'yet', 'you', "you'd", "you'll", 'your', "you're", 'yours', 'yourself', 'yourselves', "you've", 'z', 'zero']
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Windows\AppData\Local\Temp\jieba.cache
Loading model cost 0.797 seconds.
Prefix dict has been built successfully.
C:\Users\Windows\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ain', 'aren', 'couldn', 'daren', 'didn', 'doesn', 'don', 'hadn', 'hasn', 'haven', 'isn', 'll', 'mayn', 'mightn', 'mon', 'mustn', 'needn', 'oughtn', 'shan', 'shouldn', 've', 'wasn', 'weren', 'won', 'wouldn'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
C:\Users\Windows\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
  (0, 1)	0.172494787172401
  (0, 13)	0.03617578927683419
  (0, 19)	0.044685283861169885
  (0, 24)	0.04378669110667244
  (0, 32)	0.04770060616202845
  (0, 37)	0.08714906173699981
  (0, 42)	0.03262617791847282
  (0, 79)	0.03598272479613044
  (0, 80)	0.03384787551537572
  (0, 83)	0.105263111952599
  (0, 88)	0.1963277784717525
  (0, 89)	0.04008970433219022
  (0, 90)	0.1412328663779052
  (0, 98)	0.03994454517190565
  (0, 134)	0.04594067399577792
  (0, 142)	0.04422553495980068
  (0, 147)	0.05830540319790606
  (0, 149)	0.02851184938909845
  (0, 163)	0.03588762154160368
  (0, 166)	0.11695593718680143
  (0, 168)	0.06847746933720501
  (0, 170)	0.08047415949662211
  (0, 176)	0.03776179057073174
  (0, 185)	0.03924979201525634
  (0, 200)	0.04184963649844074
  :	:
  (0, 855)	0.04946215984804544
  (0, 865)	0.04649241857748127
  (0, 869)	0.138252930106818
  (0, 870)	0.21384828173878706
  (0, 873)	0.1667028225043511
  (0, 876)	0.09298483715496254
  (0, 885)	0.19784215331311594
  (0, 888)	0.04008970433219022
  (0, 896)	0.04985458905113395
  (0, 897)	0.49416003160549193
  (0, 902)	0.03938479984191792
  (0, 909)	0.03885574129240138
  (0, 910)	0.03911664642497605
  (0, 917)	0.05050266101265748
  (0, 922)	0.03966061581314387
  (0, 933)	0.02711876416903459
  (0, 947)	0.0329697338326223
  (0, 955)	0.15044205095521537
  (0, 972)	0.03255891003473477
  (0, 998)	0.1578946679288985
  (1, 196)	0.37511948551579166
  (1, 224)	0.21966463546937165
  (1, 420)	0.24197656045436386
  (1, 460)	0.49133294291978213
  (1, 787)	0.7148930709574247
1394    news.cn
1365    news.cn
Name: Class, dtype: object
准确率: 0.7176165803108808
               precision    recall  f1-score   support

1688.autos.cn       0.00      0.00      0.00         3
     autos.cn       1.00      0.22      0.36         9
       biz.cn       0.63      0.73      0.68        45
          cpc       0.00      0.00      0.00         8
      dangshi       0.48      0.77      0.59        13
       ent.cn       0.86      0.57      0.68        44
        henan       0.98      0.77      0.86        69
         news       0.00      0.00      0.00        16
      news.cn       0.64      0.93      0.76       131
      society       0.00      0.00      0.00         5
    sports.cn       0.86      0.86      0.86        37
       theory       0.00      0.00      0.00         6

     accuracy                           0.72       386
    macro avg       0.45      0.40      0.40       386
 weighted avg       0.69      0.72      0.68       386

人工智能最新文章

2022吴恩达机器学习课程——第二课（神经网

第十五章规则学习

FixMatch: Simplifying Semi-Supervised Le

数据挖掘Java——Kmeans算法的实现

大脑皮层的分割方法

【翻译】GPT-3是如何工作的

论文笔记:TEACHTEXT: CrossModal Generaliz

python从零学（六）

详解Python 3.x 导入(import)

【答读者问27】backtrader不支持最新版本的