IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 大数据 -> NLTK - 基本使用 -> 正文阅读

[大数据]NLTK - 基本使用


nltk 更适合处理英文

分词

from nltk.tokenize import word_tokenize  # 分词器
tokens = word_tokenize(input_str)

# 转化为小写
tokens = [word.lower() for word in tokens]
tokens[:5]

处理 html 标签等字符

import nltk
clean = nltk.clean_html(html)
tokens = [tok for tok in clean.split()]

可以使用正则,但这个更方便


查看频率分布

freq_dist = nltk.FreqDist(tokens)
# FreqDist({',': 3, 'have': 2, 'Today': 1, "'s": 1, 'weather': 1, 'is': 1, 'good': 1, 'very': 1, 'windy': 1, 'and': 1, ...})

for k,v in freq_dist.items():
  print(k, ': ', v)
  
Today :  1
's :  1
weather :  1
is :  1

Text对象

from nltk.text import Text
help(nltk.text) # 查看帮助

# 创建 Text 对象
t = Text(tokens)

t.count('code')  # 查看某个词的个数

t.index('code') # 查看位置索引

%matplotlib inline
t.plot(8)  # 查看最常见的 词频 分布

停用词过滤

加载停用词

from nltk.corpus import stopwords # 加载停用词
stopwords.readme().replace('\n', ' ')  # 停用词说明文档,由于有很多 \n 符号,所以这样操作来方便查看
'Stopwords Corpus  This corpus contains lists of stop words for several languages.  These are high-frequency grammatical words which are usually ignored in text retrieval applications.  They were obtained from: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/  The stop words for the Romanian language were obtained from: http://arlc.ro/resources/  The English list has been augmented https://github.com/nltk/nltk_data/issues/22  The German list has been corrected https://github.com/nltk/nltk_data/pull/49  A Kazakh list has been added https://github.com/nltk/nltk_data/pull/52  A Nepali list has been added https://github.com/nltk/nltk_data/pull/83  An Azerbaijani list has been added https://github.com/nltk/nltk_data/pull/100  A Greek list has been added https://github.com/nltk/nltk_data/pull/103  An Indonesian list has been added https://github.com/nltk/nltk_data/pull/112 '

stopwords.fileids() # 停用词表,不同语言;没有对中文的支持
['arabic',
 'azerbaijani',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

查看英文停用词表

stopwords.raw('english').replace('\n', ' ')
"i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't "
test_words = [word.lower() for word in tokens]
test_words_set = set(test_words)  # 转化为集合,方便求和停用词表的交集
test_words
['browse',
 'the',
 'latest',
 'developer',
 'documentation',
 ',',
 'including',
 'tutorials',
 ',',
 'sample',
 'code',
 ',',
 'articles',
 ',',
 'and',
 'api',
 'reference',
 '.']
test_words_set
{',',
 '.',
 'and',
 'api',
 'articles',
 'browse',
 'code',
 'developer',
 'documentation',
 'including',
 'latest',
 'reference',
 'sample',
 'the',
 'tutorials'}

查看和停用词表的交集

stopwords_english = set(stopwords.words('english'))
test_words_set.intersection(stopwords_english)
{'and', 'the'}

把停用词过滤掉

filtered = [w for w in test_words_set if(w not in stopwords_english) ]
filtered
['documentation',
 'api',
 'tutorials',
 'articles',
 '.',
 'including',
 'latest',
 'code',
 'sample',
 'developer',
 ',',
 'reference',
 'browse']

词性标注 pos_tag

nltk 工具包中的 averaged_perceptron

from nltk import pos_tag
tags = pos_tag(tokens)
[('browse', 'VB'),
 ('the', 'DT'),
 ('latest', 'JJS'),
 ('developer', 'NN'),
 ('documentation', 'NN'),
 (',', ','),
 ('including', 'VBG'),
 ('tutorials', 'NNS'),
 (',', ','),
 ('sample', 'NN'),
 ('code', 'NN'),
 (',', ','),
 ('articles', 'NNS'),
 (',', ','),
 ('and', 'CC'),
 ('api', 'JJ'),
 ('reference', 'NN'),
 ('.', '.')]

分块

将具有特定成分的内容拿出来

from nltk.chunk import RegexpParser
sentence = [('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), 
            ('dog', 'NN'), ('died', 'VBD')]
grammer = 'MY_NP:{<DT>?<JJ>*<NN>}'
cp = RegexpParser(grammer)
result = cp.parse(sentence)
# 不能直接写 result 来打印,会报错
print(result)
(S (MY_NP the/DT little/JJ yellow/JJ dog/NN) died/VBD)
result.draw()

命名实体识别

相关库: maxent_ne_chunke, words

分词
得到词性

from nltk import ne_chunk
str = 'The Apple Developer Program provides everything you need to build and distribute your apps on the Mac App Store. '

tokens = word_tokenize(str) # 分词
tags = pos_tag(tokens) # 词性标注
print(ne_chunk(tags))
(S
  The/DT
  (ORGANIZATION Apple/NNP Developer/NNP Program/NNP)
  provides/VBZ
  everything/NN
  you/PRP
  need/VBP
  to/TO
  build/VB
  and/CC
  distribute/VB
  your/PRP$
  apps/NN
  on/IN
  the/DT
  (ORGANIZATION Mac/NNP App/NNP Store/NNP)
  ./.)
tags
[('The', 'DT'),
 ('Apple', 'NNP'),
 ('Developer', 'NNP'),
 ('Program', 'NNP'),
 ('provides', 'VBZ'),
 ('everything', 'NN'),
 ('you', 'PRP'),
 ('need', 'VBP'),
 ('to', 'TO'),
 ('build', 'VB'),
 ('and', 'CC'),
 ('distribute', 'VB'),
 ('your', 'PRP$'),
 ('apps', 'NN'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('Mac', 'NNP'),
 ('App', 'NNP'),
 ('Store', 'NNP'),
 ('.', '.')]

数据清洗

  • 去掉多余空格
  • 去掉不需要特殊字符
  • 去掉一些网站等没用的东西

使用正则,stopwords

import re
from nltk.corpus import stopwords
# 输入数据
s = '    RT @Amila #Test\nTom\'s newly listed Co  &amp; Mary\'s unlisted     Group to supply tech for nlTK.\nh $TSLA $AAPL https:// t.co/x34afsfQsh'

#指定停用词
cache_english_stopwords = stopwords.words('english')

def text_clean(text):
    print('原始数据:', text, '\n')
    
    # 去掉HTML标签(e.g. &amp;)
    text_no_special_entities = re.sub(r'\&\w*;|#\w*|@\w*', '', text)
    print('去掉特殊标签后的:', text_no_special_entities, '\n')
    
    # 去掉一些价值符号
    text_no_tickers = re.sub(r'\$\w*', '', text_no_special_entities) 
    print('去掉价值符号后的:', text_no_tickers, '\n')
    
    # 去掉超链接
    text_no_hyperlinks = re.sub(r'https?:\/\/.*\/\w*', '', text_no_tickers)
    print('去掉超链接后的:', text_no_hyperlinks, '\n')

    # 去掉一些专门名词缩写,简单来说就是字母比较少的词
    text_no_small_words = re.sub(r'\b\w{1,2}\b', '', text_no_hyperlinks) 
    print('去掉专门名词缩写后:', text_no_small_words, '\n')
    
    # 去掉多余的空格
    text_no_whitespace = re.sub(r'\s\s+', ' ', text_no_small_words)
    text_no_whitespace = text_no_whitespace.lstrip(' ') 
    print('去掉空格后的:', text_no_whitespace, '\n')
    
    # 分词
    tokens = word_tokenize(text_no_whitespace)
    print('分词结果:', tokens, '\n')    
          
    # 去停用词
    list_no_stopwords = [i for i in tokens if i not in cache_english_stopwords]
    print('去停用词后结果:', list_no_stopwords, '\n')
    # 过滤后结果
    text_filtered =' '.join(list_no_stopwords) # ''.join() would join without spaces between words.
    print('过滤后:', text_filtered)

text_clean(s)






  大数据 最新文章
实现Kafka至少消费一次
亚马逊云科技:还在苦于ETL?Zero ETL的时代
初探MapReduce
【SpringBoot框架篇】32.基于注解+redis实现
Elasticsearch:如何减少 Elasticsearch 集
Go redis操作
Redis面试题
专题五 Redis高并发场景
基于GBase8s和Calcite的多数据源查询
Redis——底层数据结构原理
上一篇文章      下一篇文章      查看所有文章
加:2021-10-26 12:16:31  更:2021-10-26 12:19:02 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2025年1日历 -2025/1/18 3:48:22-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码