开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> Python Unicode实战 -> 正文阅读

[Python知识库]Python Unicode实战

??首先说明一下，本文代码以Python3版本为主（暂时不考虑和Python2的代码兼容）。

1. 各种编码的简要发展史

??最早的ASCII使用8位二进制(字节)来对字符进行编码，其中8位二进制可以表示2^8=256个字符。其中0~127用来表示英文字母、数字、控制字符等符号，可详见链接：https://ttssh2.osdn.jp/manual/4/en/macro/appendixes/ascii.html。这样一来，英文在计算机中的表示和存储就迎刃而解了。与此同时，128~255也暂时闲置下来了。

??随着计算机的逐渐发展，其他国家也需要将本国的语言在计算机中进行表示。部分国家使用128~255进行字母和符号进行表示。但是对于中文来说，剩余的256个位置根本无法表示汉字。既然一个字节无法表示中文，那么就用两个来表示吧。为了兼容原有的字符，所以当单个字节小于128时，就表示原有字符。当连续两个字节都大于128，具体来说是高字节位于区间[0xA1,0xF7]时，低字节位于区间[0xA1, 0xFE]时，就表示一个汉字。上述编码也就是GB2312，具体可参考链接：https://www.wikiwand.com/zh-hans/GB_2312。

??但是GB2312也无法表示全部的汉字，所以将高低字节的范围都进行了扩展，高字节的范围区间修改为了[81, FE]，而低字节的范围区间修改为了[40, 7E]和[80, FE]。这种编码也就是GBK。具体可参考链接：https://www.wikiwand.com/zh-hans/GBK。

??与此同时，其他国家也为自己国家的语言设计了相应的编码。但结果导致除了英文以外，各国语言的编码都无法进行兼容。国际标谁化组织(ISO)意识到问题的严重性，设计了一种包含所有国家语言单元的编码，也就是Unicode。考虑到性能和资源的平衡，最终使用两个字节来表示字符，由于2^16=65535，所以可以基本上涵盖绝大多数语言的字符单元。相比于之前的单双字节并存的编码方式，双字节是如何对原有单字节对应的字符进行表示呢？其实很简单，添加全0作为高字节，原有单字节作为低字节。但这样一来，英文字符就得用两个字节来进行表示，就会造成资源的浪费。举例来说，It’s 日报对应的Unicode编码如下所示：

I 00000000 01001001
t 00000000 01110100
' 00000000 00100111
s 00000000 01110011
  00000000 00100000
日 01100101 11100101
报 01100010 10100101

??那么能否依然使用单字节表示英文字符，从而节省资源呢。UTF-8就应运而生，UTF-8的设计思想如下所示：

单字节的字符，字节的第一位设为0，对于英语文本，UTF-8码只占用一个字节，和ASCII码完全相同；
n个字节的字符(n>1)，第一个字节的前n位设为1，第n+1位设为0，后面字节的前两位都设为10，这n个字节的其余空位填充该字符unicode码，高位用0补足。这样就形成了如下的UTF-8标记位：

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx... ...

所以It’s 日报的编码就变成了：

I 01001001
t 01110100
' 00100111
s 01110011
  00100000
日 11100110 10010111 10100101
报 11100110 10001010 10100101

??和上边的方案对比一下，英文短了，每个中文字符却多用了一个字节。但是整个字符串只用了11个字节，比上边的14个字节短了一些。所以对于字符串来说，更多的使用的是UTF-8编码。

2. Python Unicode实战

2.1 操作单个字符

2.1.1 判断单个字符所属类型

??对单个字符进行处理，最常用的函数为unicodedata.category()。将部分常用的返回类型列举如下：

[Cc] Other, Control
[Cf] Other, Format
[Pc] Punctuation, Connector
[Pd] Punctuation, Dash
[Pe] Punctuation, Close
[Pf] Punctuation, Final quote (may behave like Ps or Pe depending on usage)
[Pi] Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
[Po] Punctuation, Other
[Ps] Punctuation, Open
[Mn] Mark, Nonspacing
[Zs] Separator, Space

其他类型可详见链接：https://www.wikiwand.com/en/Unicode_character_property。

2.1.2 判断单个字符是否属于中文

def is_chinese_char(cp):
    """Checks whether CP is the codepoint of a CJK character."""
    # This defines a "chinese character" as anything in the CJK Unicode block:
    #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
    #
    # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
    # despite its name. The modern Korean Hangul alphabet is a different block,
    # as is Japanese Hiragana and Katakana. Those alphabets are used to write
    # space-separated words, so they are not treated specially and handled
    # like the all of the other languages.
    if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
        (cp >= 0x3400 and cp <= 0x4DBF) or  #
        (cp >= 0x20000 and cp <= 0x2A6DF) or  #
        (cp >= 0x2A700 and cp <= 0x2B73F) or  #
        (cp >= 0x2B740 and cp <= 0x2B81F) or  #
        (cp >= 0x2B820 and cp <= 0x2CEAF) or
        (cp >= 0xF900 and cp <= 0xFAFF) or  #
        (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
      return True

    return False

2.1.3 判断是否是空白符

??如果不考虑Unicode编码的话，空白符即为\r、\t、\n和空格。

def is_whitespace(char):
    """Checks whether `chars` is a whitespace character."""
    # \t, \n, and \r are technically contorl characters but we treat them
    # as whitespace since they are generally considered as such.
    if char == " " or char == "\t" or char == "\n" or char == "\r":
        return True
    cat = unicodedata.category(char)
    if cat == "Zs":
        return True
    return False

2.1.4 判断是否是控制符

??Unicode中除了空格以外的空白符均认为是控制符，具体如下所示：

import unicodedata
print(unicodedata.category('\r'))
print(unicodedata.category('\t'))
print(unicodedata.category('\n'))

在这里插入图片描述
??所以正确的处理逻辑是先判断是否为空白符（不包含空格），然后再通过unicodedata判断是否为控制符。

def _is_control(char):
    """Checks whether `chars` is a control character."""
    # These are technically control characters but we count them as whitespace
    # characters.
    if char == "\t" or char == "\n" or char == "\r":
        return False
    cat = unicodedata.category(char)
    if cat in ("Cc", "Cf"):
        return True
    return False

2.1.5 是否为标点符号

??如果不考虑unicode编码，则只需保留第一个分支：

def is_punctuation(char):
    """Checks whether `chars` is a punctuation character."""
    cp = ord(char)
    # We treat all non-letter/number ASCII as punctuation.
    # Characters such as "^", "$", and "`" are not in the Unicode
    # Punctuation class but we treat them as punctuation anyways, for
    # consistency.
    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
            (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
        return True
    cat = unicodedata.category(char)
    if cat.startswith("P"):
        return True
    return False

2.2 字符串处理

2.2.1 将文本转换成Unicode

??在Python3中，字符串的默认编码方式均为UTF-8。

def convert_to_unicode(text):
    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
        if isinstance(text, str):
            return text
        elif isinstance(text, bytes):
            return text.decode("utf-8", "ignore")
        else:
            raise ValueError("Unsupported string type: %s" % (type(text)))

2.2.2 清理文本中的无效字符

??除了清理无效字符以外，顺便把所有空白符转换成了空格。

def clean_text(text):
    """Performs invalid character removal and whitespace cleanup on text."""
    output = []
    for char in text:
        cp = ord(char)
        if cp == 0 or cp == 0xfffd or is_control(char):
            continue
        if is_whitespace(char):
            output.append(" ")
        else:
            output.append(char)
    return "".join(output)

2.2.3 清理文本中的重音符号

 def strip_accents(text):
    """Strips accents from a piece of text."""
    text = unicodedata.normalize("NFD", text)
    output = []
    for char in text:
        cat = unicodedata.category(char)
        if cat == "Mn":
            continue
        output.append(char)
    return "".join(output)

2.2.4 将字符串中的文本和标点符号进行划分

??有时候需要把字符串中的标点符号单独划分处理，从而输入一个字符串得到多个字符串构成的列表，使用下列函数即可达到如此效果：

def split_on_punc(text):
    """Splits punctuation on a piece of text."""
    chars = list(text)
    i = 0
    start_new_word = True
    output = []
    while i < len(chars):
        char = chars[i]
        if is_punctuation(char):
            output.append([char])
            start_new_word = True
        else:
            if start_new_word:
                output.append([])
            start_new_word = False
            output[-1].append(char)
        i += 1

    return ["".join(x) for x in output]

2.2.5 对文本进行分词

def tokenize(text):
    """Tokenizes a piece of text."""
    text = convert_to_unicode(text)
    text = clean_text(text)
    text = tokenize_chinese_chars(text)

    orig_tokens = whitespace_tokenize(text)# str to list of str
    split_tokens = []
    for token in orig_tokens:# get str of list of str
        if self.do_lower_case:
            token = token.lower()
            token = strip_accents(token)
        split_tokens.extend(split_on_punc(token))# list of str

    output_tokens = whitespace_tokenize(" ".join(split_tokens))# list of str
    return output_tokens

??留个小疑问，为什么在orig_tokens = whitespace_tokenize(text)后又进行了output_tokens = whitespace_tokenize(" ".join(split_tokens))，也就是whitespace_tokenize执行两次的意义是在哪里呢？

Python知识库最新文章

Python中String模块

【Python】 14-CVS文件操作

python的panda库读写文件

使用Nordic的nrf52840实现蓝牙DFU过程

【Python学习记录】numpy数组用法整理

Python学习笔记

python字符串和列表

python如何从txt文件中解析出有效的数据

Python编程从入门到实践自学/3.1-3.2

python变量