开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> 知识图谱DKN源码详解（三）data_preprogress.py -> 正文阅读

[人工智能]知识图谱DKN源码详解（三）data_preprogress.py

内容

try:    # 以绝对导入的方式导入cofig对象，并获取其{model_name}Config！ 
    config = getattr(importlib.import_module('config'), f"{model_name}Config")
except AttributeError:
    print(f"{model_name} not included!")
    exit()

这里就是config下的属性类

下面是解析behavior.tsv文件， behavior.tsv文件如下;
在这里插入图片描述

这就是user-1的clicked News

在这里插入图片描述

共有五列：分别是’impression_id’, ‘user’, ‘time’, ‘clicked_news’, ‘impressions’！但是文件中没有给出列明的，需要自己来定义！

下面是解析后的behavior_parsed.tsv文件：
在这里插入图片描述

def parse_behaviors(source, target, user2int_path):
    """
    Parse behaviors file in training set.
    Args:
        source: source behaviors file
        target: target behaviors file
        user2int_path: path for saving user2int file
    """
    print(f"Parse {source}")

    behaviors = pd.read_table(
        source,
        header=None,
        names=['impression_id', 'user', 'time', 'clicked_news', 'impressions'])
    behaviors.clicked_news.fillna(' ', inplace=True)  #使用空格来填充缺失值，并修改原文件
    behaviors.impressions = behaviors.impressions.str.split()  #以空字符为分隔符来切分字符串，并返回list列表，没有指定num，所以是全部分割

    user2int = {}  #定义空字典，用于存储用户转为索引
    for row in behaviors.itertuples(index=False):   #将DataFrame转换为tuple并访问每行
        if row.user not in user2int:                #如果该用户没有在字典中
            user2int[row.user] = len(user2int) + 1  #usr2int["U87243"] = 0 + 1，也就是给定索引，记得是从1开始的，不是从0！ 

	#最普通的创建DataFrame方法，其中data = user2int.items()，是元组数组； index自动； columns = user 和 int
	#将该DataFrame转换为csv文件， 分隔符是"\t"，一个tab！  不保留原来的索引
    pd.DataFrame(user2int.items(), columns=['user', 
                                            'int']).to_csv(user2int_path,
                                                           sep='\t',
                                                           index=False)
                                        
    print(  #处理完数据了，看看有多少有效的user_int
        f'Please modify `num_users` in `src/config.py` into 1 + {len(user2int)}'
    )
 	#获取了df矩阵中底index行，第'user'列! 也就是将用户名改成index
    for row in behaviors.itertuples():  
        behaviors.at[row.Index, 'user'] = user2int[row.user]

	#进度条展示！ 名字是Balancing data，遍历的是整个文件
    for row in tqdm(behaviors.itertuples(), desc="Balancing data"):
        positive = iter([x for x in row.impressions if x.endswith('1')])
        negative = [x for x in row.impressions if x.endswith('0')]
        random.shuffle(negative)
        negative = iter(negative)
        pairs = []
        try:
            while True:
                pair = [next(positive)]
                for _ in range(config.negative_sampling_ratio):
                    pair.append(next(negative))
                pairs.append(pair)
        except StopIteration:
            pass
        behaviors.at[row.Index, 'impressions'] = pairs

    behaviors = behaviors.explode('impressions').dropna(
        subset=["impressions"]).reset_index(drop=True)
    behaviors[['candidate_news', 'clicked']] = pd.DataFrame(
        behaviors.impressions.map(
            lambda x: (' '.join([e.split('-')[0] for e in x]), ' '.join(
                [e.split('-')[1] for e in x]))).tolist())
    behaviors.to_csv(
        target,
        sep='\t',
        index=False,
        columns=['user', 'clicked_news', 'candidate_news', 'clicked'])

补充：

1. getattr（）函数

描述
getattr() 函数用于返回一个对象的属性值。
语法
getattr(object, name[, default])
参数
object – 对象。
name – 字符串，对象属性。
default – 默认返回值，如果不提供该参数，在没有对应属性时，将触发 AttributeError
返回值
返回对象属性值
实例

>>>class A(object):
...     bar = 1
... 
>>> a = A()
>>> getattr(a, 'bar')        # 获取属性 bar 值
1
>>> getattr(a, 'bar2')       # 属性 bar2 不存在，触发异常
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'A' object has no attribute 'bar2'
>>> getattr(a, 'bar2', 3)    # 属性 bar2 不存在，但设置了默认值
3
>>>

实例2:

>>> class A(object):        
...     def set(self, a, b):
...         x = a        
...         a = b        
...         b = x        
...         print a, b   
... 
>>> a = A()                 
>>> c = getattr(a, 'set')
>>> c(a='1', b='2')
2 1
>>>

实际的实例：

try:
    config = getattr(importlib.import_module('config'), f"{model_name}Config")
except AttributeError:
    print(f"{model_name} not included!")
    exit()

2. importlib.import_module()

这样做的目的就是为了动态的引入模块！

import importlib
 
params = importlib.import_module('b.c.c') #绝对导入  在同一文件夹下则不同b了
params_ = importlib.import_module('.c.c',package='b') #相对导入

3. f-strings 字符串格式化

这种在字符串前面加“f”就相当于 “XX{ }”.format(model_name)
举例：

name = '宝元'
age = 18
sex = '男'
msg1 = F'姓名：{name}，性别：{age}，年龄：{sex}'  # 大写字母也可以
msg = f'姓名：{name}，性别：{age}，年龄：{sex}'   # 建议小写
print(msg)

4. pd.read_table()

原代码中是header和names混合使用！如果header = None； names就该设置列明

5. fillna()函数

具体请参考博文

我只摘录和原代码相关的内容！
该函数用来填充缺失值！也就是NaN！

inplace：方法有改变原文件也有不改变原文件两种方式！不改变原文件顾名思义也就是创建临时对象，只有在运行程序时有一个对象，相反，则是把原文件改了！该
填充值

#二、指定inplace参数
 
print (df1.fillna(0,inplace=True))
print ("-------------------------")
print (df1)

使用0来填充，那么源代码就是使用" " 空格来填充

6. split()方法

描述：
str.split() 通过指定分隔符对字符串进行切片，如果参数 num 有指定值，则分隔 num+1 个子字符串
语法：

split() 方法语法：

str.split(str="", num=string.count(str)).

参数：
str – 分隔符，默认为所有的空字符，包括空格、换行(\n)、制表符(\t)等。
num – 分割次数。默认为 -1, 即分隔所有。（从第一个分隔符开始切分）
返回值
返回分割后的字符串列表。(list列表)
实例

>>>str = "Line1-abcdef \nLine2-abc \nLine4-abcd";
>>>print str.split( );       # 以空格为分隔符，包含 \n，如果为空的话，那么就是所有空字符
>>>print str.split(' ', 1 ); # 以空格为分隔符，分隔成两个

['Line1-abcdef', 'Line2-abc', 'Line4-abcd']
['Line1-abcdef', '\nLine2-abc \nLine4-abcd']

实例2
以下实例以 # 号为分隔符，指定第二个参数为 1，返回两个参数列表。

#!/usr/bin/python
# -*- coding: UTF-8 -*-
 
txt = "Google#Runoob#Taobao#Facebook"
 
# 第二个参数为 1，返回两个参数列表
x = txt.split("#", 1)
 
print x

7. iteritems()、 iterrows()、itertuples():

itertuples()

原博客
这是将原DataFrame转成一行一行的tuple，也就是小DataFrame！
如果我们想访问要和getattr结合！

举例：
假如原test是这样的！

当我们运行itertuples()时：

其它的请看链接，我暂时不需要

8. pd.DataFrame()

也就是创建DataFrame()
DataFrame是由多种类型的列构成的二维标签数据结构.[1]
往往包含index(行标签)和columns(列标签), 彼此独立, 互不影响
直观理解:DataFrame 是带标签的二维数组
所以问题在于如何创建DataFrame：

最普通的创建DataFrame

原文博客

DataFrame()函数的参数index的值相当于行索引，若不手动赋值，将默认从0开始分配。columns的值相当于列索引，若不手动赋值，也将默认从0开始分配。

data = {
        '性别':['男','女','女','男','男'],
        '姓名':['小明','小红','小芳','大黑','张三'],
        '年龄':[20,21,25,24,29]}
df = pd.DataFrame(data, index=['one','two','three','four','five'],
               columns=['姓名','性别','年龄','职业'])
df

df.values 返回ndarray类型的对象

ndarray类型即numpy的 N 维数组对象,通常将DataFrame类型的数据转换为ndarray类型的比较方便操作。如对DataFrame类型进行切片操作需要df.iloc[ : , 1:3]这种形式，对数组类型直接X[ : , 1:3]即可。

X = df.values
print(type(X)) #显示数据类型
X

运行结果：

<class 'numpy.ndarray'>
[['小明' '男' 20 nan]
 ['小红' '女' 21 nan]
 ['小芳' '女' 25 nan]
 ['大黑' '男' 24 nan]
 ['张三' '男' 29 nan]]

df.iloc[ 行序,列序 ] 按序值返回元素

df.iloc[1,1]

Index(['one', 'two', 'three', 'four', 'five'], dtype='object')

df.at[index, columns]

作用：获取某个位置的值，例如获取第0行，第a列的值，即：index=0， columns = ‘a’
data = df.at[0, ‘a’]

9. items()

描述

Python 字典(Dictionary) items() 函数以列表返回可遍历的(键, 值) 元组数组。

语法

dict.items()

参数

无

返回值

返回可遍历的(键, 值) 元组数组 （不是list，而是数组）

实例

#!/usr/bin/python
# coding=utf-8
 
dict = {'Google': 'www.google.com', 'Runoob': 'www.runoob.com', 'taobao': 'www.taobao.com'}
 
print "字典值 : %s" %  dict.items()
 
# 遍历字典列表
for key,values in  dict.items():
    print key,values

字典值 : [('Google', 'www.google.com'), ('taobao', 'www.taobao.com'), ('Runoob', 'www.runoob.com')]
Google www.google.com
taobao www.taobao.com
Runoob www.runoob.com

10. .to_csv()

原文博客

1.首先查询当前的工作路径：

import os
os.getcwd() #获取当前工作路径

2.to_csv()是DataFrame类的方法，read_csv()是pandas的方法

dt.to_csv() #默认dt是DataFrame的一个实例，参数解释如下

路径 path_or_buf: A string path to the file to write or a StringIO

dt.to_csv('Result.csv') #相对位置，保存在getwcd()获得的路径下
dt.to_csv('C:/Users/think/Desktop/Result.csv') #绝对位置

分隔符 sep : Field delimiter for the output file (default ”,”)

dt.to_csv('C:/Users/think/Desktop/Result.csv',sep='?')#使用?分隔需要保存的数据，如果不写，默认是,

替换空值 na_rep: A string representation of a missing value (default ‘’)

dt.to_csv('C:/Users/think/Desktop/Result1.csv',na_rep='NA') #确实值保存为NA，如果不写，默认是空

是否保留行索引 index: whether to write row (index) names (default True)

dt.to_csv('C:/Users/think/Desktop/Result1.csv',index=0) #不保存行索引

是否保留列名 header: Whether to write out the column names (default True)

dt.to_csv(‘C:/Users/think/Desktop/Result.csv’,header=0) #不保存列名

是否保留某列数据 cols: Columns to write (default None

dt.to_csv('C:/Users/think/Desktop/Result.csv',columns=['name']) #保存索引列和name列

11. tpdm(iterator)

Tqdm 是一个快速，可扩展的Python进度条，可以在 Python 长循环中添加一个进度提示信息，用户只需要封装任意的迭代器 tqdm(iterator)。
使用pip就可以安装。

参数：

iterable=None,  当然是我们要被迭代的对象！  迭代多少次，则显示多少         
desc=None,      传入str类型，作为进度条标题（类似于说明）
total=None,     预期的迭代次数
leave=True,             
file=None, 
ncols=None,         可以自定义进度条的总长度
mininterval=0.1,    最小的更新间隔
maxinterval=10.0,   最大更新间隔
miniters=None, 
ascii=None, 
unit='it',
unit_scale=False,
dynamic_ncols=False,
smoothing=0.3,
bar_format=None, 
initial=0,
position=None, 
postfix             以字典形式传入 详细信息 例如  速度= 10，

操作1：

from tqdm import tqdm
 
for i in tqdm(range(10000)):
     """一些操作"""
     pass

在这里插入图片描述

操作2：

dict = {"a":123,"b":456}
for i in tqdm(range(10),total=10,desc = "WSX",ncols = 100,postfix = dict,mininterval = 0.3):
     pass

在这里插入图片描述

操作3

from tqdm import trange
from random import random, randint
from time import sleep
with trange(100) as t:
    for i in t:
        # Description will be displayed on the left
        t.set_description('下载速度 %i' % i)
        # Postfix will be displayed on the right,
        # formatted automatically based on argument's datatype
        t.set_postfix(loss=random(), gen=randint(1,999), str='详细信息',
                     lst=[1, 2])
        sleep(0.1)