1.分析COCO标注数据
使用pycocotools读取文件 COCO cocoapi/PythonAPI/pycocotools/coco.py /
from pycocotools.coco import COCO
import pandas as pd
ann_file = './xxx.json'
annotation = COCO(annotation_file=ann_file)
annotation.dataset
annotation.anns
annotation.cats
annotation.imgs
2.分析DataFrame数据
转换为dataframe,使用head()查看示例数据,默认值为5,可指定显示数量
ann_label = pd.DataFrame(annotation.anns.values())
ann_label.head()
ann_label.head(1)
isin()可以判断元素是否在一个列表中,用来筛选指定类别的数据
ann_label = ann_label[ann_label['category_id'].isin([0,1,2,3])]
hist()可以以直方图形式分析某项数据分布情况
ann_label['area'].hist()
describe()描述基本信息
ann_label['area'].describe()
count 2717.000000
mean 756703.840887
std 4253.211573
min 10450.000000
25% 75439.000000
50% 77824.000000
75% 88767.000000
max 145141.000000
Name: area, dtype: float64
value_counts()统计数值分布
ann_label['xxx'].value_counts()
81 108
83 93
79 91
82 88
84 87
...
170 1
168 1
145 1
138 1
131 1
Name: height, Length: 107, dtype: int64
reset_index()重置index,具体用法见文档 pandas.DataFrame.reset_index
ann_label.reset_index(drop=True, inplace=True)
StratifiedKFold函数采用分层划分的方法(分层随机抽样思想),验证集中不同类别占比与原始样本的比例保持一致,故StratifiedKFold在做划分的时候需要传入标签特征。
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(
n_splits=5, shuffle=True, random_state=2022
)
for fold, (train_idx, val_idx) in enumerate(skf.split(ann_label["image_id"], ann_label["grading"])):
ann_label.loc[val_idx, 'fold']=fold
tra_df = ann_label[ann_label['fold']!=0]
val_df = ann_label[ann_label['fold']==0]
3.pickle .pkl 格式存取
处理完毕后可以使用to_pickle储存为便携模式
ann_label.to_pickle('./ann.pkl')
使用read_pickle直接读取
ann_label= pd.read_pickle('./ann.pkl')
|