探索性数据分析
import numpy as np
import pandas as pd
了解你的数据吗?
利用Python进行数据分析
任务一:利用Pandas对示例数据进行排序,要求升序
arr = np.arange(8).reshape((2, 4))
arr
df = pd.DataFrame([[8, 'JackYang'],
[2, 'RocZhang'],
[8, 'Lily'],
[10, 'Tom']],
columns=['b', 'a'],
index=['three', 'one', 'two', 'one'])
df
| b | a |
---|
three | 8 | JackYang |
---|
one | 2 | RocZhang |
---|
two | 8 | Lily |
---|
one | 10 | Tom |
---|
【问题】:大多数时候我们都是想根据列的值来排序,所以将你构建的DataFrame中的数据根据某一列,升序排列
当对DataFrame排序时,你可以使用一列或者多列作为排序键。为了实现这个功能,传递一个或多个给sort_values的可选参数 by。
df.sort_values(by='b')
| b | a |
---|
one | 2 | RocZhang |
---|
three | 5 | JackYang |
---|
two | 8 | Lily |
---|
one | 10 | Tom |
---|
最基础的排序是你可以根据行(index)或者列(columns),使用 sort_index()进行排序。
df.sort_index()
| b | a |
---|
one | 2 | RocZhang |
---|
one | 10 | Tom |
---|
three | 5 | JackYang |
---|
two | 8 | Lily |
---|
df.sort_index(axis=1)
| a | b |
---|
three | JackYang | 5 |
---|
one | RocZhang | 2 |
---|
two | Lily | 8 |
---|
one | Tom | 10 |
---|
df.sort_index(axis=1, ascending=False)
| b | a |
---|
three | 5 | JackYang |
---|
one | 2 | RocZhang |
---|
two | 8 | Lily |
---|
one | 10 | Tom |
---|
df.sort_values(by=['b', 'a'], ascending=False)
| b | a |
---|
one | 10 | Tom |
---|
two | 8 | Lily |
---|
three | 8 | JackYang |
---|
one | 2 | RocZhang |
---|
任务二:对泰坦尼克号数据(trian.csv)按票价和年龄两列进行综合排序(降序排列),从这个数据中你可以分析出什么?
path = r'data\train.csv'
data = pd.read_csv(path)
data.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
---|
data.sort_values(by=['Fare', 'Age'], ascending=False).head(20)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
679 | 680 | 1 | 1 | Cardeza, Mr. Thomas Drake Martinez | male | 36.0 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C |
---|
258 | 259 | 1 | 1 | Ward, Miss. Anna | female | 35.0 | 0 | 0 | PC 17755 | 512.3292 | NaN | C |
---|
737 | 738 | 1 | 1 | Lesurer, Mr. Gustave J | male | 35.0 | 0 | 0 | PC 17755 | 512.3292 | B101 | C |
---|
438 | 439 | 0 | 1 | Fortune, Mr. Mark | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S |
---|
341 | 342 | 1 | 1 | Fortune, Miss. Alice Elizabeth | female | 24.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
---|
88 | 89 | 1 | 1 | Fortune, Miss. Mabel Helen | female | 23.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
---|
27 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
---|
742 | 743 | 1 | 1 | Ryerson, Miss. Susan Parker "Suzette" | female | 21.0 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
---|
311 | 312 | 1 | 1 | Ryerson, Miss. Emily Borie | female | 18.0 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
---|
299 | 300 | 1 | 1 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | female | 50.0 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
---|
118 | 119 | 0 | 1 | Baxter, Mr. Quigg Edmond | male | 24.0 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
---|
380 | 381 | 1 | 1 | Bidois, Miss. Rosalie | female | 42.0 | 0 | 0 | PC 17757 | 227.5250 | NaN | C |
---|
716 | 717 | 1 | 1 | Endres, Miss. Caroline Louise | female | 38.0 | 0 | 0 | PC 17757 | 227.5250 | C45 | C |
---|
700 | 701 | 1 | 1 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | female | 18.0 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C |
---|
557 | 558 | 0 | 1 | Robbins, Mr. Victor | male | NaN | 0 | 0 | PC 17757 | 227.5250 | NaN | C |
---|
527 | 528 | 0 | 1 | Farthing, Mr. John | male | NaN | 0 | 0 | PC 17483 | 221.7792 | C95 | S |
---|
377 | 378 | 0 | 1 | Widener, Mr. Harry Elkins | male | 27.0 | 0 | 2 | 113503 | 211.5000 | C82 | C |
---|
779 | 780 | 1 | 1 | Robert, Mrs. Edward Scott (Elisabeth Walton Mc... | female | 43.0 | 0 | 1 | 24160 | 211.3375 | B3 | S |
---|
730 | 731 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0 | 0 | 0 | 24160 | 211.3375 | B5 | S |
---|
689 | 690 | 1 | 1 | Madill, Miss. Georgette Alexandra | female | 15.0 | 0 | 1 | 24160 | 211.3375 | B5 | S |
---|
【思考】排序后,如果我们仅仅关注年龄和票价两列。根据常识我知道发现票价越高的应该客舱越好,所以我们会明显看出,票价前20的乘客中存活的有14人,这是相当高的一个比例,那么我们后面是不是可以进一步分析一下票价和存活之间的关系,年龄和存活之间的关系呢?当你开始发现数据之间的关系了,数据分析就开始了。
当然,这只是我的想法,你还可以有更多想法,欢迎写在你的学习笔记中。
任务三:利用Pandas进行算术计算,计算两个DataFrame数据相加结果
frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),
columns=['a', 'b', 'c'],
index=['one', 'two', 'three'])
frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),
columns=['a', 'e', 'c'],
index=['first', 'one', 'two', 'second'])
frame1_a
| a | b | c |
---|
one | 0.0 | 1.0 | 2.0 |
---|
two | 3.0 | 4.0 | 5.0 |
---|
three | 6.0 | 7.0 | 8.0 |
---|
frame1_b
| a | e | c |
---|
first | 0.0 | 1.0 | 2.0 |
---|
one | 3.0 | 4.0 | 5.0 |
---|
two | 6.0 | 7.0 | 8.0 |
---|
second | 9.0 | 10.0 | 11.0 |
---|
frame1_a + frame1_b
| a | b | c | e |
---|
first | NaN | NaN | NaN | NaN |
---|
one | 3.0 | NaN | 7.0 | NaN |
---|
second | NaN | NaN | NaN | NaN |
---|
three | NaN | NaN | NaN | NaN |
---|
two | 9.0 | NaN | 13.0 | NaN |
---|
注:两个DataFrame相加后,会返回一个新的DataFrame,对应的行和列的值会相加,没有对应的会变成空值NaN。
pd.concat([frame1_a, frame1_b], axis=1)
| a | b | c | a | e | c |
---|
one | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
---|
two | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 | 8.0 |
---|
three | 6.0 | 7.0 | 8.0 | NaN | NaN | NaN |
---|
first | NaN | NaN | NaN | 0.0 | 1.0 | 2.0 |
---|
second | NaN | NaN | NaN | 9.0 | 10.0 | 11.0 |
---|
pd.concat([frame1_a, frame1_b], axis=0)
| a | b | c | e |
---|
one | 0.0 | 1.0 | 2.0 | NaN |
---|
two | 3.0 | 4.0 | 5.0 | NaN |
---|
three | 6.0 | 7.0 | 8.0 | NaN |
---|
first | 0.0 | NaN | 2.0 | 1.0 |
---|
one | 3.0 | NaN | 5.0 | 4.0 |
---|
two | 6.0 | NaN | 8.0 | 7.0 |
---|
second | 9.0 | NaN | 11.0 | 10.0 |
---|
任务四:通过泰坦尼克号数据如何计算出在船上最大的家族有多少人?
【提醒】我们只需找出”兄弟姐妹个数“和”父母子女个数“之和最大的数,当然你还可以想出很多方法和思考角度,欢迎你来说出你的看法。
data = pd.read_csv(r'data\train.csv')
data
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
---|
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
---|
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
---|
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
---|
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
---|
891 rows × 12 columns
max(data['SibSp'] + data['Parch'])
10
任务五:学会使用Pandas describe()函数查看数据基本统计信息
data.describe()
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare |
---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
---|
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
---|
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
---|
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
---|
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
---|
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
---|
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
---|
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
---|
举例:
frame2 = pd.DataFrame([[1.4, np.nan],
[7.1, -4.5],
[np.nan, np.nan],
[0.75, -1.3]
], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
frame2
| one | two |
---|
a | 1.40 | NaN |
---|
b | 7.10 | -4.5 |
---|
c | NaN | NaN |
---|
d | 0.75 | -1.3 |
---|
调用 describe 函数,观察frame2的数据基本信息
frame2.describe()
| one | two |
---|
count | 3.000000 | 2.000000 |
---|
mean | 3.083333 | -2.900000 |
---|
std | 3.493685 | 2.262742 |
---|
min | 0.750000 | -4.500000 |
---|
25% | 1.075000 | -3.700000 |
---|
50% | 1.400000 | -2.900000 |
---|
75% | 4.250000 | -2.100000 |
---|
max | 7.100000 | -1.300000 |
---|
任务六:分别看看泰坦尼克号数据集中 票价、父母子女 这列数据的基本统计数据,你能发现什么?
data['Fare'].describe()
count 891.000000
mean 32.204208
std 49.693429
min 0.000000
25% 7.910400
50% 14.454200
75% 31.000000
max 512.329200
Name: Fare, dtype: float64
data['Parch'].describe()
count 891.000000
mean 0.381594
std 0.806057
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 6.000000
Name: Parch, dtype: float64
data['SibSp'].describe()
count 891.000000
mean 0.523008
std 1.102743
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 8.000000
Name: SibSp, dtype: float64
总结:该教程很好,有很多知识点我都不太会,需要查找资料才可以解决,也学到了很多,以后的几天我会继续坚持,把数据分析知识点过一遍,非常感谢DataWhale。
[完]
参考
DataWhale开源社区
|