数据分析与可视化 上机实践3(Pandas 数据预处理)


一、实践目的

1.掌握数据预处理的思想和方法。
2.进一步熟悉Pandas库的使用方法。


二、数据集介绍

Lending Club是一家美国P2P公司,投资者在平台中为融资方提供资金,并以此获得贷款利率收益。Lending Club为投资者与融资方提供平台。本实验使用Lending Club2015年公开数据集中截取的一小部分进行分析。数据分为三个部分:
(1)用户信息数据

img-202304301433917

(2)用户历史数据

img-202304301433211

(3)贷款交易数据

img-202304301433303

Lending Club2015年公开数据集下载


三、实践内容要求

(1)导入三个数据表到DataFrame中;

(2)查看各数据表的前5行;

(3)查看数据基本情况,并对基本情况进行分析;

(4)利用“用户编号”这个公共变量,将用户信息数据和用户历史数据合并在一起,对比inner、outer、left和right连接的区别;

(5)将用户历史数据按账户平均存款排序;

(6)基于贷款交易数据,进行左连接,将三个数据合并在一起;

(7)删去用于合并的列名user_id,使用info查看数据缺失情况,删除缺失数据;

(8)将年收入进行离散化,分为高中低三档;

(9)贷款状态为”Charged Off”,”In Grace Period”, “Late (31-120 days)”的贷款有违约风险,视为不良贷款,将其值标记为1,其他贷款标记为0;

(10)将term、grade、emp_length、annual_inc、home_ownership、verification_status这些分类变量进行哑变量处理。


四、完成情况

1
2
3
4
5
6
7
8
9
10
11
12
#(1)导入三个数据表到DataFrame中;

import pandas as pd

pd.set_option('display.unicode.east_asian_width', True)
user = pd.read_csv('user.csv',sep = ',') # 读取文件
history = pd.read_csv('history.csv',sep = ',')
loan = pd.read_csv('loan.csv',sep = ',')

#(2)查看各数据表的前5行;

user.head() # 默认输出前5行数据
img-202304301438980
1
history.head()
img-202304301438592
1
loan.head()
img-202304301438890
1
2
3
4
5
#(3)查看数据基本情况,并对基本情况进行分析;

user.info()
history.info()
loan.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1707 entries, 0 to 1706
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 1707 non-null int64
1 emp_length 1633 non-null object
2 home_ownership 1707 non-null object
3 annual_inc 1707 non-null float64
4 verification_status 1707 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 66.8+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1648 entries, 0 to 1647
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user 1648 non-null int64
1 acc_open_past_24mths 1648 non-null int64
2 avg_cur_bal 1646 non-null float64
dtypes: float64(1), int64(2)
memory usage: 38.8 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1269 entries, 0 to 1268
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user 1269 non-null int64
1 term 1269 non-null object
2 int_rate 1269 non-null object
3 grade 1269 non-null object
4 loan_status 1269 non-null object
dtypes: int64(1), object(4)
memory usage: 49.7+ KB

1
2
3
4
#(4)利用“用户编号”这个公共变量,将用户信息数据和用户历史数据合并在一起,对比inner、outer、left和right连接的区别;

a1 = pd.merge(user, history, left_on='user_id', right_on='user')
print(a1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
      user_id emp_length home_ownership  annual_inc verification_status  user  \
0 1 10+ years RENT 78000.0 Source Verified 1
1 2 8 years MORTGAGE 58000.0 Not Verified 2
2 3 10+ years MORTGAGE 125000.0 Verified 3
3 4 10+ years RENT 69000.0 Source Verified 4
4 5 < 1 year RENT 50000.0 Source Verified 5
... ... ... ... ... ... ...
1643 1644 10+ years OWN 72000.0 Verified 1644
1644 1645 7 years MORTGAGE 62000.0 Source Verified 1645
1645 1646 4 years RENT 50000.0 Source Verified 1646
1646 1647 1 year RENT 45000.0 Not Verified 1647
1647 1648 2 years OWN 35000.0 Not Verified 1648

acc_open_past_24mths avg_cur_bal
0 5 29828.0
1 7 9536.0
2 2 32727.0
3 8 3214.0
4 6 5857.0
... ... ...
1643 3 1676.0
1644 4 21360.0
1645 1 3180.0
1646 8 3652.0
1647 3 1586.0

[1648 rows x 8 columns]

1
2
3
4
#(5)将用户历史数据按账户平均存款排序;

a2 = history.sort_values(by = 'avg_cur_bal')
a2.head()
img-202304301439203
1
2
3
4
#(6)基于贷款交易数据,进行左连接,将三个数据合并在一起;

a3 = pd.merge(a1, loan, how='left') # 将三表进行左连接
print(a3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
      user_id emp_length home_ownership  annual_inc verification_status  user  \
0 1 10+ years RENT 78000.0 Source Verified 1
1 2 8 years MORTGAGE 58000.0 Not Verified 2
2 3 10+ years MORTGAGE 125000.0 Verified 3
3 4 10+ years RENT 69000.0 Source Verified 4
4 5 < 1 year RENT 50000.0 Source Verified 5
... ... ... ... ... ... ...
1643 1644 10+ years OWN 72000.0 Verified 1644
1644 1645 7 years MORTGAGE 62000.0 Source Verified 1645
1645 1646 4 years RENT 50000.0 Source Verified 1646
1646 1647 1 year RENT 45000.0 Not Verified 1647
1647 1648 2 years OWN 35000.0 Not Verified 1648

acc_open_past_24mths avg_cur_bal term int_rate grade \
0 5 29828.0 60 months 12.39% C
1 7 9536.0 36 months 6.99% A
2 2 32727.0 60 months 17.14% D
3 8 3214.0 36 months 13.66% C
4 6 5857.0 36 months 13.66% C
... ... ... ... ... ...
1643 3 1676.0 NaN NaN NaN
1644 4 21360.0 NaN NaN NaN
1645 1 3180.0 NaN NaN NaN
1646 8 3652.0 NaN NaN NaN
1647 3 1586.0 NaN NaN NaN

loan_status
0 Fully Paid
1 Charged Off
2 In Grace Period
3 Fully Paid
4 Charged Off
... ...
1643 NaN
1644 NaN
1645 NaN
1646 NaN
1647 NaN

[1648 rows x 12 columns]

1
2
3
4
#(7)删去用于合并的列名user_id,使用info查看数据缺失情况,删除缺失数据;

a3.drop('user_id',axis = 1,inplace = True) # axis=1,删除user_id这一列
print(a3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
     emp_length home_ownership  annual_inc verification_status  user  \
0 10+ years RENT 78000.0 Source Verified 1
1 8 years MORTGAGE 58000.0 Not Verified 2
2 10+ years MORTGAGE 125000.0 Verified 3
3 10+ years RENT 69000.0 Source Verified 4
4 < 1 year RENT 50000.0 Source Verified 5
... ... ... ... ... ...
1643 10+ years OWN 72000.0 Verified 1644
1644 7 years MORTGAGE 62000.0 Source Verified 1645
1645 4 years RENT 50000.0 Source Verified 1646
1646 1 year RENT 45000.0 Not Verified 1647
1647 2 years OWN 35000.0 Not Verified 1648

acc_open_past_24mths avg_cur_bal term int_rate grade \
0 5 29828.0 60 months 12.39% C
1 7 9536.0 36 months 6.99% A
2 2 32727.0 60 months 17.14% D
3 8 3214.0 36 months 13.66% C
4 6 5857.0 36 months 13.66% C
... ... ... ... ... ...
1643 3 1676.0 NaN NaN NaN
1644 4 21360.0 NaN NaN NaN
1645 1 3180.0 NaN NaN NaN
1646 8 3652.0 NaN NaN NaN
1647 3 1586.0 NaN NaN NaN

loan_status
0 Fully Paid
1 Charged Off
2 In Grace Period
3 Fully Paid
4 Charged Off
... ...
1643 NaN
1644 NaN
1645 NaN
1646 NaN
1647 NaN

[1648 rows x 11 columns]
1
a3.isnull().sum()
1
2
3
4
5
6
7
8
9
10
11
12
emp_length               71
home_ownership 0
annual_inc 0
verification_status 0
user 0
acc_open_past_24mths 0
avg_cur_bal 2
term 379
int_rate 379
grade 379
loan_status 379
dtype: int64
1
2
a3.dropna(axis = 0, inplace =True) # 沿行删除所有数据
a3.isnull().sum()
1
2
3
4
5
6
7
8
9
10
11
12
emp_length              0
home_ownership 0
annual_inc 0
verification_status 0
user 0
acc_open_past_24mths 0
avg_cur_bal 0
term 0
int_rate 0
grade 0
loan_status 0
dtype: int64

1
2
3
4
# (8)将年收入进行离散化,分为高中低三档;

a3['年收入'] = pd.cut(a3['annual_inc'], bins=[0, 50000, 100000, a3['annual_inc'].max()], labels=['低','中', '高'])
a3.head()

img-202304301440474


1
2
3
4
5
6
7
8
9
10
11
#(9)贷款状态为"Charged Off","In Grace Period", "Late (31-120 days)"的贷款有违约风险,视为不良贷款,将其值标记为1,其他贷款标记为0;
# 使用字典方式进行替换

a3.replace(to_replace={'loan_status':
{'Fully Paid':0, 'Current':0, 'Charged Off':1, 'In Grace Period':1, 'Late (31-120 days)':1}}, inplace=True)

a3.head()

# 另一种方法
# df['loan_status'].replace(to_replace=['Fully Paid','Current','Charged Off','In Grace Period','Late (31-120 days)'],
# value=[0,0,1,1,1],inplace=True)

009


1
2
3
4
5
6
#(10)将term、grade、emp_length、annual_inc、home_ownership、verification_status这些分类变量进行哑变量处理。

ss = ['term', 'grade', 'emp_length', 'annual_inc', 'home_ownership', 'verification_status']
for s in ss:
dum_list = pd.get_dummies(a3[s], prefix = s)
print(dum_list.head())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
   term_ 36 months  term_ 60 months
0 0 1
1 1 0
2 0 1
3 1 0
4 1 0
grade_A grade_B grade_C grade_D grade_E grade_F grade_G
0 0 0 1 0 0 0 0
1 1 0 0 0 0 0 0
2 0 0 0 1 0 0 0
3 0 0 1 0 0 0 0
4 0 0 1 0 0 0 0
emp_length_1 year emp_length_10+ years emp_length_2 years \
0 0 1 0
1 0 0 0
2 0 1 0
3 0 1 0
4 0 0 0

emp_length_3 years emp_length_4 years emp_length_5 years \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

emp_length_6 years emp_length_7 years emp_length_8 years \
0 0 0 0
1 0 0 1
2 0 0 0
3 0 0 0
4 0 0 0

emp_length_9 years emp_length_< 1 year
0 0 0
1 0 0
2 0 0
3 0 0
4 0 1
annual_inc_16000.0 annual_inc_18000.0 annual_inc_18500.0 \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

annual_inc_19000.0 annual_inc_19600.0 annual_inc_20000.0 \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

annual_inc_21000.0 annual_inc_21600.0 annual_inc_22000.0 \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

annual_inc_22500.0 ... annual_inc_280000.0 annual_inc_300000.0 \
0 0 ... 0 0
1 0 ... 0 0
2 0 ... 0 0
3 0 ... 0 0
4 0 ... 0 0

annual_inc_300300.0 annual_inc_310000.0 annual_inc_335000.0 \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

annual_inc_350000.0 annual_inc_385000.0 annual_inc_450000.0 \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

annual_inc_550000.0 annual_inc_580000.0
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0

[5 rows x 320 columns]
home_ownership_MORTGAGE home_ownership_OWN home_ownership_RENT
0 0 0 1
1 1 0 0
2 1 0 0
3 0 0 1
4 0 0 1
verification_status_Not Verified verification_status_Source Verified \
0 0 1
1 1 0
2 0 0
3 0 1
4 0 1

verification_status_Verified
0 0
1 0
2 1
3 0
4 0

五、参考资料

Lending Club2015年公开数据集下载

Lending Club贷款数据转换与融合