avatar

谷歌Play-StoreApp 分析

Google Play Store的App数据分析

google play store

是在国外下载安卓应用程序的商店,今天的案例是对商店中关于app的信息进行分析,分析后可以辅助app市场和开发设计,其中分析的关键是掌握数据清洗的方法。推荐使用 Jupyter Notebook进行学习。

公号回复: 获取数据 ‘googleplaystore.csv’

image-20200624121006193

一、读取数据并理解含义

首先我们来观察一下数据,如下图所示,第一行是列名,含有App程序名称,Category类别,Rating评分,Reviews评论数,Size程序大小,Installs安装数量等等,总共有1w条数据。

image-20200616035217074

1
2
import numpy as np
import pandas as pd
1
df = pd.read_csv('./googleplaystore.csv',usecols=(0,1,2,3,4,5,6)) # usecols取7列数据作为分析
1
df.head() # 观察前几行数据 了解字段含义
App Category Rating Reviews Size Installs Type
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free
1
df.describe() # 观察数据情况 发现只有rating列有描述统计 其他列的数据类型是字符串
Rating
count 9367.000000
mean 4.193338
std 0.537431
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 19.000000
1
df.count() # 统计每一列的行数 发现Rating行数少了一千多,Type少了一行,接下来对数据进行清洗
1
2
3
4
5
6
7
8
App         10841
Category 10841
Rating 9367
Reviews 10841
Size 10841
Installs 10841
Type 10840
dtype: int64

二、数据清洗

1
2
3
# 一列一列进行分析,对第一列 App 进行清洗,由于Rating数小于App数,因此猜测App有重复值
pd.unique(df['App']).size
# 发现App的unique行数是9660,说明有重复值,但是在第一列我们先不进行去重,因为可能有App名重名的情况
9660
1
2
3
# Category列处理

df['Category'].value_counts(dropna=False) # 统计Category的信息,将空值也统计进来
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
FAMILY                 1972
GAME 1144
TOOLS 843
MEDICAL 463
BUSINESS 460
PRODUCTIVITY 424
PERSONALIZATION 392
COMMUNICATION 387
SPORTS 384
LIFESTYLE 382
FINANCE 366
HEALTH_AND_FITNESS 341
PHOTOGRAPHY 335
SOCIAL 295
NEWS_AND_MAGAZINES 283
SHOPPING 260
TRAVEL_AND_LOCAL 258
DATING 234
BOOKS_AND_REFERENCE 231
VIDEO_PLAYERS 175
EDUCATION 156
ENTERTAINMENT 149
MAPS_AND_NAVIGATION 137
FOOD_AND_DRINK 127
HOUSE_AND_HOME 88
LIBRARIES_AND_DEMO 85
AUTO_AND_VEHICLES 85
WEATHER 82
ART_AND_DESIGN 65
EVENTS 64
COMICS 60
PARENTING 60
BEAUTY 53
1.9 1
Name: Category, dtype: int64
1
2
3
# 发现类别名,没有空值;
# 末尾一行的类别为1.9出现异常值,挑选出来进行观察
df[df['Category'] == '1.9']
App Category Rating Reviews Size Installs Type
10472 Life Made WI-Fi Touchscreen Photo Frame 1.9 19.0 3.0M 1,000+ Free 0
1
2
3
4
# 观察数据,Category为1.9 Rating为19(rating一般为1-5分)Size为1000+(应该是xxM) Type应该是Free 
# 判断这一行,缺失了category,应该把1.9往后的数据往后移动一行,并在页面上查找到类别补充上去
# 由于补充信息并移动列操作比较繁琐,数据有一万多行,因为删除这一行,影响很小
df.drop(index=10472, inplace=True) # inplace = True:不创建新的对象,直接对原始对象进行修改
1
2
# Rating列处理
df['Rating'].value_counts(dropna=False)
NaN    1474
4.4    1109
4.3    1076
4.5    1038
4.2     952
4.6     823
4.1     708
4.0     568
4.7     499
3.9     386
3.8     303
5.0     274
3.7     239
4.8     234
3.6     174
3.5     163
3.4     128
3.3     102
4.9      87
3.0      83
3.1      69
3.2      64
2.9      45
2.8      42
2.7      25
2.6      25
2.5      21
2.3      20
2.4      19
1.0      16
2.2      14
1.9      13
2.0      12
1.7       8
1.8       8
2.1       8
1.6       4
1.5       3
1.4       3
1.2       1
Name: Rating, dtype: int64
1
2
3
# Rating的分布都是在[0-5]之间,数据没有问题,但是NaN空值有1474个,影响较大,需要进行处理,在这里给NaN赋上平均值;
df['Rating'].fillna(value=df['Rating'].mean(), inplace=True) # fillna()函数,对NaN进行赋值
df['Rating'].value_counts(dropna=False)
4.191757    1474
4.400000    1109
4.300000    1076
4.500000    1038
4.200000     952
4.600000     823
4.100000     708
4.000000     568
4.700000     499
3.900000     386
3.800000     303
5.000000     274
3.700000     239
4.800000     234
3.600000     174
3.500000     163
3.400000     128
3.300000     102
4.900000      87
3.000000      83
3.100000      69
3.200000      64
2.900000      45
2.800000      42
2.700000      25
2.600000      25
2.500000      21
2.300000      20
2.400000      19
1.000000      16
2.200000      14
1.900000      13
2.000000      12
1.800000       8
1.700000       8
2.100000       8
1.600000       4
1.500000       3
1.400000       3
1.200000       1
Name: Rating, dtype: int64
1
2
# Reviews列处理
df['Reviews'].value_counts(dropna=False)
0         596
1         272
2         214
3         175
4         137
5         108
6          97
7          90
8          74
9          65
10         64
12         60
11         52
13         49
17         48
19         41
14         41
16         35
21         35
20         35
15         31
30         30
24         30
25         30
38         29
18         27
22         26
23         25
27         25
33         24
         ... 
127229      1
2159        1
157264      1
6826        1
21262       1
37607       1
71269       1
67071       1
24215       1
63624       1
10753       1
159455      1
72596       1
8191        1
258556      1
10672       1
454412      1
56065       1
42329       1
84114       1
71432       1
815893      1
654419      1
9562        1
580         1
2976        1
18478       1
73821       1
1740        1
354         1
Name: Reviews, Length: 6001, dtype: int64
1
2
3
# Reviews 每个app评论数的分布非常广,评论为0的情况最多,有596个
# 之前使用describe函数,并没有出现reviews的信息,这一列中的数据可能含有字符串
df['Reviews'].str.isnumeric().sum()
10840
1
2
3
# 10840个数值型数据,之前在category列中删除一行后,说明reviews列剩下的都是数值型的数据
# 也可以进行验证是否存在非数值型数据
df[-df['Reviews'].str.isnumeric()]
1
2
# 发现都是数值型数据,但是describe函数还是没有出现reviews的描述
df.describe()
Rating
count 10840.000000
mean 4.191757
std 0.478907
min 1.000000
25% 4.100000
50% 4.200000
75% 4.500000
max 5.000000
1
2
3
4
# 统一将数据格式转换
df['Reviews'] = df['Reviews'].astype('i8') # int8
df.describe()
# 最大的评论数有7.815831e+07 7.8乘以10的7次方,也就是百万次,最小是0,没有出现负值,数据合理
Rating Reviews
count 10840.000000 1.084000e+04
mean 4.191757 4.441529e+05
std 0.478907 2.927761e+06
min 1.000000 0.000000e+00
25% 4.100000 3.800000e+01
50% 4.200000 2.094000e+03
75% 4.500000 5.477550e+04
max 5.000000 7.815831e+07
1
2
# Size的清洗处理
df['Size'].value_counts(dropna=False)
Varies with device    1695
11M                    198
12M                    196
14M                    194
13M                    191
15M                    184
17M                    160
19M                    154
26M                    149
16M                    149
25M                    143
20M                    139
21M                    138
24M                    136
10M                    136
18M                    133
23M                    117
22M                    114
29M                    103
27M                     97
28M                     95
30M                     84
33M                     79
3.3M                    77
37M                     76
35M                     72
31M                     70
2.9M                    69
2.3M                    68
2.5M                    68
                      ... 
245k                     1
860k                     1
67k                      1
942k                     1
629k                     1
940k                     1
208k                     1
787k                     1
785k                     1
14k                      1
921k                     1
116k                     1
234k                     1
378k                     1
865k                     1
226k                     1
122k                     1
222k                     1
400k                     1
191k                     1
549k                     1
642k                     1
209k                     1
778k                     1
540k                     1
240k                     1
663k                     1
220k                     1
11k                      1
485k                     1
Name: Size, Length: 461, dtype: int64
1
2
3
4
5
6
# Varies with device    1695 ,值不确定,因此不方便计算,同均值代替
# 计算时,Size带有M和K的单位,不方便计算,因此需要去掉
df['Size'] = df['Size'].str.replace('M','e+6')
df['Size'] = df['Size'].str.replace('k','e+3')
# 尝试转换数据类型,此时转换报错,还有字符串
# df['Size'].astype('f8')
1
2
3
4
5
6
7
8
9
# 定义一个字符串判断是否可以转换
def is_convertable(v):
try:
float(v)
return True
except ValueError:
return False
# 查看不能转换的字符串分布
df['Size'].apply(is_convertable)
1
2
3
# 查看不能转换的字符串分布 即含有false的项
temp = df['Size'].apply(is_convertable)
df['Size'][-temp].value_counts()
1
2
3
4
5
# 转换剩下的字符串
df['Size'] = df['Size'].str.replace('Varies with device', '0')
# 在看下是不是还有没转换的字符串
temp = df['Size'].apply(is_convertable)
df['Size'][-temp].value_counts()
1
2
3
4
5
6
# 转换类型
# e+5这种格式使用astype直接转为int有问题,如果想转成int,可以先转成f8,再转i8
df['Size'] = df['Size'].astype('f8').astype('i8')
# 将Size为0的填充为平均数
df['Size'].replace(0, df['Size'].mean(), inplace=True)
df.describe()
Rating Reviews Size Installs
count 10840.000000 1.084000e+04 1.084000e+04 1.084000e+04
mean 4.191757 4.441529e+05 2.099045e+07 1.546434e+07
std 0.478907 2.927761e+06 2.078345e+07 8.502936e+07
min 1.000000 0.000000e+00 8.500000e+03 0.000000e+00
25% 4.100000 3.800000e+01 5.900000e+06 1.000000e+03
50% 4.200000 2.094000e+03 1.800000e+07 1.000000e+05
75% 4.500000 5.477550e+04 2.600000e+07 5.000000e+06
max 5.000000 7.815831e+07 1.000000e+08 1.000000e+09
1
2
3
# Installs数据清洗
# 先查看分布,数值中带有逗号和加号
df['Installs'].value_counts()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
1,000,000+        1579
10,000,000+ 1252
100,000+ 1169
10,000+ 1054
1,000+ 907
5,000,000+ 752
100+ 719
500,000+ 539
50,000+ 479
5,000+ 477
100,000,000+ 409
10+ 386
500+ 330
50,000,000+ 289
50+ 205
5+ 82
500,000,000+ 72
1+ 67
1,000,000,000+ 58
0+ 14
0 1
Name: Installs, dtype: int64
1
2
3
4
5
6
# 分布不广,直接替换 + 和 ,号
df['Installs'] = df['Installs'].str.replace('+', '')
df['Installs'] = df['Installs'].str.replace(',', '')
# 转换类型
df['Installs'] = df['Installs'].astype('i8')
df.describe()
Rating Reviews Size Installs
count 10840.000000 1.084000e+04 1.084000e+04 1.084000e+04
mean 4.191757 4.441529e+05 2.099045e+07 1.546434e+07
std 0.478907 2.927761e+06 2.078345e+07 8.502936e+07
min 1.000000 0.000000e+00 8.500000e+03 0.000000e+00
25% 4.100000 3.800000e+01 5.900000e+06 1.000000e+03
50% 4.200000 2.094000e+03 1.800000e+07 1.000000e+05
75% 4.500000 5.477550e+04 2.600000e+07 5.000000e+06
max 5.000000 7.815831e+07 1.000000e+08 1.000000e+09
1
2
# Type处理
df['Type'].value_counts(dropna=False)
Free    10039
Paid      800
NaN         1
Name: Type, dtype: int64
1
2
# 有一行是NaN,最简单的做法 需要找出index并删除
df[df['Type'].isnull()]
App Category Rating Reviews Size Installs Type
9148 Command & Conquer: Rivals FAMILY 4.191757 0 18152090 0 NaN
1
2
# 删除这条数据
df.drop(index=9148, inplace=True)
1
2
3
# 最后其他列都清洗完毕后,对App去重
df.drop_duplicates('App',inplace = True)
df.count()
App         9658
Category    9658
Rating      9658
Reviews     9658
Size        9658
Installs    9658
Type        9658
dtype: int64

三、数据分析-维度分析和相关性分析

1
2
3
# 以上数据清洗完成,接下来进行分析
# 整体情况
df.describe()
Rating Reviews Size Installs
count 9658.000000 9.658000e+03 9.658000e+03 9.658000e+03
mean 4.176046 2.166150e+05 2.011053e+07 7.778312e+06
std 0.494383 1.831413e+06 2.040865e+07 5.376100e+07
min 1.000000 0.000000e+00 8.500000e+03 0.000000e+00
25% 4.000000 2.500000e+01 5.300000e+06 1.000000e+03
50% 4.200000 9.670000e+02 1.600000e+07 1.000000e+05
75% 4.500000 2.940800e+04 2.500000e+07 1.000000e+06
max 5.000000 7.815831e+07 1.000000e+08 1.000000e+09
1
2
3
4
5
# 选择维度进行分析 

# 以Category类别为维度,分析哪些类别的App的最受欢迎
# 类别的个数
df.Category.unique().size
1
33
1
2
# 每个分类的App数量,排序,可以得出哪些分类的app最受开发者欢迎 
df.groupby('Category').count().sort_values('App', ascending=False)
App Rating Reviews Size Installs Type
Category
FAMILY 1831 1831 1831 1831 1831 1831
GAME 959 959 959 959 959 959
TOOLS 827 827 827 827 827 827
BUSINESS 420 420 420 420 420 420
MEDICAL 395 395 395 395 395 395
PERSONALIZATION 376 376 376 376 376 376
PRODUCTIVITY 374 374 374 374 374 374
LIFESTYLE 369 369 369 369 369 369
FINANCE 345 345 345 345 345 345
SPORTS 325 325 325 325 325 325
COMMUNICATION 315 315 315 315 315 315
HEALTH_AND_FITNESS 288 288 288 288 288 288
PHOTOGRAPHY 281 281 281 281 281 281
NEWS_AND_MAGAZINES 254 254 254 254 254 254
SOCIAL 239 239 239 239 239 239
BOOKS_AND_REFERENCE 222 222 222 222 222 222
TRAVEL_AND_LOCAL 219 219 219 219 219 219
SHOPPING 202 202 202 202 202 202
DATING 171 171 171 171 171 171
VIDEO_PLAYERS 163 163 163 163 163 163
MAPS_AND_NAVIGATION 131 131 131 131 131 131
EDUCATION 119 119 119 119 119 119
FOOD_AND_DRINK 112 112 112 112 112 112
ENTERTAINMENT 102 102 102 102 102 102
AUTO_AND_VEHICLES 85 85 85 85 85 85
LIBRARIES_AND_DEMO 84 84 84 84 84 84
WEATHER 79 79 79 79 79 79
HOUSE_AND_HOME 74 74 74 74 74 74
EVENTS 64 64 64 64 64 64
ART_AND_DESIGN 64 64 64 64 64 64
PARENTING 60 60 60 60 60 60
COMICS 56 56 56 56 56 56
BEAUTY 53 53 53 53 53 53
1
2
# 分类的安装量排序:娱乐社交类最被用户所需要 
df.groupby('Category').mean().sort_values('Installs', ascending=False)
Rating Reviews Size Installs
Category
COMMUNICATION 4.134647 907337.676190 1.289365e+07 3.504215e+07
VIDEO_PLAYERS 4.058137 414015.754601 1.631384e+07 2.409143e+07
SOCIAL 4.238926 953672.807531 1.643765e+07 2.296179e+07
ENTERTAINMENT 4.135294 340810.294118 2.122137e+07 2.072216e+07
PHOTOGRAPHY 4.159614 374915.551601 1.618811e+07 1.654501e+07
PRODUCTIVITY 4.185022 148638.098930 1.363180e+07 1.548955e+07
GAME 4.244643 648903.763295 3.973997e+07 1.447229e+07
TRAVEL_AND_LOCAL 4.087380 122464.570776 2.293315e+07 1.321866e+07
TOOLS 4.059615 277335.644498 9.870441e+06 9.675661e+06
NEWS_AND_MAGAZINES 4.135385 91063.889764 1.365578e+07 9.327629e+06
BOOKS_AND_REFERENCE 4.308393 75321.234234 1.376752e+07 7.504367e+06
SHOPPING 4.225835 220553.118812 1.593927e+07 6.932420e+06
WEATHER 4.238510 155634.987342 1.427317e+07 4.570893e+06
PERSONALIZATION 4.303077 142401.808511 1.168523e+07 4.075784e+06
HEALTH_AND_FITNESS 4.235199 74171.371528 2.018017e+07 3.972300e+06
MAPS_AND_NAVIGATION 4.051854 135337.007634 1.669496e+07 3.841846e+06
SPORTS 4.211275 108765.578462 2.333144e+07 3.373768e+06
EDUCATION 4.362956 112303.764706 1.882895e+07 2.965983e+06
FAMILY 4.181137 78550.239214 2.666982e+07 2.418319e+06
FOOD_AND_DRINK 4.175461 56473.464286 1.999241e+07 1.891060e+06
ART_AND_DESIGN 4.349614 22175.046875 1.255163e+07 1.786533e+06
BUSINESS 4.133347 23548.202381 1.431609e+07 1.659916e+06
LIFESTYLE 4.111489 32066.859079 1.515860e+07 1.365375e+06
FINANCE 4.125060 36701.756522 1.747266e+07 1.319851e+06
HOUSE_AND_HOME 4.156771 26079.013514 1.632407e+07 1.313682e+06
DATING 4.018100 21190.315789 1.583592e+07 8.241293e+05
COMICS 4.181848 41822.696429 1.433960e+07 8.032348e+05
LIBRARIES_AND_DEMO 4.181371 10795.607143 1.087250e+07 6.309037e+05
AUTO_AND_VEHICLES 4.190601 13690.188235 1.981538e+07 6.250613e+05
PARENTING 4.281960 15972.183333 2.207688e+07 5.253518e+05
BEAUTY 4.260553 7476.226415 1.428892e+07 5.131519e+05
EVENTS 4.363178 2515.906250 1.442185e+07 2.495806e+05
MEDICAL 4.173252 2994.863291 1.911849e+07 9.669159e+04
1
2
# 分类的评论数据:社交游戏视频评论多 
df.groupby('Category').mean().sort_values('Reviews', ascending=False)
Rating Reviews Size Installs
Category
SOCIAL 4.238926 953672.807531 1.643765e+07 2.296179e+07
COMMUNICATION 4.134647 907337.676190 1.289365e+07 3.504215e+07
GAME 4.244643 648903.763295 3.973997e+07 1.447229e+07
VIDEO_PLAYERS 4.058137 414015.754601 1.631384e+07 2.409143e+07
PHOTOGRAPHY 4.159614 374915.551601 1.618811e+07 1.654501e+07
ENTERTAINMENT 4.135294 340810.294118 2.122137e+07 2.072216e+07
TOOLS 4.059615 277335.644498 9.870441e+06 9.675661e+06
SHOPPING 4.225835 220553.118812 1.593927e+07 6.932420e+06
WEATHER 4.238510 155634.987342 1.427317e+07 4.570893e+06
PRODUCTIVITY 4.185022 148638.098930 1.363180e+07 1.548955e+07
PERSONALIZATION 4.303077 142401.808511 1.168523e+07 4.075784e+06
MAPS_AND_NAVIGATION 4.051854 135337.007634 1.669496e+07 3.841846e+06
TRAVEL_AND_LOCAL 4.087380 122464.570776 2.293315e+07 1.321866e+07
EDUCATION 4.362956 112303.764706 1.882895e+07 2.965983e+06
SPORTS 4.211275 108765.578462 2.333144e+07 3.373768e+06
NEWS_AND_MAGAZINES 4.135385 91063.889764 1.365578e+07 9.327629e+06
FAMILY 4.181137 78550.239214 2.666982e+07 2.418319e+06
BOOKS_AND_REFERENCE 4.308393 75321.234234 1.376752e+07 7.504367e+06
HEALTH_AND_FITNESS 4.235199 74171.371528 2.018017e+07 3.972300e+06
FOOD_AND_DRINK 4.175461 56473.464286 1.999241e+07 1.891060e+06
COMICS 4.181848 41822.696429 1.433960e+07 8.032348e+05
FINANCE 4.125060 36701.756522 1.747266e+07 1.319851e+06
LIFESTYLE 4.111489 32066.859079 1.515860e+07 1.365375e+06
HOUSE_AND_HOME 4.156771 26079.013514 1.632407e+07 1.313682e+06
BUSINESS 4.133347 23548.202381 1.431609e+07 1.659916e+06
ART_AND_DESIGN 4.349614 22175.046875 1.255163e+07 1.786533e+06
DATING 4.018100 21190.315789 1.583592e+07 8.241293e+05
PARENTING 4.281960 15972.183333 2.207688e+07 5.253518e+05
AUTO_AND_VEHICLES 4.190601 13690.188235 1.981538e+07 6.250613e+05
LIBRARIES_AND_DEMO 4.181371 10795.607143 1.087250e+07 6.309037e+05
BEAUTY 4.260553 7476.226415 1.428892e+07 5.131519e+05
MEDICAL 4.173252 2994.863291 1.911849e+07 9.669159e+04
EVENTS 4.363178 2515.906250 1.442185e+07 2.495806e+05
1
2
# 分类的打分数据,和其他数据不太一致,需要进一步分析 
df.groupby('Category').mean().sort_values('Rating', ascending=False)
Rating Reviews Size Installs
Type Category
Free COMMUNICATION 4.139080 992108.173611 1.350167e+07 3.832263e+07
SOCIAL 4.243693 965794.741525 1.656355e+07 2.325365e+07
GAME 4.233936 707783.190422 4.036479e+07 1.580151e+07
VIDEO_PLAYERS 4.057084 424347.176101 1.636918e+07 2.469705e+07
PHOTOGRAPHY 4.167498 401664.270992 1.667036e+07 1.773767e+07
ENTERTAINMENT 4.126000 347526.410000 2.093427e+07 2.113460e+07
TOOLS 4.047697 305987.504673 1.033869e+07 1.068097e+07
SHOPPING 4.223093 222756.230000 1.606466e+07 7.001693e+06
PERSONALIZATION 4.277251 180508.227119 1.024622e+07 5.183851e+06
WEATHER 4.226064 171249.619718 1.429121e+07 5.074486e+06
PRODUCTIVITY 4.183759 160170.312139 1.411873e+07 1.673896e+07
MAPS_AND_NAVIGATION 4.059467 140650.476190 1.652609e+07 3.993340e+06
TRAVEL_AND_LOCAL 4.084875 129476.657005 2.206258e+07 1.398408e+07
SPORTS 4.208242 116937.468439 2.361516e+07 3.638640e+06
EDUCATION 4.349494 115908.721739 1.813604e+07 3.063913e+06
NEWS_AND_MAGAZINES 4.130111 91785.821429 1.364591e+07 9.401636e+06
BOOKS_AND_REFERENCE 4.321794 86183.082474 1.393813e+07 8.587352e+06
FAMILY 4.171360 85068.516990 2.667952e+07 2.674327e+06
HEALTH_AND_FITNESS 4.229562 78078.981685 2.011587e+07 4.188822e+06
FOOD_AND_DRINK 4.172288 57469.372727 2.016998e+07 1.924898e+06
COMICS 4.181848 41822.696429 1.433960e+07 8.032348e+05
FINANCE 4.135910 38533.256098 1.791964e+07 1.387692e+06
LIFESTYLE 4.104136 33672.140000 1.521014e+07 1.436127e+06
HOUSE_AND_HOME 4.156771 26079.013514 1.632407e+07 1.313682e+06
BUSINESS 4.134144 24179.198529 1.438947e+07 1.708216e+06
ART_AND_DESIGN 4.330742 23230.114754 1.291318e+07 1.874133e+06
DATING 4.025574 21951.127273 1.603256e+07 8.540288e+05
Paid FAMILY 4.269186 19850.120219 2.658246e+07 1.128405e+05
GAME 4.359153 19181.109756 3.305742e+07 2.560971e+05
WEATHER 4.348970 17055.125000 1.411302e+07 1.015000e+05
... ... ... ... ...
EDUCATION 4.750000 8661.250000 3.875000e+07 1.505000e+05
Free BEAUTY 4.260553 7476.226415 1.428892e+07 5.131519e+05
Paid SPORTS 4.249313 6276.458333 1.977301e+07 5.182562e+04
PRODUCTIVITY 4.200628 6132.892857 7.614763e+06 5.043054e+04
PHOTOGRAPHY 4.050896 6064.789474 9.538167e+06 9.888105e+04
ENTERTAINMENT 4.600000 5004.500000 3.557604e+07 1.000000e+05
PARENTING 3.350000 4183.000000 1.322604e+07 2.505000e+04
Free MEDICAL 4.159613 3727.451923 1.949135e+07 1.206165e+05
Paid PERSONALIZATION 4.397137 3619.172840 1.692605e+07 4.023202e+04
VIDEO_PLAYERS 4.100000 3341.750000 1.411407e+07 1.775000e+04
COMMUNICATION 4.087362 3119.037037 6.408087e+06 5.037222e+04
HEALTH_AND_FITNESS 4.337802 3052.866667 2.135042e+07 3.160733e+04
Free EVENTS 4.365899 2555.841270 1.454442e+07 2.535422e+05
Paid LIFESTYLE 4.246935 2495.894737 1.420927e+07 6.205842e+04
TOOLS 4.174056 2204.320513 5.374062e+06 2.214668e+04
BUSINESS 4.106273 2094.333333 1.182101e+07 1.773125e+04
FOOD_AND_DRINK 4.350000 1698.500000 1.022604e+07 3.000000e+04
TRAVEL_AND_LOCAL 4.130586 1506.083333 3.795035e+07 1.525500e+04
MAPS_AND_NAVIGATION 3.860000 1437.600000 2.095042e+07 2.422000e+04
AUTO_AND_VEHICLES 4.327838 1387.666667 1.705070e+07 1.671667e+04
FINANCE 3.915708 1364.588235 8.848529e+06 1.091776e+04
ART_AND_DESIGN 4.733333 722.000000 5.200000e+06 5.333333e+03
DATING 3.812545 268.000000 1.042835e+07 1.891667e+03
SHOPPING 4.500000 242.000000 3.400000e+06 5.050000e+03
MEDICAL 4.224520 241.036145 1.771691e+07 6.757024e+03
NEWS_AND_MAGAZINES 4.800000 100.500000 1.490000e+07 2.750000e+03
SOCIAL 3.863919 80.666667 6.533333e+06 2.000000e+03
BOOKS_AND_REFERENCE 4.215541 64.142857 1.258550e+07 8.327143e+02
LIBRARIES_AND_DEMO 4.191757 4.000000 4.700000e+06 1.000000e+02
EVENTS 4.191757 0.000000 6.700000e+06 1.000000e+00

63 rows × 4 columns

1
2
3
4
# 评论安装比
# 收费的app评论比率更高
g = df.groupby(['Type', 'Category']).mean()
(g['Reviews'] / g['Installs']).sort_values(ascending=False)
Type  Category           
Paid  VIDEO_PLAYERS          0.188268
      FAMILY                 0.175913
      WEATHER                0.168031
      PARENTING              0.166986
      DATING                 0.141674
      ART_AND_DESIGN         0.135375
      FINANCE                0.124988
      PRODUCTIVITY           0.121611
      SPORTS                 0.121107
      BUSINESS               0.118115
      TOOLS                  0.099533
      TRAVEL_AND_LOCAL       0.098727
      HEALTH_AND_FITNESS     0.096587
      PERSONALIZATION        0.089958
      AUTO_AND_VEHICLES      0.083011
      BOOKS_AND_REFERENCE    0.077029
      GAME                   0.074898
      COMMUNICATION          0.061920
      PHOTOGRAPHY            0.061334
      MAPS_AND_NAVIGATION    0.059356
      EDUCATION              0.057550
      FOOD_AND_DRINK         0.056617
Free  COMICS                 0.052068
Paid  ENTERTAINMENT          0.050045
      SHOPPING               0.047921
Free  GAME                   0.044792
      SOCIAL                 0.041533
Paid  SOCIAL                 0.040333
      LIFESTYLE              0.040218
      LIBRARIES_AND_DEMO     0.040000
                               ...   
Free  MAPS_AND_NAVIGATION    0.035221
      PERSONALIZATION        0.034821
      WEATHER                0.033747
      SPORTS                 0.032138
      SHOPPING               0.031815
      FAMILY                 0.031809
      MEDICAL                0.030903
      PARENTING              0.030185
      FOOD_AND_DRINK         0.029856
      TOOLS                  0.028648
      FINANCE                0.027768
      COMMUNICATION          0.025888
      DATING                 0.025703
      LIFESTYLE              0.023446
      PHOTOGRAPHY            0.022645
      AUTO_AND_VEHICLES      0.021844
      HOUSE_AND_HOME         0.019852
      HEALTH_AND_FITNESS     0.018640
      VIDEO_PLAYERS          0.017182
      LIBRARIES_AND_DEMO     0.017111
      ENTERTAINMENT          0.016443
      BEAUTY                 0.014569
      BUSINESS               0.014155
      ART_AND_DESIGN         0.012395
      EVENTS                 0.010081
      BOOKS_AND_REFERENCE    0.010036
      NEWS_AND_MAGAZINES     0.009763
      PRODUCTIVITY           0.009569
      TRAVEL_AND_LOCAL       0.009259
Paid  EVENTS                 0.000000
Length: 63, dtype: float64
1
2
3
# 相关性:
#评论数和安装数强相关,其他的连0.1都不到,可以认为是不相关的(0.5以上可以认为是相关的,0.3以上可以认为是弱相关)
df.corr()
Rating Reviews Size Installs
Rating 1.000000 0.054337 0.052751 0.039245
Reviews 0.054337 1.000000 0.080578 0.625164
Size 0.052751 0.080578 1.000000 0.050675
Installs 0.039245 0.625164 0.050675 1.000000
文章作者: JackFeng
文章链接: https://minesql.github.io/posts/340b436c.html
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 SQL社区
打赏
  • 微信
    微信
  • 支付宝
    支付宝

评论