上一次的內(nèi)容中奕删,我們已經(jīng)把確定目標(biāo)和數(shù)據(jù)獲取的部分完成了谍珊,接下來我們來講一下如何進(jìn)行數(shù)據(jù)處理和數(shù)據(jù)展示吧。
第三部分:數(shù)據(jù)處理
數(shù)據(jù)處理是數(shù)據(jù)分析的核心部分矾克,我也給它打了50%的權(quán)重阐滩。之前的數(shù)據(jù)獲取部分羡微,我們拿到了1615條原始的課程數(shù)據(jù)割粮。但是简烘,這個(gè)數(shù)據(jù)中可能有無法使用和重復(fù)等情況缰泡,而且我們還要根據(jù)之后展示數(shù)據(jù)的要求進(jìn)行數(shù)據(jù)提取跌帐∏矗可見打掘,我們的數(shù)據(jù)處理又可以分成兩步:數(shù)據(jù)清洗和數(shù)據(jù)提取华畏。
一、數(shù)據(jù)清洗
我們還是使用python來對(duì)數(shù)據(jù)進(jìn)行清洗尊蚁。在之前的數(shù)據(jù)獲取中我們拿到了三張表:總課程表亡笑、課程詳情表和教師表。
我們先來看一下總課程表:
可以看出横朋,我們之前為了爬取課程詳情在課程表中增加了lessonurl列仑乌,在清洗時(shí)可以去掉。同時(shí)根據(jù)年級(jí)(grade)我們還可以增加一列學(xué)段(小學(xué)琴锭、初中晰甚、高中),可以為之后的展示提供更多的角度决帖。
代碼如下
#導(dǎo)入模塊
import numpy as np
import pandas as pd
#打印出前5行厕九,以確保數(shù)據(jù)運(yùn)行正常
lessonDf.head()
#選擇子集
lessonDf = lessonDf[['grade','channelid','lessonid','lessonname']]
#通過條件判斷篩選
#構(gòu)建查詢條件
querySer = lessonDf['lessonid'] != 'groups/'
#根據(jù)查詢條件篩選
lessonDf = lessonDf.loc[querySer, : ]
lessonDf.shape
#通過條件判斷篩選
#構(gòu)建查詢條件
querySer = lessonDf['lessonid'] != 'groups/'
#根據(jù)查詢條件篩選
lessonDf = lessonDf.loc[querySer, : ]
lessonDf.shape
#給channelid賦值
for i in range(0, 1389):
channel = lessonDf.iloc[i,1]
if channel == 1:
lessonDf.iloc[i,1]='語文'
elif channel == 2:
lessonDf.iloc[i,1]='數(shù)學(xué)'
elif channel == 3:
lessonDf.iloc[i,1]='英語'
elif channel == 201:
lessonDf.iloc[i,1]='編程'
elif channel == 4:
lessonDf.iloc[i,1]='物理'
elif channel == 5:
lessonDf.iloc[i,1]='化學(xué)'
elif channel == 6:
lessonDf.iloc[i,1]='生物'
elif channel == 7:
lessonDf.iloc[i,1]='歷史'
elif channel == 8:
lessonDf.iloc[i,1]='地理'
elif channel == 9:
lessonDf.iloc[i,1]='政治'
elif channel == 14:
lessonDf.iloc[i,1]='道德與法制'
else :
pass
#創(chuàng)建studyphase列
lessonDf['studyphase'] = lessonDf['grade']
#給studyphase賦值
for i in range(0, 1389):
phase = lessonDf.iloc[i,4]
if 0< phase<=6:
lessonDf.iloc[i,4]='小學(xué)'
elif 6< phase <=9:
lessonDf.iloc[i,4]='初中'
elif phase > 9:
lessonDf.iloc[i,4]='高中'
else :
pass
#去重
lessonDf = lessonDf.drop_duplicates()
#導(dǎo)出數(shù)據(jù)
lessonDf.to_excel('lesson.xls')
清洗結(jié)果如下:
同樣的,我們也對(duì)課程詳細(xì)表和教師表進(jìn)行去除臟數(shù)據(jù)地回、去重等操作扁远,導(dǎo)出teacher和lessondetail兩張表:
二、數(shù)據(jù)提取
處理好數(shù)據(jù)之后刻像,下一步就是根據(jù)實(shí)際的數(shù)據(jù)展示需求來提取數(shù)據(jù)了穿香。參考下面的腦圖我們可以更好的理解應(yīng)當(dāng)提取哪些數(shù)據(jù):
這幅腦圖和第一部分爬取數(shù)據(jù)時(shí)的腦圖相比更加清晰,每個(gè)子主題都可以對(duì)應(yīng)一個(gè)提數(shù)需求绎速。這里我將上面數(shù)據(jù)清洗得到的三張表導(dǎo)入Sequel Pro,使用SQL語句來提取所需的數(shù)據(jù)焙蚓。
1纹冤、學(xué)科、學(xué)段购公、年級(jí)下課程數(shù)排名
(1)不同學(xué)科的課程數(shù)量
select channelid '學(xué)科類型', count(distinct lessonid) '課程數(shù)量' from lesson
group by channelid;
(2)不同學(xué)段的課程數(shù)量
select studyphase '學(xué)段', count(distinct lessonid) '課程數(shù)量' from lesson
group by studyphase;
(3)不同年級(jí)的課程數(shù)量
select grade '年級(jí)', count(distinct lessonid) '課程數(shù)量' from lesson
group by grade;
2萌京、學(xué)科、學(xué)段宏浩、年級(jí)下課程報(bào)名人數(shù)排名
(1)不同年級(jí)知残、學(xué)科下報(bào)名人數(shù)
select grade '年級(jí)', channelid '學(xué)科', sum(signup_number) '報(bào)名人數(shù)' from lessondetail d
left join
(select distinct grade, channelid, lessonid
from lesson) t
on d.lesson_id = t.lessonid
group by grade, channelid
(2)不同學(xué)段下報(bào)名人數(shù)
select studyphase '學(xué)段', sum(signup_number) '報(bào)名人數(shù)' from lessondetail d
left join
(select distinct studyphase, lessonid
from lesson) t
on d.lesson_id = t.lessonid
group by studyphase
(3)不同年級(jí)下課程報(bào)名人數(shù)
select grade '年級(jí)', sum(signup_number) '報(bào)名人數(shù)' from lessondetail d
left join
(select distinct grade, lessonid from lesson) t
on d.lesson_id = t.lessonid
group by grade
3、學(xué)科比庄、學(xué)段求妹、年級(jí)下老師數(shù)量排名
老師總數(shù)量
select count(distinct teacher_id) from teacher;
(1)不同學(xué)科的老師數(shù)量
select channelid '學(xué)科', count(teacher_id) '教師數(shù)'
from (select distinct channelid, teacher_id
from lesson l
join teacher t
on l.lessonid = t.lesson_id) t
group by channelid
(2)不同學(xué)段的老師數(shù)量
select studyphase '學(xué)段', count(teacher_id) '教師數(shù)'
from (select distinct studyphase, teacher_id
from lesson l
join teacher t
on l.lessonid = t.lesson_id) t
group by studyphase
(3)不同年級(jí)的老師數(shù)量
select grade '年級(jí)' , count(teacher_id) '教師數(shù)'
from (select distinct grade, teacher_id
from lesson l
join teacher t
on l.lessonid = t.lesson_id) t
group by grade
(4)課程配師的分布情況
select teacher_num '課程配備教師個(gè)數(shù)', count(t.lesson_id) '課程數(shù)量' from(
select lesson_id, count(1) as teacher_num from teacher
group by lesson_id) t
group by teacher_num
(5)老師上的課程數(shù)分布
select teacher_id, ct '上課數(shù)' from(
select teacher_id , count(1) ct from teacher
group by teacher_id) t
order by t.ct desc
4乏盐、學(xué)科、學(xué)段制恍、年級(jí)下營收情況排名
(1)不同學(xué)科的營收
select channelid '學(xué)科' , sum(profit) '營收' from
(#對(duì)學(xué)科下的課程id去重
select channelid, lesson_id, count(lesson_id), price*signup_number profit
from lessondetail d
join lesson l
on d.lesson_id = l.lessonid
group by channelid, lesson_id) t
group by channelid
(2)不同學(xué)段的營收
select studyphase '學(xué)段' , sum(profit) '營收' from
(#對(duì)學(xué)段下的課程id去重
select studyphase, lesson_id, count(lesson_id), price*signup_number profit
from lessondetail d
join lesson l
on d.lesson_id = l.lessonid
group by studyphase, lesson_id) t
group by studyphase
(3)不同年級(jí)+學(xué)科的營收
select grade '年級(jí)' ,channelid '學(xué)科', sum(profit) '營收' from
(#對(duì)年級(jí)下的課程id去重
select grade, lesson_id, channelid, count(lesson_id), price*signup_number profit
from lessondetail d
join lesson l
on d.lesson_id = l.lessonid
group by grade, lesson_id, channelid) t
group by grade, channelid
(4)老師對(duì)營收的貢獻(xiàn)排名
select teacher_id '教師id' , studyphase '學(xué)段' , channelid '學(xué)科' , sum(signup_num) '學(xué)生數(shù)',
count(1) '代課數(shù)' , sum(profit) '營收' from
(select lesson_id, signup_num, profit from
(#對(duì)多老師的課程營收分配成相應(yīng)價(jià)格
select l.lesson_id, count(l.lesson_id), sum(signup_number)/count(l.lesson_id) signup_num,
price*signup_number/count(l.lesson_id) profit
from lessondetail l
join teacher t
on l.lesson_id = t.lesson_id
group by l.lesson_id) t1) t2
join teacher t
join lesson l
on t.lesson_id = t2.lesson_id
and t2.lesson_id = l.lessonid
group by teacher_id, studyphase, channelid
(5)課程對(duì)營收的貢獻(xiàn)排名
select lesson_id, sum(price*signup_number) profit
from lessondetail
group by lesson_id
以上SQL語句基本上滿足了大部分的提數(shù)需求父能。此外,還有一張表可以用來計(jì)算相關(guān)性矩陣:
合并表
select d.lesson_id, price, signup_number,
studyphase, grade, channelid,
price*signup_number profit
from lessondetail d
join lesson l
on d.lesson_id = l.lessonid
第四部分净神、數(shù)據(jù)展示
經(jīng)過以上步驟的數(shù)據(jù)提取何吝,我們可以根據(jù)需求來展示數(shù)據(jù)了。一個(gè)比較經(jīng)典的圖是這樣的:
考慮到我們需要展示的數(shù)據(jù)主要目的為比較和構(gòu)成鹃唯,因此我選擇條形圖爱榕、柱狀體、餅圖等基本圖形來進(jìn)行展示坡慌。
由于篇幅有限黔酥,這里只展示一下相關(guān)性矩陣:
可見,課程的營收情況主要和報(bào)名人數(shù)以及年級(jí)成正相關(guān)八匠。其他具體數(shù)據(jù)展示我會(huì)通過一個(gè)具體的報(bào)告進(jìn)行呈現(xiàn)絮爷,敬請(qǐng)期待。
小結(jié)
本篇文章主要討論了對(duì)猿輔導(dǎo)網(wǎng)站獲取的數(shù)據(jù)進(jìn)行清洗和提數(shù)的過程梨树。在清洗數(shù)據(jù)時(shí)利用python簡(jiǎn)化流程坑夯,在提數(shù)過程中先使用Xmind確定提數(shù)需求,再使用SQL語句完成提數(shù)抡四。