我先評(píng)價(jià)一下這門課,總體來說箫锤,這門課難度的確不高,會(huì)一些簡(jiǎn)單的字符串處理和哈希表知識(shí)就可以上手了雨女,實(shí)現(xiàn)的主要語(yǔ)言是python谚攒,也是比較流行的大數(shù)據(jù)處理語(yǔ)言。閑話不多說氛堕,開始入門馏臭。
進(jìn)行數(shù)據(jù)處理編程的時(shí)候,我最大的感受就是一定要做好原始數(shù)據(jù)的預(yù)處理讼稚。一是可以減少之后面對(duì)具體問題的代碼量括儒,同時(shí)可以將數(shù)據(jù)有效化,篩去壞數(shù)據(jù)锐想。因此我們要先做幾件事情帮寻。
首先讀取csv數(shù)據(jù),需要引入unicodecsv的庫(kù)赠摇,然后我們打開csv文件固逗,rb表示文件可以被讀取,with open的好處是可以自動(dòng)關(guān)閉文件蝉稳。unicodecsv.DictReader()函數(shù)使每一行都是一個(gè)字典抒蚜,鍵是列名,值是相應(yīng)列的值耘戚;并且函數(shù)的返回值為一個(gè)迭代器嗡髓。迭代器轉(zhuǎn)化為list進(jìn)行儲(chǔ)存從而方便處理,具體代碼如下:
import unicodecsv
def read_csv(filename):
with open(filename, 'rb') as f:
reader = unicodecsv.DictReader(f)
return list(reader)
enrollments = read_csv("enrollments.csv")
daily_engagement = read_csv("daily_engagement.csv")
project_submissions = read_csv("project_submissions.csv")
print enrollments[0]
print daily_engagement[0]
print project_submissions[0]
此時(shí)我們已經(jīng)將數(shù)據(jù)存入列表并輸出每一個(gè)列表第一個(gè)值收津,如下:
{u'status': u'canceled', u'is_udacity': u'True', u'is_canceled': u'True', u'join_date': u'2014-11-10', u'account_key': u'448', u'cancel_date': u'2015-01-14', u'days_to_cancel': u'65'}
{u'lessons_completed': u'0.0', u'num_courses_visited': u'1.0', u'total_minutes_visited': u'11.6793745', u'projects_completed': u'0.0', u'acct': u'0', u'utc_date': u'2015-01-09'}
{u'lesson_key': u'3176718735', u'processing_state': u'EVALUATED', u'account_key': u'256', u'assigned_rating': u'UNGRADED', u'completion_date': u'2015-01-16', u'creation_date': u'2015-01-14'}
不難發(fā)現(xiàn)饿这,無論鍵還是值都是字符串浊伙。因此我們要對(duì)數(shù)據(jù)進(jìn)行預(yù)處理。
對(duì)日期對(duì)象進(jìn)行預(yù)處理函數(shù)如下:
from datetime import datetime as dt
def parse_date(date):
if date =='':
return None
else:
return dt.strptime(data,"%Y-%m-%d")
對(duì)整型對(duì)象進(jìn)行預(yù)處理函數(shù)如下(簡(jiǎn)化版):
def parse_maybe_int(i):
return None if string == '' else int(i)
我們發(fā)現(xiàn)在不同的csv文件中长捧,account_key和acct都表示賬號(hào)ID嚣鄙,為了數(shù)據(jù)的統(tǒng)一從而方便處理,我們將用acct表示賬號(hào)ID的值轉(zhuǎn)換為accout_key表示串结。代碼如下:
def acct_to_account_key():
for engagement_record in daily_engagement:
engagement_record['account_key'] = engagement_record['acct']
del[engagement_record['acct']]
預(yù)處理結(jié)束后哑子,我們就開始真正的數(shù)據(jù)處理了,我們先做第一件事情肌割,就是統(tǒng)計(jì)一共注冊(cè)過多少人(也就是有多少'account_key')卧蜓,并驗(yàn)證注冊(cè)和參與課程活動(dòng)的人數(shù)是否相符合。
def get_unique_students(data):
unique_students = set()
for data_point in data:
unique_students.add(data_point['account_key'])
return unique_students
acct_to_account_key()
unique_students_enrollments = get_unique_students(enrollments)
unique_students_engagement = get_unique_students(daily_engagement)
print len(unique_students_enrollments)
print len(unique_students_engagement)
我們得到的輸出如下:
1302
1237
那么問題來了把敞,為什么兩個(gè)數(shù)值不相等呢弥奸?代表我們的數(shù)據(jù)是存在缺陷的,所以我們要審核這兩份數(shù)據(jù)奋早。
我們審核的方法是某一個(gè)學(xué)員注冊(cè)了賬號(hào)但在參與數(shù)據(jù)中并沒有體現(xiàn)盛霎,具體代碼如下:
def find_problem():
for enrollment in enrollments:
if enrollment["account_key"] not in unique_students_engagement:
print enrollment
break
find_problem()
我們得到如下數(shù)據(jù):
{u'status': u'canceled', u'is_udacity': u'False', u'is_canceled': u'True', u'join_date': u'2014-11-12', u'account_key': u'1219', u'cancel_date': u'2014-11-12', u'days_to_cancel': u'0'}
我們會(huì)發(fā)現(xiàn)這位學(xué)員注冊(cè)日期與取消日期相等。經(jīng)核實(shí)耽装,當(dāng)學(xué)員當(dāng)日注銷愤炸,則在參與數(shù)據(jù)中不體現(xiàn)。那么除了這部分?jǐn)?shù)據(jù)無法匹配掉奄,是否還有其他數(shù)據(jù)有問題呢摇幻?我們繼續(xù)審查。
def find_problem2():
num_problem_students = 0
for enrollment in enrollments:
if (enrollment['account_key'] not in unique_students_engagement \
and enrollment['join_date'] != enrollment['cancel_date']):
print enrollment
num_problem_students += 1
find_problem2()
我們得到如下的輸出:
{u'status': u'canceled', u'is_udacity': u'True', u'is_canceled': u'True', u'join_date': u'2015-01-10', u'account_key': u'1304', u'cancel_date': u'2015-03-10', u'days_to_cancel': u'59'}
{u'status': u'canceled', u'is_udacity': u'True', u'is_canceled': u'True', u'join_date': u'2015-03-10', u'account_key': u'1304', u'cancel_date': u'2015-06-17', u'days_to_cancel': u'99'}
{u'status': u'current', u'is_udacity': u'True', u'is_canceled': u'False', u'join_date': u'2015-02-25', u'account_key': u'1101', u'cancel_date': u'', u'days_to_cancel': u''}
這三個(gè)數(shù)據(jù)的問題在于它們同屬于測(cè)試賬號(hào)挥萌,也就是u'is_udacity': u'True',這些測(cè)試賬號(hào)并不一定會(huì)在參與數(shù)據(jù)當(dāng)中枉侧。此時(shí)引瀑,我們就已經(jīng)找到了注冊(cè)數(shù)據(jù)與參與數(shù)據(jù)不匹配的兩個(gè)原因。
由于測(cè)試賬號(hào)對(duì)于數(shù)據(jù)分析毫無意義榨馁,下面我們需要將這些測(cè)試賬號(hào)從注冊(cè)數(shù)據(jù)的列表中刪除憨栽。
首先,我們創(chuàng)建測(cè)試賬號(hào)的集合翼虫,函數(shù)如下:
def test_set():
data_test_set = set()
for enrollment in enrollments:
if enrollment["is_udacity"] == "True" :
data_test_set.add(enrollment['account_key'])
return data_test_set
enrollments_test_set = test_set()
然后屑柔,我們將賬號(hào)數(shù)據(jù)列表和參與數(shù)據(jù)列表中的測(cè)試賬號(hào)相關(guān)信息刪去,函數(shù)如下:
def remove_udacity_accounts(data):
non_udacity_accounts = []
for data_point in data:
if data_point["account_key"] not in enrollments_test_set:
non_udacity_accounts.append(data_point)
return non_udacity_accounts
enrollments_remove_udacity_accounts = remove_udacity_accounts(enrollments)
engagement_remove_udacity_accounts = remove_udacity_accounts(daily_engagement)
print len(enrollments_remove_udacity_accounts)
print len(engagement_remove_udacity_accounts)
輸出結(jié)果如下:
1622
135656
此時(shí)珍剑,我們已經(jīng)將測(cè)試賬號(hào)從我們需要處理的數(shù)據(jù)列表中刪除掸宛。
現(xiàn)在我們開始做第二件事情:統(tǒng)計(jì)付費(fèi)學(xué)員在其注冊(cè)開始后的第一周時(shí)間內(nèi)的參與數(shù)據(jù)。
首先招拙,我們要?jiǎng)?chuàng)建一個(gè)函數(shù)唧瘾,滿足如下要求:
(1)返回一個(gè)哈希表措译,哈希表內(nèi)的學(xué)生賬號(hào)ID來自于enrollments_remove_udacity_accounts(也就是將測(cè)試賬號(hào)刪除的數(shù)據(jù)列表)
(2)學(xué)生要滿足還沒有注銷賬號(hào),或者超過7天才注銷賬號(hào)饰序。
(3)哈希表的鍵為學(xué)生ID领虹,值為學(xué)生最近的一次注冊(cè)日期。
函數(shù)如下:
def paid_student_dict():
paid_student = {}
for student in enrollments_remove_udacity_accounts:
if student["is_canceled"] != "True" or parse_maybe_int(student["days_to_cancel"]) > 7:
account_key = student["account_key"]
join_date = student["join_date"]
if account_key not in paid_student or join_date > paid_student[account_key]:
paid_student[account_key] = join_date
return paid_student
paid_student_in_enrollments = paid_student_dict()
print len(paid_student_in_enrollments)
輸出結(jié)果如下:
995
我們現(xiàn)在所得到的哈希表也就是所有付費(fèi)學(xué)生的ID以及其相應(yīng)的注冊(cè)日期求豫。所以根據(jù)這個(gè)哈希表塌衰,我們處理以下三個(gè)數(shù)據(jù),刪去那些不符合條件的學(xué)生記錄:
enrollments_remove_udacity_accounts
engagement_remove_udacity_accounts
project_remove_udacity_accounts
函數(shù)如下:
def remove_free_trial_cancels(data):
new_data = []
for data_point in data:
if data_point['account_key'] in paid_students:
new_data.append(data_point)
return new_data
paid_enrollments = remove_free_trial_cancels(enrollments_remove_udacity_accounts)
paid_engagement = remove_free_trial_cancels(engagement_remove_udacity_accounts)
paid_project = remove_free_trial_cancels(project_remove_udacity_accounts)
print len(paid_enrollments)
print len(paid_engagement)
print len(paid_project)
輸出結(jié)果如下:
1293
134549
3618
這樣蝠嘉,我們最新的三個(gè)數(shù)據(jù)內(nèi)的學(xué)生均為付費(fèi)學(xué)生最疆。之后我們要做的就是獲取這些學(xué)生第一周的參與數(shù)據(jù)。
def within_one_week(join_date, engagement_date):
time_delta = parse_date(engagement_date) - parse_date(join_date)
return time_delta.days < 7
def first_paid_week():
paid_engagement_in_first_week = []
for engagement_record in paid_engagement:
account_key = engagement_record['account_key']
join_date = paid_students[account_key]
engagement_record_date = engagement_record['utc_date']
if within_one_week(join_date, engagement_record_date):
paid_engagement_in_first_week.append(engagement_record)
return paid_engagement_in_first_week
paid_engagement_in_first_week = first_paid_week()
print len(paid_engagement_in_first_week)
輸出結(jié)果如下:
21508
第三件事情就是得到付費(fèi)學(xué)員第一周上課的平均時(shí)間是晨。所以我們首先將付費(fèi)學(xué)員與付費(fèi)學(xué)員的參與課程記錄建立相對(duì)應(yīng)的哈希表肚菠。
def division_by_account():
engagement_by_account = defaultdict(list)
for engagement_record in paid_engagement_in_first_week:
account_key = engagement_record['account_key']
engagement_by_account[account_key].append(engagement_record)
return engagement_by_account
engagement_by_account = division_by_account()
然后統(tǒng)計(jì)每一名學(xué)員和與之對(duì)應(yīng)的參與課堂時(shí)間,從而得到學(xué)員們參與課程的平均時(shí)間罩缴,這里用到了numpy函數(shù)庫(kù)蚊逢。
def minutes_by_account():
total_minutes_by_account = {}
for account_key, engagement_for_students in engagement_by_account.items():
total_minutes = 0
for engagement_record in engagement_for_students:
total_minutes += float(engagement_record['total_minutes_visited'])
total_minutes_by_account[account_key] = total_minutes
return total_minutes_by_account
total_minutes_by_account = minutes_by_account()
print 'Mean:', np.mean(total_minutes_by_account.values())
print 'Std:', np.std(total_minutes_by_account.values())
print 'Min:', np.min(total_minutes_by_account.values())
print 'Max:', np.max(total_minutes_by_account.values())
輸出結(jié)果:
Mean: 647.590173826
Std: 1129.27121042
Min: 0.0
Max: 10568.1008673
發(fā)現(xiàn)Max值問題很大,因?yàn)闀r(shí)間的單位是min箫章。所以我們要找到max對(duì)應(yīng)的這位學(xué)員的參與課堂記錄:
def find_max_minutes_student():
student_with_max_minutes = None
max_minutes = 0
for student, total_minutes in total_minutes_by_account.items():
if total_minutes > max_minutes:
max_minutes = total_minutes
student_with_max_minutes = student
return student_with_max_minutes
def max_minutes_student_enagement():
for engagement_record in paid_engagement_in_first_week:
if engagement_record["account_key"] == student_with_max_minutes:
print engagement_record
student_with_max_minutes = find_max_minutes_student()
#print total_minutes_by_account[student_with_max_minutes]
max_minutes_student_enagement()
輸出結(jié)果(由于數(shù)據(jù)量大我只取首位兩個(gè)):
{u'lessons_completed': u'0.0', u'num_courses_visited': u'1.0', u'total_minutes_visited': u'50.9938951667', u'projects_completed': u'0.0', 'account_key': u'108', u'utc_date': u'2015-01-07'}
{u'lessons_completed': u'0.0', u'num_courses_visited': u'0.0', u'total_minutes_visited': u'0.0', u'projects_completed': u'0.0', 'account_key': u'108', u'utc_date': u'2015-04-26'}
我們會(huì)發(fā)現(xiàn)時(shí)間跨度遠(yuǎn)超1周烙荷,我首先想到是不是注冊(cè)時(shí)間有問題。因此我調(diào)用如下函數(shù):
print paid_students[student_with_max_minutes]
得到的結(jié)果是:
2015-07-09
所以問題出在了我們within_one_week()函數(shù)不夠完整檬寂,我們只記錄了學(xué)員最后一次注冊(cè)時(shí)間终抽,那么原有函數(shù)將7天及所有7天以前的記錄(學(xué)員之前注冊(cè)過的信息)都考慮進(jìn)去了,這是明顯不對(duì)的桶至,修正后的函數(shù)為:
def within_one_week(join_date, engagement_date):
time_delta = parse_date(engagement_date) - parse_date(join_date)
return time_delta.days < 7 and time_delta.days >=0
print 'Mean:', np.mean(total_minutes_by_account.values())
print 'Max:', np.max(total_minutes_by_account.values())
得到如下結(jié)果:
Mean: 306.708326753
Max: 3564.7332645
為了能夠處理類似數(shù)據(jù)昼伴,我們將上面函數(shù)進(jìn)行改寫,新的函數(shù)如下:
def group_data(data,key_name):
grouped_data = defaultdict(list)
for data_point in data:
key = data_point[key_name]
grouped_data[key].append(data_point)
return grouped_data
def sum_grouped_items(grouped_data,field_name):
summed_data = {}
for key, data_points in grouped_data.items():
total = 0
for data_point in data_points:
total += data_point[field_name]
summed_data[key] = total
return summed_data
def describe_data(data):
print 'Mean:', np.mean(data)
print 'Std:', np.std(data)
print 'Min:', np.min(data)
print 'Max:', np.max(data)
這樣我們用同一函數(shù)既能夠獲得關(guān)于學(xué)員上課總分鐘數(shù)的統(tǒng)計(jì)也能獲得學(xué)員上課數(shù)量的統(tǒng)計(jì)信息镣屹。
engagement_by_account = group_data(paid_engagement_in_first_week,'account_key')
for engagement_record in paid_engagement_in_first_week:
engagement_record['total_minutes_visited'] = float(engagement_record['total_minutes_visited'])
total_minutes_by_account = sum_grouped_items(engagement_by_account,'total_minutes_visited')
#describe_data(total_minutes_by_account.values())
for engagement_record in paid_engagement_in_first_week:
engagement_record['lessons_completed'] = float(engagement_record['lessons_completed'])
lessons_completed_by_account = sum_grouped_items(engagement_by_account,'lessons_completed')
describe_data(lessons_completed_by_account.values())
輸出結(jié)果結(jié)果為:
Mean: 1.63618090452
Std: 3.00256129983
Min: 0.0
Max: 36.0
得到付費(fèi)學(xué)員第一周上課的平均時(shí)間和課程數(shù)后圃郊,我們可以利用類似的方法求得每周付費(fèi)學(xué)員的上課天數(shù)。但是我們要在原數(shù)據(jù)中加入了一個(gè)參數(shù)'has_visited'女蜈,也就是統(tǒng)計(jì)某天真正的參與課堂而不是只在網(wǎng)站閑逛持舆。
def has_visited_update():
for engagement_record in paid_engagement:
if float(engagement_record['num_courses_visited']) > 0:
engagement_record['has_visited'] = 1
else:
engagement_record['has_visited'] = 0
has_visited_update()
days_visited_by_account = sum_grouped_items(engagement_by_account,'has_visited')
describe_data(days_visited_by_account.values())
下面開始劃分及格學(xué)員,所謂的及格學(xué)員就是某課程pass的伪窖,反之是不及格學(xué)員逸寓,我們統(tǒng)計(jì)的課號(hào)為:['746169184', '3176718735']。我們會(huì)將及格學(xué)員的學(xué)號(hào)放在一個(gè)集合中覆山。
def find_pass_subway():
subway_project_lesson_keys = ['746169184', '3176718735']
pass_subway_project = set()
for submission in paid_project:
project = submission['lesson_key']
rating = submission['assigned_rating']
if ((project in subway_project_lesson_keys) and(rating == 'PASSED' or rating == 'DISTINCTION')):
pass_subway_project.add(submission['account_key'])
return pass_subway_project
print len(find_pass_subway())
輸出結(jié)果為:
647
我們將及格與不及格學(xué)員的第一周參與課堂記錄分別存在兩個(gè)列表中竹伸。
def devision_engagement():
passing_engagement = []
non_passing_engagement = []
for engagement_record in paid_engagement_in_first_week:
if engagement_record['account_key'] in pass_subway_project:
passing_engagement.append(engagement_record)
else:
non_passing_engagement.append(engagement_record)
return passing_engagement,non_passing_engagement
print len(passing_engagement)
print len(non_passing_engagement)
輸出結(jié)果為:
4527
2392
然后我們比較兩組學(xué)員的情況。比較的內(nèi)容分別為(第一周內(nèi)的):訪問分鐘數(shù)汹买,完成課程數(shù)和參與課堂天數(shù)佩伤。
passing_engagement_by_account = group_data(passing_engagement,'account_key')
non_passing_engagement_by_account = group_data(non_passing_engagement,'account_key')
print 'non-passing students:'
non_passing_minutes = sum_grouped_items(non_passing_engagement_by_account,'total_minutes_visited')
describe_data(non_passing_minutes.values())
print 'passing students:'
passing_minutes = sum_grouped_items(passing_engagement_by_account,'total_minutes_visited')
describe_data(passing_minutes.values())
print 'non-passing students:'
non_passing_lessons = sum_grouped_items(non_passing_engagement_by_account,'lessons_completed')
describe_data(non_passing_lessons.values())
print 'passing students:'
passing_lessons = sum_grouped_items(passing_engagement_by_account,'lessons_completed')
describe_data(passing_lessons.values())
print 'non-passing students:'
non_passing_visits = sum_grouped_items(non_passing_engagement_by_account,'has_visited')
describe_data(non_passing_visits.values())
print 'passing students:'
passing_visits = sum_grouped_items(passing_engagement_by_account,'has_visited')
describe_data(passing_visits.values())
得到如下結(jié)果:
non-passing students:
Mean: 143.326474267
Std: 269.538619011
Min: 0.0
Max: 1768.52274933
passing students:
Mean: 394.586046484
Std: 448.499519327
Min: 0.0
Max: 3564.7332645
non-passing students:
Mean: 0.862068965517
Std: 2.54915994183
Min: 0.0
Max: 27.0
passing students:
Mean: 2.05255023184
Std: 3.14222705558
Min: 0.0
Max: 36.0
non-passing students:
Mean: 1.90517241379
Std: 1.90573144136
Min: 0
Max: 7
passing students:
Mean: 3.38485316847
Std: 2.25882147092
Min: 0
Max: 7
完