Introduction to Data Science in Python學習筆記

本文主要是作者在學習coursera的Introduction to Data Science in Python課程的學習筆記露泊,僅供參考盔憨。

1. 50 Years of Data Science

? ? (1) Data Exploration and Preparation?

????(2) Data Representation and Transformation

????(3) Computing with Data

? ? (4) Data Modeling

????(5) Data Visualization and Presentation

? ? (6) Science about Data Science

2. Functions

def add_numbers(x, ?y, ?z = None, flag = False):

? ? if (flag):

? ? ? ? print('Flag is true!')

? ? if (z == None):

? ? ? ? return x + y

? ? else:

? ? ? ? return x + y + z

print(add_numbers(1, 2, flag=true))

Assign function add_numbers to a variable a:

a = add_numbers

a = (1, 2, flag=true)

3. 查看數(shù)據(jù)類型

type('This is a string')

-> str


-> NoneType

4. Tuple 元組

Tuples are an immutable data structure (cannot be altered).


x = (1, 'a', 2, 'b')



5. List 列表

Lists are a mutable data structure.


x = [1, 'a', 2, 'b']



6. Append 附加

Use append to append an object to a list.




->[1, 'a', 2, 'b', 3.3]

7. Loop through each item in the list

for item in x:

? ? print(item)


? ? a

? ? 2

? ? b

? ? 3.3

8. Using the indexing operator to loop through each item in the list

i = 0

while( i != len(x) ):

? ? ? ? print(x[I])

? ? ? ? i = i +1






9. List 基本操作

(1)Use + to concatenate連接 lists

[1, 2] + [3, 4]

-> [1, 2, 3, 4]

(2)Use * to repeat lists


->[1, 1, 1]

(3) Use the in operator to check if something is inside a list

1 in [1, 2, 3]


10. String 基本操作

(1)Use bracket notation to slice a string.

? ? ? ? ??使用方括號符號來分割字符串。

x = 'This is a string'







print(x[-1]) ?# the last element


print(x[-4:-2]) ?# start from the 4th element from the end and stop before the 2nd element from the end


x[:3] ?#?This is a slice from the beginning of the string and stopping before the 3rd element.


x[3:] #?this is a slice starting from the 4th element of the string and going all the way to the end.

-> s is a string

(2) New example on list

firstname = 'Christopher'

lastname = 'Brooks'

print(firstname + ' ' + lastname)




print('Chris' in firstname)


(3) Split returns a list of all the words in a string, or a list split on a specific character.

firstname = 'Christopher Arthur Hansen Brooks'.split(' ')[0]?

lastname = 'Christopher Arthur Hansen Brooks'.split(' ')[-1]?





(4) Make sure you convert objects to strings before concatenating串聯(lián).

'Chris' + 2


'Chris' + str(2)


11. Dictionary 字典?

(1)Dictionaries associate keys with values

x = {'Christopher Brooks': '', 'Bill Gates': ''}

x['Christopher Brooks']


x['Kevyn Collins-Thompson'] = None

x['Kevyn Collins-Thompson']


(2)Iterate over all of the keys:

? ? ? ? ? 遍歷所有的鍵:

for name in x:

? ? print(x[name])


? ?

? ? None

(3) Iterate over all of the values:

for email in x.values():

? ? print(email)


? ?

? ? None

(4) Iterate over all of the items in the list:

for name, email in x.items():

? ? print(name)


->Christopher Brooks

? ?

? ? Bill Gates

? ?

? ? Kevyn Collins-Thompson


(5)?unpack a sequence into different variables:

? ? ? ? ? 將序列解壓為不同的變量:

x = ('Christopher', 'Brooks', '')

fname, lname, email = x





(6) Make sure the number of values you are unpacking matches the number of variables being assigned.

x = ('Christopher', 'Brooks', '', 'Ann Anbor')

fname, lname, email = x


12. More on Strings

(1) Simple Samples

print('Chris' + 2)


print('Chris' + str(2))


(2) Python has a built in method for convenient string formatting.

sales_record = {'price': 3.24, 'num_items': 4, 'person': 'Chris' }

sales_statement = '{} bought {} item(s) at a price of {} each for a total of {}'

print(sales_statement.format(sales_record['person'], sales_record['num_items'], sales_record['price'], sales_record['num_items']*sales_record['price']))

->Chris bought 4 item(s) at a price of 3.24 each for a total of 12.96

13. Reading and Writing CSV files


import csv

%precision 2

with open('mpg.csv') as csvfile:

? ? mpg = list(csv.DictReader(csvfile)) # 將csvfile轉(zhuǎn)化為元素為字典的list



[OrderedDict([('', '1'),

? ? ? ? ? ? ? ('manufacturer', 'audi'),

? ? ? ? ? ? ? ('model', 'a4'),

? ? ? ? ? ? ? ('displ', '1.8'),

? ? ? ? ? ? ? ('year', '1999'),

? ? ? ? ? ? ? ('cyl', '4'),

? ? ? ? ? ? ? ('trans', 'auto(l5)'),

? ? ? ? ? ? ? ('drv', 'f'),

? ? ? ? ? ? ? ('cty', '18'),

? ? ? ? ? ? ? ('hwy', '29'),

? ? ? ? ? ? ? ('fl', 'p'),

? ? ? ? ? ? ? ('class', 'compact')]),

OrderedDict([('', '2'),

? ? ? ? ? ? ? ('manufacturer', 'audi'),

? ? ? ? ? ? ? ('model', 'a4'),

? ? ? ? ? ? ? ('displ', '1.8'),

? ? ? ? ? ? ? ('year', '1999'),

? ? ? ? ? ? ? ('cyl', '4'),

? ? ? ? ? ? ? ('trans', 'manual(m5)'),

? ? ? ? ? ? ? ('drv', 'f'),

? ? ? ? ? ? ? ('cty', '21'),

? ? ? ? ? ? ? ('hwy', '29'),

? ? ? ? ? ? ? ('fl', 'p'),

? ? ? ? ? ? ? ('class', 'compact')]),

OrderedDict([('', '3'),

? ? ? ? ? ? ? ('manufacturer', 'audi'),

? ? ? ? ? ? ? ('model', 'a4'),

? ? ? ? ? ? ? ('displ', '2'),

? ? ? ? ? ? ? ('year', '2008'),

? ? ? ? ? ? ? ('cyl', '4'),

? ? ? ? ? ? ? ('trans', 'manual(m6)'),

? ? ? ? ? ? ? ('drv', 'f'),

? ? ? ? ? ? ? ('cty', '20'),

? ? ? ? ? ? ? ('hwy', '31'),

? ? ? ? ? ? ? ('fl', 'p'),

? ? ? ? ? ? ? ('class', 'compact')])]




(3)keys gives us the column names of our csv


->odict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])

(4)Find the average cty fuel economy across all car. All values in the dictionaries are strings, so we need to convert to float.

sum(float(d['hwy']) for d in mpg) / len(mpg)


(5)Use set to return the unique values for the number of cylinders the cars in our dataset have.


cylinders = set(d['cyl'] for d in mpg)


->'4', '5', '6', '8'

(6) We are grouping the cars by number of cylinder, and find the average cty mpg for each group.

CtyMpgByCyl = []

for c in cylinders:

? ? summpg = 0

? ? cyltypecount = 0

? ? for d in mpg:

? ? ? ? ? ? if d['cyl'] == c:

? ? ? ? ? ? ? ? summpg += float(d['cty'])

? ? ? ? ? ? ? ? cyltypecount += 1

? ? CtyMpgByCyl.append((c, summpg / cyltypecount))

CtyMpgByCyl.sort(key = lambda x: x[0])


->[('4', 21.01), ('5', 20.50), ('6', 16.22), ('8', 12.57)]

(7) Use set to return the unique values for the class types in our dataset

vehicleclass = set(d['class'] for d in mpg)


->{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}

(8) How to find the average hwy mpg for each class of vehicle in our dataset.

HwyMpgByClass = []

for t in vehicleclass:

? ? summpg = 0

? ? vclasscount = 0

? ? for d in mpg:

? ? ? ? ? ? if d['class'] == t:

? ? ? ? ? ? ? ? ? ? summpg += float(d['hwy'])

? ? ? ? ? ? ? ? ? ? vclasscount += 1

? ? HwyMpgByClass.append((t, summpg / vclasscount))

HwyMpgByClass.sort(key = lambda x: x[1])



[('pickup', 16.88),

('suv', 18.13),

('minivan', 22.36),

('2seater', 24.80),

('midsize', 27.29),

('subcompact', 28.14),

('compact', 28.30)]

14. Dates and Times

(1) 安裝Datetime和Times的包

import datetime as dt

import time as tm

(2) Time returns the current time in seconds since the Epoch



(3) Convert the timestamp to datetime

dtnow = dt.datetime.fromtimestamp(tm.time())



datetime.datetime(2020, 3, 11, 13, 18, 56, 990293)

(4) Handy datetime attributes: get year, month, day, etc. from a datetime

dtnow.year, dtnow.month,, dtnow.hour, dtnow.minute, dtnow.second

->(2020, 3, 11, 13, 18, 56)

(5) Timedelta is a duration expressing the difference between two dates.

delta = dt.timedelta(days = 100)



(6) returns the current local date

today =


->, 3, 11)

(7) the date 100 days ago

today - delta

->, 12, 2)

(8) compare dates

today > today - delta

-> True

15. Objects and map()

(1) an example of a class in python:

class Person:

? ? department = 'School of Information'

? ? def set_name(self, new_name)

? ? ? ? ? ? = new_name

? ? def set_location(self, new_location)

? ? ? ? ? ? self.location = new_location

person = Person()

person.set_name('Christopher Brooks')

person.set_location('Ann Arbor, MI, USA')

print('{} live in {} and work in the department {}'.format(, person.location, person.department))

(2) mapping the min function between two lists

store1 = [10.00, 11.00, 12.34, 2.34]

store2 = [9.00, 11.10, 12.34, 2.01]

cheapest = map(min, store1, store2)


-><map at 0x7f74034a8860>

(3) iterate through the map object to see the values

for item in cheapest:

? ? print(item)






16. Lambda and List Comprehensions

(1) an example of lambda that takes in three parameters and adds the first two

my_function = lambda a, b, c: a+b

my_function(1, 2, 3)


(2) iterate from 0 to 999 and return the even numbers.

my_list = []

for number in range(0, 1000):

? ? ? ? if number % 2 == 0:

? ? ? ? ? ? ? ? my_list.append(number)


->[0, 2, 4,...]

(3) Now the same thing but with list comprehension

my_list = [number for number in range(0, 1000) if number % 2 == 0]


->[0, 2, 4,...]

17. Numpy

(1) import package

import numpy as np

18.creating array數(shù)組(tuple元組袜蚕,list列表)

(1) create a list and convert it to a numpy array

mylist = [1, 2, 3]

x = np.array(mylist)


->array([1, 2, 3])

(2) just pass in a list directly

y = np.array([4, 5, 6])


->array([4, 5, 6])

(3) pass in a list of lists to create a multidimensional array

m = np.array([[[7, 8, 9,],[10, 11, 12]])



array([[ 7, 8, 9],

? ? ? [10, 11, 12]])

(4) use the shape method to find the dimensions of array



(5) arange returns evenly spaced values within a given interval

n = np.arange(0, 30, 2)


->array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

(6) reshape returns an array with the same data with a new shape

n = n.reshape(3, 5)



array([[ 0, 2, 4, 6, 8],

? ? ? [10, 12, 14, 16, 18],

? ? ? [20, 22, 24, 26, 28]])

(7) linspace returns evenly spaced numbers over a specified interval

o = np.linspace(0, 4, 9)


->array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])

(8) resize changes the shape and size of array in-space

o.resize(3, 3)



array([[ 0. , 0.5, 1. ],

? ? ? [ 1.5,? 2. ,? 2.5],

? ? ? [ 3. ,? 3.5,? 4. ]])

(9) ones returns a new array of given shape and type, filled with ones

np.ones((3, 2))


array([[ 1., 1.],

? ? ? [ 1.,? 1.],

? ? ? [ 1.,? 1.]])

(10) zeros returns a new array of given shape and type, filled with zeros



array([[ 0., 0., 0.],

? ? ? [ 0.,? 0.,? 0.]])

(11) eye returns a 2D array with ones on the diagonal and zeros



array([[ 1., 0., 0.],

? ? ? [ 0.,? 1.,? 0.],

? ? ? [ 0.,? 0.,? 1.]])

(12) diag extracts a diagonal or constructs a diagonal array



array([[4, 0, 0],

? ? ? [0, 5, 0],

? ? ? [0, 0, 6]])

(13)creating an array using repeating list

np.array([1, 2, 3]*3)

->array([1, 2, 3, 1, 2, 3, 1, 2, 3])

(14) repeat elements of an array using repeat

np.repeat([1, 2, 3], 3)

->array([1, 1, 1, 2, 2, 2, 3, 3, 3])

(15) combine arrays

p = np.ones([2, 3], int)



array([[1, 1, 1],

? ? ? [1, 1, 1]])

(16) use vstack to stack arrays in sequence vertically (row wise).

np.vstack([p, 2*p])


array([[1, 1, 1],

? ? ? [1, 1, 1],

? ? ? [2, 2, 2],

? ? ? [2, 2, 2]])

(17) use hstack to stack arrays in sequence horizontally (column wise).

np.hstack([p, 2*p])


array([[1, 1, 1, 2, 2, 2],

? ? ? [1, 1, 1, 2, 2, 2]])

19. Operations

(1) element wise + - * /




[5 7 9]

[-3 -3 -3]




[ 4 10 18]

[ 0.25? 0.4? 0.5 ]


->[1 4 9]

(2) Dot Product # x1y1+x2y2+x3y3



?z = np.array([y, y**2])


print(len(z)) #number of rows of array


[[ 4 5 6]

[16 25 36]]


(4) transpose array



[[ 4 5 6]

[16 25 36]]



array([[ 4, 16],

? ? ? [ 5, 25],

? ? ? [ 6, 36]])

(5) use .dtype to see the data type of the elements in the array



(6) use .astype to cast to a specific type?

z = z.astype('f')



(7) math functions?

a = np.array([-4, -2, 1, 3, 5])















(8) indexing / slicing

s = np.arange(13)**2


->array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144])

(9)use bracket notation to get the value at a specific index

s[0], s[4], s[-1]

->(0, 16, 144)

(10) use : to indicate a range.array[start:stop]


->array([ 1, 4, 9, 16])

(11) use negatives to count from the back


->array([ 81, 100, 121, 144])

(12) A second : can be used to indicate step-size.array[start : stop : stepsize]

Here we are starting 5th element from the end, and counting backwards by 2 until the beginning of the array is reached.


->array([64, 36, 16, 4, 0])

(13) look at the multidimensional array

r = np.arange(36)




array([[ 0, 1, 2, 3, 4, 5],

? ? ? [ 6,? 7,? 8,? 9, 10, 11],

? ? ? [12, 13, 14, 15, 16, 17],

? ? ? [18, 19, 20, 21, 22, 23],

? ? ? [24, 25, 26, 27, 28, 29],

? ? ? [30, 31, 32, 33, 34, 35]])

(14) use bracket notation to slice

r[2, 2]


(15) use : to select a range of rows or columns

r[3, 3:6]

->array([21, 22, 23])

(16) select all the rows up to row2 , and all the columns up to the last column.

r[:2, :-1]


array([[ 0, 1, 2, 3, 4],

? ? ? [ 6,? 7,? 8,? 9, 10]])

(17) a slice of last row, only every other element

r[-1, ::2]

->array([30, 32, 34])

(18) perform conditional indexing.

r[r > 30]

->array([31, 32, 33, 34, 35])

(19) assigning all values in the array that are greater than 30 to the value of 30

r[r > 30] = 30



array([[ 0, 1, 2, 3, 4, 5],

? ? ? [ 6,? 7,? 8,? 9, 10, 11],

? ? ? [12, 13, 14, 15, 16, 17],

? ? ? [18, 19, 20, 21, 22, 23],

? ? ? [24, 25, 26, 27, 28, 29],

? ? ? [30, 30, 30, 30, 30, 30]])

(20) copy and modify arrays

r2 = r[:3, :3]



array([[ 0, 1, 2],

? ? ? [ 6,? 7,? 8],

? ? ? [12, 13, 14]])

(21)set this slice's values to zero([:] selects the entire array)

r2[:] = 0



array([[0, 0, 0],

? ? ? [0, 0, 0],

? ? ? [0, 0, 0]])

(22) r has also be changed



array([[ 0, 0, 0, 3, 4, 5],

? ? ? [ 0,? 0,? 0,? 9, 10, 11],

? ? ? [ 0,? 0,? 0, 15, 16, 17],

? ? ? [18, 19, 20, 21, 22, 23],

? ? ? [24, 25, 26, 27, 28, 29],

? ? ? [30, 30, 30, 30, 30, 30]])

(23) to avoid this, use .copy()

r_copy = r.copy()



array([[ 0, 0, 0, 3, 4, 5],

? ? ? [ 0,? 0,? 0,? 9, 10, 11],

? ? ? [ 0,? 0,? 0, 15, 16, 17],

? ? ? [18, 19, 20, 21, 22, 23],

? ? ? [24, 25, 26, 27, 28, 29],

? ? ? [30, 30, 30, 30, 30, 30]])

(24) now when r_copy is modified, r will not be changed

r_copy[:] =10

print(r_copy, '\n')



[[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]]

[[ 0? 0? 0? 3? 4? 5]

[ 0? 0? 0? 9 10 11]

[ 0? 0? 0 15 16 17]

[18 19 20 21 22 23]

[24 25 26 27 28 29]

[30 30 30 30 30 30]]

(25) create a new 4*3 array of random numbers 0-9

test = np.random.randint(0, 10, (4,3))



array([[1, 8, 2],

? ? ? [6, 1, 5],

? ? ? [7, 8, 0],

? ? ? [7, 6, 2]])

(26) iterate by row

for row in test:

? ? print(row)


[1 8 2]?

[6 1 5]

[7 8 0]

[7 6 2]

(27) iterate by index

for i in range(len(test)):

? ? ? ? print(test[I])


[1 8 2]

[6 1 5]

[7 8 0]

[7 6 2]

(28) iterate by row and index

for i, row in enumerate(test):

? ? ? ? print('row', i, 'is', row)


row 0 is [1 8 2]

row 1 is [6 1 5]

row 2 is [7 8 0]

row 3 is [7 6 2]

(29) use zip to iterate over multiple iterables

test2 = test**2



array([[ 1, 64, 4],

? ? ? [36,? 1, 25],

? ? ? [49, 64,? 0],

? ? ? [49, 36,? 4]])

for i, j in zip(test, test2):

? ? ? ? print(i, '+', j, '=', i+j)


[1 8 2] + [ 1 64 4] = [ 2 72 6]

[6 1 5] + [36? 1 25] = [42? 2 30]

[7 8 0] + [49 64? 0] = [56 72? 0]

[7 6 2] + [49 36? 4] = [56 42? 6]

