本文主要是作者在學習coursera的Introduction to Data Science in Python課程的學習筆記露泊,僅供參考盔憨。
1. 50 Years of Data Science
? ? (1) Data Exploration and Preparation?
????(2) Data Representation and Transformation
????(3) Computing with Data
? ? (4) Data Modeling
????(5) Data Visualization and Presentation
? ? (6) Science about Data Science
2. Functions
def add_numbers(x, ?y, ?z = None, flag = False):
? ? if (flag):
? ? ? ? print('Flag is true!')
? ? if (z == None):
? ? ? ? return x + y
? ? else:
? ? ? ? return x + y + z
print(add_numbers(1, 2, flag=true))
Assign function add_numbers to a variable a:
a = add_numbers
a = (1, 2, flag=true)
3. 查看數(shù)據(jù)類型
type('This is a string')
-> str
type(None)
-> NoneType
4. Tuple 元組
Tuples are an immutable data structure (cannot be altered).
元組是一個不變的數(shù)據(jù)結(jié)構(gòu)(無法更改)。
x = (1, 'a', 2, 'b')
type(x)
->tuple
5. List 列表
Lists are a mutable data structure.
列表是可變的數(shù)據(jù)結(jié)構(gòu)亮靴。
x = [1, 'a', 2, 'b']
type(x)
->list
6. Append 附加
Use append to append an object to a list.
使用附加將對象附加到列表炕贵。
x.append(3.3)
print(x)
->[1, 'a', 2, 'b', 3.3]
7. Loop through each item in the list
for item in x:
? ? print(item)
->1
? ? a
? ? 2
? ? b
? ? 3.3
8. Using the indexing operator to loop through each item in the list
i = 0
while( i != len(x) ):
? ? ? ? print(x[I])
? ? ? ? i = i +1
->1
????a
????2
????b
????3.3
9. List 基本操作
(1)Use + to concatenate連接 lists
[1, 2] + [3, 4]
-> [1, 2, 3, 4]
(2)Use * to repeat lists
[1]*3
->[1, 1, 1]
(3) Use the in operator to check if something is inside a list
1 in [1, 2, 3]
->True
10. String 基本操作
(1)Use bracket notation to slice a string.
? ? ? ? ??使用方括號符號來分割字符串。
x = 'This is a string'
print(x[0])
->T
print(x[0:1])
->T
print(x[0:2])
->Th
print(x[-1]) ?# the last element
->g
print(x[-4:-2]) ?# start from the 4th element from the end and stop before the 2nd element from the end
->ri
x[:3] ?#?This is a slice from the beginning of the string and stopping before the 3rd element.
->Thi
x[3:] #?this is a slice starting from the 4th element of the string and going all the way to the end.
-> s is a string
(2) New example on list
firstname = 'Christopher'
lastname = 'Brooks'
print(firstname + ' ' + lastname)
->Christopher?Brooks
print(firstname*3)
->ChristopherChristopherChristopher
print('Chris' in firstname)
->True
(3) Split returns a list of all the words in a string, or a list split on a specific character.
firstname = 'Christopher Arthur Hansen Brooks'.split(' ')[0]?
lastname = 'Christopher Arthur Hansen Brooks'.split(' ')[-1]?
print(firstname)
->Christopher
print(lastname)
->Brooks
(4) Make sure you convert objects to strings before concatenating串聯(lián).
'Chris' + 2
->Error
'Chris' + str(2)
->Chris2
11. Dictionary 字典?
(1)Dictionaries associate keys with values
x = {'Christopher Brooks': 'brooksch@umich.edu', 'Bill Gates': 'billg@microsoft.com'}
x['Christopher Brooks']
->brooksch@umich.edu
x['Kevyn Collins-Thompson'] = None
x['Kevyn Collins-Thompson']
->沒有輸出
(2)Iterate over all of the keys:
? ? ? ? ? 遍歷所有的鍵:
for name in x:
? ? print(x[name])
->brooksch@umich.edu
? ? billg@microsoft.com
? ? None
(3) Iterate over all of the values:
for email in x.values():
? ? print(email)
->brooksch@umich.edu
? ? billg@microsoft.com
? ? None
(4) Iterate over all of the items in the list:
for name, email in x.items():
? ? print(name)
????print(email)
->Christopher Brooks
? ? brooksch@umich.edu
? ? Bill Gates
? ? billg@microsoft.com
? ? Kevyn Collins-Thompson
????None
(5)?unpack a sequence into different variables:
? ? ? ? ? 將序列解壓為不同的變量:
x = ('Christopher', 'Brooks', 'brooksch@umich.edu')
fname, lname, email = x
fname
->Christopher
lname
->Brooks
(6) Make sure the number of values you are unpacking matches the number of variables being assigned.
x = ('Christopher', 'Brooks', 'brooksch@umich.edu', 'Ann Anbor')
fname, lname, email = x
->error
12. More on Strings
(1) Simple Samples
print('Chris' + 2)
->error
print('Chris' + str(2))
->Chris2
(2) Python has a built in method for convenient string formatting.
sales_record = {'price': 3.24, 'num_items': 4, 'person': 'Chris' }
sales_statement = '{} bought {} item(s) at a price of {} each for a total of {}'
print(sales_statement.format(sales_record['person'], sales_record['num_items'], sales_record['price'], sales_record['num_items']*sales_record['price']))
->Chris bought 4 item(s) at a price of 3.24 each for a total of 12.96
13. Reading and Writing CSV files
(1)導入csv
import csv
%precision 2
with open('mpg.csv') as csvfile:
? ? mpg = list(csv.DictReader(csvfile)) # 將csvfile轉(zhuǎn)化為元素為字典的list
mpg[:3]
->
[OrderedDict([('', '1'),
? ? ? ? ? ? ? ('manufacturer', 'audi'),
? ? ? ? ? ? ? ('model', 'a4'),
? ? ? ? ? ? ? ('displ', '1.8'),
? ? ? ? ? ? ? ('year', '1999'),
? ? ? ? ? ? ? ('cyl', '4'),
? ? ? ? ? ? ? ('trans', 'auto(l5)'),
? ? ? ? ? ? ? ('drv', 'f'),
? ? ? ? ? ? ? ('cty', '18'),
? ? ? ? ? ? ? ('hwy', '29'),
? ? ? ? ? ? ? ('fl', 'p'),
? ? ? ? ? ? ? ('class', 'compact')]),
OrderedDict([('', '2'),
? ? ? ? ? ? ? ('manufacturer', 'audi'),
? ? ? ? ? ? ? ('model', 'a4'),
? ? ? ? ? ? ? ('displ', '1.8'),
? ? ? ? ? ? ? ('year', '1999'),
? ? ? ? ? ? ? ('cyl', '4'),
? ? ? ? ? ? ? ('trans', 'manual(m5)'),
? ? ? ? ? ? ? ('drv', 'f'),
? ? ? ? ? ? ? ('cty', '21'),
? ? ? ? ? ? ? ('hwy', '29'),
? ? ? ? ? ? ? ('fl', 'p'),
? ? ? ? ? ? ? ('class', 'compact')]),
OrderedDict([('', '3'),
? ? ? ? ? ? ? ('manufacturer', 'audi'),
? ? ? ? ? ? ? ('model', 'a4'),
? ? ? ? ? ? ? ('displ', '2'),
? ? ? ? ? ? ? ('year', '2008'),
? ? ? ? ? ? ? ('cyl', '4'),
? ? ? ? ? ? ? ('trans', 'manual(m6)'),
? ? ? ? ? ? ? ('drv', 'f'),
? ? ? ? ? ? ? ('cty', '20'),
? ? ? ? ? ? ? ('hwy', '31'),
? ? ? ? ? ? ? ('fl', 'p'),
? ? ? ? ? ? ? ('class', 'compact')])]
(2)查看list長度
len(mpg)
->234
(3)keys gives us the column names of our csv
mpg[0].keys()
->odict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])
(4)Find the average cty fuel economy across all car. All values in the dictionaries are strings, so we need to convert to float.
sum(float(d['hwy']) for d in mpg) / len(mpg)
->23.44
(5)Use set to return the unique values for the number of cylinders the cars in our dataset have.
使用set返回數(shù)據(jù)集中汽車具有的汽缸數(shù)的唯一值劣光。
cylinders = set(d['cyl'] for d in mpg)
cylinders
->'4', '5', '6', '8'
(6) We are grouping the cars by number of cylinder, and find the average cty mpg for each group.
CtyMpgByCyl = []
for c in cylinders:
? ? summpg = 0
? ? cyltypecount = 0
? ? for d in mpg:
? ? ? ? ? ? if d['cyl'] == c:
? ? ? ? ? ? ? ? summpg += float(d['cty'])
? ? ? ? ? ? ? ? cyltypecount += 1
? ? CtyMpgByCyl.append((c, summpg / cyltypecount))
CtyMpgByCyl.sort(key = lambda x: x[0])
CtyMpgByCyl
->[('4', 21.01), ('5', 20.50), ('6', 16.22), ('8', 12.57)]
(7) Use set to return the unique values for the class types in our dataset
vehicleclass = set(d['class'] for d in mpg)
vehicleclass
->{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}
(8) How to find the average hwy mpg for each class of vehicle in our dataset.
HwyMpgByClass = []
for t in vehicleclass:
? ? summpg = 0
? ? vclasscount = 0
? ? for d in mpg:
? ? ? ? ? ? if d['class'] == t:
? ? ? ? ? ? ? ? ? ? summpg += float(d['hwy'])
? ? ? ? ? ? ? ? ? ? vclasscount += 1
? ? HwyMpgByClass.append((t, summpg / vclasscount))
HwyMpgByClass.sort(key = lambda x: x[1])
HwyMpgByClass
->
[('pickup', 16.88),
('suv', 18.13),
('minivan', 22.36),
('2seater', 24.80),
('midsize', 27.29),
('subcompact', 28.14),
('compact', 28.30)]
14. Dates and Times
(1) 安裝Datetime和Times的包
import datetime as dt
import time as tm
(2) Time returns the current time in seconds since the Epoch
tm.time()
->1583932727.90
(3) Convert the timestamp to datetime
dtnow = dt.datetime.fromtimestamp(tm.time())
dtnow
->
datetime.datetime(2020, 3, 11, 13, 18, 56, 990293)
(4) Handy datetime attributes: get year, month, day, etc. from a datetime
dtnow.year, dtnow.month, dtnow.day, dtnow.hour, dtnow.minute, dtnow.second
->(2020, 3, 11, 13, 18, 56)
(5) Timedelta is a duration expressing the difference between two dates.
delta = dt.timedelta(days = 100)
delta
->datetime.timedelta(100)
(6) date.today returns the current local date
today = dt.date.today()
today
->datetime.date(2020, 3, 11)
(7) the date 100 days ago
today - delta
->datetime.date(2019, 12, 2)
(8) compare dates
today > today - delta
-> True
15. Objects and map()
(1) an example of a class in python:
class Person:
? ? department = 'School of Information'
? ? def set_name(self, new_name)
? ? ? ? ? ? self.name = new_name
? ? def set_location(self, new_location)
? ? ? ? ? ? self.location = new_location
person = Person()
person.set_name('Christopher Brooks')
person.set_location('Ann Arbor, MI, USA')
print('{} live in {} and work in the department {}'.format(person.name, person.location, person.department))
(2) mapping the min function between two lists
store1 = [10.00, 11.00, 12.34, 2.34]
store2 = [9.00, 11.10, 12.34, 2.01]
cheapest = map(min, store1, store2)
cheapest
-><map at 0x7f74034a8860>
(3) iterate through the map object to see the values
for item in cheapest:
? ? print(item)
->
9.0
11.0
12.34
2.01
16. Lambda and List Comprehensions
(1) an example of lambda that takes in three parameters and adds the first two
my_function = lambda a, b, c: a+b
my_function(1, 2, 3)
->3
(2) iterate from 0 to 999 and return the even numbers.
my_list = []
for number in range(0, 1000):
? ? ? ? if number % 2 == 0:
? ? ? ? ? ? ? ? my_list.append(number)
my_list
->[0, 2, 4,...]
(3) Now the same thing but with list comprehension
my_list = [number for number in range(0, 1000) if number % 2 == 0]
my_list
->[0, 2, 4,...]
17. Numpy
(1) import package
import numpy as np
18.creating array數(shù)組(tuple元組袜蚕,list列表)
(1) create a list and convert it to a numpy array
mylist = [1, 2, 3]
x = np.array(mylist)
x
->array([1, 2, 3])
(2) just pass in a list directly
y = np.array([4, 5, 6])
y
->array([4, 5, 6])
(3) pass in a list of lists to create a multidimensional array
m = np.array([[[7, 8, 9,],[10, 11, 12]])
m
->
array([[ 7, 8, 9],
? ? ? [10, 11, 12]])
(4) use the shape method to find the dimensions of array
m.shape?
->(2,3)
(5) arange returns evenly spaced values within a given interval
n = np.arange(0, 30, 2)
n
->array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])
(6) reshape returns an array with the same data with a new shape
n = n.reshape(3, 5)
n
->
array([[ 0, 2, 4, 6, 8],
? ? ? [10, 12, 14, 16, 18],
? ? ? [20, 22, 24, 26, 28]])
(7) linspace returns evenly spaced numbers over a specified interval
o = np.linspace(0, 4, 9)
o
->array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])
(8) resize changes the shape and size of array in-space
o.resize(3, 3)
o
->
array([[ 0. , 0.5, 1. ],
? ? ? [ 1.5,? 2. ,? 2.5],
? ? ? [ 3. ,? 3.5,? 4. ]])
(9) ones returns a new array of given shape and type, filled with ones
np.ones((3, 2))
->
array([[ 1., 1.],
? ? ? [ 1.,? 1.],
? ? ? [ 1.,? 1.]])
(10) zeros returns a new array of given shape and type, filled with zeros
np.zeros((2,3))
->
array([[ 0., 0., 0.],
? ? ? [ 0.,? 0.,? 0.]])
(11) eye returns a 2D array with ones on the diagonal and zeros
np.eye(3)
->
array([[ 1., 0., 0.],
? ? ? [ 0.,? 1.,? 0.],
? ? ? [ 0.,? 0.,? 1.]])
(12) diag extracts a diagonal or constructs a diagonal array
np.diag(y)
->
array([[4, 0, 0],
? ? ? [0, 5, 0],
? ? ? [0, 0, 6]])
(13)creating an array using repeating list
np.array([1, 2, 3]*3)
->array([1, 2, 3, 1, 2, 3, 1, 2, 3])
(14) repeat elements of an array using repeat
np.repeat([1, 2, 3], 3)
->array([1, 1, 1, 2, 2, 2, 3, 3, 3])
(15) combine arrays
p = np.ones([2, 3], int)
p
->
array([[1, 1, 1],
? ? ? [1, 1, 1]])
(16) use vstack to stack arrays in sequence vertically (row wise).
np.vstack([p, 2*p])
->
array([[1, 1, 1],
? ? ? [1, 1, 1],
? ? ? [2, 2, 2],
? ? ? [2, 2, 2]])
(17) use hstack to stack arrays in sequence horizontally (column wise).
np.hstack([p, 2*p])
->
array([[1, 1, 1, 2, 2, 2],
? ? ? [1, 1, 1, 2, 2, 2]])
19. Operations
(1) element wise + - * /
print(x+y)
print(x-y)
->
[5 7 9]
[-3 -3 -3]
print(x*y)
print(x/y)
->
[ 4 10 18]
[ 0.25? 0.4? 0.5 ]
print(x**2)
->[1 4 9]
(2) Dot Product
x.dot(y) # x1y1+x2y2+x3y3
->32
(3)
?z = np.array([y, y**2])
print(z)
print(len(z)) #number of rows of array
->
[[ 4 5 6]
[16 25 36]]
2
(4) transpose array
z
->
[[ 4 5 6]
[16 25 36]]
z.T
->
array([[ 4, 16],
? ? ? [ 5, 25],
? ? ? [ 6, 36]])
(5) use .dtype to see the data type of the elements in the array
z.dtype
->dtype('int64')
(6) use .astype to cast to a specific type?
z = z.astype('f')
z.dtype
->dtype('float32')
(7) math functions?
a = np.array([-4, -2, 1, 3, 5])
a.sum()
->3
a.max()
->5
a.min()
->-4
a.mean()
->0.59999999998
a.std()
->3.2619012860600183
a.argmax()
->4
a.argmin()
->0
(8) indexing / slicing
s = np.arange(13)**2
s
->array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144])
(9)use bracket notation to get the value at a specific index
s[0], s[4], s[-1]
->(0, 16, 144)
(10) use : to indicate a range.array[start:stop]
s[1:5]
->array([ 1, 4, 9, 16])
(11) use negatives to count from the back
s[-4:]
->array([ 81, 100, 121, 144])
(12) A second : can be used to indicate step-size.array[start : stop : stepsize]
Here we are starting 5th element from the end, and counting backwards by 2 until the beginning of the array is reached.
s[-5::-2]
->array([64, 36, 16, 4, 0])
(13) look at the multidimensional array
r = np.arange(36)
r.resize((6,6))
r
->
array([[ 0, 1, 2, 3, 4, 5],
? ? ? [ 6,? 7,? 8,? 9, 10, 11],
? ? ? [12, 13, 14, 15, 16, 17],
? ? ? [18, 19, 20, 21, 22, 23],
? ? ? [24, 25, 26, 27, 28, 29],
? ? ? [30, 31, 32, 33, 34, 35]])
(14) use bracket notation to slice
r[2, 2]
->14
(15) use : to select a range of rows or columns
r[3, 3:6]
->array([21, 22, 23])
(16) select all the rows up to row2 , and all the columns up to the last column.
r[:2, :-1]
->
array([[ 0, 1, 2, 3, 4],
? ? ? [ 6,? 7,? 8,? 9, 10]])
(17) a slice of last row, only every other element
r[-1, ::2]
->array([30, 32, 34])
(18) perform conditional indexing.
r[r > 30]
->array([31, 32, 33, 34, 35])
(19) assigning all values in the array that are greater than 30 to the value of 30
r[r > 30] = 30
r
->
array([[ 0, 1, 2, 3, 4, 5],
? ? ? [ 6,? 7,? 8,? 9, 10, 11],
? ? ? [12, 13, 14, 15, 16, 17],
? ? ? [18, 19, 20, 21, 22, 23],
? ? ? [24, 25, 26, 27, 28, 29],
? ? ? [30, 30, 30, 30, 30, 30]])
(20) copy and modify arrays
r2 = r[:3, :3]
r2
->
array([[ 0, 1, 2],
? ? ? [ 6,? 7,? 8],
? ? ? [12, 13, 14]])
(21)set this slice's values to zero([:] selects the entire array)
r2[:] = 0
r2
->
array([[0, 0, 0],
? ? ? [0, 0, 0],
? ? ? [0, 0, 0]])
(22) r has also be changed
r
->
array([[ 0, 0, 0, 3, 4, 5],
? ? ? [ 0,? 0,? 0,? 9, 10, 11],
? ? ? [ 0,? 0,? 0, 15, 16, 17],
? ? ? [18, 19, 20, 21, 22, 23],
? ? ? [24, 25, 26, 27, 28, 29],
? ? ? [30, 30, 30, 30, 30, 30]])
(23) to avoid this, use .copy()
r_copy = r.copy()
r_copy
->
array([[ 0, 0, 0, 3, 4, 5],
? ? ? [ 0,? 0,? 0,? 9, 10, 11],
? ? ? [ 0,? 0,? 0, 15, 16, 17],
? ? ? [18, 19, 20, 21, 22, 23],
? ? ? [24, 25, 26, 27, 28, 29],
? ? ? [30, 30, 30, 30, 30, 30]])
(24) now when r_copy is modified, r will not be changed
r_copy[:] =10
print(r_copy, '\n')
print(r)
->
[[10 10 10 10 10 10]
[10 10 10 10 10 10]
[10 10 10 10 10 10]
[10 10 10 10 10 10]
[10 10 10 10 10 10]
[10 10 10 10 10 10]]
[[ 0? 0? 0? 3? 4? 5]
[ 0? 0? 0? 9 10 11]
[ 0? 0? 0 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]
[30 30 30 30 30 30]]
(25) create a new 4*3 array of random numbers 0-9
test = np.random.randint(0, 10, (4,3))
test
->
array([[1, 8, 2],
? ? ? [6, 1, 5],
? ? ? [7, 8, 0],
? ? ? [7, 6, 2]])
(26) iterate by row
for row in test:
? ? print(row)
->
[1 8 2]?
[6 1 5]
[7 8 0]
[7 6 2]
(27) iterate by index
for i in range(len(test)):
? ? ? ? print(test[I])
->
[1 8 2]
[6 1 5]
[7 8 0]
[7 6 2]
(28) iterate by row and index
for i, row in enumerate(test):
? ? ? ? print('row', i, 'is', row)
->
row 0 is [1 8 2]
row 1 is [6 1 5]
row 2 is [7 8 0]
row 3 is [7 6 2]
(29) use zip to iterate over multiple iterables
test2 = test**2
test2
->
array([[ 1, 64, 4],
? ? ? [36,? 1, 25],
? ? ? [49, 64,? 0],
? ? ? [49, 36,? 4]])
for i, j in zip(test, test2):
? ? ? ? print(i, '+', j, '=', i+j)
->
[1 8 2] + [ 1 64 4] = [ 2 72 6]
[6 1 5] + [36? 1 25] = [42? 2 30]
[7 8 0] + [49 64? 0] = [56 72? 0]
[7 6 2] + [49 36? 4] = [56 42? 6]