前情提要:
- 概述
- 本地環(huán)境教程
- Julia1.0.0安裝指南(含 Juno IDE)
- 目前兼容的機器學(xué)習(xí)程序包
- 在線環(huán)境教程
前段時間把環(huán)境和各種版本情況理了一遍同蜻,應(yīng)該說該有的學(xué)習(xí)基礎(chǔ)設(shè)施都有了疾呻。
筆者關(guān)注的是機器學(xué)習(xí)方面的顺献,因此會側(cè)重去看這些方面的資料蜀涨。
說到機器學(xué)習(xí),首先得要有數(shù)據(jù)姐赡,不然學(xué)習(xí)個啥呢。有了數(shù)據(jù)之后柠掂,那么很多語言的第一步就是處理數(shù)據(jù)项滑。Julia也一樣,有專門的數(shù)據(jù)處理程序包涯贞。
今天就說一下DataFrames這個程序包枪狂。用過其他機器學(xué)習(xí)語言的都知道,DataFrames就是數(shù)據(jù)框宋渔,中文直譯州疾。
筆者自己學(xué)習(xí)程序語言的方式是不喜歡去看書的(程序書是用來查的),程序一定是一邊寫一邊看一邊用才會掌握的好和快皇拣,尤其是看牛人寫的程序严蓖。
記得前面有教程說過如何導(dǎo)入Github上的程序庫吧。
來审磁,先記得這個地址:https://github.com/scidom/StatsLearningByExample.jl.git
然后點開Juliabox上Git
這個按鈕
會出來一個對話框:
粘貼剛才那個地址到 Git Clone URL
框里谈飒,然后點后面的+號(如果需要修改一下同步到Juliabox的文件夾名字的請按+號前修改)。然后按下[OK]态蒂。
很快Juliabox就會把https://github.com/scidom/StatsLearningByExample.jl.git里的內(nèi)容同步過來杭措。
然后你的Juliabox就會多出來這個文件夾(StatsLearningByExample):
接下來點進去02-DataFrames目錄:
點開第一個02-01-DataFramesBasics.ipynb,
然后點Cell
下面的Run All
:
這個命令就會把所有的代碼跑一遍。
這個就是今天的課程钾恢,因為已經(jīng)有代碼且作者已經(jīng)用英文講解了手素,所以接下來會倒序(按照程序代碼來說是倒序)重點總結(jié)一下:
(文末有筆者運行結(jié)果,方便暫時不能操作的同學(xué)看):
-
DataFrames是由DataArrays組成的
- DataFrames本身的基本信息獲取
- 可以對DataFrames內(nèi)數(shù)據(jù)進行簡單統(tǒng)計描述
- 可以獲得DataFrames某行/某列數(shù)據(jù)
DataArrays可以進行矩陣運算
-
DataArrays里有個特殊成員NA(缺失值)
所有數(shù)值和NA進行運算結(jié)果都是NA
NA(缺失值)和NaN(Not a Number: 不是數(shù)字)是兩個東西瘩蚪,數(shù)值類型也不一樣
有幾個注意的地方:
這個地方其實應(yīng)該是文檔的注釋部分泉懦,作者應(yīng)該沒注意弄,變成代碼運行報錯了疹瘦。
改成文檔格式即可崩哩,忽略也可以。
還有這里:
第8句寫了不可能通過DataArray([0.1, NA, -2.4])
語句直接完成DataArray構(gòu)建言沐,這里應(yīng)該是個錯誤示范(應(yīng)該報錯)邓嘹。
作者希望提示大家用第9句的語法來完成構(gòu)建動作。
實際上险胰,作者寫的是2年前汹押,現(xiàn)在我們看到的是兩種語法都得出了正確結(jié)果。也就是目前兩種寫法都可以起便。
其他的就不多解釋了棚贾,大家要學(xué)習(xí)應(yīng)該能看懂窖维。
以下是筆者運行的結(jié)果,供參考:
Introduction to DataFrames
In [1]:
using DataArrays
using DataFrames
Missing values?
- A missing value is represented by
NA
in Julia. -
NA
is not part of Base, it is provided by theDataArrays
package. -
NA
poisons other values.
In [2]:
# NA poisons other values
1+NA
Out[2]:
missing
In [3]:
# Check if the evaluation of an expression results in NA
isna(1+NA)
Out[3]:
true
In [4]:
# Note the difference between NaN and NA
(isa(NaN, Float64), isa(NA, Float64))
Out[4]:
(true, false)
DataArrays
-
DataArray
's are used for representing arrays that contain missing values -
DataArray{T}
allows storingT
orNA
- In other words,
DataArray{T}
addsNA
's toArray{T}
-
PooledDataArray{T}
is used for storing data efficiently. -
PooledDataArray{T}
compressesDataArray{T}
.
Constructing DataArrays
In [5]:
# Call the DataArray() constructor by passing a Vector to it
DataArray([0.1, 0.5, -2.4])
Out[5]:
3-element DataArrays.DataArray{Float64,1}:
0.1
0.5
-2.4
In [6]:
# Construct a DataArray by calling the @data() macro with a Vector input argument
@data([0.1, 0.5, -2.4])
Out[6]:
3-element DataArrays.DataArray{Float64,1}:
0.1
0.5
-2.4
In [7]:
# Convert Vector to DataArray
convert(DataArray, [0.1, 0.5, -2.4])
Out[7]:
3-element DataArrays.DataArray{Float64,1}:
0.1
0.5
-2.4
In [8]:
# It is not possible to call DataArray() with NA in its input argument
DataArray([0.1, NA, -2.4])
Out[8]:
3-element DataArrays.DataArray{Float64,1}:
0.1
missing
-2.4
In [9]:
# However, it is possible to pass NA to the @data() macro
@data([0.1, NA, -2.4])
Out[9]:
3-element DataArrays.DataArray{Float64,1}:
0.1
missing
-2.4
In [10]:
# The DataArray() constructor can be called with a Matrix input argument
DataArray([0.4 1.2; 3.5 7.2])
Out[10]:
2×2 DataArrays.DataArray{Float64,2}:
0.4 1.2
3.5 7.2
In [11]:
# The @data() macro can also be called with a Matrix input argument
@data([0.4 1.2; 3.5 7.2])
Out[11]:
2×2 DataArrays.DataArray{Float64,2}:
0.4 1.2
3.5 7.2
In [12]:
# Convert a Matrix to DataArray
convert(DataArray, [0.4 1.2; 3.5 7.2])
Out[12]:
2×2 DataArrays.DataArray{Float64,2}:
0.4 1.2
3.5 7.2
Numerical computing with DataArrays
In [13]:
# Numerical computing can be done with data vectors
x = @data([0.1, NA, -2.4])
y = @data([-9.9, 0.5, 6.7])
x+y
Out[13]:
3-element DataArrays.DataArray{Float64,1}:
-9.8
missing
4.3
In [14]:
# To remove missing values (NA), call dropna()
x = @data([0.1, NA, -2.4])
dropna(x)
Out[14]:
2-element Array{Float64,1}:
0.1
-2.4
In [15]:
# Numerical computing can be done with data matrices and data vectors
A = @data([0.4 1.2 4.4; NA 7.2 3.9; 5.1 1.8 4.5])
y = @data([-9.9, 0.5, 6.7])
A*y
Out[15]:
3-element DataArrays.DataArray{Float64,1}:
26.12
missing
-19.44
DataFrames
-
DataFrame
's are used for representing data tables. - A
DataFrame
is a list ofDataArray
's. - So every
DataArray
of aDataFrame
represents a column of the corresponding data table. -
DataFrame
's accommodate heterogeneous data that might contain missing values. - Every column (
DataArray
) of aDataFrame
has its own type.
Example 02-01-01: NBA champions
Constructing DataFrames
In [16]:
# Call the DataFrame() constructor with keyword arguments (columns) of type Vector
DataFrame(
player = ["Larry Bird", "Magic Johnson", "Michael Jordan", "Scottie Pippen"],
champions = [3, 5, 6, 6]
)
Out[16]:
player | champions | |
---|---|---|
1 | Larry Bird | 3 |
2 | Magic Johnson | 5 |
3 | Michael Jordan | 6 |
4 | Scottie Pippen | 6 |
In [17]:
# Start with an empty DataFrame and populate it
ChampionsFrame = DataFrame()
ChampionsFrame[:player] = ["Larry Bird", "Magic Johnson", "Michael Jordan", "Scottie Pippen"]
ChampionsFrame[:champions] = [3, 5, 6, 6]
ChampionsFrame
Out[17]:
player | champions | |
---|---|---|
1 | Larry Bird | 3 |
2 | Magic Johnson | 5 |
3 | Michael Jordan | 6 |
4 | Scottie Pippen | 6 |
Provide CSV-like tabular data to construct a new DataFrame
In [19]:
# Call the DataFrame() constructor with keyword arguments (columns) of type DataArray
player = @data(["Larry Bird", "Magic Johnson", "Michael Jordan", "Scottie Pippen"])
champions = @data([3, 5, 6, 6])
ChampionsFrame = DataFrame(player=player, champions=champions)
Out[19]:
player | champions | |
---|---|---|
1 | Larry Bird | 3 |
2 | Magic Johnson | 5 |
3 | Michael Jordan | 6 |
4 | Scottie Pippen | 6 |
In [20]:
# Construct a DataFrame by joining two existing DataFrames
height = [2.06, 2.06, 1.98, 2.03]
HeightsFrame = DataFrame(player=player, height=height)
join(ChampionsFrame, HeightsFrame, on = :player)
Out[20]:
player | champions | height | |
---|---|---|---|
1 | Larry Bird | 3 | 2.06 |
2 | Magic Johnson | 5 | 2.06 |
3 | Michael Jordan | 6 | 1.98 |
4 | Scottie Pippen | 6 | 2.03 |
Quering basic information about DataFrames
In [21]:
# Get number of rows of a DataFrame
size(ChampionsFrame, 1)
Out[21]:
4
In [22]:
# Get number of columns of a DataFrame
size(ChampionsFrame, 2)
Out[22]:
2
In [23]:
# Get a numeric summary of a DataFrame
describe(ChampionsFrame)
Out[23]:
variable | mean | min | median | max | nunique | nmissing | eltype | |
---|---|---|---|---|---|---|---|---|
1 | player | Larry Bird | Scottie Pippen | 4 | 0 | String | ||
2 | champions | 5.0 | 3 | 5.5 | 6 | 0 | Int64 |
Indexing DataFrames
In [24]:
# Index DataFrame by column name to get a specific column
ChampionsFrame[:player]
Out[24]:
4-element DataArrays.DataArray{String,1}:
"Larry Bird"
"Magic Johnson"
"Michael Jordan"
"Scottie Pippen"
In [25]:
# Index DataFrame by row numbers to get specific rows
ChampionsFrame[2:3, :]
Out[25]:
player | champions | |
---|---|---|
1 | Magic Johnson | 5 |
2 | Michael Jordan | 6 |
KevinZhang
Aug 30, 2018