翻譯原因
買了官方第2版徐敬一翻譯的紙質(zhì)書剖毯,翻譯水平讓我想罵人旺坠,尤其是“第11章 時(shí)間序列”簡(jiǎn)直讓人摸不著頭腦惠昔。另外幕与,原版英文書中也有些許錯(cuò)誤,不方便理解,特自行修訂翻譯第11章,如后面有余力再修訂翻譯其它章節(jié)袜匿。本文借鑒了官方第1版唐雪韜的版本和簡(jiǎn)書用戶SeanCheney的版本括尸,對(duì)部分內(nèi)容進(jìn)行了修訂。
翻譯原則
1 術(shù)語(yǔ)準(zhǔn)確及詞義寬窄適度
1.1 會(huì)盡量查看相關(guān)標(biāo)準(zhǔn)及專業(yè)網(wǎng)站香拉、百科啦扬、詞典等綜合而定。重點(diǎn)參考網(wǎng)站是:
- 1.1.1 相關(guān)標(biāo)準(zhǔn)及專業(yè)網(wǎng)站:《GBT 3358.1-2009 統(tǒng)計(jì)學(xué)詞匯及符號(hào) 第1部分:一般統(tǒng)計(jì)術(shù)語(yǔ)與用于概率的術(shù)語(yǔ)》凫碌、https://dict.cnki.net/index扑毡、http://shuyu.cnki.net/、http://shuyu.cnki.net/盛险;
- 1.1.2 百科:維基百科瞄摊、百度百科、360百科苦掘;
- 1.1.3 詞典:墨墨背單詞换帜、http://dict.youdao.com/?keyfrom=dict2.top、http://www.iciba.com/(簡(jiǎn)明詞典和牛津詞典)鹤啡、https://cn.bing.com/dict/?FORM=HDRSC6惯驼。
1.2 詞義寬窄適度:中英文詞匯的詞義寬窄可能有所差異,因此不僅僅考慮英翻中,還要考慮能否中翻英回到原文的英文詞匯祟牲,使中英文的詞義寬窄盡量匹配隙畜。
2 詞匯前后一致
2.1 某個(gè)英語(yǔ)詞匯對(duì)應(yīng)中文意思可能多個(gè),為方便理解本文盡量只采用一個(gè)说贝,且盡量做到前后一致议惰。
2.2 作者可能用多個(gè)英文詞匯表達(dá)同一個(gè)中文意思,為方便理解會(huì)盡量整合到一個(gè)中文詞匯狂丝。例如option换淆、parameter等統(tǒng)一翻譯為“參數(shù)”。
3 閱讀流暢
3.1 原文重句較多几颜,為符合中文閱讀習(xí)慣倍试,會(huì)適當(dāng)拆開(kāi);
3.2 原文有些口水話蛋哭,會(huì)適當(dāng)省略或意譯县习。原文省略的部分詞匯,會(huì)適當(dāng)補(bǔ)足谆趾。例如作者會(huì)省略“...方法”的“方法”躁愿,只寫“...”。
3.3 英文原書未進(jìn)行逐級(jí)標(biāo)題編號(hào)沪蓬,為方便理解會(huì)進(jìn)行逐級(jí)標(biāo)題編號(hào)彤钟。
思維導(dǎo)圖如下:
時(shí)間序列(time series)數(shù)據(jù)是結(jié)構(gòu)化數(shù)據(jù)的一種重要形式逸雹,廣泛應(yīng)用于金融學(xué)、經(jīng)濟(jì)學(xué)云挟、生態(tài)學(xué)梆砸、神經(jīng)科學(xué)和物理學(xué)等多個(gè)領(lǐng)域。在多個(gè)時(shí)間點(diǎn)觀察或測(cè)量到的任何數(shù)據(jù)都可以形成一個(gè)時(shí)間序列园欣。很多時(shí)間序列是固定頻率的(fixed frequency)帖世,也就是說(shuō),數(shù)據(jù)點(diǎn)按照某種規(guī)則定期出現(xiàn)沸枯,例如每15秒日矫、每5分鐘或每月一次。時(shí)間序列也可以是不規(guī)則的(irregular)辉饱,沒(méi)有固定的時(shí)間單位或單位之間的偏移量搬男。如何標(biāo)記和引用時(shí)間序列數(shù)據(jù)取決于應(yīng)用場(chǎng)景,主要有以下幾種:
- 時(shí)間戳(timestamp)彭沼,特定的時(shí)刻缔逛。
- 固定時(shí)期(period),例如2007年1月或2010年全年。
- 時(shí)間間隔(interval)褐奴,由起始時(shí)間戳和結(jié)束時(shí)間戳表示按脚。時(shí)期(period)可以看作是間隔(interval)的特例。
- 實(shí)驗(yàn)時(shí)間(experiment time)或經(jīng)過(guò)時(shí)間(elapsed time)敦冬,每個(gè)時(shí)間戳都是相對(duì)于特定起始時(shí)間的一個(gè)時(shí)間度量辅搬。例如,從放入烤箱時(shí)起脖旱,每秒鐘餅干的直徑堪遂。
Time series data is an important form of structured data in many different fields, such as finance, economics, ecology, neuroscience, and physics. Anything that is observed or measured at many points in time forms a time series. Many time series are fixed frequency, which is to say that data points occur at regular intervals according to some rule, such as every 15 seconds, every 5 minutes, or once per month. Time series can also be irregular without a fixed unit of time or offset between units. How you mark and refer to time series data depends on the application, and you may have one of the following:
- Timestamps, specific instants in time
- Fixed periods, such as the month January 2007 or the full year 2010
- Intervals of time, indicated by a start and end timestamp. Periods can be thought of as special cases of intervals
- Experiment or elapsed time; each timestamp is a measure of time relative to a particular start time (e.g., the diameter of a cookie baking each second since being placed in the oven)
雖然很多技術(shù)都可用于處理實(shí)驗(yàn)型的時(shí)間序列,其索引可能是一個(gè)整數(shù)或浮點(diǎn)數(shù)(表示從實(shí)驗(yàn)開(kāi)始所經(jīng)過(guò)的時(shí)間)萌庆,但本章主要講解前三種時(shí)間序列溶褪。最簡(jiǎn)單也最廣泛使用的時(shí)間序列是被時(shí)間戳索引的時(shí)間序列。
In this chapter, I am mainly concerned with time series in the first three categories, though many of the techniques can be applied to experimental time series where the index may be an integer or floating-point number indicating elapsed time from the start of the experiment. The simplest and most widely used kind of time series are those indexed by timestamp.
pandas也支持基于timedelta對(duì)象的索引践险,timedelta對(duì)象可能是表示實(shí)驗(yàn)時(shí)間或經(jīng)過(guò)時(shí)間的有用方式猿妈。在本書中我們不講解timedelta索引,但你可以在pandas官方文檔(http://pandas.pydata.org)中了解更多巍虫。
pandas also supports indexes based on timedeltas, which can be a useful way of representing experiment or elapsed time. We do not explore timedelta indexes in this book, but you can learn more in the pandas documentation.
pandas提供了很多內(nèi)置的時(shí)間序列工具和數(shù)據(jù)算法彭则。你可以高效地處理非常大的時(shí)間序列,并且對(duì)不規(guī)則的時(shí)間序列和固定頻率的時(shí)間序列輕松地進(jìn)行切片占遥、切塊俯抖、聚合和重采樣。其中一些工具對(duì)于金融和經(jīng)濟(jì)應(yīng)用場(chǎng)景特別有用瓦胎,你當(dāng)然也可以用它們來(lái)分析服務(wù)器日志數(shù)據(jù)蚌成。
pandas provides many built-in time series tools and data algorithms. You can efficiently work with very large time series and easily slice and dice, aggregate, and resample irregular- and fixed-frequency time series. Some of these tools are especially useful for financial and economics applications, but you could certainly use them to analyze server log data, too.
11.1 日期和時(shí)間數(shù)據(jù)類型及工具
11.1 Date and Time Data Types and Tools
Python標(biāo)準(zhǔn)庫(kù)包含日期和時(shí)間數(shù)據(jù)(date and time data)的數(shù)據(jù)類型以及與日歷相關(guān)的功能。我們主要會(huì)用到datetime凛捏、time和calendar模塊。datetime.datetime(簡(jiǎn)寫為datetime)類型是廣泛使用的數(shù)據(jù)類型:
The Python standard library includes data types for date and time data, as well as calendar-related functionality. The datetime, time, and calendar modules are the main places to start. The datetime.datetime type, or simply datetime, is widely used:
import numpy as np
import pandas as pd
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)
In [10]: from datetime import datetime
In [11]: now_datetime = datetime.now() # gg注:為避免歧義芹缔,變量名從原文的now改為now_datetime
In [12]: now_datetime
Out[12]: datetime.datetime(2017, 9, 25, 14, 5, 52, 72973)
In [13]: now_datetime.year, now_datetime.month, now_datetime.day
Out[13]: (2017, 9, 25)
datetime對(duì)象存儲(chǔ)日期以及精確到微秒的時(shí)間坯癣。timedelta對(duì)象表示兩個(gè)datetime對(duì)象之間的時(shí)間差:
datetime stores both the date and time down to the microsecond. timedelta represents the temporal difference between two datetime objects:
In [14]: delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
In [15]: delta
Out[15]: datetime.timedelta(926, 56700)
In [16]: delta.days
Out[16]: 926
In [17]: delta.seconds
Out[17]: 56700
可以給datetime對(duì)象加上(或減去)一個(gè)timedelta對(duì)象或其倍數(shù),這樣會(huì)產(chǎn)生一個(gè)新對(duì)象:
You can add (or subtract) a timedelta or multiple thereof to a datetime object to yield a new shifted object:
In [18]: from datetime import timedelta
In [19]: start = datetime(2011, 1, 7)
In [20]: start + timedelta(12)
Out[20]: datetime.datetime(2011, 1, 19, 0, 0)
In [21]: start - 2 * timedelta(12)
Out[21]: datetime.datetime(2010, 12, 14, 0, 0)
表 11-1 總結(jié)了datetime模塊中的數(shù)據(jù)類型最欠。雖然本章主要講解pandas中的數(shù)據(jù)類型和更高級(jí)別的時(shí)間序列操作示罗,但你可能會(huì)在Python的很多其它地方遇到基于datetime的類型。
Table 11-1 summarizes the data types in the datetime module. While this chapter is mainly concerned with the data types in pandas and higher-level time series manipulation, you may encounter the datetime-based types in many other places in Python in the wild.
表11-1:datetime模塊中的數(shù)據(jù)類型
Table 11-1. Types in datetime module
11.1.1 字符串和datetime之間的轉(zhuǎn)換
Converting Between String and Datetime
使用str函數(shù)或strftime方法(傳入一個(gè)格式規(guī)范)芝硬,可以將datetime對(duì)象和pandas的Timestamp對(duì)象(稍后就會(huì)介紹)格式化為字符串:
You can format datetime objects and pandas Timestamp objects, which I’ll introduce later, as strings using str or the strftime method, passing a format specification:
In [22]: stamp = datetime(2011, 1, 3)
In [23]: str(stamp)
Out[23]: '2011-01-03 00:00:00'
In [24]: stamp.strftime('%Y-%m-%d')
Out[24]: '2011-01-03'
格式代碼的完整清單見(jiàn)表11-2(轉(zhuǎn)載自第2章)
See Table 11-2 for a complete list of the format codes (reproduced from Chapter 2).
表11-2:datetime格式規(guī)范(兼容ISO C89)
Table 11-2. Datetime format specification (ISO C89 compatible)
使用datetime.strptime函數(shù)和這些格式代碼可以將字符串轉(zhuǎn)換為日期:
You can use these same format codes to convert strings to dates using datetime.strptime:
In [25]: value = '2011-01-03'
In [26]: datetime.strptime(value, '%Y-%m-%d')
Out[26]: datetime.datetime(2011, 1, 3, 0, 0)
In [27]: datestrs = ['7/6/2011', '8/6/2011']
In [28]: [datetime.strptime(x, '%m/%d/%Y') for x in datestrs]
Out[28]:
[datetime.datetime(2011, 7, 6, 0, 0),
datetime.datetime(2011, 8, 6, 0, 0)]
datetime.strptime函數(shù)是解析已知格式日期的好方式蚜点。但是,每次都要編寫格式規(guī)范可能有點(diǎn)煩人拌阴,尤其是對(duì)于常見(jiàn)的日期格式绍绘。在這種情況下,可以使用第三方dateutil包中的parser.parse方法(安裝pandas時(shí)已自動(dòng)安裝好了):
datetime.strptime is a good way to parse a date with a known format. However, it can be a bit annoying to have to write a format spec each time, especially for common date formats. In this case, you can use the parser.parse method in the third-party dateutil package (this is installed automatically when you install pandas):
In [29]: from dateutil.parser import parse
In [30]: parse('2011-01-03')
Out[30]: datetime.datetime(2011, 1, 3, 0, 0)
dateutil包能夠解析大部分人類可理解的日期表示形式:
dateutil is capable of parsing most human-intelligible date representations:
In [31]: parse('Jan 31, 1997 10:45 PM')
Out[31]: datetime.datetime(1997, 1, 31, 22, 45)
在國(guó)際語(yǔ)言環(huán)境中,日出現(xiàn)在月的前面很常見(jiàn)陪拘,可以傳入dayfirst=True來(lái)表示這一點(diǎn):
In international locales, day appearing before month is very common, so you can pass dayfirst=True to indicate this:
In [32]: parse('6/12/2011', dayfirst=True)
Out[32]: datetime.datetime(2011, 12, 6, 0, 0)
panda通常面向處理日期數(shù)組(array of dates)厂镇,無(wú)論這些日期是作為DataFrame的軸索引還是列。pandas.to_datetime函數(shù)可以解析多種不同的日期表示形式左刽。像ISO 8601這樣的標(biāo)準(zhǔn)日期格式可以非侈嘈牛快速地解析:
pandas is generally oriented toward working with arrays of dates, whether used as an axis index or a column in a DataFrame. The to_datetime method parses many different kinds of date representations. Standard date formats like ISO 8601 can be parsed very quickly:
In [33]: datestrs = ['2011-07-06 12:00:00', '2011-08-06 00:00:00']
In [34]: pd.to_datetime(datestrs)
Out[34]: DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='dat
etime64[ns]', freq=None)
pandas.to_datetime函數(shù)還可以處理應(yīng)被視為缺失的值(None、空字符串等):
It also handles values that should be considered missing (None, empty string, etc.):
In [35]: idx = pd.to_datetime(datestrs + [None])
In [36]: idx
Out[36]: DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dty
pe='datetime64[ns]', freq=None)
In [37]: idx[2]
Out[37]: NaT
In [38]: pd.isnull(idx)
Out[38]: array([False, False, True], dtype=bool)
NaT(Not a Time)是pandas中時(shí)間戳數(shù)據(jù)的null值欠痴。
NaT (Not a Time) is pandas’s null value for timestamp data.
dateutil.parser是一個(gè)有用但不完美的工具迄靠。值得注意的是,它會(huì)將一些原本不是日期的字符串識(shí)別為日期喇辽,例如掌挚,“42”會(huì)被解析為2042年的今天的日歷日期。
dateutil.parser is a useful but imperfect tool. Notably, it will recognize some strings as dates that you might prefer that it didn’t—for example, '42' will be parsed as the year 2042 with today’s calendar date.
對(duì)于其它國(guó)家或語(yǔ)言的系統(tǒng)茵臭,datetime對(duì)象還有許多特定于語(yǔ)言環(huán)境的(locale-specific)格式化選項(xiàng)疫诽。例如,德語(yǔ)或法語(yǔ)系統(tǒng)月份的簡(jiǎn)稱與英語(yǔ)系統(tǒng)相比將有所不同旦委。清單見(jiàn)表11-3奇徒。
datetime objects also have a number of locale-specific formatting options for systems in other countries or languages. For example, the abbreviated month names will be different on German or French systems compared with English systems. See Table 11-3 for a listing.
表11-3:特定于語(yǔ)言環(huán)境的格式化選項(xiàng)
Table 11-3. Locale-specific date formatting
11.2 時(shí)間序列基礎(chǔ)
11.2 Time Series Basics
pandas中時(shí)間序列對(duì)象的一個(gè)基本種類是被時(shí)間戳索引的Series對(duì)象,這些時(shí)間戳通常在panda外部表示為Python字符串或datetime對(duì)象缨硝。
A basic kind of time series object in pandas is a Series indexed by timestamps, which is often represented external to pandas as Python strings or datetime objects:
In [39]: from datetime import datetime
In [40]: dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
....: datetime(2011, 1, 7), datetime(2011, 1, 8),
....: datetime(2011, 1, 10), datetime(2011, 1, 12)]
In [41]: ts = pd.Series(np.random.randn(6), index=dates)
In [42]: ts
Out[42]:
2011-01-02 -0.204708
2011-01-05 0.478943
2011-01-07 -0.519439
2011-01-08 -0.555730
2011-01-10 1.965781
2011-01-12 1.393406
dtype: float64
在底層摩钙,這些datetime對(duì)象被放在一個(gè)DatetimeIndex中:
Under the hood, these datetime objects have been put in a DatetimeIndex:
In [43]: ts.index
Out[43]:
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
'2011-01-10', '2011-01-12'],
dtype='datetime64[ns]', freq=None)
與其它Series對(duì)象一樣,不同索引的時(shí)間序列之間的算術(shù)運(yùn)算會(huì)在日期上自動(dòng)對(duì)齊:
Like other Series, arithmetic operations between differently indexed time series automatically align on the dates:
In [44]: ts + ts[::2]
Out[44]:
2011-01-02 -0.409415
2011-01-05 NaN
2011-01-07 -1.038877
2011-01-08 NaN
2011-01-10 3.931561
2011-01-12 NaN
dtype: float64
ts[::2]將ts中的元素每?jī)蓚€(gè)選取出一個(gè)查辩。
Recall that ts[::2] selects every second element in ts.
pandas使用NumPy的datetime64數(shù)據(jù)類型以納秒的分辨率存儲(chǔ)時(shí)間戳:
pandas stores timestamps using NumPy’s datetime64 data type at the nanosecond resolution:
In [45]: ts.index.dtype
Out[45]: dtype('<M8[ns]')
DatetimeIndex中的各個(gè)標(biāo)量值是pandas的Timestamp對(duì)象:
Scalar values from a DatetimeIndex are pandas Timestamp objects:
In [46]: stamp = ts.index[0]
In [47]: stamp
Out[47]: Timestamp('2011-01-02 00:00:00')
只要有需要胖笛,Timestamp對(duì)象可以隨時(shí)自動(dòng)轉(zhuǎn)換為datetime對(duì)象。此外宜岛,Timestamp對(duì)象還可以存儲(chǔ)頻率信息(如果有的話)长踊,且懂得如何執(zhí)行時(shí)區(qū)轉(zhuǎn)換以及其它種類的操作。稍后將對(duì)此進(jìn)行詳細(xì)講解萍倡。
A Timestamp can be substituted anywhere you would use a datetime object. Additionally, it can store frequency information (if any) and understands how to do time zone conversions and other kinds of manipulations. More on both of these things later.
11.2.1 索引身弊、選取、子集構(gòu)造
Indexing, Selection, Subsetting
當(dāng)你基于標(biāo)簽索引和選取數(shù)據(jù)時(shí)列敲,時(shí)間序列的行為和任何其它的pandas.Series很像:
Time series behaves like any other pandas.Series when you are indexing and selecting data based on label:
In [48]: stamp = ts.index[2]
In [49]: ts[stamp]
Out[49]: -0.51943871505673811
為了方便起見(jiàn)阱佛,你還可以傳入一個(gè)可解釋為日期的字符串:
As a convenience, you can also pass a string that is interpretable as a date:
In [50]: ts['1/10/2011']
Out[50]: 1.9657805725027142
In [51]: ts['20110110']
Out[51]: 1.9657805725027142
對(duì)于較長(zhǎng)的時(shí)間序列,可以傳入“年”或“年月”來(lái)輕松地選取數(shù)據(jù)的切片(slices of data):
For longer time series, a year or only a year and month can be passed to easily select slices of data:
In [52]: longer_ts = pd.Series(np.random.randn(1000),
....: index=pd.date_range('1/1/2000', periods=1000))
In [53]: longer_ts
Out[53]:
2000-01-01 0.092908
2000-01-02 0.281746
2000-01-03 0.769023
2000-01-04 1.246435
2000-01-05 1.007189
2000-01-06 -1.296221
2000-01-07 0.274992
2000-01-08 0.228913
2000-01-09 1.352917
2000-01-10 0.886429
...
2002-09-17 -0.139298
2002-09-18 -1.159926
2002-09-19 0.618965
2002-09-20 1.373890
2002-09-21 -0.983505
2002-09-22 0.930944
2002-09-23 -0.811676
2002-09-24 -1.830156
2002-09-25 -0.138730
2002-09-26 0.334088
Freq: D, Length: 1000, dtype: float64
In [54]: longer_ts['2001']
Out[54]:
2001-01-01 1.599534
2001-01-02 0.474071
2001-01-03 0.151326
2001-01-04 -0.542173
2001-01-05 -0.475496
2001-01-06 0.106403
2001-01-07 -1.308228
2001-01-08 2.173185
2001-01-09 0.564561
2001-01-10 -0.190481
...
2001-12-22 0.000369
2001-12-23 0.900885
2001-12-24 -0.454869
2001-12-25 -0.864547
2001-12-26 1.129120
2001-12-27 0.057874
2001-12-28 -0.433739
2001-12-29 0.092698
2001-12-30 -1.397820
2001-12-31 1.457823
Freq: D, Length: 365, dtype: float64
在這里戴而,字符串“2001”被解釋為一個(gè)年份凑术,并選取該時(shí)間段。如果指定月份所意,這也是有效的:
Here, the string '2001' is interpreted as a year and selects that time period. This also works if you specify the month:
In [55]: longer_ts['2001-05']
Out[55]:
2001-05-01 -0.622547
2001-05-02 0.936289
2001-05-03 0.750018
2001-05-04 -0.056715
2001-05-05 2.300675
2001-05-06 0.569497
2001-05-07 1.489410
2001-05-08 1.264250
2001-05-09 -0.761837
2001-05-10 -0.331617
...
2001-05-22 0.503699
2001-05-23 -1.387874
2001-05-24 0.204851
2001-05-25 0.603705
2001-05-26 0.545680
2001-05-27 0.235477
2001-05-28 0.111835
2001-05-29 -1.251504
2001-05-30 -2.949343
2001-05-31 0.634634
Freq: D, Length: 31, dtype: float64
使用datetime對(duì)象進(jìn)行切片同樣有效:
Slicing with datetime objects works as well:
In [56]: ts[datetime(2011, 1, 7):] # gg注:ts['2011-01-07':]也可
Out[56]:
2011-01-07 -0.519439
2011-01-08 -0.555730
2011-01-10 1.965781
2011-01-12 1.393406
dtype: float64
因?yàn)榇蟛糠謺r(shí)間序列數(shù)據(jù)都是按時(shí)間順序排列的淮逊,所以可以使用不存在于時(shí)間序列中的時(shí)間戳進(jìn)行切片催首,以執(zhí)行范圍查詢:
Because most time series data is ordered chronologically, you can slice with timestamps not contained in a time series to perform a range query:
In [57]: ts
Out[57]:
2011-01-02 -0.204708
2011-01-05 0.478943
2011-01-07 -0.519439
2011-01-08 -0.555730
2011-01-10 1.965781
2011-01-12 1.393406
dtype: float64
In [58]: ts['1/6/2011':'1/11/2011']
Out[58]:
2011-01-07 -0.519439
2011-01-08 -0.555730
2011-01-10 1.965781
dtype: float64
和以前一樣,你可以傳入字符串日期壮莹、datetime對(duì)象或時(shí)間戳翅帜。請(qǐng)記住,以這種方式進(jìn)行切片會(huì)在源時(shí)間序列上產(chǎn)生視圖命满,就像對(duì)NumPy數(shù)組進(jìn)行切片一樣涝滴。這意味著沒(méi)有數(shù)據(jù)被復(fù)制,切片上的修改會(huì)反映在原始數(shù)據(jù)中胶台。
As before, you can pass either a string date, datetime, or timestamp. Remember that slicing in this manner produces views on the source time series like slicing NumPy arrays. This means that no data is copied and modifications on the slice will be reflected in the original data.
有一個(gè)等價(jià)的實(shí)例方法歼疮,truncate方法,它在兩個(gè)日期之間對(duì)Series進(jìn)行切片:
There is an equivalent instance method, truncate, that slices a Series between two dates:
In [59]: ts.truncate(after='1/9/2011')
Out[59]:
2011-01-02 -0.204708
2011-01-05 0.478943
2011-01-07 -0.519439
2011-01-08 -0.555730
dtype: float64
所有這些也適用于DataFrame诈唬,例如韩脏,對(duì)DataFrame的行進(jìn)行索引:
All of this holds true for DataFrame as well, indexing on its rows:
In [60]: dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
In [61]: long_df = pd.DataFrame(np.random.randn(100, 4),
....: index=dates,
....: columns=['Colorado', 'Texas',
....: 'New York', 'Ohio'])
In [62]: long_df.loc['5-2001']
Out[62]:
Colorado Texas New York Ohio
2001-05-02 -0.006045 0.490094 -0.277186 -0.707213
2001-05-09 -0.560107 2.735527 0.927335 1.513906
2001-05-16 0.538600 1.273768 0.667876 -0.969206
2001-05-23 1.676091 -0.817649 0.050188 1.951312
2001-05-30 3.260383 0.963301 1.201206 -1.852001
11.2.2 帶有重復(fù)索引的時(shí)間序列
Time Series with Duplicate Indices
在某些應(yīng)用場(chǎng)景中,可能會(huì)有多個(gè)數(shù)據(jù)觀察結(jié)果落在同一個(gè)特定的時(shí)間戳上铸磅。 下面是一個(gè)例子:
In some applications, there may be multiple data observations falling on a particular timestamp. Here is an example:
In [63]: dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000',
....: '1/2/2000', '1/3/2000'])
In [64]: dup_ts = pd.Series(np.arange(5), index=dates)
In [65]: dup_ts
Out[65]:
2000-01-01 0
2000-01-02 1
2000-01-02 2
2000-01-02 3
2000-01-03 4
dtype: int64
通過(guò)檢查索引的is_unique屬性赡矢,我們可以看出索引不是唯一的:
We can tell that the index is not unique by checking its is_unique property:
In [66]: dup_ts.index.is_unique
Out[66]: False
對(duì)該時(shí)間序列進(jìn)行索引,要么產(chǎn)生標(biāo)量值阅仔,要么產(chǎn)生切片吹散,具體取決于時(shí)間戳是否重復(fù):
Indexing into this time series will now either produce scalar values or slices depending on whether a timestamp is duplicated:
In [67]: dup_ts['1/3/2000'] # 不重復(fù)not duplicated
Out[67]: 4
In [68]: dup_ts['1/2/2000'] # 重復(fù)duplicated
Out[68]:
2000-01-02 1
2000-01-02 2
2000-01-02 3
dtype: int64
假設(shè)您想聚合具有非唯一時(shí)間戳的數(shù)據(jù)。 一種方式是使用groupby方法并傳入level=0:
Suppose you wanted to aggregate the data having non-unique timestamps. One way to do this is to use groupby and pass level=0:
In [69]: grouped = dup_ts.groupby(level=0)
In [70]: grouped.mean()
Out[70]:
2000-01-01 0
2000-01-02 2
2000-01-03 4
dtype: int64
In [71]: grouped.count()
Out[71]:
2000-01-01 1
2000-01-02 3
2000-01-03 1
dtype: int64
11.3 日期范圍八酒、頻率和移動(dòng)
11.3 Date Ranges, Frequencies, and Shifting
pandas中一般的時(shí)間序列被假定為不規(guī)則的空民,也就是說(shuō)烦却,它們沒(méi)有固定的頻率豌鹤。對(duì)于很多應(yīng)用場(chǎng)景而言玫氢,這已經(jīng)足夠了让禀。但是,經(jīng)常有需要處理固定頻率(例如每日郊霎、每月槐沼、每15分鐘)的應(yīng)用場(chǎng)景阶冈,即使這意味著在時(shí)間序列中引入缺失值热鞍。幸運(yùn)的是与殃,pandas有一整套標(biāo)準(zhǔn)時(shí)間序列頻率和工具,用于重采樣碍现、推斷頻率和生成固定頻率的日期范圍。例如米奸,你可以通過(guò)調(diào)用resample方法將樣本時(shí)間序列轉(zhuǎn)換為固定頻率(每日)的時(shí)間序列:
Generic time series in pandas are assumed to be irregular; that is, they have no fixed frequency. For many applications this is sufficient. However, it’s often desirable to work relative to a fixed frequency, such as daily, monthly, or every 15 minutes, even if that means introducing missing values into a time series. Fortunately pandas has a full suite of standard time series frequencies and tools for resampling, inferring frequencies, and generating fixed-frequency date ranges. For example, you can convert the sample time series to be fixed daily frequency by calling resample:
In [72]: ts
Out[72]:
2011-01-02 -0.204708
2011-01-05 0.478943
2011-01-07 -0.519439
2011-01-08 -0.555730
2011-01-10 1.965781
2011-01-12 1.393406
dtype: float64
In [73]: resampler = ts.resample('D')
字符串“D”被解釋為“每日”的頻率昼接。
The string 'D' is interpreted as daily frequency.
頻率之間的轉(zhuǎn)換(或重采樣)是一個(gè)足夠大的主題,稍后有一節(jié)來(lái)講解(11.6節(jié))悴晰。這里慢睡,我將向你展示如何使用基本頻率(base frequency)及其倍數(shù)逐工。
Conversion between frequencies or resampling is a big enough topic to have its own section later (Section 11.6, “Resampling and Frequency Conversion,” on page 348). Here I’ll show you how to use the base frequencies and multiples thereof.
11.3.1 生成日期范圍
Generating Date Ranges
雖然我之前用它沒(méi)有解釋,pandas.date_range函數(shù)負(fù)責(zé)根據(jù)特定頻率生成指定長(zhǎng)度的DatetimeIndex:
While I used it previously without explanation, pandas.date_range is responsible for generating a DatetimeIndex with an indicated length according to a particular frequency:
In [74]: idx = pd.date_range('2012-04-01', '2012-06-01') # gg注:為避免歧義漂辐,變量名從原文的index改為idx
In [75]: idx
Out[75]:
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
'2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
'2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
'2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
'2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
'2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
'2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
'2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
'2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
'2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
'2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
'2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
'2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
'2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
'2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
'2012-05-31', '2012-06-01'],
dtype='datetime64[ns]', freq='D')
默認(rèn)情況下泪喊,pandas.date_range函數(shù)生成“每日”的時(shí)間戳。如果只傳入起始日期或結(jié)束日期髓涯,則必須傳入要生成的時(shí)期數(shù)(number of periods):
By default, date_range generates daily timestamps. If you pass only a start or end date, you must pass a number of periods to generate:
In [76]: pd.date_range(start='2012-04-01', periods=20)
Out[76]:
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
'2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
'2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
'2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
'2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
dtype='datetime64[ns]', freq='D')
In [77]: pd.date_range(end='2012-06-01', periods=20)
Out[77]:
DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
'2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
'2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
'2012-05-25', '2012-05-26', '2012-05-27','2012-05-28',
'2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
dtype='datetime64[ns]', freq='D')
起始日期和結(jié)束日期為生成的日期索引定義了嚴(yán)格的邊界袒啼。例如,如果你想要一個(gè)包含每月最后一個(gè)工作日的日期索引纬纪,只需要傳入“BM”頻率(每月最后一個(gè)工作日蚓再;更完整的頻率清單見(jiàn)表11-4),這樣只會(huì)包括日期間隔上或日期間隔內(nèi)的日期:
The start and end dates define strict boundaries for the generated date index. For example, if you wanted a date index containing the last business day of each month, you would pass the 'BM' frequency (business end of month; see more complete listing of frequencies in Table 11-4) and only dates falling on or inside the date interval will be included:
In [78]: pd.date_range('2000-01-01', '2000-12-01', freq='BM')
Out[78]:
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
'2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
'2000-09-29', '2000-10-31', '2000-11-30'],
dtype='datetime64[ns]', freq='BM')
表11-4:時(shí)間序列的基本頻率(不全面)
Table 11-4. Base time series frequencies (not comprehensive)
pandas.date_range函數(shù)默認(rèn)保留起始時(shí)間戳或結(jié)束時(shí)間戳的時(shí)間(如果有的話):
date_range by default preserves the time (if any) of the start or end timestamp:
In [79]: pd.date_range('2012-05-02 12:56:31', periods=5)
Out[79]:
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
'2012-05-04 12:56:31', '2012-05-05 12:56:31',
'2012-05-06 12:56:31'],
dtype='datetime64[ns]', freq='D')
有時(shí)包各,雖然起始日期或結(jié)束日期帶有時(shí)間信息摘仅,但希望生成一組標(biāo)準(zhǔn)化到午夜的時(shí)間戳。為此问畅,有一個(gè)normalize參數(shù)(gg注:option的直譯是“選項(xiàng)”娃属,為“詞匯前后一致”本文采用意譯“參數(shù)”):
Sometimes you will have start or end dates with time information but want to generate a set of timestamps normalized to midnight as a convention. To do this, there is a normalize option:
In [80]: pd.date_range('2012-05-02 12:56:31', periods=5, normalize=True)
Out[80]:
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
'2012-05-06'],
dtype='datetime64[ns]', freq='D')
11.3.2 頻率和日期偏移量
Frequencies and Date Offsets
pandas中的頻率由基本頻率(base frequency)和乘數(shù)(multiplier)組成』つ罚基本頻率通常由一個(gè)字符串別名引用矾端,例如“M”表示每月、“H”表示每小時(shí)签则。對(duì)于每個(gè)基本頻率须床,都有一個(gè)定義為日期偏移量(date offset)的對(duì)象。例如渐裂,“每小時(shí)”的頻率可以用Hour類表示:
Frequencies in pandas are composed of a base frequency and a multiplier. Base frequencies are typically referred to by a string alias, like 'M' for monthly or 'H' for hourly. For each base frequency, there is an object defined generally referred to as a date offset. For example, hourly frequency can be represented with the Hour class:
In [81]: from pandas.tseries.offsets import Hour, Minute
In [82]: one_hour = Hour() # gg注:為避免歧義豺旬,變量名從原文的hour改為one_hour
In [83]: one_hour
Out[83]: <Hour>
你可以傳入一個(gè)整數(shù)來(lái)定義偏移量的倍數(shù):
You can define a multiple of an offset by passing an integer:
In [84]: four_hours = Hour(4)
In [85]: four_hours
Out[85]: <4 * Hours>
在大部分應(yīng)用場(chǎng)景中,不需要顯式地創(chuàng)建這些對(duì)象柒凉,而是使用例如“H”或“4H”的字符串別名族阅。在基本頻率前放一個(gè)整數(shù)即可創(chuàng)建偏移量的倍數(shù):
In most applications, you would never need to explicitly create one of these objects, instead using a string alias like 'H' or '4H'. Putting an integer before the base frequency creates a multiple:
In [86]: pd.date_range('2000-01-01', '2000-01-03 23:59', freq='4h')
Out[86]:
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
'2000-01-01 08:00:00', '2000-01-01 12:00:00',
'2000-01-01 16:00:00', '2000-01-01 20:00:00',
'2000-01-02 00:00:00', '2000-01-02 04:00:00',
'2000-01-02 08:00:00', '2000-01-02 12:00:00',
'2000-01-02 16:00:00', '2000-01-02 20:00:00',
'2000-01-03 00:00:00', '2000-01-03 04:00:00',
'2000-01-03 08:00:00', '2000-01-03 12:00:00',
'2000-01-03 16:00:00', '2000-01-03 20:00:00'],
dtype='datetime64[ns]', freq='4H')
多個(gè)偏移量可以通過(guò)加法組合在一起:
Many offsets can be combined together by addition:
In [87]: Hour() + Minute(30) # gg注:結(jié)合上下文,作者想計(jì)算的是1h30min
Out[87]: <90 * Minutes>
類似地膝捞,你可以傳入頻率字符串坦刀,例如“1h30min”,該字符串將被有效地解析為相同的表達(dá)式:
Similarly, you can pass frequency strings, like '1h30min', that will effectively be parsed to the same expression:
In [88]: pd.date_range('2000-01-01', periods=10, freq='1h30min')
Out[88]:
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
'2000-01-01 03:00:00', '2000-01-01 04:30:00',
'2000-01-01 06:00:00', '2000-01-01 07:30:00',
'2000-01-01 09:00:00', '2000-01-01 10:30:00',
'2000-01-01 12:00:00', '2000-01-01 13:30:00'],
dtype='datetime64[ns]', freq='90T')
有些頻率描述的時(shí)間點(diǎn)不是均勻間隔的蔬咬。例如鲤遥,“M”(每月最后一個(gè)日歷日)和“BM”(每月最后一個(gè)工作日)取決于一個(gè)月的天數(shù),在后一種情況下林艘,還要考慮這個(gè)月是否在周末結(jié)束盖奈。 我們將這些稱為錨定偏移量(anchored offset)。
Some frequencies describe points in time that are not evenly spaced. For example, 'M' (calendar month end) and 'BM' (last business/weekday of month) depend on the number of days in a month and, in the latter case, whether the month ends on a weekend or not. We refer to these as anchored offsets.
pandas中可用的頻率代碼和日期偏移量類型的清單狐援,請(qǐng)參閱表11-4钢坦。
Refer back to Table 11-4 for a listing of frequency codes and date offset classes available in pandas.
用戶可以自定義頻率類(frequency class)來(lái)提供pandas中沒(méi)有的日期邏輯究孕,但完整細(xì)節(jié)不在本書的范圍之內(nèi)。
Users can define their own custom frequency classes to provide date logic not available in pandas, though the full details of that are outside the scope of this book.
11.3.2.1 WOM日期
Week of month dates
WOM(Week Of Month)是一個(gè)有用的頻率類爹凹。這使你能夠獲得例如“每月第三個(gè)星期五”的日期:
One useful frequency class is “week of month,” starting with WOM. This enables you to get dates like the third Friday of each month:
In [89]: rng = pd.date_range('2012-01-01', '2012-09-01', freq='WOM-3FRI')
In [90]: list(rng)
Out[90]:
[Timestamp('2012-01-20 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-02-17 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-03-16 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-04-20 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-05-18 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-06-15 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-07-20 00:00:00', freq='WOM-3FRI'),
Timestamp('2012-08-17 00:00:00', freq='WOM-3FRI')]
11.3.3 對(duì)數(shù)據(jù)進(jìn)行移動(dòng)(超前和滯后)
Shifting (Leading and Lagging) Data
移動(dòng)(shifting)是指通過(guò)時(shí)間向后和向前移動(dòng)數(shù)據(jù)厨诸。Series和DataFrame都有一個(gè)shift方法用于進(jìn)行樸素的(naive)向前或向后移動(dòng),而且保持索引不變:
“Shifting” refers to moving data backward and forward through time. Both Series and DataFrame have a shift method for doing naive shifts forward or backward, leaving the index unmodified:
In [91]: ts = pd.Series(np.random.randn(4),
....: index=pd.date_range('1/1/2000', periods=4, freq='M'))
In [92]: ts
Out[92]:
2000-01-31 -0.066748
2000-02-29 0.838639
2000-03-31 -0.117388
2000-04-30 -0.517795
Freq: M, dtype: float64
In [93]: ts.shift(2)
Out[93]:
2000-01-31 NaN
2000-02-29 NaN
2000-03-31 -0.066748
2000-04-30 0.838639
Freq: M, dtype: float64
In [94]: ts.shift(-2)
Out[94]:
2000-01-31 -0.117388
2000-02-29 -0.517795
2000-03-31 NaN
2000-04-30 NaN
Freq: M, dtype: float64
當(dāng)我們這樣進(jìn)行移動(dòng)時(shí)禾酱,會(huì)在時(shí)間序列的起始處或結(jié)束處引入缺失數(shù)據(jù)微酬。
When we shift like this, missing data is introduced either at the start or the end of the time series.
shift方法的一個(gè)常見(jiàn)用途是計(jì)算一個(gè)時(shí)間序列或多個(gè)時(shí)間序列(如DataFrame的列)中的百分比變化(percent change)。這表示為:
A common use of shift is computing percent changes in a time series or multiple time series as DataFrame columns. This is expressed as:
ts / ts.shift(1) - 1
由于樸素的移動(dòng)保持索引不變 宇植,因此一些數(shù)據(jù)被丟棄得封。如果頻率已知,則可以將其傳入shift方法以移動(dòng)時(shí)間戳而不僅僅是數(shù)據(jù):
Because naive shifts leave the index unmodified, some data is discarded. Thus if the frequency is known, it can be passed to shift to advance the timestamps instead of simply the data:
In [95]: ts.shift(2, freq='M')
Out[95]:
2000-03-31 -0.066748
2000-04-30 0.838639
2000-05-31 -0.117388
2000-06-30 -0.517795
Freq: M, dtype: float64
也可以傳入其它頻率指郁,這樣你就能靈活地對(duì)數(shù)據(jù)進(jìn)行超前和滯后處理了:
Other frequencies can be passed, too, giving you some flexibility in how to lead and lag the data:
In [96]: ts.shift(3, freq='D') # gg注:效果等同于ts.shift(1, freq='3D')
Out[96]:
2000-02-03 -0.066748
2000-03-03 0.838639
2000-04-03 -0.117388
2000-05-03 -0.517795
dtype: float64
In [97]: ts.shift(1, freq='90T') # gg注:效果等同于ts.shift(90, freq='T')
Out[97]:
2000-01-31 01:30:00 -0.066748
2000-02-29 01:30:00 0.838639
2000-03-31 01:30:00 -0.117388
2000-04-30 01:30:00 -0.517795
Freq: M, dtype: float64
這里的“T”代表分鐘忙上。
The T here stands for minutes.
11.3.3.1 通過(guò)偏移量對(duì)日期進(jìn)行移動(dòng)
Shifting dates with offsets
pandas的日期偏移量還可以與datetime對(duì)象或Timestamp對(duì)象一起使用:
The pandas date offsets can also be used with datetime or Timestamp objects:
In [98]: from pandas.tseries.offsets import Day, MonthEnd
In [99]: now = datetime(2011, 11, 17)
In [100]: now + 3 * Day()
Out[100]: Timestamp('2011-11-20 00:00:00')
如果加的是錨定偏移量(例如MonthEnd),則第一個(gè)增量會(huì)將原日期“向前滾動(dòng)”到符合頻率規(guī)則的下一個(gè)日期:
If you add an anchored offset like MonthEnd, the first increment will “roll forward” a date to the next date according to the frequency rule:
In [101]: now + MonthEnd()
Out[101]: Timestamp('2011-11-30 00:00:00')
In [102]: now + MonthEnd(2)
Out[102]: Timestamp('2011-12-31 00:00:00')
通過(guò)錨定偏移量的rollforward方法和rollback方法闲坎,可顯式地將日期向前或向后“滾動(dòng)”:
Anchored offsets can explicitly “roll” dates forward or backward by simply using their rollforward and rollback methods, respectively:
In [103]: offset = MonthEnd()
In [104]: offset.rollforward(now)
Out[104]: Timestamp('2011-11-30 00:00:00')
In [105]: offset.rollback(now)
Out[105]: Timestamp('2011-10-31 00:00:00')
日期偏移量的一個(gè)創(chuàng)造性用法是與groupby方法一起使用rollforward方法或rollback方法:
A creative use of date offsets is to use these methods with groupby:
In [106]: ts = pd.Series(np.random.randn(20),
.....: index=pd.date_range('1/15/2000', periods=20, freq='4d'))
In [107]: ts
Out[107]:
2000-01-15 -0.116696
2000-01-19 2.389645
2000-01-23 -0.932454
2000-01-27 -0.229331
2000-01-31 -1.140330
2000-02-04 0.439920
2000-02-08 -0.823758
2000-02-12 -0.520930
2000-02-16 0.350282
2000-02-20 0.204395
2000-02-24 0.133445
2000-02-28 0.327905
2000-03-03 0.072153
2000-03-07 0.131678
2000-03-11 -1.297459
2000-03-15 0.997747
2000-03-19 0.870955
2000-03-23 -0.991253
2000-03-27 0.151699
2000-03-31 1.266151
Freq: 4D, dtype: float64
In [108]: ts.groupby(offset.rollforward).mean()
Out[108]:
2000-01-31 -0.005833
2000-02-29 0.015894
2000-03-31 0.150209
dtype: float64
當(dāng)然疫粥,更簡(jiǎn)單更快捷的方式是使用resample方法(11.6節(jié)將對(duì)此進(jìn)行詳細(xì)講解):
Of course, an easier and faster way to do this is using resample (we’ll discuss this in much more depth in Section 11.6, “Resampling and Frequency Conversion,” on page 348):
In [109]: ts.resample('M').mean()
Out[109]:
2000-01-31 -0.005833
2000-02-29 0.015894
2000-03-31 0.150209
Freq: M, dtype: float64
11.4 時(shí)區(qū)處理
11.4 Time Zone Handling
處理時(shí)區(qū)(time zone)通常被認(rèn)為是時(shí)間序列操作中最令人不快的部分之一。因此腰懂,很多人選擇協(xié)調(diào)世界時(shí)(coordinated universal time, UTC)來(lái)處理時(shí)間序列梗逮。協(xié)調(diào)世界時(shí)是格林尼治標(biāo)準(zhǔn)時(shí)間(Greenwich Mean time, GMT)的繼任者,也是目前的國(guó)際標(biāo)準(zhǔn)绣溜。時(shí)區(qū)是以與UTC的偏移量形式表示的慷彤。例如,在夏令時(shí)(daylight saving time, DST)期間紐約比UTC晚4個(gè)小時(shí)怖喻,而在全年其它時(shí)間則比UTC晚5個(gè)小時(shí)底哗。
Working with time zones is generally considered one of the most unpleasant parts of time series manipulation. As a result, many time series users choose to work with time series in coordinated universal time or UTC, which is the successor to Greenwich Mean Time and is the current international standard. Time zones are expressed as offsets from UTC; for example, New York is four hours behind UTC during daylight saving time and five hours behind the rest of the year.
在Python中,時(shí)區(qū)信息來(lái)自第三方pytz庫(kù)(可通過(guò)pip或conda安裝)锚沸,它公開(kāi)了Olson數(shù)據(jù)庫(kù)(世界時(shí)區(qū)信息的匯編)跋选。這對(duì)歷史數(shù)據(jù)特別重要,因?yàn)?strong>夏令時(shí)轉(zhuǎn)變?nèi)掌?/strong>(甚至UTC偏移量)已經(jīng)根據(jù)地方政府的突發(fā)奇想改變了很多次哗蜈。在美國(guó)前标,夏令時(shí)轉(zhuǎn)變?nèi)掌?/strong>自1900年以來(lái)已經(jīng)改變了很多次!
In Python, time zone information comes from the third-party pytz library (installable with pip or conda), which exposes the Olson database, a compilation of world time zone information. This is especially important for historical data because the daylight saving time (DST) transition dates (and even UTC offsets) have been changed numerous times depending on the whims of local governments. In the United States, the DST transition times have been changed many times since 1900!
有關(guān)pytz庫(kù)的詳細(xì)信息距潘,請(qǐng)查閱該庫(kù)的官方文檔炼列。就本書而言,pandas封裝了pytz庫(kù)的功能音比,這樣你就可以忽略它在時(shí)區(qū)名稱之外的API唯鸭。時(shí)區(qū)名稱可以交互式地找到,也可以在官方文檔中找到:
For detailed information about the pytz library, you’ll need to look at that library’s documentation. As far as this book is concerned, pandas wraps pytz’s functionality so you can ignore its API outside of the time zone names. Time zone names can be found interactively and in the docs:
In [110]: import pytz
In [111]: pytz.common_timezones[-5:]
Out[111]: ['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']
要從pytz庫(kù)獲取時(shí)區(qū)對(duì)象硅确,請(qǐng)使用pytz.timezone函數(shù):
To get a time zone object from pytz, use pytz.timezone:
In [112]: tz = pytz.timezone('America/New_York')
In [113]: tz
Out[113]: <DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>
pandas中的方法既可以接受時(shí)區(qū)名稱也可以接受時(shí)區(qū)對(duì)象目溉。
Methods in pandas will accept either time zone names or these objects.
11.4.1 時(shí)區(qū)本地化和轉(zhuǎn)換
Time Zone Localization and Conversion
默認(rèn)情況下,pandas中的時(shí)間序列是時(shí)區(qū)樸素的(time zone naive)菱农。例如缭付,考慮以下時(shí)間序列:
By default, time series in pandas are time zone naive. For example, consider the following time series:
In [114]: rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
In [115]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [116]: ts
Out[116]:
2012-03-09 09:30:00 -0.202469
2012-03-10 09:30:00 0.050718
2012-03-11 09:30:00 0.639869
2012-03-12 09:30:00 0.597594
2012-03-13 09:30:00 -0.797246
2012-03-14 09:30:00 0.472879
Freq: D, dtype: float64
其索引的tz屬性是None:
The index’s tz field is None:
In [117]: print(ts.index.tz)
None
可以生成帶有時(shí)區(qū)集(time zone set)的日期范圍:
Date ranges can be generated with a time zone set:
In [118]: pd.date_range('3/9/2012 9:30', periods=10, freq='D', tz='UTC')
Out[118]:
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
'2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
'2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
'2012-03-15 09:30:00+00:00', '2012-03-16 09:30:00+00:00',
'2012-03-17 09:30:00+00:00', '2012-03-18 09:30:00+00:00'],
dtype='datetime64[ns, UTC]', freq='D')
從樸素到本地化的轉(zhuǎn)換是通過(guò)tz_localize方法處理的:
Conversion from naive to localized is handled by the tz_localize method:
In [119]: ts
Out[119]:
2012-03-09 09:30:00 -0.202469
2012-03-10 09:30:00 0.050718
2012-03-11 09:30:00 0.639869
2012-03-12 09:30:00 0.597594
2012-03-13 09:30:00 -0.797246
2012-03-14 09:30:00 0.472879
Freq: D, dtype: float64
In [120]: ts_utc = ts.tz_localize('UTC')
In [121]: ts_utc
Out[121]:
2012-03-09 09:30:00+00:00 -0.202469
2012-03-10 09:30:00+00:00 0.050718
2012-03-11 09:30:00+00:00 0.639869
2012-03-12 09:30:00+00:00 0.597594
2012-03-13 09:30:00+00:00 -0.797246
2012-03-14 09:30:00+00:00 0.472879
Freq: D, dtype: float64
In [122]: ts_utc.index
Out[122]:
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
'2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
'2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00'],
dtype='datetime64[ns, UTC]', freq='D')
一旦時(shí)間序列被本地化到某個(gè)特定的時(shí)區(qū),就可以通過(guò)tz_convert方法將其轉(zhuǎn)換到另一個(gè)時(shí)區(qū):
Once a time series has been localized to a particular time zone, it can be converted to another time zone with tz_convert:
In [123]: ts_utc.tz_convert('America/New_York')
Out[123]:
2012-03-09 04:30:00-05:00 -0.202469
2012-03-10 04:30:00-05:00 0.050718
2012-03-11 05:30:00-04:00 0.639869
2012-03-12 05:30:00-04:00 0.597594
2012-03-13 05:30:00-04:00 -0.797246
2012-03-14 05:30:00-04:00 0.472879
Freq: D, dtype: float64
在前面的時(shí)間序列中(它跨越了America/New_York時(shí)區(qū)的夏令時(shí)轉(zhuǎn)變)循未,我們可以將其本地化到美國(guó)東部標(biāo)準(zhǔn)時(shí)間(Eastern Standard Time, EST)陷猫,然后轉(zhuǎn)換到UTC或柏林時(shí)間:
In the case of the preceding time series, which straddles a DST transition in the America/New_York time zone, we could localize to EST and convert to, say, UTC or Berlin time:
In [124]: ts_eastern = ts.tz_localize('America/New_York')
In [125]: ts_eastern.tz_convert('UTC')
Out[125]:
2012-03-09 14:30:00+00:00 -0.202469
2012-03-10 14:30:00+00:00 0.050718
2012-03-11 13:30:00+00:00 0.639869
2012-03-12 13:30:00+00:00 0.597594
2012-03-13 13:30:00+00:00 -0.797246
2012-03-14 13:30:00+00:00 0.472879
Freq: D, dtype: float64
In [126]: ts_eastern.tz_convert('Europe/Berlin')
Out[126]:
2012-03-09 15:30:00+01:00 -0.202469
2012-03-10 15:30:00+01:00 0.050718
2012-03-11 14:30:00+01:00 0.639869
2012-03-12 14:30:00+01:00 0.597594
2012-03-13 14:30:00+01:00 -0.797246
2012-03-14 14:30:00+01:00 0.472879
Freq: D, dtype: float64
tz_localize和tz_convert也是DatetimeIndex的實(shí)例方法:
tz_localize and tz_convert are also instance methods on DatetimeIndex:
In [127]: ts.index.tz_localize('Asia/Shanghai')
Out[127]:
DatetimeIndex(['2012-03-09 09:30:00+08:00', '2012-03-10 09:30:00+08:00',
'2012-03-11 09:30:00+08:00', '2012-03-12 09:30:00+08:00',
'2012-03-13 09:30:00+08:00', '2012-03-14 09:30:00+08:00'],
dtype='datetime64[ns, Asia/Shanghai]', freq='D')
對(duì)樸素時(shí)間戳的本地化操作還會(huì)檢查夏令時(shí)轉(zhuǎn)變附近含混不清的或不存在的時(shí)間。
Localizing naive timestamps also checks for ambiguous or nonexistent times around daylight saving time transitions.
11.4.2 時(shí)區(qū)意識(shí)型Timestamp對(duì)象的運(yùn)算
Operations with Time Zone?Aware Timestamp Objects
與時(shí)間序列和日期范圍類似的妖,單獨(dú)的Timestamp對(duì)象也能被從樸素的本地化為時(shí)區(qū)意識(shí)型的(time zone-aware)绣檬,并從一個(gè)時(shí)區(qū)轉(zhuǎn)換到另一個(gè)時(shí)區(qū):
Similar to time series and date ranges, individual Timestamp objects similarly can be localized from naive to time zone–aware and converted from one time zone to another:
In [128]: stamp = pd.Timestamp('2011-03-12 04:00')
In [129]: stamp_utc = stamp.tz_localize('utc')
In [130]: stamp_utc.tz_convert('America/New_York')
Out[130]: Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')
在創(chuàng)建Timestamp對(duì)象時(shí),也可以傳入一個(gè)時(shí)區(qū)參數(shù):
You can also pass a time zone when creating the Timestamp:
In [131]: stamp_moscow = pd.Timestamp('2011-03-12 04:00', tz='Europe/Moscow')
In [132]: stamp_moscow
Out[132]: Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')
時(shí)區(qū)意識(shí)型Timestamp對(duì)象在內(nèi)部存儲(chǔ)了一個(gè)UTC時(shí)間戳數(shù)值----自UNIX紀(jì)元(1970 年1月1日)算起的納秒數(shù)嫂粟。這個(gè)UTC時(shí)間戳數(shù)值在時(shí)區(qū)轉(zhuǎn)換過(guò)程中是不變的:
Time zone–aware Timestamp objects internally store a UTC timestamp value as nanoseconds since the Unix epoch (January 1, 1970); this UTC value is invariant between time zone conversions:
In [133]: stamp_utc.value
Out[133]: 1299902400000000000
In [134]: stamp_utc.tz_convert('America/New_York').value
Out[134]: 1299902400000000000
當(dāng)使用pandas的DateOffset對(duì)象執(zhí)行時(shí)間算術(shù)運(yùn)算時(shí)娇未,pandas會(huì)盡可能遵從夏令時(shí)轉(zhuǎn)變。這里我們創(chuàng)建恰好發(fā)生在夏令時(shí)轉(zhuǎn)變前的時(shí)間戳星虹。首先是轉(zhuǎn)變到夏令時(shí)前的30分鐘:
When performing time arithmetic using pandas's DateOffset objects, pandas respects daylight saving time transitions where possible. Here we construct timestamps that occur right before DST transitions (forward and backward). First, 30 minutes before transitioning to DST:
In [135]: from pandas.tseries.offsets import Hour
In [136]: stamp = pd.Timestamp('2012-03-11 01:30', tz='US/Eastern') # gg注:原英文書中有誤零抬,作者的意圖是2012-03-11,而不是2012-03-12
In [137]: stamp
Out[137]: Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')
In [138]: stamp + Hour()
Out[138]: Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')
接著是從夏令時(shí)轉(zhuǎn)出前的90分鐘:
Then, 90 minutes before transitioning out of DST:
In [139]: stamp = pd.Timestamp('2012-11-04 00:30', tz='US/Eastern')
In [140]: stamp
Out[140]: Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')
In [141]: stamp + 2 * Hour()
Out[141]: Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')
11.4.3 不同時(shí)區(qū)之間的運(yùn)算
Operations Between Different Time Zones
如果組合兩個(gè)帶有不同時(shí)區(qū)的時(shí)間序列宽涌,結(jié)果會(huì)是UTC平夜。由于在底層時(shí)間戳是以UTC存儲(chǔ)的,所以這是個(gè)簡(jiǎn)單運(yùn)算卸亮,不需要轉(zhuǎn)換忽妒。
If two time series with different time zones are combined, the result will be UTC. Since the timestamps are stored under the hood in UTC, this is a straightforward operation and requires no conversion to happen:
In [142]: rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B')
In [143]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [144]: ts
Out[144]:
2012-03-07 09:30:00 0.522356
2012-03-08 09:30:00 -0.546348
2012-03-09 09:30:00 -0.733537
2012-03-12 09:30:00 1.302736
2012-03-13 09:30:00 0.022199
2012-03-14 09:30:00 0.364287
2012-03-15 09:30:00 -0.922839
2012-03-16 09:30:00 0.312656
2012-03-19 09:30:00 -1.128497
2012-03-20 09:30:00 -0.333488
Freq: B, dtype: float64
In [145]: ts1 = ts[:7].tz_localize('Europe/London')
In [146]: ts2 = ts1[2:].tz_convert('Europe/Moscow')
In [147]: result = ts1 + ts2
In [148]: result.index
Out[148]:
DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',
'2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
'2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
'2012-03-15 09:30:00+00:00'],
dtype='datetime64[ns, UTC]', freq='B')
11.5 時(shí)期及其算術(shù)運(yùn)算
Periods and Period Arithmetic
時(shí)期(period)表示的是時(shí)間跨度(timespan),例如數(shù)日兼贸、數(shù)月段直、數(shù)季或數(shù)年。Period類(The Period class)表示的就是這種數(shù)據(jù)類型寝受,其構(gòu)造函數(shù)(pandas.Period)需要一個(gè)“字符串或整數(shù)”以及一個(gè)表11-4中的頻率坷牛。
Periods represent timespans, like days, months, quarters, or years. The Period class represents this data type, requiring a string or integer and a frequency from Table 11-4:
In [149]: p = pd.Period(2007, freq='A-DEC') # gg注:p = pd.Period('2007', freq='A-DEC')效果一樣
In [150]: p
Out[150]: Period('2007', 'A-DEC')
在這個(gè)例子中,Period對(duì)象表示的是從2007年1月1日到2007年12月31日(包含在內(nèi))的整個(gè)時(shí)間跨度很澄。在Period對(duì)象上加上或減去一個(gè)整數(shù)京闰,即可方便地達(dá)到根據(jù)其頻率進(jìn)行移動(dòng)的效果。
In this case, the Period object represents the full timespan from January 1, 2007, to December 31, 2007, inclusive. Conveniently, adding and subtracting integers from periods has the effect of shifting by their frequency:
In [151]: p + 5
Out[151]: Period('2012', 'A-DEC')
In [152]: p - 2
Out[152]: Period('2005', 'A-DEC')
如果兩個(gè)Period對(duì)象擁有相同的頻率甩苛,則它們的差就是它們之間的單位數(shù)量:
If two periods have the same frequency, their difference is the number of units between them:
In [153]: pd.Period('2014', freq='A-DEC') - p
Out[153]: <7 * YearEnds: month=12> # gg注:原英文書中的輸出“7”可能是老版本的
pandas.period_range函數(shù)可用于創(chuàng)建規(guī)則的時(shí)期范圍(range of periods):
Regular ranges of periods can be constructed with the period_range function:
In [154]: rng = pd.period_range('2000-01-01', '2000-06-30', freq='M')
In [155]: rng
Out[155]: PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '20
00-06'], dtype='period[M]', freq='M')
PeriodIndex類存儲(chǔ)的是Period對(duì)象的序列蹂楣,它可以在任何pandas數(shù)據(jù)結(jié)構(gòu)中作為軸索引:
The PeriodIndex class stores a sequence of periods and can serve as an axis index in any pandas data structure:
In [156]: pd.Series(np.random.randn(6), index=rng)
Out[156]:
2000-01 -0.514551
2000-02 -0.559782
2000-03 -0.783408
2000-04 -1.797685
2000-05 -0.172670
2000-06 0.680215
Freq: M, dtype: float64
PeriodIndex類的構(gòu)造函數(shù)(pandas.PeriodIndex)也可以使用字符串?dāng)?shù)組(array of strings):
If you have an array of strings, you can also use the PeriodIndex class:
In [157]: vals = ['2001Q3', '2002Q2', '2003Q1'] # gg注:為避免歧義,變量名從原文的values改為vals
In [158]: idx = pd.PeriodIndex(vals, freq='Q-DEC') # gg注:為避免歧義讯蒲,變量名從原文的index改為idx
In [159]: idx
Out[159]: PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq
='Q-DEC')
11.5.1 時(shí)期的頻率轉(zhuǎn)換
Period Frequency Conversion
Period對(duì)象和PeriodIndex對(duì)象都可以通過(guò)其asfreq方法被轉(zhuǎn)換到別的頻率痊土。例如,假設(shè)我們有一個(gè)年度時(shí)期(annual period)墨林,希望將其換為當(dāng)年年初或年末的一個(gè)月度時(shí)期(monthly period)赁酝。這非常簡(jiǎn)單:
Periods and PeriodIndex objects can be converted to another frequency with their asfreq method. As an example, suppose we had an annual period and wanted to convert it into a monthly period either at the start or end of the year. This is fairly straightforward:
In [160]: p = pd.Period('2007', freq='A-DEC')
In [161]: p
Out[161]: Period('2007', 'A-DEC')
In [162]: p.asfreq('M', how='start') # gg注:p.asfreq(freq='M', how='start')
Out[162]: Period('2007-01', 'M')
In [163]: p.asfreq('M', how='end') # gg注:p.asfreq(freq='M', how='end')
Out[163]: Period('2007-12', 'M')
你可以將Period('2007', 'A-DEC')看作一種游標(biāo)犯祠,該游標(biāo)指向一個(gè)被劃分為多個(gè)月度時(shí)期的時(shí)間跨度,如圖11-1所示酌呆。對(duì)于一個(gè)不以十二月結(jié)束的財(cái)政年度(fiscal year)衡载,月度子時(shí)期(monthly subperiods)的歸屬情況就不一樣了:
You can think of Period('2007', 'A-DEC') as being a sort of cursor pointing to a span of time, subdivided by monthly periods. See Figure 11-1 for an illustration of this. For a fiscal year ending on a month other than December, the corresponding monthly subperiods are different:
In [164]: p = pd.Period('2007', freq='A-JUN')
In [165]: p
Out[165]: Period('2007', 'A-JUN')
In [166]: p.asfreq('M', 'start')
Out[166]: Period('2006-07', 'M')
In [167]: p.asfreq('M', 'end')
Out[167]: Period('2007-06', 'M')
在將高頻率轉(zhuǎn)換到低頻率肘,超時(shí)期(superperiod)是由子時(shí)期(subperiod )所屬的位置決定的隙袁。例如痰娱,在A-JUN頻率中,月份 “ 2007年8 月” 實(shí)際是“ 2008時(shí)期”的一部分:
When you are converting from high to low frequency, pandas determines the superperiod depending on where the subperiod “belongs.” For example, in A-JUN frequency, the month Aug-2007 is actually part of the 2008 period:
In [168]: p = pd.Period('Aug-2007', 'M')
In [169]: p.asfreq('A-JUN')
Out[169]: Period('2008', 'A-JUN')
完整的PeriodIndex對(duì)象或時(shí)間序列可以通過(guò)相同的語(yǔ)義進(jìn)行類似地轉(zhuǎn)換:
Whole PeriodIndex objects or time series can be similarly converted with the same semantics:
In [170]: rng = pd.period_range('2006', '2009', freq='A-DEC')
In [171]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [172]: ts
Out[172]:
2006 1.607578
2007 0.200381
2008 -0.834068
2009 -0.302988
Freq: A-DEC, dtype: float64
In [173]: ts.asfreq('M', how='start')
Out[173]:
2006-01 1.607578
2007-01 0.200381
2008-01 -0.834068
2009-01 -0.302988
Freq: M, dtype: float64
在這里菩收,年度時(shí)期(annual period)被替換為月度時(shí)期(monthly period)梨睁,該月度時(shí)期對(duì)應(yīng)于每個(gè)年度時(shí)期內(nèi)的第一個(gè)月。如果我們想要每年的最后一個(gè)工作日娜饵,我們可以使用“B”頻率并指定想要該時(shí)期的末尾:
Here, the annual periods are replaced with monthly periods corresponding to the first month falling within each annual period. If we instead wanted the last business day of each year, we can use the 'B' frequency and indicate that we want the end of the period:
In [174]: ts.asfreq('B', how='end')
Out[174]:
2006-12-29 1.607578
2007-12-31 0.200381
2008-12-31 -0.834068
2009-12-31 -0.302988
Freq: B, dtype: float64
11.5.2 季度時(shí)期頻率
Quarterly Period Frequencies
季度數(shù)據(jù)(quarterly data)在會(huì)計(jì)坡贺、金融等領(lǐng)域中很常見(jiàn)。許多季度數(shù)據(jù)都會(huì)涉及財(cái)政年度結(jié)束日(fiscal year end)的概念划咐,通常是一年12個(gè)月中某月的最后一個(gè)日歷日或工作日拴念。因此,“2012Q4時(shí)期”根據(jù)財(cái)政年度結(jié)束日的不同會(huì)有不同的含義褐缠。pandas支持全部12個(gè)可能的季度頻率政鼠,即Q-JAN到Q-DEC:
Quarterly data is standard in accounting, finance, and other fields. Much quarterly data is reported relative to a fiscal year end, typically the last calendar or business day of one of the 12 months of the year. Thus, the period 2012Q4 has a different meaning depending on fiscal year end. pandas supports all 12 possible quarterly frequencies as Q-JAN through Q-DEC:
In [175]: p = pd.Period('2012Q4', freq='Q-JAN')
In [176]: p
Out[176]: Period('2012Q4', 'Q-JAN')
在以1月結(jié)束的財(cái)政年度中,“2012Q4時(shí)期”是從11月到1月队魏,你可以通過(guò)將其轉(zhuǎn)換到日度頻率(daily frequency)來(lái)查看公般。如圖11-2所示。
In the case of fiscal year ending in January, 2012Q4 runs from November through January, which you can check by converting to daily frequency. See Figure 11-2 for an illustration.
In [177]: p.asfreq('D', 'start')
Out[177]: Period('2011-11-01', 'D')
In [178]: p.asfreq('D', 'end')
Out[178]: Period('2012-01-31', 'D')
因此胡桨,可以進(jìn)行簡(jiǎn)單的時(shí)期算術(shù)運(yùn)算(period arithmetic)官帘。例如,要獲得該季度倒數(shù)第二個(gè)工作日下午4點(diǎn)的時(shí)間戳昧谊,你可以這樣做:
Thus, it’s possible to do easy period arithmetic; for example, to get the timestamp at 4PM on the second-to-last business day of the quarter, you could do:
In [179]: p4pm = (p.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
In [180]: p4pm
Out[180]: Period('2012-01-30 16:00', 'T')
In [181]: p4pm.to_timestamp()
Out[181]: Timestamp('2012-01-30 16:00:00')
可以使用pandas.period_range函數(shù)生成季度范圍(quarterly range)刽虹。季度范圍的算術(shù)運(yùn)算也是一樣的:
You can generate quarterly ranges using period_range. Arithmetic is identical, too:
In [182]: rng = pd.period_range('2011Q3', '2012Q4', freq='Q-JAN')
In [183]: ts = pd.Series(np.arange(len(rng)), index=rng)
In [184]: ts
Out[184]:
2011Q3 0
2011Q4 1
2012Q1 2
2012Q2 3
2012Q3 4
2012Q4 5
Freq: Q-JAN, dtype: int64
In [185]: new_rng = (rng.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
In [186]: ts.index = new_rng.to_timestamp()
In [187]: ts
Out[187]:
2010-10-28 16:00:00 0
2011-01-28 16:00:00 1
2011-04-28 16:00:00 2
2011-07-28 16:00:00 3
2011-10-28 16:00:00 4
2012-01-30 16:00:00 5
dtype: int64
11.5.3 將時(shí)間戳轉(zhuǎn)換為時(shí)期(及其反向過(guò)程)
Converting Timestamps to Periods (and Back)
通過(guò)to_period方法,可以將被時(shí)間戳索引的Series對(duì)象和DataFrame對(duì)象轉(zhuǎn)換到被時(shí)期索引:
Series and DataFrame objects indexed by timestamps can be converted to periods with the to_period method:
In [188]: rng = pd.date_range('2000-01-01', periods=3, freq='M')
In [189]: ts = pd.Series(np.random.randn(3), index=rng)
In [190]: ts
Out[190]:
2000-01-31 1.663261
2000-02-29 -0.996206
2000-03-31 1.521760
Freq: M, dtype: float64
In [191]: pts = ts.to_period()
In [192]: pts
Out[192]:
2000-01 1.663261
2000-02 -0.996206
2000-03 1.521760
Freq: M, dtype: float64
由于時(shí)期指的是非重疊的時(shí)間跨度呢诬,因此對(duì)于給定的頻率涌哲,一個(gè)時(shí)間戳只能屬于一個(gè)時(shí)期。雖然默認(rèn)新PeriodIndex的頻率是從時(shí)間戳推斷而來(lái)的尚镰,但你可以指定任何頻率阀圾。結(jié)果中允許存在重復(fù)時(shí)期:
Since periods refer to non-overlapping timespans, a timestamp can only belong to a single period for a given frequency. While the frequency of the new PeriodIndex is inferred from the timestamps by default, you can specify any frequency you want. There is also no problem with having duplicate periods in the result:
In [193]: rng = pd.date_range('1/29/2000', periods=6, freq='D')
In [194]: ts2 = pd.Series(np.random.randn(6), index=rng)
In [195]: ts2
Out[195]:
2000-01-29 0.244175
2000-01-30 0.423331
2000-01-31 -0.654040
2000-02-01 2.089154
2000-02-02 -0.060220
2000-02-03 -0.167933
Freq: D, dtype: float64
In [196]: ts2.to_period('M')
Out[196]:
2000-01 0.244175
2000-01 0.423331
2000-01 -0.654040
2000-02 2.089154
2000-02 -0.060220
2000-02 -0.167933
Freq: M, dtype: float64
要轉(zhuǎn)換回時(shí)間戳,使用to_timestamp方法即可:
To convert back to timestamps, use to_timestamp:
In [197]: pts = ts2.to_period()
In [198]: pts
Out[198]:
2000-01-29 0.244175
2000-01-30 0.423331
2000-01-31 -0.654040
2000-02-01 2.089154
2000-02-02 -0.060220
2000-02-03 -0.167933
Freq: D, dtype: float64
In [199]: pts.to_timestamp(how='end') # gg注:原英文書中的輸出有誤狗唉,只有日期無(wú)時(shí)間信息
Out[199]:
2000-01-29 23:59:59.999999999 0.244175
2000-01-30 23:59:59.999999999 0.423331
2000-01-31 23:59:59.999999999 -0.654040
2000-02-01 23:59:59.999999999 2.089154
2000-02-02 23:59:59.999999999 -0.060220
2000-02-03 23:59:59.999999999 -0.167933
Freq: D, dtype: float64
11.5.4 從數(shù)組創(chuàng)建PeriodIndex
Creating a PeriodIndex from Arrays
固定頻率的數(shù)據(jù)集有時(shí)會(huì)將時(shí)間跨度信息分開(kāi)存儲(chǔ)在多個(gè)列中初烘。 例如,在下面這個(gè)宏觀經(jīng)濟(jì)數(shù)據(jù)集中,年份和季度就在不同的列中:
Fixed frequency datasets are sometimes stored with timespan information spread across multiple columns. For example, in this macroeconomic dataset, the year and quarter are in different columns:
In [200]: data = pd.read_csv('examples/macrodata.csv')
In [201]: data.head(5)
Out[201]:
year quarter realgdp realcons realinv realgovt realdpi cpi \
0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98
1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15
2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35
3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37
4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54
m1 tbilrate unemp pop infl realint
0 139.7 2.82 5.8 177.146 0.00 0.00
1 141.7 3.08 5.1 177.830 2.34 0.74
2 140.5 3.82 5.3 178.657 2.74 1.09
3 140.0 4.33 5.6 179.386 0.27 4.06
4 139.6 3.50 5.2 180.007 2.31 1.19
In [202]: data.year
Out[202]:
0 1959.0
1 1959.0
2 1959.0
3 1959.0
4 1960.0
5 1960.0
6 1960.0
7 1960.0
8 1961.0
9 1961.0
...
193 2007.0
194 2007.0
195 2007.0
196 2008.0
197 2008.0
198 2008.0
199 2008.0
200 2009.0
201 2009.0
202 2009.0
Name: year, Length: 203, dtype: float64
In [203]: data.quarter
Out[203]:
0 1.0
1 2.0
2 3.0
3 4.0
4 1.0
5 2.0
6 3.0
7 4.0
8 1.0
9 2.0
...
193 2.0
194 3.0
195 4.0
196 1.0
197 2.0
198 3.0
199 4.0
200 1.0
201 2.0
202 3.0
Name: quarter, Length: 203, dtype: float64
通過(guò)將這些數(shù)組以及一個(gè)頻率傳入pandas.PeriodIndex函數(shù)肾筐,就可以將它們組合成DataFrame的索引:
By passing these arrays to PeriodIndex with a frequency, you can combine them to form an index for the DataFrame:
In [204]: idx = pd.PeriodIndex(year=data.year, quarter=data.quarter,
.....: freq='Q-DEC') # gg注:為避免歧義哆料,變量名從原文的index改為idx
In [205]: idx
Out[205]:
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
'1960Q3', '1960Q4', '1961Q1', '1961Q2',
...
'2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
'2008Q4', '2009Q1', '2009Q2', '2009Q3'],
dtype='period[Q-DEC]', length=203, freq='Q-DEC')
In [206]: data.index = idx
In [207]: data.infl
Out[207]:
1959Q1 0.00
1959Q2 2.34
1959Q3 2.74
1959Q4 0.27
1960Q1 2.31
1960Q2 0.14
1960Q3 2.70
1960Q4 1.21
1961Q1 -0.40
1961Q2 1.47
...
2007Q2 2.75
2007Q3 3.45
2007Q4 6.38
2008Q1 2.82
2008Q2 8.53
2008Q3 -3.16
2008Q4 -8.79
2009Q1 0.94
2009Q2 3.37
2009Q3 3.56
Freq: Q-DEC, Name: infl, Length: 203, dtype: float64
11.6 重采樣及頻率轉(zhuǎn)換
Resampling and Frequency Conversion
重采樣(resampling)指的是將時(shí)間序列從一個(gè)頻率轉(zhuǎn)換到另一個(gè)頻率的過(guò)程。將高頻率數(shù)據(jù)聚合到低頻率稱為降采樣(downsampling)吗铐,而將低頻率數(shù)據(jù)轉(zhuǎn)換到高頻率則稱為升采樣(upsampling)剧劝。并不是所有的重采樣都能被劃分到這兩個(gè)大類中滋捶。例如烦衣,將W-WED(每周三)轉(zhuǎn)換為W-FRI既不是降采樣也不是升采樣舟山。
Resampling refers to the process of converting a time series from one frequency to another. Aggregating higher frequency data to lower frequency is called downsampling, while converting lower frequency to higher frequency is called upsampling. Not all resampling falls into either of these categories; for example, converting W-WED (weekly on Wednesday) to W-FRI is neither upsampling nor downsampling.
pandas對(duì)象都帶有一個(gè)resample方法,它是各種頻率轉(zhuǎn)換工作的主力函數(shù)谣妻。resample方法有一個(gè)類似于groupby方法的API,先調(diào)用resample方法分組數(shù)據(jù)卒稳,然后再調(diào)用一個(gè)聚合函數(shù):
pandas objects are equipped with a resample method, which is the workhorse function for all frequency conversion. resample has a similar API to groupby; you call resample to group the data, then call an aggregation function:
In [208]: rng = pd.date_range('2000-01-01', periods=100, freq='D')
In [209]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [210]: ts
Out[210]:
2000-01-01 0.631634
2000-01-02 -1.594313
2000-01-03 -1.519937
2000-01-04 1.108752
2000-01-05 1.255853
2000-01-06 -0.024330
2000-01-07 -2.047939
2000-01-08 -0.272657
2000-01-09 -1.692615
2000-01-10 1.423830
...
2000-03-31 -0.007852
2000-04-01 -1.638806
2000-04-02 1.401227
2000-04-03 1.758539
2000-04-04 0.628932
2000-04-05 -0.423776
2000-04-06 0.789740
2000-04-07 0.937568
2000-04-08 -2.253294
2000-04-09 -1.772919
Freq: D, Length: 100, dtype: float64
In [211]: ts.resample('M').mean()
Out[211]:
2000-01-31 -0.165893
2000-02-29 0.078606
2000-03-31 0.223811
2000-04-30 -0.063643
Freq: M, dtype: float64
In [212]: ts.resample('M', kind='period').mean()
Out[212]:
2000-01 -0.165893
2000-02 0.078606
2000-03 0.223811
2000-04 -0.063643
Freq: M, dtype: float64
resample方法是一個(gè)靈活高效的方法蹋半,可用于處理非常大的時(shí)間序列。我將通過(guò)一系列的示例說(shuō)明其用法充坑。表11-5總結(jié)它的一些參數(shù)减江。
resample is a flexible and high-performance method that can be used to process very large time series. The examples in the following sections illustrate its semantics and use. Table 11-5 summarizes some of its options.
表11-5. resample方法的參數(shù)
Table 11-5. Resample method arguments
11.6.1 降采樣
Downsampling
將數(shù)據(jù)聚合到規(guī)律的低頻率是一件非常普通的時(shí)間序列處理任務(wù)。待聚合的數(shù)據(jù)不必?fù)碛泄潭ǖ念l率捻爷,期望的頻率會(huì)自動(dòng)定義聚合的箱邊緣(bin edge)辈灼,這些箱邊緣用于將時(shí)間序列拆分為多個(gè)片段。例如也榄,要轉(zhuǎn)換到月度頻率('M'或'BM')巡莹,數(shù)據(jù)需要被劃分到多個(gè)單月間隔(onemonth interval)中。各間隔都是半開(kāi)半閉的(half-open)甜紫。一個(gè)數(shù)據(jù)點(diǎn)只能屬于一個(gè)間隔降宅,所有間隔的并集必須能組成整個(gè)時(shí)間范圍(time frame,gg注:為方便理解采用“時(shí)間范圍”囚霸,最精確的翻譯是“時(shí)間框架”)腰根。在用resample方法對(duì)數(shù)據(jù)進(jìn)行降采樣時(shí),需要考慮兩件事:
- 各間隔哪端是閉合的拓型。
- 如何標(biāo)記各聚合后的箱额嘿,采用間隔的起始還是末尾(gg注:即采用箱的左邊緣還是右邊緣)。
Aggregating data to a regular, lower frequency is a pretty normal time series task. The data you’re aggregating doesn’t need to be fixed frequently; the desired frequency defines bin edges that are used to slice the time series into pieces to aggregate. For example, to convert to monthly, 'M' or 'BM', you need to chop up the data into onemonth intervals. Each interval is said to be half-open; a data point can only belong to one interval, and the union of the intervals must make up the whole time frame. There are a couple things to think about when using resample to downsample data: - Which side of each interval is closed
- How to label each aggregated bin, either with the start of the interval or the end
為了說(shuō)明吨述,我們來(lái)看一些“1分鐘”的數(shù)據(jù):
To illustrate, let’s look at some one-minute data:
In [213]: rng = pd.date_range('2000-01-01', periods=12, freq='T')
In [214]: ts = pd.Series(np.arange(12), index=rng)
In [215]: ts
Out[215]:
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
2000-01-01 00:09:00 9
2000-01-01 00:10:00 10
2000-01-01 00:11:00 11
Freq: T, dtype: int64
假設(shè)你想通過(guò)求和的方式將這些數(shù)據(jù)聚合到“5分鐘”的塊中:
Suppose you wanted to aggregate this data into five-minute chunks or bars by taking the sum of each group:
In [216]: ts.resample('5min').sum() # gg注:原英文書中有誤岩睁,作者的意圖是采用closed參數(shù)的默認(rèn)值
Out[216]:
2000-01-01 00:00:00 10
2000-01-01 00:05:00 35
2000-01-01 00:10:00 21
Freq: 5T, dtype: int32
傳入的頻率將會(huì)以“5分鐘”的增量定義箱邊緣。默認(rèn)情況下揣云,箱的左邊緣是包含的(gg注:即左閉右開(kāi))捕儒,因此00:00到00:05間隔是包含00:00的[1]。傳入closed='right'會(huì)讓間隔變成左開(kāi)右閉的:
The frequency you pass defines bin edges in five-minute increments. By default, the left bin edge is inclusive, so the 00:00 value is included in the 00:00 to 00:05 interval[1]. Passing closed='right' changes the interval to be closed on the right:
In [217]: ts.resample('5min', closed='right').sum()
Out[217]:
1999-12-31 23:55:00 0
2000-01-01 00:00:00 15
2000-01-01 00:05:00 40
2000-01-01 00:10:00 11
Freq: 5T, dtype: int64
結(jié)果的時(shí)間序列默認(rèn)是以各箱左邊緣的時(shí)間戳進(jìn)行標(biāo)記的。傳入label='right'即可用箱的右邊緣對(duì)其進(jìn)行標(biāo)記:
The resulting time series is labeled by the timestamps from the left side of each bin. By passing label='right' you can label them with the right bin edge:
In [218]: ts.resample('5min', closed='right', label='right').sum()
Out[218]:
2000-01-01 00:00:00 0
2000-01-01 00:05:00 15
2000-01-01 00:10:00 40
2000-01-01 00:15:00 11
Freq: 5T, dtype: int64
圖11-3是“1分鐘”頻率的數(shù)據(jù)被重采樣到“5分鐘”頻率的示意圖刘莹。
See Figure 11-3 for an illustration of minute frequency data being resampled to fiveminute frequency.
最后,你可能想對(duì)結(jié)果的索引進(jìn)行一些移動(dòng)点弯,例如從右邊緣減去一秒以便更容易明白該時(shí)間戳到底表示的是哪個(gè)間隔扇调。只需要給loffset參數(shù)傳入一個(gè)字符串或日期偏移量即可實(shí)現(xiàn)這個(gè)目的:
Lastly, you might want to shift the result index by some amount, say subtracting one second from the right edge to make it more clear which interval the timestamp refers to. To do this, pass a string or date offset to loffset:
In [219]: ts.resample('5min', closed='right',
.....: label='right', loffset='-1s').sum()
Out[219]:
1999-12-31 23:59:59 0
2000-01-01 00:04:59 15
2000-01-01 00:09:59 40
2000-01-01 00:14:59 11
Freq: 5T, dtype: int32
也可以通過(guò)調(diào)用結(jié)果對(duì)象的shift方法來(lái)實(shí)現(xiàn)該效果,這樣就不需要設(shè)置loffset參數(shù)了抢肛。
You also could have accomplished the effect of loffset by calling the shift method on the result without the loffset.
11.6.1.1 OHLC重采樣
Open-High-Low-Close (OHLC) resampling
金融領(lǐng)域中有一種聚合時(shí)間序列的常見(jiàn)方式狼钮,即計(jì)算各時(shí)間段的四個(gè)值:開(kāi)盤價(jià)(open)、最高價(jià)(high)捡絮、最低價(jià)(low)和收盤價(jià)(close)熬芜。使用ohlc聚合函數(shù)即可得到一個(gè)含有這四個(gè)值的DataFrame對(duì)象,只需要對(duì)數(shù)據(jù)進(jìn)行一次掃描就可以有效地計(jì)算出結(jié)果:
(gg注:為方便理解對(duì)原英文書中的“the first (open), last (close), maximum (high), and minimal (low) values”的順序進(jìn)行了調(diào)整)
In finance, a popular way to aggregate a time series is to compute four values for each bucket: the first (open), maximum (high), minimal (low) and last (close) values. By using the ohlc aggregate function you will obtain a DataFrame having columns containing these four aggregates, which are efficiently computed in a single sweep of the data:
In [220]: ts.resample('5min').ohlc()
Out[220]:
open high low close
2000-01-01 00:00:00 0 4 0 4
2000-01-01 00:05:00 5 9 5 9
2000-01-01 00:10:00 10 11 10 11
11.6.2 升采樣和插值
Upsampling and Interpolation
在將數(shù)據(jù)從低頻率轉(zhuǎn)換到高頻率時(shí)福稳,就不需要聚合了涎拉。我們來(lái)看一個(gè)帶有一些周度數(shù)據(jù)(weekly data)的DataFrame對(duì)象:
When converting from a low frequency to a higher frequency, no aggregation is needed. Let’s consider a DataFrame with some weekly data:
In [221]: frame = pd.DataFrame(np.random.randn(2, 4),
.....: index=pd.date_range('1/1/2000', periods=2,
.....: freq='W-WED'),
.....: columns=['Colorado', 'Texas', 'New York', 'Ohio'])
In [222]: frame
Out[222]:
Colorado Texas New York Ohio
2000-01-05 -0.896431 0.677263 0.036503 0.087102
2000-01-12 -0.046662 0.927238 0.482284 -0.867130
當(dāng)你對(duì)這個(gè)數(shù)據(jù)進(jìn)行聚合時(shí),每組只有一個(gè)值的圆,這樣就會(huì)引入缺失值鼓拧。我們使用asfreq方法轉(zhuǎn)換到高頻率,不經(jīng)過(guò)聚合:
When you are using an aggregation function with this data, there is only one value per group, and missing values result in the gaps. We use the asfreq method to convert to the higher frequency without any aggregation:
In [223]: df_daily = frame.resample('D').asfreq()
In [224]: df_daily
Out[224]:
Colorado Texas New York Ohio
2000-01-05 -0.896431 0.677263 0.036503 0.087102
2000-01-06 NaN NaN NaN NaN
2000-01-07 NaN NaN NaN NaN
2000-01-08 NaN NaN NaN NaN
2000-01-09 NaN NaN NaN NaN
2000-01-10 NaN NaN NaN NaN
2000-01-11 NaN NaN NaN NaN
2000-01-12 -0.046662 0.927238 0.482284 -0.867130
假設(shè)你想在“非星期三”向前填充各周度值(weekly value)越妈。resample方法的填充和插值方式跟fillna方法和reindex方法的一樣:
Suppose you wanted to fill forward each weekly value on the non-Wednesdays. The same filling or interpolation methods available in the fillna and reindex methods are available for resampling:
In [225]: frame.resample('D').ffill()
Out[225]:
Colorado Texas New York Ohio
2000-01-05 -0.896431 0.677263 0.036503 0.087102
2000-01-06 -0.896431 0.677263 0.036503 0.087102
2000-01-07 -0.896431 0.677263 0.036503 0.087102
2000-01-08 -0.896431 0.677263 0.036503 0.087102
2000-01-09 -0.896431 0.677263 0.036503 0.087102
2000-01-10 -0.896431 0.677263 0.036503 0.087102
2000-01-11 -0.896431 0.677263 0.036503 0.087102
2000-01-12 -0.046662 0.927238 0.482284 -0.867130
同樣季俩,你可以選擇只填充指定的時(shí)期數(shù),以限制觀測(cè)值的持續(xù)使用距離:
You can similarly choose to only fill a certain number of periods forward to limit how far to continue using an observed value:
In [226]: frame.resample('D').ffill(limit=2)
Out[226]:
Colorado Texas New York Ohio
2000-01-05 -0.896431 0.677263 0.036503 0.087102
2000-01-06 -0.896431 0.677263 0.036503 0.087102
2000-01-07 -0.896431 0.677263 0.036503 0.087102
2000-01-08 NaN NaN NaN NaN
2000-01-09 NaN NaN NaN NaN
2000-01-10 NaN NaN NaN NaN
2000-01-11 NaN NaN NaN NaN
2000-01-12 -0.046662 0.927238 0.482284 -0.867130
注意叮称,新的日期索引完全沒(méi)必要跟舊的重疊:
Notably, the new date index need not overlap with the old one at all:
In [227]: frame.resample('W-THU').ffill()
Out[227]:
Colorado Texas New York Ohio
2000-01-06 -0.896431 0.677263 0.036503 0.087102
2000-01-13 -0.046662 0.927238 0.482284 -0.867130
11.6.3 通過(guò)時(shí)期進(jìn)行重采樣
Resampling with Periods
對(duì)被時(shí)期索引的數(shù)據(jù)進(jìn)行重采樣類似于被時(shí)間戳索引的數(shù)據(jù):
Resampling data indexed by periods is similar to timestamps:
In [228]: frame = pd.DataFrame(np.random.randn(24, 4),
.....: index=pd.period_range('1-2000', '12-2001',
.....: freq='M'),
.....: columns=['Colorado', 'Texas', 'New York', 'Ohio'])
In [229]: frame[:5]
Out[229]:
Colorado Texas New York Ohio
2000-01 0.493841 -0.155434 1.397286 1.507055
2000-02 -1.179442 0.443171 1.395676 -0.529658
2000-03 0.787358 0.248845 0.743239 1.267746
2000-04 1.302395 -0.272154 -0.051532 -0.467740
2000-05 -1.040816 0.426419 0.312945 -1.115689
In [230]: annual_frame = frame.resample('A-DEC').mean()
In [231]: annual_frame
Out[231]:
Colorado Texas New York Ohio
2000 0.556703 0.016631 0.111873 -0.027445
2001 0.046303 0.163344 0.251503 -0.157276
升采樣要稍微麻煩一些种玛,因?yàn)槟惚仨殯Q定在新頻率中時(shí)間跨度的哪端用于放置原始值,就像asfreq方法那樣瓤檐。convention參數(shù)默認(rèn)'start'赂韵,也可設(shè)置為'end':
Upsampling is more nuanced, as you must make a decision about which end of the timespan in the new frequency to place the values before resampling, just like the asfreq method. The convention argument defaults to 'start' but can also be 'end':
In [232]: annual_frame.resample('Q-DEC').ffill()
Out[232]:
Colorado Texas New York Ohio
2000Q1 0.556703 0.016631 0.111873 -0.027445
2000Q2 0.556703 0.016631 0.111873 -0.027445
2000Q3 0.556703 0.016631 0.111873 -0.027445
2000Q4 0.556703 0.016631 0.111873 -0.027445
2001Q1 0.046303 0.163344 0.251503 -0.157276
2001Q2 0.046303 0.163344 0.251503 -0.157276
2001Q3 0.046303 0.163344 0.251503 -0.157276
2001Q4 0.046303 0.163344 0.251503 -0.157276
In [233]: annual_frame.resample('Q-DEC', convention='end').ffill()
Out[233]:
Colorado Texas New York Ohio
2000Q4 0.556703 0.016631 0.111873 -0.027445
2001Q1 0.556703 0.016631 0.111873 -0.027445
2001Q2 0.556703 0.016631 0.111873 -0.027445
2001Q3 0.556703 0.016631 0.111873 -0.027445
2001Q4 0.046303 0.163344 0.251503 -0.157276
由于時(shí)期指的是時(shí)間跨度,所以升采樣和降采樣的規(guī)則就比較嚴(yán)格:
- 在降采樣中挠蛉,目標(biāo)頻率必須是源頻率的子時(shí)期(subperiod)祭示。
- 在升采樣中,目標(biāo)頻率必須是源頻率的超時(shí)期(superperiod)谴古。
Since periods refer to timespans, the rules about upsampling and downsampling are more rigid: - In downsampling, the target frequency must be a subperiod of the source frequency.
- In upsampling, the target frequency must be a superperiod of the source frequency.
如果不滿足這些規(guī)則质涛,就會(huì)引發(fā)異常。這主要影響季度頻率掰担、年度頻率和周度頻率汇陆。例如,由Q-MAR定義的時(shí)間跨度只能升采樣為A-MAR带饱、A-JUN毡代、A-SEP阅羹、A-DEC等:
If these rules are not satisfied, an exception will be raised. This mainly affects the quarterly, annual, and weekly frequencies; for example, the timespans defined by Q-MAR only line up with A-MAR, A-JUN, A-SEP, and A-DEC:
In [234]: annual_frame.resample('Q-MAR').ffill()
Out[234]:
Colorado Texas New York Ohio
2000Q4 0.556703 0.016631 0.111873 -0.027445
2001Q1 0.556703 0.016631 0.111873 -0.027445
2001Q2 0.556703 0.016631 0.111873 -0.027445
2001Q3 0.556703 0.016631 0.111873 -0.027445
2001Q4 0.046303 0.163344 0.251503 -0.157276
2002Q1 0.046303 0.163344 0.251503 -0.157276
2002Q2 0.046303 0.163344 0.251503 -0.157276
2002Q3 0.046303 0.163344 0.251503 -0.157276
11.7 移動(dòng)窗口函數(shù)
Moving Window Functions
用于時(shí)間序列運(yùn)算的數(shù)組轉(zhuǎn)換的一個(gè)重要類別是:在滑動(dòng)窗口(sliding window,可以帶有指數(shù)衰減權(quán)重)上計(jì)算的各種統(tǒng)計(jì)函數(shù)教寂∧笥悖可以用于平滑噪聲數(shù)據(jù)(noisy data)或缺口數(shù)據(jù)(gappy data)。我將它們稱為移動(dòng)窗口函數(shù)(moving window function)酪耕,盡管包括不定長(zhǎng)窗口的函數(shù)导梆,例如指數(shù)加權(quán)移動(dòng)平均。跟其它統(tǒng)計(jì)函數(shù)一樣迂烁,移動(dòng)窗口函數(shù)也會(huì)自動(dòng)排除缺失數(shù)據(jù)看尼。
An important class of array transformations used for time series operations are statistics and other functions evaluated over a sliding window or with exponentially decaying weights. This can be useful for smoothing noisy or gappy data. I call these moving window functions, even though it includes functions without a fixed-length window like exponentially weighted moving average. Like other statistical functions, these also automatically exclude missing data.
開(kāi)始之前,我們加載一些時(shí)間序列數(shù)據(jù)盟步,將其重采樣為工作日頻率:
Before digging in, we can load up some time series data and resample it to business day frequency:
In [235]: close_px_all = pd.read_csv('examples/stock_px_2.csv',
.....: parse_dates=True, index_col=0)
In [236]: close_px = close_px_all[['AAPL', 'MSFT', 'XOM']]
In [237]: close_px = close_px.resample('B').ffill()
現(xiàn)在引入rolling函數(shù)狡忙,它與resample方法和groupby方法很像≈沸荆可以在Series對(duì)象或DataFrame對(duì)象上沿著一個(gè)窗口(window,表示為時(shí)期數(shù)窜觉,見(jiàn)圖11-4)調(diào)用它:
I now introduce the rolling operator, which behaves similarly to resample and groupby. It can be called on a Series or DataFrame along with a window (expressed as a number of periods; see Figure 11-4 for the plot created):
In [238]: close_px.AAPL.plot()
Out[238]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2f2570cf98>
In [239]: close_px.AAPL.rolling(250).mean().plot() # gg注:等價(jià)于close_px.AAPL.rolling(window=250).mean().plot()
表達(dá)式rolling(250)與groupby方法很像谷炸,但不是對(duì)其直接分組而是創(chuàng)建一個(gè)對(duì)象,該對(duì)象允許在250日的滑動(dòng)窗口上分組禀挫。然后旬陡,我們就得到了蘋果公司股價(jià)250日的移動(dòng)平均線。
The expression rolling(250) is similar in behavior to groupby, but instead of grouping it creates an object that enables grouping over a 250-day sliding window. So here we have the 250-day moving window average of Apple’s stock price.
默認(rèn)情況下语婴,rolling函數(shù)要求窗口中的所有值都是非NA描孟。可以修改該行為以解決缺失數(shù)據(jù)的問(wèn)題砰左,尤其是在時(shí)間序列的開(kāi)始匿醒,少于窗口時(shí)期數(shù)的數(shù)據(jù)(見(jiàn)圖11-5):
By default rolling functions require all of the values in the window to be non-NA. This behavior can be changed to account for missing data and, in particular, the fact that you will have fewer than window periods of data at the beginning of the time series (see Figure 11-5):
In [241]: appl_std250 = close_px.AAPL.rolling(250, min_periods=10).std()
In [242]: appl_std250[5:12]
Out[242]:
2003-01-09 NaN
2003-01-10 NaN
2003-01-13 NaN
2003-01-14 NaN
2003-01-15 0.077496
2003-01-16 0.074760
2003-01-17 0.112368
Freq: B, Name: AAPL, dtype: float64
In [243]: appl_std250.plot()
為了計(jì)算擴(kuò)展窗口均值(expanding window mean),使用expanding函數(shù)代替rolling函數(shù)缠导。擴(kuò)展均值從時(shí)間序列的起始處開(kāi)始時(shí)間窗口廉羔,并增加窗口的大小,直到它包含整個(gè)時(shí)間序列僻造。apple_std250時(shí)間序列的擴(kuò)展窗口均值如下:
In order to compute an expanding window mean, use the expanding operator instead of rolling. The expanding mean starts the time window from the beginning of the time series and increases the size of the window until it encompasses the whole series. An expanding window mean on the apple_std250 time series looks like this:
In [244]: expanding_mean = appl_std250.expanding().mean()
在DataFrame對(duì)象上調(diào)用移動(dòng)窗口函數(shù)憋他,會(huì)將轉(zhuǎn)換應(yīng)用到每一列(見(jiàn)圖11-6):
Calling a moving window function on a DataFrame applies the transformation to each column (see Figure 11-6):
In [246]: close_px.rolling(60).mean().plot(logy=True)
rolling函數(shù)也可以接受一個(gè)字符串,該字符串表示固定大小的時(shí)間偏移量而不是固定數(shù)量的時(shí)期(gg注:即window參數(shù)可以等于字符串例如“20D”’)髓削。使用這種表示法對(duì)不規(guī)則的時(shí)間序列很有用竹挡。這些字符串也可以傳遞給resample方法。例如立膛,我們可以計(jì)算20日的滾動(dòng)均值揪罕,如下所示:
The rolling function also accepts a string indicating a fixed-size time offset rather than a set number of periods. Using this notation can be useful for irregular time series. These are the same strings that you can pass to resample. For example, we could compute a 20-day rolling mean like so:
In [247]: close_px.rolling('20D').mean() # gg注:等價(jià)于close_px.rolling(window='20D').mean()
Out[247]:
AAPL MSFT XOM
2003-01-02 7.400000 21.110000 29.220000
2003-01-03 7.425000 21.125000 29.230000
2003-01-06 7.433333 21.256667 29.473333
2003-01-07 7.432500 21.425000 29.342500
2003-01-08 7.402000 21.402000 29.240000
2003-01-09 7.391667 21.490000 29.273333
2003-01-10 7.387143 21.558571 29.238571
2003-01-13 7.378750 21.633750 29.197500
2003-01-14 7.370000 21.717778 29.194444
2003-01-15 7.355000 21.757000 29.152000
... ... ... ...
2011-10-03 398.002143 25.890714 72.413571
2011-10-04 396.802143 25.807857 72.427143
2011-10-05 395.751429 25.729286 72.422857
2011-10-06 394.099286 25.673571 72.375714
2011-10-07 392.479333 25.712000 72.454667
2011-10-10 389.351429 25.602143 72.527857
2011-10-11 388.505000 25.674286 72.835000
2011-10-12 388.531429 25.810000 73.400714
2011-10-13 388.826429 25.961429 73.905000
2011-10-14 391.038000 26.048667 74.185333
[2292 rows x 3 columns]
11.7.1 指數(shù)加權(quán)函數(shù)
Exponentially Weighted Functions
一種使用固定大小窗口的方式是賦予觀察結(jié)果同等權(quán)重,另一種方式是指定一個(gè)衰減因子(decay factor)常量,以便賦予近期的觀察結(jié)果更多的權(quán)重耸序。指定衰減因子的方式有很多忍些。常見(jiàn)的方式是使用跨度(span),它使結(jié)果類似于一個(gè)窗口大小等于跨度的簡(jiǎn)單移動(dòng)窗口函數(shù)坎怪。
An alternative to using a static window size with equally weighted observations is to specify a constant decay factor to give more weight to more recent observations. There are a couple of ways to specify the decay factor. A popular one is using a span, which makes the result comparable to a simple moving window function with window size equal to the span.
由于指數(shù)加權(quán)統(tǒng)計(jì)會(huì)賦予近期的觀察結(jié)果更多的權(quán)重罢坝,因此它比等權(quán)統(tǒng)計(jì)更快“適應(yīng)”變化。
Since an exponentially weighted statistic places more weight on more recent observations, it “adapts” faster to changes compared with the equal-weighted version.
除了rolling函數(shù)和expanding函數(shù)搅窿,pandas還有ewm函數(shù)嘁酿。下面這個(gè)例子比較了蘋果公司股價(jià)60日的簡(jiǎn)單移動(dòng)平均線和span=60的指數(shù)加權(quán)移動(dòng)平均線(見(jiàn)圖11-7):
pandas has the ewm operator to go along with rolling and expanding. Here’s an example comparing a 60-day moving average of Apple’s stock price with an EW moving average with span=60 (see Figure 11-7):
In [249]: aapl_px = close_px.AAPL['2006':'2007']
In [250]: ma60 = aapl_px.rolling(60, min_periods=20).mean() # gg注:原英文書中有誤,作者的意圖是60而不是30
In [251]: ewma60 = aapl_px.ewm(span=60).mean() # gg注:原英文書中有誤男应,作者的意圖是60而不是30
In [252]: ma60.plot(style='k--', label='Simple MA')
Out[252]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2f252161d0>
In [253]: ewma60.plot(style='k-', label='EW MA')
Out[253]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2f252161d0>
In [254]: plt.legend()
11.7.2 二元移動(dòng)窗口函數(shù)
Binary Moving Window Functions
一些統(tǒng)計(jì)運(yùn)算符(例如相關(guān)系數(shù)和協(xié)方差)需要在兩個(gè)時(shí)間序列上運(yùn)算闹司。例如,金融分析師通常對(duì)某只股票與標(biāo)普500等基準(zhǔn)指數(shù)的相關(guān)系數(shù)感興趣沐飘。為了解這一點(diǎn)游桩,先計(jì)算所有感興趣的時(shí)間序列的百分比變化:
Some statistical operators, like correlation and covariance, need to operate on two time series. As an example, financial analysts are often interested in a stock’s correlation to a benchmark index like the S&P 500. To have a look at this, we first compute the percent change for all of our time series of interest:
(gg注:結(jié)合上下文,作者想計(jì)算的是相關(guān)系數(shù)correlation coefficient耐朴,但他寫作時(shí)省略了“coefficient ”借卧,翻譯時(shí)進(jìn)行補(bǔ)足)
In [256]: spx_px = close_px_all['SPX']
In [257]: spx_rets = spx_px.pct_change()
In [258]: returns = close_px.pct_change()
在調(diào)用rolling函數(shù)后,corr聚合函數(shù)計(jì)算與spx_rets的滾動(dòng)相關(guān)系數(shù)(結(jié)果見(jiàn)圖11-8):
The corr aggregation function after we call rolling can then compute the rolling correlation with spx_rets (see Figure 11-8 for the resulting plot):
In [259]: corr_coefficient = returns.AAPL.rolling(125, min_periods=100).corr(spx_rets) # gg注:為避免歧義筛峭,變量名從原文的corr改為corr_coefficient
In [260]: corr_coefficient.plot()
假設(shè)你想一次性計(jì)算多只股票與標(biāo)普500的相關(guān)系數(shù)铐刘。雖然編寫一個(gè)循環(huán)并新建一個(gè)DataFrame對(duì)象不是什么難事,但比較啰嗦影晓。其實(shí)镰吵,只需傳入一個(gè)Series對(duì)象和一個(gè)DataFrame對(duì)象,rolling(...).corr將自動(dòng)計(jì)算該Series對(duì)象(本例中就是spx_rets)與DataFrame對(duì)象中每列的相關(guān)系數(shù)(結(jié)果見(jiàn)圖11-9):
Suppose you wanted to compute the correlation of the S&P 500 index with many stocks at once. Writing a loop and creating a new DataFrame would be easy but might get repetitive, so if you pass a Series and a DataFrame, rolling(...).corr(gg注:原英文書中rooling_corr是老版的函數(shù)挂签,新版已取消)will compute the correlation of the Series (spx_rets, in this case) with each column in the DataFrame (see Figure 11-9 for the plot of the result):
In [262]: corr_coefficient2 = returns.rolling(125, min_periods=100).corr(spx_rets) # gg注:為避免歧義疤祭,變量名從原文的corr改為corr_coefficient2
In [263]: corr_coefficient2.plot()
11.7.3 用戶定義的移動(dòng)窗口函數(shù)
User-Defined Moving Window Functions
在rolling及其相關(guān)函數(shù)上的apply方法,讓你能夠在移動(dòng)窗口上應(yīng)用自己設(shè)計(jì)的數(shù)組函數(shù)饵婆。唯一的要求是:該函數(shù)從數(shù)組的每部分產(chǎn)生一個(gè)單值画株。例如,當(dāng)使用rolling(...).quantile(q)計(jì)算樣本分位數(shù)時(shí)啦辐,我們可能對(duì)某個(gè)特定值在樣本中的百分等級(jí)感興趣谓传。scipy.stats.percentileofscore函數(shù)就能達(dá)到這個(gè)目的(結(jié)果見(jiàn)圖11-10):
The apply method on rolling and related methods provides a means to apply an array function of your own devising over a moving window. The only requirement is that the function produce a single value (a reduction) from each piece of the array. For example, while we can compute sample quantiles using rolling(...).quantile(q), we might be interested in the percentile rank of a particular value over the sample. The scipy.stats.percentileofscore function does just this (see Figure 11-10 for the resulting plot):
In [265]: from scipy.stats import percentileofscore
In [266]: score_at_2percent = lambda x: percentileofscore(x, 0.02)
In [267]: result = returns.AAPL.rolling(250).apply(score_at_2percent)
In [268]: result.plot()
如果你沒(méi)安裝SciPy,可以使用conda或pip安裝芹关。
If you don’t have SciPy installed already, you can install it with conda or pip.
11.8 本章小結(jié)
Conclusion
與前面章節(jié)講解的其它類型的數(shù)據(jù)相比续挟,時(shí)間序列數(shù)據(jù)需要不同類型的分析和數(shù)據(jù)轉(zhuǎn)換工具。
Time series data calls for different types of analysis and data transformation tools than the other types of data we have explored in previous chapters.
在接下來(lái)的章節(jié)中侥衬,我們將繼續(xù)介紹一些高級(jí)的pandas方法并展示如何開(kāi)始使用建模庫(kù)statsmodels和scikit-learn等建模庫(kù)诗祸。
In the following chapters, we will move on to some advanced pandas methods and show how to start using modeling libraries like statsmodels and scikit-learn.
-
closed參數(shù)和label參數(shù)的默認(rèn)值可能會(huì)讓部分用戶感到奇怪跑芳。在實(shí)際使用中,這兩個(gè)參數(shù)比較隨意直颅。對(duì)于某些目標(biāo)頻率博个,closed='left'更好。而對(duì)于其它頻率功偿,closed='right'更好盆佣。你真正應(yīng)該關(guān)注的是如何對(duì)數(shù)據(jù)分段。
The choice of the default values for closed and label might seem a bit odd to some users. In practice the choice is somewhat arbitrary; for some target frequencies, closed='left' is preferable, while for others closed='right' makes more sense. The important thing is that you keep in mind exactly how you are segmenting the data. ? ?