第10章第1节：日期和时间数据类型及工具

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

python提供了关于日期(date)，时间(time)，日历(calendar)的模块。主要有：

date : 存储日期（年,月,日）
time : 存储时间(时,分,秒,毫秒)
datetime: 存储日期和时间
timedelta: 存储datetime之差（日，秒，毫秒）

# datetime 以毫秒形式存储日期和时间
from datetime import datetime
now = datetime.now()
print(now)
now.year, now.month, now.day

2017-07-20 10:07:53.800127

(2017, 7, 20)

# datetime.timedelta 表示两个datetime对象之间的时间差
delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
print(delta)
delta.days, delta.seconds

926 days, 15:45:00

(926, 56700)

# datetime 可以 加减 timedelta, 得到一个新的 datetime
from datetime import timedelta
start = datetime(2011, 1, 7)
start + timedelta(12)

datetime.datetime(2011, 1, 19, 0, 0)

字符串和datetime的转换¶

stamp = datetime(2011, 1, 3)
str(stamp)

'2011-01-03 00:00:00'

stamp.strftime('%Y-%m-%d')

'2011-01-03'

value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')

datetime.datetime(2011, 1, 3, 0, 0)

datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]

[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]

# dateutil包提供了一些更方便的方法
from dateutil.parser import parse
parse('2011-01-03')

datetime.datetime(2011, 1, 3, 0, 0)

parse('Jan 31, 1997 10:45 PM')

datetime.datetime(1997, 1, 31, 22, 45)

# dayfirst, 指定日在月前面
parse('6/12/2011', dayfirst=True)

datetime.datetime(2011, 12, 6, 0, 0)

# pandas中提供了处理成组日期的方法
datestrs

['7/6/2011', '8/6/2011']

pd.to_datetime(datestrs)

DatetimeIndex(['2011-07-06', '2011-08-06'], dtype='datetime64[ns]', freq=None)

# 可以自动处理缺失值
idx = pd.to_datetime(datestrs + [None])
idx

DatetimeIndex(['2011-07-06', '2011-08-06', 'NaT'], dtype='datetime64[ns]', freq=None)

# NaT 表示 Not a Time
idx[2]

NaT

pd.isnull(idx)

array([False, False,  True], dtype=bool)

《利用Python进行数据分析》读书笔记。

第10章第2节：时间序列基础

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

pandas中最基本的时间序列类型是以时间戳（字符串或datetime对象）为索引的Series。

from datetime import datetime
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7),
         datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = Series(np.random.randn(6), index=dates)
ts

2011-01-02    2.196938
2011-01-05    0.904351
2011-01-07   -0.471502
2011-01-08   -0.006652
2011-01-10    0.566689
2011-01-12    2.491312
dtype: float64

# ts是一个 TimeSeries, 其索引是一个 DatetimeIndex
print(type(ts))
print(ts.index)

<class 'pandas.core.series.Series'>
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

# 不同索引的时间序列之间的算数运算会自动对齐
ts + ts[::2]

2011-01-02    4.393876
2011-01-05         NaN
2011-01-07   -0.943004
2011-01-08         NaN
2011-01-10    1.133379
2011-01-12         NaN
dtype: float64

# DatetimeIndex使用 datetime64, 存储时间戳的纳秒数值
# 其值是pandas的Timestamp对象
print(ts.index.dtype)
ts.index[0]

datetime64[ns]

Timestamp('2011-01-02 00:00:00')

索引、选取、子集构建¶

TimeSeries是Series的一个子类，所以在索引以及数据选取方面跟Series一样。

stamp = ts.index[2]
ts[stamp]

-0.47150210579550061

# 更方便的用法是传入可以被解释为日期的字符串
print(ts['1/10/2011'])
print(ts['20110110'])

0.566689483331
0.566689483331

# 对于较长的时间序列，只需传入“年”或“年月”即可轻松选取数据切片
longer_ts = Series(np.random.randn(1000),
                   index=pd.date_range('1/1/2000', periods=1000))
longer_ts.head()

2000-01-01    0.752180
2000-01-02   -0.667890
2000-01-03    0.438020
2000-01-04    0.085829
2000-01-05    0.355862
Freq: D, dtype: float64

longer_ts['2001'].tail()

2001-12-27    1.398178
2001-12-28    0.420257
2001-12-29    0.217273
2001-12-30   -1.462059
2001-12-31   -1.135778
Freq: D, dtype: float64

longer_ts['2001-05'].tail()

2001-05-27   -0.747961
2001-05-28    0.435790
2001-05-29    1.298764
2001-05-30    0.378459
2001-05-31    0.577249
Freq: D, dtype: float64

# 可以用不存在于该时间序列中的时间戳对其进行切片（即范围查询）
# 这里可以传入字符串日期、datetime或者Timestamp
longer_ts['1/6/1999':'1/11/2000']

2000-01-01    0.752180
2000-01-02   -0.667890
2000-01-03    0.438020
2000-01-04    0.085829
2000-01-05    0.355862
2000-01-06   -1.193612
2000-01-07    1.730057
2000-01-08   -0.436587
2000-01-09    1.766524
2000-01-10    0.717529
2000-01-11   -0.558795
Freq: D, dtype: float64

dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
long_df = DataFrame(np.random.randn(100, 4),
                    index=dates,
                    columns=['Colorado', 'Texas', 'New York', 'Ohio'])
long_df.ix['5-2001']

带有重复索引的时间序列¶

dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/2/2000',
                          '1/3/2000'])
dup_ts = Series(np.arange(5), index=dates)
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32

dup_ts.index.is_unique

False

# 索引得到的可能是标量值，也可能是切片
print(dup_ts['1/2/2000'])
print('----------------------------')
print(dup_ts['1/3/2000'])

2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32
----------------------------
4

# 对具有非唯一时间戳的数据进行聚合一个办法是使用groupby，并传入level = 0
grouped = dup_ts.groupby(level = 0)
grouped.mean()

2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32

《利用Python进行数据分析》读书笔记。

第10章第3节：日期的范围、频率以及移动

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from datetime import datetime
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

pandas中的时间序列一般被认为是不规则的，没有固定的频率。

但是有时候需要用相对固定的频率对数据进行分析，比如每月、每天等。

pandas提供了一整套标准时间序列频率以及用于重采样、频率推断、生成固定频率日期范围的工具。

dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7),
         datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = Series(np.random.randn(6), index=dates)
ts

2011-01-02    0.010556
2011-01-05    0.409548
2011-01-07    0.589688
2011-01-08    1.670158
2011-01-10    0.071417
2011-01-12   -0.809712
dtype: float64

# 通过resample重采样
r=ts.resample('D')
r

DatetimeIndexResampler [freq=<Day>, axis=0, closed=left, label=left, convention=start, base=0]

r.mean()

2011-01-02    0.010556
2011-01-03         NaN
2011-01-04         NaN
2011-01-05    0.409548
2011-01-06         NaN
2011-01-07    0.589688
2011-01-08    1.670158
2011-01-09         NaN
2011-01-10    0.071417
2011-01-11         NaN
2011-01-12   -0.809712
Freq: D, dtype: float64

ts.resample('2D').sum()

2011-01-02    0.010556
2011-01-04    0.409548
2011-01-06    0.589688
2011-01-08    1.670158
2011-01-10    0.071417
2011-01-12   -0.809712
Freq: 2D, dtype: float64

生成日期范围¶

# 生成指定长度的DatetimeIndex
index = pd.date_range('4/1/2012', '6/1/2012')
index

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
               '2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
               '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

pd.date_range(start='4/1/2012', periods=20)

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
              dtype='datetime64[ns]', freq='D')

pd.date_range(end='6/1/2012', periods=20)

DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
               '2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
               '2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
               '2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
               '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

# BM（business end of month），每月最后一个工作日
pd.date_range('1/1/2000', '12/1/2000', freq='BM')

DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-29', '2000-10-31', '2000-11-30'],
              dtype='datetime64[ns]', freq='BM')

pd.date_range('5/2/2012 12:56:31', periods=5)

DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')

pd.date_range('5/2/2012 12:56:31', periods=5, normalize=True)

DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')

频率和日期偏移量¶

pandas中的频率，由基础频率(base frequency)和乘数组成，比如 H，5H。偏移量与之类似。

基础频率

移动（超前和滞后）数据¶

移动（shifting）指的是沿着时间轴将数据前移或后移。

Series和DataFrame都有一个shift方法用于执行单纯的前移或后移操作，保持索引不变。

ts = Series(np.random.randn(4),
            index=pd.date_range('1/1/2000', periods=4, freq='M'))
ts

2000-01-31    0.001004
2000-02-29   -1.498394
2000-03-31    0.499473
2000-04-30    0.402967
Freq: M, dtype: float64

ts.shift(2)

2000-01-31         NaN
2000-02-29         NaN
2000-03-31    0.001004
2000-04-30   -1.498394
Freq: M, dtype: float64

ts.shift(-2)

2000-01-31    0.499473
2000-02-29    0.402967
2000-03-31         NaN
2000-04-30         NaN
Freq: M, dtype: float64

# shift通常用于计算一个时间序列或多个时间序列（如DataFrame列）中的百分比变化。
ts / ts.shift(1) - 1

2000-01-31            NaN
2000-02-29   -1493.219055
2000-03-31      -1.333339
2000-04-30      -0.193216
Freq: M, dtype: float64

# 单纯的移位操作不会修改索引，所以部分数据会被丢弃
# 如果频率已知，则可以将其传给shift以实现对时间戳进行位移而不是只对数据移位
ts.shift(2,freq = 'M')  #时间戳移动，而数据不动

2000-03-31    0.001004
2000-04-30   -1.498394
2000-05-31    0.499473
2000-06-30    0.402967
Freq: M, dtype: float64

ts.shift(3,freq = 'D')

2000-02-03    0.001004
2000-03-03   -1.498394
2000-04-03    0.499473
2000-05-03    0.402967
dtype: float64

ts.shift(1,freq = '3D')

2000-02-03    0.001004
2000-03-03   -1.498394
2000-04-03    0.499473
2000-05-03    0.402967
dtype: float64

ts.shift(1,freq = '90T')

2000-01-31 01:30:00    0.001004
2000-02-29 01:30:00   -1.498394
2000-03-31 01:30:00    0.499473
2000-04-30 01:30:00    0.402967
Freq: M, dtype: float64

通过偏移量对日期进行位移¶

pandas的日期偏移量还可以用在datetime或Timestemp对象上

from pandas.tseries.offsets import Day, MonthEnd
now = datetime(2011, 11, 17)
now + 3 * Day()

Timestamp('2011-11-20 00:00:00')

# 如果加的是锚点偏移量，第一次增量会将原日期向前滚动到符合频率规则的下一个日期
# 如果本来就是锚点，那么下一个就是下一个锚点
now + MonthEnd()

Timestamp('2011-11-30 00:00:00')

now + MonthEnd(2)

Timestamp('2011-12-31 00:00:00')

# 通过锚点偏移量的rollforward和rollback方法，可显示地将日期向前或向后“滚动”
offset = MonthEnd()
offset.rollforward(now)

Timestamp('2011-11-30 00:00:00')

offset.rollback(now)

Timestamp('2011-10-31 00:00:00')

# 日期偏移量还有一个巧妙的用法，即结合groupby使用这两个“滚动”方法
ts = Series(np.random.randn(20),
            index=pd.date_range('1/15/2000', periods=20, freq='4d'))
ts.groupby(offset.rollforward).mean()

2000-01-31   -0.040670
2000-02-29   -0.270960
2000-03-31   -0.279871
dtype: float64

# 当然，更简单快速的方式是使用resample
ts.resample('M').mean()

2000-01-31   -0.040670
2000-02-29   -0.270960
2000-03-31   -0.279871
Freq: M, dtype: float64

《利用Python进行数据分析》读书笔记。

第10章第4节：时区处理

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from datetime import datetime
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

时间序列最让人不爽的就是对时区的处理。很多人已经用协调世界时（UTC，格林尼治时间接替者，目前是国际标准）来处理时间序列。

时区就是以UTC偏移量的形式表示的。

Python中，时区信息来自第三方库pytz，它可以使Python可以使用Olson数据库。

pandas包装了pytz功能。因此不用记忆API，只要记得时区名称即可。时区名可以在文档中找到。

# 通过交互的方式查看时区
import pytz
pytz.common_timezones[-5:]

['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']

tz = pytz.timezone('Asia/Shanghai')
tz

<DstTzInfo 'Asia/Shanghai' LMT+8:06:00 STD>

本地化和转换¶

默认情况下，pandas中的序列是单纯的（naive）时区，其索引的tz字段为None.

rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
ts = Series(np.random.randn(len(rng)), index=rng)

print(ts.index.tz)

None

# 在生成日期范围的时候可以加上一个时区集
print(pd.date_range('3/9/2012',periods = 10,freq = 'D',tz = 'UTC'))

DatetimeIndex(['2012-03-09', '2012-03-10', '2012-03-11', '2012-03-12',
               '2012-03-13', '2012-03-14', '2012-03-15', '2012-03-16',
               '2012-03-17', '2012-03-18'],
              dtype='datetime64[ns, UTC]', freq='D')

# 转换时区是通过tz_localize方法处理的
ts_utc = ts.tz_localize('UTC')
ts_utc

2012-03-09 09:30:00+00:00    1.570592
2012-03-10 09:30:00+00:00    1.514111
2012-03-11 09:30:00+00:00   -0.818680
2012-03-12 09:30:00+00:00   -0.097930
2012-03-13 09:30:00+00:00   -0.072169
2012-03-14 09:30:00+00:00    0.141429
Freq: D, dtype: float64

ts_utc.index

DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

# 一旦被转换为某个特定时期，就可以用tz_convert将其转换到其他时区了
ts_utc.tz_convert('US/Eastern')

2012-03-09 04:30:00-05:00    1.570592
2012-03-10 04:30:00-05:00    1.514111
2012-03-11 05:30:00-04:00   -0.818680
2012-03-12 05:30:00-04:00   -0.097930
2012-03-13 05:30:00-04:00   -0.072169
2012-03-14 05:30:00-04:00    0.141429
Freq: D, dtype: float64

操作时区意识型（time zone-aware）Timestamp对象¶

跟时间序列和日期序列差不多，Timestamp对象也能被从单纯型（navie）本地化为time zone-aware，并从一个时区转换为另一个时区。

stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')
stamp_utc.tz_convert('US/Eastern')

Timestamp('2011-03-11 23:00:00-0500', tz='US/Eastern')

# 创建Timestamp时可以传入时区信息
stamp_moscow = pd.Timestamp('2011-03-12 04:00', tz='Europe/Moscow')
stamp_moscow

Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')

# Timestamp的内在UTC时间戳（纳秒数）不会随时区的转换而变化
stamp_utc.value

1299902400000000000

stamp_utc.tz_convert('US/Eastern').value

1299902400000000000

# 30 minutes before DST transition
from pandas.tseries.offsets import Hour
stamp = pd.Timestamp('2012-03-12 01:30', tz='US/Eastern')
stamp

Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')

stamp + Hour()

Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')

# 90 minutes before DST transition
stamp = pd.Timestamp('2012-11-04 00:30', tz='US/Eastern')
stamp

Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')

stamp + 2 * Hour()

Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')

不同时区之间的运算¶

如果时间时间时区不同，那么结果就会是UTC时间，由于时间戳其实是以UTC储存的，索引计算很方便。

rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B')
ts = Series(np.random.randn(len(rng)), index=rng)
ts

2012-03-07 09:30:00   -0.498195
2012-03-08 09:30:00    0.085745
2012-03-09 09:30:00    1.460264
2012-03-12 09:30:00    1.586622
2012-03-13 09:30:00   -0.083646
2012-03-14 09:30:00   -0.034900
2012-03-15 09:30:00   -0.262193
2012-03-16 09:30:00   -0.885324
2012-03-19 09:30:00    1.066322
2012-03-20 09:30:00    1.247435
Freq: B, dtype: float64

#注意naive是不能直接转换为时区的，必须先转换为localize再进行转换
ts1 = ts[:7].tz_localize('Europe/London')
ts2 = ts1[2:].tz_convert('Europe/Moscow')
result = ts1 + ts2

#自动转换为UTC
result.index

DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',
               '2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='B')

《利用Python进行数据分析》读书笔记。

第10章第5节：时期及其算数运算

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from datetime import datetime
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

时期（period）,表示时间区间，比如几日、几月、几年等。

Period类所表示的就是这种数据类型，其构造函数需要用到一个字符串或整数，以及频率。

p = pd.Period(2007, freq='A-DEC')
p

Period('2007', 'A-DEC')

# 位移
p + 5

Period('2012', 'A-DEC')

# 相同频率的Period可以进行加减,不同频率是不能加减的
pd.Period('2014', freq='A-DEC') - p

7

# period_range函数，可用于创建规则的时期范围

rng = pd.period_range('1/1/2000', '6/30/2000', freq='M')
rng

PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')

# 将 PeriodIndex 作为索引
Series(np.random.randn(6), index=rng)

2000-01   -1.057696
2000-02    0.409239
2000-03    0.500589
2000-04    0.842015
2000-05   -1.538833
2000-06    0.735769
Freq: M, dtype: float64

# 直接使用一组字符串构建
values = ['2001Q3', '2002Q2', '2003Q1']
index = pd.PeriodIndex(values, freq='Q-DEC')
index

PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')

时期的频率转换¶

Period和PeriodIndex对象都可以通过其asfreq方法转换为别的频率。

# 将年度时期转换为月度时期
p = pd.Period('2007', freq='A-DEC')
p.asfreq('M', how='start')

Period('2007-01', 'M')

p.asfreq('M', how='end')

Period('2007-12', 'M')

p = pd.Period('2007', freq='A-JUN')
p.asfreq('M', 'start')

Period('2006-07', 'M')

# 高频率时期转换为低频率时期
p = pd.Period('Aug-2007', 'M')

# 注意， 2007-08,属于周期2008年
p.asfreq('A-JUN')

Period('2008', 'A-JUN')

# PeriodIndex 或 TimeSeries 的频率转换
rng = pd.period_range('2006', '2009', freq='A-DEC')
ts = Series(np.random.randn(len(rng)), index=rng)
ts

2006   -1.195180
2007    0.424639
2008    0.157479
2009    0.774099
Freq: A-DEC, dtype: float64

ts.asfreq('M', how='start')

2006-01   -1.195180
2007-01    0.424639
2008-01    0.157479
2009-01    0.774099
Freq: M, dtype: float64

ts.asfreq('B', how='end')

2006-12-29   -1.195180
2007-12-31    0.424639
2008-12-31    0.157479
2009-12-31    0.774099
Freq: B, dtype: float64

按季度计算的时期频率¶

季度型数据在会计、金融等领域中很常见。许多季度型数据都会涉及“财年末”的概念，通常是一年12个月中某月的最后一个日历日或工作日。就这一点来说，“2012Q4”根据财年末的会有不同含义。pandas支持12种可能的季度频率，即Q-JAN、Q-DEC。

p = pd.Period('2012Q4', freq='Q-JAN')
p

Period('2012Q4', 'Q-JAN')

p.asfreq('D', 'start')

Period('2011-11-01', 'D')

p.asfreq('D', 'end')

Period('2012-01-31', 'D')

p4pm = (p.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
p4pm

Period('2012-01-30 16:00', 'T')

p4pm.to_timestamp()

Timestamp('2012-01-30 16:00:00')

# period_range还可以用于生产季度型范围，季度型范围的算数运算也跟上面是一样的
rng = pd.period_range('2011Q3', '2012Q4', freq='Q-JAN')
ts = Series(np.arange(len(rng)), index=rng)
ts

2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32

new_rng = (rng.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
ts.index = new_rng.to_timestamp()
ts

2010-10-28 16:00:00    0
2011-01-28 16:00:00    1
2011-04-28 16:00:00    2
2011-07-28 16:00:00    3
2011-10-28 16:00:00    4
2012-01-30 16:00:00    5
dtype: int32

Timestamp与Period的互相转换¶

通过to_period方法，可以将由时间戳索引的Series和DataFrame对象转换为以时期为索引的对象

rng = pd.date_range('1/1/2000', periods=3, freq='M')
ts = Series(randn(3), index=rng)
pts = ts.to_period()
ts

2000-01-31   -0.890889
2000-02-29    0.863907
2000-03-31   -0.734717
Freq: M, dtype: float64

pts

2000-01   -0.890889
2000-02    0.863907
2000-03   -0.734717
Freq: M, dtype: float64

# 由于时期指的是非重叠时间区间，因此对于给定的频率，一个时间戳只能属于一个时期。
# 新PeriodIndex的频率默认是从时间戳推断而来的，当然可以自己指定频率，结果中允许存在重复时期

rng = pd.date_range('1/29/2000', periods=6, freq='D')
ts2 = Series(randn(6), index=rng)
ts2.to_period('M')

2000-01    0.974506
2000-01   -0.175840
2000-01    0.051886
2000-02   -0.543674
2000-02    1.710834
2000-02    0.343342
Freq: M, dtype: float64

pts = ts.to_period()
pts

2000-01   -0.890889
2000-02    0.863907
2000-03   -0.734717
Freq: M, dtype: float64

# 转换为时间戳
pts.to_timestamp(how='end')

2000-01-31   -0.890889
2000-02-29    0.863907
2000-03-31   -0.734717
Freq: M, dtype: float64

通过数组创建PeriodIndex¶

固定频率的数据集通常会将时间信息分开存放在多个列中。例如下面的这个宏观经济数据集中，年度和季度就分别存放在不同的列中。

data = pd.read_csv('data/ch08/macrodata.csv')
data.year.head()

0    1959.0
1    1959.0
2    1959.0
3    1959.0
4    1960.0
Name: year, dtype: float64

data.quarter.head()

0    1.0
1    2.0
2    3.0
3    4.0
4    1.0
Name: quarter, dtype: float64

# 使用 year, quarter这两个数组，以及 一个频率 Q-DEC, 构建一个 PeriodIndex
index = pd.PeriodIndex(year=data.year, quarter=data.quarter, freq='Q-DEC')
index

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', length=203, freq='Q-DEC')

# 将 PeriodIndex 作为 data 的索引
data.index = index
data.infl.head()

1959Q1    0.00
1959Q2    2.34
1959Q3    2.74
1959Q4    0.27
1960Q1    2.31
Freq: Q-DEC, Name: infl, dtype: float64

《利用Python进行数据分析》读书笔记。

第10章第6节：重采样及频率转换

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from datetime import datetime
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

pandas对象都提供了resample方法，用于重采样。

对于时间序列来说，重采样（resampling）指的是将时间序列从一个频率转换到另一个频率的过程。

其中两类特殊的重采样是：将高频率数据聚合到低频率称为降采样（downsampling），而将低频率数据转换到高频率称为升采样（uosampling）。

并不是所有的重采样都能被划分到这两类中，比如将W-WED转换为W-FRI既不是降采样也不是升采样。

rng = pd.date_range('1/1/2000', periods=100, freq='D')
ts = Series(randn(len(rng)), index=rng)
ts.resample('M').mean()

2000-01-31   -0.102857
2000-02-29    0.042360
2000-03-31   -0.065909
2000-04-30   -0.058290
Freq: M, dtype: float64

ts.resample('M', kind='period').mean()

2000-01   -0.102857
2000-02    0.042360
2000-03   -0.065909
2000-04   -0.058290
Freq: M, dtype: float64

resample方法的主要参数包括：

重采样参数重采样参数（续）

降采样¶

将数据的频率降低称为降采样，也就是将数据进行聚合。一个数据点只能属于一个聚合时间段，所有时间段的并集组成整个时间帧。在进行降采样时，应该考虑如下：

各区间那便是闭合的
如何标记各个聚合面元，用区间的开头还是结尾

# 1分钟数据
rng = pd.date_range('1/1/2000', periods=12, freq='T')
ts = Series(np.arange(12), index=rng)
ts

2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

# 聚合到5分钟
# 注意:默认情况下，为 闭-开区间
ts.resample('5min').last()

2000-01-01 00:00:00     4
2000-01-01 00:05:00     9
2000-01-01 00:10:00    11
Freq: 5T, dtype: int32

# 指定closed = 'right' 改为 开- 闭 区间
ts.resample('5min', closed='right').last()

1999-12-31 23:55:00     0
2000-01-01 00:00:00     5
2000-01-01 00:05:00    10
2000-01-01 00:10:00    11
Freq: 5T, dtype: int32

# 指定使用右侧标记作为标签
ts.resample('5min', closed='right', label='right').last()

2000-01-01 00:00:00     0
2000-01-01 00:05:00     5
2000-01-01 00:10:00    10
2000-01-01 00:15:00    11
Freq: 5T, dtype: int32

# 对结果索引做一些位移
ts.resample('5min', loffset='-1s').last()

# 也可以通过调用结果对象的shift方法来实现。

1999-12-31 23:59:59     4
2000-01-01 00:04:59     9
2000-01-01 00:09:59    11
Freq: 5T, dtype: int32

OHLC重采样¶

对于ohlc数据，pandas做了专门处理

ts.resample('5min').ohlc()

通过groupby进行重采样¶

另一种方法是使用pandas的groupby功能。例如，你打算根据月份或者周几进行分组，只需传入一个能够访问时间序列的索引上的这些字段的函数即可：

rng = pd.date_range('1/1/2000', periods=100, freq='D')
ts = Series(np.arange(100), index=rng)
ts.groupby(lambda x: x.month).mean()

1    15
2    45
3    75
4    95
dtype: int32

ts.groupby(lambda x: x.weekday).mean()

0    47.5
1    48.5
2    49.5
3    50.5
4    51.5
5    49.0
6    50.0
dtype: float64

升采样和插值¶

将数据从低频率转换到高频率时，就不需要聚合了。

frame = DataFrame(np.random.randn(2, 4),
                  index=pd.date_range('1/1/2000', periods=2, freq='W-WED'),
                  columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame

# 重采样到日频率，默认会引入缺失值
df_daily = frame.resample('D')
df_daily.last()

# 可以跟fillna和reindex一样进行填充
frame.resample('D').ffill()

# 只填充指定的时期数（目的是限制前面的观测值的持续使用距离）
frame.resample('D').ffill(limit=2)

# 注意，新的日期索引完全没必要跟旧的相交,注意这个例子展现了数据日期可以延长
frame.resample('W-THU').ffill()

通过时期进行重采样¶

对那些使用时期索引的数据进行重采样是一件非常简单的事情。

frame = DataFrame(np.random.randn(24, 4),
                  index=pd.period_range('1-2000', '12-2001', freq='M'),
                  columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame[:5]

# 升采样要稍微麻烦些，因为你必须决定在新的频率中各区间的哪端用于放置原来的值
# 就像asfreq方法一样，convention默认为'end',可设置为'start'
# Q-DEC：季度型（每年以12月结束）
annual_frame = frame.resample('Q-DEC').mean()
annual_frame

annual_frame.resample('Q-DEC').ffill()

# Q-DEC: Quarterly, year ending in December

# note: output changed, default value changed from convention='end' to convention='start' + 'start' changed to span-like
# also the following cells
annual_frame.resample('Q-DEC', convention='start').ffill()

由于时期指的是时间区间，所以升采样和降采样的规则就比较严格:

在降采样中，目标频率必须是源频率的子时期(subperiod)
在升采样中，目标频率必须是原频率的超时期(superperiod)

如果不满足这些条件，就会引发异常，主要影响的是按季、年、周计算的频率。

例如，由Q-MAR定义的时间区间只能升采样为A-MAR、A-JUN等

annual_frame.resample('Q-MAR').ffill()

《利用Python进行数据分析》读书笔记。

第10章第7节：时间序列绘图

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from datetime import datetime
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

pandas时间序列的绘图功能在日期格式化方面比matplotlib原生的要好。

import matplotlib.pyplot as plt
plt.rc('figure', figsize=(12, 4))

close_px_all = pd.read_csv('data/ch09/stock_px.csv', parse_dates=True, index_col=0)
close_px = close_px_all[['AAPL', 'MSFT', 'XOM']]
close_px = close_px.resample('B').ffill()
close_px.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2292 entries, 2003-01-02 to 2011-10-14
Freq: B
Data columns (total 3 columns):
AAPL    2292 non-null float64
MSFT    2292 non-null float64
XOM     2292 non-null float64
dtypes: float64(3)
memory usage: 71.6 KB

close_px['AAPL'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x99aab70>

close_px.ix['2009'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x4247400>

close_px['AAPL'].ix['01-2011':'03-2011'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x9a7bc50>

appl_q = close_px['AAPL'].resample('Q-DEC').ffill()
appl_q.ix['2009':].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x9e33518>

《利用Python进行数据分析》读书笔记。

第10章第8节：移动窗口函数

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from datetime import datetime
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

在移动窗口（可以带有指数衰减权数）上计算的各种统计函数也是一类常见于时间序列的数组变换。

称为移动窗口函数（moving window function），其中还包括那些窗口不定长的函数（如指数加权移动平均）。

跟其他统计函数一样，移动窗口函数也会自动排除缺失值。这样的函数通常需要指定一些数量的非NA观测值。

close_px_all = pd.read_csv('data/ch09/stock_px.csv', parse_dates=True, index_col=0)
close_px = close_px_all[['AAPL', 'MSFT', 'XOM']]
close_px = close_px.resample('B').ffill()
close_px.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2292 entries, 2003-01-02 to 2011-10-14
Freq: B
Data columns (total 3 columns):
AAPL    2292 non-null float64
MSFT    2292 non-null float64
XOM     2292 non-null float64
dtypes: float64(3)
memory usage: 71.6 KB

# rolling_mean是其中最简单的一个。它接受一个TimeSeries或DataFrame以及一个window（表示期数）
close_px = close_px.asfreq('B').fillna(method='ffill')

<matplotlib.axes._subplots.AxesSubplot at 0x9e56e80>

close_px.AAPL.plot()
# pd.rolling_mean is deprecated
#pd.rolling_mean(close_px.AAPL, 250).plot()
# 250均线
close_px.AAPL.rolling(window=250,center=False).mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x9e36ba8>

# 两个 figure
plt.figure()

<matplotlib.figure.Figure at 0x9b75b38>

<matplotlib.figure.Figure at 0x9b75b38>

#close_px.AAPL.plot()
# 250期的标准差
close_px.AAPL.rolling(window=250,center=False).std().plot()

<matplotlib.axes._subplots.AxesSubplot at 0xbe92518>

# 计算前面所有数的std，比如min_periods = 10时，计算前10个数的，
# min_periods = 20时，计算前20个数的，直到min_periods = 250为止，
# 这就是所谓的“指定的非NA观测值”
close_px.AAPL.rolling(window=250,min_periods = 10).std().plot()

<matplotlib.axes._subplots.AxesSubplot at 0xba6cb70>

要计算扩展窗口平均（expanding window mean），可以将扩展窗口看作一个特殊的窗口，

其长度与时间序列一样，但只需一期或多期即可计算一个值。

# Define expanding mean in terms of rolling_mean
expanding_mean = lambda x: rolling_mean(x, len(x), min_periods=1)
close_px.rolling(window=60,center=False).mean().plot(logy=True)

<matplotlib.axes._subplots.AxesSubplot at 0xd123e48>

plt.close('all')

pandas中的移动窗口和指数加权函数：

移动窗口和指数加权函数

指数加权函数¶

另一种使用固定大小窗口及相等权数观测值的方法是，

定义一个衰减因子（decay factor）常量，以便使近期的观测值拥有更大的权数。

衰减因子的定义方式有很多，比较流行的是使用时间间隔（span），

它可以使结果兼容于窗口大小等于时间间隔的简单移动窗口函数。

由于指数加权统计赋予近期的观测值更大的权重，因此更能适应较快的变化。

fig, axes = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True,
                         figsize=(12, 7))

aapl_px = close_px.AAPL['2005':'2009']

ma60 = aapl_px.rolling(window=60,min_periods=50,center=False).mean()
ewma60 = aapl_px.ewm(span=60,min_periods=0,adjust=True,ignore_na=False).mean()

aapl_px.plot(style='k-', ax=axes[0])
ma60.plot(style='k--', ax=axes[0])
aapl_px.plot(style='k-', ax=axes[1])
ewma60.plot(style='k--', ax=axes[1])
axes[0].set_title('Simple MA')
axes[1].set_title('Exponentially-weighted MA')

<matplotlib.text.Text at 0xd8a95f8>

二元移动窗口函数¶

有些统计运算（如相关系数和协方差）需要在两个时间序列上执行。

比如，金融分析师常常对某只股票对某个参数（如标普500指数）的相关系数感兴趣。

我们可以通过计算百分比变化并使用rolling_corr的方式得到该结果。

spx_px = close_px_all['SPX']
spx_rets = spx_px / spx_px.shift(1) - 1
returns = close_px.pct_change()
corr = returns.AAPL.rolling(window=125,min_periods=100).corr(other=spx_rets)
corr.plot()

<matplotlib.axes._subplots.AxesSubplot at 0xbfe9da0>

# 一次处理多个
corr = returns.rolling(window=125,min_periods=100).corr(other=spx_rets)
corr.plot()

<matplotlib.axes._subplots.AxesSubplot at 0xee806d8>

用户自定义的移动窗口函数¶

rolling_apply函数使你能够在移动窗口上应用自己设计的数组函数。

唯一的要求就是：该函数要能从数组的各个片段中产生单个值。

比如，当用rolling_quantile计算样本分位数时，可能对样本中特定值的百分等级感兴趣。

from scipy.stats import percentileofscore
score_at_2percent = lambda x: percentileofscore(x, 0.02)
result = returns.AAPL.rolling(window=250,center=False).apply(func=score_at_2percent)
result.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x100450f0>

《利用Python进行数据分析》读书笔记。

第10章第9节：性能和内存使用方面的注意事项

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from datetime import datetime
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

TimeSeries和Period都是以64位整数表示的（即NumPy的datetime64数据类型）。

也就是说，对于每个数据点，其时间戳需要占用8字节内存。

因此，含有一百万个float64数据点的时间序列需要占用大约16MB的内存空间。由于pandas会尽量在多个时间序列之间共享索引，所以创建现有时间序列的视图不会占用更多内存。

此外，低频率索引（日以上）会被存放在一个中心缓存中，所以任何固定频率的索引都是该日期缓存的视图。所以。如果你有一个很大的低频率时间序列，索引所占用的内存空间将不会很大。

性能方面，pandas对数据对齐（两个不同索引的ts1 + ts2的幕后工作）和重采样运算进行了高度优化。

# 下面这个例子将一亿个数据点聚合为OHLC
rng = pd.date_range('1/1/2000', periods=10000000, freq='10ms')
ts = Series(np.random.randn(len(rng)), index=rng)
ts.head()

2000-01-01 00:00:00.000    0.219509
2000-01-01 00:00:00.010    1.336106
2000-01-01 00:00:00.020   -0.197760
2000-01-01 00:00:00.030   -1.554178
2000-01-01 00:00:00.040    1.049711
Freq: 10L, dtype: float64

%timeit ts.resample('15min').ohlc()

1 loop, best of 3: 146 ms per loop

rng = pd.date_range('1/1/2000', periods=10000000, freq='1s')
ts = Series(np.random.randn(len(rng)), index=rng)
%timeit ts.resample('15s').ohlc()

1 loop, best of 3: 210 ms per loop

	Colorado	Texas	New York	Ohio
2001-05-02	0.489843	0.667164	-0.021513	0.638708
2001-05-09	-0.672459	-0.951448	0.200572	0.331490
2001-05-16	0.526990	1.011224	-1.010511	1.244311
2001-05-23	0.199731	-0.477121	1.051838	0.859933
2001-05-30	-0.463182	-1.097522	-0.912460	0.340360

	Colorado	Texas	New York	Ohio
2000-01-05	-0.847635	0.66079	2.916199	-0.503541
2000-01-12	0.052009	-0.76434	-1.662339	0.125280

	Colorado	Texas	New York	Ohio
2000-01-05	-0.847635	0.66079	2.916199	-0.503541
2000-01-06	-0.847635	0.66079	2.916199	-0.503541
2000-01-07	-0.847635	0.66079	2.916199	-0.503541
2000-01-08	-0.847635	0.66079	2.916199	-0.503541
2000-01-09	-0.847635	0.66079	2.916199	-0.503541
2000-01-10	-0.847635	0.66079	2.916199	-0.503541
2000-01-11	-0.847635	0.66079	2.916199	-0.503541
2000-01-12	0.052009	-0.76434	-1.662339	0.125280

	Colorado	Texas	New York	Ohio
2000-01-06	-0.847635	0.66079	2.916199	-0.503541
2000-01-13	0.052009	-0.76434	-1.662339	0.125280

	Colorado	Texas	New York	Ohio
2000-01	2.001633	0.637625	0.422806	-1.233967
2000-02	0.214921	-0.561227	-0.155320	-2.211660
2000-03	-0.584018	-0.205559	1.276460	-2.255439
2000-04	0.346297	0.188510	-1.720630	-0.742461
2000-05	-0.908527	0.315601	-0.507128	-0.449549

心内求法

利用Python进行数据分析(10)：时间序列

日期和时间数据类型及工具

字符串和datetime的转换¶

时间序列基础

索引、选取、子集构建¶

带有重复索引的时间序列¶

日期的范围、频率以及移动

生成日期范围¶

频率和日期偏移量¶

移动（超前和滞后）数据¶

通过偏移量对日期进行位移¶

时区处理

本地化和转换¶

操作时区意识型（time zone-aware）Timestamp对象¶

不同时区之间的运算¶

时期及其算数运算

时期的频率转换¶

按季度计算的时期频率¶

Timestamp与Period的互相转换¶

通过数组创建PeriodIndex¶

重采样及频率转换

降采样¶

OHLC重采样¶

通过groupby进行重采样¶

升采样和插值¶

通过时期进行重采样¶

时间序列绘图

移动窗口函数

指数加权函数¶

二元移动窗口函数¶

用户自定义的移动窗口函数¶

性能和内存使用方面的注意事项

	open	high	low	close
2000-01-01 00:00:00	0	4	0	4
2000-01-01 00:05:00	5	9	5	9
2000-01-01 00:10:00	10	11	10	11

	Colorado	Texas	New York	Ohio
2000Q1	0.544178	-0.043054	0.514649	-1.900355
2000Q2	-0.000791	0.468495	-1.247954	-0.708676
2000Q3	-0.247497	0.067101	0.512545	0.103532
2000Q4	-1.126654	-0.190198	0.416364	-0.046905
2001Q1	-1.081078	-0.259754	-0.861197	-0.235051
2001Q2	0.010989	0.004505	-1.142694	0.922503
2001Q3	0.843994	-0.283601	-0.942034	0.787534
2001Q4	-0.211269	-0.200959	0.103619	-0.239605