0%

利用Python进行数据分析(5):Pandas 入门

NumPy虽然提供了方便的数组处理功能,但是它还是缺少许多数据处理、分析所需的一些快速工具。
Pandas基于NumPy构建,提供众多更高级的数据处理功能,使得以 NumPy 为中心的数据处理工作更便捷。

pandas的作者在设计 Pandas 时,主要考虑以下几个方面:

  • 按轴自动或显式数据对齐功能的数据结构

    来自不同数据源的数据,由于各数据源的索引方式不同,经常会出现数据未对齐,而这会导致很多常见错误。

  • 集成时间序列功能

  • 既能处理时间序列数据也能处理非时间序列数据的数据结构

  • 数学运算和简约(比如对某个轴求和)可以根据不同的元数据(轴编号)执行

  • 灵活处理缺失数据

  • 合并及其他出现在常见数据库(例如基于SQL的)中的关系型运算

引入 Pandas 的约定为:

1
2
from pandas import Series, DataFrame
import pandas as pd

本章内容较多,每节的内容拆分成一个 ipython notebook:

数据结构

Pandas中两个主要的数据结构是 Series 和 DataFrame,两个数据结构中都使用了索引(Index)对象。

%pylab inline
from pandas import Series, DataFrame
import pandas as pd
Populating the interactive namespace from numpy and matplotlib

利用Python进行数据分析》 作者: Wes McKinney 译者: 唐学韬

Pandas中两个主要的数据结构是 SeriesDataFrame,两个数据结构中都使用了索引(Index)对象。

Series

Series是一组带索引的数组。包含一组数据(各种NumPy数据类型)和与之相关的一组数据标签(即索引)。

可以用index和values分别规定索引和值。如果不规定索引,会自动创建 0 到 N-1 索引。

最简单的 Series

#  只包含一组数据的 Series
obj = Series([4, 7, -5, 3])
obj
Out[40]:
0    4
1    7
2   -5
3    3
dtype: int64
obj.index
Out[41]:
RangeIndex(start=0, stop=4, step=1)
obj.values
Out[42]:
array([ 4,  7, -5,  3])

指定索引

obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
Out[43]:
d    4
b    7
a   -5
c    3
dtype: int64

通过索引获取数据

#  取单行数据
obj2['d']
Out[44]:
4
# 取多行数据
obj2[['c', 'a', 'd']]
Out[45]:
c    3
a   -5
d    4
dtype: int64

按条件获取数据

obj2[obj2 > 0]
Out[46]:
d    4
b    7
c    3
dtype: int64

数组运算

对数组进行计算,只改变值,不会改变索引

obj2 * 2
Out[47]:
d     8
b    14
a   -10
c     6
dtype: int64
np.exp(obj2)
Out[48]:
d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

改变索引

obj2
Out[49]:
d    4
b    7
a   -5
c    3
dtype: int64
obj2.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj2
Out[50]:
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

作为Dictionary

可以将 Series 看作一个定长有序字典。

'b' in obj2
Out[51]:
False
#  通过 Dictionary 直接创建 Series
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3
Out[52]:
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
# 重建索引,并按索引筛选数据
# 注意,索引 `California`  对应的值为 NaN
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4
Out[53]:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
 

数据检查

pd.isnull(obj4) # 等价于 obj4.isnull()
Out[54]:
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
pd.notnull(obj4) #  等价于 obj4.notnull()
Out[55]:
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

自动对齐索引

Series很重要的功能就是按照索引自动对齐。

obj3
Out[56]:
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
obj4
Out[57]:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
obj3 + obj4
Out[58]:
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

命名

Series 自身和索引都可以命名。这个设计对于 Pandas 其他模块很重要。

obj4.name = 'population'
obj4.index.name = 'state'
obj4
Out[59]:
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64
obj4.index
Out[60]:
Index(['California', 'Ohio', 'Oregon', 'Texas'], dtype='object', name='state')

DataFrame

DataFrame 是一种表格型结构,含有一组有序的列,每一列可以是不同的数据类型。

DataFrame 可以看做是共用索引的、由Series组成的字典。其中,共用的索引是整个 DataFrame 的行索引,而各个 Series 的名字作为列索引。

DataFrame面向行和列的操作基本是平衡的。其实,DataFrame中的数据是以一个或者多个二维块存放的(不是列表、字典或者其他)。

创建 DataFrame

#  使用字典创建
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame
Out[92]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
# 更改列顺序
DataFrame(data, columns=['year', 'state', 'pop'])
Out[66]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
# 与行索引的机制类似,找不到的列,会用 NaN 填充
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])
frame2
Out[68]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN

得到列

# 列是关于 Series 的字典
frame2.columns
Out[70]:
Index(['year', 'state', 'pop', 'debt'], dtype='object')
# 获取某一列
frame2['state']
Out[72]:
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
# 也可以用属性的方式获取 DataFrame 中的某一列
frame2.year
Out[74]:
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64
#  获取多个列,还是 DataFrame, 此时为“切片”操作
frame2[['state','year']]
Out[78]:
state year
one Ohio 2000
two Ohio 2001
three Ohio 2002
four Nevada 2001
five Nevada 2002

得到行

通过索引可以得到 DataFrame 的一行,行也是一个 Series.

frame2.ix['three']
Out[79]:
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

按列赋值

# 整列赋相同值
frame2['debt'] = 16.5
frame2
Out[81]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
# 用一个 array 或 narray 对列赋值
frame2['debt'] = np.arange(5.)
frame2
Out[84]:
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
# 使用 Series,可以根据索引对列精确赋值
# 注意其中的空值
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2
Out[86]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7

增加和删除列

# 对不存在的列赋值会增加列
frame2['eastern'] = frame2.state == 'Ohio'
frame2
Out[90]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
# 使用 `del` 删除列
del frame2['eastern']
frame2
Out[91]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7

转置

frame2.T
Out[93]:
one two three four five
year 2000 2001 2002 2001 2002
state Ohio Ohio Ohio Nevada Nevada
pop 1.5 1.7 3.6 2.4 2.9
debt NaN -1.2 NaN -1.5 -1.7

更多创建方法

#  使用二维字典创建
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = DataFrame(pop)
frame3
Out[95]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
DataFrame(pop, index=[2001, 2002, 2003])
Out[96]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
DataFrame(pdata)
Out[97]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7

为索引和列命名

frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3
Out[98]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

DataFrame 的值

# 返回 ndarray
frame3.values
Out[101]:
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])
# 类型不同时,自动转换为最抽象的类型
frame2.values
Out[102]:
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

索引对象

obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
Out[103]:
Index(['a', 'b', 'c'], dtype='object')
index[1:]
Out[104]:
Index(['b', 'c'], dtype='object')
# Index 里面的数据不能更改
# index[1] = 'd'
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index is index
Out[113]:
True
frame3
Out[114]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
'Ohio' in frame3.columns
Out[115]:
True
2003 in frame3.index
Out[116]:
False

Index 不是数组,也不是 Series(看起来像)。Index 支持的一些方法如下:

  • append 连接另一个 Index
  • diff 计算差集
  • intersection 计算交集
  • union 计算并集
  • isin 是否全包含
  • delete 删除 i 初的元素,返回一个新的 Index
  • drop 删除传入的值,返回一个新的 Index
  • insert 插入到 i 处
  • is_monotonic 是否各元素都大于等于前一个元素(顺序排列)
  • is_unique 是否没有重复值
  • unique 返回唯一值的数组
 

基本功能

%pylab inline
from pandas import Series, DataFrame
import pandas as pd
Populating the interactive namespace from numpy and matplotlib

利用Python进行数据分析》 作者: Wes McKinney 译者: 唐学韬

重建索引

Pandas 对象的重建索引(reindex()),会根据索引创建一个新的对象。

obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
Out[2]:
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
# 如果传入的索引值在数据里不存在,则不会报错,而是添加缺失值的新行
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
Out[5]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
#如果不想用缺失值,可以用 fill_value 参数指定填充值
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
Out[6]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
# method 指定填充新值采用的差值方法
# ffill 或 pad 实现用前值填充(forward fill)
obj3.reindex(range(6), method='ffill')
Out[13]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
# bfill 或 backfill 实现用后值填充
obj3.reindex(range(6), method='bfill')
Out[16]:
0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object
# #DataFrame 的reindex可以修改行、列或者两个都改

frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
                  columns=['Ohio', 'Texas', 'California'])
frame
Out[17]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
# 默认更改行索引
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
Out[19]:
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
# 使用 columns 更改列索引
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
Out[20]:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
# 同时更改行和列索引
# 注意插值只能用在行上
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',
              columns=states)
Out[21]:
Texas Utah California
a 1 NaN 2
b 1 NaN 2
c 4 NaN 5
d 7 NaN 8
# 使用ix的标签索引功能,重新索引变得比较简洁
# 等价于 frame.reindex(index=['a', 'b', 'c', 'd'], columns=states)
frame.ix[['a', 'b', 'c', 'd'], states]
Out[22]:
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0

丢弃指定轴上的项

#  drop() 方法:通过索引数组或列表丢弃多个项,返回一个新的对象
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
new_obj
Out[23]:
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
# 丢弃多项
obj.drop(['d', 'c'])
Out[24]:
a    0.0
b    1.0
e    4.0
dtype: float64
# 对于 DataFrame
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data
Out[25]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
#  丢弃行
data.drop(['Colorado', 'Ohio'])
Out[27]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
# 丢弃一列
data.drop('two', axis=1)
Out[28]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
# 丢弃多列
data.drop(['two', 'four'], axis=1)
Out[29]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14

索引、选取和过滤

obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
Out[30]:
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
#  通过索引选取, 以下两种方式等价
print(obj['b'])
print(obj[1])
1.0
1.0
#  连续切片
obj[2:4]
Out[39]:
c    2.0
d    3.0
dtype: float64
# 通过索引选取
obj[['b', 'a', 'd']]
Out[40]:
b    1.0
a    0.0
d    3.0
dtype: float64
# 通过序号选取
obj[[1, 3]]
Out[41]:
b    1.0
d    3.0
dtype: float64
# 条件筛选
obj[obj < 2]
Out[42]:
a    0.0
b    1.0
dtype: float64
# 针对索引的条件筛选
obj['b':'d']
Out[46]:
b    1.0
c    2.0
d    3.0
dtype: float64
# 筛选并赋值
obj['b':'d'] = 5
obj
Out[49]:
a    0.0
b    5.0
c    5.0
d    5.0
dtype: float64
# 对于 DataFrame
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data
Out[51]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
# 选取列
data['two']
Out[52]:
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64
# 选取多列
data[['three', 'one']]
Out[53]:
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
# 
data[:2]
Out[55]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
data[data['three'] > 5]
Out[56]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data < 5
Out[57]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
data[data < 5] = 0
data
Out[59]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
# 选择一行中的一部分
data.ix['Colorado', ['two', 'three']]
Out[63]:
two      5
three    6
Name: Colorado, dtype: int64
# 选择多行中的一部分
data.ix[['Colorado', 'Utah'], [3, 0, 1]]
Out[64]:
four one two
Colorado 7 0 5
Utah 11 8 9
# 一整行
data.ix[2]
Out[66]:
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64
data.ix[:'Utah', 'two']
Out[67]:
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64
data.ix[data.three > 5, :3]
Out[68]:
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14

算术运算和数据对齐

Pandas 在对 Series 和 DataFrame 进行算术运算时,能够根据索引自动对齐,其中索引不重合的部分值为NaN。

+,-,*,/ 对应的函数分别为 add(), sub(), mul(), div()

s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1
Out[70]:
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
s2
Out[71]:
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64
s1 + s2
Out[72]:
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1
Out[73]:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
df2
Out[74]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
df1 + df2
Out[75]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN

在算术方法中填充值

df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df1
Out[76]:
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
df2
Out[77]:
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df1 + df2
Out[78]:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 11.0 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
df1.add(df2, fill_value=0)
Out[79]:
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 11.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df1.reindex(columns=df2.columns, fill_value=0)
Out[80]:
a b c d e
0 0.0 1.0 2.0 3.0 0
1 4.0 5.0 6.0 7.0 0
2 8.0 9.0 10.0 11.0 0

DataFrame 和 Series 之间的运算

arr = np.arange(12.).reshape((3, 4))
arr
Out[81]:
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])
arr[0]
Out[82]:
array([ 0.,  1.,  2.,  3.])
arr - arr[0]
Out[83]:
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])
frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]
frame
Out[84]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
series
Out[85]:
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64
# 将Series的索引匹配到DataFrame的列,然后进行计算,再沿着行一直向下广播
frame - series
Out[91]:
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
series2 = Series(range(3), index=['b', 'e', 'f'])
series2
Out[92]:
b    0
e    1
f    2
dtype: int64
#  非共有的列,自动填充
frame + series2
Out[95]:
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
series3 = frame['d']
frame
Out[88]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
series3
Out[89]:
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64
# 使用算术运算方法,可以指定轴,从而实现匹配行且在列上广播
frame.sub(series3, axis=0)
Out[96]:
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0

函数应用和映射

frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
Out[98]:
b d e
Utah 0.155111 -0.301627 0.034083
Ohio -0.003312 0.807804 -1.046114
Texas 1.983635 -0.667859 1.657091
Oregon 1.633512 -2.041568 0.820017
# 求绝对值函数
np.abs(frame)
Out[100]:
b d e
Utah 0.155111 0.301627 0.034083
Ohio 0.003312 0.807804 1.046114
Texas 1.983635 0.667859 1.657091
Oregon 1.633512 2.041568 0.820017
#另一种常见的做法是:将一个函数应用到行或者列上,用apply方法
# 默认应用到行,广播到列
f = lambda x: x.max() - x.min()
frame.apply(f)
Out[103]:
b    1.986947
d    2.849372
e    2.703205
dtype: float64
# 通过指定 axis,可以应用到列,广播到行
frame.apply(f, axis=1)
Out[104]:
Utah      0.456738
Ohio      1.853918
Texas     2.651494
Oregon    3.675080
dtype: float64
# 传递的函数还可以返回 Series
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
Out[105]:
b d e
min -0.003312 -2.041568 -1.046114
max 1.983635 0.807804 1.657091
# applymap:  应用到元素(而不是行和列)
format = lambda x: '%.2f' % x
frame.applymap(format)
Out[109]:
b d e
Utah 0.16 -0.30 0.03
Ohio -0.00 0.81 -1.05
Texas 1.98 -0.67 1.66
Oregon 1.63 -2.04 0.82
# 因为 frame 调用了 Series 的 map()函数
frame['e'].map(format)
Out[111]:
Utah       0.03
Ohio      -1.05
Texas      1.66
Oregon     0.82
Name: e, dtype: object

排序和排名

主要的函数包括:

  • sort_index(): 按索引排序
  • sort_values(): 按值排序
  • rank(): 排名
#用sort_index函数对行、列的索引进行排序
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
Out[113]:
a    1
b    2
c    3
d    0
dtype: int64
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])
frame.sort_index()
Out[114]:
d a b c
one 4 5 6 7
three 0 1 2 3
frame.sort_index(axis=1)
Out[115]:
a b c d
three 1 2 3 0
one 5 6 7 4
frame.sort_index(axis=1, ascending=False)
Out[116]:
d c b a
three 0 3 2 1
one 4 7 6 5
#  对值排序,使用 sort_values()
obj = Series([4, 7, -3, 2])
obj.sort_values()
Out[118]:
2   -3
3    2
0    4
1    7
dtype: int64
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
Out[119]:
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame
Out[120]:
a b
0 0 4
1 1 7
2 0 -3
3 1 2
frame.sort_values(by='b')
Out[122]:
a b
2 0 -3
3 1 2
0 0 4
1 1 7
frame.sort_values(by=['a', 'b'])
Out[123]:
a b
2 0 -3
0 0 4
3 1 2
1 1 7
#rank函数返回从小到大排序的下标,对于平级的数,rank是通过“为各组分配一个平均排名”的方式破坏平级关系
#下标从1开始
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()
Out[128]:
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
obj.rank(method='first')
Out[125]:
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64
obj.rank(ascending=False, method='max')
Out[129]:
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64
frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                   'c': [-2, 5, 8, -2.5]})
frame
Out[130]:
a b c
0 0 4.3 -2.0
1 1 7.0 5.0
2 0 -3.0 8.0
3 1 2.0 -2.5
frame.rank(axis=1)
Out[131]:
a b c
0 2.0 3.0 1.0
1 1.0 3.0 2.0
2 2.0 1.0 3.0
3 2.0 3.0 1.0

带有重复值的轴索引

obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
Out[132]:
a    0
a    1
b    2
b    3
c    4
dtype: int64
#用is_unique看是否唯一
obj.index.is_unique
Out[133]:
False
# #对于重复值的索引,选取的话返回一个Series
obj['a']
Out[135]:
a    0
a    1
dtype: int64
# 唯一的索引返回一个标量
obj['c']
Out[137]:
4
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df
Out[138]:
0 1 2
a -0.146504 -0.442410 -1.044600
a -1.243650 0.281303 0.485324
b 1.040675 0.479036 -0.753287
b -1.571259 0.194030 0.517437
df.ix['b']
Out[139]:
0 1 2
b 1.040675 0.479036 -0.753287
b -1.571259 0.194030 0.517437
 

汇总和描述性统计

%pylab inline
from pandas import Series, DataFrame
import pandas as pd
Populating the interactive namespace from numpy and matplotlib

利用Python进行数据分析》 作者: Wes McKinney 译者: 唐学韬

汇总和描述性统计

Pandas 实现了一些简单简单的汇总和描述性统计功能,更复杂的统计计算可以使用 Scipystats模块。

基本统计

包括:

  • count: 计数
  • describe: 各列的汇总统计
  • min, max: 最小值和最大值
  • argmin, argmax: 最小值和最大值的索引位置
  • idxmin, idxmax: 最小值和最大值 的索引值
  • quantile: 样本的分位数(0--1)
  • sum: 求和
  • mean: 平均数
  • median: 中位数(0.5分位数)
  • mad: 平均绝对离差
  • var: 方差
  • std: 标准差
  • skew: 偏度(三阶矩)
  • kurt: 峰度(四阶矩)
  • cumsum: 累计和
  • cummin, cummax: 累计最小、最大值
  • cumprod: 累计积
  • diff: 一阶差分(用于时间序列)
  • pct_change: 百分数变化
df = DataFrame([[1.4, np.nan], [7.1, -4.5],
                [np.nan, np.nan], [0.75, -1.3]],
               index=['a', 'b', 'c', 'd'],
               columns=['one', 'two'])
df
Out[2]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
# 求和
df.sum()
Out[3]:
one    9.25
two   -5.80
dtype: float64
#  按行求和
df.sum(axis=1)
Out[5]:
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
# 不忽略 NA
df.mean(axis=1, skipna=False)
Out[6]:
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64
# 返回索引
df.idxmax()
Out[9]:
one    b
two    d
dtype: object
# 累加型
df.cumsum()
Out[10]:
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
# 描述
df.describe()
Out[11]:
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
# 非数值型的描述
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()
Out[12]:
count     16
unique     3
top        a
freq       8
dtype: object

相关系数(Correlation)与协方差(covariance)

from pandas_datareader import data, wb

all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = data.get_data_yahoo(ticker)

price = DataFrame({tic: data['Adj Close']
                   for tic, data in all_data.items()})
volume = DataFrame({tic: data['Volume']
                    for tic, data in all_data.items()})
price
Out[27]:
AAPL GOOG IBM MSFT
Date
2010-01-04 27.727039 313.062468 111.405000 25.555485
2010-01-05 27.774976 311.683844 110.059232 25.563741
2010-01-06 27.333178 303.826685 109.344283 25.406859
2010-01-07 27.282650 296.753749 108.965786 25.142634
2010-01-08 27.464034 300.709808 110.059232 25.316031
2010-01-11 27.221758 300.255255 108.906903 24.994007
2010-01-12 26.912110 294.945572 109.773245 24.828866
2010-01-13 27.291720 293.252243 109.537735 25.060064
2010-01-14 27.133657 294.630868 111.287245 25.563741
2010-01-15 26.680198 289.710772 110.841458 25.481172
2010-01-19 27.860485 293.516976 112.826478 25.679340
2010-01-20 27.431644 289.915587 109.554561 25.258232
2010-01-21 26.957455 291.199286 108.503173 24.779325
2010-01-22 25.620401 274.730736 105.559289 23.912336
2010-01-25 26.309658 269.730740 106.080779 24.209590
2010-01-26 26.681494 270.939526 105.769566 24.358216
2010-01-27 26.932840 270.779695 106.257412 24.498586
2010-01-28 25.819922 266.878565 104.087347 24.077477
2010-01-29 24.883208 264.705742 102.943437 23.268290
2010-02-01 25.229131 266.244198 104.861166 23.458201
2010-02-02 25.375533 265.295156 105.584521 23.499485
2010-02-03 25.812149 270.140309 105.693870 23.639855
2010-02-04 24.881912 263.127321 103.456514 22.987551
2010-02-05 25.323710 265.380075 103.893889 23.136177
2010-02-08 25.150100 266.468996 102.972982 22.888466
2010-02-09 25.418289 267.952522 104.096663 23.127920
2010-02-10 25.279660 266.958496 103.758713 23.111406
2010-02-11 25.739595 267.932539 104.536000 23.218748
2010-02-12 25.961142 266.294170 104.764113 23.061864
2010-02-16 26.352412 270.380071 105.803308 23.518124
... ... ... ... ...
2017-01-12 118.735214 806.359985 166.632452 62.232715
2017-01-13 118.526121 807.880005 166.027237 62.322172
2017-01-17 119.481976 804.609985 166.572925 62.153195
2017-01-18 119.472017 806.070007 165.491479 62.123377
2017-01-19 119.262925 802.174988 165.501396 61.924581
2017-01-20 119.481976 805.020020 169.212061 62.361932
2017-01-23 119.561633 819.309998 169.688291 62.580604
2017-01-24 119.452107 823.869995 174.520082 63.137231
2017-01-25 121.353858 835.669983 176.891332 63.296267
2017-01-26 121.413604 832.150024 177.258440 63.882708
2017-01-27 121.423555 823.309998 175.909108 65.383610
2017-01-30 121.104937 802.320007 174.420876 64.737526
2017-01-31 120.826147 796.789978 173.150918 64.260423
2017-02-01 128.194203 795.695007 172.922712 63.196871
2017-02-02 127.975152 798.530029 173.210445 62.789338
2017-02-03 128.522781 801.489990 174.440723 63.296267
2017-02-06 129.727549 801.340027 174.480403 63.256507
2017-02-07 130.962201 806.969971 177.060012 63.047773
2017-02-08 131.469994 808.380005 176.169998 62.958315
2017-02-09 132.419998 809.559998 177.210007 63.673974
2017-02-10 132.119995 813.669983 178.679993 63.614338
2017-02-13 133.289993 819.239990 179.360001 64.330000
2017-02-14 135.020004 820.450012 180.130005 64.570000
2017-02-15 135.509995 818.979980 181.679993 64.529999
2017-02-16 135.350006 824.159973 181.429993 64.519997
2017-02-17 135.720001 828.070007 180.669998 64.620003
2017-02-21 136.699997 831.659973 180.259995 64.489998
2017-02-22 137.110001 830.760010 181.149994 64.360001
2017-02-23 136.529999 831.330017 181.649994 64.620003
2017-02-24 136.660004 828.640015 181.350006 64.620003

1799 rows × 4 columns

returns = price.pct_change()
returns.tail()
Out[28]:
AAPL GOOG IBM MSFT
Date
2017-02-17 0.002734 0.004744 -0.004189 0.001550
2017-02-21 0.007221 0.004335 -0.002269 -0.002012
2017-02-22 0.002999 -0.001082 0.004937 -0.002016
2017-02-23 -0.004230 0.000686 0.002760 0.004040
2017-02-24 0.000952 -0.003236 -0.001651 0.000000
# 相关系数
print(returns.MSFT.corr(returns.IBM))
# 协方差
print(returns.MSFT.cov(returns.IBM))
0.495166900181
8.58872503835e-05
#  相关系数矩阵
returns.corr()
Out[40]:
AAPL GOOG IBM MSFT
AAPL 1.000000 0.409523 0.381495 0.388913
GOOG 0.409523 1.000000 0.402880 0.470810
IBM 0.381495 0.402880 1.000000 0.495167
MSFT 0.388913 0.470810 0.495167 1.000000
# 协方差矩阵
returns.cov()
Out[41]:
AAPL GOOG IBM MSFT
AAPL 0.000269 0.000105 0.000075 0.000092
GOOG 0.000105 0.000244 0.000075 0.000106
IBM 0.000075 0.000075 0.000144 0.000086
MSFT 0.000092 0.000106 0.000086 0.000209
# 一组相关系数
returns.corrwith(returns.IBM)
Out[44]:
AAPL    0.381495
GOOG    0.402880
IBM     1.000000
MSFT    0.495167
dtype: float64
# 一组相关系数( 回报百分比与成交量之间)
returns.corrwith(volume)
Out[45]:
AAPL   -0.074047
GOOG   -0.009542
IBM    -0.194405
MSFT   -0.091078
dtype: float64

唯一值,值计数和成员资格

obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques
Out[47]:
array(['c', 'a', 'd', 'b'], dtype=object)
obj.value_counts()
Out[48]:
a    3
c    3
b    2
d    1
dtype: int64
pd.value_counts(obj.values, sort=False)
Out[49]:
b    2
d    1
c    3
a    3
dtype: int64
mask = obj.isin(['b', 'c'])
mask
Out[50]:
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
# 通过 `isin()`的结果进行筛选
obj[mask]
Out[54]:
0    c
5    b
6    b
7    c
8    c
dtype: object
data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
                  'Qu2': [2, 3, 1, 2, 3],
                  'Qu3': [1, 5, 2, 4, 4]})
data
Out[52]:
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
result = data.apply(pd.value_counts).fillna(0)
result
Out[55]:
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 2.0 2.0 0.0
4 2.0 0.0 2.0
5 0.0 0.0 1.0
 

处理缺失数据

%pylab inline
from pandas import Series, DataFrame
import pandas as pd
Populating the interactive namespace from numpy and matplotlib

利用Python进行数据分析》 作者: Wes McKinney 译者: 唐学韬

处理缺失数据

数据不完整在数据分析的过程中很常见。

pandas使用浮点值NaN表示浮点和非浮点数组里的缺失数据。

pandas使用isnull()notnull()函数来判断缺失情况。

对于缺失数据一般处理方法为:

  • 滤掉, 使用 dropna()
  • 填充,使用 fillna,比如 ffill(指定值)或 bfill(插值
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
Out[2]:
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
string_data.isnull()
Out[3]:
0    False
1    False
2     True
3    False
dtype: bool
# None 也被认为是缺失
string_data[0] = None
string_data.isnull()
Out[5]:
0     True
1    False
2     True
3    False
dtype: bool

滤除缺失数据

from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data
Out[7]:
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
# 使用 dropna()方法滤除
data.dropna()
Out[8]:
0    1.0
2    3.5
4    7.0
dtype: float64
# 等价方法
data[data.notnull()]
Out[10]:
0    1.0
2    3.5
4    7.0
dtype: float64
# 对于 DataFrame, 会 drop 掉所有有 NA 的行
data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
                  [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data
Out[14]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
cleaned
Out[13]:
0 1 2
0 1.0 6.5 3.0
#  只丢弃全为空的行
data.dropna(how='all')
Out[15]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
data[4] = NA
data
Out[16]:
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
# 丢弃列
data.dropna(axis=1, how='all')
Out[17]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
df = DataFrame(np.random.randn(7, 3))
df.ix[:4, 1] = NA; df.ix[:2, 2] = NA
df
Out[18]:
0 1 2
0 -1.890923 NaN NaN
1 1.020294 NaN NaN
2 -3.132870 NaN NaN
3 0.541084 NaN -0.203774
4 2.200062 NaN 1.775465
5 -0.458244 0.208688 -0.027016
6 -1.291057 0.391487 -1.341282
# thresh :选取至少包含多少个非空值的行
df.dropna(thresh=2)
Out[21]:
0 1 2
3 0.541084 NaN -0.203774
4 2.200062 NaN 1.775465
5 -0.458244 0.208688 -0.027016
6 -1.291057 0.391487 -1.341282

填充缺失数据

# 用指定值填充
df.fillna(0)
Out[24]:
0 1 2
0 -1.890923 0.000000 0.000000
1 1.020294 0.000000 0.000000
2 -3.132870 0.000000 0.000000
3 0.541084 0.000000 -0.203774
4 2.200062 0.000000 1.775465
5 -0.458244 0.208688 -0.027016
6 -1.291057 0.391487 -1.341282
# 每列用不同的值填充
df.fillna({1: 0.5, 3: -1})
Out[26]:
0 1 2
0 -1.890923 0.500000 NaN
1 1.020294 0.500000 NaN
2 -3.132870 0.500000 NaN
3 0.541084 0.500000 -0.203774
4 2.200062 0.500000 1.775465
5 -0.458244 0.208688 -0.027016
6 -1.291057 0.391487 -1.341282
# inplace: 不返回新对象,就地修改
_ = df.fillna(0, inplace=True)
df
Out[27]:
0 1 2
0 -1.890923 0.000000 0.000000
1 1.020294 0.000000 0.000000
2 -3.132870 0.000000 0.000000
3 0.541084 0.000000 -0.203774
4 2.200062 0.000000 1.775465
5 -0.458244 0.208688 -0.027016
6 -1.291057 0.391487 -1.341282
df = DataFrame(np.random.randn(6, 3))
df.ix[2:, 1] = NA; df.ix[4:, 2] = NA
df
Out[28]:
0 1 2
0 -0.201570 -0.301825 -0.292899
1 0.407622 0.154548 0.337993
2 -1.390216 NaN 1.952913
3 -0.155158 NaN 0.732646
4 -1.391272 NaN NaN
5 -0.155741 NaN NaN
# 用前值填充
df.fillna(method='ffill')
Out[30]:
0 1 2
0 -0.201570 -0.301825 -0.292899
1 0.407622 0.154548 0.337993
2 -1.390216 0.154548 1.952913
3 -0.155158 0.154548 0.732646
4 -1.391272 0.154548 0.732646
5 -0.155741 0.154548 0.732646
#  用前值,最多填充2次
df.fillna(method='ffill', limit=2)
Out[32]:
0 1 2
0 -0.201570 -0.301825 -0.292899
1 0.407622 0.154548 0.337993
2 -1.390216 0.154548 1.952913
3 -0.155158 0.154548 0.732646
4 -1.391272 NaN 0.732646
5 -0.155741 NaN 0.732646
# 用平均值填充
data = Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())
Out[34]:
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

fillna()参数汇总:

  • value
  • method, 插值方法,默认为 ffill
  • axis,
  • inplace
  • limit
 

层次化索引

%pylab inline
from pandas import Series, DataFrame
import pandas as pd
Populating the interactive namespace from numpy and matplotlib

利用Python进行数据分析》 作者: Wes McKinney 译者: 唐学韬

层次化索引

层次化索引(hierarchical index)是pandas的重要功能,

通过在一个轴上拥有两个以上的索引级别, 使得 Pandas 可以用低维度形式处理高维度。

层次化索引在数据重塑和基于分组操作(如透视表生成)中扮演者重要的角色。

根据索引选择数据子集

data = Series(np.random.randn(10),
              index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
                     [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data
Out[2]:
a  1   -0.291810
   2   -2.489879
   3   -0.993844
b  1    0.791313
   2   -1.233354
   3    1.096154
c  1    1.627453
   2    0.637391
d  2   -0.726163
   3   -2.117156
dtype: float64
data.index
Out[3]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
# 以外层索引的方式选择
data['b']
Out[10]:
1    0.791313
2   -1.233354
3    1.096154
dtype: float64
data['b':'c']
Out[5]:
b  1    0.791313
   2   -1.233354
   3    1.096154
c  1    1.627453
   2    0.637391
dtype: float64
data.ix[['b', 'd']]
Out[6]:
b  1    0.791313
   2   -1.233354
   3    1.096154
d  2   -0.726163
   3   -2.117156
dtype: float64
# 以内层索引的方式选择
data[:, 2]
Out[11]:
a   -2.489879
b   -1.233354
c    0.637391
d   -0.726163
dtype: float64

stack()unstack()

# 层次化索引数据可以转换为一个DataFrame
data.unstack()
Out[13]:
1 2 3
a -0.291810 -2.489879 -0.993844
b 0.791313 -1.233354 1.096154
c 1.627453 0.637391 NaN
d NaN -0.726163 -2.117156
#  反向
data.unstack().stack()
Out[14]:
a  1   -0.291810
   2   -2.489879
   3   -0.993844
b  1    0.791313
   2   -1.233354
   3    1.096154
c  1    1.627453
   2    0.637391
d  2   -0.726163
   3   -2.117156
dtype: float64

DataFrame 的层次化索引

# 对于DataFrame,横轴和竖轴都可以有层次化索引

frame = DataFrame(np.arange(12).reshape((4, 3)),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=[['Ohio', 'Ohio', 'Colorado'],
                           ['Green', 'Red', 'Green']])
frame
Out[19]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
#  每层索引都可以命名
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
Out[20]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame['Ohio']
Out[18]:
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
# 可以单独创建MultiIndex然后复用
columns = pd.MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],['Green','Red','Green']],names = ['state','color'])
frame1 = DataFrame(np.arange(12).reshape((4,3)),columns = columns)
frame1
Out[22]:
state Ohio Colorado
color Green Red Green
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11

重排分级顺序

frame
Out[26]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.swaplevel('key1', 'key2')
Out[23]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
frame.sortlevel(1)
Out[24]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
frame.swaplevel(0, 1).sortlevel(0)
Out[25]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11

根据级别汇总统计

frame.sum(level='key2')
Out[27]:
state Ohio Colorado
color Green Red Green
key2
1 6 8 10
2 12 14 16
frame.sum(level='color', axis=1)
Out[28]:
color Green Red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10

使用 DataFrame 列

frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
                   'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                   'd': [0, 1, 2, 0, 1, 2, 3]})
frame
Out[29]:
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
# set_index(): 将列转化为索引
frame2 = frame.set_index(['c', 'd'])
frame2
Out[31]:
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
# 转化为索引,同时保留列
frame.set_index(['c', 'd'], drop=False)
Out[33]:
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
# reset_index(): 将索引转化为列
frame2.reset_index()
Out[35]:
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
 

其他

%pylab inline
from pandas import Series, DataFrame
import pandas as pd
Populating the interactive namespace from numpy and matplotlib

利用Python进行数据分析》 作者: Wes McKinney 译者: 唐学韬

其他关于 Pandas的话题

整数索引

ser = Series(np.arange(3.))
ser
Out[5]:
0    0.0
1    1.0
2    2.0
dtype: float64
ser.iloc[-1]
Out[6]:
2.0
ser2 = Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]
Out[7]:
2.0
#ix函数总是面向标签的
ser.ix[:1]
Out[8]:
0    0.0
1    1.0
dtype: float64
ser3 = Series(range(3), index=[-5, 1, 3])
ser3.iloc[2]
#如果需要可靠的、不考虑索引类型的、基于位置的索引,可以使用Series的iget_value方法,Dataframe的irow 和 icol方法
frame = DataFrame(np.arange(6).reshape((3, 2)), index=[2, 0, 1])
frame.iloc[0]
Out[9]:
0    0
1    1
Name: 2, dtype: int64

面板数据

可以看做是三维的 DataFrame, 但是通常用多层索引的 DataFrame,而不是使用 Panel。

#可以利用DataFrame对象组成的字典或者一个三维ndarray来创建Panel对象
from pandas_datareader import data, wb

pdata = pd.Panel(dict((stk, data.get_data_yahoo(stk))
                       for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))
pdata
Out[12]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 1821 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2010-01-04 00:00:00 to 2017-02-28 00:00:00
Minor_axis axis: Open to Adj Close
pdata = pdata.swapaxes('items', 'minor')
pdata['Adj Close']
Out[13]:
AAPL DELL GOOG MSFT
Date
2010-01-04 27.727039 14.06528 313.062468 25.555485
2010-01-05 27.774976 14.38450 311.683844 25.563741
2010-01-06 27.333178 14.10397 303.826685 25.406859
2010-01-07 27.282650 14.23940 296.753749 25.142634
2010-01-08 27.464034 14.36516 300.709808 25.316031
2010-01-11 27.221758 14.37483 300.255255 24.994007
2010-01-12 26.912110 14.56830 294.945572 24.828866
2010-01-13 27.291720 14.57797 293.252243 25.060064
2010-01-14 27.133657 14.22005 294.630868 25.563741
2010-01-15 26.680198 13.92985 289.710772 25.481172
2010-01-19 27.860485 14.32646 293.516976 25.679340
2010-01-20 27.431644 14.03626 289.915587 25.258232
2010-01-21 26.957455 13.92017 291.199286 24.779325
2010-01-22 25.620401 13.18982 274.730736 23.912336
2010-01-25 26.309658 13.43650 269.730740 24.209590
2010-01-26 26.681494 13.13662 270.939526 24.358216
2010-01-27 26.932840 13.08825 270.779695 24.498586
2010-01-28 25.819922 12.84641 266.878565 24.077477
2010-01-29 24.883208 12.47882 264.705742 23.268290
2010-02-01 25.229131 12.78837 266.244198 23.458201
2010-02-02 25.375533 12.86576 265.295156 23.499485
2010-02-03 25.812149 12.92380 270.140309 23.639855
2010-02-04 24.881912 12.58523 263.127321 22.987551
2010-02-05 25.323710 12.80772 265.380075 23.136177
2010-02-08 25.150100 12.95282 266.468996 22.888466
2010-02-09 25.418289 13.10760 267.952522 23.127920
2010-02-10 25.279660 13.30107 266.958496 23.111406
2010-02-11 25.739595 13.49454 267.932539 23.218748
2010-02-12 25.961142 13.38813 266.294170 23.061864
2010-02-16 26.352412 13.67834 270.380071 23.518124
... ... ... ... ...
2017-01-17 119.481976 NaN 804.609985 62.153195
2017-01-18 119.472017 NaN 806.070007 62.123377
2017-01-19 119.262925 NaN 802.174988 61.924581
2017-01-20 119.481976 NaN 805.020020 62.361932
2017-01-23 119.561633 NaN 819.309998 62.580604
2017-01-24 119.452107 NaN 823.869995 63.137231
2017-01-25 121.353858 NaN 835.669983 63.296267
2017-01-26 121.413604 NaN 832.150024 63.882708
2017-01-27 121.423555 NaN 823.309998 65.383610
2017-01-30 121.104937 NaN 802.320007 64.737526
2017-01-31 120.826147 NaN 796.789978 64.260423
2017-02-01 128.194203 NaN 795.695007 63.196871
2017-02-02 127.975152 NaN 798.530029 62.789338
2017-02-03 128.522781 NaN 801.489990 63.296267
2017-02-06 129.727549 NaN 801.340027 63.256507
2017-02-07 130.962201 NaN 806.969971 63.047773
2017-02-08 131.469994 NaN 808.380005 62.958315
2017-02-09 132.419998 NaN 809.559998 63.673974
2017-02-10 132.119995 NaN 813.669983 63.614338
2017-02-13 133.289993 NaN 819.239990 64.330000
2017-02-14 135.020004 NaN 820.450012 64.570000
2017-02-15 135.509995 NaN 818.979980 64.529999
2017-02-16 135.350006 NaN 824.159973 64.519997
2017-02-17 135.720001 NaN 828.070007 64.620003
2017-02-21 136.699997 NaN 831.659973 64.489998
2017-02-22 137.110001 NaN 830.760010 64.360001
2017-02-23 136.529999 NaN 831.330017 64.620003
2017-02-24 136.660004 NaN 828.640015 64.620003
2017-02-27 136.929993 NaN 829.280029 64.230003
2017-02-28 136.990005 NaN 823.210022 63.980000

1821 rows × 4 columns

pdata.ix[:, '6/1/2012', :]
Out[14]:
Open High Low Close Volume Adj Close
AAPL 569.159996 572.650009 560.520012 560.989983 130246900.0 72.681610
DELL 12.150000 12.300000 12.045000 12.070000 19397600.0 11.675920
GOOG 571.790972 572.650996 568.350996 570.981000 6138700.0 285.205295
MSFT 28.760000 28.959999 28.440001 28.450001 56634300.0 24.942239
pdata.ix['Adj Close', '5/22/2012':, :]
Out[15]:
AAPL DELL GOOG MSFT
Date
2012-05-22 72.160786 14.58765 300.100412 26.090721
2012-05-23 73.921494 12.08221 304.426106 25.520864
2012-05-24 73.242607 12.04351 301.528978 25.485795
2012-05-25 72.850038 12.05319 295.470050 25.477028
2012-05-28 NaN 12.05319 NaN NaN
2012-05-29 74.143041 12.24666 296.873645 25.915380
2012-05-30 75.037005 12.14992 293.821674 25.722505
2012-05-31 74.850442 11.92743 290.140354 25.591000
2012-06-01 72.681610 11.67592 285.205295 24.942239
2012-06-04 73.109156 11.60821 289.006480 25.029908
2012-06-05 72.920005 11.76298 284.920579 24.994841
2012-06-06 74.038104 11.81619 289.995487 25.731273
2012-06-07 74.071787 11.73396 288.826666 25.626067
2012-06-08 75.185997 11.72429 289.935570 25.994283
2012-06-11 74.000526 11.47278 283.966519 25.336755
2012-06-12 74.647030 11.57919 282.268201 25.678671
2012-06-13 74.128794 11.87423 280.265216 25.538397
2012-06-14 74.047168 11.93711 279.246220 25.722505
2012-06-15 74.384024 11.89841 281.973510 26.318665
2012-06-18 75.893391 12.01449 285.140358 26.160857
2012-06-19 76.104579 11.78233 290.475012 26.914824
2012-06-20 75.888208 11.89841 288.467038 27.116465
2012-06-21 74.842665 11.60821 282.323161 26.423868
2012-06-22 75.416614 11.80168 285.455033 26.914824
2012-06-25 73.948707 11.55500 280.070407 26.187159
2012-06-26 74.111953 11.53565 282.058429 26.318665
2012-06-27 74.431960 11.92743 284.366112 26.450170
2012-06-28 73.725860 11.55984 281.873596 26.222227
2012-06-29 75.662780 12.10155 289.745749 26.818386
2012-07-02 76.766625 11.98064 289.945546 26.792084
... ... ... ... ...
2017-01-17 119.481976 NaN 804.609985 62.153195
2017-01-18 119.472017 NaN 806.070007 62.123377
2017-01-19 119.262925 NaN 802.174988 61.924581
2017-01-20 119.481976 NaN 805.020020 62.361932
2017-01-23 119.561633 NaN 819.309998 62.580604
2017-01-24 119.452107 NaN 823.869995 63.137231
2017-01-25 121.353858 NaN 835.669983 63.296267
2017-01-26 121.413604 NaN 832.150024 63.882708
2017-01-27 121.423555 NaN 823.309998 65.383610
2017-01-30 121.104937 NaN 802.320007 64.737526
2017-01-31 120.826147 NaN 796.789978 64.260423
2017-02-01 128.194203 NaN 795.695007 63.196871
2017-02-02 127.975152 NaN 798.530029 62.789338
2017-02-03 128.522781 NaN 801.489990 63.296267
2017-02-06 129.727549 NaN 801.340027 63.256507
2017-02-07 130.962201 NaN 806.969971 63.047773
2017-02-08 131.469994 NaN 808.380005 62.958315
2017-02-09 132.419998 NaN 809.559998 63.673974
2017-02-10 132.119995 NaN 813.669983 63.614338
2017-02-13 133.289993 NaN 819.239990 64.330000
2017-02-14 135.020004 NaN 820.450012 64.570000
2017-02-15 135.509995 NaN 818.979980 64.529999
2017-02-16 135.350006 NaN 824.159973 64.519997
2017-02-17 135.720001 NaN 828.070007 64.620003
2017-02-21 136.699997 NaN 831.659973 64.489998
2017-02-22 137.110001 NaN 830.760010 64.360001
2017-02-23 136.529999 NaN 831.330017 64.620003
2017-02-24 136.660004 NaN 828.640015 64.620003
2017-02-27 136.929993 NaN 829.280029 64.230003
2017-02-28 136.990005 NaN 823.210022 63.980000

1214 rows × 4 columns

stacked = pdata.ix[:, '5/30/2012':, :].to_frame()
stacked
Out[16]:
Open High Low Close Volume Adj Close
Date minor
2012-05-30 AAPL 569.199997 579.989990 566.559990 579.169998 132357400.0 75.037005
DELL 12.590000 12.700000 12.460000 12.560000 19787800.0 12.149920
GOOG 588.161028 591.901014 583.530999 588.230992 3827600.0 293.821674
MSFT 29.350000 29.480000 29.120001 29.340000 41585500.0 25.722505
2012-05-31 AAPL 580.740021 581.499985 571.460022 577.730019 122918600.0 74.850442
DELL 12.530000 12.540000 12.330000 12.330000 19955600.0 11.927430
GOOG 588.720982 590.001032 579.001013 580.860990 5958800.0 290.140354
MSFT 29.299999 29.420000 28.940001 29.190001 39134000.0 25.591000
2012-06-01 AAPL 569.159996 572.650009 560.520012 560.989983 130246900.0 72.681610
DELL 12.150000 12.300000 12.045000 12.070000 19397600.0 11.675920
GOOG 571.790972 572.650996 568.350996 570.981000 6138700.0 285.205295
MSFT 28.760000 28.959999 28.440001 28.450001 56634300.0 24.942239
2012-06-04 AAPL 561.500008 567.499985 548.499977 564.289978 139248900.0 73.109156
DELL 12.110000 12.112500 11.800000 12.000000 17015700.0 11.608210
GOOG 570.220958 580.491016 570.011006 578.590973 4883500.0 289.006480
MSFT 28.620001 28.780001 28.320000 28.549999 47926300.0 25.029908
2012-06-05 AAPL 561.269989 566.470001 558.330002 562.830025 97053600.0 72.920005
DELL 11.950000 12.240000 11.950000 12.160000 15620900.0 11.762980
GOOG 575.451008 578.131003 566.470986 570.410999 4697200.0 284.920579
MSFT 28.510000 28.750000 28.389999 28.510000 45715400.0 24.994841
2012-06-06 AAPL 567.770004 573.849983 565.499992 571.460022 100363900.0 74.038104
DELL 12.210000 12.280000 12.090000 12.215000 20779900.0 11.816190
GOOG 576.480979 581.970971 573.611004 580.570966 4207200.0 289.995487
MSFT 28.879999 29.370001 28.809999 29.350000 46860500.0 25.731273
2012-06-07 AAPL 577.290009 577.320023 570.500000 571.720001 94941700.0 74.071787
DELL 12.320000 12.410000 12.120000 12.130000 20074000.0 11.733960
GOOG 587.601014 587.891038 577.251006 578.230986 3530100.0 288.826666
MSFT 29.639999 29.700001 29.170000 29.230000 37792800.0 25.626067
2012-06-08 AAPL 571.599998 580.580017 568.999992 580.319984 86879100.0 75.185997
DELL 12.130000 12.225000 12.020000 12.120000 18155600.0 11.724290
... ... ... ... ... ... ... ...
2017-02-14 AAPL 133.470001 135.089996 133.250000 135.020004 32815500.0 135.020004
GOOG 819.000000 823.000000 816.000000 820.450012 1053600.0 820.450012
MSFT 64.410004 64.720001 64.019997 64.570000 23065900.0 64.570000
2017-02-15 AAPL 135.520004 136.270004 134.619995 135.509995 35501600.0 135.509995
GOOG 819.359985 823.000000 818.469971 818.979980 1304000.0 818.979980
MSFT 64.500000 64.570000 64.160004 64.529999 16917000.0 64.529999
2017-02-16 AAPL 135.669998 135.899994 134.839996 135.350006 22118000.0 135.350006
GOOG 819.929993 824.400024 818.979980 824.159973 1281700.0 824.159973
MSFT 64.739998 65.239998 64.440002 64.519997 20524700.0 64.519997
2017-02-17 AAPL 135.100006 135.830002 135.100006 135.720001 22084500.0 135.720001
GOOG 823.020020 828.070007 821.655029 828.070007 1597800.0 828.070007
MSFT 64.470001 64.690002 64.300003 64.620003 21234600.0 64.620003
2017-02-21 AAPL 136.229996 136.750000 135.979996 136.699997 24265100.0 136.699997
GOOG 828.659973 833.450012 828.349976 831.659973 1247700.0 831.659973
MSFT 64.610001 64.949997 64.449997 64.489998 19384900.0 64.489998
2017-02-22 AAPL 136.429993 137.119995 136.110001 137.110001 20745300.0 137.110001
GOOG 828.659973 833.250000 828.640015 830.760010 982900.0 830.760010
MSFT 64.330002 64.389999 64.050003 64.360001 19259700.0 64.360001
2017-02-23 AAPL 137.380005 137.479996 136.300003 136.529999 20704100.0 136.529999
GOOG 830.119995 832.460022 822.880005 831.330017 1470100.0 831.330017
MSFT 64.419998 64.730003 64.190002 64.620003 20235200.0 64.620003
2017-02-24 AAPL 135.910004 136.660004 135.279999 136.660004 21690900.0 136.660004
GOOG 827.729980 829.000000 824.200012 828.640015 1386600.0 828.640015
MSFT 64.529999 64.800003 64.139999 64.620003 21705200.0 64.620003
2017-02-27 AAPL 137.139999 137.440002 136.279999 136.929993 20196400.0 136.929993
GOOG 824.549988 830.500000 824.000000 829.280029 1099500.0 829.280029
MSFT 64.540001 64.540001 64.050003 64.230003 15850400.0 64.230003
2017-02-28 AAPL 137.080002 137.440002 136.699997 136.990005 23403500.0 136.990005
GOOG 825.609985 828.539978 820.200012 823.210022 2252300.0 823.210022
MSFT 64.080002 64.199997 63.759998 63.980000 23043900.0 63.980000

3955 rows × 6 columns

stacked.to_panel()
Out[17]:
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1208 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2012-05-30 00:00:00 to 2017-02-28 00:00:00
Minor_axis axis: AAPL to MSFT