%pylab inline
from pandas import Series, DataFrame
import pandas as pd

Populating the interactive namespace from numpy and matplotlib

《利用Python进行数据分析》作者: Wes McKinney 译者: 唐学韬

Pandas中两个主要的数据结构是 Series 和 DataFrame，两个数据结构中都使用了索引(Index)对象。

Series¶

Series是一组带索引的数组。包含一组数据（各种NumPy数据类型）和与之相关的一组数据标签（即索引）。

可以用index和values分别规定索引和值。如果不规定索引，会自动创建 0 到 N-1 索引。

最简单的 Series¶

#  只包含一组数据的 Series
obj = Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

obj.index

RangeIndex(start=0, stop=4, step=1)

obj.values

array([ 4,  7, -5,  3])

指定索引¶

obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

通过索引获取数据¶

#  取单行数据
obj2['d']

4

# 取多行数据
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

按条件获取数据¶

obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

数组运算¶

对数组进行计算，只改变值，不会改变索引

obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

改变索引¶

obj2

d    4
b    7
a   -5
c    3
dtype: int64

obj2.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj2

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

作为Dictionary¶

可以将 Series 看作一个定长有序字典。

'b' in obj2

False

#  通过 Dictionary 直接创建 Series
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

# 重建索引，并按索引筛选数据
# 注意，索引 `California`  对应的值为 NaN
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

数据检查¶

pd.isnull(obj4) # 等价于 obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

pd.notnull(obj4) #  等价于 obj4.notnull()

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

自动对齐索引¶

Series很重要的功能就是按照索引自动对齐。

obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

命名¶

Series 自身和索引都可以命名。这个设计对于 Pandas 其他模块很重要。

obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

obj4.index

Index(['California', 'Ohio', 'Oregon', 'Texas'], dtype='object', name='state')

DataFrame¶

DataFrame 是一种表格型结构，含有一组有序的列，每一列可以是不同的数据类型。

DataFrame 可以看做是共用索引的、由Series组成的字典。其中，共用的索引是整个 DataFrame 的行索引，而各个 Series 的名字作为列索引。

DataFrame面向行和列的操作基本是平衡的。其实，DataFrame中的数据是以一个或者多个二维块存放的（不是列表、字典或者其他）。

创建 DataFrame¶

#  使用字典创建
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame

# 更改列顺序
DataFrame(data, columns=['year', 'state', 'pop'])

# 与行索引的机制类似，找不到的列，会用 NaN 填充
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])
frame2

得到列¶

# 列是关于 Series 的字典
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

# 获取某一列
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

# 也可以用属性的方式获取 DataFrame 中的某一列
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

#  获取多个列，还是 DataFrame, 此时为“切片”操作
frame2[['state','year']]

得到行¶

通过索引可以得到 DataFrame 的一行，行也是一个 Series.

frame2.ix['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

按列赋值¶

# 整列赋相同值
frame2['debt'] = 16.5
frame2

# 用一个 array 或 narray 对列赋值
frame2['debt'] = np.arange(5.)
frame2

# 使用 Series，可以根据索引对列精确赋值
# 注意其中的空值
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

增加和删除列¶

# 对不存在的列赋值会增加列
frame2['eastern'] = frame2.state == 'Ohio'
frame2

# 使用 `del` 删除列
del frame2['eastern']
frame2

转置¶

frame2.T

为索引和列命名¶

frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

DataFrame 的值¶

# 返回 ndarray
frame3.values

array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

# 类型不同时，自动转换为最抽象的类型
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

索引对象¶

obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

index[1:]

Index(['b', 'c'], dtype='object')

# Index 里面的数据不能更改
# index[1] = 'd'

index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index is index

True

frame3

'Ohio' in frame3.columns

True

2003 in frame3.index

False

Index 不是数组，也不是 Series（看起来像）。Index 支持的一些方法如下：

append 连接另一个 Index
diff 计算差集
intersection 计算交集
union 计算并集
isin 是否全包含
delete 删除 i 初的元素，返回一个新的 Index
drop 删除传入的值，返回一个新的 Index
insert 插入到 i 处
is_monotonic 是否各元素都大于等于前一个元素（顺序排列）
is_unique 是否没有重复值
unique 返回唯一值的数组

%pylab inline
from pandas import Series, DataFrame
import pandas as pd

Populating the interactive namespace from numpy and matplotlib

《利用Python进行数据分析》作者: Wes McKinney 译者: 唐学韬

重建索引¶

Pandas 对象的重建索引（reindex())，会根据索引创建一个新的对象。

obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

# 如果传入的索引值在数据里不存在，则不会报错，而是添加缺失值的新行
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

#如果不想用缺失值，可以用 fill_value 参数指定填充值
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
# method 指定填充新值采用的差值方法
# ffill 或 pad 实现用前值填充(forward fill)
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

# bfill 或 backfill 实现用后值填充
obj3.reindex(range(6), method='bfill')

0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

# #DataFrame 的reindex可以修改行、列或者两个都改

frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
                  columns=['Ohio', 'Texas', 'California'])
frame

# 默认更改行索引
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

# 使用 columns 更改列索引
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

# 同时更改行和列索引
# 注意插值只能用在行上
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',
              columns=states)

# 使用ix的标签索引功能，重新索引变得比较简洁
# 等价于 frame.reindex(index=['a', 'b', 'c', 'd'], columns=states)
frame.ix[['a', 'b', 'c', 'd'], states]

丢弃指定轴上的项¶

#  drop() 方法：通过索引数组或列表丢弃多个项，返回一个新的对象
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

# 丢弃多项
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

# 对于 DataFrame
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data

#  丢弃行
data.drop(['Colorado', 'Ohio'])

# 丢弃一列
data.drop('two', axis=1)

# 丢弃多列
data.drop(['two', 'four'], axis=1)

索引、选取和过滤¶

obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

#  通过索引选取, 以下两种方式等价
print(obj['b'])
print(obj[1])

1.0
1.0

#  连续切片
obj[2:4]

c    2.0
d    3.0
dtype: float64

# 通过索引选取
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

# 通过序号选取
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

# 条件筛选
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

# 针对索引的条件筛选
obj['b':'d']

b    1.0
c    2.0
d    3.0
dtype: float64

# 筛选并赋值
obj['b':'d'] = 5
obj

a    0.0
b    5.0
c    5.0
d    5.0
dtype: float64

# 对于 DataFrame
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data

# 选取列
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

# 选取多列
data[['three', 'one']]

# 
data[:2]

data[data['three'] > 5]

data < 5

data[data < 5] = 0
data

# 选择一行中的一部分
data.ix['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int64

# 选择多行中的一部分
data.ix[['Colorado', 'Utah'], [3, 0, 1]]

# 一整行
data.ix[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

data.ix[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

data.ix[data.three > 5, :3]

算术运算和数据对齐¶

Pandas 在对 Series 和 DataFrame 进行算术运算时，能够根据索引自动对齐,其中索引不重合的部分值为NaN。

+，-，*，/ 对应的函数分别为 add(), sub(), mul(), div()。

s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1

df2

df1 + df2

在算术方法中填充值¶

df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df1

df2

df1 + df2

df1.add(df2, fill_value=0)

df1.reindex(columns=df2.columns, fill_value=0)

DataFrame 和 Series 之间的运算¶

arr = np.arange(12.).reshape((3, 4))
arr

array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

arr[0]

array([ 0.,  1.,  2.,  3.])

arr - arr[0]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]
frame

series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

# 将Series的索引匹配到DataFrame的列，然后进行计算，再沿着行一直向下广播
frame - series

series2 = Series(range(3), index=['b', 'e', 'f'])
series2

b    0
e    1
f    2
dtype: int64

#  非共有的列，自动填充
frame + series2

series3 = frame['d']
frame

series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

# 使用算术运算方法，可以指定轴，从而实现匹配行且在列上广播
frame.sub(series3, axis=0)

函数应用和映射¶

frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])

frame

# 求绝对值函数
np.abs(frame)

#另一种常见的做法是：将一个函数应用到行或者列上,用apply方法
# 默认应用到行，广播到列
f = lambda x: x.max() - x.min()
frame.apply(f)

b    1.986947
d    2.849372
e    2.703205
dtype: float64

# 通过指定 axis，可以应用到列，广播到行
frame.apply(f, axis=1)

Utah      0.456738
Ohio      1.853918
Texas     2.651494
Oregon    3.675080
dtype: float64

# 传递的函数还可以返回 Series
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

# applymap:  应用到元素（而不是行和列）
format = lambda x: '%.2f' % x
frame.applymap(format)

# 因为 frame 调用了 Series 的 map()函数
frame['e'].map(format)

Utah       0.03
Ohio      -1.05
Texas      1.66
Oregon     0.82
Name: e, dtype: object

排序和排名¶

主要的函数包括：

sort_index(): 按索引排序
sort_values(): 按值排序
rank(): 排名

#用sort_index函数对行、列的索引进行排序
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])
frame.sort_index()

frame.sort_index(axis=1)

frame.sort_index(axis=1, ascending=False)

#  对值排序，使用 sort_values()
obj = Series([4, 7, -3, 2])
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

frame.sort_values(by='b')

frame.sort_values(by=['a', 'b'])

#rank函数返回从小到大排序的下标，对于平级的数，rank是通过“为各组分配一个平均排名”的方式破坏平级关系
#下标从1开始
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                   'c': [-2, 5, 8, -2.5]})
frame

frame.rank(axis=1)

带有重复值的轴索引¶

obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

#用is_unique看是否唯一
obj.index.is_unique

False

# #对于重复值的索引，选取的话返回一个Series
obj['a']

a    0
a    1
dtype: int64

# 唯一的索引返回一个标量
obj['c']

4

df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

df.ix['b']

%pylab inline
from pandas import Series, DataFrame
import pandas as pd

Populating the interactive namespace from numpy and matplotlib

《利用Python进行数据分析》作者: Wes McKinney 译者: 唐学韬

汇总和描述性统计

Pandas 实现了一些简单简单的汇总和描述性统计功能，更复杂的统计计算可以使用 Scipy 的 stats模块。

基本统计¶

包括：

count: 计数
describe: 各列的汇总统计
min, max: 最小值和最大值
argmin, argmax: 最小值和最大值的索引位置
idxmin, idxmax: 最小值和最大值的索引值
quantile: 样本的分位数(0--1)
sum: 求和
mean: 平均数
median: 中位数（0.5分位数）
mad: 平均绝对离差
var: 方差
std: 标准差
skew: 偏度（三阶矩）
kurt: 峰度（四阶矩）
cumsum: 累计和
cummin, cummax: 累计最小、最大值
cumprod: 累计积
diff: 一阶差分（用于时间序列）
pct_change: 百分数变化

df = DataFrame([[1.4, np.nan], [7.1, -4.5],
                [np.nan, np.nan], [0.75, -1.3]],
               index=['a', 'b', 'c', 'd'],
               columns=['one', 'two'])
df

# 求和
df.sum()

one    9.25
two   -5.80
dtype: float64

#  按行求和
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

# 不忽略 NA
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

# 返回索引
df.idxmax()

one    b
two    d
dtype: object

# 累加型
df.cumsum()

# 描述
df.describe()

# 非数值型的描述
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

唯一值，值计数和成员资格¶

obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

obj.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

pd.value_counts(obj.values, sort=False)

b    2
d    1
c    3
a    3
dtype: int64

mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

# 通过 `isin()`的结果进行筛选
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
                  'Qu2': [2, 3, 1, 2, 3],
                  'Qu3': [1, 5, 2, 4, 4]})
data

result = data.apply(pd.value_counts).fillna(0)
result

%pylab inline
from pandas import Series, DataFrame
import pandas as pd

Populating the interactive namespace from numpy and matplotlib

《利用Python进行数据分析》作者: Wes McKinney 译者: 唐学韬

处理缺失数据

数据不完整在数据分析的过程中很常见。

pandas使用浮点值NaN表示浮点和非浮点数组里的缺失数据。

pandas使用isnull()和notnull()函数来判断缺失情况。

对于缺失数据一般处理方法为:

滤掉, 使用 dropna()
填充，使用 fillna，比如 ffill(指定值）或 bfill（插值

string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

# None 也被认为是缺失
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

滤除缺失数据¶

from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

# 使用 dropna()方法滤除
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

# 等价方法
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

# 对于 DataFrame, 会 drop 掉所有有 NA 的行
data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
                  [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data

cleaned

#  只丢弃全为空的行
data.dropna(how='all')

data[4] = NA
data

# 丢弃列
data.dropna(axis=1, how='all')

df = DataFrame(np.random.randn(7, 3))
df.ix[:4, 1] = NA; df.ix[:2, 2] = NA
df

# thresh ：选取至少包含多少个非空值的行
df.dropna(thresh=2)

填充缺失数据¶

# 用指定值填充
df.fillna(0)

# 每列用不同的值填充
df.fillna({1: 0.5, 3: -1})

# inplace: 不返回新对象，就地修改
_ = df.fillna(0, inplace=True)
df

df = DataFrame(np.random.randn(6, 3))
df.ix[2:, 1] = NA; df.ix[4:, 2] = NA
df

# 用前值填充
df.fillna(method='ffill')

#  用前值，最多填充2次
df.fillna(method='ffill', limit=2)

# 用平均值填充
data = Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

fillna()参数汇总：

value
method, 插值方法，默认为 ffill
axis,
inplace
limit

%pylab inline
from pandas import Series, DataFrame
import pandas as pd

Populating the interactive namespace from numpy and matplotlib

《利用Python进行数据分析》作者: Wes McKinney 译者: 唐学韬

层次化索引

层次化索引（hierarchical index）是pandas的重要功能，

通过在一个轴上拥有两个以上的索引级别，使得 Pandas 可以用低维度形式处理高维度。

层次化索引在数据重塑和基于分组操作（如透视表生成）中扮演者重要的角色。

根据索引选择数据子集¶

data = Series(np.random.randn(10),
              index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
                     [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data

a  1   -0.291810
   2   -2.489879
   3   -0.993844
b  1    0.791313
   2   -1.233354
   3    1.096154
c  1    1.627453
   2    0.637391
d  2   -0.726163
   3   -2.117156
dtype: float64

data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

# 以外层索引的方式选择
data['b']

1    0.791313
2   -1.233354
3    1.096154
dtype: float64

data['b':'c']

b  1    0.791313
   2   -1.233354
   3    1.096154
c  1    1.627453
   2    0.637391
dtype: float64

data.ix[['b', 'd']]

b  1    0.791313
   2   -1.233354
   3    1.096154
d  2   -0.726163
   3   -2.117156
dtype: float64

# 以内层索引的方式选择
data[:, 2]

a   -2.489879
b   -1.233354
c    0.637391
d   -0.726163
dtype: float64

`stack()` 和 `unstack()`¶

# 层次化索引数据可以转换为一个DataFrame
data.unstack()

#  反向
data.unstack().stack()

a  1   -0.291810
   2   -2.489879
   3   -0.993844
b  1    0.791313
   2   -1.233354
   3    1.096154
c  1    1.627453
   2    0.637391
d  2   -0.726163
   3   -2.117156
dtype: float64

DataFrame 的层次化索引¶

# 对于DataFrame，横轴和竖轴都可以有层次化索引

frame = DataFrame(np.arange(12).reshape((4, 3)),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=[['Ohio', 'Ohio', 'Colorado'],
                           ['Green', 'Red', 'Green']])
frame

#  每层索引都可以命名
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

frame['Ohio']

# 可以单独创建MultiIndex然后复用
columns = pd.MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],['Green','Red','Green']],names = ['state','color'])
frame1 = DataFrame(np.arange(12).reshape((4,3)),columns = columns)
frame1

重排分级顺序¶

frame

frame.swaplevel('key1', 'key2')

frame.sortlevel(1)

frame.swaplevel(0, 1).sortlevel(0)

根据级别汇总统计¶

frame.sum(level='key2')

frame.sum(level='color', axis=1)

使用 DataFrame 列¶

frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
                   'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                   'd': [0, 1, 2, 0, 1, 2, 3]})
frame

# set_index(): 将列转化为索引
frame2 = frame.set_index(['c', 'd'])
frame2

# 转化为索引，同时保留列
frame.set_index(['c', 'd'], drop=False)

# reset_index(): 将索引转化为列
frame2.reset_index()

%pylab inline
from pandas import Series, DataFrame
import pandas as pd

Populating the interactive namespace from numpy and matplotlib

《利用Python进行数据分析》作者: Wes McKinney 译者: 唐学韬

其他关于 Pandas的话题

整数索引¶

ser = Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

ser.iloc[-1]

2.0

ser2 = Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]

2.0

#ix函数总是面向标签的
ser.ix[:1]

0    0.0
1    1.0
dtype: float64

ser3 = Series(range(3), index=[-5, 1, 3])
ser3.iloc[2]

#如果需要可靠的、不考虑索引类型的、基于位置的索引，可以使用Series的iget_value方法，Dataframe的irow 和 icol方法
frame = DataFrame(np.arange(6).reshape((3, 2)), index=[2, 0, 1])
frame.iloc[0]

0    0
1    1
Name: 2, dtype: int64

面板数据¶

可以看做是三维的 DataFrame，但是通常用多层索引的 DataFrame，而不是使用 Panel。

#可以利用DataFrame对象组成的字典或者一个三维ndarray来创建Panel对象
from pandas_datareader import data, wb

pdata = pd.Panel(dict((stk, data.get_data_yahoo(stk))
                       for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))

pdata

<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 1821 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2010-01-04 00:00:00 to 2017-02-28 00:00:00
Minor_axis axis: Open to Adj Close

pdata = pdata.swapaxes('items', 'minor')
pdata['Adj Close']

pdata.ix[:, '6/1/2012', :]

pdata.ix['Adj Close', '5/22/2012':, :]

stacked = pdata.ix[:, '5/30/2012':, :].to_frame()
stacked

stacked.to_panel()

<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1208 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2012-05-30 00:00:00 to 2017-02-28 00:00:00
Minor_axis axis: AAPL to MSFT

	a	b	c	d	e
0	0.0	1.0	2.0	3.0	4.0
1	5.0	6.0	7.0	8.0	9.0
2	10.0	11.0	12.0	13.0	14.0
3	15.0	16.0	17.0	18.0	19.0

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	11.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

	b	d	e
Utah	0.155111	-0.301627	0.034083
Ohio	-0.003312	0.807804	-1.046114
Texas	1.983635	-0.667859	1.657091
Oregon	1.633512	-2.041568	0.820017

	b	d	e
Utah	0.155111	0.301627	0.034083
Ohio	0.003312	0.807804	1.046114
Texas	1.983635	0.667859	1.657091
Oregon	1.633512	2.041568	0.820017

	b	d	e
min	-0.003312	-2.041568	-1.046114
max	1.983635	0.807804	1.657091

	pop	state	year
0	1.5	Ohio	2000
1	1.7	Ohio	2001
2	3.6	Ohio	2002
3	2.4	Nevada	2001
4	2.9	Nevada	2002

	year	state	pop	debt
one	2000	Ohio	1.5	16.5
two	2001	Ohio	1.7	16.5
three	2002	Ohio	3.6	16.5
four	2001	Nevada	2.4	16.5
five	2002	Nevada	2.9	16.5

	year	state	pop	debt	eastern
one	2000	Ohio	1.5	NaN	True
two	2001	Ohio	1.7	-1.2	True
three	2002	Ohio	3.6	NaN	True
four	2001	Nevada	2.4	-1.5	False
five	2002	Nevada	2.9	-1.7	False

	one	two	three	four	five
year	2000	2001	2002	2001	2002
state	Ohio	Ohio	Ohio	Nevada	Nevada
pop	1.5	1.7	3.6	2.4	2.9
debt	NaN	-1.2	NaN	-1.5	-1.7

	b	c	d	e
Colorado	NaN	NaN	NaN	NaN
Ohio	3.0	NaN	6.0	NaN
Oregon	NaN	NaN	NaN	NaN
Texas	9.0	NaN	12.0	NaN
Utah	NaN	NaN	NaN	NaN

	b	d	e
Utah	0.16	-0.30	0.03
Ohio	-0.00	0.81	-1.05
Texas	1.98	-0.67	1.66
Oregon	1.63	-2.04	0.82

	0	1	2
a	-0.146504	-0.442410	-1.044600
a	-1.243650	0.281303	0.485324
b	1.040675	0.479036	-0.753287
b	-1.571259	0.194030	0.517437

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

	AAPL	GOOG	IBM	MSFT
Date
2010-01-04	27.727039	313.062468	111.405000	25.555485
2010-01-05	27.774976	311.683844	110.059232	25.563741
2010-01-06	27.333178	303.826685	109.344283	25.406859
2010-01-07	27.282650	296.753749	108.965786	25.142634
2010-01-08	27.464034	300.709808	110.059232	25.316031
2010-01-11	27.221758	300.255255	108.906903	24.994007
2010-01-12	26.912110	294.945572	109.773245	24.828866
2010-01-13	27.291720	293.252243	109.537735	25.060064
2010-01-14	27.133657	294.630868	111.287245	25.563741
2010-01-15	26.680198	289.710772	110.841458	25.481172
2010-01-19	27.860485	293.516976	112.826478	25.679340
2010-01-20	27.431644	289.915587	109.554561	25.258232
2010-01-21	26.957455	291.199286	108.503173	24.779325
2010-01-22	25.620401	274.730736	105.559289	23.912336
2010-01-25	26.309658	269.730740	106.080779	24.209590
2010-01-26	26.681494	270.939526	105.769566	24.358216
2010-01-27	26.932840	270.779695	106.257412	24.498586
2010-01-28	25.819922	266.878565	104.087347	24.077477
2010-01-29	24.883208	264.705742	102.943437	23.268290
2010-02-01	25.229131	266.244198	104.861166	23.458201
2010-02-02	25.375533	265.295156	105.584521	23.499485
2010-02-03	25.812149	270.140309	105.693870	23.639855
2010-02-04	24.881912	263.127321	103.456514	22.987551
2010-02-05	25.323710	265.380075	103.893889	23.136177
2010-02-08	25.150100	266.468996	102.972982	22.888466
2010-02-09	25.418289	267.952522	104.096663	23.127920
2010-02-10	25.279660	266.958496	103.758713	23.111406
2010-02-11	25.739595	267.932539	104.536000	23.218748
2010-02-12	25.961142	266.294170	104.764113	23.061864
2010-02-16	26.352412	270.380071	105.803308	23.518124
...	...	...	...	...
2017-01-12	118.735214	806.359985	166.632452	62.232715
2017-01-13	118.526121	807.880005	166.027237	62.322172
2017-01-17	119.481976	804.609985	166.572925	62.153195
2017-01-18	119.472017	806.070007	165.491479	62.123377
2017-01-19	119.262925	802.174988	165.501396	61.924581
2017-01-20	119.481976	805.020020	169.212061	62.361932
2017-01-23	119.561633	819.309998	169.688291	62.580604
2017-01-24	119.452107	823.869995	174.520082	63.137231
2017-01-25	121.353858	835.669983	176.891332	63.296267
2017-01-26	121.413604	832.150024	177.258440	63.882708
2017-01-27	121.423555	823.309998	175.909108	65.383610
2017-01-30	121.104937	802.320007	174.420876	64.737526
2017-01-31	120.826147	796.789978	173.150918	64.260423
2017-02-01	128.194203	795.695007	172.922712	63.196871
2017-02-02	127.975152	798.530029	173.210445	62.789338
2017-02-03	128.522781	801.489990	174.440723	63.296267
2017-02-06	129.727549	801.340027	174.480403	63.256507
2017-02-07	130.962201	806.969971	177.060012	63.047773
2017-02-08	131.469994	808.380005	176.169998	62.958315
2017-02-09	132.419998	809.559998	177.210007	63.673974
2017-02-10	132.119995	813.669983	178.679993	63.614338
2017-02-13	133.289993	819.239990	179.360001	64.330000
2017-02-14	135.020004	820.450012	180.130005	64.570000
2017-02-15	135.509995	818.979980	181.679993	64.529999
2017-02-16	135.350006	824.159973	181.429993	64.519997
2017-02-17	135.720001	828.070007	180.669998	64.620003
2017-02-21	136.699997	831.659973	180.259995	64.489998
2017-02-22	137.110001	830.760010	181.149994	64.360001
2017-02-23	136.529999	831.330017	181.649994	64.620003
2017-02-24	136.660004	828.640015	181.350006	64.620003

	AAPL	GOOG	IBM	MSFT
Date
2017-02-17	0.002734	0.004744	-0.004189	0.001550
2017-02-21	0.007221	0.004335	-0.002269	-0.002012
2017-02-22	0.002999	-0.001082	0.004937	-0.002016
2017-02-23	-0.004230	0.000686	0.002760	0.004040
2017-02-24	0.000952	-0.003236	-0.001651	0.000000

	AAPL	GOOG	IBM	MSFT
AAPL	1.000000	0.409523	0.381495	0.388913
GOOG	0.409523	1.000000	0.402880	0.470810
IBM	0.381495	0.402880	1.000000	0.495167
MSFT	0.388913	0.470810	0.495167	1.000000

	AAPL	GOOG	IBM	MSFT
AAPL	0.000269	0.000105	0.000075	0.000092
GOOG	0.000105	0.000244	0.000075	0.000106
IBM	0.000075	0.000075	0.000144	0.000086
MSFT	0.000092	0.000106	0.000086	0.000209

	0	1	2
0	-1.890923	NaN	NaN
1	1.020294	NaN	NaN
2	-3.132870	NaN	NaN
3	0.541084	NaN	-0.203774
4	2.200062	NaN	1.775465
5	-0.458244	0.208688	-0.027016
6	-1.291057	0.391487	-1.341282

	0	1	2
0	-1.890923	0.500000	NaN
1	1.020294	0.500000	NaN
2	-3.132870	0.500000	NaN
3	0.541084	0.500000	-0.203774
4	2.200062	0.500000	1.775465
5	-0.458244	0.208688	-0.027016
6	-1.291057	0.391487	-1.341282

	0	1	2
0	-0.201570	-0.301825	-0.292899
1	0.407622	0.154548	0.337993
2	-1.390216	NaN	1.952913
3	-0.155158	NaN	0.732646
4	-1.391272	NaN	NaN
5	-0.155741	NaN	NaN

	1	2	3
a	-0.291810	-2.489879	-0.993844
b	0.791313	-1.233354	1.096154
c	1.627453	0.637391	NaN
d	NaN	-0.726163	-2.117156

	Open	High	Low	Close	Volume	Adj Close
AAPL	569.159996	572.650009	560.520012	560.989983	130246900.0	72.681610
DELL	12.150000	12.300000	12.045000	12.070000	19397600.0	11.675920
GOOG	571.790972	572.650996	568.350996	570.981000	6138700.0	285.205295
MSFT	28.760000	28.959999	28.440001	28.450001	56634300.0	24.942239

	AAPL	DELL	GOOG	MSFT
Date
2012-05-22	72.160786	14.58765	300.100412	26.090721
2012-05-23	73.921494	12.08221	304.426106	25.520864
2012-05-24	73.242607	12.04351	301.528978	25.485795
2012-05-25	72.850038	12.05319	295.470050	25.477028
2012-05-28	NaN	12.05319	NaN	NaN
2012-05-29	74.143041	12.24666	296.873645	25.915380
2012-05-30	75.037005	12.14992	293.821674	25.722505
2012-05-31	74.850442	11.92743	290.140354	25.591000
2012-06-01	72.681610	11.67592	285.205295	24.942239
2012-06-04	73.109156	11.60821	289.006480	25.029908
2012-06-05	72.920005	11.76298	284.920579	24.994841
2012-06-06	74.038104	11.81619	289.995487	25.731273
2012-06-07	74.071787	11.73396	288.826666	25.626067
2012-06-08	75.185997	11.72429	289.935570	25.994283
2012-06-11	74.000526	11.47278	283.966519	25.336755
2012-06-12	74.647030	11.57919	282.268201	25.678671
2012-06-13	74.128794	11.87423	280.265216	25.538397
2012-06-14	74.047168	11.93711	279.246220	25.722505
2012-06-15	74.384024	11.89841	281.973510	26.318665
2012-06-18	75.893391	12.01449	285.140358	26.160857
2012-06-19	76.104579	11.78233	290.475012	26.914824
2012-06-20	75.888208	11.89841	288.467038	27.116465
2012-06-21	74.842665	11.60821	282.323161	26.423868
2012-06-22	75.416614	11.80168	285.455033	26.914824
2012-06-25	73.948707	11.55500	280.070407	26.187159
2012-06-26	74.111953	11.53565	282.058429	26.318665
2012-06-27	74.431960	11.92743	284.366112	26.450170
2012-06-28	73.725860	11.55984	281.873596	26.222227
2012-06-29	75.662780	12.10155	289.745749	26.818386
2012-07-02	76.766625	11.98064	289.945546	26.792084
...	...	...	...	...
2017-01-17	119.481976	NaN	804.609985	62.153195
2017-01-18	119.472017	NaN	806.070007	62.123377
2017-01-19	119.262925	NaN	802.174988	61.924581
2017-01-20	119.481976	NaN	805.020020	62.361932
2017-01-23	119.561633	NaN	819.309998	62.580604
2017-01-24	119.452107	NaN	823.869995	63.137231
2017-01-25	121.353858	NaN	835.669983	63.296267
2017-01-26	121.413604	NaN	832.150024	63.882708
2017-01-27	121.423555	NaN	823.309998	65.383610
2017-01-30	121.104937	NaN	802.320007	64.737526
2017-01-31	120.826147	NaN	796.789978	64.260423
2017-02-01	128.194203	NaN	795.695007	63.196871
2017-02-02	127.975152	NaN	798.530029	62.789338
2017-02-03	128.522781	NaN	801.489990	63.296267
2017-02-06	129.727549	NaN	801.340027	63.256507
2017-02-07	130.962201	NaN	806.969971	63.047773
2017-02-08	131.469994	NaN	808.380005	62.958315
2017-02-09	132.419998	NaN	809.559998	63.673974
2017-02-10	132.119995	NaN	813.669983	63.614338
2017-02-13	133.289993	NaN	819.239990	64.330000
2017-02-14	135.020004	NaN	820.450012	64.570000
2017-02-15	135.509995	NaN	818.979980	64.529999
2017-02-16	135.350006	NaN	824.159973	64.519997
2017-02-17	135.720001	NaN	828.070007	64.620003
2017-02-21	136.699997	NaN	831.659973	64.489998
2017-02-22	137.110001	NaN	830.760010	64.360001
2017-02-23	136.529999	NaN	831.330017	64.620003
2017-02-24	136.660004	NaN	828.640015	64.620003
2017-02-27	136.929993	NaN	829.280029	64.230003
2017-02-28	136.990005	NaN	823.210022	63.980000

		Open	High	Low	Close	Volume	Adj Close
Date	minor
2012-05-30	AAPL	569.199997	579.989990	566.559990	579.169998	132357400.0	75.037005
	DELL	12.590000	12.700000	12.460000	12.560000	19787800.0	12.149920
	GOOG	588.161028	591.901014	583.530999	588.230992	3827600.0	293.821674
	MSFT	29.350000	29.480000	29.120001	29.340000	41585500.0	25.722505
2012-05-31	AAPL	580.740021	581.499985	571.460022	577.730019	122918600.0	74.850442
	DELL	12.530000	12.540000	12.330000	12.330000	19955600.0	11.927430
	GOOG	588.720982	590.001032	579.001013	580.860990	5958800.0	290.140354
	MSFT	29.299999	29.420000	28.940001	29.190001	39134000.0	25.591000
2012-06-01	AAPL	569.159996	572.650009	560.520012	560.989983	130246900.0	72.681610
	DELL	12.150000	12.300000	12.045000	12.070000	19397600.0	11.675920
	GOOG	571.790972	572.650996	568.350996	570.981000	6138700.0	285.205295
	MSFT	28.760000	28.959999	28.440001	28.450001	56634300.0	24.942239
2012-06-04	AAPL	561.500008	567.499985	548.499977	564.289978	139248900.0	73.109156
	DELL	12.110000	12.112500	11.800000	12.000000	17015700.0	11.608210
	GOOG	570.220958	580.491016	570.011006	578.590973	4883500.0	289.006480
	MSFT	28.620001	28.780001	28.320000	28.549999	47926300.0	25.029908
2012-06-05	AAPL	561.269989	566.470001	558.330002	562.830025	97053600.0	72.920005
	DELL	11.950000	12.240000	11.950000	12.160000	15620900.0	11.762980
	GOOG	575.451008	578.131003	566.470986	570.410999	4697200.0	284.920579
	MSFT	28.510000	28.750000	28.389999	28.510000	45715400.0	24.994841
2012-06-06	AAPL	567.770004	573.849983	565.499992	571.460022	100363900.0	74.038104
	DELL	12.210000	12.280000	12.090000	12.215000	20779900.0	11.816190
	GOOG	576.480979	581.970971	573.611004	580.570966	4207200.0	289.995487
	MSFT	28.879999	29.370001	28.809999	29.350000	46860500.0	25.731273
2012-06-07	AAPL	577.290009	577.320023	570.500000	571.720001	94941700.0	74.071787
	DELL	12.320000	12.410000	12.120000	12.130000	20074000.0	11.733960
	GOOG	587.601014	587.891038	577.251006	578.230986	3530100.0	288.826666
	MSFT	29.639999	29.700001	29.170000	29.230000	37792800.0	25.626067
2012-06-08	AAPL	571.599998	580.580017	568.999992	580.319984	86879100.0	75.185997
2012-06-08	DELL	12.130000	12.225000	12.020000	12.120000	18155600.0	11.724290
...	...	...	...	...	...	...	...
2017-02-14	AAPL	133.470001	135.089996	133.250000	135.020004	32815500.0	135.020004
	GOOG	819.000000	823.000000	816.000000	820.450012	1053600.0	820.450012
	MSFT	64.410004	64.720001	64.019997	64.570000	23065900.0	64.570000
2017-02-15	AAPL	135.520004	136.270004	134.619995	135.509995	35501600.0	135.509995
	GOOG	819.359985	823.000000	818.469971	818.979980	1304000.0	818.979980
	MSFT	64.500000	64.570000	64.160004	64.529999	16917000.0	64.529999
2017-02-16	AAPL	135.669998	135.899994	134.839996	135.350006	22118000.0	135.350006
	GOOG	819.929993	824.400024	818.979980	824.159973	1281700.0	824.159973
	MSFT	64.739998	65.239998	64.440002	64.519997	20524700.0	64.519997
2017-02-17	AAPL	135.100006	135.830002	135.100006	135.720001	22084500.0	135.720001
	GOOG	823.020020	828.070007	821.655029	828.070007	1597800.0	828.070007
	MSFT	64.470001	64.690002	64.300003	64.620003	21234600.0	64.620003
2017-02-21	AAPL	136.229996	136.750000	135.979996	136.699997	24265100.0	136.699997
	GOOG	828.659973	833.450012	828.349976	831.659973	1247700.0	831.659973
	MSFT	64.610001	64.949997	64.449997	64.489998	19384900.0	64.489998
2017-02-22	AAPL	136.429993	137.119995	136.110001	137.110001	20745300.0	137.110001
	GOOG	828.659973	833.250000	828.640015	830.760010	982900.0	830.760010
	MSFT	64.330002	64.389999	64.050003	64.360001	19259700.0	64.360001
2017-02-23	AAPL	137.380005	137.479996	136.300003	136.529999	20704100.0	136.529999
	GOOG	830.119995	832.460022	822.880005	831.330017	1470100.0	831.330017
	MSFT	64.419998	64.730003	64.190002	64.620003	20235200.0	64.620003
2017-02-24	AAPL	135.910004	136.660004	135.279999	136.660004	21690900.0	136.660004
	GOOG	827.729980	829.000000	824.200012	828.640015	1386600.0	828.640015
	MSFT	64.529999	64.800003	64.139999	64.620003	21705200.0	64.620003
2017-02-27	AAPL	137.139999	137.440002	136.279999	136.929993	20196400.0	136.929993
	GOOG	824.549988	830.500000	824.000000	829.280029	1099500.0	829.280029
	MSFT	64.540001	64.540001	64.050003	64.230003	15850400.0	64.230003
2017-02-28	AAPL	137.080002	137.440002	136.699997	136.990005	23403500.0	136.990005
	GOOG	825.609985	828.539978	820.200012	823.210022	2252300.0	823.210022
	MSFT	64.080002	64.199997	63.759998	63.980000	23043900.0	63.980000

数据结构

Series¶

最简单的 Series¶

指定索引¶

通过索引获取数据¶

按条件获取数据¶

数组运算¶

改变索引¶

作为Dictionary¶

数据检查¶

自动对齐索引¶

命名¶

DataFrame¶

创建 DataFrame¶

得到列¶

得到行¶

按列赋值¶

增加和删除列¶

转置¶

更多创建方法¶

为索引和列命名¶

DataFrame 的值¶

索引对象¶

基本功能

重建索引¶

丢弃指定轴上的项¶

索引、选取和过滤¶

算术运算和数据对齐¶

在算术方法中填充值¶

DataFrame 和 Series 之间的运算¶

函数应用和映射¶

排序和排名¶

带有重复值的轴索引¶

汇总和描述性统计

基本统计¶

相关系数(Correlation)与协方差(covariance)¶

唯一值，值计数和成员资格¶

处理缺失数据

滤除缺失数据¶

填充缺失数据¶

层次化索引

根据索引选择数据子集¶

stack() 和 unstack()¶

DataFrame 的层次化索引¶

重排分级顺序¶

根据级别汇总统计¶

使用 DataFrame 列¶

其他

整数索引¶

面板数据¶

`stack()` 和 `unstack()`¶