第9章第1节：groupby

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

分组运算的典型过程为：split-apply-combine （拆分-应用-合并），如下图：

分组聚合演示

用pandas进行分组很灵活：

在维度上可以任意选择。例如， DataFrame可以在行（axis=0）或列（axis=1）上进行分组。
在分组键上，可以有多种形式，比如列名，关于名/值的数组、列表、字典、Series等，甚至可以使用函数。

一个简单的例子：

df = DataFrame({
    'key1': ['a','a','b','b','a'],
    'key2': ['one','two','one','two','one'],
    'data1': np.random.randn(5),
    'data2': np.random.randn(5)
})

df

# 根据key1 进行分组，并计算data1列的平均值
# 结果是一个Series
grouped = df['data1'].groupby(df['key1'])
grouped.mean()

key1
a   -0.097969
b   -0.228027
Name: data1, dtype: float64

#两个维度(key1,key2)上的分组
# 结果是一个Series
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one    -0.985874
      two     1.677840
b     one    -0.111286
      two    -0.344769
Name: data1, dtype: float64

# 转换成DataFrame
means.unstack()

# 分组键不仅可以是Series
# 比如，可以是数组（需要长度适当）
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()

California  2005    1.677840
            2006   -0.111286
Ohio        2005   -1.016810
            2006   -0.282896
Name: data1, dtype: float64

# 可以将列名（字符串、数组或其他对象）用作分组键
df.groupby('key1').mean()

df.groupby(['key1', 'key2']).mean()

df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

对分组进行迭代¶

# groupby 的结果是 GroupBy 对象。
# 可以进行迭代：
for name, group in df.groupby('key1'):
    print('=================================')
    print(name)
    print('----')
    print(group)

=================================
a
----
      data1     data2 key1 key2
0 -1.688851 -1.738961    a  one
1  1.677840  0.697004    a  two
4 -0.282896 -1.406068    a  one
=================================
b
----
      data1     data2 key1 key2
2 -0.111286 -0.630340    b  one
3 -0.344769 -1.289249    b  two

# 多重键时， 元组的第一个元素是 键值的组合
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print('=================================')
    print((k1, k2))
    print('----')
    print(group)

=================================
('a', 'one')
----
      data1     data2 key1 key2
0 -1.688851 -1.738961    a  one
4 -0.282896 -1.406068    a  one
=================================
('a', 'two')
----
     data1     data2 key1 key2
1  1.67784  0.697004    a  two
=================================
('b', 'one')
----
      data1    data2 key1 key2
2 -0.111286 -0.63034    b  one
=================================
('b', 'two')
----
      data1     data2 key1 key2
3 -0.344769 -1.289249    b  two

# 可以利用这些数据片段。比如，做成一个字段
pieces = dict(list(df.groupby('key1')))
pieces

{'a':       data1     data2 key1 key2
 0 -1.688851 -1.738961    a  one
 1  1.677840  0.697004    a  two
 4 -0.282896 -1.406068    a  one, 'b':       data1     data2 key1 key2
 2 -0.111286 -0.630340    b  one
 3 -0.344769 -1.289249    b  two}

pieces['b']

# groupby默认是在 axis=0 上分组，其实可以在任何轴上进行分组
# 比如，根据 dtype对列进行分组

df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

grouped = df.groupby(df.dtypes, axis=1)
dict(list(grouped))

{dtype('float64'):       data1     data2
 0 -1.688851 -1.738961
 1  1.677840  0.697004
 2 -0.111286 -0.630340
 3 -0.344769 -1.289249
 4 -0.282896 -1.406068, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}

选取一个或一组列¶

对groupby产生的 GroupBy 进行索引，能实现选取部分列进行聚合的目的。索引可以是一个或一组字符串。

对于大数据集，可能只需要对部分列进行聚合，这种方法就很有用。比如：只计算data2列的平均值：

# 等价于 df['data2'].groupby(df['key1'])
df.groupby('key1')['data2'].mean()

key1
a   -0.816009
b   -0.959794
Name: data2, dtype: float64

# 直接转换为 DataFrame
# 等价于 df[['data2']].groupby(df['key1'])
df.groupby('key1')[['data2']].mean()

# 多个键值
df.groupby(['key1', 'key2'])[['data2']].mean()

通过字典或Series进行分组¶

people = DataFrame(np.random.randn(5, 5),
                   columns=['a', 'b', 'c', 'd', 'e'],
                   index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.ix[2:3, ['b', 'c']] = np.nan # Add a few NA values
people

# 使用字典，根据列的分组关系计算总和
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'orange', 'e': 'red', 'f' : 'blue'}
by_column = people.groupby(mapping, axis=1)
by_column.sum()

# 使用 Series作为分组键， pandas会检查 Series以确保其索引与分组轴是对其的
map_series = Series(mapping)
map_series

a       red
b       red
c      blue
d    orange
e       red
f      blue
dtype: object

people.groupby(map_series, axis=1).count()

通过函数进行分组¶

函数作为分组键时，会在各个索引值上被调用一次，起返回值作为分组名称。

people.groupby(len).sum()

# 数组，列表，字典，Series，函数可以混合分组
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

根据索引级别分组¶

层次化的索引，可以根据索引级别进行聚合。通过level关键字传入级别编号或名称即可。

columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1, 3, 5, 1, 3]], names=['cty', 'tenor'])
hier_df = DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

hier_df.groupby(level='cty', axis=1).count()

hier_df.groupby(level=1, axis=1).count()

《利用Python进行数据分析》读书笔记。

第9章第2节：数据聚合

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

GroupBy对象上面优化了一些方法，可以快速进行统计计算，比如：

count
sum
mean
median（中位数）
std, var（标准差、方差）
min, max
prod （积）
first, last

但是不仅如此，可以自定义聚合运算。比如， quantile可以计算Series或 DataFrame列的样本分位数：

df = DataFrame({
    'key1': ['a','a','b','b','a'],
    'key2': ['one','two','one','two','one'],
    'data1': np.random.randn(5),
    'data2': np.random.randn(5)
})

df

grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)

key1
a    0.161916
b    0.260343
Name: data1, dtype: float64

如果需要自定义聚合函数，将其传入 aggregate 或 agg 方法即可：

def peak_to_peak(arr):
    return arr.max() - arr.min()
grouped.agg(peak_to_peak)

# 数据准备
tips = pd.read_csv('data/ch08/tips.csv')

# 增加小费占比（tip_pct）
tips['tip_pct'] = tips['tip']/tips['total_bill']
tips.head()

面向列的多函数应用¶

对Series或DataFrame列的聚合运算，其实就是使用aggregate或调用mean, std等方法。

所以很容易对不同的列使用不同的聚合函数，或者一次应用多个函数。

# 根据sex和smoker对tips进行分组
grouped = tips.groupby(['sex', 'smoker'])
grouped_pct = grouped['tip_pct']
# 对于优化过的描述统计，可以直接传入函数名
grouped_pct.agg('mean').unstack()

# 传入一组函数/函数名，得到的DataFrame列会自动命名
grouped_pct.agg(['mean', 'std', peak_to_peak])

# 通过 （name, function）元组指定列名
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])

# 对所有列应用一组函数
functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
result

# 可以指定一组函数的名称
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
grouped['tip_pct', 'total_bill'].agg(ftuples)

# 对不同的列应用不同的函数
grouped.agg({'tip' : np.max, 'size' : 'sum'})

# 更复杂的
grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'],
             'size' : 'sum'})

以“无索引”的形式返回聚合数据¶

参数： as_index=False

tips.groupby(['sex', 'smoker'], as_index=False).mean()

# 对比
tips.groupby(['sex', 'smoker']).mean()

《利用Python进行数据分析》读书笔记。

第9章第3节：分组级运算和转换

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

聚合，只是分组运算中的一种，是数据转换的一种方法：将一维数组简化为标量值。

更多的分组运算，可以通过 transform和apply方法指定。

df = DataFrame({
    'key1': ['a','a','b','b','a'],
    'key2': ['one','two','one','two','one'],
    'data1': np.random.randn(5),
    'data2': np.random.randn(5)
})

df

k1_means = df.groupby('key1').mean().add_prefix('mean_')
k1_means

pd.merge(df, k1_means, left_on='key1', right_index=True)

# 通过transform完成
df.groupby('key1').transform(np.mean)

transform会将一个函数应用到各个分组，然后将结果放置到适当的位置。

如果各分组产生的是一个标量值，则该值会被广播出去。

下面实现从各组中减去平均值。

# 创建一个距平化函数(demeaning function)
def demean(arr):
    return arr - arr.mean()

demeaned = df.groupby('key1').transform(demean)
demeaned

# 检查一下，此时平均值应该为0:

demeaned.mean()

data1    6.661338e-17
data2   -1.110223e-17
dtype: float64

apply: 一般性的“拆分-应用-合并”¶

transform与aggregate一样，对函数有严格条件：其结果要么产生一个可以广播的标量值，如np.mean, 要么产生一个相同大小的结果数组。

最一般化的groupby方法是apply。apply会将待处理的对象拆分成多个片段，然后对个片段调用传入的函数，最后尝试将各个片段组合到一起。

# 数据准备
tips = pd.read_csv('data/ch08/tips.csv')

# 增加小费占比（tip_pct）
tips['tip_pct'] = tips['tip']/tips['total_bill']
tips.head()

# 假设要根据分组选出最高的5个tip_pct值

# 编写一个选取某个列具有最大值的行的函数

def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]

top(tips,n=6)

# 现在，对smoker进行分组，并apply该函数

tips.groupby('smoker').apply(top)

# 传入apply函数的参数
tips.groupby(['smoker','day']).apply(top, n=1, column='total_bill')

# 禁用分组键
tips.groupby(['smoker','day'], group_keys=False).apply(top, n=1, column='total_bill')

分位数和桶分析¶

将分块工具(比如cut，qcut)与groupby结合起来，能非常轻松实现分位数(quantile)或桶(bucket)分析。

frame = DataFrame({'data1': np.random.randn(1000),
                   'data2': np.random.randn(1000)})

# 使用cut，将数据装入长度相等的桶中
factor = pd.cut(frame.data1, 4)
factor[:10]

0     (-1.465, 0.285]
1     (0.285, 2.0343]
2     (-1.465, 0.285]
3     (0.285, 2.0343]
4     (0.285, 2.0343]
5     (0.285, 2.0343]
6     (-1.465, 0.285]
7     (0.285, 2.0343]
8     (-1.465, 0.285]
9    (-3.221, -1.465]
Name: data1, dtype: category
Categories (4, object): [(-3.221, -1.465] < (-1.465, 0.285] < (0.285, 2.0343] < (2.0343, 3.784]]

# cut返回的Factor对象，可以直接用于groupby

def get_stats(group):
    return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}

grouped = frame.data2.groupby(factor)

grouped.apply(get_stats).unstack()

# 使用qcut，根据样本分位数得到大小相等的桶。
# 传入labels=False可以只获取分位数的编号

grouping = pd.qcut(frame.data1, 10, labels=False)

grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()

示例：用特定于分组的值填充缺失值¶

s = Series(np.random.randn(6))
s[::2] = np.nan
s

0         NaN
1   -0.917052
2         NaN
3   -0.773708
4         NaN
5    0.526083
dtype: float64

s.fillna(s.mean())

0   -0.388225
1   -0.917052
2   -0.388225
3   -0.773708
4   -0.388225
5    0.526083
dtype: float64

states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = Series(np.random.randn(8), index=states)
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data

Ohio          0.173629
New York     -0.054600
Vermont            NaN
Florida      -0.086387
Oregon        0.005616
Nevada             NaN
California    0.460536
Idaho              NaN
dtype: float64

data.groupby(group_key).mean()

East    0.010881
West    0.233076
dtype: float64

# 用分组平均值填充 NA 值 
fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)

Ohio          0.173629
New York     -0.054600
Vermont       0.010881
Florida      -0.086387
Oregon        0.005616
Nevada        0.233076
California    0.460536
Idaho         0.233076
dtype: float64

# 指定填充
fill_values = {'East': 0.5, 'West': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])

data.groupby(group_key).apply(fill_func)

Ohio          0.173629
New York     -0.054600
Vermont       0.500000
Florida      -0.086387
Oregon        0.005616
Nevada       -1.000000
California    0.460536
Idaho        -1.000000
dtype: float64

示例：随机采样和排列¶

一个随机采样的方法：选取np.random.permutation(N)的前K个元素。其中，N为总体个数，K为期望的样本大小。比如，一个扑克牌。

# suite: 花色: 红桃 Hearts, 黑桃 Spades, 梅花 Clubs, 方块 Diamonds
suits  = ['H', 'S','C','D']
# 点数： 在21点中的取值 1,2,3,...9,10,10,10,10 
card_val = (list(range(1, 11)) + [10] * 3) * 4
# 牌面
base_names = ['A'] + list(range(2,11)) + ['J','Q','K']

cards = []
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)

# 一副扑克牌（52张）
deck = Series(card_val, index=cards)
deck.head()

AH    1
2H    2
3H    3
4H    4
5H    5
dtype: int64

# 随机抽五张

def draw(deck,n=5):
    return deck.take(np.random.permutation(len(deck))[:n])

draw(deck)

8S     8
JC    10
9D     9
JH    10
6C     6
dtype: int64

# 每种花色抽2张

get_suit = lambda card: card[-1]
deck.groupby(get_suit).apply(draw,n=2)

C  KC     10
   JC     10
D  3D      3
   2D      2
H  QH     10
   6H      6
S  10S    10
   4S      4
dtype: int64

# 去掉键值
deck.groupby(get_suit, group_keys=False).apply(draw,n=2)

3C     3
6C     6
KD    10
3D     3
7H     7
AH     1
4S     4
7S     7
dtype: int64

实例：分组加权平均数和相关系数¶

df = DataFrame({'category': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
                'data': np.random.randn(8),
                'weights': np.random.rand(8)})
df

# 利用category 计算分组加权平均数
grouped = df.groupby('category')
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
grouped.apply(get_wavg)

category
a    0.903003
b   -0.237941
dtype: float64

# 标普500指数和几只股票的收盘价数据

close_px = pd.read_csv('data/ch09/stock_px.csv', parse_dates=True, index_col=0)
close_px.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
AAPL    2214 non-null float64
MSFT    2214 non-null float64
XOM     2214 non-null float64
SPX     2214 non-null float64
dtypes: float64(4)
memory usage: 86.5 KB

# 任务：计算日收益率与SPX之间的年度相关系数
rets = close_px.pct_change().dropna()
spx_corr = lambda x: x.corrwith(x['SPX'])
by_year = rets.groupby(lambda x: x.year)
by_year.apply(spx_corr)

# 苹果与微软的年度相关系数
by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))

2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

示例：面向分组的线性回归¶

import statsmodels.api as sm
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params
by_year.apply(regress, 'AAPL', ['SPX'])

《利用Python进行数据分析》读书笔记。

第9章第4节：透视表和交叉表

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

透视表(pivot table), 一种常见的数据汇总工具。根据一个或多个键对数据进行聚合，并根据行和列上的分组键将数据分配到各个矩形区域中。

pandas中，groupby功能可以实现透视表。

更方便的，提供了 pandas.pivot_table函数和 DataFrame的pivot_table 方法。

pivot_table 方法的主要参数包括：

values: 待聚合的列名。默认为所有列
rows: 用于分组的列名或其他键，出现在结果透视表的行
cols: 用于分组的列名或其他键，出现在结果透视表的列
aggfunc：聚合函数/函数列表，默认为'mean'。可以是任何对groupby有效的函数
fill_value: 填充 NA的值
margins: 是否显示行列小计和总计，默认为False

# 数据准备
tips = pd.read_csv('data/ch08/tips.csv')

# 增加小费占比（tip_pct）
tips['tip_pct'] = tips['tip']/tips['total_bill']
tips.head()

# 根据sex和smoker计算分组平均数
tips.pivot_table(index=['sex', 'smoker'])

# 聚合tip_pct和size, 根据day进行分组
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'],
                 columns='smoker')

# 添加分项小计
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'],
                 columns='smoker', margins=True)

# 通过aggfunc传入其他聚合函数
tips.pivot_table('tip_pct', index=['sex', 'smoker'], columns='day',
                 aggfunc=len, margins=True)

# 填充空值
tips.pivot_table('size', index=['time', 'sex', 'smoker'],
                 columns='day', aggfunc='sum', fill_value=0)

交叉表 crosstab¶

交叉表，是用于计算分组频率的特殊透视表。

import io
#from io.StringIO import StringIO
data = """\
Sample    Gender    Handedness
1    Female    Right-handed
2    Male    Left-handed
3    Female    Right-handed
4    Male    Right-handed
5    Male    Left-handed
6    Male    Right-handed
7    Female    Right-handed
8    Female    Left-handed
9    Male    Right-handed
10    Female    Right-handed"""
data = pd.read_table(io.StringIO(data), sep='\s+')

data

pd.crosstab(data.Gender,data.Handedness, margins=True)

pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)

《利用Python进行数据分析》读书笔记。

第9章第5节：示例：2012联邦选举委员会数据库

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib

# 数据准备
fec = pd.read_csv('data/ch09/P00000001-ALL.csv')
fec.info()

C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id              1001731 non-null object
cand_id              1001731 non-null object
cand_nm              1001731 non-null object
contbr_nm            1001731 non-null object
contbr_city          1001712 non-null object
contbr_st            1001727 non-null object
contbr_zip           1001620 non-null object
contbr_employer      988002 non-null object
contbr_occupation    993301 non-null object
contb_receipt_amt    1001731 non-null float64
contb_receipt_dt     1001731 non-null object
receipt_desc         14166 non-null object
memo_cd              92482 non-null object
memo_text            97770 non-null object
form_tp              1001731 non-null object
file_num             1001731 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 122.3+ MB

# 一行的数据记录如下
fec.ix[123456]

cmte_id                             C00431445
cand_id                             P80003338
cand_nm                         Obama, Barack
contbr_nm                         ELLMAN, IRA
contbr_city                             TEMPE
contbr_st                                  AZ
contbr_zip                          852816719
contbr_employer      ARIZONA STATE UNIVERSITY
contbr_occupation                   PROFESSOR
contb_receipt_amt                          50
contb_receipt_dt                    01-DEC-11
receipt_desc                              NaN
memo_cd                                   NaN
memo_text                                 NaN
form_tp                                 SA17A
file_num                               772372
Name: 123456, dtype: object

# 通过unique，获取候选人名单
unique_cands = fec.cand_nm.unique()
unique_cands

array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
       "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
       'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick', 'Cain, Herman',
       'Gingrich, Newt', 'McCotter, Thaddeus G', 'Huntsman, Jon',
       'Perry, Rick'], dtype=object)

# 党派关系数据
parties = {'Bachmann, Michelle': 'Republican',
           'Cain, Herman': 'Republican',
           'Gingrich, Newt': 'Republican',
           'Huntsman, Jon': 'Republican',
           'Johnson, Gary Earl': 'Republican',
           'McCotter, Thaddeus G': 'Republican',
           'Obama, Barack': 'Democrat',
           'Paul, Ron': 'Republican',
           'Pawlenty, Timothy': 'Republican',
           'Perry, Rick': 'Republican',
           "Roemer, Charles E. 'Buddy' III": 'Republican',
           'Romney, Mitt': 'Republican',
           'Santorum, Rick': 'Republican'}

fec.cand_nm[123456:123461]

123456    Obama, Barack
123457    Obama, Barack
123458    Obama, Barack
123459    Obama, Barack
123460    Obama, Barack
Name: cand_nm, dtype: object

fec.cand_nm[123456:123461].map(parties)

123456    Democrat
123457    Democrat
123458    Democrat
123459    Democrat
123460    Democrat
Name: cand_nm, dtype: object

# 添加党派信息列
fec['party'] = fec.cand_nm.map(parties)
fec['party'].value_counts()

Democrat      593746
Republican    407985
Name: party, dtype: int64

# 剔除退款数据
fec = fec[fec.contb_receipt_amt > 0]

# 只查看两位主要候选人
fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]
fec_mrbo.head()

根据职业和雇主统计赞助信息¶

# 根据职业计算出资总额
fec.contbr_occupation.value_counts()[:10]

RETIRED                                   233990
INFORMATION REQUESTED                      35107
ATTORNEY                                   34286
HOMEMAKER                                  29931
PHYSICIAN                                  23432
INFORMATION REQUESTED PER BEST EFFORTS     21138
ENGINEER                                   14334
TEACHER                                    13990
CONSULTANT                                 13273
PROFESSOR                                  12555
Name: contbr_occupation, dtype: int64

# 清理一些类似职业（将其映射到另一个职业）
# 这里巧妙利用了dict.get, 运行没有映射关系的职业也能“通过”
occ_mapping = {
   'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
   'INFORMATION REQUESTED' : 'NOT PROVIDED',
   'INFORMATION REQUESTED (BEST EFFORTS)' : 'NOT PROVIDED',
   'C.E.O.': 'CEO'
}

# 如果没有提供相关映射，则返回x
f = lambda x: occ_mapping.get(x, x)
fec.contbr_occupation = fec.contbr_occupation.map(f)

# 对雇主信息进行同样的处理
emp_mapping = {
   'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
   'INFORMATION REQUESTED' : 'NOT PROVIDED',
   'SELF' : 'SELF-EMPLOYED',
   'SELF EMPLOYED' : 'SELF-EMPLOYED',
}

# If no mapping provided, return x
f = lambda x: emp_mapping.get(x, x)
fec.contbr_employer = fec.contbr_employer.map(f)

# 根据党派和职业进行聚合，得到 对个党派总出资最高的职业

by_occupation = fec.pivot_table('contb_receipt_amt',
                                index='contbr_occupation',
                                columns='party', aggfunc='sum')
# 过滤掉小于200万的赞助
over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
over_2mm

# 绘制柱状图
over_2mm.plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x9d48d30>

# 计算对两位候选人总出资最高的职业和企业
def get_top_amounts(group, key, n=5):
    totals = group.groupby(key)['contb_receipt_amt'].sum()

    # Order totals by key in descending order
    return totals.order(ascending=False)[-n:]

# 根据职业和雇主进行聚合
grouped = fec_mrbo.groupby('cand_nm')
grouped.apply(get_top_amounts, 'contbr_occupation', n=7)

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\__main__.py:6: FutureWarning: order is deprecated, use sort_values(...)

cand_nm        contbr_occupation                     
Obama, Barack  COMPUTER ASSISTANT                        3.0
               SPRINKLER FITTER FIRE PROTECTION SPECI    3.0
               ADMINISTRATION/INSTRUCTOR                 3.0
               LEAD UI/UX DEVELOPER                      3.0
               POLICY/ LAWYER                            3.0
               LAN/WAN ANALYST                           3.0
               SR MGR                                    3.0
Romney, Mitt   MD - UROLOGIST                            5.0
               DISTRICT REPRESENTATIVE                   5.0
               INDEPENDENT PROFESSIONAL                  3.0
               REMODELER & SEMI RETIRED                  3.0
               AFFORDABLE REAL ESTATE DEVELOPER          3.0
               IFC CONTRACTING SOLUTIONS                 3.0
               3RD GENERATION FAMILY BUSINESS OWNER      3.0
Name: contb_receipt_amt, dtype: float64

grouped.apply(get_top_amounts, 'contbr_employer', n=10)

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\__main__.py:6: FutureWarning: order is deprecated, use sort_values(...)

cand_nm        contbr_employer                   
Obama, Barack  SOLIYA                                3.0
               CARR ENTERPRISES                      3.0
               PENN STATE DICKINSON SCHOOL OF LAW    3.0
               CADUCEUS OCCUPATIONAL MEDICINE        3.0
               N.A.                                  3.0
               REAL ENERGY CONSULTING SERVICES       3.0
               JPDSYSTEMS, LLC                       3.0
               CASS REGIONAL MED. CENTER             2.5
               ARCON CORP                            2.0
               THE VICTORIA GROUP, INC.              2.0
Romney, Mitt   EASTHAM CAPITAL                       5.0
               GREGORY GALLIVAN                      5.0
               DIRECT LENDERS LLC                    5.0
               LOUGH INVESTMENT ADVISORY LLC         4.0
               WATERWORKS INDUSRTIES                 3.0
               WILL MERRIFIELD                       3.0
               HONOLD COMMUNICTAIONS                 3.0
               INDEPENDENT PROFESSIONAL              3.0
               UPTOWN CHEAPSKATE                     3.0
               UN                                    3.0
Name: contb_receipt_amt, dtype: float64

对出资额分组¶

bins = np.array([0, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000])
# 用cut函数根据出资额大小将数据离散化到多个面元中
labels = pd.cut(fec_mrbo.contb_receipt_amt, bins)
labels

411           (10, 100]
412         (100, 1000]
413         (100, 1000]
414           (10, 100]
415           (10, 100]
416           (10, 100]
417         (100, 1000]
418           (10, 100]
419         (100, 1000]
420           (10, 100]
421           (10, 100]
422         (100, 1000]
423         (100, 1000]
424         (100, 1000]
425         (100, 1000]
426         (100, 1000]
427       (1000, 10000]
428         (100, 1000]
429         (100, 1000]
430           (10, 100]
431       (1000, 10000]
432         (100, 1000]
433         (100, 1000]
434         (100, 1000]
435         (100, 1000]
436         (100, 1000]
437           (10, 100]
438         (100, 1000]
439         (100, 1000]
440           (10, 100]
              ...      
701356        (10, 100]
701357          (1, 10]
701358        (10, 100]
701359        (10, 100]
701360        (10, 100]
701361        (10, 100]
701362      (100, 1000]
701363        (10, 100]
701364        (10, 100]
701365        (10, 100]
701366        (10, 100]
701367        (10, 100]
701368      (100, 1000]
701369        (10, 100]
701370        (10, 100]
701371        (10, 100]
701372        (10, 100]
701373        (10, 100]
701374        (10, 100]
701375        (10, 100]
701376    (1000, 10000]
701377        (10, 100]
701378        (10, 100]
701379      (100, 1000]
701380    (1000, 10000]
701381        (10, 100]
701382      (100, 1000]
701383          (1, 10]
701384        (10, 100]
701385      (100, 1000]
Name: contb_receipt_amt, dtype: category
Categories (8, object): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]

# 根据候选人和面元标签对数据进行分组
grouped = fec_mrbo.groupby(['cand_nm', labels])
grouped.size().unstack(0)

# 规整，以便画图
bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
bucket_sums

normed_sums = bucket_sums.div(bucket_sums.sum(axis=1), axis=0)
normed_sums

normed_sums[:-2].plot(kind='barh', stacked=True)

<matplotlib.axes._subplots.AxesSubplot at 0x9fd8470>

根据州统计赞助信息¶

# 根据候选人和州进行聚合
grouped = fec_mrbo.groupby(['cand_nm', 'contbr_st'])
totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
totals = totals[totals.sum(1) > 100000]
totals[:10]

percent = totals.div(totals.sum(1), axis=0)
percent[:10]

# 绘制地图
from mpl_toolkits.basemap import Basemap, cm

# 略

	a	b	c	d	e
Joe	-1.001767	-0.944141	-0.406020	0.914168	-0.653524
Steve	-2.624573	-1.142006	1.275258	0.638460	-0.049654
Wes	1.053102	NaN	NaN	1.136185	-0.705047
Jim	0.258541	1.262957	-1.130326	-1.853057	0.916846
Travis	-0.582800	1.582861	0.187689	-0.495851	-1.422027

	blue	orange	red
Joe	-0.406020	0.914168	-2.599433
Steve	1.275258	0.638460	-3.816232
Wes	NaN	1.136185	0.348056
Jim	-1.130326	-1.853057	2.438345
Travis	0.187689	-0.495851	-0.421966

	a	b	c	d	e
3	0.309876	0.318816	-1.536346	0.197296	-0.441725
5	-2.624573	-1.142006	1.275258	0.638460	-0.049654
6	-0.582800	1.582861	0.187689	-0.495851	-1.422027

		a	b	c	d	e
3	one	-1.001767	-0.944141	-0.406020	0.914168	-0.705047
3	two	0.258541	1.262957	-1.130326	-1.853057	0.916846
5	one	-2.624573	-1.142006	1.275258	0.638460	-0.049654
6	two	-0.582800	1.582861	0.187689	-0.495851	-1.422027

cty	US			JP
tenor	1	3	5	1	3
0	-0.761653	-1.095973	0.123118	1.363196	-0.087321
1	-1.586871	0.242787	0.311317	0.096900	0.384882
2	1.032321	0.016799	-1.619511	0.017748	0.334336
3	-0.371162	0.919163	-0.476240	1.155305	2.318014

心内求法

利用Python进行数据分析(9)：数据聚合和分组运算

groupby

对分组进行迭代¶

选取一个或一组列¶

通过字典或Series进行分组¶

通过函数进行分组¶

根据索引级别分组¶

数据聚合

面向列的多函数应用¶

以“无索引”的形式返回聚合数据¶

分组运算和转换

apply: 一般性的“拆分-应用-合并”¶

分位数和桶分析¶

示例：用特定于分组的值填充缺失值¶

示例：随机采样和排列¶

实例：分组加权平均数和相关系数¶

示例：面向分组的线性回归¶

透视表和交叉表

交叉表 crosstab¶

示例：2012联邦选举委员会数据库

根据职业和雇主统计赞助信息¶

对出资额分组¶

根据州统计赞助信息¶

	data1	data2	key1	key2
0	-1.688851	-1.738961	a	one
1	1.677840	0.697004	a	two
2	-0.111286	-0.630340	b	one
3	-0.344769	-1.289249	b	two
4	-0.282896	-1.406068	a	one

		data1	data2
key1	key2
a	one	-0.985874	-1.572515
a	two	1.677840	0.697004
b	one	-0.111286	-0.630340
b	two	-0.344769	-1.289249

	data1	data2	key1	key2
0	-0.075862	0.644310	a	one
1	-0.122875	0.692719	a	two
2	0.500193	-0.798486	b	one
3	-1.898311	0.482947	b	two
4	0.221361	-1.521094	a	one

	total_bill	tip	sex	smoker	day	time	size	tip_pct
0	16.99	1.01	Female	No	Sun	Dinner	2	0.059447
1	10.34	1.66	Male	No	Sun	Dinner	3	0.160542
2	21.01	3.50	Male	No	Sun	Dinner	3	0.166587
3	23.68	3.31	Male	No	Sun	Dinner	2	0.139780
4	24.59	3.61	Female	No	Sun	Dinner	4	0.146808

		mean	std	peak_to_peak
sex	smoker
Female	No	0.156921	0.036421	0.195876
Female	Yes	0.182150	0.071595	0.360233
Male	No	0.160669	0.041849	0.220186
Male	Yes	0.152771	0.090588	0.674707

		tip_pct		total_bill
		Durchschnitt	Abweichung	Durchschnitt	Abweichung
sex	smoker
Female	No	0.156921	0.001327	18.105185	53.092422
Female	Yes	0.182150	0.005126	17.977879	84.451517
Male	No	0.160669	0.001751	19.791237	76.152961
Male	Yes	0.152771	0.008206	22.284500	98.244673

	sex	smoker	total_bill	tip	size	tip_pct
0	Female	No	18.105185	2.773519	2.592593	0.156921
1	Female	Yes	17.977879	2.931515	2.242424	0.182150
2	Male	No	19.791237	3.113402	2.711340	0.160669
3	Male	Yes	22.284500	3.051167	2.500000	0.152771

	data1	data2	key1	key2
0	-0.545043	-1.770258	a	one
1	-2.725590	1.378620	a	two
2	-0.558872	0.443795	b	one
3	-0.393619	0.014487	b	two
4	-1.216987	-0.386167	a	one

	data1	data2	key1	key2	mean_data1	mean_data2
0	-0.545043	-1.770258	a	one	-1.495873	-0.259268
1	-2.725590	1.378620	a	two	-1.495873	-0.259268
4	-1.216987	-0.386167	a	one	-1.495873	-0.259268
2	-0.558872	0.443795	b	one	-0.476246	0.229141
3	-0.393619	0.014487	b	two	-0.476246	0.229141

	data1	data2
0	0.950831	-1.510990
1	-1.229717	1.637888
2	-0.082627	0.214654
3	0.082627	-0.214654
4	0.278886	-0.126898

	data1	data2
key1
a	0.344235	2.213812
b	2.398504	1.281433

		total_bill	tip	sex	smoker	day	time	size	tip_pct
smoker
No	88	24.71	5.85	Male	No	Thur	Lunch	2	0.236746
	185	20.69	5.00	Male	No	Sun	Dinner	5	0.241663
	51	10.29	2.60	Female	No	Sun	Dinner	2	0.252672
	149	7.51	2.00	Male	No	Thur	Lunch	2	0.266312
	232	11.61	3.39	Male	No	Sat	Dinner	2	0.291990
Yes	109	14.31	4.00	Female	Yes	Sat	Dinner	2	0.279525
	183	23.17	6.50	Male	Yes	Sun	Dinner	4	0.280535
	67	3.07	1.00	Female	Yes	Sat	Dinner	1	0.325733
	178	9.60	4.00	Female	Yes	Sun	Dinner	2	0.416667
	172	7.25	5.15	Male	Yes	Sun	Dinner	2	0.710345

			total_bill	tip	sex	smoker	day	time	size	tip_pct
smoker	day
No	Fri	94	22.75	3.25	Female	No	Fri	Dinner	2	0.142857
	Sat	212	48.33	9.00	Male	No	Sat	Dinner	4	0.186220
	Sun	156	48.17	5.00	Male	No	Sun	Dinner	6	0.103799
	Thur	142	41.19	5.00	Male	No	Thur	Lunch	5	0.121389
Yes	Fri	95	40.17	4.73	Male	Yes	Fri	Dinner	4	0.117750
	Sat	170	50.81	10.00	Male	Yes	Sat	Dinner	3	0.196812
	Sun	182	45.35	3.50	Male	Yes	Sun	Dinner	3	0.077178
	Thur	197	43.11	5.00	Female	Yes	Thur	Lunch	4	0.115982

	total_bill	tip	sex	smoker	day	time	size	tip_pct
94	22.75	3.25	Female	No	Fri	Dinner	2	0.142857
212	48.33	9.00	Male	No	Sat	Dinner	4	0.186220
156	48.17	5.00	Male	No	Sun	Dinner	6	0.103799
142	41.19	5.00	Male	No	Thur	Lunch	5	0.121389
95	40.17	4.73	Male	Yes	Fri	Dinner	4	0.117750
170	50.81	10.00	Male	Yes	Sat	Dinner	3	0.196812
182	45.35	3.50	Male	Yes	Sun	Dinner	3	0.077178
197	43.11	5.00	Female	Yes	Thur	Lunch	4	0.115982

	count	max	mean	min
data1
(-3.221, -1.465]	70.0	2.983339	0.002262	-3.194609
(-1.465, 0.285]	551.0	3.048865	-0.036949	-3.262832
(0.285, 2.0343]	354.0	3.179622	0.011766	-2.261756
(2.0343, 3.784]	25.0	1.548275	-0.035986	-1.678245

	count	max	mean	min
data1
0	100.0	2.983339	-0.054477	-3.194609
1	100.0	2.271839	0.108037	-2.296877
2	100.0	2.077979	-0.109394	-2.092836
3	100.0	2.526364	0.065489	-3.077176
4	100.0	2.173251	-0.024841	-3.262832
5	100.0	1.603202	-0.170120	-2.779698
6	100.0	3.048865	0.060291	-2.318005
7	100.0	3.179622	0.007835	-2.261756
8	100.0	2.759021	-0.102486	-2.100495
9	100.0	1.659892	0.050313	-1.712727

	category	data	weights
0	a	0.924912	0.168805
1	a	1.495683	0.731728
2	a	0.073354	0.455043
3	a	0.773848	0.463418
4	b	0.751353	0.702116
5	b	-1.144943	0.688624
6	b	-0.574356	0.300167
7	b	-0.117242	0.256547

	AAPL	MSFT	XOM	SPX
2003	0.541124	0.745174	0.661265	1.0
2004	0.374283	0.588531	0.557742	1.0
2005	0.467540	0.562374	0.631010	1.0
2006	0.428267	0.406126	0.518514	1.0
2007	0.508118	0.658770	0.786264	1.0
2008	0.681434	0.804626	0.828303	1.0
2009	0.707103	0.654902	0.797921	1.0
2010	0.710105	0.730118	0.839057	1.0
2011	0.691931	0.800996	0.859975	1.0

	SPX	intercept
2003	1.195406	0.000710
2004	1.363463	0.004201
2005	1.766415	0.003246
2006	1.645496	0.000080
2007	1.198761	0.003438
2008	0.968016	-0.001110
2009	0.879103	0.002954
2010	1.052608	0.001261
2011	0.806605	0.001514

		tip_pct		size
	smoker	No	Yes	No	Yes
sex	day
Female	Fri	0.165296	0.209129	2.500000	2.000000
	Sat	0.147993	0.163817	2.307692	2.200000
	Sun	0.165710	0.237075	3.071429	2.500000
	Thur	0.155971	0.163073	2.480000	2.428571
Male	Fri	0.138005	0.144730	2.000000	2.125000
	Sat	0.162132	0.139067	2.656250	2.629630
	Sun	0.158291	0.173964	2.883721	2.600000
	Thur	0.165706	0.164417	2.500000	2.300000

	day	Fri	Sat	Sun	Thur	All
sex	smoker
Female	No	2.0	13.0	14.0	25.0	54.0
Female	Yes	7.0	15.0	4.0	7.0	33.0
Male	No	2.0	32.0	43.0	20.0	97.0
Male	Yes	8.0	27.0	15.0	10.0	60.0
All		19.0	87.0	76.0	62.0	244.0

	Sample	Gender	Handedness
0	1	Female	Right-handed
1	2	Male	Left-handed
2	3	Female	Right-handed
3	4	Male	Right-handed
4	5	Male	Left-handed
5	6	Male	Right-handed
6	7	Female	Right-handed
7	8	Female	Left-handed
8	9	Male	Right-handed
9	10	Female	Right-handed

	cmte_id	cand_id	cand_nm	contbr_nm	contbr_city	contbr_st	contbr_zip	contbr_employer	contbr_occupation	contb_receipt_amt	contb_receipt_dt	receipt_desc	memo_cd	memo_text	form_tp	file_num	party
411	C00431171	P80003353	Romney, Mitt	ELDERBAUM, WILLIAM	DPO	AA	3.4023e+08	US GOVERNMENT	FOREIGN SERVICE OFFICER	25.0	01-FEB-12	NaN	NaN	NaN	SA17A	780124	Republican
412	C00431171	P80003353	Romney, Mitt	ELDERBAUM, WILLIAM	DPO	AA	3.4023e+08	US GOVERNMENT	FOREIGN SERVICE OFFICER	110.0	01-FEB-12	NaN	NaN	NaN	SA17A	780124	Republican
413	C00431171	P80003353	Romney, Mitt	CARLSEN, RICHARD	APO	AE	9.128e+07	DEFENSE INTELLIGENCE AGENCY	INTELLIGENCE ANALYST	250.0	13-APR-12	NaN	NaN	NaN	SA17A	785689	Republican
414	C00431171	P80003353	Romney, Mitt	DELUCA, PIERRE	APO	AE	9.128e+07	CISCO	ENGINEER	30.0	21-AUG-11	NaN	NaN	NaN	SA17A	760261	Republican
415	C00431171	P80003353	Romney, Mitt	SARGENT, MICHAEL	APO	AE	9.01201e+07	RAYTHEON TECHNICAL SERVICES CORP	COMPUTER SYSTEMS ENGINEER	100.0	07-MAR-12	NaN	NaN	NaN	SA17A	780128	Republican

party	Democrat	Republican
contbr_occupation
ATTORNEY	11141982.97	7.477194e+06
CEO	2074974.79	4.211041e+06
CONSULTANT	2459912.71	2.544725e+06
ENGINEER	951525.55	1.818374e+06
EXECUTIVE	1355161.05	4.138850e+06
HOMEMAKER	4248875.80	1.363428e+07
INVESTOR	884133.00	2.431769e+06
LAWYER	3160478.87	3.912243e+05
MANAGER	762883.22	1.444532e+06
NOT PROVIDED	4866973.96	2.056547e+07
OWNER	1001567.36	2.408287e+06
PHYSICIAN	3735124.94	3.594320e+06
PRESIDENT	1878509.95	4.720924e+06
PROFESSOR	2165071.08	2.967027e+05
REAL ESTATE	528902.09	1.625902e+06
RETIRED	25305116.38	2.356124e+07
SELF-EMPLOYED	672393.40	1.640253e+06

cand_nm	Obama, Barack	Romney, Mitt
contb_receipt_amt
(0, 1]	493.0	77.0
(1, 10]	40070.0	3681.0
(10, 100]	372280.0	31853.0
(100, 1000]	153991.0	43357.0
(1000, 10000]	22284.0	26186.0
(10000, 100000]	2.0	1.0
(100000, 1000000]	3.0	NaN
(1000000, 10000000]	4.0	NaN

cand_nm	Obama, Barack	Romney, Mitt
contb_receipt_amt
(0, 1]	318.24	77.00
(1, 10]	337267.62	29819.66
(10, 100]	20288981.41	1987783.76
(100, 1000]	54798531.46	22363381.69
(1000, 10000]	51753705.67	63942145.42
(10000, 100000]	59100.00	12700.00
(100000, 1000000]	1490683.08	NaN
(1000000, 10000000]	7148839.76	NaN

cand_nm	Obama, Barack	Romney, Mitt
contb_receipt_amt
(0, 1]	0.805182	0.194818
(1, 10]	0.918767	0.081233
(10, 100]	0.910769	0.089231
(100, 1000]	0.710176	0.289824
(1000, 10000]	0.447326	0.552674
(10000, 100000]	0.823120	0.176880
(100000, 1000000]	1.000000	NaN
(1000000, 10000000]	1.000000	NaN

cand_nm	Obama, Barack	Romney, Mitt
contbr_st
AK	281840.15	86204.24
AL	543123.48	527303.51
AR	359247.28	105556.00
AZ	1506476.98	1888436.23
CA	23824984.24	11237636.60
CO	2132429.49	1506714.12
CT	2068291.26	3499475.45
DC	4373538.80	1025137.50
DE	336669.14	82712.00
FL	7318178.58	8338458.81

cand_nm	Obama, Barack	Romney, Mitt
contbr_st
AK	0.765778	0.234222
AL	0.507390	0.492610
AR	0.772902	0.227098
AZ	0.443745	0.556255
CA	0.679498	0.320502
CO	0.585970	0.414030
CT	0.371476	0.628524
DC	0.810113	0.189887
DE	0.802776	0.197224
FL	0.467417	0.532583