0%

利用Python进行数据分析(9):数据聚合和分组运算

利用Python进行数据分析》读书笔记。
第 9 章:数据聚合和分组运算。

准备好数据集后,通常的任务是进行分组统计,或生成透视表。
pandas提供了 groupby 和 pivot, 可以方便的进行这些操作。

groupby

利用Python进行数据分析》读书笔记。

第9章 第1节:groupby

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame
Populating the interactive namespace from numpy and matplotlib

分组运算的典型过程为:split-apply-combine (拆分-应用-合并),如下图:

分组聚合演示

用pandas进行分组很灵活:

  • 在维度上可以任意选择。例如, DataFrame可以在行(axis=0)或列(axis=1)上进行分组。

  • 在分组键上,可以有多种形式,比如列名,关于名/值的数组、列表、字典、Series等,甚至可以使用函数。

一个简单的例子:

df = DataFrame({
    'key1': ['a','a','b','b','a'],
    'key2': ['one','two','one','two','one'],
    'data1': np.random.randn(5),
    'data2': np.random.randn(5)
})

df
Out[4]:
data1 data2 key1 key2
0 -1.688851 -1.738961 a one
1 1.677840 0.697004 a two
2 -0.111286 -0.630340 b one
3 -0.344769 -1.289249 b two
4 -0.282896 -1.406068 a one
# 根据key1 进行分组,并计算data1列的平均值
# 结果是一个Series
grouped = df['data1'].groupby(df['key1'])
grouped.mean()
Out[6]:
key1
a   -0.097969
b   -0.228027
Name: data1, dtype: float64
#两个维度(key1,key2)上的分组
# 结果是一个Series
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means
Out[7]:
key1  key2
a     one    -0.985874
      two     1.677840
b     one    -0.111286
      two    -0.344769
Name: data1, dtype: float64
# 转换成DataFrame
means.unstack()
Out[13]:
key2 one two
key1
a -0.985874 1.677840
b -0.111286 -0.344769
# 分组键不仅可以是Series
# 比如,可以是数组(需要长度适当)
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()
Out[15]:
California  2005    1.677840
            2006   -0.111286
Ohio        2005   -1.016810
            2006   -0.282896
Name: data1, dtype: float64
# 可以将列名(字符串、数组或其他对象)用作分组键
df.groupby('key1').mean()
Out[16]:
data1 data2
key1
a -0.097969 -0.816009
b -0.228027 -0.959794
df.groupby(['key1', 'key2']).mean()
Out[17]:
data1 data2
key1 key2
a one -0.985874 -1.572515
two 1.677840 0.697004
b one -0.111286 -0.630340
two -0.344769 -1.289249
df.groupby(['key1', 'key2']).size()
Out[18]:
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

对分组进行迭代

# groupby 的结果是 GroupBy 对象。
# 可以进行迭代:
for name, group in df.groupby('key1'):
    print('=================================')
    print(name)
    print('----')
    print(group)
=================================
a
----
      data1     data2 key1 key2
0 -1.688851 -1.738961    a  one
1  1.677840  0.697004    a  two
4 -0.282896 -1.406068    a  one
=================================
b
----
      data1     data2 key1 key2
2 -0.111286 -0.630340    b  one
3 -0.344769 -1.289249    b  two
# 多重键时, 元组的第一个元素是 键值的组合
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print('=================================')
    print((k1, k2))
    print('----')
    print(group)
=================================
('a', 'one')
----
      data1     data2 key1 key2
0 -1.688851 -1.738961    a  one
4 -0.282896 -1.406068    a  one
=================================
('a', 'two')
----
     data1     data2 key1 key2
1  1.67784  0.697004    a  two
=================================
('b', 'one')
----
      data1    data2 key1 key2
2 -0.111286 -0.63034    b  one
=================================
('b', 'two')
----
      data1     data2 key1 key2
3 -0.344769 -1.289249    b  two
# 可以利用这些数据片段。比如,做成一个字段
pieces = dict(list(df.groupby('key1')))
pieces
Out[24]:
{'a':       data1     data2 key1 key2
 0 -1.688851 -1.738961    a  one
 1  1.677840  0.697004    a  two
 4 -0.282896 -1.406068    a  one, 'b':       data1     data2 key1 key2
 2 -0.111286 -0.630340    b  one
 3 -0.344769 -1.289249    b  two}
pieces['b']
Out[25]:
data1 data2 key1 key2
2 -0.111286 -0.630340 b one
3 -0.344769 -1.289249 b two
# groupby默认是在 axis=0 上分组,其实可以在任何轴上进行分组
# 比如,根据 dtype对列进行分组

df.dtypes
Out[27]:
data1    float64
data2    float64
key1      object
key2      object
dtype: object
grouped = df.groupby(df.dtypes, axis=1)
dict(list(grouped))
Out[30]:
{dtype('float64'):       data1     data2
 0 -1.688851 -1.738961
 1  1.677840  0.697004
 2 -0.111286 -0.630340
 3 -0.344769 -1.289249
 4 -0.282896 -1.406068, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}

选取一个或一组列

对groupby产生的 GroupBy 进行索引,能实现选取部分列进行聚合的目的。索引可以是一个或一组字符串。

对于大数据集,可能只需要对部分列进行聚合,这种方法就很有用。比如:只计算data2列的平均值:

# 等价于 df['data2'].groupby(df['key1'])
df.groupby('key1')['data2'].mean()
Out[37]:
key1
a   -0.816009
b   -0.959794
Name: data2, dtype: float64
# 直接转换为 DataFrame
# 等价于 df[['data2']].groupby(df['key1'])
df.groupby('key1')[['data2']].mean()
Out[38]:
data2
key1
a -0.816009
b -0.959794
# 多个键值
df.groupby(['key1', 'key2'])[['data2']].mean()
Out[40]:
data2
key1 key2
a one -1.572515
two 0.697004
b one -0.630340
two -1.289249

通过字典或Series进行分组

people = DataFrame(np.random.randn(5, 5),
                   columns=['a', 'b', 'c', 'd', 'e'],
                   index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.ix[2:3, ['b', 'c']] = np.nan # Add a few NA values
people
Out[41]:
a b c d e
Joe -1.001767 -0.944141 -0.406020 0.914168 -0.653524
Steve -2.624573 -1.142006 1.275258 0.638460 -0.049654
Wes 1.053102 NaN NaN 1.136185 -0.705047
Jim 0.258541 1.262957 -1.130326 -1.853057 0.916846
Travis -0.582800 1.582861 0.187689 -0.495851 -1.422027
# 使用字典,根据列的分组关系计算总和
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'orange', 'e': 'red', 'f' : 'blue'}
by_column = people.groupby(mapping, axis=1)
by_column.sum()
Out[43]:
blue orange red
Joe -0.406020 0.914168 -2.599433
Steve 1.275258 0.638460 -3.816232
Wes NaN 1.136185 0.348056
Jim -1.130326 -1.853057 2.438345
Travis 0.187689 -0.495851 -0.421966
# 使用 Series作为分组键, pandas会检查 Series以确保其索引与分组轴是对其的
map_series = Series(mapping)
map_series
Out[44]:
a       red
b       red
c      blue
d    orange
e       red
f      blue
dtype: object
people.groupby(map_series, axis=1).count()
Out[45]:
blue orange red
Joe 1 1 3
Steve 1 1 3
Wes 0 1 2
Jim 1 1 3
Travis 1 1 3

通过函数进行分组

函数作为分组键时,会在各个索引值上被调用一次,起返回值作为分组名称。

people.groupby(len).sum()
Out[46]:
a b c d e
3 0.309876 0.318816 -1.536346 0.197296 -0.441725
5 -2.624573 -1.142006 1.275258 0.638460 -0.049654
6 -0.582800 1.582861 0.187689 -0.495851 -1.422027
# 数组,列表,字典,Series,函数可以混合分组
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()
Out[48]:
a b c d e
3 one -1.001767 -0.944141 -0.406020 0.914168 -0.705047
two 0.258541 1.262957 -1.130326 -1.853057 0.916846
5 one -2.624573 -1.142006 1.275258 0.638460 -0.049654
6 two -0.582800 1.582861 0.187689 -0.495851 -1.422027

根据索引级别分组

层次化的索引,可以根据索引级别进行聚合。通过level关键字传入级别编号或名称即可。

columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1, 3, 5, 1, 3]], names=['cty', 'tenor'])
hier_df = DataFrame(np.random.randn(4, 5), columns=columns)
hier_df
Out[49]:
cty US JP
tenor 1 3 5 1 3
0 -0.761653 -1.095973 0.123118 1.363196 -0.087321
1 -1.586871 0.242787 0.311317 0.096900 0.384882
2 1.032321 0.016799 -1.619511 0.017748 0.334336
3 -0.371162 0.919163 -0.476240 1.155305 2.318014
hier_df.groupby(level='cty', axis=1).count()
Out[50]:
cty JP US
0 2 3
1 2 3
2 2 3
3 2 3
hier_df.groupby(level=1, axis=1).count()
Out[51]:
tenor 1 3 5
0 2 2 1
1 2 2 1
2 2 2 1
3 2 2 1
 

数据聚合

利用Python进行数据分析》读书笔记。

第9章 第2节:数据聚合

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame
Populating the interactive namespace from numpy and matplotlib

GroupBy对象上面优化了一些方法,可以快速进行统计计算,比如:

  • count
  • sum
  • mean
  • median(中位数)
  • std, var(标准差、方差)
  • min, max
  • prod (积)
  • first, last

但是不仅如此,可以自定义聚合运算。比如, quantile可以计算Series或 DataFrame列的样本分位数:

df = DataFrame({
    'key1': ['a','a','b','b','a'],
    'key2': ['one','two','one','two','one'],
    'data1': np.random.randn(5),
    'data2': np.random.randn(5)
})

df
Out[2]:
data1 data2 key1 key2
0 -0.075862 0.644310 a one
1 -0.122875 0.692719 a two
2 0.500193 -0.798486 b one
3 -1.898311 0.482947 b two
4 0.221361 -1.521094 a one
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)
Out[3]:
key1
a    0.161916
b    0.260343
Name: data1, dtype: float64

如果需要自定义聚合函数,将其传入 aggregate 或 agg 方法即可:

def peak_to_peak(arr):
    return arr.max() - arr.min()
grouped.agg(peak_to_peak)
Out[4]:
data1 data2
key1
a 0.344235 2.213812
b 2.398504 1.281433
# 数据准备
tips = pd.read_csv('data/ch08/tips.csv')

# 增加小费占比(tip_pct)
tips['tip_pct'] = tips['tip']/tips['total_bill']
tips.head()
Out[8]:
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808

面向列的多函数应用

对Series或DataFrame列的聚合运算,其实就是使用aggregate或调用mean, std等方法。

所以很容易对不同的列使用不同的聚合函数,或者一次应用多个函数。

# 根据sex和smoker对tips进行分组
grouped = tips.groupby(['sex', 'smoker'])
grouped_pct = grouped['tip_pct']
# 对于优化过的描述统计,可以直接传入函数名
grouped_pct.agg('mean').unstack()
Out[11]:
smoker No Yes
sex
Female 0.156921 0.182150
Male 0.160669 0.152771
# 传入一组函数/函数名,得到的DataFrame列会自动命名
grouped_pct.agg(['mean', 'std', peak_to_peak])
Out[13]:
mean std peak_to_peak
sex smoker
Female No 0.156921 0.036421 0.195876
Yes 0.182150 0.071595 0.360233
Male No 0.160669 0.041849 0.220186
Yes 0.152771 0.090588 0.674707
# 通过 (name, function)元组指定列名
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])
Out[14]:
foo bar
sex smoker
Female No 0.156921 0.036421
Yes 0.182150 0.071595
Male No 0.160669 0.041849
Yes 0.152771 0.090588
# 对所有列应用一组函数
functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
result
Out[15]:
tip_pct total_bill
count mean max count mean max
sex smoker
Female No 54 0.156921 0.252672 54 18.105185 35.83
Yes 33 0.182150 0.416667 33 17.977879 44.30
Male No 97 0.160669 0.291990 97 19.791237 48.33
Yes 60 0.152771 0.710345 60 22.284500 50.81
# 可以指定一组函数的名称
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
grouped['tip_pct', 'total_bill'].agg(ftuples)
Out[16]:
tip_pct total_bill
Durchschnitt Abweichung Durchschnitt Abweichung
sex smoker
Female No 0.156921 0.001327 18.105185 53.092422
Yes 0.182150 0.005126 17.977879 84.451517
Male No 0.160669 0.001751 19.791237 76.152961
Yes 0.152771 0.008206 22.284500 98.244673
# 对不同的列应用不同的函数
grouped.agg({'tip' : np.max, 'size' : 'sum'})
Out[17]:
tip size
sex smoker
Female No 5.2 140
Yes 6.5 74
Male No 9.0 263
Yes 10.0 150
# 更复杂的
grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'],
             'size' : 'sum'})
Out[18]:
tip_pct size
min max mean std sum
sex smoker
Female No 0.056797 0.252672 0.156921 0.036421 140
Yes 0.056433 0.416667 0.182150 0.071595 74
Male No 0.071804 0.291990 0.160669 0.041849 263
Yes 0.035638 0.710345 0.152771 0.090588 150

以“无索引”的形式返回聚合数据

参数: as_index=False

tips.groupby(['sex', 'smoker'], as_index=False).mean()
Out[19]:
sex smoker total_bill tip size tip_pct
0 Female No 18.105185 2.773519 2.592593 0.156921
1 Female Yes 17.977879 2.931515 2.242424 0.182150
2 Male No 19.791237 3.113402 2.711340 0.160669
3 Male Yes 22.284500 3.051167 2.500000 0.152771
# 对比
tips.groupby(['sex', 'smoker']).mean()
Out[20]:
total_bill tip size tip_pct
sex smoker
Female No 18.105185 2.773519 2.592593 0.156921
Yes 17.977879 2.931515 2.242424 0.182150
Male No 19.791237 3.113402 2.711340 0.160669
Yes 22.284500 3.051167 2.500000 0.152771
 

分组运算和转换

利用Python进行数据分析》读书笔记。

第9章 第3节:分组级运算和转换

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame
Populating the interactive namespace from numpy and matplotlib

聚合,只是分组运算中的一种,是数据转换的一种方法:将一维数组简化为标量值。

更多的分组运算,可以通过 transform和apply方法指定。

df = DataFrame({
    'key1': ['a','a','b','b','a'],
    'key2': ['one','two','one','two','one'],
    'data1': np.random.randn(5),
    'data2': np.random.randn(5)
})

df
Out[2]:
data1 data2 key1 key2
0 -0.545043 -1.770258 a one
1 -2.725590 1.378620 a two
2 -0.558872 0.443795 b one
3 -0.393619 0.014487 b two
4 -1.216987 -0.386167 a one
k1_means = df.groupby('key1').mean().add_prefix('mean_')
k1_means
Out[3]:
mean_data1 mean_data2
key1
a -1.495873 -0.259268
b -0.476246 0.229141
pd.merge(df, k1_means, left_on='key1', right_index=True)
Out[4]:
data1 data2 key1 key2 mean_data1 mean_data2
0 -0.545043 -1.770258 a one -1.495873 -0.259268
1 -2.725590 1.378620 a two -1.495873 -0.259268
4 -1.216987 -0.386167 a one -1.495873 -0.259268
2 -0.558872 0.443795 b one -0.476246 0.229141
3 -0.393619 0.014487 b two -0.476246 0.229141
# 通过transform完成
df.groupby('key1').transform(np.mean)
Out[5]:
data1 data2
0 -1.495873 -0.259268
1 -1.495873 -0.259268
2 -0.476246 0.229141
3 -0.476246 0.229141
4 -1.495873 -0.259268

transform会将一个函数应用到各个分组,然后将结果放置到适当的位置。

如果各分组产生的是一个标量值,则该值会被广播出去。

下面实现从各组中减去平均值。

# 创建一个距平化函数(demeaning function)
def demean(arr):
    return arr - arr.mean()

demeaned = df.groupby('key1').transform(demean)
demeaned
Out[9]:
data1 data2
0 0.950831 -1.510990
1 -1.229717 1.637888
2 -0.082627 0.214654
3 0.082627 -0.214654
4 0.278886 -0.126898
# 检查一下,此时平均值应该为0:

demeaned.mean()
Out[10]:
data1    6.661338e-17
data2   -1.110223e-17
dtype: float64

apply: 一般性的“拆分-应用-合并”

transform与aggregate一样,对函数有严格条件:其结果要么产生一个可以广播的标量值,如np.mean, 要么产生一个相同大小的结果数组。

最一般化的groupby方法是apply。apply会将待处理的对象拆分成多个片段,然后对个片段调用传入的函数,最后尝试将各个片段组合到一起。

# 数据准备
tips = pd.read_csv('data/ch08/tips.csv')

# 增加小费占比(tip_pct)
tips['tip_pct'] = tips['tip']/tips['total_bill']
tips.head()
Out[11]:
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
# 假设要根据分组选出最高的5个tip_pct值

# 编写一个选取某个列具有最大值的行的函数

def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]

top(tips,n=6)
Out[13]:
total_bill tip sex smoker day time size tip_pct
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
232 11.61 3.39 Male No Sat Dinner 2 0.291990
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
# 现在,对smoker进行分组,并apply该函数

tips.groupby('smoker').apply(top)
Out[14]:
total_bill tip sex smoker day time size tip_pct
smoker
No 88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
# 传入apply函数的参数
tips.groupby(['smoker','day']).apply(top, n=1, column='total_bill')
Out[15]:
total_bill tip sex smoker day time size tip_pct
smoker day
No Fri 94 22.75 3.25 Female No Fri Dinner 2 0.142857
Sat 212 48.33 9.00 Male No Sat Dinner 4 0.186220
Sun 156 48.17 5.00 Male No Sun Dinner 6 0.103799
Thur 142 41.19 5.00 Male No Thur Lunch 5 0.121389
Yes Fri 95 40.17 4.73 Male Yes Fri Dinner 4 0.117750
Sat 170 50.81 10.00 Male Yes Sat Dinner 3 0.196812
Sun 182 45.35 3.50 Male Yes Sun Dinner 3 0.077178
Thur 197 43.11 5.00 Female Yes Thur Lunch 4 0.115982
# 禁用分组键
tips.groupby(['smoker','day'], group_keys=False).apply(top, n=1, column='total_bill')
Out[17]:
total_bill tip sex smoker day time size tip_pct
94 22.75 3.25 Female No Fri Dinner 2 0.142857
212 48.33 9.00 Male No Sat Dinner 4 0.186220
156 48.17 5.00 Male No Sun Dinner 6 0.103799
142 41.19 5.00 Male No Thur Lunch 5 0.121389
95 40.17 4.73 Male Yes Fri Dinner 4 0.117750
170 50.81 10.00 Male Yes Sat Dinner 3 0.196812
182 45.35 3.50 Male Yes Sun Dinner 3 0.077178
197 43.11 5.00 Female Yes Thur Lunch 4 0.115982

分位数和桶分析

将分块工具(比如cut,qcut)与groupby结合起来,能非常轻松实现分位数(quantile)或桶(bucket)分析。

frame = DataFrame({'data1': np.random.randn(1000),
                   'data2': np.random.randn(1000)})

# 使用cut,将数据装入长度相等的桶中
factor = pd.cut(frame.data1, 4)
factor[:10]
Out[19]:
0     (-1.465, 0.285]
1     (0.285, 2.0343]
2     (-1.465, 0.285]
3     (0.285, 2.0343]
4     (0.285, 2.0343]
5     (0.285, 2.0343]
6     (-1.465, 0.285]
7     (0.285, 2.0343]
8     (-1.465, 0.285]
9    (-3.221, -1.465]
Name: data1, dtype: category
Categories (4, object): [(-3.221, -1.465] < (-1.465, 0.285] < (0.285, 2.0343] < (2.0343, 3.784]]
# cut返回的Factor对象,可以直接用于groupby

def get_stats(group):
    return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}

grouped = frame.data2.groupby(factor)
grouped.apply(get_stats).unstack()
Out[21]:
count max mean min
data1
(-3.221, -1.465] 70.0 2.983339 0.002262 -3.194609
(-1.465, 0.285] 551.0 3.048865 -0.036949 -3.262832
(0.285, 2.0343] 354.0 3.179622 0.011766 -2.261756
(2.0343, 3.784] 25.0 1.548275 -0.035986 -1.678245
# 使用qcut,根据样本分位数得到大小相等的桶。
# 传入labels=False可以只获取分位数的编号

grouping = pd.qcut(frame.data1, 10, labels=False)

grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()
Out[22]:
count max mean min
data1
0 100.0 2.983339 -0.054477 -3.194609
1 100.0 2.271839 0.108037 -2.296877
2 100.0 2.077979 -0.109394 -2.092836
3 100.0 2.526364 0.065489 -3.077176
4 100.0 2.173251 -0.024841 -3.262832
5 100.0 1.603202 -0.170120 -2.779698
6 100.0 3.048865 0.060291 -2.318005
7 100.0 3.179622 0.007835 -2.261756
8 100.0 2.759021 -0.102486 -2.100495
9 100.0 1.659892 0.050313 -1.712727

示例:用特定于分组的值填充缺失值

s = Series(np.random.randn(6))
s[::2] = np.nan
s
Out[24]:
0         NaN
1   -0.917052
2         NaN
3   -0.773708
4         NaN
5    0.526083
dtype: float64
s.fillna(s.mean())
Out[25]:
0   -0.388225
1   -0.917052
2   -0.388225
3   -0.773708
4   -0.388225
5    0.526083
dtype: float64
states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = Series(np.random.randn(8), index=states)
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data
Out[26]:
Ohio          0.173629
New York     -0.054600
Vermont            NaN
Florida      -0.086387
Oregon        0.005616
Nevada             NaN
California    0.460536
Idaho              NaN
dtype: float64
data.groupby(group_key).mean()
Out[27]:
East    0.010881
West    0.233076
dtype: float64
# 用分组平均值填充 NA 值 
fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
Out[29]:
Ohio          0.173629
New York     -0.054600
Vermont       0.010881
Florida      -0.086387
Oregon        0.005616
Nevada        0.233076
California    0.460536
Idaho         0.233076
dtype: float64
# 指定填充
fill_values = {'East': 0.5, 'West': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])

data.groupby(group_key).apply(fill_func)
Out[30]:
Ohio          0.173629
New York     -0.054600
Vermont       0.500000
Florida      -0.086387
Oregon        0.005616
Nevada       -1.000000
California    0.460536
Idaho        -1.000000
dtype: float64

示例:随机采样和排列

一个随机采样的方法:选取np.random.permutation(N)的前K个元素。其中,N为总体个数,K为期望的样本大小。 比如,一个扑克牌。

# suite: 花色: 红桃 Hearts, 黑桃 Spades, 梅花 Clubs, 方块 Diamonds
suits  = ['H', 'S','C','D']
# 点数: 在21点中的取值 1,2,3,...9,10,10,10,10 
card_val = (list(range(1, 11)) + [10] * 3) * 4
# 牌面
base_names = ['A'] + list(range(2,11)) + ['J','Q','K']

cards = []
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)

# 一副扑克牌(52张)
deck = Series(card_val, index=cards)
deck.head()
Out[46]:
AH    1
2H    2
3H    3
4H    4
5H    5
dtype: int64
# 随机抽五张

def draw(deck,n=5):
    return deck.take(np.random.permutation(len(deck))[:n])

draw(deck)
Out[47]:
8S     8
JC    10
9D     9
JH    10
6C     6
dtype: int64
# 每种花色抽2张

get_suit = lambda card: card[-1]
deck.groupby(get_suit).apply(draw,n=2)
Out[48]:
C  KC     10
   JC     10
D  3D      3
   2D      2
H  QH     10
   6H      6
S  10S    10
   4S      4
dtype: int64
# 去掉键值
deck.groupby(get_suit, group_keys=False).apply(draw,n=2)
Out[50]:
3C     3
6C     6
KD    10
3D     3
7H     7
AH     1
4S     4
7S     7
dtype: int64

实例:分组加权平均数和相关系数

df = DataFrame({'category': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
                'data': np.random.randn(8),
                'weights': np.random.rand(8)})
df
Out[51]:
category data weights
0 a 0.924912 0.168805
1 a 1.495683 0.731728
2 a 0.073354 0.455043
3 a 0.773848 0.463418
4 b 0.751353 0.702116
5 b -1.144943 0.688624
6 b -0.574356 0.300167
7 b -0.117242 0.256547
# 利用category 计算分组加权平均数
grouped = df.groupby('category')
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
grouped.apply(get_wavg)
Out[52]:
category
a    0.903003
b   -0.237941
dtype: float64
# 标普500指数和几只股票的收盘价数据

close_px = pd.read_csv('data/ch09/stock_px.csv', parse_dates=True, index_col=0)
close_px.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
AAPL    2214 non-null float64
MSFT    2214 non-null float64
XOM     2214 non-null float64
SPX     2214 non-null float64
dtypes: float64(4)
memory usage: 86.5 KB
# 任务:计算日收益率与SPX之间的年度相关系数
rets = close_px.pct_change().dropna()
spx_corr = lambda x: x.corrwith(x['SPX'])
by_year = rets.groupby(lambda x: x.year)
by_year.apply(spx_corr)
Out[54]:
AAPL MSFT XOM SPX
2003 0.541124 0.745174 0.661265 1.0
2004 0.374283 0.588531 0.557742 1.0
2005 0.467540 0.562374 0.631010 1.0
2006 0.428267 0.406126 0.518514 1.0
2007 0.508118 0.658770 0.786264 1.0
2008 0.681434 0.804626 0.828303 1.0
2009 0.707103 0.654902 0.797921 1.0
2010 0.710105 0.730118 0.839057 1.0
2011 0.691931 0.800996 0.859975 1.0
# 苹果与微软的年度相关系数
by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))
Out[55]:
2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

示例:面向分组的线性回归

import statsmodels.api as sm
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params
by_year.apply(regress, 'AAPL', ['SPX'])
Out[56]:
SPX intercept
2003 1.195406 0.000710
2004 1.363463 0.004201
2005 1.766415 0.003246
2006 1.645496 0.000080
2007 1.198761 0.003438
2008 0.968016 -0.001110
2009 0.879103 0.002954
2010 1.052608 0.001261
2011 0.806605 0.001514
 

透视表和交叉表

利用Python进行数据分析》读书笔记。

第9章 第4节:透视表和交叉表

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame
Populating the interactive namespace from numpy and matplotlib

透视表(pivot table), 一种常见的数据汇总工具。根据一个或多个键对数据进行聚合,并根据行和列上的分组键将数据分配到各个矩形区域中。

pandas中,groupby功能可以实现透视表。

更方便的,提供了 pandas.pivot_table函数和 DataFrame的pivot_table 方法。

pivot_table 方法的主要参数包括:

  • values: 待聚合的列名。默认为所有列
  • rows: 用于分组的列名或其他键,出现在结果透视表的行
  • cols: 用于分组的列名或其他键,出现在结果透视表的列
  • aggfunc: 聚合函数/函数列表,默认为'mean'。可以是任何对groupby有效的函数
  • fill_value: 填充 NA的值
  • margins: 是否显示行列小计和总计,默认为False
# 数据准备
tips = pd.read_csv('data/ch08/tips.csv')

# 增加小费占比(tip_pct)
tips['tip_pct'] = tips['tip']/tips['total_bill']
tips.head()
Out[2]:
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
# 根据sex和smoker计算分组平均数
tips.pivot_table(index=['sex', 'smoker'])
Out[3]:
size tip tip_pct total_bill
sex smoker
Female No 2.592593 2.773519 0.156921 18.105185
Yes 2.242424 2.931515 0.182150 17.977879
Male No 2.711340 3.113402 0.160669 19.791237
Yes 2.500000 3.051167 0.152771 22.284500
# 聚合tip_pct和size, 根据day进行分组
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'],
                 columns='smoker')
Out[4]:
tip_pct size
smoker No Yes No Yes
sex day
Female Fri 0.165296 0.209129 2.500000 2.000000
Sat 0.147993 0.163817 2.307692 2.200000
Sun 0.165710 0.237075 3.071429 2.500000
Thur 0.155971 0.163073 2.480000 2.428571
Male Fri 0.138005 0.144730 2.000000 2.125000
Sat 0.162132 0.139067 2.656250 2.629630
Sun 0.158291 0.173964 2.883721 2.600000
Thur 0.165706 0.164417 2.500000 2.300000
# 添加分项小计
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'],
                 columns='smoker', margins=True)
Out[5]:
tip_pct size
smoker No Yes All No Yes All
sex day
Female Fri 0.165296 0.209129 0.199388 2.500000 2.000000 2.111111
Sat 0.147993 0.163817 0.156470 2.307692 2.200000 2.250000
Sun 0.165710 0.237075 0.181569 3.071429 2.500000 2.944444
Thur 0.155971 0.163073 0.157525 2.480000 2.428571 2.468750
Male Fri 0.138005 0.144730 0.143385 2.000000 2.125000 2.100000
Sat 0.162132 0.139067 0.151577 2.656250 2.629630 2.644068
Sun 0.158291 0.173964 0.162344 2.883721 2.600000 2.810345
Thur 0.165706 0.164417 0.165276 2.500000 2.300000 2.433333
All 0.159328 0.163196 0.160803 2.668874 2.408602 2.569672
# 通过aggfunc传入其他聚合函数
tips.pivot_table('tip_pct', index=['sex', 'smoker'], columns='day',
                 aggfunc=len, margins=True)
Out[6]:
day Fri Sat Sun Thur All
sex smoker
Female No 2.0 13.0 14.0 25.0 54.0
Yes 7.0 15.0 4.0 7.0 33.0
Male No 2.0 32.0 43.0 20.0 97.0
Yes 8.0 27.0 15.0 10.0 60.0
All 19.0 87.0 76.0 62.0 244.0
# 填充空值
tips.pivot_table('size', index=['time', 'sex', 'smoker'],
                 columns='day', aggfunc='sum', fill_value=0)
Out[7]:
day Fri Sat Sun Thur
time sex smoker
Dinner Female No 2 30 43 2
Yes 8 33 10 0
Male No 4 85 124 0
Yes 12 71 39 0
Lunch Female No 3 0 0 60
Yes 6 0 0 17
Male No 0 0 0 50
Yes 5 0 0 23

交叉表 crosstab

交叉表,是用于计算分组频率的特殊透视表。

import io
#from io.StringIO import StringIO
data = """\
Sample    Gender    Handedness
1    Female    Right-handed
2    Male    Left-handed
3    Female    Right-handed
4    Male    Right-handed
5    Male    Left-handed
6    Male    Right-handed
7    Female    Right-handed
8    Female    Left-handed
9    Male    Right-handed
10    Female    Right-handed"""
data = pd.read_table(io.StringIO(data), sep='\s+')

data
Out[13]:
Sample Gender Handedness
0 1 Female Right-handed
1 2 Male Left-handed
2 3 Female Right-handed
3 4 Male Right-handed
4 5 Male Left-handed
5 6 Male Right-handed
6 7 Female Right-handed
7 8 Female Left-handed
8 9 Male Right-handed
9 10 Female Right-handed
pd.crosstab(data.Gender,data.Handedness, margins=True)
Out[14]:
Handedness Left-handed Right-handed All
Gender
Female 1 4 5
Male 2 3 5
All 3 7 10
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)
Out[15]:
smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 0 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244
 

示例:2012联邦选举委员会数据库

利用Python进行数据分析》读书笔记。

第9章 第5节:示例:2012联邦选举委员会数据库

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame
Populating the interactive namespace from numpy and matplotlib
# 数据准备
fec = pd.read_csv('data/ch09/P00000001-ALL.csv')
fec.info()
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id              1001731 non-null object
cand_id              1001731 non-null object
cand_nm              1001731 non-null object
contbr_nm            1001731 non-null object
contbr_city          1001712 non-null object
contbr_st            1001727 non-null object
contbr_zip           1001620 non-null object
contbr_employer      988002 non-null object
contbr_occupation    993301 non-null object
contb_receipt_amt    1001731 non-null float64
contb_receipt_dt     1001731 non-null object
receipt_desc         14166 non-null object
memo_cd              92482 non-null object
memo_text            97770 non-null object
form_tp              1001731 non-null object
file_num             1001731 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 122.3+ MB
# 一行的数据记录如下
fec.ix[123456]
Out[3]:
cmte_id                             C00431445
cand_id                             P80003338
cand_nm                         Obama, Barack
contbr_nm                         ELLMAN, IRA
contbr_city                             TEMPE
contbr_st                                  AZ
contbr_zip                          852816719
contbr_employer      ARIZONA STATE UNIVERSITY
contbr_occupation                   PROFESSOR
contb_receipt_amt                          50
contb_receipt_dt                    01-DEC-11
receipt_desc                              NaN
memo_cd                                   NaN
memo_text                                 NaN
form_tp                                 SA17A
file_num                               772372
Name: 123456, dtype: object
# 通过unique,获取候选人名单
unique_cands = fec.cand_nm.unique()
unique_cands
Out[4]:
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
       "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
       'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick', 'Cain, Herman',
       'Gingrich, Newt', 'McCotter, Thaddeus G', 'Huntsman, Jon',
       'Perry, Rick'], dtype=object)
# 党派关系数据
parties = {'Bachmann, Michelle': 'Republican',
           'Cain, Herman': 'Republican',
           'Gingrich, Newt': 'Republican',
           'Huntsman, Jon': 'Republican',
           'Johnson, Gary Earl': 'Republican',
           'McCotter, Thaddeus G': 'Republican',
           'Obama, Barack': 'Democrat',
           'Paul, Ron': 'Republican',
           'Pawlenty, Timothy': 'Republican',
           'Perry, Rick': 'Republican',
           "Roemer, Charles E. 'Buddy' III": 'Republican',
           'Romney, Mitt': 'Republican',
           'Santorum, Rick': 'Republican'}
fec.cand_nm[123456:123461]
Out[6]:
123456    Obama, Barack
123457    Obama, Barack
123458    Obama, Barack
123459    Obama, Barack
123460    Obama, Barack
Name: cand_nm, dtype: object
fec.cand_nm[123456:123461].map(parties)
Out[7]:
123456    Democrat
123457    Democrat
123458    Democrat
123459    Democrat
123460    Democrat
Name: cand_nm, dtype: object
# 添加党派信息列
fec['party'] = fec.cand_nm.map(parties)
fec['party'].value_counts()
Out[9]:
Democrat      593746
Republican    407985
Name: party, dtype: int64
# 剔除退款数据
fec = fec[fec.contb_receipt_amt > 0]
# 只查看两位主要候选人
fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]
fec_mrbo.head()
Out[16]:
cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num party
411 C00431171 P80003353 Romney, Mitt ELDERBAUM, WILLIAM DPO AA 3.4023e+08 US GOVERNMENT FOREIGN SERVICE OFFICER 25.0 01-FEB-12 NaN NaN NaN SA17A 780124 Republican
412 C00431171 P80003353 Romney, Mitt ELDERBAUM, WILLIAM DPO AA 3.4023e+08 US GOVERNMENT FOREIGN SERVICE OFFICER 110.0 01-FEB-12 NaN NaN NaN SA17A 780124 Republican
413 C00431171 P80003353 Romney, Mitt CARLSEN, RICHARD APO AE 9.128e+07 DEFENSE INTELLIGENCE AGENCY INTELLIGENCE ANALYST 250.0 13-APR-12 NaN NaN NaN SA17A 785689 Republican
414 C00431171 P80003353 Romney, Mitt DELUCA, PIERRE APO AE 9.128e+07 CISCO ENGINEER 30.0 21-AUG-11 NaN NaN NaN SA17A 760261 Republican
415 C00431171 P80003353 Romney, Mitt SARGENT, MICHAEL APO AE 9.01201e+07 RAYTHEON TECHNICAL SERVICES CORP COMPUTER SYSTEMS ENGINEER 100.0 07-MAR-12 NaN NaN NaN SA17A 780128 Republican

根据职业和雇主统计赞助信息

# 根据职业计算出资总额
fec.contbr_occupation.value_counts()[:10]
Out[17]:
RETIRED                                   233990
INFORMATION REQUESTED                      35107
ATTORNEY                                   34286
HOMEMAKER                                  29931
PHYSICIAN                                  23432
INFORMATION REQUESTED PER BEST EFFORTS     21138
ENGINEER                                   14334
TEACHER                                    13990
CONSULTANT                                 13273
PROFESSOR                                  12555
Name: contbr_occupation, dtype: int64
# 清理一些类似职业(将其映射到另一个职业)
# 这里巧妙利用了dict.get, 运行没有映射关系的职业也能“通过”
occ_mapping = {
   'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
   'INFORMATION REQUESTED' : 'NOT PROVIDED',
   'INFORMATION REQUESTED (BEST EFFORTS)' : 'NOT PROVIDED',
   'C.E.O.': 'CEO'
}

# 如果没有提供相关映射,则返回x
f = lambda x: occ_mapping.get(x, x)
fec.contbr_occupation = fec.contbr_occupation.map(f)
# 对雇主信息进行同样的处理
emp_mapping = {
   'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
   'INFORMATION REQUESTED' : 'NOT PROVIDED',
   'SELF' : 'SELF-EMPLOYED',
   'SELF EMPLOYED' : 'SELF-EMPLOYED',
}

# If no mapping provided, return x
f = lambda x: emp_mapping.get(x, x)
fec.contbr_employer = fec.contbr_employer.map(f)
# 根据党派和职业进行聚合,得到 对个党派总出资最高的职业

by_occupation = fec.pivot_table('contb_receipt_amt',
                                index='contbr_occupation',
                                columns='party', aggfunc='sum')
# 过滤掉小于200万的赞助
over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
over_2mm
Out[23]:
party Democrat Republican
contbr_occupation
ATTORNEY 11141982.97 7.477194e+06
CEO 2074974.79 4.211041e+06
CONSULTANT 2459912.71 2.544725e+06
ENGINEER 951525.55 1.818374e+06
EXECUTIVE 1355161.05 4.138850e+06
HOMEMAKER 4248875.80 1.363428e+07
INVESTOR 884133.00 2.431769e+06
LAWYER 3160478.87 3.912243e+05
MANAGER 762883.22 1.444532e+06
NOT PROVIDED 4866973.96 2.056547e+07
OWNER 1001567.36 2.408287e+06
PHYSICIAN 3735124.94 3.594320e+06
PRESIDENT 1878509.95 4.720924e+06
PROFESSOR 2165071.08 2.967027e+05
REAL ESTATE 528902.09 1.625902e+06
RETIRED 25305116.38 2.356124e+07
SELF-EMPLOYED 672393.40 1.640253e+06
# 绘制柱状图
over_2mm.plot(kind='barh')
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x9d48d30>
# 计算对两位候选人总出资最高的职业和企业
def get_top_amounts(group, key, n=5):
    totals = group.groupby(key)['contb_receipt_amt'].sum()

    # Order totals by key in descending order
    return totals.order(ascending=False)[-n:]

# 根据职业和雇主进行聚合
grouped = fec_mrbo.groupby('cand_nm')
grouped.apply(get_top_amounts, 'contbr_occupation', n=7)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\__main__.py:6: FutureWarning: order is deprecated, use sort_values(...)
Out[26]:
cand_nm        contbr_occupation                     
Obama, Barack  COMPUTER ASSISTANT                        3.0
               SPRINKLER FITTER FIRE PROTECTION SPECI    3.0
               ADMINISTRATION/INSTRUCTOR                 3.0
               LEAD UI/UX DEVELOPER                      3.0
               POLICY/ LAWYER                            3.0
               LAN/WAN ANALYST                           3.0
               SR MGR                                    3.0
Romney, Mitt   MD - UROLOGIST                            5.0
               DISTRICT REPRESENTATIVE                   5.0
               INDEPENDENT PROFESSIONAL                  3.0
               REMODELER & SEMI RETIRED                  3.0
               AFFORDABLE REAL ESTATE DEVELOPER          3.0
               IFC CONTRACTING SOLUTIONS                 3.0
               3RD GENERATION FAMILY BUSINESS OWNER      3.0
Name: contb_receipt_amt, dtype: float64
grouped.apply(get_top_amounts, 'contbr_employer', n=10)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\__main__.py:6: FutureWarning: order is deprecated, use sort_values(...)
Out[27]:
cand_nm        contbr_employer                   
Obama, Barack  SOLIYA                                3.0
               CARR ENTERPRISES                      3.0
               PENN STATE DICKINSON SCHOOL OF LAW    3.0
               CADUCEUS OCCUPATIONAL MEDICINE        3.0
               N.A.                                  3.0
               REAL ENERGY CONSULTING SERVICES       3.0
               JPDSYSTEMS, LLC                       3.0
               CASS REGIONAL MED. CENTER             2.5
               ARCON CORP                            2.0
               THE VICTORIA GROUP, INC.              2.0
Romney, Mitt   EASTHAM CAPITAL                       5.0
               GREGORY GALLIVAN                      5.0
               DIRECT LENDERS LLC                    5.0
               LOUGH INVESTMENT ADVISORY LLC         4.0
               WATERWORKS INDUSRTIES                 3.0
               WILL MERRIFIELD                       3.0
               HONOLD COMMUNICTAIONS                 3.0
               INDEPENDENT PROFESSIONAL              3.0
               UPTOWN CHEAPSKATE                     3.0
               UN                                    3.0
Name: contb_receipt_amt, dtype: float64

对出资额分组

bins = np.array([0, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000])
# 用cut函数根据出资额大小将数据离散化到多个面元中
labels = pd.cut(fec_mrbo.contb_receipt_amt, bins)
labels
Out[28]:
411           (10, 100]
412         (100, 1000]
413         (100, 1000]
414           (10, 100]
415           (10, 100]
416           (10, 100]
417         (100, 1000]
418           (10, 100]
419         (100, 1000]
420           (10, 100]
421           (10, 100]
422         (100, 1000]
423         (100, 1000]
424         (100, 1000]
425         (100, 1000]
426         (100, 1000]
427       (1000, 10000]
428         (100, 1000]
429         (100, 1000]
430           (10, 100]
431       (1000, 10000]
432         (100, 1000]
433         (100, 1000]
434         (100, 1000]
435         (100, 1000]
436         (100, 1000]
437           (10, 100]
438         (100, 1000]
439         (100, 1000]
440           (10, 100]
              ...      
701356        (10, 100]
701357          (1, 10]
701358        (10, 100]
701359        (10, 100]
701360        (10, 100]
701361        (10, 100]
701362      (100, 1000]
701363        (10, 100]
701364        (10, 100]
701365        (10, 100]
701366        (10, 100]
701367        (10, 100]
701368      (100, 1000]
701369        (10, 100]
701370        (10, 100]
701371        (10, 100]
701372        (10, 100]
701373        (10, 100]
701374        (10, 100]
701375        (10, 100]
701376    (1000, 10000]
701377        (10, 100]
701378        (10, 100]
701379      (100, 1000]
701380    (1000, 10000]
701381        (10, 100]
701382      (100, 1000]
701383          (1, 10]
701384        (10, 100]
701385      (100, 1000]
Name: contb_receipt_amt, dtype: category
Categories (8, object): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]
# 根据候选人和面元标签对数据进行分组
grouped = fec_mrbo.groupby(['cand_nm', labels])
grouped.size().unstack(0)
Out[29]:
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 493.0 77.0
(1, 10] 40070.0 3681.0
(10, 100] 372280.0 31853.0
(100, 1000] 153991.0 43357.0
(1000, 10000] 22284.0 26186.0
(10000, 100000] 2.0 1.0
(100000, 1000000] 3.0 NaN
(1000000, 10000000] 4.0 NaN
# 规整,以便画图
bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
bucket_sums
Out[30]:
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 318.24 77.00
(1, 10] 337267.62 29819.66
(10, 100] 20288981.41 1987783.76
(100, 1000] 54798531.46 22363381.69
(1000, 10000] 51753705.67 63942145.42
(10000, 100000] 59100.00 12700.00
(100000, 1000000] 1490683.08 NaN
(1000000, 10000000] 7148839.76 NaN
normed_sums = bucket_sums.div(bucket_sums.sum(axis=1), axis=0)
normed_sums
Out[31]:
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 0.805182 0.194818
(1, 10] 0.918767 0.081233
(10, 100] 0.910769 0.089231
(100, 1000] 0.710176 0.289824
(1000, 10000] 0.447326 0.552674
(10000, 100000] 0.823120 0.176880
(100000, 1000000] 1.000000 NaN
(1000000, 10000000] 1.000000 NaN
normed_sums[:-2].plot(kind='barh', stacked=True)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x9fd8470>

根据州统计赞助信息

# 根据候选人和州进行聚合
grouped = fec_mrbo.groupby(['cand_nm', 'contbr_st'])
totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
totals = totals[totals.sum(1) > 100000]
totals[:10]
Out[33]:
cand_nm Obama, Barack Romney, Mitt
contbr_st
AK 281840.15 86204.24
AL 543123.48 527303.51
AR 359247.28 105556.00
AZ 1506476.98 1888436.23
CA 23824984.24 11237636.60
CO 2132429.49 1506714.12
CT 2068291.26 3499475.45
DC 4373538.80 1025137.50
DE 336669.14 82712.00
FL 7318178.58 8338458.81
percent = totals.div(totals.sum(1), axis=0)
percent[:10]
Out[34]:
cand_nm Obama, Barack Romney, Mitt
contbr_st
AK 0.765778 0.234222
AL 0.507390 0.492610
AR 0.772902 0.227098
AZ 0.443745 0.556255
CA 0.679498 0.320502
CO 0.585970 0.414030
CT 0.371476 0.628524
DC 0.810113 0.189887
DE 0.802776 0.197224
FL 0.467417 0.532583
# 绘制地图
from mpl_toolkits.basemap import Basemap, cm

# 略