0%

利用Python进行数据分析(7):数据规整化:清理、转换、合并、重塑

数据分析和建模的大量编程工作都是在数据准备上的:加载、清理、转换以及重塑。

pandas和Python标准库提供了一组高级的、灵活的、高效的核心函数和算法,
能够轻松地将数据规整化为正确的形式。
尤其,Pandas 的许多功能都来自实际应用中的需求。

数据规整主要包括:合并数据集、重塑和轴向旋转、数据转换、字符串操作。

合并数据集

利用Python进行数据分析》读书笔记。

第7章 第1节:合并数据集

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame
import json
Populating the interactive namespace from numpy and matplotlib

pandas提供了一些内置的方式来处理数据的合并:

  • pandas.merge(): 根据一个或者多个键值,将不同 DataFrame 中的行连接起来,就是SQL中的数据库连接工作。
  • pandas.concat(): 沿着一条轴将多个对象堆叠在一起
  • DataFrame.combine_first(): 将重复数据编接在一起,用一个对象中的值填充另一个对象中的缺失值。

数据库风格的DataFrame合并

数据集的合并(merge)或者连接(join)运算,是通过一个或者多个键将行链接起来。这是关系型数据库的核心。

在合并(merge)时,支持内连接(inner)、左连接(left)、右连接(right)、外连接(outer),通过how指定。默认为内连接(inner)。

多对多的合并,结果是行的笛卡尔积,即针对一个键值,两个对象对应值的所有组合。连接方式只影响出现在结果中的键。其中:

  • 内连接只保留合并列中的交集
  • 左连接?
  • 右连接?
  • 外连接保留和并列的并集,相当于组合了左连接和右连接的结果
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': range(7)})
df1
Out[12]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
df2 = DataFrame({'key': ['a', 'b', 'd'],
                 'data2': range(3)})
df2
Out[13]:
data2 key
0 0 a
1 1 b
2 2 d
# 默认用重复的列名进行合并,并且只保留合并列中的交集,其他舍去(内连接)
pd.merge(df1, df2)
Out[20]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
# 最好显示指定合并列
pd.merge(df1, df2, on='key')
Out[15]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
# 列名不同时,可以分别指定
df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': range(7)})
df4 = DataFrame({'rkey': ['a', 'b', 'd'],
                 'data2': range(3)})
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
Out[17]:
data1 lkey data2 rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a
#  外连接:
pd.merge(df1,df2,how = 'outer')
Out[19]:
data1 key data2
0 0.0 b 1.0
1 1.0 b 1.0
2 6.0 b 1.0
3 2.0 a 0.0
4 4.0 a 0.0
5 5.0 a 0.0
6 3.0 c NaN
7 NaN d 2.0
# 左连接
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                 'data1': range(6)})
df1
Out[24]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 b
df2 = DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
                 'data2': range(5)})
df2
Out[22]:
data2 key
0 0 a
1 1 b
2 2 a
3 3 b
4 4 d
# 左连接
pd.merge(df1, df2, on='key', how='left')
Out[26]:
data1 key data2
0 0 b 1.0
1 0 b 3.0
2 1 b 1.0
3 1 b 3.0
4 2 a 0.0
5 2 a 2.0
6 3 c NaN
7 4 a 0.0
8 4 a 2.0
9 5 b 1.0
10 5 b 3.0
#对多个键进行合并,传入一个由列名组成的列表即可
left = DataFrame({'key1': ['foo', 'foo', 'bar'],
                  'key2': ['one', 'two', 'one'],
                  'lval': [1, 2, 3]})
right = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                   'key2': ['one', 'one', 'one', 'two'],
                   'rval': [4, 5, 6, 7]})
pd.merge(left, right, on=['key1', 'key2'], how='outer')
Out[27]:
key1 key2 lval rval
0 foo one 1.0 4.0
1 foo one 1.0 5.0
2 foo two 2.0 NaN
3 bar one 3.0 6.0
4 bar two NaN 7.0
# 对于重复列名,pandas会自动添加后缀
pd.merge(left,right,on = 'key1')
Out[30]:
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
# 可以通过suffixes选项指定后缀
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))
Out[31]:
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7

索引上的合并

如果连接键在索引中,可以通过 left_index = True或者right_index = True 指定。

left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
                  'value': range(6)})
left1
Out[32]:
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
right1 = DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
right1
Out[33]:
group_val
a 3.5
b 7.0
pd.merge(left1, right1, left_on='key', right_index=True)
Out[34]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')
Out[35]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN
#  对于层次化索引
lefth = DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
                   'key2': [2000, 2001, 2002, 2001, 2002],
                   'data': np.arange(5.)})
lefth
Out[36]:
data key1 key2
0 0.0 Ohio 2000
1 1.0 Ohio 2001
2 2.0 Ohio 2002
3 3.0 Nevada 2001
4 4.0 Nevada 2002
righth = DataFrame(np.arange(12).reshape((6, 2)),
                   index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
                          [2001, 2000, 2000, 2000, 2001, 2002]],
                   columns=['event1', 'event2'])
righth
Out[37]:
event1 event2
Nevada 2001 0 1
2000 2 3
Ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11
#这种情况下,必须指明用作合并键的多个列(注意对重复索引值的处理)
#注意得到的结果的index是跟左边对象的index一致
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)
Out[39]:
data key1 key2 event1 event2
0 0.0 Ohio 2000 4 5
0 0.0 Ohio 2000 6 7
1 1.0 Ohio 2001 8 9
2 2.0 Ohio 2002 10 11
3 3.0 Nevada 2001 0 1
# 同时使用合并双方的索引
left2 = DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=['a', 'c', 'e'],
                 columns=['Ohio', 'Nevada'])
right2 = DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                   index=['b', 'c', 'd', 'e'], columns=['Missouri', 'Alabama'])
pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
Out[40]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
# DataFrame.join(), 能更方便地实现按索引合并
# 还可以用作合并多个带有相同或者相似索引的DataFrame对象,而不管有没有重叠的列
# DataFrame的join方法是在连接键上做左连接
left2.join(right2, how='outer')
Out[44]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
left1.join(right1, on='key')
Out[43]:
key value group_val
0 a 0 3.5
1 b 1 7.0
2 a 2 3.5
3 a 3 3.5
4 b 4 7.0
5 c 5 NaN
# join()方法支持参数DataFrame的索引跟调用者DataFrame的某个列之间的连接(这个方法有点像merge中的left_index这样的参数)
another = DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
                    index=['a', 'c', 'e', 'f'], columns=['New York', 'Oregon'])
another
Out[45]:
New York Oregon
a 7.0 8.0
c 9.0 10.0
e 11.0 12.0
f 16.0 17.0
left2.join([right2, another])
Out[46]:
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0 9.0 10.0
e 5.0 6.0 13.0 14.0 11.0 12.0
# 对于简单的索引合并,还可以向join传入多个DataFrame
left2.join([right2, another], how='outer')
Out[47]:
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7.0 8.0
b NaN NaN 7.0 8.0 NaN NaN
c 3.0 4.0 9.0 10.0 9.0 10.0
d NaN NaN 11.0 12.0 NaN NaN
e 5.0 6.0 13.0 14.0 11.0 12.0
f NaN NaN NaN NaN 16.0 17.0

缺失值的合并

a = Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
           index=['f', 'e', 'd', 'c', 'b', 'a'])
a
Out[3]:
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64
b = Series(np.arange(len(a), dtype=np.float64),
           index=['f', 'e', 'd', 'c', 'b', 'a'])
b[-1] = np.nan
b
Out[4]:
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64
np.where(pd.isnull(a), b, a)
Out[5]:
array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])
b[:-2].combine_first(a[2:])
Out[6]:
a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64
df1 = DataFrame({'a': [1., np.nan, 5., np.nan],
                 'b': [np.nan, 2., np.nan, 6.],
                 'c': range(2, 18, 4)})
df2 = DataFrame({'a': [5., 4., np.nan, 3., 7.],
                 'b': [np.nan, 3., 4., 6., 8.]})
df1.combine_first(df2)
Out[7]:
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN
 

重塑和轴向旋转

利用Python进行数据分析》读书笔记。

第7章 第2节:重塑和轴向旋转

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame
import json
Populating the interactive namespace from numpy and matplotlib

pandas 有很多 重排表格数据的运算,称为 reshape(重塑)和 pivot(轴向旋转)操作。

重塑层次化索引

stack:将数据的列“旋转”为行

unstack:将数据的行“旋转”为列

data = DataFrame(np.arange(6).reshape((2, 3)),
                 index=pd.Index(['Ohio', 'Colorado'], name='state'),
                 columns=pd.Index(['one', 'two', 'three'], name='number'))
data
Out[10]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
result = data.stack()
result
Out[11]:
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64
result.unstack()
Out[12]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
# 默认情况下,unstack处理的是内层的索引,若想别的层次,传入编号或者名称即可,注意最外一层编号为0

result.unstack(0)
Out[16]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
# 也可用列名指定
result.unstack('state')
Out[17]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
# 下面看有缺失值的情况,unstack()会标示出缺失值
s1 = Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2.unstack()
Out[18]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
# stack会滤除缺失数据
data2.unstack().stack()
Out[20]:
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64
# 保留缺失值
data2.unstack().stack(dropna=False)
Out[21]:
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64
# 对DataFrame进行unstack时,作为旋转轴的级别成为结果中最低的,弄到最内层
df = DataFrame({'left': result, 'right': result + 5},
               columns=pd.Index(['left', 'right'], name='side'))
df
Out[22]:
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
df.unstack('state')
Out[23]:
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
df.unstack('state').stack('side')
Out[24]:
state Ohio Colorado
number side
one left 0 3
right 5 8
two left 1 4
right 6 9
three left 2 5
right 7 10

pivot: 将“长格式”转换为“宽格式”

data = pd.read_csv('data/ch07/macrodata.csv')
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='date')
data = DataFrame(data.to_records(),
                 columns=pd.Index(['realgdp', 'infl', 'unemp'], name='item'),
                 index=periods.to_timestamp('D', 'end'))

ldata = data.stack().reset_index().rename(columns={0: 'value'})
ldata[:10]
Out[32]:
date item value
0 1959-03-31 realgdp 2710.349
1 1959-03-31 infl 0.000
2 1959-03-31 unemp 5.800
3 1959-06-30 realgdp 2778.801
4 1959-06-30 infl 2.340
5 1959-06-30 unemp 5.100
6 1959-09-30 realgdp 2775.488
7 1959-09-30 infl 2.740
8 1959-09-30 unemp 5.300
9 1959-12-31 realgdp 2785.204
# 将data、item作为行、列名,value填充进二维表
pivoted = ldata.pivot('date', 'item', 'value')
pivoted.head()
Out[33]:
item infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 0.27 2785.204 5.6
1960-03-31 2.31 2847.699 5.2
ldata['value2'] = np.random.randn(len(ldata))
ldata[:10]
Out[34]:
date item value value2
0 1959-03-31 realgdp 2710.349 -0.535021
1 1959-03-31 infl 0.000 1.783525
2 1959-03-31 unemp 5.800 2.835291
3 1959-06-30 realgdp 2778.801 -1.444524
4 1959-06-30 infl 2.340 1.728538
5 1959-06-30 unemp 5.100 1.100782
6 1959-09-30 realgdp 2775.488 -1.371209
7 1959-09-30 infl 2.740 1.069021
8 1959-09-30 unemp 5.300 -0.658462
9 1959-12-31 realgdp 2785.204 -2.165827
pivoted = ldata.pivot('date', 'item')
pivoted[:5]
Out[35]:
value value2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 1.783525 -0.535021 2.835291
1959-06-30 2.34 2778.801 5.1 1.728538 -1.444524 1.100782
1959-09-30 2.74 2775.488 5.3 1.069021 -1.371209 -0.658462
1959-12-31 0.27 2785.204 5.6 -0.322622 -2.165827 -1.525572
1960-03-31 2.31 2847.699 5.2 -1.386987 -0.456043 -0.392422
pivoted['value'][:5]
Out[36]:
item infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 0.27 2785.204 5.6
1960-03-31 2.31 2847.699 5.2
# pivot其实只是一个“快捷方式而已”, 其本质是用set_index创建层次化索引,再用unstack重塑
unstacked = ldata.set_index(['date', 'item']).unstack('item')
unstacked[:7]
Out[37]:
value value2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 1.783525 -0.535021 2.835291
1959-06-30 2.34 2778.801 5.1 1.728538 -1.444524 1.100782
1959-09-30 2.74 2775.488 5.3 1.069021 -1.371209 -0.658462
1959-12-31 0.27 2785.204 5.6 -0.322622 -2.165827 -1.525572
1960-03-31 2.31 2847.699 5.2 -1.386987 -0.456043 -0.392422
1960-06-30 0.14 2834.390 5.2 -2.086858 0.316907 -1.492590
1960-09-30 2.70 2839.022 5.6 1.509653 -0.776808 0.520116
 

数据转换

利用Python进行数据分析》读书笔记。

第7章 第3节:数据转换

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame
import json
Populating the interactive namespace from numpy and matplotlib

重排之后,下面介绍数据的过滤、清理、以及其他转换工作。

去除重复数据

data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
                  'k2': [1, 1, 2, 3, 3, 4, 4]})
data
Out[2]:
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4
data.duplicated()
Out[3]:
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool
# 得到去重之后的DataFrame,这是非常常用的
data.drop_duplicates()
Out[9]:
k1 k2 v1
0 one 1 0
1 one 1 1
2 one 2 2
3 two 3 3
4 two 3 4
5 two 4 5
6 two 4 6
# 可以选定需要去重的列
data['v1'] = range(7)
data.drop_duplicates(['k1'])
Out[10]:
k1 k2 v1
0 one 1 0
3 two 3 3
# 默认保留第一次出现的行,  可以设定为最后一个
data.drop_duplicates(['k1', 'k2'], keep='last')
Out[11]:
k1 k2 v1
1 one 1 1
2 one 2 2
4 two 3 4
6 two 4 6

用函数或者映射(mapping)进行数据转换

可以实现根据值进行转换。

data = DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami',
                           'corned beef', 'Bacon', 'pastrami', 'honey ham',
                           'nova lox'],
                  'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[12]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
#  添加一个肉类到动物的映射
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}
# Series的map方法可以接受一个函数或含有映射关系的字典型对象
# 为了保证映射正确,先转换大小写。 这种方法很常用
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
data
Out[18]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
# 也可以用一个函数实现上述功能
data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[19]:
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

替换值

使用 replace()函数,可以更简单的实现值替换。

data = Series([1., -999., 2., -999., -1000., 3.])
data
Out[20]:
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64
data.replace(-999, np.nan)
Out[21]:
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64
data.replace([-999, -1000], np.nan)
Out[22]:
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64
data.replace([-999, -1000], [np.nan, 0])
Out[23]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64
data.replace({-999: np.nan, -1000: 0})
Out[24]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

重命名轴索引

与值一样,轴标签页可以用 map()函数进行映射。

data = DataFrame(np.arange(12).reshape((3, 4)),
                 index=['Ohio', 'Colorado', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data.index.map(str.upper)
Out[26]:
array(['OHIO', 'COLORADO', 'NEW YORK'], dtype=object)
data.index = data.index.map(str.upper)
data
Out[27]:
one two three four
OHIO 0 1 2 3
COLORADO 4 5 6 7
NEW YORK 8 9 10 11
# 对于轴,与 replace()类似的函数是 rename()
data.rename(index=str.title, columns=str.upper)
Out[28]:
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})
Out[29]:
one two peekaboo four
INDIANA 0 1 2 3
COLORADO 4 5 6 7
NEW YORK 8 9 10 11
# inplace 指定 就地修改,而无需新建一个数据结构
_ = data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data
Out[31]:
one two three four
INDIANA 0 1 2 3
COLORADO 4 5 6 7
NEW YORK 8 9 10 11

离散化和面元划分

为了便于分析,连续数据常常被离散化或拆分为面元(bin),即分组。

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
# 使用 cut(), 将 ages 分为 18-25,25-35.。。等几个面元(bin)
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
Out[38]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
# 返回的是一个特殊的Categorical对象,可以看作是表示面元名称的字符串。 
# 含有一个表示不同分类名称的 categories 数组以及一个 codes 属性.
cats.codes
Out[39]:
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
Out[40]:
Index(['(18, 25]', '(25, 35]', '(35, 60]', '(60, 100]'], dtype='object')
pd.value_counts(cats)
Out[41]:
(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64
#  默认为“左开右闭”,可以通过 right=False 指定“左闭右开”
pd.cut(ages, [18, 26, 36, 61, 100], right=False)
Out[43]:
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, object): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
# 指定分组标签(面元名称)
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)
Out[46]:
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
# 按数量均分为4组,精度为2位(而不是按值划分)
data = np.random.rand(20)
pd.cut(data, 4, precision=2)
Out[48]:
[(0.74, 0.98], (0.49, 0.74], (0.25, 0.49], (0.74, 0.98], (0.25, 0.49], ..., (0.00071, 0.25], (0.00071, 0.25], (0.74, 0.98], (0.00071, 0.25], (0.74, 0.98]]
Length: 20
Categories (4, object): [(0.00071, 0.25] < (0.25, 0.49] < (0.49, 0.74] < (0.74, 0.98]]
# qcut: 根据样本分位数切分。如下:根据4分位数切成4份
data = np.random.randn(1000) # Normally distributed
cats = pd.qcut(data, 4) # Cut into quartiles
cats
Out[49]:
[(0.0103, 0.682], [-3.378, -0.647], (0.0103, 0.682], (-0.647, 0.0103], (0.0103, 0.682], ..., (0.0103, 0.682], (-0.647, 0.0103], (-0.647, 0.0103], [-3.378, -0.647], (0.0103, 0.682]]
Length: 1000
Categories (4, object): [[-3.378, -0.647] < (-0.647, 0.0103] < (0.0103, 0.682] < (0.682, 3.119]]
pd.value_counts(cats)
Out[51]:
(0.682, 3.119]      250
(0.0103, 0.682]     250
(-0.647, 0.0103]    250
[-3.378, -0.647]    250
dtype: int64
#  根据设定的分位数切分
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
Out[52]:
[(0.0103, 1.294], [-3.378, -1.276], (0.0103, 1.294], (-1.276, 0.0103], (0.0103, 1.294], ..., (0.0103, 1.294], (-1.276, 0.0103], (-1.276, 0.0103], (-1.276, 0.0103], (0.0103, 1.294]]
Length: 1000
Categories (4, object): [[-3.378, -1.276] < (-1.276, 0.0103] < (0.0103, 1.294] < (1.294, 3.119]]

检测和过滤离群点

离群点(outlier)的过滤或变换运算在很大程度上其实就是数组运算。

np.random.seed(12345)
data = DataFrame(np.random.randn(1000, 4))
data.describe()
Out[53]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.067684 0.067924 0.025598 -0.002298
std 0.998035 0.992106 1.006835 0.996794
min -3.428254 -3.548824 -3.184377 -3.745356
25% -0.774890 -0.591841 -0.641675 -0.644144
50% -0.116401 0.101143 0.002073 -0.013611
75% 0.616366 0.780282 0.680391 0.654328
max 3.366626 2.653656 3.260383 3.927528
# 第4列中,绝对值>3的行
col = data[3]
col[np.abs(col) > 3]
Out[55]:
97     3.927528
305   -3.399312
400   -3.745356
Name: 3, dtype: float64
# 全部含有超过3或-3的值的行
data[(np.abs(data) > 3).any(1)]
Out[56]:
0 1 2 3
5 -0.539741 0.476985 3.248944 -1.021228
97 -0.774363 0.552936 0.106061 3.927528
102 -0.655054 -0.565230 3.176873 0.959533
305 -2.315555 0.457246 -0.025907 -3.399312
324 0.050188 1.951312 3.260383 0.963301
400 0.146326 0.508391 -0.196713 -3.745356
499 -0.293333 -0.242459 -3.056990 1.918403
523 -3.428254 -0.296336 -0.439938 -0.867165
586 0.275144 1.179227 -3.184377 1.369891
808 -0.362528 -3.548824 1.553205 -2.186301
900 3.366626 -2.372214 0.851010 1.332846
# 修改值
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()
Out[58]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.067623 0.068473 0.025153 -0.002081
std 0.995485 0.990253 1.003977 0.989736
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.774890 -0.591841 -0.641675 -0.644144
50% -0.116401 0.101143 0.002073 -0.013611
75% 0.616366 0.780282 0.680391 0.654328
max 3.000000 2.653656 3.000000 3.000000

排列和随机采样

df = DataFrame(np.arange(5 * 4).reshape((5, 4)))
df
Out[59]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
# 返回一个随机排列
sampler = np.random.permutation(5)
sampler
Out[62]:
array([1, 3, 0, 2, 4])
# 在基于ix的索引操作或者take函数中使用该数组
df.take(sampler)
Out[63]:
0 1 2 3
1 4 5 6 7
3 12 13 14 15
0 0 1 2 3
2 8 9 10 11
4 16 17 18 19
# 进行截取
df.take(np.random.permutation(len(df))[:3])
Out[64]:
0 1 2 3
1 4 5 6 7
0 0 1 2 3
4 16 17 18 19
bag = np.array([5, 7, -1, 6, 4])
sampler = np.random.randint(0, len(bag), size=10)
sampler
Out[66]:
array([3, 0, 4, 1, 1, 2, 3, 0, 1, 2])
draws = bag.take(sampler)
draws
Out[67]:
array([ 6,  5,  4,  7,  7, -1,  6,  5,  7, -1])

计算指标/哑变量

一种常用的用于统计建模或机器学习的转换方式是:将分类变量(categorical variable)转换为“哑变量矩阵”(dummy matrix)或“指标矩阵”(indicator matrix)。

如果DataFrame的某一列有k各不同的值,可以派生出一个k列的矩阵或者DataFrame(值为1和0)。

这样的做法在下一章(第八章)的地图的例子中有体现。

df = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                'data1': range(6)})
df
Out[70]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 b
# 得到哑变量DataFrame
pd.get_dummies(df['key'])
Out[71]:
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
# 给指标列加上一个前缀
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
Out[72]:
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('data/ch02/movielens/movies.dat', sep='::', header=None,
                        names=mnames,engine='python')
movies[:10]
Out[75]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children's
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
#  将 `|` 分隔的 genres 拆分
genre_iter = (set(x.split('|')) for x in movies.genres)
genres = sorted(set.union(*genre_iter))
genres
Out[77]:
['Action',
 'Adventure',
 'Animation',
 "Children's",
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']
#  构建一个全零的 DataFrame
dummies = DataFrame(np.zeros((len(movies), len(genres))), columns=genres)
dummies.head()
Out[83]:
Action Adventure Animation Children's Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
#  设置标记
for i, gen in enumerate(movies.genres):
    dummies.ix[i, gen.split('|')] = 1
# 将标记合并到movies
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic
Out[86]:
movie_id title genres Genre_Action Genre_Adventure Genre_Animation Genre_Children's Genre_Comedy Genre_Crime Genre_Documentary ... Genre_Fantasy Genre_Film-Noir Genre_Horror Genre_Musical Genre_Mystery Genre_Romance Genre_Sci-Fi Genre_Thriller Genre_War Genre_Western
0 1 Toy Story (1995) Animation|Children's|Comedy 0.0 0.0 1.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2 Jumanji (1995) Adventure|Children's|Fantasy 0.0 1.0 0.0 1.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3 Grumpier Old Men (1995) Comedy|Romance 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
3 4 Waiting to Exhale (1995) Comedy|Drama 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5 Father of the Bride Part II (1995) Comedy 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 6 Heat (1995) Action|Crime|Thriller 1.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
6 7 Sabrina (1995) Comedy|Romance 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
7 8 Tom and Huck (1995) Adventure|Children's 0.0 1.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 9 Sudden Death (1995) Action 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 10 GoldenEye (1995) Action|Adventure|Thriller 1.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
10 11 American President, The (1995) Comedy|Drama|Romance 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
11 12 Dracula: Dead and Loving It (1995) Comedy|Horror 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12 13 Balto (1995) Animation|Children's 0.0 0.0 1.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
13 14 Nixon (1995) Drama 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14 15 Cutthroat Island (1995) Action|Adventure|Romance 1.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
15 16 Casino (1995) Drama|Thriller 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
16 17 Sense and Sensibility (1995) Drama|Romance 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
17 18 Four Rooms (1995) Thriller 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
18 19 Ace Ventura: When Nature Calls (1995) Comedy 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
19 20 Money Train (1995) Action 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20 21 Get Shorty (1995) Action|Comedy|Drama 1.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
21 22 Copycat (1995) Crime|Drama|Thriller 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
22 23 Assassins (1995) Thriller 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
23 24 Powder (1995) Drama|Sci-Fi 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
24 25 Leaving Las Vegas (1995) Drama|Romance 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
25 26 Othello (1995) Drama 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
26 27 Now and Then (1995) Drama 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
27 28 Persuasion (1995) Romance 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
28 29 City of Lost Children, The (1995) Adventure|Sci-Fi 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
29 30 Shanghai Triad (Yao a yao yao dao waipo qiao) ... Drama 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3853 3923 Return of the Fly (1959) Horror|Sci-Fi 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3854 3924 Pajama Party (1964) Comedy 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3855 3925 Stranger Than Paradise (1984) Comedy 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3856 3926 Voyage to the Bottom of the Sea (1961) Adventure|Sci-Fi 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3857 3927 Fantastic Voyage (1966) Adventure|Sci-Fi 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3858 3928 Abbott and Costello Meet Frankenstein (1948) Comedy|Horror 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3859 3929 Bank Dick, The (1940) Comedy 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3860 3930 Creature From the Black Lagoon, The (1954) Horror 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3861 3931 Giant Gila Monster, The (1959) Horror|Sci-Fi 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3862 3932 Invisible Man, The (1933) Horror|Sci-Fi 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3863 3933 Killer Shrews, The (1959) Horror|Sci-Fi 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3864 3934 Kronos (1957) Sci-Fi 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3865 3935 Kronos (1973) Horror 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3866 3936 Phantom of the Opera, The (1943) Drama|Thriller 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3867 3937 Runaway (1984) Sci-Fi|Thriller 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
3868 3938 Slumber Party Massacre, The (1982) Horror 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3869 3939 Slumber Party Massacre II, The (1987) Horror 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3870 3940 Slumber Party Massacre III, The (1990) Horror 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3871 3941 Sorority House Massacre (1986) Horror 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3872 3942 Sorority House Massacre II (1990) Horror 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3873 3943 Bamboozled (2000) Comedy 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3874 3944 Bootmen (2000) Comedy|Drama 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3875 3945 Digimon: The Movie (2000) Adventure|Animation|Children's 0.0 1.0 1.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3876 3946 Get Carter (2000) Action|Drama|Thriller 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3877 3947 Get Carter (1971) Thriller 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3878 3948 Meet the Parents (2000) Comedy 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3879 3949 Requiem for a Dream (2000) Drama 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3880 3950 Tigerland (2000) Drama 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3881 3951 Two Family House (2000) Drama 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3882 3952 Contender, The (2000) Drama|Thriller 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

3883 rows × 21 columns

movies_windic.ix[0]
Out[88]:
movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Action                                   0
Genre_Adventure                                0
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Crime                                    0
Genre_Documentary                              0
Genre_Drama                                    0
Genre_Fantasy                                  0
Genre_Film-Noir                                0
Genre_Horror                                   0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Romance                                  0
Genre_Sci-Fi                                   0
Genre_Thriller                                 0
Genre_War                                      0
Genre_Western                                  0
Name: 0, dtype: object
# 对于很大的数据,这种方法构建指标非常慢。肯定需要编写一个能够利用DataFrame内部机制的更低级的函数才行 
# 一个对统计应用的秘诀是:结合get_dummies和诸如cut之类的离散化函数
np.random.seed(12345)
values = np.random.rand(10)
values
Out[90]:
array([ 0.92961609,  0.31637555,  0.18391881,  0.20456028,  0.56772503,
        0.5955447 ,  0.96451452,  0.6531771 ,  0.74890664,  0.65356987])
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))
Out[91]:
(0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1]
0 0 0 0 0 1
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 1 0 0
6 0 0 0 0 1
7 0 0 0 1 0
8 0 0 0 1 0
9 0 0 0 1 0
 

字符串操作

利用Python进行数据分析》读书笔记。

第7章 第4节:字符串操作

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame
Populating the interactive namespace from numpy and matplotlib

Python有简单易用的字符串和文本处理功能。大部分文本运算直接做成了字符串对象的内置方法。当然还能用正则表达式。pandas对此进行了加强,能够对数组数据应用字符串表达式和正则表达式,而且能处理烦人的缺失数据。

字符串对象方法

对于大部分的字符串而言,内置的方法已经能够满足要求了。

python 的字符串方法主要有:

  • count
  • endswith, startswith
  • join
  • index
  • find
  • rfind
  • replace
  • strip, rstrip, lstrip
  • split
  • lower, upper
  • ljust, rjust
# 返回一个列表
val = 'a,b,  guido'
val.split(',')
Out[3]:
['a', 'b', '  guido']
#  去除空格
pieces = [x.strip() for x in val.split(',')]
pieces
Out[4]:
['a', 'b', 'guido']
# + 连接字符串。 注意下面的赋值方式
first, second, third = pieces
first + '::' + second + '::' + third
Out[5]:
'a::b::guido'
# 上面的不实用,下面是一种更快的风格
'::'.join(pieces)
Out[6]:
'a::b::guido'
# 字串定位,常用的有 in、index、find
'guido' in val
Out[7]:
True
val.index(',')
Out[8]:
1
val.find(':')
Out[9]:
-1
# 不包含子串会报错
# val.index(':') 
# 返回个数
val.count(',')
Out[13]:
2
#  替换
val.replace(',', '::')
Out[14]:
'a::b::  guido'
#  剔除
val.replace(',', '')
Out[15]:
'ab  guido'

正则表达式

正则表达式(regex)提供了一种灵活的在文本中搜索、匹配字符串的模式。用的是re模块。

re模块的函数分为3类:模式匹配、替换、拆分。

关于python 内置的正则表达式(re 模块),可以参考AstralWind的总结

另外animalize 介绍了 更强大的第三方模块(regex)

re 模块的主要方法有:

  • findall, finditer
  • match
  • search
  • split
  • sub, subn
import re
text = "foo    bar\t baz  \tqux"

# 先编译正则表达式 \s+ (多个空白字符),然后再调用split
re.split('\s+', text)
Out[44]:
['foo', 'bar', 'baz', 'qux']
# 等价于
# 如果想对许多字符串都应用同一条正则表达式,应该先compile节省时间
regex = re.compile('\s+')
regex.split(text)
Out[45]:
['foo', 'bar', 'baz', 'qux']
# 找到匹配regex的所有模式 (\s+)
regex.findall(text)
Out[46]:
['    ', '\t ', '  \t']
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
# r, 指定为原生字符串,使得转义字符 \ 不起作用
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)
#findall 返回字符串中所有匹配项
regex.findall(text)
Out[41]:
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
# search只返回第一个匹配项
# 返回的是一种特殊特殊对象,这个对象只能告诉我们模式在原始字符串中的起始和结束位置
m = regex.search(text)
m
Out[51]:
<_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>
text[m.start():m.end()]
Out[49]:
'dave@google.com'
# match更加严格,它只匹配出现在字符串开头的模式
regex.match(text)
# sub方法,会将匹配到的模式替换为指定字符串,并返回新字符串
regex.sub('REDACTED', text)
Out[56]:
'Dave REDACTED\nSteve REDACTED\nRob REDACTED\nRyan REDACTED\n'
# 如果想将找出的模式分段, 需要用圆括号括起来
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups() # 返回 tuple
Out[60]:
('wesm', 'bright', 'net')
regex.findall(text) # 返回列表
Out[61]:
[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text)) # 返回替换后的字符串
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com

regex = re.compile(r"""
    (?P<username>[A-Z0-9._%+-]+)
    @
    (?P<domain>[A-Z0-9.-]+)
    \.
    (?P<suffix>[A-Z]{2,4})""", flags=re.IGNORECASE|re.VERBOSE)
m = regex.match('wesm@bright.net')
m.groupdict()   #  返回一个简单的字典
Out[65]:
{'domain': 'bright', 'suffix': 'net', 'username': 'wesm'}

pandas中矢量化字符串函数

将字符串方法或正则表达式应用到一系列数据。常用的方法包括:

  • cat
  • contains
  • count
  • endswith, startswith
  • findall
  • get
  • join
  • len
  • lower, upper
  • match
  • pad
  • center
  • repeat
  • replace
  • slice
  • split
  • strip, rstrip, lstrip

通过data.map()方法,所有字符串和正则都能传入各个值(通过lambda或者其他函数),但是如果存在NA就会报错。 #然而,Series有些跳过NA的方法。通过Series的str属性可以访问这些方法。

data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = Series(data)
data
Out[69]:
Dave     dave@google.com
Rob        rob@gmail.com
Steve    steve@gmail.com
Wes                  NaN
dtype: object
data.isnull()
Out[70]:
Dave     False
Rob      False
Steve    False
Wes       True
dtype: bool
data.str.contains('gmail')
Out[71]:
Dave     False
Rob       True
Steve     True
Wes        NaN
dtype: object
pattern
Out[72]:
'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
data.str.findall(pattern, flags=re.IGNORECASE)
Out[73]:
Dave     [(dave, google, com)]
Rob        [(rob, gmail, com)]
Steve    [(steve, gmail, com)]
Wes                        NaN
dtype: object
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches
/Users/holbrook/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:1: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.
  if __name__ == '__main__':
Out[75]:
Dave     (dave, google, com)
Rob        (rob, gmail, com)
Steve    (steve, gmail, com)
Wes                      NaN
dtype: object
#有两个办法可以实现矢量化的元素获取操作:要么使用str.get,要么在str属性上用索引
matches.str.get(1)
Out[77]:
Dave     google
Rob       gmail
Steve     gmail
Wes         NaN
dtype: object
matches.str[0]
Out[78]:
Dave      dave
Rob        rob
Steve    steve
Wes        NaN
dtype: object
# 进行截取
data.str[:5]
Out[79]:
Dave     dave@
Rob      rob@g
Steve    steve
Wes        NaN
dtype: object
 

示例:usda食品数据库

利用Python进行数据分析》读书笔记。

第7章 第5节:USDA食品数据库的例子

所有用到的数据可以从作者的 github下载。

%pylab inline
import pandas as pd
from pandas import Series, DataFrame
import json
Populating the interactive namespace from numpy and matplotlib

数据加载

{ "id": 21441, "description": "KENTUCKY FRIED CHICKEN, Fried Chicken, EXTRA CRISPY, Wing, meat and skin with breading", "tags": ["KFC"], "manufacturer": "Kentucky Fried Chicken", "group": "Fast Foods", "portions": [ { "amount": 1, "unit": "wing, with skin", "grams": 68.0 }, ... ], "nutrients": [ { "value": 20.8, "units": "g", "description": "Protein", "group": "Composition" }, ... ] }
import json
db = json.load(open('data/ch07/foods-2011-10-03.json'))
# 得到的db是个list,每个条目都是含有某种食物全部数据的字典
len(db)
Out[5]:
6636
db[0].keys()
Out[3]:
dict_keys(['id', 'description', 'tags', 'manufacturer', 'group', 'portions', 'nutrients'])
# nutrients 对应的值是有关食物营养成分的一个字典列表,很长……
db[0]['nutrients'][0]
Out[6]:
{'description': 'Protein',
 'group': 'Composition',
 'units': 'g',
 'value': 25.18}

数据准备

# 将营养成分做成DataFrame
nutrients = DataFrame(db[0]['nutrients'])
nutrients[:7]
Out[10]:
description group units value
0 Protein Composition g 25.18
1 Total lipid (fat) Composition g 29.20
2 Carbohydrate, by difference Composition g 3.06
3 Ash Other g 3.28
4 Energy Energy kcal 376.00
5 Water Composition g 39.28
6 Energy Energy kJ 1573.00
info_keys = ['description', 'group', 'id', 'manufacturer']
info = DataFrame(db, columns=info_keys)
info[:5]
Out[12]:
description group id manufacturer
0 Cheese, caraway Dairy and Egg Products 1008
1 Cheese, cheddar Dairy and Egg Products 1009
2 Cheese, edam Dairy and Egg Products 1018
3 Cheese, feta Dairy and Egg Products 1019
4 Cheese, mozzarella, part skim milk Dairy and Egg Products 1028
info.head()
Out[15]:
description group id manufacturer
0 Cheese, caraway Dairy and Egg Products 1008
1 Cheese, cheddar Dairy and Egg Products 1009
2 Cheese, edam Dairy and Egg Products 1018
3 Cheese, feta Dairy and Egg Products 1019
4 Cheese, mozzarella, part skim milk Dairy and Egg Products 1028
# 查看分类分布情况
pd.value_counts(info.group)[:10]
Out[14]:
Vegetables and Vegetable Products    812
Beef Products                        618
Baked Products                       496
Breakfast Cereals                    403
Fast Foods                           365
Legumes and Legume Products          365
Lamb, Veal, and Game Products        345
Sweets                               341
Pork Products                        328
Fruits and Fruit Juices              328
Name: group, dtype: int64

将所有营养成分整合到一个大表中

# 将列表连接起来,相当于rbind,把行对其连接在一起

nutrients = []

for rec in db:
    fnuts = DataFrame(rec['nutrients'])
    fnuts['id'] = rec['id']
    nutrients.append(fnuts)

nutrients = pd.concat(nutrients, ignore_index=True)
nutrients.head()
Out[18]:
description group units value id
0 Protein Composition g 25.18 1008
1 Total lipid (fat) Composition g 29.20 1008
2 Carbohydrate, by difference Composition g 3.06 1008
3 Ash Other g 3.28 1008
4 Energy Energy kcal 376.00 1008

去重

nutrients.duplicated().sum()
Out[19]:
14179
nutrients = nutrients.drop_duplicates()

修整

# 由于nutrients与info有重复的名字,所以需要重命名一下info
col_mapping = {'description' : 'food',
               'group'       : 'fgroup'}
info = info.rename(columns=col_mapping, copy=False)
info.head()
Out[23]:
food fgroup id manufacturer
0 Cheese, caraway Dairy and Egg Products 1008
1 Cheese, cheddar Dairy and Egg Products 1009
2 Cheese, edam Dairy and Egg Products 1018
3 Cheese, feta Dairy and Egg Products 1019
4 Cheese, mozzarella, part skim milk Dairy and Egg Products 1028
col_mapping = {'description' : 'nutrient',
               'group' : 'nutgroup'}
nutrients = nutrients.rename(columns=col_mapping, copy=False)
nutrients.head()
Out[25]:
nutrient nutgroup units value id
0 Protein Composition g 25.18 1008
1 Total lipid (fat) Composition g 29.20 1008
2 Carbohydrate, by difference Composition g 3.06 1008
3 Ash Other g 3.28 1008
4 Energy Energy kcal 376.00 1008

数据转换

ndata = pd.merge(nutrients, info, on='id', how='outer')
ndata.head()
Out[29]:
nutrient nutgroup units value id food fgroup manufacturer
0 Protein Composition g 25.18 1008 Cheese, caraway Dairy and Egg Products
1 Total lipid (fat) Composition g 29.20 1008 Cheese, caraway Dairy and Egg Products
2 Carbohydrate, by difference Composition g 3.06 1008 Cheese, caraway Dairy and Egg Products
3 Ash Other g 3.28 1008 Cheese, caraway Dairy and Egg Products
4 Energy Energy kcal 376.00 1008 Cheese, caraway Dairy and Egg Products
ndata.ix[30000]
Out[30]:
nutrient                                       Glycine
nutgroup                                   Amino Acids
units                                                g
value                                             0.04
id                                                6158
food            Soup, tomato bisque, canned, condensed
fgroup                      Soups, Sauces, and Gravies
manufacturer                                          
Name: 30000, dtype: object

建模和计算

#  根据营养成分,得到锌的中位数
result = ndata.groupby(['nutrient', 'fgroup'])['value'].quantile(0.5)
result['Zinc, Zn'].sort_values().plot(kind='barh')
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x115eed908>
# 发现各营养成分最为丰富的食物
by_nutrient = ndata.groupby(['nutgroup', 'nutrient'])

get_maximum = lambda x: x.xs(x.value.idxmax())
get_minimum = lambda x: x.xs(x.value.idxmin())

max_foods = by_nutrient.apply(get_maximum)[['value', 'food']]

# make the food a little smaller
max_foods.food = max_foods.food.str[:50]
max_foods.ix['Amino Acids']['food']
Out[34]:
nutrient
Alanine                           Gelatins, dry powder, unsweetened
Arginine                               Seeds, sesame flour, low-fat
Aspartic acid                                   Soy protein isolate
Cystine                Seeds, cottonseed flour, low fat (glandless)
Glutamic acid                                   Soy protein isolate
Glycine                           Gelatins, dry powder, unsweetened
Histidine                Whale, beluga, meat, dried (Alaska Native)
Hydroxyproline    KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...
Isoleucine        Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Leucine           Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Lysine            Seal, bearded (Oogruk), meat, dried (Alaska Na...
Methionine                    Fish, cod, Atlantic, dried and salted
Phenylalanine     Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Proline                           Gelatins, dry powder, unsweetened
Serine            Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Threonine         Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Tryptophan         Sea lion, Steller, meat with fat (Alaska Native)
Tyrosine          Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Valine            Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Name: food, dtype: object