数据轴标签(即索引index)在pandas 的对象( Series和DataFrame )中起着重要作用,比如确定数据位置、获取数据集的子集。
Object Type | Indexers |
---|---|
Series | s.loc[indexer] |
DataFrame | df.loc[row_indexer,column_indexer] |
import pandas as pd
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4),
index=dates, columns=['A', 'B', 'C', 'D'])
df
A B C D
2000-01-01 -0.479079 0.172883 -0.242021 0.628444
2000-01-02 0.159806 -1.505613 -0.134579 0.498075
2000-01-03 1.533752 1.201623 0.530075 -0.323315
2000-01-04 1.449016 0.762993 0.601700 1.255557
2000-01-05 0.239401 1.765966 -0.818390 -0.156134
2000-01-06 2.869231 0.956241 -0.178162 -2.097230
2000-01-07 -0.380161 0.602484 0.469034 1.277701
2000-01-08 0.640450 0.565034 -0.010303 -1.344165
s = df['A']#s为Series
s[dates[5]]
2.869230887708206
df[['B', 'A']] = df[['A', 'B']]#可实现交换DataFrame两列的功能
#df[['A', 'B']] 等价于df.loc[:,['A', 'B']]
可以用.对数据进行切片操作
>>>df.A
>>>df.D #等价于df['D']或 df.loc[:,'D'] 对列起作用
2000-01-01 0.628444
2000-01-02 0.498075
2000-01-03 -0.323315
2000-01-04 1.255557
2000-01-05 -0.156134
2000-01-06 -2.097230
2000-01-07 1.277701
2000-01-08 -1.344165
Freq: D, Name: D, dtype: float64
>>>df[::-1] #df[::-1]将数据倒置显示,df[:3]和df[::-1]对行起作用
A B C D
2000-01-08 0.640450 0.565034 -0.010303 -1.344165
2000-01-07 -0.380161 0.602484 0.469034 1.277701
2000-01-06 2.869231 0.956241 -0.178162 -2.097230
2000-01-05 0.239401 1.765966 -0.818390 -0.156134
2000-01-04 1.449016 0.762993 0.601700 1.255557
2000-01-03 1.533752 1.201623 0.530075 -0.323315
2000-01-02 0.159806 -1.505613 -0.134579 0.498075
2000-01-01 -0.479079 0.172883 -0.242021 0.628444
1、loc与iloc
>>> df1 = pd.DataFrame(np.random.randn(6, 4),
index=list('abcdef'),columns=list('ABCD'))
>>> df1
A B C D
a 0.899943 0.500422 -0.142480 0.714779
b -0.592714 -0.371228 -1.407495 0.748776
c -0.567269 0.225230 -0.215326 0.826066
d -0.882531 -1.744819 1.818175 -0.144823
e -1.174458 1.108387 1.127187 -0.110846
f 0.205241 0.035335 1.302494 0.813305
#df1.loc[['a', 'b', 'd'], :] df1.loc['d':, 'A':'C'] df1.loc[:, df1.loc['a'] > 0]
Selection by callable 根据函数返回结果调用
>>>df1 = pd.DataFrame(np.random.randn(6, 4),
index=list('abcdef'),columns=list('ABCD'))
>>>df1
A B C D
a 1.254603 0.666147 0.960109 0.290801
b 1.024046 -1.046331 -0.904427 -0.205843
c -2.321422 -0.014234 -0.171935 -0.511684
d 0.978548 1.030372 -0.298060 1.856619
e 0.106820 0.101090 -0.152575 -0.395502
f -0.560688 0.692521 -0.920736 -0.948279
>>> df['A'] > 0
2000-01-01 False
2000-01-02 True
2000-01-03 True
2000-01-04 True
2000-01-05 True
2000-01-06 True
2000-01-07 False
2000-01-08 True
>>>df1.loc[lambda df: df['A'] > 0, :] #相当于返回df['A'] > 0的Series
A B C D
a 1.254603 0.666147 0.960109 0.290801
b 1.024046 -1.046331 -0.904427 -0.205843
d 0.978548 1.030372 -0.298060 1.856619
e 0.106820 0.101090 -0.152575 -0.395502
df.sample(n=1, axis=1) 随机抽取一行或一列
2、at与iat 用于访问值(标量)用法与loc和iloc一致
3、利用逻辑向量对Series和DataFrame进行选取
|
for or
, &
for and
, and ~
for not
.
>>>df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'],'c': np.random.randn(7)})
>>>df2
a b c
0 one x -1.201883
1 one y 0.323085
2 two y -1.228992
3 three x -0.691629
4 two y 0.342987
5 one x -1.405064
6 six x -0.023214
>>>criterion = df2['a'].map(lambda x: x.startswith('t'))
#map生成一个Boolean Series,startswith()返回Boolean
>>>criterion
0 False
1 False
2 True
3 True
4 True
5 False
6 False
Name: a, dtype: bool
4、isin使用
Series.isin(list())
>>> s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
>>> s
4 0
3 1
2 2
1 3
0 4
dtype: int64
>>> s[s.isin([2, 4, 6])]
2 2
0 4
dtype: int64
DataFrame.isin()
>>> df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],'ids2': ['a', 'n', 'c', 'n']})
>>> df
vals ids ids2
0 1 a a
1 2 b n
2 3 f c
3 4 n n
>>> values = ['a', 'b', 1, 3]
>>> df.isin(values)
vals ids ids2
0 True True True
1 False True False
2 True False False
3 False False False
DataFrame.isin()参数为dict时
>>> values = {'ids': ['a', 'b'], 'vals': [1, 3]}
>>> df.isin(values)
vals ids ids2
0 True True False
1 False True False
2 True False False
3 False False False
#any()与all()
>>> values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}
>>> row_mask = df.isin(values)
>>> df[row_mask]
vals ids ids2
0 1.0 a a
1 NaN b NaN
2 3.0 NaN c
3 NaN NaN NaN
>>> row_mask = df.isin(values).all(1)
>>> df[row_mask]
vals ids ids2
0 1 a a
>>> row_mask = df.isin(values).any(1)
>>> row_mask
0 True
1 True
2 True
3 False
dtype: bool
>>> df[row_mask]
vals ids ids2
0 1 a a
1 2 b n
2 3 f c
5、where() & mask()
见链接 pandas.DataFrame.where() 和 mask()方法
6、query()
见链接 pandas.DataFrame.query()方法
7、query() 与 isin()
isin()见链接 pandas.Series.isin()和pandas.DataFrame.isin()
>>> df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
... 'c': np.random.randint(5, size=12),
... 'd': np.random.randint(9, size=12)})
...
>>> df
a b c d
0 a a 3 5
1 a a 3 5
2 b a 0 5
3 b a 3 2
4 c b 3 6
5 c b 2 8
6 d b 2 3
7 d b 3 7
8 e c 3 4
9 e c 1 5
10 f c 2 2
11 f c 3 5
>>> df.query('a in b')
a b c d
0 a a 3 5
1 a a 3 5
2 b a 0 5
3 b a 3 2
4 c b 3 6
5 c b 2 8
>>> df[df['a'].isin(df['b'])]
a b c d
0 a a 3 5
1 a a 3 5
2 b a 0 5
3 b a 3 2
4 c b 3 6
5 c b 2 8
>>> df[~df['a'].isin(df['b'])]
a b c d
6 d b 2 3
7 d b 3 7
8 e c 3 4
9 e c 1 5
10 f c 2 2
11 f c 3 5
>>> df.query('b == ["a", "b", "c"]')
a b c d
0 a a 3 5
1 a a 3 5
2 b a 0 5
3 b a 3 2
4 c b 3 6
5 c b 2 8
6 d b 2 3
7 d b 3 7
8 e c 3 4
9 e c 1 5
10 f c 2 2
11 f c 3 5
>>> df[df['b'].isin(["a", "b", "c"])]
a b c d
0 a a 3 5
1 a a 3 5
2 b a 0 5
3 b a 3 2
4 c b 3 6
5 c b 2 8
6 d b 2 3
7 d b 3 7
8 e c 3 4
9 e c 1 5
10 f c 2 2
11 f c 3 5
从性能上讲,query()更快

8、重复数据
duplicated 数据处理见链接 pandas DataFrame 重复数据处理 – duplicated()和 drop_duplicates()
0 Comments