阅读笔记

作者：方跃文

Email: fyuewen@gmail.com

** 时间：始于2018年11月17日，结束写作于 2018年

第七章数据规整化：清理、转换、合并、重塑

PANDAS 的产生是以运用为导向的，因此它包含了许多实际工作中需要的数据清理方式。

合并数据集

pandas对象可以通过一些内置的方法进行合并：

pandas.merge, 可以根据一个或者多个key将数据进行连接
pandas.concat ，可以沿着一条轴将多个数据堆叠在一起。
实例方法中的 combine.fist 可以将重复的数据编排在一起，并且用一个对象中的值填缺另一个对象中的缺失值。

数据库风格的DataFrame合并 (database-style DataFrame joins)

Pands 中的merge，允许我们根据一个或者多个keys来合并datasets，这种操作实现类似于基于SQL数据中的 join 方法。



In [22]:

    
import pandas as pd

def special_sign(sign, times):
    # sign is string, times is integer
    str_list = sign*times
    new_str = ''.join([i for i in str_list])
    return(new_str)

df1 = pd.DataFrame({'key':list('bbacaab'),
                   'data1': range(7)})
df2 = pd.DataFrame({'key': list('abd'),
                   'data2': range(3)})
print(df1)
print(special_sign('#',15))
print(df2)

## many-to-one join; without specifying which column to join on
pd.merge(df1, df2)









    



  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   a      5
6   b      6
###############
  key  data2
0   a      0
1   b      1
2   d      2






    Out[22]:







  
    
      
      key
      data1
      data2
    
  
  
    
      0
      b
      0
      1
    
    
      1
      b
      1
      1
    
    
      2
      b
      6
      1
    
    
      3
      a
      2
      0
    
    
      4
      a
      4
      0
    
    
      5
      a
      5
      0



In [25]:

    
## many-to-one join; wit specifying which column to join on
pd.merge(df1, df2, on = 'key')

轴向连接

DataFrame 中有很丰富的merge方法，此外还有一种数据合并运算被称作连接（concatenation）、binding、stacking。在Numpy中，也有concatenation函数。



In [2]:

    
import numpy as np

arr1 = np.arange(12).reshape(3,4)
print(arr1)









    



[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]



In [3]:

    
np.concatenate([arr1, arr1], axis=1)









    Out[3]:





array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

对于pandas对象，带有标签的轴使我们能够进一步推广数组的连接运算。 pandas中的concate函数提供了一些功能，来操作这种合并运算

下方这个例子中，有三个series，这三个series的索引没有重叠，我们来看看，concate是如何给出合并运算的。



In [8]:

    
import pandas as pd
seri1 = pd.Series([-1,2], index=list('ab'))
seri2 = pd.Series([2,3,4], index=list('cde'))
seri3 = pd.Series([5,6], index=list('fg'))
print(seri1)
print(seri2)
print(seri3)









    



a   -1
b    2
dtype: int64
c    2
d    3
e    4
dtype: int64
f    5
g    6
dtype: int64



In [6]:

    
print(seri1)









    



a   -1
b    2
dtype: int64



In [9]:

    
pd.concat([seri1,seri2,seri3])









    Out[9]:





a   -1
b    2
c    2
d    3
e    4
f    5
g    6
dtype: int64

By default，concat是在axis=0上工作的，最终产生一个全新的Series。如果传入axis=1，那么结果就会成为一个 DataFrame （axis=1 是列）



In [12]:

    
pd.concat([seri1, seri2, seri3],axis=1, sort=False)



In [15]:

    
pd.concat([seri1, seri2, seri3],axis=1, sort=False,join='inner') # 传入 inner，得到并集，该处并集为none



In [17]:

    
seri4 = pd.concat([seri1*5, seri3])
print(seri4)









    



a    -5
b    10
f     5
g     6
dtype: int64



In [18]:

    
seri4 = pd.concat([seri1*5, seri3],axis=1, join='inner')
print(seri4)









    



Empty DataFrame
Columns: [0, 1]
Index: []



In [ ]:

Appendix during writing this note

define a function which returns a string with repeated letters



In [18]:

    
# Ref: https://stackoverflow.com/questions/38273353/how-to-repeat-individual-characters-in-strings-in-python
def special_sign(sign, times):
    # sign is string, times is integer
    str_list = sign*times
    new_str = ''.join([i for i in str_list])
    return(new_str)

print(special_sign('*',20))









    



********************



In [ ]:

	0	1	2
a	-1.0	NaN	NaN
b	2.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0