Pandas 学习笔记

建立环境

安装 pandas

sudo apt-get install build-essential python-dev 
sudo apt-get install python-pandas python-tk
sudo apt-get install python-scipy python-matplotlib python-tables
sudo apt-get install python-numexpr python-xlrd python-statsmodels
sudo apt-get install python-openpyxl python-xlwt python-bs4

if use virtualenv before install matplotlib should install libpng-dev, libjpeg8-dev, libfreetype6-dev

安装 ipython-notebook

sudo pip install "ipython[notebook]"
sudo pip install pygments

运行 ipython-notebook

ipython notebook
#如果你使用matplotlib内嵌进网页中,那么需要运行:
ipython notebook --matplotlib inline

In [119]:
import numpy as np
import pandas as pd

In [120]:
%matplotlib inline

In [121]:
# 查看 pandas 的版本
pd.__version__


Out[121]:
'0.20.1'

创建对象

Series


In [160]:
s = pd.Series([1,3,5,np.nan,6,8]);s


Out[160]:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

DataFrames


In [161]:
dates = pd.date_range('20180101', periods=6);dates


Out[161]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06'],
              dtype='datetime64[ns]', freq='D')

In [162]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'));df


Out[162]:
A B C D
2018-01-01 -0.293383 1.188783 -0.250695 0.311327
2018-01-02 -2.576191 -0.883473 -0.555335 2.133046
2018-01-03 -0.000352 1.130436 0.608967 -0.058527
2018-01-04 0.141822 -1.184875 -1.049310 -0.973988
2018-01-05 1.643239 -1.490113 0.139039 -0.633444
2018-01-06 0.372552 -1.116897 -1.197287 1.721004

In [163]:
# 可以使用字典来创建 DataFrame 。
# 如果字典的 Value 是单一值,那么会自动扩展。
# 如果字典的 Value 是列表或者 Series ,那么长度要保持一致。
# 如果字典中只有一个值有 Index ,那么会使用这个 Index 作为整个 DataFrame 的 Index 。
# 如果字典有多个 Index ,那么必须保持一致,否则会报错。
df2 = pd.DataFrame(
    { 'A' : 1.,
      'B' : pd.Timestamp('20130102'),
      'C' : pd.Series(1,index=list(range(2,6)),dtype='float32'), 
      'D' : np.array([3] * 4,dtype='int32'), 
      'E' : pd.Categorical(["test","train","test","train"]), 
      'F' : 'foo' }
)
df2


Out[163]:
A B C D E F
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
4 1.0 2013-01-02 1.0 3 test foo
5 1.0 2013-01-02 1.0 3 train foo

查看数据


In [164]:
# 对象类型
df2.dtypes


Out[164]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [165]:
# 查看头部数据
df.head()


Out[165]:
A B C D
2018-01-01 -0.293383 1.188783 -0.250695 0.311327
2018-01-02 -2.576191 -0.883473 -0.555335 2.133046
2018-01-03 -0.000352 1.130436 0.608967 -0.058527
2018-01-04 0.141822 -1.184875 -1.049310 -0.973988
2018-01-05 1.643239 -1.490113 0.139039 -0.633444

In [166]:
# 查看尾部数据
df.tail(2)
#head 和 tail 接受一个整数参数,缺省值为 5 。


Out[166]:
A B C D
2018-01-05 1.643239 -1.490113 0.139039 -0.633444
2018-01-06 0.372552 -1.116897 -1.197287 1.721004

In [167]:
df.index


Out[167]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06'],
              dtype='datetime64[ns]', freq='D')

In [168]:
df.columns


Out[168]:
Index(['A', 'B', 'C', 'D'], dtype='object')

In [169]:
df.values


Out[169]:
array([[ -2.93382745e-01,   1.18878334e+00,  -2.50694543e-01,
          3.11327196e-01],
       [ -2.57619141e+00,  -8.83472631e-01,  -5.55334503e-01,
          2.13304648e+00],
       [ -3.51821479e-04,   1.13043604e+00,   6.08967225e-01,
         -5.85268484e-02],
       [  1.41821576e-01,  -1.18487523e+00,  -1.04930999e+00,
         -9.73987922e-01],
       [  1.64323906e+00,  -1.49011349e+00,   1.39038801e-01,
         -6.33443886e-01],
       [  3.72551663e-01,  -1.11689713e+00,  -1.19728703e+00,
          1.72100425e+00]])

In [170]:
df.describe()


Out[170]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.118719 -0.392690 -0.384103 0.416570
std 1.378217 1.218048 0.694371 1.258635
min -2.576191 -1.490113 -1.197287 -0.973988
25% -0.220125 -1.167881 -0.925816 -0.489715
50% 0.070735 -1.000185 -0.403015 0.126400
75% 0.314869 0.626959 0.041605 1.368585
max 1.643239 1.188783 0.608967 2.133046

In [171]:
df.T


Out[171]:
2018-01-01 00:00:00 2018-01-02 00:00:00 2018-01-03 00:00:00 2018-01-04 00:00:00 2018-01-05 00:00:00 2018-01-06 00:00:00
A -0.293383 -2.576191 -0.000352 0.141822 1.643239 0.372552
B 1.188783 -0.883473 1.130436 -1.184875 -1.490113 -1.116897
C -0.250695 -0.555335 0.608967 -1.049310 0.139039 -1.197287
D 0.311327 2.133046 -0.058527 -0.973988 -0.633444 1.721004

读入数据

从 CSV 文件读入数据


In [122]:
# 读入 CSV 格式数据
df_movies = pd.read_csv('datas/movies.csv', sep='\t', encoding='utf-8')
df_movies.head()


Out[122]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 $26,168,351 American Sniper $9,905,616
1 2 Jan. 31 Sat 31 $41,633,588 American Sniper $16,510,536
2 3 Feb. 1 Sun 32 $12,515,579 American Sniper $4,244,376
3 4 Feb. 2 Mon 33 $6,475,068 American Sniper $2,645,109
4 5 Feb. 3 Tue 34 $7,825,091 American Sniper $2,923,141

In [123]:
df_movies = pd.read_csv('datas/movies.csv', sep='\t', encoding='utf-8',thousands=',',escapechar='$')
df_movies.head()


Out[123]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376
3 4 Feb. 2 Mon 33 6475068 American Sniper 2645109
4 5 Feb. 3 Tue 34 7825091 American Sniper 2923141

从 Excel 文件读入数据

原型

pandas.read_excel(io, sheet_name=0, header=0, skiprows=None, skip_footer=0,
                  index_col=None, names=None, usecols=None, parse_dates=False,
                  date_parser=None, na_values=None, thousands=None,
                  convert_float=True, converters=None, dtype=None,
                  true_values=None, false_values=None, engine=None,
                  squeeze=False, **kwds)

Read an Excel table into a pandas DataFrame

Parameters

  • io : string, path object (pathlib.Path or py._path.local.LocalPath), file-like object, pandas ExcelFile, or xlrd workbook. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx

  • sheet_name : string, int, mixed list of strings/ints, or None, default 0 Strings are used for sheet names, Integers are used in zero-indexed sheet positions. Lists of strings/integers are used to request multiple sheets. Specify None to get all sheets. str|int -> DataFrame is returned. list|None -> Dict of DataFrames is returned, with keys representing sheets. Available Cases • Defaults to 0 -> 1st sheet as a DataFrame • 1 -> 2nd sheet as a DataFrame • “Sheet1” -> 1st sheet as a DataFrame • [0,1,”Sheet5”] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames • None -> All sheets as a dictionary of DataFrames

  • sheetname : string, int, mixed list of strings/ints, or None, default 0 Deprecated since version 0.21.0: Use sheet_name instead

  • header : int, list of ints, default 0 Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex. Use None if there is no header.

  • skiprows : list-like Rows to skip at the beginning (0-indexed)

  • skip_footer : int, default 0 Rows at the end to skip (0-indexed)

  • index_col : int, list of ints, default None Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex. If a subset of data is selected with usecols, index_col is based on the subset.

  • names : array-like, default None List of column names to use. If file contains no header row, then you should explicitly pass header=None

  • converters : dict, default None Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.

  • dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion. New in version 0.20.0.

  • true_values : list, default None Values to consider as True New in version 0.19.0.

  • false_values : list, default None Values to consider as False New in version 0.19.0.

  • parse_cols : int or list, default None Deprecated since version 0.21.0: Pass in usecols instead.

  • usecols : int or list, default None • If None then parse all columns, • If int then indicates last column to be parsed • If list of ints then indicates list of column numbers to be parsed • If string then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.

  • squeeze : boolean, default False If the parsed data only contains one column then return a Series

  • na_values : scalar, str, list-like, or dict, default None Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

  • thousands : str, default None Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.

  • keep_default_na : bool, default True If na_values are specified and keep_default_na is False the default NaN values are over-ridden, otherwise they’re appended to.

  • verbose : boolean, default False Indicate number of NA values placed in non-numeric columns

  • engine: string, default None If io is not a buffer or path, this must be set to identify io. Acceptable values are None or xlrd

  • convert_float : boolean, default True convert integral floats to int (i.e., 1.0 –> 1). If False, all numeric data will be read in as floats: Excel stores all numbers as floats internally

Returns parsed : DataFrame or Dict of DataFrames

DataFrame from the passed in Excel file. See notes in sheet_name argument for more information on when a Dict of Dataframes is returned.

复制数据


In [124]:
df = df_movies.copy()
df.head(3)


Out[124]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376

选择数据


In [125]:
#显示开头的数据,缺省显示 5 条
df.head()


Out[125]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376
3 4 Feb. 2 Mon 33 6475068 American Sniper 2645109
4 5 Feb. 3 Tue 34 7825091 American Sniper 2923141

In [126]:
#显示开头的数据,指定显示 3 条
df.head(3)


Out[126]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376

In [127]:
#显示末尾的数据,缺省显示 5 条
df.tail()


Out[127]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
24 25 Feb. 23 Mon 54 7385671 Fifty Shades of Grey 1846390
25 26 Feb. 24 Tue 55 9424126 Fifty Shades of Grey 2265910
26 27 Feb. 25 Wed 56 6862942 Fifty Shades of Grey 1772230
27 28 Feb. 26 Thu 57 7161773 Fifty Shades of Grey 1790520
28 29 Feb. 27 Fri 58 26457000 Focus (2015) 6465000

In [128]:
#显示末尾的数据,缺省显示 2 条
df.tail(2)


Out[128]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
27 28 Feb. 26 Thu 57 7161773 Fifty Shades of Grey 1790520
28 29 Feb. 27 Fri 58 26457000 Focus (2015) 6465000

In [129]:
#只显示指定的行和列
df.iloc[[1,3,5],[0,1,2,3]]


Out[129]:
Row Date Day Day#
1 2 Jan. 31 Sat 31
3 4 Feb. 2 Mon 33
5 6 Feb. 4 Wed 35

In [130]:
df.loc[[1,3,5],['Date', 'Gross']]


Out[130]:
Date Gross
1 Jan. 31 16510536
3 Feb. 2 2645109
5 Feb. 4 2273342

操作单元


In [131]:
df = df_movies.copy()
# 单元格赋值
# 单个单元格赋值
df.iloc[0, 6] = u'土豆之歌'
df.loc[df.index[1], u'Gross']= 999
df.head(3)


Out[131]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 26168351 American Sniper 土豆之歌
1 2 Jan. 31 Sat 31 41633588 American Sniper 999
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376

In [132]:
# 多单个单元格赋值
df.loc[df.index[0:2], u'Gross'] = [100, 200]
df.head(3)


Out[132]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 26168351 American Sniper 100
1 2 Jan. 31 Sat 31 41633588 American Sniper 200
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376

操作列

改变列头

使用 columns 属性


In [133]:
df = df_movies.copy()
#用一个列表来显式地指定,列表长度必须与列数一致
# 示例 1
df.columns = [u'Row', u'Date', u'WeekDay', u'Day', u'Top10Gross', u'No1Moive', u'Gross']
df.head()


Out[133]:
Row Date WeekDay Day Top10Gross No1Moive Gross
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376
3 4 Feb. 2 Mon 33 6475068 American Sniper 2645109
4 5 Feb. 3 Tue 34 7825091 American Sniper 2923141

In [134]:
# 示例 2 :大写转小写
df.columns = [c.lower() for c in df.columns]
df.head()


Out[134]:
row date weekday day top10gross no1moive gross
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376
3 4 Feb. 2 Mon 33 6475068 American Sniper 2645109
4 5 Feb. 3 Tue 34 7825091 American Sniper 2923141

使用 rename 方法


In [135]:
# 示例 1 :小写转大写
df = df.rename(columns=lambda x: x.upper())
df.tail(3)


Out[135]:
ROW DATE WEEKDAY DAY TOP10GROSS NO1MOIVE GROSS
26 27 Feb. 25 Wed 56 6862942 Fifty Shades of Grey 1772230
27 28 Feb. 26 Thu 57 7161773 Fifty Shades of Grey 1790520
28 29 Feb. 27 Fri 58 26457000 Focus (2015) 6465000

In [136]:
# 示例 2 :改变特定的列头
df = df.rename(columns={'DATE': u'日期', 'GROSS': u'票房'})
df.head()


Out[136]:
ROW 日期 WEEKDAY DAY TOP10GROSS NO1MOIVE 票房
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376
3 4 Feb. 2 Mon 33 6475068 American Sniper 2645109
4 5 Feb. 3 Tue 34 7825091 American Sniper 2923141

打印列类型


In [137]:
df.columns.to_series().groupby(df.dtypes).groups


Out[137]:
{dtype('int64'): Index(['ROW', 'DAY', 'TOP10GROSS', '票房'], dtype='object'),
 dtype('O'): Index(['日期', 'WEEKDAY', 'NO1MOIVE'], dtype='object')}

In [138]:
# 打印列类型(清晰打印中文)
types = df.columns.to_series().groupby(df.dtypes).groups
for key, value in types.items():
    print(key,':\t', ','.join(value))


int64 :	 ROW,DAY,TOP10GROSS,票房
object :	 日期,WEEKDAY,NO1MOIVE

插入列


In [139]:
df = df_movies.copy()
# 方式一:在末尾添加
df['memo'] = pd.Series('', index=df.index)
df.head(3)


Out[139]:
Row Date Day Day# Top 10 Gross #1 Movie Gross memo
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376

In [140]:
# 方式二:在中间插入
df = df_movies.copy()
df.insert(loc=1, column=u'year', value=u'2015')
df.head(3)


Out[140]:
Row year Date Day Day# Top 10 Gross #1 Movie Gross
0 1 2015 Jan. 30 Fri 30 26168351 American Sniper 9905616
1 2 2015 Jan. 31 Sat 31 41633588 American Sniper 16510536
2 3 2015 Feb. 1 Sun 32 12515579 American Sniper 4244376

In [141]:
# 根据现有值生成一个新的列
df = df_movies.copy()
df.insert(loc = 5 , column=u'OtherGross', value=df[u'Top 10 Gross'] - df[u'Gross'])
df.head(3)


Out[141]:
Row Date Day Day# Top 10 Gross OtherGross #1 Movie Gross
0 1 Jan. 30 Fri 30 26168351 16262735 American Sniper 9905616
1 2 Jan. 31 Sat 31 41633588 25123052 American Sniper 16510536
2 3 Feb. 1 Sun 32 12515579 8271203 American Sniper 4244376

In [142]:
# 根据现有值生成多个新的列
df = df_movies.copy()
def process_date_col(text):
    #根据日期生成月份和日两个新的列
    if pd.isnull(text):
        month = day = np.nan
    else:
        month, day = text.split('.')
    return pd.Series([month, day])

df[[u'month', u'day']] = df.Date.apply(process_date_col)
df.head()


Out[142]:
Row Date Day Day# Top 10 Gross #1 Movie Gross month day
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616 Jan 30
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536 Jan 31
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376 Feb 1
3 4 Feb. 2 Mon 33 6475068 American Sniper 2645109 Feb 2
4 5 Feb. 3 Tue 34 7825091 American Sniper 2923141 Feb 3

改变列值


In [143]:
df = df_movies.copy()
#根据一列的值改变另一列
df[u'#1 Movie'] = df[u'#1 Movie'].apply(lambda x: x[::-1])
df.head(3)


Out[143]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 26168351 repinS naciremA 9905616
1 2 Jan. 31 Sat 31 41633588 repinS naciremA 16510536
2 3 Feb. 1 Sun 32 12515579 repinS naciremA 4244376

In [144]:
# 同时改变多个列的值
cols = [u'Gross', u'Top 10 Gross']
df[cols] = df[cols].applymap(lambda x: x/10000)
df.head(3)


Out[144]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 2616.8351 repinS naciremA 990.5616
1 2 Jan. 31 Sat 31 4163.3588 repinS naciremA 1651.0536
2 3 Feb. 1 Sun 32 1251.5579 repinS naciremA 424.4376

操作行


In [145]:
df = df_movies.copy()
# 添加一个空行
df = df.append(pd.Series(
                [np.nan]*len(df.columns), # Fill cells with NaNs
                index=df.columns),
                ignore_index=True)
df.tail(3)


Out[145]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
27 28.0 Feb. 26 Thu 57.0 7161773.0 Fifty Shades of Grey 1790520.0
28 29.0 Feb. 27 Fri 58.0 26457000.0 Focus (2015) 6465000.0
29 NaN NaN NaN NaN NaN NaN NaN

空值处理(NaN)


In [146]:
# 计数有空值的行
nans = df.shape[0] - df.dropna().shape[0]
print(u'一共有 %d 行出现空值' % nans)

# 填充空值为`无`
df.fillna(value=u'无', inplace=True)
df.tail()


一共有 1 行出现空值
Out[146]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
25 26 Feb. 24 Tue 55 9.42413e+06 Fifty Shades of Grey 2.26591e+06
26 27 Feb. 25 Wed 56 6.86294e+06 Fifty Shades of Grey 1.77223e+06
27 28 Feb. 26 Thu 57 7.16177e+06 Fifty Shades of Grey 1.79052e+06
28 29 Feb. 27 Fri 58 2.6457e+07 Focus (2015) 6.465e+06
29

排序


In [147]:
df = df_movies.copy()
# 添加一个空行
df = df.append(pd.Series(
                [np.nan]*len(df.columns), # Fill cells with NaNs
                index=df.columns),
                ignore_index=True)
# 根据某一列排序(由低到高)
df.sort_values(u'Gross', ascending=True, inplace=True)
df.head()


Out[147]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
12 13.0 Feb. 11 Wed 42.0 6138013.0 American Sniper 1468160.0
13 14.0 Feb. 12 Thu 43.0 5969515.0 SpongeBob 1527552.0
26 27.0 Feb. 25 Wed 56.0 6862942.0 Fifty Shades of Grey 1772230.0
27 28.0 Feb. 26 Thu 57.0 7161773.0 Fifty Shades of Grey 1790520.0
24 25.0 Feb. 23 Mon 54.0 7385671.0 Fifty Shades of Grey 1846390.0

In [148]:
# 排序后重新编制索引
df.index = range(1,len(df.index)+1)
df.head()


Out[148]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
1 13.0 Feb. 11 Wed 42.0 6138013.0 American Sniper 1468160.0
2 14.0 Feb. 12 Thu 43.0 5969515.0 SpongeBob 1527552.0
3 27.0 Feb. 25 Wed 56.0 6862942.0 Fifty Shades of Grey 1772230.0
4 28.0 Feb. 26 Thu 57.0 7161773.0 Fifty Shades of Grey 1790520.0
5 25.0 Feb. 23 Mon 54.0 7385671.0 Fifty Shades of Grey 1846390.0

过滤


In [149]:
df = df_movies.copy()
# 根据列类型过滤
# 只选择字符串型的列
df.loc[:, (df.dtypes == np.dtype('O')).values].head()


Out[149]:
Date Day #1 Movie
0 Jan. 30 Fri American Sniper
1 Jan. 31 Sat American Sniper
2 Feb. 1 Sun American Sniper
3 Feb. 2 Mon American Sniper
4 Feb. 3 Tue American Sniper

In [150]:
# 选择 artifact 为空值的行
df.iloc[0, 6] = np.nan
df.iloc[3, 6] = np.nan
df[df[u'Gross'].isnull()].head()


Out[150]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 26168351 American Sniper NaN
3 4 Feb. 2 Mon 33 6475068 American Sniper NaN

In [151]:
# 选择'Gross'为非空值的行
df[df[u'Gross'].notnull()].head()


Out[151]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536.0
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376.0
4 5 Feb. 3 Tue 34 7825091 American Sniper 2923141.0
5 6 Feb. 4 Wed 35 5819529 American Sniper 2273342.0
6 7 Feb. 5 Thu 36 6165344 American Sniper 2506106.0

In [152]:
# 根据条件过滤
df[ (df[u'Day'] == u'Sat') | (df[u'Day#'] <= 32) ]


Out[152]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 26168351 American Sniper NaN
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536.0
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376.0
8 9 Feb. 7 Sat 38 59153298 SpongeBob 24086968.0
15 16 Feb. 14 Sat 45 87900659 Fifty Shades of Grey 36752460.0
22 23 Feb. 21 Sat 52 43708356 Fifty Shades of Grey 8991100.0

In [153]:
df[ (df[u'Day'] == u'Sat') & (df[u'Day#'] <= 32) ]


Out[153]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536.0

切片

合并


In [159]:
# 横向合并
df_a = df.filter(regex='D', axis=1);
print(df_a.head())
print('============================')
df_b = df.filter(regex='ss', axis=1);
print(df_b.head())
print('============================')
df_c = pd.concat([df_a, df_b],axis=1)
print(df_c.head())


      Date  Day  Day#
0  Jan. 30  Fri    30
1  Jan. 31  Sat    31
2   Feb. 1  Sun    32
3   Feb. 2  Mon    33
4   Feb. 3  Tue    34
============================
   Top 10 Gross       Gross
0      26168351         NaN
1      41633588  16510536.0
2      12515579   4244376.0
3       6475068         NaN
4       7825091   2923141.0
============================
      Date  Day  Day#  Top 10 Gross       Gross
0  Jan. 30  Fri    30      26168351         NaN
1  Jan. 31  Sat    31      41633588  16510536.0
2   Feb. 1  Sun    32      12515579   4244376.0
3   Feb. 2  Mon    33       6475068         NaN
4   Feb. 3  Tue    34       7825091   2923141.0

统计:计数,平均,最大,最小,方差,标准差

同比,环比

图形化


In [96]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()


Out[96]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99aeb8dc50>

In [98]:
df = df_movies.copy()
df.head()


Out[98]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376
3 4 Feb. 2 Mon 33 6475068 American Sniper 2645109
4 5 Feb. 3 Tue 34 7825091 American Sniper 2923141

In [99]:
df.plot(x='Date', y=['Top 10 Gross', 'Gross'])


Out[99]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f99aeb10908>

使用另一个 DataFrame 来更新数据


In [100]:
df_1 = df_movies.copy()
df_2 = pd.DataFrame({u'#1 Movie':[u'American Sniper',
                            u'SpongeBob',
                            u'Fifty Shades of Grey'],
                            u'chs':[u'美国阻击手',
                                    u'海绵宝宝',
                                    u'五十度灰']})
df_1.head()


Out[100]:
Row Date Day Day# Top 10 Gross #1 Movie Gross
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376
3 4 Feb. 2 Mon 33 6475068 American Sniper 2645109
4 5 Feb. 3 Tue 34 7825091 American Sniper 2923141

In [101]:
df_2.head()


Out[101]:
#1 Movie chs
0 American Sniper 美国阻击手
1 SpongeBob 海绵宝宝
2 Fifty Shades of Grey 五十度灰

In [102]:
pd.merge(df_1, df_2, on=u'#1 Movie').head()


Out[102]:
Row Date Day Day# Top 10 Gross #1 Movie Gross chs
0 1 Jan. 30 Fri 30 26168351 American Sniper 9905616 美国阻击手
1 2 Jan. 31 Sat 31 41633588 American Sniper 16510536 美国阻击手
2 3 Feb. 1 Sun 32 12515579 American Sniper 4244376 美国阻击手
3 4 Feb. 2 Mon 33 6475068 American Sniper 2645109 美国阻击手
4 5 Feb. 3 Tue 34 7825091 American Sniper 2923141 美国阻击手

导出数据


In [103]:
# 导出周六的数据,格式为 CSV
# df[ (df['Day'] == 'Sat') ].to_csv('test_tmp.csv', mode='w', encoding='utf-8', index=False)

#在前面的文件中追加周日的数据
# df[ (df['Day'] == 'Sun') ].to_csv('test_output.csv', mode='a', header=False, encoding='utf-8', index=False)

In [104]:
# 输出为 dict 格式
# DataFrame.to_dict可以接受 ‘dict’, ‘list’, ‘series’, ‘split’, ‘records’, ‘index’
df = pd.DataFrame({'AAA' : [4,5,6,7], 'ABB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df


Out[104]:
AAA ABB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50

In [105]:
df.to_dict('records')


Out[105]:
[{'AAA': 4, 'ABB': 10, 'CCC': 100},
 {'AAA': 5, 'ABB': 20, 'CCC': 50},
 {'AAA': 6, 'ABB': 30, 'CCC': -30},
 {'AAA': 7, 'ABB': 40, 'CCC': -50}]

In [106]:
# 过滤包含指定字符的列
df_A = df.filter(regex='A', axis=1); df_A


Out[106]:
AAA ABB
0 4 10
1 5 20
2 6 30
3 7 40