阅读笔记

作者：方跃文

Email: fyuewen@gmail.com

时间：始于2017年9月12日，结束写作于

第二章笔记始于2017年9月12日，第一阶段结束语2017年9月28日晚(剩余两个分析案例)

第二章引言

时间： 2017年9月12日

尽管数据处理的目的和领域都大不相同，但是利用python数据处理时候基本都需要完成如下几个大类的任务：

1) 与外界进行数据交互

2) 准备：对数据进行清理、修整、规范化、重塑、切片切块

3) 转换：对数据集做一些数学和统计运算以产生新的数据集，e.g. 根据分组变量对一个大表进行聚合

4) 建模和计算：将数据跟统计模型、机器学习算法联系起来

5) 展示：创建交换式的或者静态的图片或者文字摘要。

来自bit.ly的1.usa.gov数据

ch02中的usagov_bitly_data2012-03-16-1331923249.txt是bit.ly网站收集到的每小时快照数据。文件中的格式为JavaScript Object Notation （JSON)——一种常用的web数据格式。例如如果我们只读取某个文件中的第一行，那么所看到的结果是下面这样：



In [1]:

    
path = "./pydata-book/ch02/usagov_bitly_data2012-03-16-1331923249.txt"
open(path).readline()









    Out[1]:





'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'



In [9]:

    
print(path)
print(type(path))









    



./pydata-book/ch02/usagov_bitly_data2012-03-16-1331923249.txt
<class 'str'>



In [6]:

    
import json
datach02= [json.loads(line) for line in open(path)]

python有许多内置或第三方模块可以将JSON字符转换成python字典对象。这里，我将使用json模块及其loads函数逐行加载已经下载好的数据文件：



In [3]:

    
import json
path = "./pydata-book/ch02/usagov_bitly_data2012-03-16-1331923249.txt"
records = [json.loads(line) for line in open(path)]

上面最后一行表达式，叫做“列表推导式 list comprehension”。这是一种在一组字符串（或一组别的对象）上执行一条相同操作（如json.loads）的简洁方式。在一个打开的文件句柄上进行迭代即可获得一个由行组成的序列。现在，records对象就成为一组python字典了。



In [4]:

    
records[0]









    Out[4]:





{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
 'al': 'en-US,en;q=0.8',
 'c': 'US',
 'cy': 'Danvers',
 'g': 'A6qOVH',
 'gr': 'MA',
 'h': 'wfLQtf',
 'hc': 1331822918,
 'hh': '1.usa.gov',
 'l': 'orofrog',
 'll': [42.576698, -70.954903],
 'nk': 1,
 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
 't': 1331923247,
 'tz': 'America/New_York',
 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}



In [4]:

    
records[0]['tz']









    Out[4]:





'America/New_York'

用纯Python代码对时区进行排序

时间： 2017年9月26日

现在，我们想对时区进行计数，处理的方法有多种。我们首先考虑的是利用列表推导式取出一组时区：



In [10]:

    
time_zones = [rec['tz'] for rec in records]









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-10-f3fbbc37f129> in <module>()
----> 1 time_zones = [rec['tz'] for rec in records]

<ipython-input-10-f3fbbc37f129> in <listcomp>(.0)
----> 1 time_zones = [rec['tz'] for rec in records]

KeyError: 'tz'

然而我们发现上面这个出现了‘tz'的keyerror,这是因为并不是所有记录里面都有tz这个字段，为了让程序判断出来，我们需要加上if语句，即



In [11]:

    
time_zones = [i['tz'] for i in records if 'tz' in i]
time_zones[:2]









    Out[11]:





['America/New_York', 'America/Denver']



In [12]:

    
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
time_zones[:10]









    Out[12]:





['America/New_York',
 'America/Denver',
 'America/New_York',
 'America/Sao_Paulo',
 'America/New_York',
 'America/New_York',
 'Europe/Warsaw',
 '',
 '',
 '']

我们从上面可以看到，的确有些时区字段是空的。此处，为了对时区进行计算，介绍两种办法。

第一种，只使用python标准库



In [7]:

    
#这种方法是在遍历时区的过程中将计数值保留在字典中：

def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts
#今天回头看这段代码发现看的不是很明白，特别是我在下面这个cell中，
#利用了上述的代码，发现这个结果让人看了有点费解。
#11th Jan. 2018



In [21]:

    
def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts

sequence1={1,23,434,53,23,24}
a=get_counts(sequence1)
a[23]

#11th Jan. 2018









    Out[21]:





1

非常了解Python标准库的话，可以将上述代码写得更加精简：



In [8]:

    
from collections import defaultdict

def get_counts2(sequence):
    counts = defaultdict(int) #所有的值都会被初始化为0
    for x in sequence:
        counts[x] += 1
    return counts

上述两种写法中，都将代码写到了函数中。这样的做法，是为了代码段有更高的可重要性，方便对时区进行处理。此处我们只需要将时区 time_zones 传入即可：



In [9]:

    
def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts

counts = get_counts(time_zones)
counts['America/New_York']









    Out[9]:





1251



In [10]:

    
len(time_zones)









    Out[10]:





3440

如果要想得到前10位的时区及其计数值，我们需要用到一些有关字典的处理技巧：



In [11]:

    
def top_counts(count_dict, n =10):
    value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
    value_key_pairs.sort()
    return value_key_pairs[-n:]

top_counts(counts)









    Out[11]:





[(33, 'America/Sao_Paulo'),
 (35, 'Europe/Madrid'),
 (36, 'Pacific/Honolulu'),
 (37, 'Asia/Tokyo'),
 (74, 'Europe/London'),
 (191, 'America/Denver'),
 (382, 'America/Los_Angeles'),
 (400, 'America/Chicago'),
 (521, ''),
 (1251, 'America/New_York')]

我们还可在python标准库中找到collections.Counter类，它能使这个任务更加简单：



In [12]:

    
from collections import Counter
counts = Counter(time_zones)



In [13]:

    
counts.most_common(10)









    Out[13]:





[('America/New_York', 1251),
 ('', 521),
 ('America/Chicago', 400),
 ('America/Los_Angeles', 382),
 ('America/Denver', 191),
 ('Europe/London', 74),
 ('Asia/Tokyo', 37),
 ('Pacific/Honolulu', 36),
 ('Europe/Madrid', 35),
 ('America/Sao_Paulo', 33)]

第二种，用pendas对时区进行计数

DataFrame 是pendas中最重要的数据结构，它用于将数据表示为一个表格。从一组原始记录中创建DataFrames是很简单的：



In [14]:

    
from pandas import DataFrame, Series
import pandas as pd; import numpy as np
frame = DataFrame(records)
frame









    Out[14]:







  
    
      
      _heartbeat_
      a
      al
      c
      cy
      g
      gr
      h
      hc
      hh
      kw
      l
      ll
      nk
      r
      t
      tz
      u
    
  
  
    
      0
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
      en-US,en;q=0.8
      US
      Danvers
      A6qOVH
      MA
      wfLQtf
      1.331823e+09
      1.usa.gov
      NaN
      orofrog
      [42.576698, -70.954903]
      1.0
      http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/...
      1.331923e+09
      America/New_York
      http://www.ncbi.nlm.nih.gov/pubmed/22415991
    
    
      1
      NaN
      GoogleMaps/RochesterNY
      NaN
      US
      Provo
      mwszkS
      UT
      mwszkS
      1.308262e+09
      j.mp
      NaN
      bitly
      [40.218102, -111.613297]
      0.0
      http://www.AwareMap.com/
      1.331923e+09
      America/Denver
      http://www.monroecounty.gov/etc/911/rss.php
    
    
      2
      NaN
      Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...
      en-US
      US
      Washington
      xxr3Qb
      DC
      xxr3Qb
      1.331920e+09
      1.usa.gov
      NaN
      bitly
      [38.9007, -77.043098]
      1.0
      http://t.co/03elZC4Q
      1.331923e+09
      America/New_York
      http://boxer.senate.gov/en/press/releases/0316...
    
    
      3
      NaN
      Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...
      pt-br
      BR
      Braz
      zCaLwp
      27
      zUtuOu
      1.331923e+09
      1.usa.gov
      NaN
      alelex88
      [-23.549999, -46.616699]
      0.0
      direct
      1.331923e+09
      America/Sao_Paulo
      http://apod.nasa.gov/apod/ap120312.html
    
    
      4
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
      en-US,en;q=0.8
      US
      Shrewsbury
      9b6kNl
      MA
      9b6kNl
      1.273672e+09
      bit.ly
      NaN
      bitly
      [42.286499, -71.714699]
      0.0
      http://www.shrewsbury-ma.gov/selco/
      1.331923e+09
      America/New_York
      http://www.shrewsbury-ma.gov/egov/gallery/1341...
    
    
      5
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
      en-US,en;q=0.8
      US
      Shrewsbury
      axNK8c
      MA
      axNK8c
      1.273673e+09
      bit.ly
      NaN
      bitly
      [42.286499, -71.714699]
      0.0
      http://www.shrewsbury-ma.gov/selco/
      1.331923e+09
      America/New_York
      http://www.shrewsbury-ma.gov/egov/gallery/1341...
    
    
      6
      NaN
      Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1...
      pl-PL,pl;q=0.8,en-US;q=0.6,en;q=0.4
      PL
      Luban
      wcndER
      77
      zkpJBR
      1.331923e+09
      1.usa.gov
      NaN
      bnjacobs
      [51.116699, 15.2833]
      0.0
      http://plus.url.google.com/url?sa=z&n=13319232...
      1.331923e+09
      Europe/Warsaw
      http://www.nasa.gov/mission_pages/nustar/main/...
    
    
      7
      NaN
      Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/2...
      bg,en-us;q=0.7,en;q=0.3
      None
      NaN
      wcndER
      NaN
      zkpJBR
      1.331923e+09
      1.usa.gov
      NaN
      bnjacobs
      NaN
      0.0
      http://www.facebook.com/
      1.331923e+09
      
      http://www.nasa.gov/mission_pages/nustar/main/...
    
    
      8
      NaN
      Opera/9.80 (X11; Linux zbov; U; en) Presto/2.1...
      en-US, en
      None
      NaN
      wcndER
      NaN
      zkpJBR
      1.331923e+09
      1.usa.gov
      NaN
      bnjacobs
      NaN
      0.0
      http://www.facebook.com/l.php?u=http%3A%2F%2F1...
      1.331923e+09
      
      http://www.nasa.gov/mission_pages/nustar/main/...
    
    
      9
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
      pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4
      None
      NaN
      zCaLwp
      NaN
      zUtuOu
      1.331923e+09
      1.usa.gov
      NaN
      alelex88
      NaN
      0.0
      http://t.co/o1Pd0WeV
      1.331923e+09
      
      http://apod.nasa.gov/apod/ap120312.html
    
    
      10
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)...
      en-us,en;q=0.5
      US
      Seattle
      vNJS4H
      WA
      u0uD9q
      1.319564e+09
      1.usa.gov
      NaN
      o_4us71ccioa
      [47.5951, -122.332603]
      1.0
      direct
      1.331923e+09
      America/Los_Angeles
      https://www.nysdot.gov/rexdesign/design/commun...
    
    
      11
      NaN
      Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4...
      en-us,en;q=0.5
      US
      Washington
      wG7OIH
      DC
      A0nRz4
      1.331816e+09
      1.usa.gov
      NaN
      darrellissa
      [38.937599, -77.092796]
      0.0
      http://t.co/ND7SoPyo
      1.331923e+09
      America/New_York
      http://oversight.house.gov/wp-content/uploads/...
    
    
      12
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)...
      en-us,en;q=0.5
      US
      Alexandria
      vNJS4H
      VA
      u0uD9q
      1.319564e+09
      1.usa.gov
      NaN
      o_4us71ccioa
      [38.790901, -77.094704]
      1.0
      direct
      1.331923e+09
      America/New_York
      https://www.nysdot.gov/rexdesign/design/commun...
    
    
      13
      1.331923e+09
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      14
      NaN
      Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US...
      en-us,en;q=0.5
      US
      Marietta
      2rOUYc
      GA
      2rOUYc
      1.255770e+09
      1.usa.gov
      NaN
      bitly
      [33.953201, -84.5177]
      1.0
      direct
      1.331923e+09
      America/New_York
      http://toxtown.nlm.nih.gov/index.php
    
    
      15
      NaN
      Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1...
      zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4
      HK
      Central District
      nQvgJp
      00
      rtrrth
      1.317318e+09
      j.mp
      NaN
      walkeryuen
      [22.2833, 114.150002]
      1.0
      http://forum2.hkgolden.com/view.aspx?type=BW&m...
      1.331923e+09
      Asia/Hong_Kong
      http://www.ssd.noaa.gov/PS/TROP/TCFP/data/curr...
    
    
      16
      NaN
      Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1...
      zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4
      HK
      Central District
      XdUNr
      00
      qWkgbq
      1.317318e+09
      j.mp
      NaN
      walkeryuen
      [22.2833, 114.150002]
      1.0
      http://forum2.hkgolden.com/view.aspx?type=BW&m...
      1.331923e+09
      Asia/Hong_Kong
      http://www.usno.navy.mil/NOOC/nmfc-ph/RSS/jtwc...
    
    
      17
      NaN
      Mozilla/5.0 (Macintosh; Intel Mac OS X 10.5; r...
      en-us,en;q=0.5
      US
      Buckfield
      zH1BFf
      ME
      x3jOIv
      1.331840e+09
      1.usa.gov
      NaN
      andyzieminski
      [44.299702, -70.369797]
      0.0
      http://t.co/6Cx4ROLs
      1.331923e+09
      America/New_York
      http://www.usda.gov/wps/portal/usda/usdahome?c...
    
    
      18
      NaN
      GoogleMaps/RochesterNY
      NaN
      US
      Provo
      mwszkS
      UT
      mwszkS
      1.308262e+09
      1.usa.gov
      NaN
      bitly
      [40.218102, -111.613297]
      0.0
      http://www.AwareMap.com/
      1.331923e+09
      America/Denver
      http://www.monroecounty.gov/etc/911/rss.php
    
    
      19
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
      it-IT,it;q=0.8,en-US;q=0.6,en;q=0.4
      IT
      Venice
      wcndER
      20
      zkpJBR
      1.331923e+09
      1.usa.gov
      NaN
      bnjacobs
      [45.438599, 12.3267]
      0.0
      http://www.facebook.com/
      1.331923e+09
      Europe/Rome
      http://www.nasa.gov/mission_pages/nustar/main/...
    
    
      20
      NaN
      Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...
      es-ES
      ES
      Alcal
      zQ95Hi
      51
      ytZYWR
      1.331671e+09
      bitly.com
      NaN
      jplnews
      [37.516701, -5.9833]
      0.0
      http://www.facebook.com/
      1.331923e+09
      Africa/Ceuta
      http://voyager.jpl.nasa.gov/imagesvideo/uranus...
    
    
      21
      NaN
      Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6...
      en-us,en;q=0.5
      US
      Davidsonville
      wcndER
      MD
      zkpJBR
      1.331923e+09
      1.usa.gov
      NaN
      bnjacobs
      [38.939201, -76.635002]
      0.0
      http://www.facebook.com/
      1.331923e+09
      America/New_York
      http://www.nasa.gov/mission_pages/nustar/main/...
    
    
      22
      NaN
      Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...
      en-us
      US
      Hockessin
      y3ZImz
      DE
      y3ZImz
      1.331064e+09
      1.usa.gov
      NaN
      bitly
      [39.785, -75.682297]
      0.0
      direct
      1.331923e+09
      America/New_York
      http://portal.hud.gov/hudportal/documents/hudd...
    
    
      23
      NaN
      Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3)...
      en-us
      US
      Lititz
      wWiOiD
      PA
      wWiOiD
      1.330218e+09
      1.usa.gov
      NaN
      bitly
      [40.174999, -76.3078]
      0.0
      http://www.facebook.com/l.php?u=http%3A%2F%2F1...
      1.331923e+09
      America/New_York
      http://www.tricare.mil/mybenefit/ProfileFilter...
    
    
      24
      NaN
      Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES...
      es-es,es;q=0.8,en-us;q=0.5,en;q=0.3
      ES
      Bilbao
      wcndER
      59
      zkpJBR
      1.331923e+09
      1.usa.gov
      NaN
      bnjacobs
      [43.25, -2.9667]
      0.0
      http://www.facebook.com/
      1.331923e+09
      Europe/Madrid
      http://www.nasa.gov/mission_pages/nustar/main/...
    
    
      25
      NaN
      Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1...
      en-GB,en;q=0.8,en-US;q=0.6,en-AU;q=0.4
      MY
      Kuala Lumpur
      wcndER
      14
      zkpJBR
      1.331923e+09
      1.usa.gov
      NaN
      bnjacobs
      [3.1667, 101.699997]
      0.0
      http://www.facebook.com/
      1.331923e+09
      Asia/Kuala_Lumpur
      http://www.nasa.gov/mission_pages/nustar/main/...
    
    
      26
      NaN
      Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1...
      ro-RO,ro;q=0.8,en-US;q=0.6,en;q=0.4
      CY
      Nicosia
      wcndER
      04
      zkpJBR
      1.331923e+09
      1.usa.gov
      NaN
      bnjacobs
      [35.166698, 33.366699]
      0.0
      http://www.facebook.com/?ref=tn_tnmn
      1.331923e+09
      Asia/Nicosia
      http://www.nasa.gov/mission_pages/nustar/main/...
    
    
      27
      NaN
      Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...
      en-US,en;q=0.8
      BR
      SPaulo
      zCaLwp
      27
      zUtuOu
      1.331923e+09
      1.usa.gov
      NaN
      alelex88
      [-23.5333, -46.616699]
      0.0
      direct
      1.331923e+09
      America/Sao_Paulo
      http://apod.nasa.gov/apod/ap120312.html
    
    
      28
      NaN
      Mozilla/5.0 (iPad; CPU OS 5_0_1 like Mac OS X)...
      en-us
      None
      NaN
      vNJS4H
      NaN
      u0uD9q
      1.319564e+09
      1.usa.gov
      NaN
      o_4us71ccioa
      NaN
      0.0
      direct
      1.331923e+09
      
      https://www.nysdot.gov/rexdesign/design/commun...
    
    
      29
      NaN
      Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X...
      en-us
      None
      NaN
      FPX0IM
      NaN
      FPX0IL
      1.331923e+09
      1.usa.gov
      NaN
      twittershare
      NaN
      1.0
      http://t.co/5xlp0B34
      1.331923e+09
      
      http://www.ed.gov/news/media-advisories/us-dep...
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      3530
      NaN
      Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.1...
      en-US,en;q=0.8
      US
      San Francisco
      xVZg4P
      CA
      wqUkTo
      1.331908e+09
      go.nasa.gov
      NaN
      nasatwitter
      [37.7645, -122.429398]
      0.0
      http://www.facebook.com/l.php?u=http%3A%2F%2Fg...
      1.331927e+09
      America/Los_Angeles
      http://www.nasa.gov/multimedia/imagegallery/im...
    
    
      3531
      NaN
      Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6...
      en-US
      None
      NaN
      wcndER
      NaN
      zkpJBR
      1.331923e+09
      1.usa.gov
      NaN
      bnjacobs
      NaN
      0.0
      direct
      1.331927e+09
      
      http://www.nasa.gov/mission_pages/nustar/main/...
    
    
      3532
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)...
      en-us,en;q=0.5
      US
      Washington
      Au3aUS
      DC
      A9ct6C
      1.331926e+09
      1.usa.gov
      NaN
      ncsha
      [38.904202, -77.031998]
      1.0
      http://www.ncsha.org/
      1.331927e+09
      America/New_York
      http://portal.hud.gov/hudportal/HUD?src=/press...
    
    
      3533
      NaN
      Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) A...
      en-us
      US
      Jacksonville
      b2UtUJ
      FL
      ieCdgH
      1.301393e+09
      go.nasa.gov
      NaN
      nasatwitter
      [30.279301, -81.585098]
      1.0
      direct
      1.331927e+09
      America/New_York
      http://apod.nasa.gov/apod/
    
    
      3534
      NaN
      Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...
      en-us
      US
      Frisco
      vNJS4H
      TX
      u0uD9q
      1.319564e+09
      1.usa.gov
      NaN
      o_4us71ccioa
      [33.149899, -96.855499]
      1.0
      direct
      1.331927e+09
      America/Chicago
      https://www.nysdot.gov/rexdesign/design/commun...
    
    
      3535
      NaN
      Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/...
      en-us
      US
      Houston
      zIgLx8
      TX
      yrPaLt
      1.331903e+09
      aash.to
      NaN
      aashto
      [29.775499, -95.415199]
      1.0
      direct
      1.331927e+09
      America/Chicago
      http://ntl.bts.gov/lib/44000/44300/44374/FHWA-...
    
    
      3536
      NaN
      Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; e...
      en-US,en;q=0.5
      None
      NaN
      xIcyim
      NaN
      yG1TTf
      1.331728e+09
      go.nasa.gov
      NaN
      nasatwitter
      NaN
      0.0
      http://t.co/g1VKE8zS
      1.331927e+09
      
      http://www.nasa.gov/mission_pages/hurricanes/a...
    
    
      3537
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)...
      es-es,es;q=0.8,en-us;q=0.5,en;q=0.3
      HN
      Tegucigalpa
      zCaLwp
      08
      w63FZW
      1.331547e+09
      1.usa.gov
      NaN
      bufferapp
      [14.1, -87.216698]
      0.0
      http://t.co/A8TJyibE
      1.331927e+09
      America/Tegucigalpa
      http://apod.nasa.gov/apod/ap120312.html
    
    
      3538
      NaN
      Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma...
      en-us
      US
      Los Angeles
      qMac9k
      CA
      qds1Ge
      1.310474e+09
      1.usa.gov
      NaN
      healthypeople
      [34.041599, -118.298798]
      0.0
      direct
      1.331927e+09
      America/Los_Angeles
      http://healthypeople.gov/2020/connect/webinars...
    
    
      3539
      NaN
      Mozilla/5.0 (compatible; Fedora Core 3) FC3 KDE
      NaN
      US
      Bellevue
      zu2M5o
      WA
      zDhdro
      1.331586e+09
      bit.ly
      NaN
      glimtwin
      [47.615398, -122.210297]
      0.0
      direct
      1.331927e+09
      America/Los_Angeles
      http://www.federalreserve.gov/newsevents/press...
    
    
      3540
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
      en-US,en;q=0.8
      US
      Payson
      wcndER
      UT
      zkpJBR
      1.331923e+09
      1.usa.gov
      NaN
      bnjacobs
      [40.014198, -111.738899]
      0.0
      http://www.facebook.com/l.php?u=http%3A%2F%2F1...
      1.331927e+09
      America/Denver
      http://www.nasa.gov/mission_pages/nustar/main/...
    
    
      3541
      NaN
      Mozilla/5.0 (X11; U; OpenVMS AlphaServer_ES40;...
      NaN
      US
      Bellevue
      zu2M5o
      WA
      zDhdro
      1.331586e+09
      1.usa.gov
      NaN
      glimtwin
      [47.615398, -122.210297]
      0.0
      direct
      1.331927e+09
      America/Los_Angeles
      http://www.federalreserve.gov/newsevents/press...
    
    
      3542
      NaN
      Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...
      en-us
      US
      Pittsburg
      y3reI1
      CA
      y3reI1
      1.331926e+09
      1.usa.gov
      NaN
      bitly
      [38.0051, -121.838699]
      0.0
      http://www.facebook.com/l.php?u=http%3A%2F%2F1...
      1.331927e+09
      America/Los_Angeles
      http://www.sba.gov/community/blogs/community-b...
    
    
      3543
      1.331927e+09
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3544
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0.1) ...
      en-us,en;q=0.5
      US
      Wentzville
      vNJS4H
      MO
      u0uD9q
      1.319564e+09
      1.usa.gov
      NaN
      o_4us71ccioa
      [38.790001, -90.854897]
      1.0
      direct
      1.331927e+09
      America/Chicago
      https://www.nysdot.gov/rexdesign/design/commun...
    
    
      3545
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)...
      en-us,en;q=0.5
      US
      Saint Charles
      vNJS4H
      IL
      u0uD9q
      1.319564e+09
      1.usa.gov
      NaN
      o_4us71ccioa
      [41.9352, -88.290901]
      1.0
      direct
      1.331927e+09
      America/Chicago
      https://www.nysdot.gov/rexdesign/design/commun...
    
    
      3546
      NaN
      Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma...
      en-us
      US
      Los Angeles
      qMac9k
      CA
      qds1Ge
      1.310474e+09
      1.usa.gov
      NaN
      healthypeople
      [34.041599, -118.298798]
      1.0
      direct
      1.331927e+09
      America/Los_Angeles
      http://healthypeople.gov/2020/connect/webinars...
    
    
      3547
      NaN
      Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...
      en-us
      US
      Silver Spring
      y0jYkg
      MD
      y0jYkg
      1.331852e+09
      1.usa.gov
      NaN
      bitly
      [39.052101, -77.014999]
      1.0
      direct
      1.331927e+09
      America/New_York
      http://www.epa.gov/otaq/regs/fuels/additive/e1...
    
    
      3548
      NaN
      Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma...
      en-us
      US
      Mcgehee
      y5rMac
      AR
      xANY6O
      1.331916e+09
      1.usa.gov
      NaN
      twitterfeed
      [33.628399, -91.356903]
      1.0
      https://twitter.com/fdarecalls/status/18069759...
      1.331927e+09
      America/Chicago
      http://www.fda.gov/Safety/Recalls/ucm296326.htm
    
    
      3549
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
      sv-SE,sv;q=0.8,en-US;q=0.6,en;q=0.4
      SE
      Sollefte
      eH8wu
      24
      7dtjei
      1.260316e+09
      1.usa.gov
      NaN
      tweetdeckapi
      [63.166698, 17.266701]
      1.0
      direct
      1.331927e+09
      Europe/Stockholm
      http://www.nasa.gov/mission_pages/WISE/main/in...
    
    
      3550
      NaN
      Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...
      en-us
      US
      Conshohocken
      A00b72
      PA
      yGSwzn
      1.331918e+09
      1.usa.gov
      NaN
      addthis
      [40.0798, -75.2855]
      0.0
      http://www.linkedin.com/home?trk=hb_tab_home_top
      1.331927e+09
      America/New_York
      http://www.nlm.nih.gov/medlineplus/news/fullst...
    
    
      3551
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
      en-US,en;q=0.8
      None
      NaN
      wcndER
      NaN
      zkpJBR
      1.331923e+09
      1.usa.gov
      NaN
      bnjacobs
      NaN
      0.0
      http://plus.url.google.com/url?sa=z&n=13319268...
      1.331927e+09
      
      http://www.nasa.gov/mission_pages/nustar/main/...
    
    
      3552
      NaN
      Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US...
      NaN
      US
      Decatur
      rqgJuE
      AL
      xcz8vt
      1.331227e+09
      1.usa.gov
      NaN
      bootsnall
      [34.572701, -86.940598]
      0.0
      direct
      1.331927e+09
      America/Chicago
      http://travel.state.gov/passport/passport_5535...
    
    
      3553
      NaN
      Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ...
      en-us
      US
      Shrewsbury
      9b6kNl
      MA
      9b6kNl
      1.273672e+09
      bit.ly
      NaN
      bitly
      [42.286499, -71.714699]
      0.0
      http://www.shrewsbury-ma.gov/selco/
      1.331927e+09
      America/New_York
      http://www.shrewsbury-ma.gov/egov/gallery/1341...
    
    
      3554
      NaN
      Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ...
      en-us
      US
      Shrewsbury
      axNK8c
      MA
      axNK8c
      1.273673e+09
      bit.ly
      NaN
      bitly
      [42.286499, -71.714699]
      0.0
      http://www.shrewsbury-ma.gov/selco/
      1.331927e+09
      America/New_York
      http://www.shrewsbury-ma.gov/egov/gallery/1341...
    
    
      3555
      NaN
      Mozilla/4.0 (compatible; MSIE 9.0; Windows NT ...
      en
      US
      Paramus
      e5SvKE
      NJ
      fqPSr9
      1.301298e+09
      1.usa.gov
      NaN
      tweetdeckapi
      [40.9445, -74.07]
      1.0
      direct
      1.331927e+09
      America/New_York
      http://www.fda.gov/AdvisoryCommittees/Committe...
    
    
      3556
      NaN
      Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1...
      en-US,en;q=0.8
      US
      Oklahoma City
      jQLtP4
      OK
      jQLtP4
      1.307530e+09
      1.usa.gov
      NaN
      bitly
      [35.4715, -97.518997]
      0.0
      http://www.facebook.com/l.php?u=http%3A%2F%2F1...
      1.331927e+09
      America/Chicago
      http://www.okc.gov/PublicNotificationSystem/Fo...
    
    
      3557
      NaN
      GoogleMaps/RochesterNY
      NaN
      US
      Provo
      mwszkS
      UT
      mwszkS
      1.308262e+09
      j.mp
      NaN
      bitly
      [40.218102, -111.613297]
      0.0
      http://www.AwareMap.com/
      1.331927e+09
      America/Denver
      http://www.monroecounty.gov/etc/911/rss.php
    
    
      3558
      NaN
      GoogleProducer
      NaN
      US
      Mountain View
      zjtI4X
      CA
      zjtI4X
      1.327529e+09
      1.usa.gov
      NaN
      bitly
      [37.419201, -122.057404]
      0.0
      direct
      1.331927e+09
      America/Los_Angeles
      http://www.ahrq.gov/qual/qitoolkit/
    
    
      3559
      NaN
      Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...
      en-US
      US
      Mc Lean
      qxKrTK
      VA
      qxKrTK
      1.312898e+09
      1.usa.gov
      NaN
      bitly
      [38.935799, -77.162102]
      0.0
      http://t.co/OEEEvwjU
      1.331927e+09
      America/New_York
      http://herndon-va.gov/Content/public_safety/Pu...
    
  

3560 rows × 18 columns



In [15]:

    
frame['tz'][:10]









    Out[15]:





0     America/New_York
1       America/Denver
2     America/New_York
3    America/Sao_Paulo
4     America/New_York
5     America/New_York
6        Europe/Warsaw
7                     
8                     
9                     
Name: tz, dtype: object

这里frame的输出形式是摘要试图(summary view),主要是用于较大的DataFrame对象。frame['tz']所返回的Series对象有一个value_counts方法，该方法可以让我们得到所需的信息：



In [16]:

    
tz_counts = frame['tz'].value_counts()
tz_counts[:10]









    Out[16]:





America/New_York       1251
                        521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
America/Sao_Paulo        33
Name: tz, dtype: int64

现在，我们想用matplotlib为这段数据生成一张图片。为此，我们先给记录中未知或缺失的时区天上一个替代值。fillna 函数可以替换缺失值(NA)，而未知值（空字符串）可以通过布尔型数据索引加以替换：



In [17]:

    
clean_tz = frame['tz'].fillna('Missing')



In [18]:

    
clean_tz[clean_tz == ''] = 'Unknown'



In [19]:

    
tz_counts = clean_tz.value_counts()



In [20]:

    
tz_counts[:10]









    Out[20]:





America/New_York       1251
Unknown                 521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Missing                 120
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
Name: tz, dtype: int64

利用tz_counts对象的plot方法，我们开得到一张水平条形图：



In [21]:

    
%matplotlib inline
tz_counts[:10].plot(kind='barh', rot=0)









    Out[21]:





<matplotlib.axes._subplots.AxesSubplot at 0x2100edc56d8>

我们还可以对这种数据进行很多的处理。比如说，a字段含有执行URL短缩操作的浏览器、设备、应用程序的相关信息：



In [22]:

    
frame['a'][1]









    Out[22]:





'GoogleMaps/RochesterNY'



In [23]:

    
frame['a'][50]









    Out[23]:





'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'



In [24]:

    
frame['a'][51]









    Out[24]:





'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'

将这些“agent"字符串（即浏览器的USER——AGENT）中的所有信息都解析出来是一件挺枯燥的工作。不过我们只要掌握了python内置的字符串函数和正则表达式，事情就好办许多了。

比如，我们可以将这种字符串的第一节（与浏览器大致呼应）分离出来并得到另外一份用户行为摘要：



In [25]:

    
results = Series([x.split()[0] for x in frame.a.dropna()])



In [26]:

    
results[:5]









    Out[26]:





0               Mozilla/5.0
1    GoogleMaps/RochesterNY
2               Mozilla/4.0
3               Mozilla/5.0
4               Mozilla/5.0
dtype: object



In [27]:

    
results.value_counts()[:8]









    Out[27]:





Mozilla/5.0                 2594
Mozilla/4.0                  601
GoogleMaps/RochesterNY       121
Opera/9.80                    34
TEST_INTERNET_AGENT           24
GoogleProducer                21
Mozilla/6.0                    5
BlackBerry8520/5.0.0.681       4
dtype: int64

现在假设我们想按Windows和非Windows用户对时区统计信息进行分解。为了简单，我们假定只要agent字符串中包含有"Windows"就认为该用户为Windows用户。由于有的agent确实，我们首先将它们从数据中移除：



In [28]:

    
cframe = frame[frame.a.notnull()]

接下来，根据a值计算出各行是否是Windows:



In [29]:

    
operating_system = np.where(cframe['a'].str.contains('Windows'), 'Windows','Not Windows')



In [30]:

    
operating_system[:5] #注意这句代码执行后的输出跟原书不同









    Out[30]:





array(['Windows', 'Not Windows', 'Windows', 'Not Windows', 'Windows'], 
      dtype='<U11')

接下来可以根据时区和新的到的操作系统列表对数据进行分组了：



In [31]:

    
by_tz_os = cframe.groupby(['tz', operating_system])

然后通过size对分组结果进行计数（类似于上面的value_counts函数），并利用unstack对计数结果进行重塑：



In [32]:

    
agg_counts = by_tz_os.size().unstack().fillna(0)



In [33]:

    
agg_counts[:10]









    Out[33]:







  
    
      
      Not Windows
      Windows
    
    
      tz
      
      
    
  
  
    
      
      245.0
      276.0
    
    
      Africa/Cairo
      0.0
      3.0
    
    
      Africa/Casablanca
      0.0
      1.0
    
    
      Africa/Ceuta
      0.0
      2.0
    
    
      Africa/Johannesburg
      0.0
      1.0
    
    
      Africa/Lusaka
      0.0
      1.0
    
    
      America/Anchorage
      4.0
      1.0
    
    
      America/Argentina/Buenos_Aires
      1.0
      0.0
    
    
      America/Argentina/Cordoba
      0.0
      1.0
    
    
      America/Argentina/Mendoza
      0.0
      1.0

最后我们来选取最常出现的时区。为了达到这个目的，我们根据agg_counts中的行数构造了一个间接索引数组：



In [34]:

    
#用于按升序排列
indexer = agg_counts.sum(1).argsort()



In [35]:

    
indexer[:10]









    Out[35]:





tz
                                  24
Africa/Cairo                      20
Africa/Casablanca                 21
Africa/Ceuta                      92
Africa/Johannesburg               87
Africa/Lusaka                     53
America/Anchorage                 54
America/Argentina/Buenos_Aires    57
America/Argentina/Cordoba         26
America/Argentina/Mendoza         55
dtype: int64

然后我们通过过take按照这个舒徐截取了最后的10行：



In [36]:

    
count_subset = agg_counts.take(indexer)[-10:]



In [37]:

    
count_subset









    Out[37]:







  
    
      
      Not Windows
      Windows
    
    
      tz
      
      
    
  
  
    
      America/Sao_Paulo
      13.0
      20.0
    
    
      Europe/Madrid
      16.0
      19.0
    
    
      Pacific/Honolulu
      0.0
      36.0
    
    
      Asia/Tokyo
      2.0
      35.0
    
    
      Europe/London
      43.0
      31.0
    
    
      America/Denver
      132.0
      59.0
    
    
      America/Los_Angeles
      130.0
      252.0
    
    
      America/Chicago
      115.0
      285.0
    
    
      
      245.0
      276.0
    
    
      America/New_York
      339.0
      912.0

这里可以生成一张条形图。我们将使用stacked = True来生成一张堆积条形图：



In [38]:

    
%matplotlib inline
normed_subset = count_subset.div(count_subset.sum(1), axis=0)



In [39]:

    
normed_subset.plot(kind='barh', stacked = True)









    Out[39]:





<matplotlib.axes._subplots.AxesSubplot at 0x2100f219f28>

这里所用到的所有方法都会在本书后续的章节中详细讲解。（我觉得这句话作者应该早点讲，害的我一直不敢继续读下去，原来这只是一个长长的说明案例啊）

MovieLens 1M数据集

GroupLens Research 采集了从上世纪九十年代到本世纪初MovieLens用户提供的电影评分数据。这些数据中包括电影评分、电影元数据（风格和年代）以及用户的人口学统计数据（性别年龄等）。基于机器学习算法的推荐系统一般都会对此类数据感兴趣。虽然这本书不会详细介绍机器学习技术，不会可以让我们学习如何对数据进行切片切块以满足需求。

MovieLens 1M数据集包含了来自6000名用户对4000部电影的100万条评分数据。它分为三个表：评分、用户信息和电源信息。可以通过pandas.read_table将各个表读到一个pandas DataFrame对象中：



In [40]:

    
import pandas as pd

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']

users = pd.read_table('pydata-book/ch02/movielens/users.dat', sep='::', 
                      header=None, names = unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('pydata-book/ch02/movielens/ratings.dat', sep='::',
                       header=None, names = rnames)

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('pydata-book/ch02/movielens/movies.dat', sep='::',
                      header=None, names = mnames)









    



D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  
D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:10: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  # Remove the CWD from sys.path while we load stuff.
D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:14: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.



In [41]:

    
users[:5]



In [42]:

    
ratings[:5]



In [43]:

    
movies[:5]









    Out[43]:







  
    
      
      movie_id
      title
      genres
    
  
  
    
      0
      1
      Toy Story (1995)
      Animation|Children's|Comedy
    
    
      1
      2
      Jumanji (1995)
      Adventure|Children's|Fantasy
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance
    
    
      3
      4
      Waiting to Exhale (1995)
      Comedy|Drama
    
    
      4
      5
      Father of the Bride Part II (1995)
      Comedy



In [88]:

    
ratings[:10]

注意，数据和职业是以编码形式给出的。他们的具体含义请参考该数据集的README文件。

分析散步在三个表中的数据不是件轻松的事情。假设我们想根据性别和年龄计算某电影的平均得分，如果将所有数据都合并为一个表中的话问题就简单多了。我们先用pandas的merge函数将ratings跟users合并到一起，然后再将movies合并进去。pandas会根据列明的重叠情况推断出哪些列是合并（或连接）键



In [45]:

    
data = pd.merge(pd.merge(ratings, users), movies)



In [89]:

    
data[:10]









    Out[89]:







  
    
      
      user_id
      movie_id
      rating
      timestamp
      gender
      age
      occupation
      zip
      title
      genres
    
  
  
    
      0
      1
      1193
      5
      978300760
      F
      1
      10
      48067
      One Flew Over the Cuckoo's Nest (1975)
      Drama
    
    
      1
      2
      1193
      5
      978298413
      M
      56
      16
      70072
      One Flew Over the Cuckoo's Nest (1975)
      Drama
    
    
      2
      12
      1193
      4
      978220179
      M
      25
      12
      32793
      One Flew Over the Cuckoo's Nest (1975)
      Drama
    
    
      3
      15
      1193
      4
      978199279
      M
      25
      7
      22903
      One Flew Over the Cuckoo's Nest (1975)
      Drama
    
    
      4
      17
      1193
      5
      978158471
      M
      50
      1
      95350
      One Flew Over the Cuckoo's Nest (1975)
      Drama
    
    
      5
      18
      1193
      4
      978156168
      F
      18
      3
      95825
      One Flew Over the Cuckoo's Nest (1975)
      Drama
    
    
      6
      19
      1193
      5
      982730936
      M
      1
      10
      48073
      One Flew Over the Cuckoo's Nest (1975)
      Drama
    
    
      7
      24
      1193
      5
      978136709
      F
      25
      7
      10023
      One Flew Over the Cuckoo's Nest (1975)
      Drama
    
    
      8
      28
      1193
      3
      978125194
      F
      25
      1
      14607
      One Flew Over the Cuckoo's Nest (1975)
      Drama
    
    
      9
      33
      1193
      5
      978557765
      M
      45
      3
      55421
      One Flew Over the Cuckoo's Nest (1975)
      Drama

现在我们就可以根据任意个域用户或电源属性对评分数据进行聚合操作了。为了按性别计算每部电源的平均分，我们可以使用pivot_table方法：



In [64]:

    
#书中原文的代码是
mean_ratings = data.pivot_table('rating',
                               rows='title', cols='gender',aggfunc='mean')









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-64-df5506c0e75e> in <module>()
      1 mean_ratings = data.pivot_table('rating',
----> 2                                rows='title', cols='gender',aggfunc='mean')

TypeError: pivot_table() got an unexpected keyword argument 'rows'

显然上面执行通不过，从错误信息看根本没有‘rows’这个参数的功能。我本来想放弃这个代码了，但是还是留了个心眼，去Google了一下，发现去年有人讨论了这个问题：stackoverflow地址。

解决方案是

将

mean_ratings = data.pivot_table('rating', rows='title', cols='gender', aggfunc='mean')

改成

mean_ratings = data.pivot_table('rating', index='title', columns='gender', aggfunc='mean')

原因是：

书中的代码是旧的并且已经被移除了的语法。



In [90]:

    
mean_ratings = data.pivot_table('rating', index='title', 
                                columns='gender', aggfunc='mean')



In [91]:

    
mean_ratings[:5]









    Out[91]:







  
    
      gender
      F
      M
    
    
      title
      
      
    
  
  
    
      $1,000,000 Duck (1971)
      3.375000
      2.761905
    
    
      'Night Mother (1986)
      3.388889
      3.352941
    
    
      'Til There Was You (1997)
      2.675676
      2.733333
    
    
      'burbs, The (1989)
      2.793478
      2.962085
    
    
      ...And Justice for All (1979)
      3.828571
      3.689024

上述操作产生了另一个DataFrame，其内容为电源平均得分，行作为电影名。列标为性别。现在，我们打算过滤掉评分数据不够250条的电影。为了达到这个目的，可以先对title进行分组，然后利用size()得到的一个含有各个电影分组大小的Series对象：



In [68]:

    
ratings_by_title = data.groupby('title').size()



In [69]:

    
ratings_by_title[0:10]









    Out[69]:





title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64



In [71]:

    
active_titles = ratings_by_title.index[ratings_by_title >= 250]



In [72]:

    
active_titles









    Out[72]:





Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

上述所得到的索引中含有评分数据大于250条的电影名称，然后我们就可以据此从前面的mean_ratings中选取所需的行了：



In [105]:

    
mean_ratings = mean_ratings.ix[active_titles] 
#书中原文用了mean_ratings.ix 但是ix其实已经被弃用了









    



D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.



In [107]:

    
mean_ratings = mean_ratings.loc[active_titles]



In [108]:

    
mean_ratings









    Out[108]:







  
    
      gender
      F
      M
      diff
    
    
      title
      
      
      
    
  
  
    
      'burbs, The (1989)
      2.793478
      2.962085
      0.168607
    
    
      10 Things I Hate About You (1999)
      3.646552
      3.311966
      -0.334586
    
    
      101 Dalmatians (1961)
      3.791444
      3.500000
      -0.291444
    
    
      101 Dalmatians (1996)
      3.240000
      2.911215
      -0.328785
    
    
      12 Angry Men (1957)
      4.184397
      4.328421
      0.144024
    
    
      13th Warrior, The (1999)
      3.112000
      3.168000
      0.056000
    
    
      2 Days in the Valley (1996)
      3.488889
      3.244813
      -0.244076
    
    
      20,000 Leagues Under the Sea (1954)
      3.670103
      3.709205
      0.039102
    
    
      2001: A Space Odyssey (1968)
      3.825581
      4.129738
      0.304156
    
    
      2010 (1984)
      3.446809
      3.413712
      -0.033097
    
    
      28 Days (2000)
      3.209424
      2.977707
      -0.231717
    
    
      39 Steps, The (1935)
      3.965517
      4.107692
      0.142175
    
    
      54 (1998)
      2.701754
      2.782178
      0.080424
    
    
      7th Voyage of Sinbad, The (1958)
      3.409091
      3.658879
      0.249788
    
    
      8MM (1999)
      2.906250
      2.850962
      -0.055288
    
    
      About Last Night... (1986)
      3.188679
      3.140909
      -0.047770
    
    
      Absent Minded Professor, The (1961)
      3.469388
      3.446809
      -0.022579
    
    
      Absolute Power (1997)
      3.469136
      3.327759
      -0.141377
    
    
      Abyss, The (1989)
      3.659236
      3.689507
      0.030272
    
    
      Ace Ventura: Pet Detective (1994)
      3.000000
      3.197917
      0.197917
    
    
      Ace Ventura: When Nature Calls (1995)
      2.269663
      2.543333
      0.273670
    
    
      Addams Family Values (1993)
      3.000000
      2.878531
      -0.121469
    
    
      Addams Family, The (1991)
      3.186170
      3.163498
      -0.022672
    
    
      Adventures in Babysitting (1987)
      3.455782
      3.208122
      -0.247660
    
    
      Adventures of Buckaroo Bonzai Across the 8th Dimension, The (1984)
      3.308511
      3.402321
      0.093810
    
    
      Adventures of Priscilla, Queen of the Desert, The (1994)
      3.989071
      3.688811
      -0.300260
    
    
      Adventures of Robin Hood, The (1938)
      4.166667
      3.918367
      -0.248299
    
    
      African Queen, The (1951)
      4.324232
      4.223822
      -0.100410
    
    
      Age of Innocence, The (1993)
      3.827068
      3.339506
      -0.487561
    
    
      Agnes of God (1985)
      3.534884
      3.244898
      -0.289986
    
    
      ...
      ...
      ...
      ...
    
    
      White Men Can't Jump (1992)
      3.028777
      3.231061
      0.202284
    
    
      Who Framed Roger Rabbit? (1988)
      3.569378
      3.713251
      0.143873
    
    
      Who's Afraid of Virginia Woolf? (1966)
      4.029703
      4.096939
      0.067236
    
    
      Whole Nine Yards, The (2000)
      3.296552
      3.404814
      0.108262
    
    
      Wild Bunch, The (1969)
      3.636364
      4.128099
      0.491736
    
    
      Wild Things (1998)
      3.392000
      3.459082
      0.067082
    
    
      Wild Wild West (1999)
      2.275449
      2.131973
      -0.143476
    
    
      William Shakespeare's Romeo and Juliet (1996)
      3.532609
      3.318644
      -0.213965
    
    
      Willow (1988)
      3.658683
      3.453543
      -0.205139
    
    
      Willy Wonka and the Chocolate Factory (1971)
      4.063953
      3.789474
      -0.274480
    
    
      Witness (1985)
      4.115854
      3.941504
      -0.174349
    
    
      Wizard of Oz, The (1939)
      4.355030
      4.203138
      -0.151892
    
    
      Wolf (1994)
      3.074074
      2.899083
      -0.174992
    
    
      Women on the Verge of a Nervous Breakdown (1988)
      3.934307
      3.865741
      -0.068566
    
    
      Wonder Boys (2000)
      4.043796
      3.913649
      -0.130147
    
    
      Working Girl (1988)
      3.606742
      3.312500
      -0.294242
    
    
      World Is Not Enough, The (1999)
      3.337500
      3.388889
      0.051389
    
    
      Wrong Trousers, The (1993)
      4.588235
      4.478261
      -0.109974
    
    
      Wyatt Earp (1994)
      3.147059
      3.283898
      0.136839
    
    
      X-Files: Fight the Future, The (1998)
      3.489474
      3.493797
      0.004323
    
    
      X-Men (2000)
      3.682310
      3.851702
      0.169391
    
    
      Year of Living Dangerously (1982)
      3.951220
      3.869403
      -0.081817
    
    
      Yellow Submarine (1968)
      3.714286
      3.689286
      -0.025000
    
    
      You've Got Mail (1998)
      3.542424
      3.275591
      -0.266834
    
    
      Young Frankenstein (1974)
      4.289963
      4.239177
      -0.050785
    
    
      Young Guns (1988)
      3.371795
      3.425620
      0.053825
    
    
      Young Guns II (1990)
      2.934783
      2.904025
      -0.030758
    
    
      Young Sherlock Holmes (1985)
      3.514706
      3.363344
      -0.151362
    
    
      Zero Effect (1998)
      3.864407
      3.723140
      -0.141266
    
    
      eXistenZ (1999)
      3.098592
      3.289086
      0.190494
    
  

1216 rows × 3 columns

为了了解女性观众最喜欢的电源，我们可以对F列降序：



In [109]:

    
top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)









    



D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
  """Entry point for launching an IPython kernel.



In [110]:

    
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)



In [111]:

    
top_female_ratings[:10]









    Out[111]:







  
    
      gender
      F
      M
      diff
    
    
      title
      
      
      
    
  
  
    
      Close Shave, A (1995)
      4.644444
      4.473795
      -0.170650
    
    
      Wrong Trousers, The (1993)
      4.588235
      4.478261
      -0.109974
    
    
      Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)
      4.572650
      4.464589
      -0.108060
    
    
      Wallace & Gromit: The Best of Aardman Animation (1996)
      4.563107
      4.385075
      -0.178032
    
    
      Schindler's List (1993)
      4.562602
      4.491415
      -0.071187
    
    
      Shawshank Redemption, The (1994)
      4.539075
      4.560625
      0.021550
    
    
      Grand Day Out, A (1992)
      4.537879
      4.293255
      -0.244624
    
    
      To Kill a Mockingbird (1962)
      4.536667
      4.372611
      -0.164055
    
    
      Creature Comforts (1990)
      4.513889
      4.272277
      -0.241612
    
    
      Usual Suspects, The (1995)
      4.513317
      4.518248
      0.004931

计算评分分歧

假设我们想要找出男性和女性观众分歧最大的电影。一个半法是给mean_ratings加上一个用于存放平均得分之差的列，并对它进行排序：



In [112]:

    
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

按‘diff'排序即可得到分歧最大且女性观众更喜欢的电影：



In [93]:

    
sorted_by_diff = mean_ratings.sort_index(by = 'diff')









    



D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
  """Entry point for launching an IPython kernel.



In [113]:

    
sorted_by_diff = mean_ratings.sort_values(by='diff')



In [114]:

    
sorted_by_diff[:15]









    Out[114]:







  
    
      gender
      F
      M
      diff
    
    
      title
      
      
      
    
  
  
    
      Dirty Dancing (1987)
      3.790378
      2.959596
      -0.830782
    
    
      Jumpin' Jack Flash (1986)
      3.254717
      2.578358
      -0.676359
    
    
      Grease (1978)
      3.975265
      3.367041
      -0.608224
    
    
      Little Women (1994)
      3.870588
      3.321739
      -0.548849
    
    
      Steel Magnolias (1989)
      3.901734
      3.365957
      -0.535777
    
    
      Anastasia (1997)
      3.800000
      3.281609
      -0.518391
    
    
      Rocky Horror Picture Show, The (1975)
      3.673016
      3.160131
      -0.512885
    
    
      Color Purple, The (1985)
      4.158192
      3.659341
      -0.498851
    
    
      Age of Innocence, The (1993)
      3.827068
      3.339506
      -0.487561
    
    
      Free Willy (1993)
      2.921348
      2.438776
      -0.482573
    
    
      French Kiss (1995)
      3.535714
      3.056962
      -0.478752
    
    
      Little Shop of Horrors, The (1960)
      3.650000
      3.179688
      -0.470312
    
    
      Guys and Dolls (1955)
      4.051724
      3.583333
      -0.468391
    
    
      Mary Poppins (1964)
      4.197740
      3.730594
      -0.467147
    
    
      Patch Adams (1998)
      3.473282
      3.008746
      -0.464536

对排序结果反序并取出10行，得到的就是男性更喜欢的电影啦：



In [115]:

    
sorted_by_diff[::-1][:15]









    Out[115]:







  
    
      gender
      F
      M
      diff
    
    
      title
      
      
      
    
  
  
    
      Good, The Bad and The Ugly, The (1966)
      3.494949
      4.221300
      0.726351
    
    
      Kentucky Fried Movie, The (1977)
      2.878788
      3.555147
      0.676359
    
    
      Dumb & Dumber (1994)
      2.697987
      3.336595
      0.638608
    
    
      Longest Day, The (1962)
      3.411765
      4.031447
      0.619682
    
    
      Cable Guy, The (1996)
      2.250000
      2.863787
      0.613787
    
    
      Evil Dead II (Dead By Dawn) (1987)
      3.297297
      3.909283
      0.611985
    
    
      Hidden, The (1987)
      3.137931
      3.745098
      0.607167
    
    
      Rocky III (1982)
      2.361702
      2.943503
      0.581801
    
    
      Caddyshack (1980)
      3.396135
      3.969737
      0.573602
    
    
      For a Few Dollars More (1965)
      3.409091
      3.953795
      0.544704
    
    
      Porky's (1981)
      2.296875
      2.836364
      0.539489
    
    
      Animal House (1978)
      3.628906
      4.167192
      0.538286
    
    
      Exorcist, The (1973)
      3.537634
      4.067239
      0.529605
    
    
      Fright Night (1985)
      2.973684
      3.500000
      0.526316
    
    
      Barb Wire (1996)
      1.585366
      2.100386
      0.515020

如果只想找出分歧最大的电影并且不考虑性别因素，则可以计算得分数据的方差或者标准差：



In [127]:

    
#根据电影名称分组的得分数据的标准差
rating_std_by_title = data.groupby('title')['rating'].std()



In [128]:

    
#根据active_title 进行过滤
rating_std_by_title = rating_std_by_title.loc[active_titles]



In [129]:

    
#根据值对Series进行降序排列
rating_std_by_title.order(ascending=False)[:10]









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-129-5b832fb1fe6d> in <module>()
      1 #根据值对Series进行降序排列
----> 2 rating_std_by_title.order(ascending=False)[:10]

D:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   3079             if name in self._info_axis:
   3080                 return self[name]
-> 3081             return object.__getattribute__(self, name)
   3082 
   3083     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'order'



In [130]:

    
#上一个书中源代码中的order已经被弃用。最新版的可以使用sort_values
rating_std_by_title.sort_values(ascending=False)[:10]









    Out[130]:





title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64

作者按：

可能你已经注意到了，电影分类是以“|”分隔符给出的。如果想对电源的分类进行分析的话，就需要先将其转换成更有用的形式才行。本书后续章节将给出处理方法，到时还需用到这个数据。



In [ ]:

	user_id	movie_id	rating	timestamp
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291
5	1	1197	3	978302268
6	1	1287	5	978302039
7	1	2804	5	978300719
8	1	594	4	978302268
9	1	919	4	978301368

	_heartbeat_	a	al	c	cy	g	gr	h	hc	hh	kw	l	ll	nk	r	t	tz	u
0	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	en-US,en;q=0.8	US	Danvers	A6qOVH	MA	wfLQtf	1.331823e+09	1.usa.gov	NaN	orofrog	[42.576698, -70.954903]	1.0	http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/...	1.331923e+09	America/New_York	http://www.ncbi.nlm.nih.gov/pubmed/22415991
1	NaN	GoogleMaps/RochesterNY	NaN	US	Provo	mwszkS	UT	mwszkS	1.308262e+09	j.mp	NaN	bitly	[40.218102, -111.613297]	0.0	http://www.AwareMap.com/	1.331923e+09	America/Denver	http://www.monroecounty.gov/etc/911/rss.php
2	NaN	Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...	en-US	US	Washington	xxr3Qb	DC	xxr3Qb	1.331920e+09	1.usa.gov	NaN	bitly	[38.9007, -77.043098]	1.0	http://t.co/03elZC4Q	1.331923e+09	America/New_York	http://boxer.senate.gov/en/press/releases/0316...
3	NaN	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...	pt-br	BR	Braz	zCaLwp	27	zUtuOu	1.331923e+09	1.usa.gov	NaN	alelex88	[-23.549999, -46.616699]	0.0	direct	1.331923e+09	America/Sao_Paulo	http://apod.nasa.gov/apod/ap120312.html
4	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	en-US,en;q=0.8	US	Shrewsbury	9b6kNl	MA	9b6kNl	1.273672e+09	bit.ly	NaN	bitly	[42.286499, -71.714699]	0.0	http://www.shrewsbury-ma.gov/selco/	1.331923e+09	America/New_York	http://www.shrewsbury-ma.gov/egov/gallery/1341...
5	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	en-US,en;q=0.8	US	Shrewsbury	axNK8c	MA	axNK8c	1.273673e+09	bit.ly	NaN	bitly	[42.286499, -71.714699]	0.0	http://www.shrewsbury-ma.gov/selco/	1.331923e+09	America/New_York	http://www.shrewsbury-ma.gov/egov/gallery/1341...
6	NaN	Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1...	pl-PL,pl;q=0.8,en-US;q=0.6,en;q=0.4	PL	Luban	wcndER	77	zkpJBR	1.331923e+09	1.usa.gov	NaN	bnjacobs	[51.116699, 15.2833]	0.0	http://plus.url.google.com/url?sa=z&n=13319232...	1.331923e+09	Europe/Warsaw	http://www.nasa.gov/mission_pages/nustar/main/...
7	NaN	Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/2...	bg,en-us;q=0.7,en;q=0.3	None	NaN	wcndER	NaN	zkpJBR	1.331923e+09	1.usa.gov	NaN	bnjacobs	NaN	0.0	http://www.facebook.com/	1.331923e+09		http://www.nasa.gov/mission_pages/nustar/main/...
8	NaN	Opera/9.80 (X11; Linux zbov; U; en) Presto/2.1...	en-US, en	None	NaN	wcndER	NaN	zkpJBR	1.331923e+09	1.usa.gov	NaN	bnjacobs	NaN	0.0	http://www.facebook.com/l.php?u=http%3A%2F%2F1...	1.331923e+09		http://www.nasa.gov/mission_pages/nustar/main/...
9	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4	None	NaN	zCaLwp	NaN	zUtuOu	1.331923e+09	1.usa.gov	NaN	alelex88	NaN	0.0	http://t.co/o1Pd0WeV	1.331923e+09		http://apod.nasa.gov/apod/ap120312.html
10	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)...	en-us,en;q=0.5	US	Seattle	vNJS4H	WA	u0uD9q	1.319564e+09	1.usa.gov	NaN	o_4us71ccioa	[47.5951, -122.332603]	1.0	direct	1.331923e+09	America/Los_Angeles	https://www.nysdot.gov/rexdesign/design/commun...
11	NaN	Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4...	en-us,en;q=0.5	US	Washington	wG7OIH	DC	A0nRz4	1.331816e+09	1.usa.gov	NaN	darrellissa	[38.937599, -77.092796]	0.0	http://t.co/ND7SoPyo	1.331923e+09	America/New_York	http://oversight.house.gov/wp-content/uploads/...
12	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)...	en-us,en;q=0.5	US	Alexandria	vNJS4H	VA	u0uD9q	1.319564e+09	1.usa.gov	NaN	o_4us71ccioa	[38.790901, -77.094704]	1.0	direct	1.331923e+09	America/New_York	https://www.nysdot.gov/rexdesign/design/commun...
13	1.331923e+09	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	NaN	Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US...	en-us,en;q=0.5	US	Marietta	2rOUYc	GA	2rOUYc	1.255770e+09	1.usa.gov	NaN	bitly	[33.953201, -84.5177]	1.0	direct	1.331923e+09	America/New_York	http://toxtown.nlm.nih.gov/index.php
15	NaN	Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1...	zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4	HK	Central District	nQvgJp	00	rtrrth	1.317318e+09	j.mp	NaN	walkeryuen	[22.2833, 114.150002]	1.0	http://forum2.hkgolden.com/view.aspx?type=BW&m...	1.331923e+09	Asia/Hong_Kong	http://www.ssd.noaa.gov/PS/TROP/TCFP/data/curr...
16	NaN	Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1...	zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4	HK	Central District	XdUNr	00	qWkgbq	1.317318e+09	j.mp	NaN	walkeryuen	[22.2833, 114.150002]	1.0	http://forum2.hkgolden.com/view.aspx?type=BW&m...	1.331923e+09	Asia/Hong_Kong	http://www.usno.navy.mil/NOOC/nmfc-ph/RSS/jtwc...
17	NaN	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.5; r...	en-us,en;q=0.5	US	Buckfield	zH1BFf	ME	x3jOIv	1.331840e+09	1.usa.gov	NaN	andyzieminski	[44.299702, -70.369797]	0.0	http://t.co/6Cx4ROLs	1.331923e+09	America/New_York	http://www.usda.gov/wps/portal/usda/usdahome?c...
18	NaN	GoogleMaps/RochesterNY	NaN	US	Provo	mwszkS	UT	mwszkS	1.308262e+09	1.usa.gov	NaN	bitly	[40.218102, -111.613297]	0.0	http://www.AwareMap.com/	1.331923e+09	America/Denver	http://www.monroecounty.gov/etc/911/rss.php
19	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	it-IT,it;q=0.8,en-US;q=0.6,en;q=0.4	IT	Venice	wcndER	20	zkpJBR	1.331923e+09	1.usa.gov	NaN	bnjacobs	[45.438599, 12.3267]	0.0	http://www.facebook.com/	1.331923e+09	Europe/Rome	http://www.nasa.gov/mission_pages/nustar/main/...
20	NaN	Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...	es-ES	ES	Alcal	zQ95Hi	51	ytZYWR	1.331671e+09	bitly.com	NaN	jplnews	[37.516701, -5.9833]	0.0	http://www.facebook.com/	1.331923e+09	Africa/Ceuta	http://voyager.jpl.nasa.gov/imagesvideo/uranus...
21	NaN	Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6...	en-us,en;q=0.5	US	Davidsonville	wcndER	MD	zkpJBR	1.331923e+09	1.usa.gov	NaN	bnjacobs	[38.939201, -76.635002]	0.0	http://www.facebook.com/	1.331923e+09	America/New_York	http://www.nasa.gov/mission_pages/nustar/main/...
22	NaN	Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...	en-us	US	Hockessin	y3ZImz	DE	y3ZImz	1.331064e+09	1.usa.gov	NaN	bitly	[39.785, -75.682297]	0.0	direct	1.331923e+09	America/New_York	http://portal.hud.gov/hudportal/documents/hudd...
23	NaN	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3)...	en-us	US	Lititz	wWiOiD	PA	wWiOiD	1.330218e+09	1.usa.gov	NaN	bitly	[40.174999, -76.3078]	0.0	http://www.facebook.com/l.php?u=http%3A%2F%2F1...	1.331923e+09	America/New_York	http://www.tricare.mil/mybenefit/ProfileFilter...
24	NaN	Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES...	es-es,es;q=0.8,en-us;q=0.5,en;q=0.3	ES	Bilbao	wcndER	59	zkpJBR	1.331923e+09	1.usa.gov	NaN	bnjacobs	[43.25, -2.9667]	0.0	http://www.facebook.com/	1.331923e+09	Europe/Madrid	http://www.nasa.gov/mission_pages/nustar/main/...
25	NaN	Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1...	en-GB,en;q=0.8,en-US;q=0.6,en-AU;q=0.4	MY	Kuala Lumpur	wcndER	14	zkpJBR	1.331923e+09	1.usa.gov	NaN	bnjacobs	[3.1667, 101.699997]	0.0	http://www.facebook.com/	1.331923e+09	Asia/Kuala_Lumpur	http://www.nasa.gov/mission_pages/nustar/main/...
26	NaN	Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1...	ro-RO,ro;q=0.8,en-US;q=0.6,en;q=0.4	CY	Nicosia	wcndER	04	zkpJBR	1.331923e+09	1.usa.gov	NaN	bnjacobs	[35.166698, 33.366699]	0.0	http://www.facebook.com/?ref=tn_tnmn	1.331923e+09	Asia/Nicosia	http://www.nasa.gov/mission_pages/nustar/main/...
27	NaN	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...	en-US,en;q=0.8	BR	SPaulo	zCaLwp	27	zUtuOu	1.331923e+09	1.usa.gov	NaN	alelex88	[-23.5333, -46.616699]	0.0	direct	1.331923e+09	America/Sao_Paulo	http://apod.nasa.gov/apod/ap120312.html
28	NaN	Mozilla/5.0 (iPad; CPU OS 5_0_1 like Mac OS X)...	en-us	None	NaN	vNJS4H	NaN	u0uD9q	1.319564e+09	1.usa.gov	NaN	o_4us71ccioa	NaN	0.0	direct	1.331923e+09		https://www.nysdot.gov/rexdesign/design/commun...
29	NaN	Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X...	en-us	None	NaN	FPX0IM	NaN	FPX0IL	1.331923e+09	1.usa.gov	NaN	twittershare	NaN	1.0	http://t.co/5xlp0B34	1.331923e+09		http://www.ed.gov/news/media-advisories/us-dep...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3530	NaN	Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.1...	en-US,en;q=0.8	US	San Francisco	xVZg4P	CA	wqUkTo	1.331908e+09	go.nasa.gov	NaN	nasatwitter	[37.7645, -122.429398]	0.0	http://www.facebook.com/l.php?u=http%3A%2F%2Fg...	1.331927e+09	America/Los_Angeles	http://www.nasa.gov/multimedia/imagegallery/im...
3531	NaN	Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6...	en-US	None	NaN	wcndER	NaN	zkpJBR	1.331923e+09	1.usa.gov	NaN	bnjacobs	NaN	0.0	direct	1.331927e+09		http://www.nasa.gov/mission_pages/nustar/main/...
3532	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)...	en-us,en;q=0.5	US	Washington	Au3aUS	DC	A9ct6C	1.331926e+09	1.usa.gov	NaN	ncsha	[38.904202, -77.031998]	1.0	http://www.ncsha.org/	1.331927e+09	America/New_York	http://portal.hud.gov/hudportal/HUD?src=/press...
3533	NaN	Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) A...	en-us	US	Jacksonville	b2UtUJ	FL	ieCdgH	1.301393e+09	go.nasa.gov	NaN	nasatwitter	[30.279301, -81.585098]	1.0	direct	1.331927e+09	America/New_York	http://apod.nasa.gov/apod/
3534	NaN	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...	en-us	US	Frisco	vNJS4H	TX	u0uD9q	1.319564e+09	1.usa.gov	NaN	o_4us71ccioa	[33.149899, -96.855499]	1.0	direct	1.331927e+09	America/Chicago	https://www.nysdot.gov/rexdesign/design/commun...
3535	NaN	Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/...	en-us	US	Houston	zIgLx8	TX	yrPaLt	1.331903e+09	aash.to	NaN	aashto	[29.775499, -95.415199]	1.0	direct	1.331927e+09	America/Chicago	http://ntl.bts.gov/lib/44000/44300/44374/FHWA-...
3536	NaN	Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; e...	en-US,en;q=0.5	None	NaN	xIcyim	NaN	yG1TTf	1.331728e+09	go.nasa.gov	NaN	nasatwitter	NaN	0.0	http://t.co/g1VKE8zS	1.331927e+09		http://www.nasa.gov/mission_pages/hurricanes/a...
3537	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)...	es-es,es;q=0.8,en-us;q=0.5,en;q=0.3	HN	Tegucigalpa	zCaLwp	08	w63FZW	1.331547e+09	1.usa.gov	NaN	bufferapp	[14.1, -87.216698]	0.0	http://t.co/A8TJyibE	1.331927e+09	America/Tegucigalpa	http://apod.nasa.gov/apod/ap120312.html
3538	NaN	Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma...	en-us	US	Los Angeles	qMac9k	CA	qds1Ge	1.310474e+09	1.usa.gov	NaN	healthypeople	[34.041599, -118.298798]	0.0	direct	1.331927e+09	America/Los_Angeles	http://healthypeople.gov/2020/connect/webinars...
3539	NaN	Mozilla/5.0 (compatible; Fedora Core 3) FC3 KDE	NaN	US	Bellevue	zu2M5o	WA	zDhdro	1.331586e+09	bit.ly	NaN	glimtwin	[47.615398, -122.210297]	0.0	direct	1.331927e+09	America/Los_Angeles	http://www.federalreserve.gov/newsevents/press...
3540	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	en-US,en;q=0.8	US	Payson	wcndER	UT	zkpJBR	1.331923e+09	1.usa.gov	NaN	bnjacobs	[40.014198, -111.738899]	0.0	http://www.facebook.com/l.php?u=http%3A%2F%2F1...	1.331927e+09	America/Denver	http://www.nasa.gov/mission_pages/nustar/main/...
3541	NaN	Mozilla/5.0 (X11; U; OpenVMS AlphaServer_ES40;...	NaN	US	Bellevue	zu2M5o	WA	zDhdro	1.331586e+09	1.usa.gov	NaN	glimtwin	[47.615398, -122.210297]	0.0	direct	1.331927e+09	America/Los_Angeles	http://www.federalreserve.gov/newsevents/press...
3542	NaN	Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...	en-us	US	Pittsburg	y3reI1	CA	y3reI1	1.331926e+09	1.usa.gov	NaN	bitly	[38.0051, -121.838699]	0.0	http://www.facebook.com/l.php?u=http%3A%2F%2F1...	1.331927e+09	America/Los_Angeles	http://www.sba.gov/community/blogs/community-b...
3543	1.331927e+09	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3544	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0.1) ...	en-us,en;q=0.5	US	Wentzville	vNJS4H	MO	u0uD9q	1.319564e+09	1.usa.gov	NaN	o_4us71ccioa	[38.790001, -90.854897]	1.0	direct	1.331927e+09	America/Chicago	https://www.nysdot.gov/rexdesign/design/commun...
3545	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)...	en-us,en;q=0.5	US	Saint Charles	vNJS4H	IL	u0uD9q	1.319564e+09	1.usa.gov	NaN	o_4us71ccioa	[41.9352, -88.290901]	1.0	direct	1.331927e+09	America/Chicago	https://www.nysdot.gov/rexdesign/design/commun...
3546	NaN	Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma...	en-us	US	Los Angeles	qMac9k	CA	qds1Ge	1.310474e+09	1.usa.gov	NaN	healthypeople	[34.041599, -118.298798]	1.0	direct	1.331927e+09	America/Los_Angeles	http://healthypeople.gov/2020/connect/webinars...
3547	NaN	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...	en-us	US	Silver Spring	y0jYkg	MD	y0jYkg	1.331852e+09	1.usa.gov	NaN	bitly	[39.052101, -77.014999]	1.0	direct	1.331927e+09	America/New_York	http://www.epa.gov/otaq/regs/fuels/additive/e1...
3548	NaN	Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma...	en-us	US	Mcgehee	y5rMac	AR	xANY6O	1.331916e+09	1.usa.gov	NaN	twitterfeed	[33.628399, -91.356903]	1.0	https://twitter.com/fdarecalls/status/18069759...	1.331927e+09	America/Chicago	http://www.fda.gov/Safety/Recalls/ucm296326.htm
3549	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	sv-SE,sv;q=0.8,en-US;q=0.6,en;q=0.4	SE	Sollefte	eH8wu	24	7dtjei	1.260316e+09	1.usa.gov	NaN	tweetdeckapi	[63.166698, 17.266701]	1.0	direct	1.331927e+09	Europe/Stockholm	http://www.nasa.gov/mission_pages/WISE/main/in...
3550	NaN	Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...	en-us	US	Conshohocken	A00b72	PA	yGSwzn	1.331918e+09	1.usa.gov	NaN	addthis	[40.0798, -75.2855]	0.0	http://www.linkedin.com/home?trk=hb_tab_home_top	1.331927e+09	America/New_York	http://www.nlm.nih.gov/medlineplus/news/fullst...
3551	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	en-US,en;q=0.8	None	NaN	wcndER	NaN	zkpJBR	1.331923e+09	1.usa.gov	NaN	bnjacobs	NaN	0.0	http://plus.url.google.com/url?sa=z&n=13319268...	1.331927e+09		http://www.nasa.gov/mission_pages/nustar/main/...
3552	NaN	Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US...	NaN	US	Decatur	rqgJuE	AL	xcz8vt	1.331227e+09	1.usa.gov	NaN	bootsnall	[34.572701, -86.940598]	0.0	direct	1.331927e+09	America/Chicago	http://travel.state.gov/passport/passport_5535...
3553	NaN	Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ...	en-us	US	Shrewsbury	9b6kNl	MA	9b6kNl	1.273672e+09	bit.ly	NaN	bitly	[42.286499, -71.714699]	0.0	http://www.shrewsbury-ma.gov/selco/	1.331927e+09	America/New_York	http://www.shrewsbury-ma.gov/egov/gallery/1341...
3554	NaN	Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ...	en-us	US	Shrewsbury	axNK8c	MA	axNK8c	1.273673e+09	bit.ly	NaN	bitly	[42.286499, -71.714699]	0.0	http://www.shrewsbury-ma.gov/selco/	1.331927e+09	America/New_York	http://www.shrewsbury-ma.gov/egov/gallery/1341...
3555	NaN	Mozilla/4.0 (compatible; MSIE 9.0; Windows NT ...	en	US	Paramus	e5SvKE	NJ	fqPSr9	1.301298e+09	1.usa.gov	NaN	tweetdeckapi	[40.9445, -74.07]	1.0	direct	1.331927e+09	America/New_York	http://www.fda.gov/AdvisoryCommittees/Committe...
3556	NaN	Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1...	en-US,en;q=0.8	US	Oklahoma City	jQLtP4	OK	jQLtP4	1.307530e+09	1.usa.gov	NaN	bitly	[35.4715, -97.518997]	0.0	http://www.facebook.com/l.php?u=http%3A%2F%2F1...	1.331927e+09	America/Chicago	http://www.okc.gov/PublicNotificationSystem/Fo...
3557	NaN	GoogleMaps/RochesterNY	NaN	US	Provo	mwszkS	UT	mwszkS	1.308262e+09	j.mp	NaN	bitly	[40.218102, -111.613297]	0.0	http://www.AwareMap.com/	1.331927e+09	America/Denver	http://www.monroecounty.gov/etc/911/rss.php
3558	NaN	GoogleProducer	NaN	US	Mountain View	zjtI4X	CA	zjtI4X	1.327529e+09	1.usa.gov	NaN	bitly	[37.419201, -122.057404]	0.0	direct	1.331927e+09	America/Los_Angeles	http://www.ahrq.gov/qual/qitoolkit/
3559	NaN	Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...	en-US	US	Mc Lean	qxKrTK	VA	qxKrTK	1.312898e+09	1.usa.gov	NaN	bitly	[38.935799, -77.162102]	0.0	http://t.co/OEEEvwjU	1.331927e+09	America/New_York	http://herndon-va.gov/Content/public_safety/Pu...

	Not Windows	Windows
tz
	245.0	276.0
Africa/Cairo	0.0	3.0
Africa/Casablanca	0.0	1.0
Africa/Ceuta	0.0	2.0
Africa/Johannesburg	0.0	1.0
Africa/Lusaka	0.0	1.0
America/Anchorage	4.0	1.0
America/Argentina/Buenos_Aires	1.0	0.0
America/Argentina/Cordoba	0.0	1.0
America/Argentina/Mendoza	0.0	1.0

	Not Windows	Windows
tz
America/Sao_Paulo	13.0	20.0
Europe/Madrid	16.0	19.0
Pacific/Honolulu	0.0	36.0
Asia/Tokyo	2.0	35.0
Europe/London	43.0	31.0
America/Denver	132.0	59.0
America/Los_Angeles	130.0	252.0
America/Chicago	115.0	285.0
	245.0	276.0
America/New_York	339.0	912.0

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

	user_id	movie_id	rating	timestamp	gender	age	occupation	zip	title	genres
0	1	1193	5	978300760	F	1	10	48067	One Flew Over the Cuckoo's Nest (1975)	Drama
1	2	1193	5	978298413	M	56	16	70072	One Flew Over the Cuckoo's Nest (1975)	Drama
2	12	1193	4	978220179	M	25	12	32793	One Flew Over the Cuckoo's Nest (1975)	Drama
3	15	1193	4	978199279	M	25	7	22903	One Flew Over the Cuckoo's Nest (1975)	Drama
4	17	1193	5	978158471	M	50	1	95350	One Flew Over the Cuckoo's Nest (1975)	Drama
5	18	1193	4	978156168	F	18	3	95825	One Flew Over the Cuckoo's Nest (1975)	Drama
6	19	1193	5	982730936	M	1	10	48073	One Flew Over the Cuckoo's Nest (1975)	Drama
7	24	1193	5	978136709	F	25	7	10023	One Flew Over the Cuckoo's Nest (1975)	Drama
8	28	1193	3	978125194	F	25	1	14607	One Flew Over the Cuckoo's Nest (1975)	Drama
9	33	1193	5	978557765	M	45	3	55421	One Flew Over the Cuckoo's Nest (1975)	Drama

gender	F	M
title
$1,000,000 Duck (1971)	3.375000	2.761905
'Night Mother (1986)	3.388889	3.352941
'Til There Was You (1997)	2.675676	2.733333
'burbs, The (1989)	2.793478	2.962085
...And Justice for All (1979)	3.828571	3.689024

gender	F	M	diff
title
Close Shave, A (1995)	4.644444	4.473795	-0.170650
Wrong Trousers, The (1993)	4.588235	4.478261	-0.109974
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)	4.572650	4.464589	-0.108060
Wallace & Gromit: The Best of Aardman Animation (1996)	4.563107	4.385075	-0.178032
Schindler's List (1993)	4.562602	4.491415	-0.071187
Shawshank Redemption, The (1994)	4.539075	4.560625	0.021550
Grand Day Out, A (1992)	4.537879	4.293255	-0.244624
To Kill a Mockingbird (1962)	4.536667	4.372611	-0.164055
Creature Comforts (1990)	4.513889	4.272277	-0.241612
Usual Suspects, The (1995)	4.513317	4.518248	0.004931

gender	F	M	diff
title
Dirty Dancing (1987)	3.790378	2.959596	-0.830782
Jumpin' Jack Flash (1986)	3.254717	2.578358	-0.676359
Grease (1978)	3.975265	3.367041	-0.608224
Little Women (1994)	3.870588	3.321739	-0.548849
Steel Magnolias (1989)	3.901734	3.365957	-0.535777
Anastasia (1997)	3.800000	3.281609	-0.518391
Rocky Horror Picture Show, The (1975)	3.673016	3.160131	-0.512885
Color Purple, The (1985)	4.158192	3.659341	-0.498851
Age of Innocence, The (1993)	3.827068	3.339506	-0.487561
Free Willy (1993)	2.921348	2.438776	-0.482573
French Kiss (1995)	3.535714	3.056962	-0.478752
Little Shop of Horrors, The (1960)	3.650000	3.179688	-0.470312
Guys and Dolls (1955)	4.051724	3.583333	-0.468391
Mary Poppins (1964)	4.197740	3.730594	-0.467147
Patch Adams (1998)	3.473282	3.008746	-0.464536