阅读笔记

作者:方跃文

Email: fyuewen@gmail.com

时间:始于2017年9月12日, 结束写作于

第二章笔记始于2017年9月12日,第一阶段结束语2017年9月28日晚(剩余两个分析案例)

第二章 引言

时间: 2017年9月12日

尽管数据处理的目的和领域都大不相同,但是利用python数据处理时候基本都需要完成如下几个大类的任务:

1) 与外界进行数据交互

2) 准备:对数据进行清理、修整、规范化、重塑、切片切块

3) 转换:对数据集做一些数学和统计运算以产生新的数据集,e.g. 根据分组变量对一个大表进行聚合

4) 建模和计算:将数据跟统计模型、机器学习算法联系起来

5) 展示:创建交换式的或者静态的图片或者文字摘要。

来自bit.ly的1.usa.gov数据

ch02中的usagov_bitly_data2012-03-16-1331923249.txt是bit.ly网站收集到的每小时快照数据。文件中的格式为JavaScript Object Notation (JSON)——一种常用的web数据格式。例如如果我们只读取某个文件中的第一行,那么所看到的结果是下面这样:


In [1]:
path = "./pydata-book/ch02/usagov_bitly_data2012-03-16-1331923249.txt"
open(path).readline()


Out[1]:
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'

In [9]:
print(path)
print(type(path))


./pydata-book/ch02/usagov_bitly_data2012-03-16-1331923249.txt
<class 'str'>

In [6]:
import json
datach02= [json.loads(line) for line in open(path)]

python有许多内置或第三方模块可以将JSON字符转换成python字典对象。这里,我将使用json模块及其loads函数逐行加载已经下载好的数据文件:


In [3]:
import json
path = "./pydata-book/ch02/usagov_bitly_data2012-03-16-1331923249.txt"
records = [json.loads(line) for line in open(path)]

上面最后一行表达式,叫做“列表推导式 list comprehension”。这是一种在一组字符串(或一组别的对象)上执行一条相同操作(如json.loads)的简洁方式。在一个打开的文件句柄上进行迭代即可获得一个由行组成的序列。现在,records对象就成为一组python字典了。


In [4]:
records[0]


Out[4]:
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
 'al': 'en-US,en;q=0.8',
 'c': 'US',
 'cy': 'Danvers',
 'g': 'A6qOVH',
 'gr': 'MA',
 'h': 'wfLQtf',
 'hc': 1331822918,
 'hh': '1.usa.gov',
 'l': 'orofrog',
 'll': [42.576698, -70.954903],
 'nk': 1,
 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
 't': 1331923247,
 'tz': 'America/New_York',
 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}

In [4]:
records[0]['tz']


Out[4]:
'America/New_York'

用纯Python代码对时区进行排序

时间: 2017年9月26日

现在,我们想对时区进行计数,处理的方法有多种。 我们首先考虑的是利用列表推导式取出一组时区:


In [10]:
time_zones = [rec['tz'] for rec in records]


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-10-f3fbbc37f129> in <module>()
----> 1 time_zones = [rec['tz'] for rec in records]

<ipython-input-10-f3fbbc37f129> in <listcomp>(.0)
----> 1 time_zones = [rec['tz'] for rec in records]

KeyError: 'tz'

然而我们发现上面这个出现了‘tz'的keyerror,这是因为并不是所有记录里面都有tz这个字段,为了让程序判断出来,我们需要加上if语句,即


In [11]:
time_zones = [i['tz'] for i in records if 'tz' in i]
time_zones[:2]


Out[11]:
['America/New_York', 'America/Denver']

In [12]:
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
time_zones[:10]


Out[12]:
['America/New_York',
 'America/Denver',
 'America/New_York',
 'America/Sao_Paulo',
 'America/New_York',
 'America/New_York',
 'Europe/Warsaw',
 '',
 '',
 '']

我们从上面可以看到,的确有些时区字段是空的。此处,为了对时区进行计算,介绍两种办法。

第一种,只使用python标准库


In [7]:
#这种方法是在遍历时区的过程中将计数值保留在字典中:

def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts
#今天回头看这段代码发现看的不是很明白,特别是我在下面这个cell中,
#利用了上述的代码,发现这个结果让人看了有点费解。
#11th Jan. 2018

In [21]:
def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts

sequence1={1,23,434,53,23,24}
a=get_counts(sequence1)
a[23]

#11th Jan. 2018


Out[21]:
1

非常了解Python标准库的话,可以将上述代码写得更加精简:


In [8]:
from collections import defaultdict

def get_counts2(sequence):
    counts = defaultdict(int) #所有的值都会被初始化为0
    for x in sequence:
        counts[x] += 1
    return counts

上述两种写法中,都将代码写到了函数中。这样的做法,是为了代码段有更高的可重要性,方便对时区进行处理。此处我们只需要将时区 time_zones 传入即可:


In [9]:
def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts

counts = get_counts(time_zones)
counts['America/New_York']


Out[9]:
1251

In [10]:
len(time_zones)


Out[10]:
3440

如果要想得到前10位的时区及其计数值,我们需要用到一些有关字典的处理技巧:


In [11]:
def top_counts(count_dict, n =10):
    value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
    value_key_pairs.sort()
    return value_key_pairs[-n:]

top_counts(counts)


Out[11]:
[(33, 'America/Sao_Paulo'),
 (35, 'Europe/Madrid'),
 (36, 'Pacific/Honolulu'),
 (37, 'Asia/Tokyo'),
 (74, 'Europe/London'),
 (191, 'America/Denver'),
 (382, 'America/Los_Angeles'),
 (400, 'America/Chicago'),
 (521, ''),
 (1251, 'America/New_York')]

我们还可在python标准库中找到collections.Counter类,它能使这个任务更加简单:


In [12]:
from collections import Counter
counts = Counter(time_zones)

In [13]:
counts.most_common(10)


Out[13]:
[('America/New_York', 1251),
 ('', 521),
 ('America/Chicago', 400),
 ('America/Los_Angeles', 382),
 ('America/Denver', 191),
 ('Europe/London', 74),
 ('Asia/Tokyo', 37),
 ('Pacific/Honolulu', 36),
 ('Europe/Madrid', 35),
 ('America/Sao_Paulo', 33)]

第二种,用pendas对时区进行计数

DataFrame 是pendas中最重要的数据结构,它用于将数据表示为一个表格。从一组原始记录中创建DataFrames是很简单的:


In [14]:
from pandas import DataFrame, Series
import pandas as pd; import numpy as np
frame = DataFrame(records)
frame


Out[14]:
_heartbeat_ a al c cy g gr h hc hh kw l ll nk r t tz u
0 NaN Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... en-US,en;q=0.8 US Danvers A6qOVH MA wfLQtf 1.331823e+09 1.usa.gov NaN orofrog [42.576698, -70.954903] 1.0 http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/... 1.331923e+09 America/New_York http://www.ncbi.nlm.nih.gov/pubmed/22415991
1 NaN GoogleMaps/RochesterNY NaN US Provo mwszkS UT mwszkS 1.308262e+09 j.mp NaN bitly [40.218102, -111.613297] 0.0 http://www.AwareMap.com/ 1.331923e+09 America/Denver http://www.monroecounty.gov/etc/911/rss.php
2 NaN Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... en-US US Washington xxr3Qb DC xxr3Qb 1.331920e+09 1.usa.gov NaN bitly [38.9007, -77.043098] 1.0 http://t.co/03elZC4Q 1.331923e+09 America/New_York http://boxer.senate.gov/en/press/releases/0316...
3 NaN Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)... pt-br BR Braz zCaLwp 27 zUtuOu 1.331923e+09 1.usa.gov NaN alelex88 [-23.549999, -46.616699] 0.0 direct 1.331923e+09 America/Sao_Paulo http://apod.nasa.gov/apod/ap120312.html
4 NaN Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... en-US,en;q=0.8 US Shrewsbury 9b6kNl MA 9b6kNl 1.273672e+09 bit.ly NaN bitly [42.286499, -71.714699] 0.0 http://www.shrewsbury-ma.gov/selco/ 1.331923e+09 America/New_York http://www.shrewsbury-ma.gov/egov/gallery/1341...
5 NaN Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... en-US,en;q=0.8 US Shrewsbury axNK8c MA axNK8c 1.273673e+09 bit.ly NaN bitly [42.286499, -71.714699] 0.0 http://www.shrewsbury-ma.gov/selco/ 1.331923e+09 America/New_York http://www.shrewsbury-ma.gov/egov/gallery/1341...
6 NaN Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1... pl-PL,pl;q=0.8,en-US;q=0.6,en;q=0.4 PL Luban wcndER 77 zkpJBR 1.331923e+09 1.usa.gov NaN bnjacobs [51.116699, 15.2833] 0.0 http://plus.url.google.com/url?sa=z&n=13319232... 1.331923e+09 Europe/Warsaw http://www.nasa.gov/mission_pages/nustar/main/...
7 NaN Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/2... bg,en-us;q=0.7,en;q=0.3 None NaN wcndER NaN zkpJBR 1.331923e+09 1.usa.gov NaN bnjacobs NaN 0.0 http://www.facebook.com/ 1.331923e+09 http://www.nasa.gov/mission_pages/nustar/main/...
8 NaN Opera/9.80 (X11; Linux zbov; U; en) Presto/2.1... en-US, en None NaN wcndER NaN zkpJBR 1.331923e+09 1.usa.gov NaN bnjacobs NaN 0.0 http://www.facebook.com/l.php?u=http%3A%2F%2F1... 1.331923e+09 http://www.nasa.gov/mission_pages/nustar/main/...
9 NaN Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4 None NaN zCaLwp NaN zUtuOu 1.331923e+09 1.usa.gov NaN alelex88 NaN 0.0 http://t.co/o1Pd0WeV 1.331923e+09 http://apod.nasa.gov/apod/ap120312.html
10 NaN Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)... en-us,en;q=0.5 US Seattle vNJS4H WA u0uD9q 1.319564e+09 1.usa.gov NaN o_4us71ccioa [47.5951, -122.332603] 1.0 direct 1.331923e+09 America/Los_Angeles https://www.nysdot.gov/rexdesign/design/commun...
11 NaN Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4... en-us,en;q=0.5 US Washington wG7OIH DC A0nRz4 1.331816e+09 1.usa.gov NaN darrellissa [38.937599, -77.092796] 0.0 http://t.co/ND7SoPyo 1.331923e+09 America/New_York http://oversight.house.gov/wp-content/uploads/...
12 NaN Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)... en-us,en;q=0.5 US Alexandria vNJS4H VA u0uD9q 1.319564e+09 1.usa.gov NaN o_4us71ccioa [38.790901, -77.094704] 1.0 direct 1.331923e+09 America/New_York https://www.nysdot.gov/rexdesign/design/commun...
13 1.331923e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 NaN Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US... en-us,en;q=0.5 US Marietta 2rOUYc GA 2rOUYc 1.255770e+09 1.usa.gov NaN bitly [33.953201, -84.5177] 1.0 direct 1.331923e+09 America/New_York http://toxtown.nlm.nih.gov/index.php
15 NaN Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1... zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4 HK Central District nQvgJp 00 rtrrth 1.317318e+09 j.mp NaN walkeryuen [22.2833, 114.150002] 1.0 http://forum2.hkgolden.com/view.aspx?type=BW&m... 1.331923e+09 Asia/Hong_Kong http://www.ssd.noaa.gov/PS/TROP/TCFP/data/curr...
16 NaN Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1... zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4 HK Central District XdUNr 00 qWkgbq 1.317318e+09 j.mp NaN walkeryuen [22.2833, 114.150002] 1.0 http://forum2.hkgolden.com/view.aspx?type=BW&m... 1.331923e+09 Asia/Hong_Kong http://www.usno.navy.mil/NOOC/nmfc-ph/RSS/jtwc...
17 NaN Mozilla/5.0 (Macintosh; Intel Mac OS X 10.5; r... en-us,en;q=0.5 US Buckfield zH1BFf ME x3jOIv 1.331840e+09 1.usa.gov NaN andyzieminski [44.299702, -70.369797] 0.0 http://t.co/6Cx4ROLs 1.331923e+09 America/New_York http://www.usda.gov/wps/portal/usda/usdahome?c...
18 NaN GoogleMaps/RochesterNY NaN US Provo mwszkS UT mwszkS 1.308262e+09 1.usa.gov NaN bitly [40.218102, -111.613297] 0.0 http://www.AwareMap.com/ 1.331923e+09 America/Denver http://www.monroecounty.gov/etc/911/rss.php
19 NaN Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... it-IT,it;q=0.8,en-US;q=0.6,en;q=0.4 IT Venice wcndER 20 zkpJBR 1.331923e+09 1.usa.gov NaN bnjacobs [45.438599, 12.3267] 0.0 http://www.facebook.com/ 1.331923e+09 Europe/Rome http://www.nasa.gov/mission_pages/nustar/main/...
20 NaN Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ... es-ES ES Alcal zQ95Hi 51 ytZYWR 1.331671e+09 bitly.com NaN jplnews [37.516701, -5.9833] 0.0 http://www.facebook.com/ 1.331923e+09 Africa/Ceuta http://voyager.jpl.nasa.gov/imagesvideo/uranus...
21 NaN Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6... en-us,en;q=0.5 US Davidsonville wcndER MD zkpJBR 1.331923e+09 1.usa.gov NaN bnjacobs [38.939201, -76.635002] 0.0 http://www.facebook.com/ 1.331923e+09 America/New_York http://www.nasa.gov/mission_pages/nustar/main/...
22 NaN Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... en-us US Hockessin y3ZImz DE y3ZImz 1.331064e+09 1.usa.gov NaN bitly [39.785, -75.682297] 0.0 direct 1.331923e+09 America/New_York http://portal.hud.gov/hudportal/documents/hudd...
23 NaN Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3)... en-us US Lititz wWiOiD PA wWiOiD 1.330218e+09 1.usa.gov NaN bitly [40.174999, -76.3078] 0.0 http://www.facebook.com/l.php?u=http%3A%2F%2F1... 1.331923e+09 America/New_York http://www.tricare.mil/mybenefit/ProfileFilter...
24 NaN Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES... es-es,es;q=0.8,en-us;q=0.5,en;q=0.3 ES Bilbao wcndER 59 zkpJBR 1.331923e+09 1.usa.gov NaN bnjacobs [43.25, -2.9667] 0.0 http://www.facebook.com/ 1.331923e+09 Europe/Madrid http://www.nasa.gov/mission_pages/nustar/main/...
25 NaN Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1... en-GB,en;q=0.8,en-US;q=0.6,en-AU;q=0.4 MY Kuala Lumpur wcndER 14 zkpJBR 1.331923e+09 1.usa.gov NaN bnjacobs [3.1667, 101.699997] 0.0 http://www.facebook.com/ 1.331923e+09 Asia/Kuala_Lumpur http://www.nasa.gov/mission_pages/nustar/main/...
26 NaN Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1... ro-RO,ro;q=0.8,en-US;q=0.6,en;q=0.4 CY Nicosia wcndER 04 zkpJBR 1.331923e+09 1.usa.gov NaN bnjacobs [35.166698, 33.366699] 0.0 http://www.facebook.com/?ref=tn_tnmn 1.331923e+09 Asia/Nicosia http://www.nasa.gov/mission_pages/nustar/main/...
27 NaN Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)... en-US,en;q=0.8 BR SPaulo zCaLwp 27 zUtuOu 1.331923e+09 1.usa.gov NaN alelex88 [-23.5333, -46.616699] 0.0 direct 1.331923e+09 America/Sao_Paulo http://apod.nasa.gov/apod/ap120312.html
28 NaN Mozilla/5.0 (iPad; CPU OS 5_0_1 like Mac OS X)... en-us None NaN vNJS4H NaN u0uD9q 1.319564e+09 1.usa.gov NaN o_4us71ccioa NaN 0.0 direct 1.331923e+09 https://www.nysdot.gov/rexdesign/design/commun...
29 NaN Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X... en-us None NaN FPX0IM NaN FPX0IL 1.331923e+09 1.usa.gov NaN twittershare NaN 1.0 http://t.co/5xlp0B34 1.331923e+09 http://www.ed.gov/news/media-advisories/us-dep...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3530 NaN Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.1... en-US,en;q=0.8 US San Francisco xVZg4P CA wqUkTo 1.331908e+09 go.nasa.gov NaN nasatwitter [37.7645, -122.429398] 0.0 http://www.facebook.com/l.php?u=http%3A%2F%2Fg... 1.331927e+09 America/Los_Angeles http://www.nasa.gov/multimedia/imagegallery/im...
3531 NaN Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6... en-US None NaN wcndER NaN zkpJBR 1.331923e+09 1.usa.gov NaN bnjacobs NaN 0.0 direct 1.331927e+09 http://www.nasa.gov/mission_pages/nustar/main/...
3532 NaN Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)... en-us,en;q=0.5 US Washington Au3aUS DC A9ct6C 1.331926e+09 1.usa.gov NaN ncsha [38.904202, -77.031998] 1.0 http://www.ncsha.org/ 1.331927e+09 America/New_York http://portal.hud.gov/hudportal/HUD?src=/press...
3533 NaN Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) A... en-us US Jacksonville b2UtUJ FL ieCdgH 1.301393e+09 go.nasa.gov NaN nasatwitter [30.279301, -81.585098] 1.0 direct 1.331927e+09 America/New_York http://apod.nasa.gov/apod/
3534 NaN Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)... en-us US Frisco vNJS4H TX u0uD9q 1.319564e+09 1.usa.gov NaN o_4us71ccioa [33.149899, -96.855499] 1.0 direct 1.331927e+09 America/Chicago https://www.nysdot.gov/rexdesign/design/commun...
3535 NaN Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/... en-us US Houston zIgLx8 TX yrPaLt 1.331903e+09 aash.to NaN aashto [29.775499, -95.415199] 1.0 direct 1.331927e+09 America/Chicago http://ntl.bts.gov/lib/44000/44300/44374/FHWA-...
3536 NaN Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; e... en-US,en;q=0.5 None NaN xIcyim NaN yG1TTf 1.331728e+09 go.nasa.gov NaN nasatwitter NaN 0.0 http://t.co/g1VKE8zS 1.331927e+09 http://www.nasa.gov/mission_pages/hurricanes/a...
3537 NaN Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)... es-es,es;q=0.8,en-us;q=0.5,en;q=0.3 HN Tegucigalpa zCaLwp 08 w63FZW 1.331547e+09 1.usa.gov NaN bufferapp [14.1, -87.216698] 0.0 http://t.co/A8TJyibE 1.331927e+09 America/Tegucigalpa http://apod.nasa.gov/apod/ap120312.html
3538 NaN Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma... en-us US Los Angeles qMac9k CA qds1Ge 1.310474e+09 1.usa.gov NaN healthypeople [34.041599, -118.298798] 0.0 direct 1.331927e+09 America/Los_Angeles http://healthypeople.gov/2020/connect/webinars...
3539 NaN Mozilla/5.0 (compatible; Fedora Core 3) FC3 KDE NaN US Bellevue zu2M5o WA zDhdro 1.331586e+09 bit.ly NaN glimtwin [47.615398, -122.210297] 0.0 direct 1.331927e+09 America/Los_Angeles http://www.federalreserve.gov/newsevents/press...
3540 NaN Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... en-US,en;q=0.8 US Payson wcndER UT zkpJBR 1.331923e+09 1.usa.gov NaN bnjacobs [40.014198, -111.738899] 0.0 http://www.facebook.com/l.php?u=http%3A%2F%2F1... 1.331927e+09 America/Denver http://www.nasa.gov/mission_pages/nustar/main/...
3541 NaN Mozilla/5.0 (X11; U; OpenVMS AlphaServer_ES40;... NaN US Bellevue zu2M5o WA zDhdro 1.331586e+09 1.usa.gov NaN glimtwin [47.615398, -122.210297] 0.0 direct 1.331927e+09 America/Los_Angeles http://www.federalreserve.gov/newsevents/press...
3542 NaN Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ... en-us US Pittsburg y3reI1 CA y3reI1 1.331926e+09 1.usa.gov NaN bitly [38.0051, -121.838699] 0.0 http://www.facebook.com/l.php?u=http%3A%2F%2F1... 1.331927e+09 America/Los_Angeles http://www.sba.gov/community/blogs/community-b...
3543 1.331927e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3544 NaN Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0.1) ... en-us,en;q=0.5 US Wentzville vNJS4H MO u0uD9q 1.319564e+09 1.usa.gov NaN o_4us71ccioa [38.790001, -90.854897] 1.0 direct 1.331927e+09 America/Chicago https://www.nysdot.gov/rexdesign/design/commun...
3545 NaN Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2)... en-us,en;q=0.5 US Saint Charles vNJS4H IL u0uD9q 1.319564e+09 1.usa.gov NaN o_4us71ccioa [41.9352, -88.290901] 1.0 direct 1.331927e+09 America/Chicago https://www.nysdot.gov/rexdesign/design/commun...
3546 NaN Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma... en-us US Los Angeles qMac9k CA qds1Ge 1.310474e+09 1.usa.gov NaN healthypeople [34.041599, -118.298798] 1.0 direct 1.331927e+09 America/Los_Angeles http://healthypeople.gov/2020/connect/webinars...
3547 NaN Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)... en-us US Silver Spring y0jYkg MD y0jYkg 1.331852e+09 1.usa.gov NaN bitly [39.052101, -77.014999] 1.0 direct 1.331927e+09 America/New_York http://www.epa.gov/otaq/regs/fuels/additive/e1...
3548 NaN Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Ma... en-us US Mcgehee y5rMac AR xANY6O 1.331916e+09 1.usa.gov NaN twitterfeed [33.628399, -91.356903] 1.0 https://twitter.com/fdarecalls/status/18069759... 1.331927e+09 America/Chicago http://www.fda.gov/Safety/Recalls/ucm296326.htm
3549 NaN Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... sv-SE,sv;q=0.8,en-US;q=0.6,en;q=0.4 SE Sollefte eH8wu 24 7dtjei 1.260316e+09 1.usa.gov NaN tweetdeckapi [63.166698, 17.266701] 1.0 direct 1.331927e+09 Europe/Stockholm http://www.nasa.gov/mission_pages/WISE/main/in...
3550 NaN Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... en-us US Conshohocken A00b72 PA yGSwzn 1.331918e+09 1.usa.gov NaN addthis [40.0798, -75.2855] 0.0 http://www.linkedin.com/home?trk=hb_tab_home_top 1.331927e+09 America/New_York http://www.nlm.nih.gov/medlineplus/news/fullst...
3551 NaN Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... en-US,en;q=0.8 None NaN wcndER NaN zkpJBR 1.331923e+09 1.usa.gov NaN bnjacobs NaN 0.0 http://plus.url.google.com/url?sa=z&n=13319268... 1.331927e+09 http://www.nasa.gov/mission_pages/nustar/main/...
3552 NaN Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US... NaN US Decatur rqgJuE AL xcz8vt 1.331227e+09 1.usa.gov NaN bootsnall [34.572701, -86.940598] 0.0 direct 1.331927e+09 America/Chicago http://travel.state.gov/passport/passport_5535...
3553 NaN Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ... en-us US Shrewsbury 9b6kNl MA 9b6kNl 1.273672e+09 bit.ly NaN bitly [42.286499, -71.714699] 0.0 http://www.shrewsbury-ma.gov/selco/ 1.331927e+09 America/New_York http://www.shrewsbury-ma.gov/egov/gallery/1341...
3554 NaN Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ... en-us US Shrewsbury axNK8c MA axNK8c 1.273673e+09 bit.ly NaN bitly [42.286499, -71.714699] 0.0 http://www.shrewsbury-ma.gov/selco/ 1.331927e+09 America/New_York http://www.shrewsbury-ma.gov/egov/gallery/1341...
3555 NaN Mozilla/4.0 (compatible; MSIE 9.0; Windows NT ... en US Paramus e5SvKE NJ fqPSr9 1.301298e+09 1.usa.gov NaN tweetdeckapi [40.9445, -74.07] 1.0 direct 1.331927e+09 America/New_York http://www.fda.gov/AdvisoryCommittees/Committe...
3556 NaN Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1... en-US,en;q=0.8 US Oklahoma City jQLtP4 OK jQLtP4 1.307530e+09 1.usa.gov NaN bitly [35.4715, -97.518997] 0.0 http://www.facebook.com/l.php?u=http%3A%2F%2F1... 1.331927e+09 America/Chicago http://www.okc.gov/PublicNotificationSystem/Fo...
3557 NaN GoogleMaps/RochesterNY NaN US Provo mwszkS UT mwszkS 1.308262e+09 j.mp NaN bitly [40.218102, -111.613297] 0.0 http://www.AwareMap.com/ 1.331927e+09 America/Denver http://www.monroecounty.gov/etc/911/rss.php
3558 NaN GoogleProducer NaN US Mountain View zjtI4X CA zjtI4X 1.327529e+09 1.usa.gov NaN bitly [37.419201, -122.057404] 0.0 direct 1.331927e+09 America/Los_Angeles http://www.ahrq.gov/qual/qitoolkit/
3559 NaN Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ... en-US US Mc Lean qxKrTK VA qxKrTK 1.312898e+09 1.usa.gov NaN bitly [38.935799, -77.162102] 0.0 http://t.co/OEEEvwjU 1.331927e+09 America/New_York http://herndon-va.gov/Content/public_safety/Pu...

3560 rows × 18 columns


In [15]:
frame['tz'][:10]


Out[15]:
0     America/New_York
1       America/Denver
2     America/New_York
3    America/Sao_Paulo
4     America/New_York
5     America/New_York
6        Europe/Warsaw
7                     
8                     
9                     
Name: tz, dtype: object

这里frame的输出形式是摘要试图(summary view),主要是用于较大的DataFrame对象。frame['tz']所返回的Series对象有一个value_counts方法,该方法可以让我们得到所需的信息:


In [16]:
tz_counts = frame['tz'].value_counts()
tz_counts[:10]


Out[16]:
America/New_York       1251
                        521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
America/Sao_Paulo        33
Name: tz, dtype: int64

现在,我们想用matplotlib为这段数据生成一张图片。为此,我们先给记录中未知或缺失的时区天上一个替代值。fillna 函数可以替换缺失值(NA),而未知值(空字符串)可以通过布尔型数据索引加以替换:


In [17]:
clean_tz = frame['tz'].fillna('Missing')

In [18]:
clean_tz[clean_tz == ''] = 'Unknown'

In [19]:
tz_counts = clean_tz.value_counts()

In [20]:
tz_counts[:10]


Out[20]:
America/New_York       1251
Unknown                 521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Missing                 120
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
Name: tz, dtype: int64

利用tz_counts对象的plot方法,我们开得到一张水平条形图:


In [21]:
%matplotlib inline
tz_counts[:10].plot(kind='barh', rot=0)


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x2100edc56d8>

我们还可以对这种数据进行很多的处理。比如说,a字段含有执行URL短缩操作的浏览器、设备、应用程序的相关信息:


In [22]:
frame['a'][1]


Out[22]:
'GoogleMaps/RochesterNY'

In [23]:
frame['a'][50]


Out[23]:
'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'

In [24]:
frame['a'][51]


Out[24]:
'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'

将这些“agent"字符串(即浏览器的USER——AGENT)中的所有信息都解析出来是一件挺枯燥的工作。不过我们只要掌握了python内置的字符串函数和正则表达式,事情就好办许多了。

比如,我们可以将这种字符串的第一节(与浏览器大致呼应)分离出来并得到另外一份用户行为摘要:


In [25]:
results = Series([x.split()[0] for x in frame.a.dropna()])

In [26]:
results[:5]


Out[26]:
0               Mozilla/5.0
1    GoogleMaps/RochesterNY
2               Mozilla/4.0
3               Mozilla/5.0
4               Mozilla/5.0
dtype: object

In [27]:
results.value_counts()[:8]


Out[27]:
Mozilla/5.0                 2594
Mozilla/4.0                  601
GoogleMaps/RochesterNY       121
Opera/9.80                    34
TEST_INTERNET_AGENT           24
GoogleProducer                21
Mozilla/6.0                    5
BlackBerry8520/5.0.0.681       4
dtype: int64

现在假设我们想按Windows和非Windows用户对时区统计信息进行分解。为了简单,我们假定只要agent字符串中包含有"Windows"就认为该用户为Windows用户。由于有的agent确实,我们首先将它们从数据中移除:


In [28]:
cframe = frame[frame.a.notnull()]

接下来,根据a值计算出各行是否是Windows:


In [29]:
operating_system = np.where(cframe['a'].str.contains('Windows'), 'Windows','Not Windows')

In [30]:
operating_system[:5] #注意这句代码执行后的输出跟原书不同


Out[30]:
array(['Windows', 'Not Windows', 'Windows', 'Not Windows', 'Windows'], 
      dtype='<U11')

接下来可以根据时区和新的到的操作系统列表对数据进行分组了:


In [31]:
by_tz_os = cframe.groupby(['tz', operating_system])

然后通过size对分组结果进行计数(类似于上面的value_counts函数),并利用unstack对计数结果进行重塑:


In [32]:
agg_counts = by_tz_os.size().unstack().fillna(0)

In [33]:
agg_counts[:10]


Out[33]:
Not Windows Windows
tz
245.0 276.0
Africa/Cairo 0.0 3.0
Africa/Casablanca 0.0 1.0
Africa/Ceuta 0.0 2.0
Africa/Johannesburg 0.0 1.0
Africa/Lusaka 0.0 1.0
America/Anchorage 4.0 1.0
America/Argentina/Buenos_Aires 1.0 0.0
America/Argentina/Cordoba 0.0 1.0
America/Argentina/Mendoza 0.0 1.0

最后我们来选取最常出现的时区。为了达到这个目的,我们根据agg_counts中的行数构造了一个间接索引数组:


In [34]:
#用于按升序排列
indexer = agg_counts.sum(1).argsort()

In [35]:
indexer[:10]


Out[35]:
tz
                                  24
Africa/Cairo                      20
Africa/Casablanca                 21
Africa/Ceuta                      92
Africa/Johannesburg               87
Africa/Lusaka                     53
America/Anchorage                 54
America/Argentina/Buenos_Aires    57
America/Argentina/Cordoba         26
America/Argentina/Mendoza         55
dtype: int64

然后我们通过过take按照这个舒徐截取了最后的10行:


In [36]:
count_subset = agg_counts.take(indexer)[-10:]

In [37]:
count_subset


Out[37]:
Not Windows Windows
tz
America/Sao_Paulo 13.0 20.0
Europe/Madrid 16.0 19.0
Pacific/Honolulu 0.0 36.0
Asia/Tokyo 2.0 35.0
Europe/London 43.0 31.0
America/Denver 132.0 59.0
America/Los_Angeles 130.0 252.0
America/Chicago 115.0 285.0
245.0 276.0
America/New_York 339.0 912.0

这里可以生成一张条形图。我们将使用stacked = True来生成一张堆积条形图:


In [38]:
%matplotlib inline
normed_subset = count_subset.div(count_subset.sum(1), axis=0)

In [39]:
normed_subset.plot(kind='barh', stacked = True)


Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x2100f219f28>

这里所用到的所有方法都会在本书后续的章节中详细讲解。(我觉得这句话作者应该早点讲,害的我一直不敢继续读下去,原来这只是一个长长的说明案例啊)

MovieLens 1M数据集

GroupLens Research 采集了从上世纪九十年代到本世纪初MovieLens用户提供的电影评分数据。这些数据中包括电影评分、电影元数据(风格和年代)以及用户的人口学统计数据(性别年龄等)。基于机器学习算法的推荐系统一般都会对此类数据感兴趣。虽然这本书不会详细介绍机器学习技术,不会可以让我们学习如何对数据进行切片切块以满足需求。

MovieLens 1M数据集包含了来自6000名用户对4000部电影的100万条评分数据。它分为三个表:评分、用户信息和电源信息。可以通过pandas.read_table将各个表读到一个pandas DataFrame对象中:


In [40]:
import pandas as pd

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']

users = pd.read_table('pydata-book/ch02/movielens/users.dat', sep='::', 
                      header=None, names = unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('pydata-book/ch02/movielens/ratings.dat', sep='::',
                       header=None, names = rnames)

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('pydata-book/ch02/movielens/movies.dat', sep='::',
                      header=None, names = mnames)


D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  
D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:10: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  # Remove the CWD from sys.path while we load stuff.
D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:14: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  

In [41]:
users[:5]


Out[41]:
user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455

In [42]:
ratings[:5]


Out[42]:
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291

In [43]:
movies[:5]


Out[43]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy

In [88]:
ratings[:10]


Out[88]:
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
5 1 1197 3 978302268
6 1 1287 5 978302039
7 1 2804 5 978300719
8 1 594 4 978302268
9 1 919 4 978301368

注意,数据和职业是以编码形式给出的。他们的具体含义请参考该数据集的README文件。

分析散步在三个表中的数据不是件轻松的事情。假设我们想根据性别和年龄计算某电影的平均得分,如果将所有数据都合并为一个表中的话问题就简单多了。我们先用pandas的merge函数将ratings跟users合并到一起,然后再将movies合并进去。pandas会根据列明的重叠情况推断出哪些列是合并(或连接)键


In [45]:
data = pd.merge(pd.merge(ratings, users), movies)

In [89]:
data[:10]


Out[89]:
user_id movie_id rating timestamp gender age occupation zip title genres
0 1 1193 5 978300760 F 1 10 48067 One Flew Over the Cuckoo's Nest (1975) Drama
1 2 1193 5 978298413 M 56 16 70072 One Flew Over the Cuckoo's Nest (1975) Drama
2 12 1193 4 978220179 M 25 12 32793 One Flew Over the Cuckoo's Nest (1975) Drama
3 15 1193 4 978199279 M 25 7 22903 One Flew Over the Cuckoo's Nest (1975) Drama
4 17 1193 5 978158471 M 50 1 95350 One Flew Over the Cuckoo's Nest (1975) Drama
5 18 1193 4 978156168 F 18 3 95825 One Flew Over the Cuckoo's Nest (1975) Drama
6 19 1193 5 982730936 M 1 10 48073 One Flew Over the Cuckoo's Nest (1975) Drama
7 24 1193 5 978136709 F 25 7 10023 One Flew Over the Cuckoo's Nest (1975) Drama
8 28 1193 3 978125194 F 25 1 14607 One Flew Over the Cuckoo's Nest (1975) Drama
9 33 1193 5 978557765 M 45 3 55421 One Flew Over the Cuckoo's Nest (1975) Drama

现在我们就可以根据任意个域用户或电源属性对评分数据进行聚合操作了。为了按性别计算每部电源的平均分,我们可以使用pivot_table方法:


In [64]:
#书中原文的代码是
mean_ratings = data.pivot_table('rating',
                               rows='title', cols='gender',aggfunc='mean')


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-64-df5506c0e75e> in <module>()
      1 mean_ratings = data.pivot_table('rating',
----> 2                                rows='title', cols='gender',aggfunc='mean')

TypeError: pivot_table() got an unexpected keyword argument 'rows'

显然上面执行通不过,从错误信息看根本没有‘rows’这个参数的功能。我本来想放弃这个代码了,但是还是留了个心眼,去Google了一下,发现去年有人讨论了这个问题:stackoverflow地址

解决方案是

mean_ratings = data.pivot_table('rating', rows='title', cols='gender', aggfunc='mean')

改成

mean_ratings = data.pivot_table('rating', index='title', columns='gender', aggfunc='mean')

原因是:

书中的代码是旧的并且已经被移除了的语法。


In [90]:
mean_ratings = data.pivot_table('rating', index='title', 
                                columns='gender', aggfunc='mean')

In [91]:
mean_ratings[:5]


Out[91]:
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024

上述操作产生了另一个DataFrame,其内容为电源平均得分,行作为电影名。列标为性别。现在,我们打算过滤掉评分数据不够250条的电影。为了达到这个目的,可以先对title进行分组,然后利用size()得到的一个含有各个电影分组大小的Series对象:


In [68]:
ratings_by_title = data.groupby('title').size()

In [69]:
ratings_by_title[0:10]


Out[69]:
title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64

In [71]:
active_titles = ratings_by_title.index[ratings_by_title >= 250]

In [72]:
active_titles


Out[72]:
Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

上述所得到的索引中含有评分数据大于250条的电影名称,然后我们就可以据此从前面的mean_ratings中选取所需的行了:


In [105]:
mean_ratings = mean_ratings.ix[active_titles] 
#书中原文用了mean_ratings.ix 但是ix其实已经被弃用了


D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

In [107]:
mean_ratings = mean_ratings.loc[active_titles]

In [108]:
mean_ratings


Out[108]:
gender F M diff
title
'burbs, The (1989) 2.793478 2.962085 0.168607
10 Things I Hate About You (1999) 3.646552 3.311966 -0.334586
101 Dalmatians (1961) 3.791444 3.500000 -0.291444
101 Dalmatians (1996) 3.240000 2.911215 -0.328785
12 Angry Men (1957) 4.184397 4.328421 0.144024
13th Warrior, The (1999) 3.112000 3.168000 0.056000
2 Days in the Valley (1996) 3.488889 3.244813 -0.244076
20,000 Leagues Under the Sea (1954) 3.670103 3.709205 0.039102
2001: A Space Odyssey (1968) 3.825581 4.129738 0.304156
2010 (1984) 3.446809 3.413712 -0.033097
28 Days (2000) 3.209424 2.977707 -0.231717
39 Steps, The (1935) 3.965517 4.107692 0.142175
54 (1998) 2.701754 2.782178 0.080424
7th Voyage of Sinbad, The (1958) 3.409091 3.658879 0.249788
8MM (1999) 2.906250 2.850962 -0.055288
About Last Night... (1986) 3.188679 3.140909 -0.047770
Absent Minded Professor, The (1961) 3.469388 3.446809 -0.022579
Absolute Power (1997) 3.469136 3.327759 -0.141377
Abyss, The (1989) 3.659236 3.689507 0.030272
Ace Ventura: Pet Detective (1994) 3.000000 3.197917 0.197917
Ace Ventura: When Nature Calls (1995) 2.269663 2.543333 0.273670
Addams Family Values (1993) 3.000000 2.878531 -0.121469
Addams Family, The (1991) 3.186170 3.163498 -0.022672
Adventures in Babysitting (1987) 3.455782 3.208122 -0.247660
Adventures of Buckaroo Bonzai Across the 8th Dimension, The (1984) 3.308511 3.402321 0.093810
Adventures of Priscilla, Queen of the Desert, The (1994) 3.989071 3.688811 -0.300260
Adventures of Robin Hood, The (1938) 4.166667 3.918367 -0.248299
African Queen, The (1951) 4.324232 4.223822 -0.100410
Age of Innocence, The (1993) 3.827068 3.339506 -0.487561
Agnes of God (1985) 3.534884 3.244898 -0.289986
... ... ... ...
White Men Can't Jump (1992) 3.028777 3.231061 0.202284
Who Framed Roger Rabbit? (1988) 3.569378 3.713251 0.143873
Who's Afraid of Virginia Woolf? (1966) 4.029703 4.096939 0.067236
Whole Nine Yards, The (2000) 3.296552 3.404814 0.108262
Wild Bunch, The (1969) 3.636364 4.128099 0.491736
Wild Things (1998) 3.392000 3.459082 0.067082
Wild Wild West (1999) 2.275449 2.131973 -0.143476
William Shakespeare's Romeo and Juliet (1996) 3.532609 3.318644 -0.213965
Willow (1988) 3.658683 3.453543 -0.205139
Willy Wonka and the Chocolate Factory (1971) 4.063953 3.789474 -0.274480
Witness (1985) 4.115854 3.941504 -0.174349
Wizard of Oz, The (1939) 4.355030 4.203138 -0.151892
Wolf (1994) 3.074074 2.899083 -0.174992
Women on the Verge of a Nervous Breakdown (1988) 3.934307 3.865741 -0.068566
Wonder Boys (2000) 4.043796 3.913649 -0.130147
Working Girl (1988) 3.606742 3.312500 -0.294242
World Is Not Enough, The (1999) 3.337500 3.388889 0.051389
Wrong Trousers, The (1993) 4.588235 4.478261 -0.109974
Wyatt Earp (1994) 3.147059 3.283898 0.136839
X-Files: Fight the Future, The (1998) 3.489474 3.493797 0.004323
X-Men (2000) 3.682310 3.851702 0.169391
Year of Living Dangerously (1982) 3.951220 3.869403 -0.081817
Yellow Submarine (1968) 3.714286 3.689286 -0.025000
You've Got Mail (1998) 3.542424 3.275591 -0.266834
Young Frankenstein (1974) 4.289963 4.239177 -0.050785
Young Guns (1988) 3.371795 3.425620 0.053825
Young Guns II (1990) 2.934783 2.904025 -0.030758
Young Sherlock Holmes (1985) 3.514706 3.363344 -0.151362
Zero Effect (1998) 3.864407 3.723140 -0.141266
eXistenZ (1999) 3.098592 3.289086 0.190494

1216 rows × 3 columns

为了了解女性观众最喜欢的电源,我们可以对F列降序:


In [109]:
top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)


D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
  """Entry point for launching an IPython kernel.

In [110]:
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)

In [111]:
top_female_ratings[:10]


Out[111]:
gender F M diff
title
Close Shave, A (1995) 4.644444 4.473795 -0.170650
Wrong Trousers, The (1993) 4.588235 4.478261 -0.109974
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589 -0.108060
Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075 -0.178032
Schindler's List (1993) 4.562602 4.491415 -0.071187
Shawshank Redemption, The (1994) 4.539075 4.560625 0.021550
Grand Day Out, A (1992) 4.537879 4.293255 -0.244624
To Kill a Mockingbird (1962) 4.536667 4.372611 -0.164055
Creature Comforts (1990) 4.513889 4.272277 -0.241612
Usual Suspects, The (1995) 4.513317 4.518248 0.004931

计算评分分歧

假设我们想要找出男性和女性观众分歧最大的电影。一个半法是给mean_ratings加上一个用于存放平均得分之差的列,并对它进行排序:


In [112]:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

按‘diff'排序即可得到分歧最大且女性观众更喜欢的电影:


In [93]:
sorted_by_diff = mean_ratings.sort_index(by = 'diff')


D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
  """Entry point for launching an IPython kernel.

In [113]:
sorted_by_diff = mean_ratings.sort_values(by='diff')

In [114]:
sorted_by_diff[:15]


Out[114]:
gender F M diff
title
Dirty Dancing (1987) 3.790378 2.959596 -0.830782
Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359
Grease (1978) 3.975265 3.367041 -0.608224
Little Women (1994) 3.870588 3.321739 -0.548849
Steel Magnolias (1989) 3.901734 3.365957 -0.535777
Anastasia (1997) 3.800000 3.281609 -0.518391
Rocky Horror Picture Show, The (1975) 3.673016 3.160131 -0.512885
Color Purple, The (1985) 4.158192 3.659341 -0.498851
Age of Innocence, The (1993) 3.827068 3.339506 -0.487561
Free Willy (1993) 2.921348 2.438776 -0.482573
French Kiss (1995) 3.535714 3.056962 -0.478752
Little Shop of Horrors, The (1960) 3.650000 3.179688 -0.470312
Guys and Dolls (1955) 4.051724 3.583333 -0.468391
Mary Poppins (1964) 4.197740 3.730594 -0.467147
Patch Adams (1998) 3.473282 3.008746 -0.464536

对排序结果反序并取出10行,得到的就是男性更喜欢的电影啦:


In [115]:
sorted_by_diff[::-1][:15]


Out[115]:
gender F M diff
title
Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351
Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359
Dumb & Dumber (1994) 2.697987 3.336595 0.638608
Longest Day, The (1962) 3.411765 4.031447 0.619682
Cable Guy, The (1996) 2.250000 2.863787 0.613787
Evil Dead II (Dead By Dawn) (1987) 3.297297 3.909283 0.611985
Hidden, The (1987) 3.137931 3.745098 0.607167
Rocky III (1982) 2.361702 2.943503 0.581801
Caddyshack (1980) 3.396135 3.969737 0.573602
For a Few Dollars More (1965) 3.409091 3.953795 0.544704
Porky's (1981) 2.296875 2.836364 0.539489
Animal House (1978) 3.628906 4.167192 0.538286
Exorcist, The (1973) 3.537634 4.067239 0.529605
Fright Night (1985) 2.973684 3.500000 0.526316
Barb Wire (1996) 1.585366 2.100386 0.515020

如果只想找出分歧最大的电影并且不考虑性别因素,则可以计算得分数据的方差或者标准差:


In [127]:
#根据电影名称分组的得分数据的标准差
rating_std_by_title = data.groupby('title')['rating'].std()

In [128]:
#根据active_title 进行过滤
rating_std_by_title = rating_std_by_title.loc[active_titles]

In [129]:
#根据值对Series进行降序排列
rating_std_by_title.order(ascending=False)[:10]


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-129-5b832fb1fe6d> in <module>()
      1 #根据值对Series进行降序排列
----> 2 rating_std_by_title.order(ascending=False)[:10]

D:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   3079             if name in self._info_axis:
   3080                 return self[name]
-> 3081             return object.__getattribute__(self, name)
   3082 
   3083     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'order'

In [130]:
#上一个书中源代码中的order已经被弃用。最新版的可以使用sort_values
rating_std_by_title.sort_values(ascending=False)[:10]


Out[130]:
title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64

作者按:

可能你已经注意到了,电影分类是以“|”分隔符给出的。如果想对电源的分类进行分析的话,就需要先将其转换成更有用的形式才行。本书后续章节将给出处理方法,到时还需用到这个数据。


In [ ]: