Pandas

Pandas je biblioteka za analizu podataka. Ilustrirat ćemo njeno korištenje na jednostavnom primjeru analize podataka iz IMDB baze.

Koristit ću i biblioteke requests (za učitavanje web stranica) i BeautifulSoup (za analizu HTML-a).


In [4]:
import requests
from bs4 import BeautifulSoup as bs

Dohvaćamo podatke s IMDB-a o filmovima sa zemljom porijekla Hrvatska, smimljenim između 1945. i 2017.


In [117]:
import time
url = 'http://www.imdb.com/search/title'
params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1945,2017',countries='hr', languages='hr')
r=[]
for i in range(1,240,50): # takvih filmova ima trenutno 201
    params['start']=i
    r.append(requests.get(url, params=params))
    time.sleep(10)

Parsiranje podataka koje spremamo u datoteku filmovi2.txt.


In [167]:
import re
with open('filmovi2.txt','w') as f:
    for i in range(len(r)):
        soup = bs(r[i].text,'html.parser')
        for film in soup.find_all('div', class_="lister-item"):
            for a in film.find_all('a', href=True):
                if '/title/tt'  and 'adv_li_tt' in a['href']:
                    title = a.contents[0]
            rt = film.find_all('span', class_="runtime")
            if rt:
                runtime =  rt[0].contents[0]
            else:
                runtime = '0 mins'
            y = film.find_all('span', class_="lister-item-year")
            if y: 
                year = y[0].contents[0]
                year = year.replace('(I)','').replace('(III)','').strip()
            else:
                year = '???'
            rat = film.find_all('span', class_='global-sprite rating-star imdb-rating')
            if rat:
                rating = str(list(rat[0].next_siblings)[1]).replace('<strong>','').replace('</strong>','')
            else:
                rat = '0'
            g = film.find_all('span',class_="genre")
            if g: genres =  ' '.join(g[0].contents).replace('\n','').strip()
            d=film.find_all(string =re.compile('Director'))[0]
            director = d.next_element.contents[0]
            f.write('\t'.join((title, year, runtime, rating,director, genres))+'\n')

In [119]:
!head filmovi2.txt


The Hunting Party	(2007)	101 min	6.9	Richard Shepard	Adventure, Comedy, Drama
Karaula	(2006)	94 min	7.7	Rajko Grlic	Action, Comedy, Drama
Svecenikova djeca	(2013)	96 min	6.8	Vinko Bresan	Comedy, Drama
Metastaze	(2009)	82 min	7.8	Branko Schmidt	Crime, Drama
Zvizdan	(2015)	123 min	7.3	Dalibor Matanic	Drama, Romance, War
How the War Started on My Island	(1996)	97 min	7.9	Vinko Bresan	Comedy, War
Ljudozder vegetarijanac	(2012)	85 min	7.2	Branko Schmidt	Drama
Fine mrtve djevojke	(2002)	77 min	7.2	Dalibor Matanic	Drama, Thriller
Sonja i bik	(2012)	103 min	7.1	Vlatka Vorkapic	Comedy, Romance
Petelinji zajtrk	(2007)	124 min	7.6	Marko Nabersnik	Drama

Analiza podataka pomoću biblioteke Pandas


In [174]:
import pandas as pd
names = ['title', 'year','runtime', 'rating', 'director', 'genres']
data = pd.read_csv('filmovi2.txt', delimiter='\t', names=names)
print ("Number of rows: {:d}".format(data.shape[0]))
data.head()


Number of rows: 201
Out[174]:
title year runtime rating director genres
0 The Hunting Party (2007) 101 min 6.9 Richard Shepard Adventure, Comedy, Drama
1 Karaula (2006) 94 min 7.7 Rajko Grlic Action, Comedy, Drama
2 Svecenikova djeca (2013) 96 min 6.8 Vinko Bresan Comedy, Drama
3 Metastaze (2009) 82 min 7.8 Branko Schmidt Crime, Drama
4 Zvizdan (2015) 123 min 7.3 Dalibor Matanic Drama, Romance, War

In [175]:
# data['runtime'].fillna('0 mins.', inplace=True);
clean_runtime = [int(v.split(' ')[0]) for v in data.runtime]
data['runtime'] = clean_runtime
data['year'] = [int(y[1:-1]) for y in data.year]
# data.rating[data.rating=='-'] = '0';
clean_rating = [float(v) for v in data.rating]
data['rating'] = clean_rating
#clean_genres = [g.replace(' ','|') for g in data.genres]
#data['genres'] = clean_genres
data.head()


Out[175]:
title year runtime rating director genres
0 The Hunting Party 2007 101 6.9 Richard Shepard Adventure, Comedy, Drama
1 Karaula 2006 94 7.7 Rajko Grlic Action, Comedy, Drama
2 Svecenikova djeca 2013 96 6.8 Vinko Bresan Comedy, Drama
3 Metastaze 2009 82 7.8 Branko Schmidt Crime, Drama
4 Zvizdan 2015 123 7.3 Dalibor Matanic Drama, Romance, War

In [176]:
data.ix[118]


Out[176]:
title       Doktor ludosti
year                  2003
runtime                  0
rating                   7
director      Fadil Hadzic
genres              Comedy
Name: 118, dtype: object

In [177]:
data[['year','runtime', 'rating']].describe()


Out[177]:
year runtime rating
count 201.000000 201.000000 201.000000
mean 2007.935323 85.029851 7.131841
std 7.379078 29.859489 0.866534
min 1970.000000 0.000000 3.700000
25% 2004.000000 78.000000 6.700000
50% 2010.000000 90.000000 7.200000
75% 2013.000000 100.000000 7.600000
max 2017.000000 200.000000 9.600000

In [217]:
import numpy as np
data.replace(0,np.nan, inplace=True);

In [218]:
data[['runtime', 'rating']].describe()


Out[218]:
runtime rating
count 186.000000 201.000000
mean 91.887097 7.131841
std 18.176407 0.866534
min 46.000000 3.700000
25% 80.250000 6.700000
50% 91.500000 7.200000
75% 100.000000 7.600000
max 200.000000 9.600000

In [219]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(data.year, bins=np.arange(1945, 2017))
plt.xlabel("Godina produkcije");



In [220]:
plt.hist(data.rating.dropna(), bins=20)
plt.xlabel("IMDB ocjena");



In [222]:
plt.scatter(data.year, data.rating, lw=0, color='k')
plt.xlabel("Godina")
plt.ylabel("IMDB ocjena");



In [223]:
data[data.rating == data.rating.min()][['title', 'year', 'rating','director', 'genres']]


Out[223]:
title year rating director genres
83 Larin izbor: Izgubljeni princ 2012 3.7 Tomislav Rukavina Drama

In [224]:
data[data.rating == data.rating.max()][['title', 'year', 'rating','director',  'genres']]


Out[224]:
title year rating director genres
189 The Second Death of Maximilian Paspa Aka Paspi... 2016 9.6 Zorko Sirotic Mystery

In [225]:
genres = set()
for m in data.genres:
    genres.update(g for g in m.split(','))
genres = sorted(genres)

for genre in genres:
    data[genre] = [genre in movie.split(',') for movie in data.genres]
         
data.head()


Out[225]:
title year runtime rating director genres Adventure Comedy Crime Drama ... Biography Comedy Crime Drama Family Fantasy Horror Mystery Thriller War
0 The Hunting Party 2007 101.0 6.9 Richard Shepard Adventure, Comedy, Drama False True False True ... False False False False False False False False False False
1 Karaula 2006 94.0 7.7 Rajko Grlic Action, Comedy, Drama False True False True ... False False False False False False False False False False
2 Svecenikova djeca 2013 96.0 6.8 Vinko Bresan Comedy, Drama False False False True ... False True False False False False False False False False
3 Metastaze 2009 82.0 7.8 Branko Schmidt Crime, Drama False False False True ... False False True False False False False False False False
4 Zvizdan 2015 123.0 7.3 Dalibor Matanic Drama, Romance, War False False False False ... False False False True False False False False False False

5 rows × 35 columns


In [226]:
genre_count = data[genres].sum()
pd.DataFrame({'Genre Count': genre_count})


Out[226]:
Genre Count
Adventure 2
Comedy 7
Crime 4
Drama 43
Family 9
Fantasy 2
History 5
Horror 2
Music 2
Musical 1
Mystery 5
Romance 16
Sci-Fi 2
Sport 1
Thriller 12
War 19
Action 15
Adventure 6
Animation 3
Biography 6
Comedy 49
Crime 10
Drama 97
Family 3
Fantasy 2
Horror 3
Mystery 4
Thriller 2
War 1

In [227]:
petoljetka =  (data.year // 5) * 5

tyd = data.loc[:, ('title', 'year')]
tyd['petoljetka'] = petoljetka;

tyd.head()


Out[227]:
title year petoljetka
0 The Hunting Party 2007 2005
1 Karaula 2006 2005
2 Svecenikova djeca 2013 2010
3 Metastaze 2009 2005
4 Zvizdan 2015 2015

In [228]:
pet_mean = data.groupby(petoljetka).rating.mean()
pet_mean.name = 'Petoljetka mean'
print (pet_mean)

plt.plot(pet_mean.index, pet_mean.values, 'o-',
        color='r', lw=3, label='Petoljetka prosjek')
plt.scatter(data.year, data.rating, alpha=.04, lw=0, color='k')
plt.xlabel("Godina")
plt.ylabel("Ocjena")
plt.legend(frameon=False);


year
1970    8.600000
1980    7.700000
1990    7.114286
1995    6.754545
2000    6.956000
2005    6.973684
2010    7.193333
2015    7.512500
Name: Petoljetka mean, dtype: float64

In [229]:
for year, subset in data.groupby('year'):
    print (year, subset[subset.rating == subset.rating.max()].title.values)


1970 ['Tko pjeva zlo ne misli']
1980 ['Izgubljeni zavicaj']
1991 ['Vrijeme ratnika']
1992 ['Kamenita vrata']
1993 ['Kontesa Dora']
1994 ['Vukovar se vraca kuci']
1995 ['Washed Out']
1996 ['How the War Started on My Island']
1997 ['Pont Neuf']
1998 ['Kad mrtvi zapjevaju']
1999 ['Marsal' 'Dubrovacki suton']
2000 ['Je li jasno prijatelju?']
2001 ['Holding']
2002 ['24 sata' 'Serafin, svjetionicarev sin']
2003 ['Konjanik' 'Tu' 'Bore Lee: U kandzama velegrada']
2004 ['Nije bed']
2005 ['RGB: RedGreenBlue']
2006 ['Crveno i crno']
2007 ['Pjevajte nesto ljubavno']
2008 ['Blazeni Augustin Kazotic']
2009 ['Da mogu...']
2010 ['Mjesto na kojem je umro posljednji covjek']
2011 ["Marija's Own"]
2012 ["Once Upon a Winter's Night"]
2013 ['Glazbena kutija']
2014 ['Vlog']
2015 ['Oporuka']
2016 ['The Second Death of Maximilian Paspa Aka Paspin Kut3']
2017 ['Uzbuna na Zelenom Vrhu']

In [230]:
from verzije import *
from IPython.display import HTML
HTML(print_sysinfo()+info_packages('pandas, numpy,requests, beautifulsoup4'))


Out[230]:
Python verzija3.5.3
kompajlerGCC 4.8.2 20140120 (Red Hat 4.8.2-15)
sustavLinux
broj CPU-a8
interpreter64bit
pandas verzija0.19.2
numpy verzija1.11.3
requests verzija2.13.0
beautifulsoup4 verzija4.5.3