Se trata de una librería para análisis de datos en Python inspirada por R. Con el tiempo ha ido ganando popularidad, y de mantenerse la tendencia superará incluso R.
Dispone de las siguientes funcionalidades.
Pandas fue desarrollado por Wes McKinney cuando trabajaba para AQR Capital Management, y cuenta con bastante arraigo dentro del sector del análisis financiero en EEUU y UK.
In [1]:
import pandas as pd
import numpy as np
In [2]:
trends = pd.read_csv('./data/20160819_OlympicSportsByCountries.csv',
header=1)
trends.head()
Out[2]:
In [3]:
trends[trends.Country == "Spain"].sort_values(by="Search Interest",
ascending=False)
Out[3]:
In [4]:
trends[trends.Sport == "Tennis"].sort_values(by="Search Interest",
ascending=False
).head()
Out[4]:
In [5]:
interesting_sports = pd.DataFrame(
[(sport, len(data)) for sport, data in trends.groupby("Sport")]
)
interesting_sports.columns=["Sport", "Countries interested"]
interesting_sports.sort_values(by="Countries interested", ascending=False)
Out[5]:
In [6]:
trends[trends.Sport == "Trampolining"]
Out[6]:
Rosie McLennan (Canadá) ganó el oro femenino en gimnasia de trampolín
In [7]:
medalists = pd.read_csv(
'./data/Summer Olympic medallists 1896 to 2008 - ALL MEDALISTS.csv',
header=4)
medalists.head()
Out[7]:
In [8]:
filtered = medalists[(medalists.NOC == 'ESP') & (medalists.Edition == 1992)]
filtered[["Discipline", "Event", "Gender", "Medal"]].groupby(
["Discipline", "Event", "Gender", "Medal"]
).count()
Out[8]:
In [9]:
import matplotlib.pyplot as plt
plt.style.use("bmh")
from datetime import date
%matplotlib notebook
medalists_list = list()
for edition, data in medalists[medalists.NOC == 'ESP'].groupby('Edition'):
medalists_list.append((date(year=edition, month=1, day=1), len(data)))
timevol = pd.Series([pair[1] for pair in medalists_list],
index=[pair[0] for pair in medalists_list])
timevol.plot(title='Número de medallistas')
Out[9]:
Pandas es quizás la herramienta con la que es más sencillo manipular y transformar datos desestructurados, que es sin duda una de las tareas más desagradecidas y laboriosas de la profesión de analista de datos.
In [10]:
import sqlite3
conn = sqlite3.connect('data/medalists.db')
c = conn.cursor()
In [11]:
c.execute("SELECT * FROM medalists WHERE NOC = 'ESP' LIMIT 5")
for row in c.fetchall():
print(row)
In [12]:
query = """SELECT Edition, Event, Athlete, Gender FROM medalists
WHERE NOC = 'ESP'
ORDER BY Edition ASC
LIMIT 5
"""
c.execute(query)
for row in c.fetchall():
print(row)
In [13]:
query = """SELECT Discipline, Event, Gender, Medal FROM medalists
WHERE NOC = 'ESP' AND Edition = 1992
GROUP BY Discipline, Event, Gender, Medal
"""
c.execute(query)
for row in c.fetchall():
print(row)
#filtered = medalists[(medalists.NOC == 'ESP') & (medalists.Edition == 1992)]
#filtered[["Discipline", "Event", "Gender", "Medal"]].groupby(
# ["Discipline", "Event", "Gender", "Medal"]
#).count()
In [14]:
table = pd.read_sql(query, conn)
table
Out[14]:
In [15]:
conn.close()