Usual stuff to import
In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from IPython.display import display, HTML
If you find manipulating dataframes in R
a bit too cumbersome, why don't you give Pandas a chance. On top of easy and efficient table management, plotting functionality is pretty great.
One-dimensional labelled array which can hold any data type (even Python objects).
In [2]:
series_one = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
series_one
Out[2]:
In [3]:
series_two = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})
series_two
Out[3]:
Series is ndarray-like, dick-like, supports vectorized operations and label alignment
In [4]:
series_one[2:4]
Out[4]:
In [5]:
series_one['a']
Out[5]:
In [6]:
series_one + series_two
Out[6]:
In [7]:
series_one * 3
Out[7]:
In [8]:
df_one = pd.DataFrame({'one': pd.Series(np.random.rand(5),
index=['a', 'b', 'c', 'd' , 'e']),
'two': pd.Series(np.random.rand(4),
index=['a', 'b', 'c', 'e'])})
df_one
Out[8]:
There are several other constructors for creating a DataFrame object
pd.DataFrame.from_records
pd.DataFrame.from_dict
pd.DataFrame.from_items
Other Pandas data objects which we are not going to talk about are
The Pandas I/O API is a set of nice reader functions which generally return a pandas object
Some important parameters
sep
- Delimiterindex_col
- Specifies which column to select as indexusecols
- Specify which columns to read when reading a filecompression
- Can handle gzip, bz2 compressed text filescomment
- Comment characternames
- If header=None
, you can specify the names of columns iterator
- Return an iterator TextFileReader
object
In [9]:
iris = pd.read_csv("iris.csv", index_col=0)
iris.head()
Out[9]:
Let's see the power of pandas. We'll use Gencode v24 to demonstrate and read the annotation file.
In [10]:
url = "ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/gencode.v24.primary_assembly.annotation.gtf.gz"
gencode = pd.read_csv(url, compression="gzip", iterator=True, header=None,
sep="\t", comment="#", quoting=3,
usecols=[0, 1, 2, 3, 4, 6])
gencode.get_chunk(10)
Out[10]:
Dumps data to a csv file. A lot of optional parameters apply which will help you save the file just like you want.
iris.to_csv("iris_copy.csv")
iris.to_hdf("iris_copy.h5", "df")
Creates a HDF5 file (binary indexed file for faster loading and index filtering during load times). Requires pytables
as a depandency if you want to go full on with it's functionality
Almost everyone will be familiar on how much you need to reshape the data if we want to plot it properly. This functionality is also pretty well covered in pandas.
pd.melt
In [11]:
planets = pd.read_csv("planets.csv", index_col=0)
planets.head()
Out[11]:
In [12]:
planets_melt = pd.melt(planets, id_vars="method")
planets_melt.head()
Out[12]:
In [13]:
heatmap = pd.read_csv("Heatmap.tsv", sep="\t", index_col=0)
heatmap.head(10)
Out[13]:
In [14]:
heatmap.iloc[4:8]
Out[14]:
In [15]:
heatmap.loc[['prisons', 'jacks', 'irons']]
Out[15]:
Almost forgot, HTML conditional formatting just made it into the latest release `0.17.1` and it's pretty awesome. Use a function to your liking or do it with a background gradient</span>
In [16]:
def color_negative_red(val):
"""
Takes a scalar and returns a string with
the css property `'color: red'` for negative
strings, black otherwise.
"""
color = 'red' if val < 0 else 'black'
return 'color: %s' % color
# Apply the function like this
heatmap.head(10).style.applymap(color_negative_red)
Out[16]:
In [17]:
heatmap.head(10).style.background_gradient(cmap="RdBu_r")
Out[17]:
You can group data (on both axes) based on a criteria. It returns an iterator but you can directly apply a function without the need to iterate through.
Remember though, you'll get a new index based on what you group with if you directly apply the function without iterating over the groups.
pd.DataFrame.groupby
In [18]:
# No need to iter through to apply mean based on species
iris_species_grouped = iris.groupby('species')
iris_species_grouped.mean()
Out[18]:
In [19]:
# The previous iterator has reached it's end, so re-initialize
iris_species_grouped = iris.groupby('species')
for species, group in iris_species_grouped:
display(HTML(species))
display(pd.DataFrame(group.mean(axis=0)).T)
In [20]:
pd.DataFrame(iris[[0, 1, 2, 3]].apply(np.std, axis=0)).T
Out[20]:
In [21]:
def add_length_width(x):
"""
Adds up the length and width of the features and returns
a pd.Series object so as to get a pd.DataFrame
"""
sepal_sum = x['sepal_length'] + x['sepal_width']
petal_sum = x['petal_length'] + x['petal_width']
return pd.Series([sepal_sum, petal_sum, x['species']],
index=['sepal_sum', 'petal_sum', 'species'])
iris.apply(add_length_width, axis=1).head(5)
Out[21]:
There's always need for that. Obviously needed float & int filters but exceptional string filtering options baked in... So much good stuff this..
Inside the pd.DataFrame.loc
, you can specify and (&
), or (|
), not (~
) as logical operators. This stuff works and is tested ;)
>, <, >=, <=
str.contains, str.startswith, str.endswith
In [22]:
iris.loc[iris.sepal_width > 3.5]
Out[22]:
In [23]:
iris.loc[(iris.sepal_width > 3.5) & (iris.species == 'virginica')]
Out[23]:
In [24]:
heatmap.loc[heatmap.index.str.contains("due|ver|ap")]
Out[24]:
There is a ton of stuff that can be done in Pandas
. The online docs is super detailed and amazing. Explore, search, stack overflow it and most probably you'll get what you're looking for. The current version docs (as of this talk) Pandas v0.17.1
Things that I can't cover because of the time constraints