01: Building a pandas Cheat Sheet, Part 1

Use the csv I've attached to answer the following questions Import pandas with the right name


In [1]:
# !workon dataanalysis
import pandas as pd


/Users/Monica/.virtualenvs/dataanalysis/lib/python3.5/site-packages/matplotlib/__init__.py:1035: UserWarning: Duplicate key in file "/Users/Monica/.matplotlib/matplotlibrc", line #2
  (fname, cnt))

Having matplotlib play nice with virtual environments

The matplotlib library has some issues when you’re using a Python 3 virtual environment. The error looks like this:

RuntimeError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are Working with Matplotlib in a virtual enviroment see ‘Working with Matplotlib in Virtual environments’ in the Matplotlib FAQ Luckily it’s an easy fix.

mkdir -p ~/.matplotlib && echo 'backend: TkAgg' >> ~/.matplotlib/matplotlibrc (ADD THIS LINE TO TERMINAL)

This adds a line to the matplotlib startup script to set the backend to TkAgg, whatever that means.

Set all graphics from matplotlib to display inline


In [29]:
import matplotlib.pyplot as plt
#DISPLAY MOTPLOTLIB INLINE WITH THE NOTEBOOK AS OPPOSED TO POP UP WINDOW
%matplotlib inline

Read the csv in (it should be UTF-8 already so you don't have to worry about encoding), save it with the proper boring name


In [30]:
df = pd.read_csv('07-hw-animals.csv')

In [32]:
df


Out[32]:
animal name length
0 cat Anne 35
1 cat Bob 45
2 dog Egglesburg 65
3 dog Devon 50
4 cat Charlie 32
5 dog Fontaine 35

In [25]:
# Display the names of the columns in the csv

In [31]:
df.columns


Out[31]:
Index(['animal', 'name', 'length'], dtype='object')

Display the first 3 animals.


In [6]:
df.head(3)


Out[6]:
animal name length
0 cat Anne 35
1 cat Bob 45
2 dog Egglesburg 65

In [26]:
# Sort the animals to see the 3 longest animals.

In [8]:
df.sort_values('length', ascending = False).head(3)


Out[8]:
animal name length
2 dog Egglesburg 65
3 dog Devon 50
1 cat Bob 45

In [27]:
# What are the counts of the different values of the "animal" column? a.k.a. how many cats and how many dogs.
# Only select the dogs.

In [10]:
(df['animal'] == 'dog').value_counts()


Out[10]:
True     3
False    3
Name: animal, dtype: int64

In [28]:
# Display all of the animals that are greater than 40 cm.

In [12]:
df[df['length'] > 40]


Out[12]:
animal name length
1 cat Bob 45
2 dog Egglesburg 65
3 dog Devon 50

'length' is the animal's length in cm. Create a new column called inches that is the length in inches.


In [46]:
length_in = df['length']* 0.3937

df['length (in.)'] = length_in

Save the cats to a separate variable called "cats." Save the dogs to a separate variable called "dogs."


In [14]:
dogs = df[df['animal'] == 'dog']
cats = df[df['animal'] == 'cat']

Display all of the animals that are cats and above 12 inches long. First do it using the "cats" variable, then do it using your normal dataframe.


In [15]:
cats['length'] > 12


Out[15]:
0    True
1    True
4    True
Name: length, dtype: bool

In [16]:
df[(df['length'] > 12) & (df['animal'] == 'cat')]


Out[16]:
animal name length length (in.)
0 cat Anne 35 13.7795
1 cat Bob 45 17.7165
4 cat Charlie 32 12.5984

What's the mean length of a cat?


In [17]:
# cats.describe() displays all stats for length

In [36]:
cats['length'].mean()


Out[36]:
37.333333333333336

In [18]:
#only shows mean length
cats.mean()


Out[18]:
length          37.333333
length (in.)    14.698133
dtype: float64

What's the mean length of a dog?


In [37]:
dogs['length'].mean()


Out[37]:
50.0

In [39]:
dogs['length'].describe()


Out[39]:
count     3.0
mean     50.0
std      15.0
min      35.0
25%      42.5
50%      50.0
75%      57.5
max      65.0
Name: length, dtype: float64

In [19]:
dogs.mean()


Out[19]:
length          50.000
length (in.)    19.685
dtype: float64

Use groupby to accomplish both of the above tasks at once.


In [51]:
df.groupby('animal')['length (in.)'].mean()


Out[51]:
animal
cat    14.698133
dog    19.685000
Name: length (in.), dtype: float64

Make a histogram of the length of dogs. I apologize that it is so boring.


In [21]:
dogs.plot(kind='hist', y = 'length (in.)') # all the same length "/


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x112f33dd8>

Change your graphing style to be something else (anything else!)


In [63]:
df.plot(kind="bar", x="name", y="length", color = "red", legend =False)


Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x1144baa90>

In [64]:
df.plot(kind="barh", x="name", y="length", color = "red", legend =False)


Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x1143acd30>

In [22]:
dogs


Out[22]:
animal name length length (in.)
2 dog Egglesburg 65 25.5905
3 dog Devon 50 19.6850
5 dog Fontaine 35 13.7795

In [23]:
dogs.plot(kind='bar')


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x113854f98>

In [24]:
# dogs.plot(kind='scatter', x='name', y='length (in.)')

Make a horizontal bar graph of the length of the animals, with their name as the label


In [66]:
df.columns


Out[66]:
Index(['animal', 'name', 'length', 'length (in.)'], dtype='object')

In [99]:
dogs['name']


Out[99]:
2    Egglesburg
3         Devon
5      Fontaine
Name: name, dtype: object

In [65]:
dogs.plot(kind='bar', x='name', y = 'length', legend=False)


Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x1146881d0>

Make a sorted horizontal bar graph of the cats, with the larger cats on top.


In [66]:
cats.sort_values('length').plot(kind='barh', x='name', y = 'length', legend = False)


Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x11479d6d8>

In [ ]: