Unit 2 - Subsetting DataFrames and Series



In [ ]:

    
import pandas as pd



In [ ]:

    
df = pd.read_csv('../data/airbnb_input.csv', index_col='room_id')

On df DataFrame you can find data about Airbnb rooms in Lisbon.

Go ahead and preview it!



In [ ]:

    
# code to preview df

Row selection

With function iloc you can select specific rows from a DataFrame.
In order to specify the rows you want to select you can use, for instance, the row position (integer starting from 0) or a list of row positions.

Here, we are selecting the first row and getting it as a pandas Series:



In [ ]:

    
df.iloc[0]

But you can also get a single row as a DataFrame:



In [ ]:

    
df.iloc[[0]]

Now try to get rows 0, 1 and 3, using iloc:



In [ ]:

    
# code to select rows 0, 1 and 3 from df

Array slices also work, so try to select all the rows from position number 2 to position number 7:



In [ ]:

    
# code to select rows [2, 7] from df

With function loc you can select specific rows from a DataFrame, like with iloc.
The difference here is that you specify the rows to select using the rows' indexes.

Here, we're selecting the row that corresponds to room 17031 (with index 17031):



In [ ]:

    
df.loc[17031]

Now, if you select a row for an index that doesn't exist you'll get a KeyError:



In [ ]:

    
df.loc[[1]]

Try to select the rows that correspond to rooms 17031 and 25659:



In [ ]:

    
# code to select rooms 17031 and 25659

Now check what happens when you pass a boolean array to loc!



In [ ]:

    
# code to call loc with a boolean array

You can select columns using dot notation, like this:



In [ ]:

    
df.room_type

Or using brackets, like this:



In [ ]:

    
df['room_type']

Now, try to select column room_type from df, but getting the result as a DataFrame instead of a Series:



In [ ]:

    
# code to get column room_type from df as a DataFrame

And now try to select columns room_type and neighborhood:



In [ ]:

    
# code to select columns room_type and neighborhood

If you're feeling confident, here's a little challenge: select column room_type using loc!



In [ ]:

    
# code to select column room_type using loc

Mask function can be used to "hide" the rows that verify a certain condition:



In [ ]:

    
df.mask(df.overall_satisfaction == 5.0)

And with where function, you can hide the rows that don't verify a certain condition:



In [ ]:

    
df.where(df.overall_satisfaction == 5.0)

You can also use conditions inside brackets to filter data from the DataFrame
When you do it you get a DataFrame that has a different shape from the initial one
While with mask/filter, the DataFame shape is the same, and you just get NaNs in the cells that verify/don't verify the conditions

Here we're selecting only the rows of rooms in the Alvalade neighborhood:



In [ ]:

    
df[df.neighborhood == 'Alvalade']

Check DataFrame's data types by using the class attribute dtypes:



In [ ]:

    
df.dtypes

In order to select columns based on their data type, use the function select_dtypes:



In [ ]:

    
df.select_dtypes(include=['float64'])

To output your processed data to a file, pandas has many function to help you with that:

Check more here !

Here, we're exporting df to a csv file called airbnb_output.csv on the data directory of this unit:



In [ ]:

    
df.to_csv('../data/airbnb_output.csv')