We're going to use a real kaggle competition data set to explore Pandas dataframes. Grab the rent.csv.zip file and unzip it.
In [1]:
import pandas as pd
df = pd.read_csv("data/rent.csv", parse_dates=['created'])
df.head(2)
Out[1]:
In [2]:
df.head(2).T
Out[2]:
In [3]:
df.info()
In [4]:
df.describe()
Out[4]:
In [5]:
df.price.sort_values(ascending=False).head(10)
Out[5]:
In [6]:
df.price.head(5) # works only on right hand side of assignment or in expression
Out[6]:
In [7]:
df['price'].head(5)
Out[7]:
In [8]:
prices = df['price']
avg_rent = prices.mean()
print(f"Average rent is ${avg_rent:.0f}")
In [9]:
df.latitude.min(), df.latitude.max()
Out[9]:
In [10]:
bybaths = df.groupby(['bathrooms']).mean()
bybaths
Out[10]:
In [11]:
bybaths = bybaths.reset_index() # overcome quirk in Pandas
bybaths.head(3)
Out[11]:
In [12]:
bybaths[['bathrooms','price']] # print just num baths, avg price
Out[12]:
In [13]:
# get column
df['bedrooms'].head(5)
Out[13]:
In [14]:
df[['bedrooms','bathrooms']].head(3)
Out[14]:
In [15]:
# get index 3 row
df.iloc[3] # same as df.iloc[3,:]
Out[15]:
In [16]:
df.iloc[0:2] # first two rows
Out[16]:
In [17]:
df.iloc[0:2][['created','features']] # first two rows, show 2 columns
Out[17]:
In [18]:
df = df.set_index('listing_id')
df.head(3)
Out[18]:
In [19]:
df.loc[7150865]
Out[19]:
In [20]:
df = df.reset_index()
df.head(3)
Out[20]:
Allows non-unique indexes
In [21]:
df_beds = df.set_index('bedrooms')
df_beds.loc[3].head(3)
Out[21]:
In [22]:
df.isnull().head(5)
Out[22]:
In [23]:
df.isnull().any()
Out[23]:
Find all rows in data frame where description is missing:
In [24]:
df.description.isnull().head(5)
Out[24]:
In [25]:
df[df.description.isnull()].head(5)
Out[25]:
Another query to get all apt rows with price above 1M$
In [26]:
(df.price>1000000).head(5)
Out[26]:
In [27]:
df[df.price>1000000]
Out[27]:
In [28]:
df[(df.price>1000) & (df.price<10_000)].head(3) # parentheses are required in query!!!!
Out[28]:
In [29]:
df.bathrooms.value_counts()
Out[29]:
In [30]:
import matplotlib.pyplot as plt
plt.xlabel('Num Bathrooms')
plt.ylabel('Num Apts')
plt.hist(df.bathrooms, bins=20)
plt.show()
In [31]:
plt.xlabel('Num Bedrooms')
plt.ylabel('Num Apts')
plt.hist(df.bedrooms, bins=20)
plt.show()
In [32]:
plt.xlabel('Price')
plt.ylabel('Num Apts at that price')
plt.hist(df.price, bins=45) # not useful since loooong right tail
plt.show()
In [33]:
import numpy as np
df_log = df.copy()
df_log["price"] = np.log( df["price"] )
#OR:
#df_log["price"] = df["price"].apply(np.log)
df_log.price.head(3)
Out[33]:
In [34]:
plt.xlabel('Price')
plt.ylabel('Num Apts at that price')
plt.hist(df_log.price, bins=45) # not useful since loooong right tail
plt.show()
In [35]:
bybaths.plot.line('bathrooms','price', style='-o')
plt.show()
In [36]:
# OR, can do directly
plt.plot(bybaths.bathrooms, bybaths.price, marker='o') # note slightly different arguments
plt.show()
Get the column of bathrooms from df
Iterate through list of columns and print them out
Get row 6 from df
Set the index of df to bedrooms then get all apartments with 3 bedrooms using the index
Get all rows where price > 100_000 per month
Drop column building_id from df
Get all rows where price is between 1000 and 2000.
Apply the np.log()
function to the price and store in a new column called log_price
Get columns latitude and longitude into their own dataframe
In [68]:
df.bathrooms
for colname in df:
print(colname)
df.loc[5]
df = df.set_index('bedrooms')
df.loc[3]
df[df.price > 100_000]
#del df.building_id
df = df.drop('building_id', axis=1)
df[ (df.price>1000) & (df.price<2000) ]
df['log_price'] = np.log( df.price )
df[ ['longitude','latitude'] ]
In [38]:
df_clean = df[(df.price>1_000) & (df.price<10_000)]
In [39]:
plt.xlabel('Price')
plt.ylabel('Num Apts at that price')
plt.hist(df_clean.price, bins=45)
plt.show()
In [40]:
plt.scatter(df_clean.bedrooms, df_clean.price, alpha=0.1)
plt.xlabel("Bedrooms", fontsize=12)
plt.ylabel("Rent price", fontsize=12)
plt.show()
In [41]:
df_missing = df_clean[(df_clean.longitude==0) |
(df_clean.latitude==0)]
len(df_missing)
Out[41]:
In [42]:
# only 11 filter out
df_clean = df_clean[(df_clean.longitude!=0) |
(df_clean.latitude!=0)]
Use &
to get "and" whereas |
is "or".
Using GPS checker gives a rough outline for New York City of latitude, longitude 40.55, -74.1 on the lower left and 40.94, -73.67 on the upper right. Let's filter others out.
In [43]:
df_clean = df_clean[(df_clean['latitude']>40.55) &
(df_clean['latitude']<40.94) &
(df_clean['longitude']>-74.1) &
(df_clean['longitude']<-73.67)]
print(len(df_clean), len(df))
In [44]:
from matplotlib import colors
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
plt.scatter(df_clean.longitude,
df_clean.latitude,
alpha=0.8,
s=1.5, # pixel size
vmin=1000, vmax=4000, # limit price range so 4000 is red
c=df_clean['price'],
cmap='rainbow', # color map
marker='.')
plt.colorbar()
plt.show()
In [ ]:
In [45]:
df_train = df_clean[['bathrooms', 'bedrooms', 'longitude', 'latitude', 'price']]
df_train.head(5)
Out[45]:
In [46]:
X_train = df_train[['bedrooms','bathrooms','latitude','longitude']]
y_train = df_train['price']
In [47]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
rf.fit(X_train, y_train)
print(f"OOB R^2 score is {rf.oob_score_:.3f} (range is -infinity to 1.0; 1.0 is perfect)")
In [48]:
# pip install rfpimp
from rfpimp import *
I = oob_importances(rf, X_train, y_train)
I.plot(kind='barh', legend=False)
plt.show()
In [49]:
df['one'] = 1
df.head(3)
Out[49]:
Add random column of appropriate length
In [50]:
df2 = df.copy()
df2['random'] = np.random.random(size=len(df))
df2.head(2).T
Out[50]:
In [51]:
df2['i'] = [i for i in range(len(df))]
df2.head(2).T
Out[51]:
The data set has a features
attribute (of type string) with a list of features about the apartment.
In [52]:
df.features.head(5)
Out[52]:
Let's create three new boolean columns that indicate whether the apartment has a doorman, parking, or laundry. Start by making a copy of the data frame because we'll be modifying it (otherwise we'll get error "A value is trying to be set on a copy of a slice from a DataFrame"):
In [53]:
df_aug = df[['bedrooms','bathrooms','latitude','longitude',
'features','price']].copy()
Then we normalize the features column so that missing features values become blanks and we lowercase all of the strings.
In [54]:
# rewrite features column
df_aug['features'] = df_aug['features'].fillna('') # fill missing w/blanks
df_aug['features'] = df_aug['features'].str.lower() # normalize to lower case
Create the three boolean columns by checking for the presence or absence of a string in the features column.
In [55]:
df_aug['doorman'] = df_aug['features'].str.contains("doorman")
df_aug['parking'] = df_aug['features'].str.contains("parking|garage")
df_aug['laundry'] = df_aug['features'].str.contains("laundry")
del df_aug['features'] # don't need this anymore
df_aug.head(3)
Out[55]:
The other way to drop a column other than del
is with drop()
function:
In [56]:
df2 = df.drop('description',axis=1) # drop doesn't affect df in place, returns new one
df2.head(2).T # kill this column, return new df without that column
Out[56]:
Let's do some numerical feature stuff
In [57]:
df_aug["beds_to_baths"] = df_aug["bedrooms"]/(df_aug["bathrooms"]+1)
df_aug.head(3)
Out[57]:
Beyond our scope here, but let's retrain model to see if it improves OOB score.
In [58]:
df_clean = df_aug[(df.price>1_000) & (df.price<10_000)]
df_train = df_clean[['bathrooms', 'bedrooms', 'longitude', 'latitude',
'doorman', 'parking', 'laundry', 'beds_to_baths', 'price']]
X_train = df_train[['bedrooms','bathrooms','latitude','longitude',
'doorman', 'parking', 'laundry', 'beds_to_baths']]
y_train = df_train['price']
rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
rf.fit(X_train, y_train)
print(f"OOB R^2 score is {rf.oob_score_:.3f} (range is -infinity to 1.0; 1.0 is perfect)")
In [114]:
I = pd.DataFrame(data={'Feature':X_train.columns, 'Importance':rf.feature_importances_})
In [117]:
I.sort_values('Importance',ascending=False)
Out[117]:
That score is slightly better but not by much.
This is not general but works for small set of categories:
In [59]:
df['interest_level'] = df['interest_level'].map({'low':1,'medium':2,'high':3})
In [60]:
df[['interest_level']].head(5)
Out[60]:
In [118]:
df['some_boolean'] = True
In [120]:
df['some_boolean'] = df['some_boolean'].astype('int8')
df.some_boolean.head()
Out[120]:
In [64]:
df['dayofweek'] = df['created'].dt.dayofweek # add dow column
df['day'] = df['created'].dt.day
df['month'] = df['created'].dt.month
df.head(1).T
Out[64]:
In [65]:
df.to_feather("/tmp/rent.feather")
In [66]:
% time df = pd.read_feather("/tmp/rent.feather")
Compare to loading CSV; like 5x slower:
In [67]:
% time df = pd.read_csv("data/rent.csv")
Filter out all rows with more than 5 bathrooms and put back into same dataframe
Show a scatter plot of bathrooms vs price
Show a heatmap whose color is a function of number of bedrooms.
Insert a column called avg_price
that is the average of all prices
Using a dictionary, convert all original interest_level
values to 10, 20, 30 for the low, medium, and high categories
Convert manager_id
column to be categorical not string type
Create a new column called day_of_year
from the created
field with the day of year, 1-365 for each row