Answer your own selection out of the following questions, or any other questions you might be able to think of. Write the question down first in a markdown cell (use a # to make the question a nice header), THEN try to get an answer to it. A lot of these are remarkably similar, and some you'll need to do manual work for - the GDP ones, for example.
If you are trying to figure out some other question that we didn't cover in class and it does not have to do with joining to another data set, we're happy to help you figure it out during lab!
Take a peek at the billionaires notebook I uploaded into Slack, it should be helpful for the graphs (I added a few other styles and options, too). You'll probably also want to look at the "sum()" line I added.
In [2]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [3]:
df=pd.read_excel('Billionaires.xlsx')
In [4]:
df=df[df['year']==2014]
In [5]:
df.columns
Out[5]:
In [6]:
countries=df.groupby('countrycode')['countrycode'].count().sort_values(ascending=False).head(10)
countries_df=pd.DataFrame(countries, columns=['countrycode'])
countries_df.rename(columns = {'countrycode':'count'}, inplace = True)
In [7]:
countries_df
Out[7]:
In [8]:
# this part im just looking up and manually adding
# i could have probably learned census.govs terrible api to automate this but for 10 countrie this was faster
countries_df['pop_thousands']=[323996, 1373541, 142355, 80723, 205824, 1266884, 64430, 7167, 66836, 62008]
countries_df['billionaires_per_billion']=countries_df['count']/(countries_df['pop_thousands']/100000)
In [9]:
countries_df
Out[9]:
In [10]:
# this is my beautiful bargraph. hong kong clearly has it going on
countries_df.sort_values(by='billionaires_per_billion', ascending=True).plot(kind='barh', y='billionaires_per_billion', title='Billionaires Per Billion Persons', legend=False)
Out[10]:
In [11]:
df.columns
Out[11]:
In [12]:
df[['name', 'networthusbillion']].sort_values(by='networthusbillion', ascending=False).head(10)
Out[12]:
In [13]:
df['networthusbillion'].describe()
Out[13]:
In [14]:
df[['networthusbillion', 'gender']].groupby('gender').describe()
Out[14]:
In [15]:
df[['name', 'networthusbillion']].sort_values(by='networthusbillion', ascending=True).head(10)
Out[15]:
In [16]:
# actually there are a lot more than 10 tied for poorest billionaires. here is the whole list!
df[df['networthusbillion']==1]['name']
Out[16]:
In [17]:
# 81 tied for poorest. 102 tied for second place.
df['networthusbillion'].value_counts().sort_index().head(10)
Out[17]:
In [18]:
df.columns
Out[18]:
In [19]:
# relationship to company looks like job title for the most part. looks like you want to get in
# on the ground floor to make those B's
df['relationshiptocompany'].value_counts().head(10)
Out[19]:
In [20]:
df['sourceofwealth'].value_counts().head(10)
Out[20]:
In [21]:
df[['gender', 'sourceofwealth']].groupby('gender').describe()
Out[21]:
In [22]:
# this time i decided not to cheat and actually found and converted a gdp to country chart into a csv. here it is!
gdp=pd.read_csv('GDP.csv')
gdp
Out[22]:
In [23]:
networth=df[['name', 'citizenship', 'networthusbillion']]
networth=networth.rename(columns = {'citizenship':'Country'})
In [24]:
networth_gdp=pd.merge(networth, gdp, on='Country', how='left')
In [25]:
networth_gdp['percent_GDP']=(networth_gdp['networthusbillion']*1000000000)/(networth_gdp['GDP (millions USD)']*1000000)
In [26]:
# this doenst EXACTLY answer the question, but I think its more interesting anyway. top 20 people in the world by
# personal wealth as a percent of their own countrys gdp
networth_gdp.sort_values(by='percent_GDP', ascending=False).head(20)
Out[26]:
In [27]:
total_worth=pd.DataFrame(networth_gdp.groupby('Country')['networthusbillion'].sum())
In [28]:
total_worth['total_worth']=total_worth['networthusbillion']*1000000000
In [29]:
total_worth.index
Out[29]:
In [30]:
total_worth['Country']=total_worth.index
In [31]:
total_worth=pd.merge(total_worth, gdp, on='Country', how='left')
In [32]:
total_worth['GDP (millions USD)']=total_worth['GDP (millions USD)']*1000000
In [33]:
total_worth['GDP']=total_worth['GDP (millions USD)']
In [34]:
del total_worth['GDP (millions USD)']
In [35]:
total_worth['percent_by_billionaires']=total_worth['total_worth']/total_worth['GDP']
In [36]:
# here is the top ten countries by percent of gdp possessed by billionaires
total_worth[['Country', 'percent_by_billionaires']].sort_values(by='percent_by_billionaires', ascending=False).head(10)
Out[36]:
In [37]:
# and now us vs india
total_worth[['Country', 'percent_by_billionaires']].loc[total_worth['Country'].isin(['India', 'United States'])]
Out[37]:
In [38]:
df.groupby("industry")['networthusbillion'].count().sort_values(ascending=False)
Out[38]:
In [39]:
df.groupby("industry")['networthusbillion'].sum().sort_values(ascending=False)
Out[39]:
In [40]:
df.groupby('selfmade')['name'].count()
Out[40]:
In [41]:
# this looks like a fairly normal curve
df.groupby('age')['selfmade'].count().plot(kind='bar')
Out[41]:
In [43]:
# seems like you generally get to be a billionaire faster if you're self made
df.loc[df['selfmade']=='self-made'].groupby('age').count().plot(kind='bar', legend=False)
df.loc[df['selfmade']=='inherited'].groupby('age').count().plot(kind='bar', legend=False)
Out[43]:
In [55]:
# comparing a number to itself to get rid of NaNs is a trick i learned from stackoverflow
# also this isnt a very good graph. nearly all of the billionaires at any age have 1 or near to 1 billion
df[['age', 'networthusbillion']].loc[df['age'] == df['age']].plot(kind='scatter', x='age', y='networthusbillion')
Out[55]:
In [ ]: