In [1]:
%matplotlib inline
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
import numpy as np # foundation for Pandas
import xlwt # package to create a new excel
import seaborn as sns # package for better graph look
import statsmodels.api as sm # package for statical analysis
import statsmodels.formula.api as smf # package for statical analysis that specifying formula
from patsy import dmatrices # package for transforming matrix
Humans are ambitious of getting closer to the sky, so breaking the record of “the tallest skyscraper” seems to be a heated competition worldwide. For now, the winner goes to Burj Khalifa, a megatall skyscraper in Dubai. A whimsical concept called the “skyscraper index” proposed by Andrew Lawrence in 1999 has caught the attention of economists. According to Lawrence, “Business cycles and skyscraper construction correlate in such a way that investment in skyscrapers peaks when cyclical growth is exhausted and the economy is ready for recession”. For this project, we are interested in exploring the correlation between skyscraper construction and recession. Taking a step further to fulfill Lawrence’s idea, the Barclay Capital actually calculate their own skyscraper index and use it as an index for real estate and microeconomics analysis.
Given that the Barclay Skyscraper Index is unreleased, we decide to organize the database of skyscraper index by ourselves, with information obtained from www.skyscraper.com and Penn World Table. By exploring the topics discussed in the following section, we want to find out answer for the big question: whether the skyscraper index is a good indicator for recession?....
The skyscrapercenter.com website contains useful information about the skyscrapers in the world, including building name, city and its country code, height, completed year, etc. We transfrom data in the website to excel since the dataset is in manageable size (Scrapping doesn't work out nicely in our case). At the same time, countries’ macro statistical numbers are easy to get (GDP, inflation, etc.).
As introduced above, we would like to find out the correlations between the skyscraper data and macroeconomics data by graphing and doing data analysis. For better presentation purpose, we mainly focus on country Japan. It is known that Japan had a recession period starting 1980s. Also, its record of skyscraper construction is complete and eaily to obtain. We assume that such strong recession in Japan will give more explanatory power to the “skyscraper theorem” if it holds. Of course, using our data, we can easily analyze other countires by changing the country code.
We expect to see that the “skyscraper theorem” will hold during the recession period. Countries that built more skyscrapers may have more severe period of recession.
By running through detailed topics, we expect to see that the skyscraper index has correlation with recession. However, whether the correlation implies strong predicting power for future recession needs further research.
The skyscrapers data comes from Skyscraper Center. The website is written in JavaScript so that we cannot use the method we learned in class to scrap the data. We still tried to scrap it but we failed to do so. However, the website allows us to download the data in pdf format. We then manually copy and paste the data into the excel. The data is well tabulated so the process is actually not complicated.
In [2]:
data_clean = 'C:\\Users\\lenovo\\Documents\\Data_Bootcamp\\Project\\data_clean.xlsx'
In [3]:
df1 = pd.read_excel(data_clean)
df1.head()
Out[3]:
We want to know for each country, how many Skyscrapers were built in each year. We found a package called 'xlwt' which help us create a new excel file and clean the data more efficiently. By saving our output into a new excel, we do not need to re-run the code every time we open the jupyter notebook.
In [4]:
sheet_range = list(range(0, 5)) # There are 5 sheets in the file, and we convert the range into a list for future use.
year_range = list(range(1909, 2018)) # The earliest skyscrapers data was built in 1909 in New York according to our data.
In [5]:
# Create a new excel to record the counts, using the xlwt package.
file = xlwt.Workbook()
table = file.add_sheet('sheet_1', cell_overwrite_ok=True)
# "cell_overwrite_ok=True" allows Python to overwrite the cells. In our practice we found it useful to prevent errors.
# Set the column names for the new excel, using the function table.write(row, column, text).
table.write(0, 0, 'Country')
table.write(0, 1, 'Year')
table.write(0, 2, 'Count')
The skyscrapers data excel have 5 sheets, and each of them contains the skyscrapers data for a specific contury (Japan, Singarpore, China, US and UAE). The idea is that, we first go through all the 5 sheets, and within a sheet, we count the number of skyscrapers built in each year. By using the "groupby" function and boolean selection we could also achieve this. However, when we wrote those codes, those concepts were not introduced, so we used two loops to get the same result. The method works great so we decide to keep those codes. As we mentioned, we will focus on the Japanese data in this project, but we clean the data for all 5 countries in case of future use.
In [6]:
row = 1
for sheetnumber in sheet_range:
df1 = pd.read_excel(data_clean, sheetname=sheetnumber) # Read a sheet, from the first one to the fifth.
country = df1.iat[1, 2] # Get the country name by reading the specific value in cell (2, 3).
for year in year_range:
Years = [year]
year_data = df1[df1['Completed'].isin(Years)] # Group all the skyscrapers that were built in the same year.
count = len(year_data.index) # Count how many skyscrapers were built in each year group, using the length of the index.
print(country, year, count) # To show the result and check if the results are correct.
table.write(row, 0, country) # Record the county, year and count data in the new excel we created.
table.write(row, 1, year)
table.write(row, 2, count)
row += 1
file.save('year_count_data.xls')
# Save the file. Next time we can directly use the data from this excel without re-running all the code again.
The GDP data comes from Penn World Table. Here we clean the data for Japan as an example. For other countries, it is easy to repeat the same process by changing the country code.
In [7]:
url = "http://www.rug.nl/ggdc/docs/pwt90.xlsx" # The url link from PWT
pwt = pd.read_excel(url, sheetname='Data', # Use data from the sheet named 'Data'
parse_cols=[0,1,2,3,4]) # Extarct the 1-5 collums of the data
In [8]:
pwt.head()
Out[8]:
So now we hava a data that contans information of countrycode, country name, curreny_unit, year and real GDP.
In [9]:
pwt1=pwt.loc[pwt['countrycode'] == ('JPN')] # Extract data of Japan by using boolean
pwt2=pwt1.loc[pwt1['year'].isin(range(1970,2017))] # Change the year range, from 1970 to 2016
pwt2.head()
Out[9]:
We further extract information of Japan by setting boolean statement, and set year to range from 1970 to 1974. The number of skyscrapers built in each year is an annual change, so it makes more sense to compare it with the annual change of GDP, which is the GDP growth rate. See section 2.3,2.4 for calculation of GDP growth rate and combinaiton of data.
Again, we use Japan as an example to combine the skyscrapers data and the GDP data into one data set.
In [10]:
year_count_data = 'C:\\Users\\lenovo\\Documents\\Data_Bootcamp\\Project\\year_count_data.xls'
df2 = pd.read_excel(year_count_data)
df2.head()
Out[10]:
In [11]:
df_JPN = df2[df2['Country']=='Japan']
combo_1 = df_JPN.merge(pwt2, how='left', left_on='Year', right_on='year')
combo_1 = combo_1.rename(columns={'rgdpe':'GDP'})
combo_1.tail()
Out[11]:
In [12]:
combo_JPN = combo_1[['Country', 'Year', 'Count', 'GDP', 'currency_unit']]
combo_JPN.tail()
Out[12]:
There is an error saying that "A value is trying to be set on a copy of a slice from a DataFrame" and there is a better way to do it. However, the current method works well. So far we get the clean data we need to continue our project.
Since we have the cleaned PWT90 data, we can calculate GDP growth by dividing a year's GDP by the previous year's GDP and minus 1. The loop gives us a list. We have to convert it into a series in order to concat it with dataframes.
In [13]:
i = 0
GDP_growth = []
for i in range(0, 109):
if i == 0:
c = 0 # set 0 as our first year's GDP growth since we do not have previous year's information. We are not likely to use the first year's GDP growth anyways.
GDP_growth.append(c)
else:
a = combo_JPN.loc[i-1, 'GDP']
b = combo_JPN.loc[i, 'GDP']
c = b/a-1
GDP_growth.append(c)
i += 1
GDP_growth = pd.Series(GDP_growth)
In [14]:
GDP_growth.tail()
Out[14]:
Combine the GDP_growth series and the combo_JPN dataframe together.
In [15]:
combo_JPN = pd.concat([combo_JPN, GDP_growth], axis=1, join_axes=[combo_JPN.index])
In [16]:
combo_JPN = combo_JPN.rename(columns={0: 'GDP_growth'})
combo_JPN = combo_JPN.set_index('Year')
combo_JPN.tail()
Out[16]:
In order to better understand the information about recession, we obtain data from IMF,specifically its World Economic Outlook database. By using selective variables related to GDP and inflation rates, we extract information within certain countries that are reasonable for our purpose of doing analysis.
In [17]:
url = 'https://www.imf.org/external/pubs/ft/weo/2016/02/weodata/WEOOct2016all.xls' ###weo link
In [18]:
weo = pd.read_csv(url, sep='\t', # read csv document fron url
usecols = [1, 2] + list(range(19,46)), # use the 2nd, 3rd and 20-47 columns
thousands=',', # change thousands
na_values=['n/a', '--']) # change NA values
Now we obtain the dataFrame called weo. It contains information from different countries from Year 1990 to Year 2011. This year range is sufficient for now considered our puropose of doing simple recession analysis. To continue, we extraxt specific vairiables and countires that we are looking for.
In [19]:
variables1 = ['NGDPRPC','NGDPD','PCPIPCH','PCPIE'] # Define a list of variables1
df = weo[weo['WEO Subject Code'].isin(variables1)] # Find variables1 in this column
variables2 = ['JPN','USA','CHN','ARE','SGP'] # Define a list of variables12
df = weo[weo['ISO'].isin(variables2)] # Find variables2 in this column
df = df[0:-5] # Extract only Data without additional info
pv2 = pd.pivot_table(df,index=['WEO Subject Code','ISO'], dropna=False, fill_value=0.0)
# reindex teh table to a pivot view
pv2.columns= np.arange(1990,2017) # set year range
import seaborn as sns # import seaborn package
palette = sns.color_palette("Paired", 10) # set palette style
sns.set_palette(palette)
pv2.loc['NGDPRPC', 'JPN'] # Test data by get GDP from Japan
Out[19]:
Now we have the data ready for inflation and GDP analysis, see the next section for graphs.
Before we get in to the analysis of skycraper index and recession, we want to show that the time period and countries that we selected are valid for the purpose of illustration. For the purpose of simplicity, we focus mainly on the country Japan, and the time frame 1990-2016. We choose Japan as our main research target for two reaons. On the one hand, Japan goes through great recession period as Japan government tighted their monetary policy during 1985-1990, the market collapsed from then on, so the signs for recession should be clear. On the other hand, Japan have relativly sufficent record of construction of skycrapers. We choose other control countires such as United States, China, Singpapre and United Arab Emirates for comparison and illustration purpose (Extraction of countries shown in part 2.5).
In [20]:
pv2.loc['NGDPD'].T.plot(alpha=0.75, rot=45, title="Real GDP per capita (current US$)")
Out[20]:
This graph demonstrates real GDP per capita across different countires (scaled through US dollar). In this graph , it is clear that Japan (Labelld in Green)goes through decline in GDP starting around Year1995, while other countries in the same period had increasing GDP. The volatility of Japan GDP is also relatively large, implying that the economy of Japan was not stable over time.
In [21]:
pv2.loc['PCPIPCH'].T.plot(alpha=0.75, rot=45, title="Inflation (% change)")
Out[21]:
This graph demonstrates information about inflation.It can be seen that starting Year 1990, Japan had decreasing rate of change in CPI. Compared with other countries, it had a relatively long peiord of deflation, the inflation (negative) index is lower in value compared with other countries. Japan seems to be influenced by this recession shock and finds it hard to recover.
In [22]:
plot_data = pv2.loc['NGDPD'].T.reset_index() # focus on real GDP data
plot_data.rename(columns={'index':'Years'}, inplace=True) # rename the index
melt_data = pd.melt(plot_data, id_vars=['Years'],var_name='Country') # x value: Year
melt_data.rename(columns={'value':'GDP'}, inplace=True) # y value: Country GDP
sns.lmplot(x="Years", y="GDP", hue="Country", data=melt_data, palette="Set1")
Out[22]:
By running out a linear regression, we see that the regression line of Japan for GDP over years are relatively flat. After fluctuations around Year 1995, the downward trending GDP restrain economic growth of Japan, while other countires have faster increase.
In [23]:
pv2T = pv2.T # Transpose
f, ax = plt.subplots(figsize=(16, 5)) # Set fig size
sns.barplot(x='Years',y='JPN', data=plot_data,label="JPN", color=sns.xkcd_rgb["lime"])
ax.legend(ncol=2, loc="upper left", frameon=True) # Set legend
ax.set(xlim=(-1, 28), ylabel="",xlabel="gdp") # Set label
plt.xticks(rotation=90) # rotate the year on x value for better look
sns.despine(left=True, bottom=True) # Adjust size
plt.show()
At last, we are plotting a histogram to show the real GDP of Japan from year 1990 to 2016. With a clearer distribution, we will go into details about whether Skycraper index relates to recession in the following section.
The data we use for 3.2 starts from 1980, which is the year before the recession happened.
In [24]:
combo_JPN_subset = combo_JPN[combo_JPN.index.isin(range(1980, 2015))]
combo_JPN_subset.head()
Out[24]:
We multiply GDP growth by 1000 and then divide GDP by a 100000, so that they can be shown in the same graph with skyscrapers counts. By observing the graph we can have a basic idea of the relationship between skyscrapers index and GDP.
In [25]:
fix, ax = plt.subplots()
x1 = combo_JPN_subset['GDP']/100000
x2 = combo_JPN_subset['GDP_growth']*1000
combo_JPN_subset['Count'].plot(ax=ax, label='Skyscrapers Built in Each Year')
x1.plot(ax=ax, label='GDP, 100 Thousand Yen')
x2.plot(ax=ax, label='GDP Growth, 1000%')
ax.set_title('GDP, Growth and Numbers of Skyscrapers Built', fontsize=16, loc='left')
ax.legend(loc='best')
Out[25]:
Given skyscraper count, GDP and GDP growth are measured in different units (1 for count, 100k for GDP and 1000% for growth), it is not reasonable to compare them on the value basis. From the graph we may still say there is some kind of correlation between skyscrapers index and GDP growth, because in some years, they move up and down together. For example, during 2000 and 2005, this correlation is very strong. However, this is our intuition thus we cannot make this argument without further quantitative analysis.
Correlation analysis is the simplest model but it may offer us some useful information. If the “skyscraper theorem” holds, then there might be a negative correlation between skyscrapers count and GDP growth in the long run. The correlation is -0.27 using the data from 1980 to 2014. This negative correlation somehow meets our expectation. At the same time, however, the absolute value of which is not large, indicating that the correlation is not that strong.
In [26]:
combo_JPN_subset['Count'].corr(combo_JPN_subset['GDP_growth'])
Out[26]:
We also notice that the correlation results are various when we split the data into several subsets. Here we split the data in to pre-crisis time (1980-1985), crisis time (1986-1991) and post-crisis time (1992-2014). We see there is a strong positive correlation during pre-crisis time. This may result from Japan's rapid economic growth before crisis. During the crisis time and post-crisis time, the correlations between skyscrapers count and GDP growth are weak.
In [27]:
combo_JPN_subset_1 = combo_JPN[combo_JPN.index.isin(range(1980, 1985))]
combo_JPN_subset_1['Count'].corr(combo_JPN_subset_1['GDP_growth'])
Out[27]:
In [28]:
combo_JPN_subset_2 = combo_JPN[combo_JPN.index.isin(range(1986, 1991))]
combo_JPN_subset_2['Count'].corr(combo_JPN_subset_2['GDP_growth'])
Out[28]:
In [29]:
combo_JPN_subset_3 = combo_JPN[combo_JPN.index.isin(range(1992, 2014))]
combo_JPN_subset_3['Count'].corr(combo_JPN_subset_3['GDP_growth'])
Out[29]:
The numbers make sense to us, but correlation cannot show the predictive power of skyscrapers index towards recession. Thus, further analysis is needed.
Since the correlation result does not seem to have much explanatory power, we now take a step further to do a simple regression. We want to conduct a linear regression model, setting the independent varialble (x) to be count, and the dependent variable (y) to be GDP growth.
In [30]:
combo_JPN_subset1 = combo_JPN_subset [['Count','GDP_growth']] # Extract the two varaibles we need
combo_JPN_subset1.tail()
Out[30]:
In [31]:
# Now tarnsform the variables into matrix for clearer presentation
y,X = dmatrices('GDP_growth~Count', data=combo_JPN_subset1, return_type = 'dataframe')
print (y.head()) # printting y variable to be GDP_growth
print (X.head()) # printting X variable to be Count
In [32]:
mod = sm.OLS(y,X) # choosing to run OLS estimation
res = mod.fit() # setting the residual value
print (res.summary()) # print out the OLS result
This regression result gives us more insight about our hypothesis. First of all, the independent variable (count) has negative coefficient, and this is significant in the 95% Confidence Level. That is to say, the skyscraper index is indeed negatively related to to GDP growth In Japan from Year 1980 to year 2014. However, the R-Square and adjusted R-square is extremely low, this makes sense since we only regress on one dependent variable, while there are many other factors influencing the change in GDP growth. Another thing that we notice is the Durbin_Watson result. The value 0.832 is between (0,2), indicating that there is a positive auto correlation. That is to say, there is probably a time lag in the count seires data. Also, it can be assumed that when economy is generally good, more buildings are built within these years, while recession might lead to decrease in constructions of skyscrapers in following years.