The main dataset was the NYC Street Tree Data from 2015 and is the result of a community survery done mainly by volunteers, cataloging all trees in NYC. As secondary data sets we had the ones regarding Street Tree Data from 1995 and 2005. Moreover, we used the air pollution data in New York City, in order to understand the influence of trees on the air quality. We also started analyzing the "311" dataset, to explore some complaints regarding trees. The Street Tree dataset was chosen because it could give new insights and perspective to urban planning, discover their status (how healthy they are, if people are taking care of those etc.), and if they are influencing the life quality of the city. Moreover we could have discovered facts that most people would probably not be aware of beforehand.
The goal was to enlighten users about trees in NYC. Are there certain types of trees more suitable for streets than others? Where are they located? Is it possible to know which kind of tree you might encounter based on the location, health of the tree, the diameter, or even the amount of problems of the tree? From this project it should be possible to learn something new about a topic you might never have considered learning something about.
There were some outlies in the dataset which had to be removed to get useful results. One example was lat/lon which had an extreme outlier.
Regarding the 311 dataset, we just selected the 2015 data and the complaints regarding trees, as these were the only ones important for our domain.
For the air pollution there was data for all the community districts, but only some of the neighbourhoods. The measurements were mean percentile. We took the mean values for the community district and and assigned them to the corresponding borough. This was because the names of the neighbourhoods in this dataset and our own dataset were so different that it was very difficult to figure out which neighbourhoods were the same in the two sets.
When taking a first glance at the dataset it was a bit overwhelming, as it's huge and there are a lot of rows which does not necessarily make sense at a first glance, as wel as a number of of variables not particularly interesting or necessary for what we wanted to do. Each variable was carefully examined and the variables deemed unnecessary were excluded. Among these was "Tree_Id", a unique ID for each tree, but this unique ID was unique for each of the three datasets (1995, 2005, 2015), meaning it was not possible to join the datasets by this ID, deeming it not relevant. Other excluded variables were address information, since there were multiple variables delivering address information on different levels - and it was not relevant to distinguish between all these.
It was decided to only focus on the top 20 tree species, since there were a lot of different species without a significant amount of observations, it would be difficult to describe them all properly. It would also be very difficult to do good predictions if the observations are sparse. For some machine learning tools, we focused only on the top 10 species, or top 5, because the data was too sparse when going above this limit.
There were a lot of trees without a species listed, and those were disregarded completely. The dead trees were also excluded from the dataset.
It was considered to only focus on one of the five boroughs in NYC to get a more detailed view. This was not implemented since it was deemed more interesting to two differences between the boroughs as well.
The final dataset "Street Tree Data 2015" consists of 534,514 tree observations and 21 variables/features, totalling 74.5 MB. The selected features were:
Amount of trees in each borough:
In general, the top 20 species were the same for the 5 boroughs, but the order of this "top 20" list was different. There were more trees with general problems in Manhattan as well as more unhealthy trees.
We did a lot of Pearson correlation among the different variables for figuring out that not a lot of the variables were correlated. In the end, finally figured out some correlation between the air quality, tree amount and tree diameter and the health states.
But let's start looking at the main dataset, the 2015 Street Tree Census (https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Tree-Data/uvpi-gqnh).
Multiple secondary datasets were inspected, e.g. the 311 dataset and the airpollution dataset, as well as the Street Tree datasets from 2005 and 1995, respectively. In the 311 set, there were several complaints about trees in NYC. No significant correlations were found though. It was hoped that a connected between a certain type of complaint were correlated with different problems or the health of the tree, but unfortunately data does not always behave as hoped or suspected, and patterns cannot (and should not) be forced to appear.
One could also be inclined to wonder if more "green" areas, meaning areas with a lot of trees, had higher house prices. Again, after investigation, this was found to be challenging, since there is not a lot of information about house prices available - at least not on a neighbourhood level.
It was also considered if there was a correlation between the trees/features of the trees and the air pollution. This dataset was used for simple linear regression.
For the different maps used, a few other datasets have been included in the shape of geojson files, which includes the data needed for drawing the d3 maps (polygons) as well as basic information about the parts of the city, they're representing, such as borough, community district etc., which was used in combination with our own data from the Street Tree dataset to produce interactive maps. The geojsons used can be found and downloaded at (https://github.com/cecli/cecli.github.io/tree/master/data/geojson).
When doing predictions it can be difficult to find the appropriate tools to use. Different tools have different qualities and it all depends on the data and the patterns in your data. In this project, different tools have been tried out, typically multiple tools for the same prediction to inspect the model performance of each tool.
KNN is a tool rather easy to grasp and implement. It was chosen for predicting the health of a tree based on GPS coordinates, as well as predicting species based on GPS coordinates. An argument for KNN being the most appropriate choice is that one could think that when planting trees, one would be inclined to plant the same trees together. One could also think that unhealthy trees are likely in the same area, presumable because of a decease in the area, a pollution problem, soil problems or something completely different. A drawback of the KNN method is that when dealing when an unbalanced dataset it will favour the most occuring observation.
Decision trees can often be a good choice because they are nice to visualize. A drawback is that they tend to overfit the training data. It was used for predicting health based on GPS coordinates, as well as species based on GPS coordinates in spite of its drawback. And also, we use it for predicting the tree species based on location and predicting diameter based on species and location. When predicting species different features were added to see if they contributed to the predictions, e.g. the diameter. The main reason was to compare with the other results. If the decision trees did not overfit and still performed well, then it would be nice to visualize. To accomodate the overfitting issue a random forest was also tried out.
Decision trees were also used to predict diameter based on species and problems, as well as predicting diamater based on the amount of problems. Here, the diameter was binned in bins of different sizes (1-10, 10-15, 15-20, ..., 45-50, 50-60, 60-70, ..., 90-100, 100-150, 150-200, ...)
As a third tool, Support Vector Machines were tried out. SVMs can do linear classification by creating a "maximum seperating hyperplane" between data. It can also do non-linear classification using a so-called kernel-trick where inputs are mapped to high-dimensional feature space. This was used to predict health based on GPS coordinates.
Apriori is an algorithm for frequent item search. It was used to inspect problems appearing together. We used this algorithm for seeing if some problems were appearing together in the same observation. We found that some problems are sometimes appearing in the same trees. [We used Apyori package: https://pypi.python.org/pypi/apyori/1.1.1]
Linear regression was used to inspect correlation between different features, and is not really a machine learning tool as much as a tool for investigating linear correlations. It was used to predict air pollution based on the amount of trees as well as diameter. Techniques used were Lasso and Elastic Net, apart from the standard linear regression.
When selecting appropriate models, first thing is to split the data into a training set and a test set. When predicting health (or species) based on GPS coordinates, a test set consisting of 15% of the total amount of observations was used. Hereafter the training set was "split" into a training set and a validation set, using a 5-fold and 10-fold cross-validation. The best model was chosen based on accuracy scored, but with computation time taken into account as well. For the KNN, different values of $k$ was tried, ranging from $K=2,...,10$. The limit was set to 10 because it was not expected that we would have a whole area of unhealthy trees, and that might just confuse the predictions.
For predicting species and health based on GPS coordinates, KNN was selected as the best model. SVM simply took a significant amount of time to run, making it difficult to fine tune and handle. Decision trees overfitted the training data and was not good at handling sparse data.
For predicting species, the KNN classifier was able to predict $51.7$% accurately for $K=4$ on the test data, whereas the average (average over the 5 folds) performance on the validation set were $49.9$% (compared to $49$% for decision trees). This is considered rather good, taken into account that it is labelling $20$ different species, but it would also suggest that the same species is not always planted next to each other. They are actually not planted next to each other more often than expected before investigating the data. In comparison, when only trying to predict the top 5 species instead of all 20 species, the average performance on the validation set were $70.4$% for $K=4$. This also confirms that the same species are not always planted next to each other, and shows, as expected, that the model performs better when addressing fewer species.
For predicting health, the KNN classifier was able to predict $80.7$% accurately for $K=5$ on the test data, whereas the average (over the 5 folds) performance on the valdiation set were $80.4$% (compared to $74.8$% for decision trees). An accuracy of $80.7$% is rather good considering the sparsity of the "fair" and "poor" tree observations. The KNN did handle the sparsity trees better than the decision tree classifier. When it labeled a tree as "fair" it had around $43$% correct (on the validation set for $K=5$). A bit worse it went for the "poor" classifications, here it only predicted around $1/3$ correctly. It was, as we thought, much better at predicting the good trees. This makes sense since there were a lot more training data available. In comparison, the decision tree classifier had around $31$% correct for the "fair" trees, and $17$% for the "poor" trees. The amount of misclassifications on the "poor" and "fair" trees suggests that the condition of the trees do not really reflect on its neighbours and are most likely caused by other, individual things.
The decision trees were performing almost like the random forest in our case. When using species as an input, we obtained a maximum accuracy of 0.5 with the top 5 species. With top 10 and 20, the accuracy was even less, which made us conclude it was not a good model. Moreover, we saw that in the visualization using graphviz, it was predicting just two species, Locust and Pine Oak, so it was definitely not a good tool for our case.
The apriori algorithm found some correlation among the stone problems. The score was not that good, but we saw it was improving when taking into account just the trees with some problems or the trees with just one problem. This is understandable, because some of the trees might be new, which can also be seen on the diameter of these trees. So it is possible that they don't have a lot of problems.
Elastic Net and Lasso are generally good tools, but we lack of variables and data points (maximum 188). That can be the reason behind the non-optimal performance.
The initial histogram shows all the trees in New York City. It can be clearly seen that some trees are present in a really low amount. That is why we have focused on a subset of species, mostly the top 20 (for all the anlysis), and sometimes top 10 and 5. The pie chart shows then the distribution of top 20 trees, with the possibility of hovering it for a more clear eplanation of the hovered slice.
The d3 bar chart show the distribution of the 5 most common species in NYC for each borough, as well as NYC. This illustrates the borough-wise differences regarding the most common trees. E.g. the London Planetree is the most common tree in NYC. It is also the most common tree in Queens. Queens is the largest area in NYC both regarding trees and size, as shown in the pie charts. Therefore it might be that most of NYC's London Planetrees are located in Queens, but as it can be seen from the chart, most of them actually comes from Brooklyn. The chart also have the option of changing the years from 2015 to either 2005 or 1995, and hereby see the differences over time as well.
This chart is important for the project since it enables the user to view the distribution of the most common trees in NYC on borough level for the three different years available: 1995, 2005, 2015. This helps us show the changes in street trees in NYC over time.
We have also appended a plot in Python (using GeoPlotLib) in order to give the user an overview of how distribution/amount of trees has changed over years.
This page enables the user to hover over pictures of leaves for the 10 most common species to see the species name, and then when clicking a leaf image, fun facts about these trees appear.
This function is important for the project since it sets each of the trees in a context: What is it called? What is its ranking? How many trees are there of this species? What is special/interesting about the species? Why is it a good street tree compared to others?
A basic histogram and a pie chart for make user understanding the health distribution of the trees. It can be clearly seen that there are more healthy trees in respect to the poor ones, but the histogram gives some more details about various species, probably discovering some interesting species.
The d3 map enables the user to view a prediction of the health of the NYC street trees, using KNN as a classifier. It is possible to hover over the individual boroughs for details, and to switch between visualizing the good, fair, and poor trees. When hovering over a borough, a tooltip is displayed, showing information in the shape of borough name, borough size, amount of trees in that borough specified by the tree species, and finally, the percentages of 'good' and 'poor' trees in that borough. The boroughs have different colors in respect to the amount of trees present there.
The map is an important visualization for the project since it shows the location of the trees in regards to different health. Because of the large amount of data points it would have created a too confusing picture to visualize all three health states at the same time, which is why it has been split up into 'good' and 'poor' with the 'fair' trees available to append at the click of a button.
The site includes three scatterplots. The first scatterplot visualizes the most common problems trees can have according to the dataset. The problems visualized are split into three different ones: 'Trunk', 'Root' and 'Branch'. These problems are caused by humans, such as trees growing into phone lines, wires around the trunk or stones on the root. Further definitions can be found in the dataset manual in the link provided.
The second and the third scatterplot shows the correlation among the amount of trees, diameter and air pollution in each of the neighbourhoods. We just took the community district points and assign the mean of each neighbourhood. The user can explore each neighbourhood when hovering over the map (which actually shows community districts) and the corresponding plots in the scatterplot (and viceversa). The scatterplots colors are per borough, thus the users can explore the air pollution amount in each different area, and in various level of details.
References for ispiration for d3 visualization (scatterplot):
When working with the project, two things became clear to us: 1) Real life data is messy 2) Data do not care what you think of them
It is possible to have good intentions, and a lot of good ideas as to what to do when analyzing a dataset, but the data itself just have limitations that are not always possible to overcome.
During the project, a lot of things did not go as expected. First of all, the pattern we expected to find in the data was just not there. The intention was to find correlation between the problems of the trees and the health, possible also correlation with the diameter. A lot of basic Pearson correlations was done on the data, but it appeared that there actually were no significant correlations. Then a lot of other datasets were inspected to see if they were correlated with some of the tree data features. The only interesting part that appears regarded the air pollution, seeing the correlation among the amount of trees and diameter and the air quality. Moreover, we discovered that the diameter of a trees is influenced by the problems. We managed to create a lot of plots and visualizations of the data, showing the fundamentals counts. We also managed to apply different machine learning methods, though they only showed us what the preliminary analysis did. In general, there were not really large areas with problematic trees, and that could not really be related to the health. The health was sparse, influencing the predictions.
In general, we feel that we have acquired familiarity with Python for data analysis. On the other hand, it took much more time having visualization working, especially when d3 visualization were put online. Sadly, we spent a lot of time in fixing minor problems of compatibility with the website/d3, taking it out from our analysis/visualization work.
If there was more data available to join with the street tree data, we might have been able to find other nice pattern/correlation. The air pollution was one, and it could have been nice to focus more on this, exploring also other variables, like amount of traffic in New York City.
We could also have focused on one prediction goal instead of trying to find a lot of different patterns, that turned out to not be there. E.g. we could have focused on predicting species based but with some other tools, since KNN obviously was not the best choice. A suggestion would be to try do some binary classifications, locating e.g. London Planetrees using SVMs (so just one species instead of focusing on more than one).
In the end, what we actually missed was patterns in the data so we could have done some more advanced and great predictions. But patterns cannot be forced, so with that in mind, we could have used more visualizations for the predictions we did. We could also have showed some more interactive features regarding health and problems in various years, deeply analyzing the changes of the trees.
We could have assigned the air pollution to each neighbourhood available in the dataset instead of the community district mean. This was too much work because of the different neighbourhood names. Moreover, we could have explore the 311 dataset more than we did to see if some areas with more complaints performed worse in regards to air pollution.
We also started analyzing house prices, but the neighbourhood names were too different, so time prohibited us from continuing down this path.
The visualization of the map on the site is a bit slow, and we could not figure out how to optimize it. Moreover, the scatterplots were not working initially so we spent a lot of time figuring out how to improve the site in regards to these and one axis is even still missing in the second scatterplot (even if locally was working before).
In [1]:
#Import data the whole dataset
import pandas as pd
import csv
tree_data = pd.read_csv('2015_tree_data_updated.csv')
tree_data
Out[1]:
In [4]:
#Convert health categories to numbers. 1: Poor, 2: Fair, 3: Good. Higher = Better
health = []
for i in range(len(tree_data['Health'])):
if tree_data['Health'][i] == 'Good':
health.append(3)
elif tree_data['Health'][i] == 'Fair':
health.append(2)
elif tree_data['Health'][i] == 'Poor':
health.append(1)
else:
health.append(0)
#print "err", tree_data['Health'][i], i
In [5]:
#Finding the total number of trees and how many there are of the different species of trees
tree_amount = tree_data['Spc_Common'].value_counts()
print(tree_amount)
In [7]:
#Plotting the results to get an overview
#plt.style.use('ggplot')
%matplotlib inline
def barplot(series, title, figsize, ylabel, flag, rotation):
ax = series.plot(kind='bar',
title = title,
figsize = figsize,
fontsize = 13)
# set ylabel
ax.set_ylabel(ylabel)
# set xlabel (depending on the flag that comes as a function parameter)
ax.get_xaxis().set_visible(flag)
# set series index as xlabels and rotate them
ax.set_xticklabels(series.index, rotation= rotation)
barplot(tree_amount,'Tree types', figsize=(20,8), ylabel = 'tree count',flag = True, rotation = 90)
In [9]:
#Putting the percentages on a pie chart
ax = tree_amount.plot(kind='pie', title='Top 20 tree species in NYC', autopct='%1.0f%%', pctdistance=0.9)
ax.set_ylabel('')
ax.set_aspect('equal')
In [10]:
#Count no. of trees in each borough:
boros = tree_data['Borough'].unique()
print tree_data['Borough'].value_counts()
In [11]:
#Finding how many trees there are of the different health types
tree_health = tree_data['Health'].value_counts()
print(tree_health)
In [12]:
#Comparing the count of each tree species in the whole city with a borough. This was used in the initial analysis to try and
#determine if focus should be put on a single borough, and which borough this should be.
queens_tree_types = tree_data.loc[tree_data['Borough'] == 'Queens', 'Spc_Common'].value_counts()
brooklyn_tree_types = tree_data.loc[tree_data['Borough'] == 'Brooklyn', 'Spc_Common'].value_counts()
staten_tree_types = tree_data.loc[tree_data['Borough'] == 'Staten Island', 'Spc_Common'].value_counts()
bronx_tree_types = tree_data.loc[tree_data['Borough'] == 'Bronx', 'Spc_Common'].value_counts()
manhattan_tree_types = tree_data.loc[tree_data['Borough'] == 'Manhattan', 'Spc_Common'].value_counts()
df = pd.concat([tree_amount, queens_tree_types], axis=1)
print(df)
df.columns = ['NYC', 'Queens']
df = df.sort_values('NYC', ascending=False) # sort the df using NYC values
df.plot.bar(color=['red','blue'])
Out[12]:
In [17]:
#Comparing the number of trees in each of the five boroughs
import matplotlib as plt
fig, axes = df.plot.subplots(nrows=5)
plt.subplots_adjust(wspace=1, hspace=0.5)
plot = queens_tree_types.plot(ax=axes[0], kind='bar', figsize=(8,30)); axes[0].set_title('Queens');
brooklyn_tree_types.plot(ax=axes[1], kind='bar'); axes[1].set_title('Brooklyn');
manhattan_tree_types.plot(ax=axes[2], kind='bar'); axes[2].set_title('Manhattan');
staten_tree_types.plot(ax=axes[3], kind='bar'); axes[3].set_title('Staten Island');
bronx_tree_types.plot(ax=axes[4], kind='bar'); axes[4].set_title('Bronx');
fig = plot.get_figure()
In [19]:
with open('2015_tree_data_updated.csv', 'r') as infile:
# read the file as a dictionary for each row ({header : value})
reader = csv.DictReader(infile)
data = {} #empty set
for row in reader:
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
Diameter = data['Diameter']
Health = data['Health']
Spc_Latin = data['Spc_Latin']
Spc_Common = data['Spc_Common']
Sidewalk_Condition = data['Sidewalk_Condition']
problems = data['problems']
root_stone = data['root_stone']
root_grate = data['root_grate']
root_other = data['root_other']
trunk_wire = data['trunk_wire']
trnk_light = data['trnk_light']
trnk_other = data['trnk_other']
brch_light = data['brch_light']
brch_shoe = data['brch_shoe']
brch_other = data['brch_other']
Address = data['Address']
Zipcode = data['Zipcode']
CB = data['CB']
Borough = data['Borough']
Latitude = data['Latitude']
Longitude = data['Longitude']
#Heatmap of tree distribution
X_new=[]
Y_new=[]
for i in range(len(CB)):
X_new.append(Longitude[i])
Y_new.append(Latitude[i])
with open('Coordinates_trees.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(('lon', 'lat'))
for xd, yd in zip(X_new, Y_new):
writer.writerow( (xd, yd ) )
csvfile.close()
import geoplotlib
from geoplotlib.utils import read_csv, BoundingBox, DataAccessObject
min_lat = min(X_new)
max_lat = max(X_new)
min_lon = min(Y_new)
max_lon = max(Y_new)
bbox = BoundingBox(north=float(max_lon), west=float(max_lat), south=float(min_lon), east=float(min_lat))
print "Trees:", bbox
data_trees = read_csv('Coordinates_trees.csv')
geoplotlib.kde(data_trees, bw=0.5, cmap = 'jet', cut_below=1e-4)
geoplotlib.set_bbox(bbox)
geoplotlib.inline()
In the repository, there's also data from 2005 and 1995 but it is not included in this explainer notebook as it does not provide any value for explaining our analysis.
In [20]:
#2015 data
import numpy as np
#NYC top 20 species
unique, counts = np.unique(zip(tree_data['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Extract data for each borough
tree_data_bronx = tree_data.loc[tree_data['Borough'] == 'Bronx']
tree_data_brook = tree_data.loc[tree_data['Borough'] == 'Brooklyn']
tree_data_stat = tree_data.loc[tree_data['Borough'] == 'Staten Island']
tree_data_manh = tree_data.loc[tree_data['Borough'] == 'Manhattan']
tree_data_queens = tree_data.loc[tree_data['Borough'] == 'Queens']
#Bronx
unique, counts = np.unique(zip(tree_data_bronx['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Brooklyn
unique, counts = np.unique(zip(tree_data_brook['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Staten Island
unique, counts = np.unique(zip(tree_data_stat['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Manhattan
unique, counts = np.unique(zip(tree_data_manh['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Queens
unique, counts = np.unique(zip(tree_data_queens['Spc_Common']), return_counts=True)
print "Queens data:"
print sorted(zip(counts, unique), reverse = True)
In [23]:
#2005 data
import pandas as pd
import numpy as np
tree_data = pd.read_csv('2005_tree_data_updated.csv')
#NYC top 20 species
unique, counts = np.unique(zip(tree_data['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Extract data for each borough
tree_data_bronx = tree_data.loc[tree_data['Borough'] == 'Bronx']
tree_data_brook = tree_data.loc[tree_data['Borough'] == 'Brooklyn']
tree_data_stat = tree_data.loc[tree_data['Borough'] == 5]
tree_data_manh = tree_data.loc[tree_data['Borough'] == 'Manhattan']
tree_data_queens = tree_data.loc[tree_data['Borough'] == 'Queens']
#Bronx
unique, counts = np.unique(zip(tree_data_bronx['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Brooklyn
unique, counts = np.unique(zip(tree_data_brook['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Staten Island
unique, counts = np.unique(zip(tree_data_stat['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Manhattan
unique, counts = np.unique(zip(tree_data_manh['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Queens
unique, counts = np.unique(zip(tree_data_queens['Spc_Common']), return_counts=True)
print "Queens data:"
print sorted(zip(counts, unique), reverse = True)
In [ ]:
#1995 data
import pandas as pd
import numpy as np
tree_data = pd.read_csv('1995_tree_data_updated.csv')
#NYC top 20 species
unique, counts = np.unique(zip(tree_data['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Extract data for each borough
tree_data_bronx = tree_data.loc[tree_data['Borough'] == 'Bronx']
tree_data_brook = tree_data.loc[tree_data['Borough'] == 'Brooklyn']
tree_data_stat = tree_data.loc[tree_data['Borough'] == 'Staten Island']
tree_data_manh = tree_data.loc[tree_data['Borough'] == 'Manhattan']
tree_data_queens = tree_data.loc[tree_data['Borough'] == 'Queens']
#Bronx
unique, counts = np.unique(zip(tree_data_bronx['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Brooklyn
unique, counts = np.unique(zip(tree_data_brook['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Staten Island
unique, counts = np.unique(zip(tree_data_stat['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Manhattan
unique, counts = np.unique(zip(tree_data_manh['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)
#Queens
unique, counts = np.unique(zip(tree_data_queens['Spc_Common']), return_counts=True)
print "Queens data:"
print sorted(zip(counts, unique), reverse = True)
In [25]:
bor_list = list(set(list(Borough)))
In [26]:
bronx_prob_list = [0, 0, 0, 0, 0]
brooklyn_prob_list = [0, 0, 0, 0, 0]
staten_prob_list = [0, 0, 0, 0, 0]
man_prob_list = [0, 0, 0, 0, 0]
queens_prob_list = [0, 0, 0, 0, 0]
dic_all_boro = {}
for b in bor_list:
dic_all_boro[b] = [0, 0, 0, 0, 0]
temp_root = 0
temp_trunk = 0
temp_branch = 0
temp_tot = 0
sidewalk = 0
for i in range (0, len(CB)):
if root_stone[i] == 'Yes':
temp_root += 1
if root_grate[i] == 'Yes':
temp_root += 1
if root_other[i] == 'Yes':
temp_root += 1
if trunk_wire[i] == 'Yes':
temp_trunk += 1
if trnk_light[i] == 'Yes':
temp_trunk += 1
if trnk_other[i] == 'Yes':
temp_trunk += 1
if brch_light[i] == 'Yes':
temp_branch += 1
if brch_shoe[i] == 'Yes':
temp_branch += 1
if brch_other[i] == 'Yes':
temp_branch += 1
if Sidewalk_Condition[i] == 'Damage':
sidewalk += 1
temp_tot = temp_root + temp_trunk + temp_branch + sidewalk
temp_list = [temp_root, temp_trunk, temp_branch, sidewalk, temp_tot]
#choose which list to update
c = 0
for t in temp_list:
dic_all_boro[Borough[i]][c] += t
c += 1
temp_root = 0
temp_trunk = 0
temp_branch = 0
temp_tot = 0
sidewalk = 0
In [27]:
with open('problem_count.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerow(('Borough','Root_Prob', 'Trunk_Prob', 'Branch_Prob', 'Sidewalk', 'Tot_Prob'))
for d in dic_all_boro.keys():
writer.writerow((d, dic_all_boro[d][0], dic_all_boro[d][1], dic_all_boro[d][2], dic_all_boro[d][3], dic_all_boro[d][4] ))
f.close()
In [28]:
bronx_prob_list = [0, 0, 0, 0, 0]
brooklyn_prob_list = [0, 0, 0, 0, 0]
staten_prob_list = [0, 0, 0, 0, 0]
man_prob_list = [0, 0, 0, 0, 0]
queens_prob_list = [0, 0, 0, 0, 0]
dic_all_boro = {}
for b in bor_list:
dic_all_boro[b] = [0, 0, 0, 0, 0]
temp_root = 0
temp_trunk = 0
temp_branch = 0
temp_tot = 0
sidewalk = 0
for i in range (0, len(CB)):
if root_stone[i] == 'Yes':
temp_root += 1
if root_grate[i] == 'Yes':
temp_root += 1
if root_other[i] == 'Yes':
temp_root += 1
if trunk_wire[i] == 'Yes':
temp_trunk += 1
if trnk_light[i] == 'Yes':
temp_trunk += 1
if trnk_other[i] == 'Yes':
temp_trunk += 1
if brch_light[i] == 'Yes':
temp_branch += 1
if brch_shoe[i] == 'Yes':
temp_branch += 1
if brch_other[i] == 'Yes':
temp_branch += 1
if Sidewalk_Condition[i] == 'Damage':
sidewalk += 1
temp_tot = temp_root + temp_trunk + temp_branch + sidewalk
temp_list = [temp_root, temp_trunk, temp_branch, sidewalk, temp_tot]
#choose which list to update
c = 0
for t in temp_list:
dic_all_boro[Borough[i]][c] += t
c += 1
temp_root = 0
temp_trunk = 0
temp_branch = 0
temp_tot = 0
sidewalk = 0
In [29]:
import matplotlib
import matplotlib.pylab as plt
import numpy as np
root = [dic_all_boro['Bronx'][0], dic_all_boro['Brooklyn'][0] , dic_all_boro['Manhattan'][0] , dic_all_boro['Queens'][0] , dic_all_boro['Staten Island'][0]]
trunk = [dic_all_boro['Bronx'][1], dic_all_boro['Brooklyn'][1], dic_all_boro['Manhattan'][1] , dic_all_boro['Queens'][1] , dic_all_boro['Staten Island'][1]]
branch = [dic_all_boro['Bronx'][2], dic_all_boro['Brooklyn'][2], dic_all_boro['Manhattan'][2] , dic_all_boro['Queens'][2] , dic_all_boro['Staten Island'][2]]
tot = [dic_all_boro['Bronx'][3], dic_all_boro['Brooklyn'][3], dic_all_boro['Manhattan'][3] , dic_all_boro['Queens'][3] , dic_all_boro['Staten Island'][3]]
#f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(15,15))
fig_index = 0
#fig = plt.figure(fig_index)
fig, ax = plt.subplots()
ax.set_xlabel('Root')
ax.set_ylabel('Trunk')
ax.scatter(np.asarray(root), np.asarray(trunk))
for label, x, y in zip(bor_list, root, trunk):
ax.annotate(
label,
xy=(x, y), xytext=(-20, 20),
textcoords='offset points', ha='right', va='bottom',
bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))
fig.savefig("problem1")
plt.show()
f, ax = plt.subplots()
ax.set_xlabel('Root')
ax.set_ylabel('Branch')
ax.scatter(np.asarray(root), np.asarray(branch))
for label, x, y in zip(bor_list, root, branch):
ax.annotate(
label,
xy=(x, y), xytext=(-20, 20),
textcoords='offset points', ha='right', va='bottom',
bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))
f.savefig("problem2")
plt.show()
f, ax = plt.subplots()
ax.set_xlabel('Trunk')
ax.set_ylabel('Branch')
ax.scatter(np.asarray(trunk), np.asarray(branch))
for label, x, y in zip(bor_list, trunk, branch):
ax.annotate(
label,
xy=(x, y), xytext=(-20, 20),
textcoords='offset points', ha='right', va='bottom',
bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))
f.savefig("problem3")
plt.show()
We have used the apriori algorithm for exploring if some problems happen to appear together.
In [30]:
index = 0
root_stone_lon = []
root_stone_lat = []
root_grate_lon = []
root_grate_lat = []
trunk_wire_lon = []
trunk_wire_lat = []
trunk_light_lon = []
trunk_light_lat = []
branch_light_lon = []
branch_light_lat = []
branch_shoe_lon = []
branch_shoe_lat = []
count_br = 0
count2 = 0
for i in range(0, len(Latitude)):
if root_stone[i] == 'Yes':
root_stone_lat.append(float(Latitude[i]))
root_stone_lon.append(float(Longitude[i]))
if root_grate[i] == 'Yes':
root_grate_lat.append(Latitude[i])
root_grate_lon.append(Longitude[i])
if trunk_wire[i] == 'Yes':
trunk_wire_lat.append(Latitude[i])
trunk_wire_lon.append(Longitude[i])
if trnk_light[i] == 'Yes':
trunk_light_lat.append(Latitude[i])
trunk_light_lon.append(Longitude[i])
if brch_light[i] == 'Yes':
branch_light_lat.append(Latitude[i])
branch_light_lon.append(Longitude[i])
if brch_shoe[i] == 'Yes':
branch_shoe_lat.append(Latitude[i])
branch_shoe_lon.append(Longitude[i])
if Borough[i] == 'Brooklyn' and brch_light[i] == 'Yes' and trunk_wire[i] == 'Yes':
count_br += 1
if Borough[i] == 'Brooklyn':
count2 += 1
print 'Count: ', count_br, count2
print (set(Borough))
root_stone_zip = zip(root_stone_lon, root_stone_lat)
root_grate_zip = zip(root_grate_lon, root_grate_lat)
trunk_wire_zip = zip(trunk_wire_lon, trunk_wire_lat)
trunk_light_zip = zip(trunk_light_lon, trunk_light_lat)
branch_light_zip = zip(branch_light_lon, branch_light_lat)
branch_shoe_zip = zip(branch_shoe_lon, branch_shoe_lat)
In [31]:
sidewalk_cond_lon = []
sidewalk_cond_lat = []
for i in range(0, len(Latitude)):
if Sidewalk_Condition[i] == 'Damage':
sidewalk_cond_lat.append(float(Latitude[i]))
sidewalk_cond_lon.append(float(Longitude[i]))
with open('sidewalk_dam.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(('lon', 'lat'))
for xd, yd in zip(sidewalk_cond_lon, sidewalk_cond_lat):
writer.writerow( (xd, yd ) )
csvfile.close()
In [32]:
with open('root_stone.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(('lon', 'lat'))
for xd, yd in zip(root_stone_lon, root_stone_lat):
writer.writerow( (xd, yd ) )
csvfile.close()
with open('root_grate.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(('lon', 'lat'))
for xd, yd in zip(root_grate_lon, root_grate_lat):
writer.writerow( (xd, yd ) )
csvfile.close()
with open('trk_wire.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(('lon', 'lat'))
for xd, yd in zip(trunk_wire_lon, trunk_wire_lat):
writer.writerow( (xd, yd ) )
csvfile.close()
with open('trk_light.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(('lon', 'lat'))
for xd, yd in zip(trunk_light_lon, trunk_light_lat):
writer.writerow( (xd, yd ) )
csvfile.close()
with open('brc_light.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(('lon', 'lat'))
for xd, yd in zip(branch_light_lon, branch_light_lat):
writer.writerow( (xd, yd ) )
csvfile.close()
with open('brc_shoe.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(('lon', 'lat'))
for xd, yd in zip(branch_shoe_lon, branch_shoe_lat):
writer.writerow( (xd, yd ) )
csvfile.close()
In [33]:
import geoplotlib
from geoplotlib.utils import read_csv, BoundingBox, DataAccessObject
min_lat = min(root_stone_lat)
max_lat = max(root_stone_lat)
min_lon = min(root_stone_lon)
max_lon = max(root_stone_lon)
bbox = BoundingBox(north=float(max_lat), west=float(max_lon), south=float(min_lat), east=float(min_lon))
print "Trees:", bbox
geoplotlib.set_bbox(bbox)
data = read_csv('root_stone.csv')
print 'Root stone: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()
data = read_csv('root_grate.csv')
print 'Root grate: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()
data = read_csv('trk_wire.csv')
print 'Trunk wire: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()
data = read_csv('trk_light.csv')
print 'Trunk light: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()
data = read_csv('brc_light.csv')
print 'Branch light: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()
data = read_csv('brc_shoe.csv')
print 'Branch shoe: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()
data1 = read_csv('brc_light.csv')
data2 = read_csv('trk_wire.csv')
print 'Branch light and Trunk Wire: '
geoplotlib.dot(data1, 'r', point_size = 0.4)
geoplotlib.dot(data2, 'g', point_size = 0.4)
geoplotlib.inline()
In [34]:
data = read_csv('sidewalk_dam.csv')
print 'Sidewalk damaged: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()
In [35]:
!pip install apyori-1.1.1.tar.gz
## Trying association mining
from apyori import apriori
transactions = [
['cheese', 'nuggets'],
['burgers', 'balls'],
]
results = list(apriori(transactions))
In [36]:
## Trying association mining
from apyori import apriori
transactions = [
['beer', 'nuts'],
['beer', 'cheese'],
['nuts', 'cheese'],
]
transactions.append(['nuts', 'cheese'])
results = list(apriori(transactions))
print results[0]
print ''
print results[1]
print ''
print results[4]
In [37]:
import numpy as np
#root_stone = data['root_stone']
#root_grate = data['root_grate']
#root_other = data['root_other']
#trunk_wire = data['trunk_wire']
#trnk_light = data['trnk_light']
#trnk_other = data['trnk_other']
#brch_light = data['brch_light']
#brch_shoe = data['brch_shoe']
#brch_other = data['brch_other']
transactions = []
temp = []
np_count = 0
nuno_count = 0
counter = 0
print len(temp)
for i in range(0,len(root_stone)):
temp = []
if root_stone[i] == 'Yes':
temp.append("Root_Stone")
if root_grate[i] == 'Yes':
temp.append("Root_Grate")
#if root_other[i] == 'Yes':
#temp.append("Root_Other")
if trunk_wire[i] == 'Yes':
temp.append("Trunk_Wire")
if trnk_light[i] == 'Yes':
temp.append("Trunk_Light")
#if trnk_other[i] == 'Yes':
#temp.append("Trunk_Other")
if brch_light[i] == 'Yes':
temp.append("Branch_Light")
if brch_shoe[i] == 'Yes':
temp.append("Branch_Shoe")
if Sidewalk_Condition[i] == 'Damage':
temp.append("Sidewalk")
#if brch_other[i] == 'Yes':
#temp.append("Branch_Other")
if (len(temp)) > 1:
transactions.append(temp)
elif (len(temp)) == 0:
np_count = np_count + 1
elif (len(temp)) == 1:
nuno_count += 1
if (len(temp)) > 1:
counter += 1
results = list(apriori(np.asarray(transactions)))
print 'Associated:', len(transactions)
print len(results)
print 'Empty:', np_count
print 'One Item:', nuno_count
print 'More: ', counter
In [38]:
import numpy as np
transactions = []
temp = []
np_count = 0
nuno_count = 0
counter = 0
print len(temp)
for i in range(0,len(root_stone)):
temp = []
if root_stone[i] == 'Yes':
temp.append("Root_Stone")
#if root_grate[i] == 'Yes':
# temp.append("Root_Grate")
if trunk_wire[i] == 'Yes':
temp.append("Trunk_Wire")
#if trnk_light[i] == 'Yes':
# temp.append("Trunk_Light")
if brch_light[i] == 'Yes':
temp.append("Branch_Light")
#if brch_shoe[i] == 'Yes':
# temp.append("Branch_Shoe")
#if Sidewalk_Condition[i] == 'Damage':
# temp.append("Sidewalk")
if (len(temp)) > 1:
transactions.append(temp)
elif (len(temp)) == 0:
np_count = np_count + 1
elif (len(temp)) == 1:
nuno_count += 1
if (len(temp)) > 1:
counter += 1
results = list(apriori(np.asarray(transactions)))
for i in range (0, len(results)):
print '- ', i, ':', results[i][0], results[i][1], ', Lift:' ,results[i][-1][-1][-1]
The results show us that branch light appears together with trunk wire, which can also be seen in the plot. This could be because trees places such that their branches grow into street lights also have the trunk be provoked by the lighting structures.
We have also explored the 311 dataset in the context of trees, as there is some data in there that is specific to our domain.
Some of the most interesting 311 requests that we found, were in relation to overgrown trees and new tree requests. The Python analysis has not been included in detail in the notebook, but two images are included which show two geoplots of the mentioned complaints (note that there are some hotspots ).
In [39]:
from IPython.display import Image
Image("new_requests.png")
Out[39]:
In [40]:
Image("overgrown_trees.png")
Out[40]:
We have found out that, sadly, problems are not related with the health. In fact, the Pearson correlation of these two parameters was very low.
We explore the diameter and the amount of trees in area of the city for discovering that these two factors are influencing the air quality. We also discovered that the problems seem to have an influence on the diameter.
In [41]:
Image("problem_amount.png")
Out[41]:
The above image shows the resuls of the regression, which can also be seen here:
These are the results of the regression between air pollution and O2 for predicting the amount of air pollution (and O2) given the amount of trees and their diameter as parameters. The results were almost the same for the other types of particles found. The whole regression notebook is included in the repo with the necessary datafiles. (https://github.com/cecli/cecli.github.io/blob/master/regression_notebook.ipynb)
The images showing the corellation between the amount of trees and the pollution is in the other notebook, and this is also the data which the regression visualizaiton on the website is based on.
In [ ]:
In [42]:
#KNN classifier
#Load relevant libraries
import numpy as np
import pylab as pl
from sklearn import neighbors, datasets, model_selection
#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_data['Latitude'], tree_data['Longitude'] )
, tree_data['Spc_Common']
, test_size=0.15
, random_state=42)
accuracy = []
#Classify KNN with K=2-10
for k in range(2,11):
knn = neighbors.KNeighborsClassifier(n_neighbors=k, weights = "distance")
#Fit the data and make predictions
knn.fit(X_train, y_train).predict(X_test)
#Calculate accuracy from validation set
n_folds = 5
score = np.mean(model_selection.cross_val_score(knn.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "KNN score for k =", k, ":", score
#Save accuracy into a list
accuracy.append(score)
#KNN score for k = 2 : 0.523009486208
#KNN score for k = 3 : 0.518710892965
#KNN score for k = 4 : 0.520454115483
#KNN score for k = 5 : 0.519925851647
#KNN score for k = 6 : 0.51948343831
#KNN score for k = 7 : 0.518847340502
#KNN score for k = 8 : 0.518138623743
#KNN score for k = 9 : 0.517289048133
#KNN score for k = 10 : 0.51639320774
In [43]:
#Plot accuracy as a function of the number of K (2-10)
import matplotlib.pyplot as plt
plt.figure(figsize=(20,5))
ks = range(2, 11)
plt.plot(ks, accuracy)
plt.xticks(ks)
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.title("Prediction accuracy as a function of k")
plt.show()
In [44]:
#K=4 was chosen for simplicity compared to accuracy
#Test score
knn = neighbors.KNeighborsClassifier(n_neighbors=4, weights = "distance")
#Fit the data and make predictions
knn.fit(X_train, y_train).predict(X_test)
#Calculate accuracy
score = knn.fit(X_train, y_train).score(X_test, y_test)
print score #0.543241288134
In [45]:
tree_data = pd.read_csv('2015_tree_data_updated.csv')
unique, counts = np.unique(zip(tree_data['Spc_Common']), return_counts=True)
print sorted(zip(counts, unique), reverse = True)
#Try doing KNN for only the top 5 species
top5_spec = ['London planetree','honeylocust', 'Callery pear','pin oak', 'Norway maple']
tree_spec5 = []
tree_lat5 = []
tree_lon5 = []
for i in range(len(tree_data)):
if tree_data['Spc_Common'][i] in top5_spec:
tree_spec5.append(tree_data['Spc_Common'][i])
tree_lat5.append(tree_data['Latitude'][i])
tree_lon5.append(tree_data['Longitude'][i])
print len(tree_spec5)
#print tree_spec5[:10]
#print tree_lat5[:10]
#print tree_lon5[:10]
#KNN classifier
#Load relevant libraries
import numpy as np
import pylab as pl
from sklearn import neighbors, datasets, model_selection
#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_lat5, tree_lon5)
, tree_spec5
, test_size=0.15
, random_state=42)
accuracy = []
#Classify KNN with K=2-10
for k in range(2,11):
knn = neighbors.KNeighborsClassifier(n_neighbors=k, weights = "distance")
#Fit the data and make predictions
knn.fit(X_train, y_train).predict(X_test)
#Calculate accuracy
#score = knn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 5
score = np.mean(model_selection.cross_val_score(knn.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "KNN score for k =", k, ":", score
#Save accuracy into a list
accuracy.append(score)
In [46]:
#Create Decision tree classifier
#Load relevant libraries
import numpy as np
from sklearn import tree
from sklearn import model_selection
#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_data['Latitude'], tree_data['Longitude'])
, tree_data['Spc_Common']
, test_size=0.15
, random_state=42)
#Classify Decision trees
dt = tree.DecisionTreeClassifier(random_state = 42)
#Fit the data and make predictions
dt.fit(X_train, y_train).predict(X_test)
#Calculate accuracy
#score = dtnn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 5
score = np.mean(model_selection.cross_val_score(dt.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Decision tree accuracy:", score
In [47]:
#Adjust KNN classifyer
#Load relevant libraries
import numpy as np
import pylab as pl
from sklearn import neighbors, datasets, model_selection
#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_data['Latitude'], tree_data['Longitude'])
, tree_data['Health']
, test_size=0.15
, random_state=42)
accuracy = []
#Classify KNN with K=2-10
for k in range(2,11):
knn = neighbors.KNeighborsClassifier(n_neighbors=k, weights="distance")
#Fit the data and make predictions
knn_pred = knn.fit(X_train, y_train).predict(X_test)
#Calculate accuracy
#score = knn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 5
score = np.mean(model_selection.cross_val_score(knn.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "KNN score for k =", k, ":", score
#Save accuracy into a list
accuracy.append(score)
#KNN score for k = 2 : 0.768112149227
#KNN score for k = 3 : 0.789842752784
#KNN score for k = 4 : 0.798109764942
#KNN score for k = 5 : 0.804213182933
#KNN score for k = 6 : 0.80804514699
#KNN score for k = 7 : 0.810998909425
#KNN score for k = 8 : 0.813439834781
#KNN score for k = 9 : 0.815178637038
#KNN score for k = 10 : 0.816549869474
In [48]:
#Test accuracy
knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights="distance")
#Fit the data and make predictions
knn_pred = knn.fit(X_train, y_train).predict(X_test)
#Calculate accuracy
score = knn.fit(X_train, y_train).score(X_test, y_test)
print score #0.806917109432
In [ ]:
#Create Decision tree classifier
#Load relevant libraries
import numpy as np
from sklearn import tree
from sklearn import model_selection
#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_data['Latitude'], tree_data['Longitude'])
, tree_data['Health']
, test_size=0.15
, random_state=42)
#Classify Decision trees
dt = tree.DecisionTreeClassifier(random_state = 42)
#Fit the data and make predictions
dt_pred = dt.fit(X_train, y_train).predict(X_test)
#Calculate accuracy
#score = dtnn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 5
score = np.mean(model_selection.cross_val_score(dt.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Decision tree accuracy:", score #0.748111525321
In [ ]:
# Create SVM classifier
#Load relevant libraries
import numpy as np
from sklearn import svm
from sklearn import model_selection
#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_data['Latitude'], tree_data['Longitude'])
, tree_data['Health']
, test_size=0.15
, random_state=42)
#Classify Decision trees
svm = svm.SVC(random_state = 42)
#Fit the data and make predictions
svm.fit(X_train, y_train).predict(X_test)
#Calculate accuracy
#score = dtnn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 5
score = np.mean(model_selection.cross_val_score(svm.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "SVM accuracy:", score
We tried decision tree and random forest to predict species based on locations and diameter. We tried to classify both top 20, 10 and 5 species. We found out that the decision tree was not an idea solution, as the best result was 0.5 for the top 5 species. Random forest had the same result. We have included the images of the decision trees, and as can be seen, they just predict two species: Locust and Pine Oak. So this is clearly not a good model. We also tried predicting neighbourhoods based on problems, but that also did not work, as the accuracy was even worse. The images are included in the repository.
In [ ]:
#Try doing KNN for only the top 10 species
top10_spec = ['London planetree','pin oak', 'honeylocust','Norway maple', 'Callery pear']
tree_spec10 = []
tree_lat10 = []
tree_lon10 = []
tree_health10 = []
tree_nth10 = []
tree_diam10 = []
tree_cb10 = []
tree_boro10 = []
tree_root10 = []
tree_branch10 = []
tree_trunk10 = []
tree_total10 = []
print len(health), len(tree_data)
for i in range(len(health)):
if tree_data['Spc_Common'][i] in top10_spec and float(tree_data['Diameter'][i]) >= 10.00:
tree_spec10.append(tree_data['Spc_Common'][i])
tree_lat10.append(tree_data['Latitude'][i])
tree_lon10.append(tree_data['Longitude'][i])
tree_health10.append(health[i])
tree_nth10.append(tree_data['Neighbourhoods'][i])
tree_diam10.append(tree_data['Diameter'][i])
tree_cb10.append(tree_data['CB'][i])
tree_boro10.append(tree_data['Borough'][i])
tree_root10.append(root_list[i])
tree_branch10.append(branch_list[i])
tree_trunk10.append(trunk_list[i])
tree_total10.append(total_prob_list[i])
print len(tree_spec10)
print tree_spec10[:10]
print tree_lat10[:10]
print tree_lon10[:10]
In [ ]:
bin1 = 1.00
bin2 = 10.00
bin3 = 15.00
bin4 = 20.00
bin5 = 25.00
bin6 = 30.00
bin7 = 35.00
bin8 = 40.00
bin9 = 45.00
bin10 = 50.00
bin11 = 60.00
bin12 = 70.00
bin13 = 80.00
bin14 = 90.00
bin15 = 100.00
bin16 = 150.00
bin17 = 200.00
bin18 = 250.00
bin19 = 300.00
list_bins = [bin19, bin18, bin17, bin16, bin15, bin14, bin13, bin12, bin11, bin10, bin9, bin8,
bin7, bin6, bin5, bin4, bin3, bin2]
new_diam = []
for t in tree_diam10:
ft = float(t)
for l in list_bins:
if ft >= l:
new_diam.append(l)
break
else:
continue
In [ ]:
#Decision tree for classifying tree species based on health and diameter
#Load relevant libraries
import numpy as np
from sklearn import tree
from sklearn import model_selection
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
#le.fit(tree_data['Neighbourhoods'])
#list(le.classes_)
#trans_nbh = le.transform(tree_data['Neighbourhoods'])
#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_data['Diameter'], tree_data['Longitude'], tree_data['Latitude'])
, tree_data['Spc_Common']
, test_size=0.10
, random_state=42)
#Classify Decision trees
dt = tree.DecisionTreeClassifier(random_state = 42)
#Fit the data and make predictions
dt.fit(X_train, y_train).predict(X_test)
#Calculate accuracy
#score = dtnn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 10
score = np.mean(model_selection.cross_val_score(dt.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Decision tree accuracy:", score
#trans_nbh10 = le.transform(tree_nth10)
#X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_diam10, tree_health10, trans_cb10)
#, tree_spec10
#, test_size=0.33
#, random_state=42)
#Classify Decision trees
#dt = tree.DecisionTreeClassifier(random_state = 42)
#Fit the data and make predictions
#dt.fit(X_train, y_train).predict(X_test)
#Calculate accuracy
#score = dtnn.fit(X_train, y_train).score(X_test, y_test)
#n_folds = 10
#score = np.mean(model_selection.cross_val_score(dt.fit(X_train, y_train),X_train, y_train,cv=n_folds))
#print "Decision tree accuracy for classifying top 10 species:", score
In [ ]:
import numpy as np
from sklearn import tree
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import ensemble
le = preprocessing.LabelEncoder()
le.fit(tree_nth10)
list(le.classes_)
trans_nbh10 = le.transform(tree_nth10)
le = preprocessing.LabelEncoder()
le.fit(tree_boro10)
list(le.classes_)
trans_boro10 = le.transform(tree_boro10)
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_lon10, tree_lat10),
tree_spec10
, test_size=0.10
, random_state=42)
#Classify Decision trees
dt = tree.DecisionTreeClassifier(random_state = 42)
dt2 = ensemble.RandomForestClassifier(random_state = 42)
#Fit the data and make predictions
dt.fit(X_train, y_train).predict(X_test)
dt2.fit(X_train, y_train).predict(X_test)
#Calculate accuracy
score = dtnn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 10
score = np.mean(model_selection.cross_val_score(dt.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Decision tree accuracy for classifying top 10 species:", score
score = np.mean(model_selection.cross_val_score(dt2.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Random forest accuracy for classifying top 10 species:", score
In [ ]:
import numpy as np
from sklearn import tree
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import ensemble
le = preprocessing.LabelEncoder()
le.fit(tree_nth10)
list(le.classes_)
trans_nbh10 = le.transform(tree_nth10)
le = preprocessing.LabelEncoder()
le.fit(tree_boro10)
list(le.classes_)
trans_boro10 = le.transform(tree_boro10)
le = preprocessing.LabelEncoder()
le.fit(tree_spec10)
list(le.classes_)
trans_species10 = le.transform(tree_spec10)
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(trans_species10, tree_total10)
, new_diam
, test_size=0.25
, random_state=42)
#Classify Decision trees
dt = tree.DecisionTreeClassifier(max_depth=20, max_leaf_nodes=40, random_state = 42)
#dt2 = ensemble.RandomForestClassifier(random_state = 42)
#Fit the data and make predictions
dt.fit(X_train, y_train).predict(X_test)
#dt2.fit(X_train, y_train).predict(X_test)
#Calculate accuracy
#score = dtnn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 5
score = np.mean(model_selection.cross_val_score(dt.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Decision tree accuracy for classifying top 10 species:", score
score = np.mean(model_selection.cross_val_score(dt2.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Random forest accuracy for classifying top 10 species:", score