1. Motivation

1.1 Dataset

The main dataset was the NYC Street Tree Data from 2015 and is the result of a community survery done mainly by volunteers, cataloging all trees in NYC. As secondary data sets we had the ones regarding Street Tree Data from 1995 and 2005. Moreover, we used the air pollution data in New York City, in order to understand the influence of trees on the air quality. We also started analyzing the "311" dataset, to explore some complaints regarding trees. The Street Tree dataset was chosen because it could give new insights and perspective to urban planning, discover their status (how healthy they are, if people are taking care of those etc.), and if they are influencing the life quality of the city. Moreover we could have discovered facts that most people would probably not be aware of beforehand.

1.2 Goal

The goal was to enlighten users about trees in NYC. Are there certain types of trees more suitable for streets than others? Where are they located? Is it possible to know which kind of tree you might encounter based on the location, health of the tree, the diameter, or even the amount of problems of the tree? From this project it should be possible to learn something new about a topic you might never have considered learning something about.

2. Basic stats

2.1 Preprocessing the data

There were some outlies in the dataset which had to be removed to get useful results. One example was lat/lon which had an extreme outlier.

Regarding the 311 dataset, we just selected the 2015 data and the complaints regarding trees, as these were the only ones important for our domain.

For the air pollution there was data for all the community districts, but only some of the neighbourhoods. The measurements were mean percentile. We took the mean values for the community district and and assigned them to the corresponding borough. This was because the names of the neighbourhoods in this dataset and our own dataset were so different that it was very difficult to figure out which neighbourhoods were the same in the two sets.

2.1.1 Variable selection

When taking a first glance at the dataset it was a bit overwhelming, as it's huge and there are a lot of rows which does not necessarily make sense at a first glance, as wel as a number of of variables not particularly interesting or necessary for what we wanted to do. Each variable was carefully examined and the variables deemed unnecessary were excluded. Among these was "Tree_Id", a unique ID for each tree, but this unique ID was unique for each of the three datasets (1995, 2005, 2015), meaning it was not possible to join the datasets by this ID, deeming it not relevant. Other excluded variables were address information, since there were multiple variables delivering address information on different levels - and it was not relevant to distinguish between all these.

2.1.2 Observation selection

It was decided to only focus on the top 20 tree species, since there were a lot of different species without a significant amount of observations, it would be difficult to describe them all properly. It would also be very difficult to do good predictions if the observations are sparse. For some machine learning tools, we focused only on the top 10 species, or top 5, because the data was too sparse when going above this limit.

There were a lot of trees without a species listed, and those were disregarded completely. The dead trees were also excluded from the dataset.

It was considered to only focus on one of the five boroughs in NYC to get a more detailed view. This was not implemented since it was deemed more interesting to two differences between the boroughs as well.

2.2 Stats for the preprocessed data

The final dataset "Street Tree Data 2015" consists of 534,514 tree observations and 21 variables/features, totalling 74.5 MB. The selected features were:

  • Diameter (inches)
  • Health (three values: Good, fair, poor)
  • Spc_Latin, Spc_Common (latin and common name for the species)
  • Sidewalk_Condition (two values: Damage, NoDamage)
  • Problems (a string concatenated from the following types of problems)
    • root_stone, root_grate, root_other, trunk_wire, trnk_light, trnk_other, brch_light, brch_shoe, brch_other (two values: yes/no)
  • Address
  • Zipcode
  • CB (community board)
  • Borough
  • Latitude, Longitude

Amount of trees in each borough:

  • Bronx: 63,035
  • Brooklyn: 138,760
  • Staten Island: 82,619
  • Manhattan: 54,115
  • Queens: 195,985

In general, the top 20 species were the same for the 5 boroughs, but the order of this "top 20" list was different. There were more trees with general problems in Manhattan as well as more unhealthy trees.

We did a lot of Pearson correlation among the different variables for figuring out that not a lot of the variables were correlated. In the end, finally figured out some correlation between the air quality, tree amount and tree diameter and the health states.

But let's start looking at the main dataset, the 2015 Street Tree Census (https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Tree-Data/uvpi-gqnh).

2.3 Other datasets inspected

Multiple secondary datasets were inspected, e.g. the 311 dataset and the airpollution dataset, as well as the Street Tree datasets from 2005 and 1995, respectively. In the 311 set, there were several complaints about trees in NYC. No significant correlations were found though. It was hoped that a connected between a certain type of complaint were correlated with different problems or the health of the tree, but unfortunately data does not always behave as hoped or suspected, and patterns cannot (and should not) be forced to appear.

One could also be inclined to wonder if more "green" areas, meaning areas with a lot of trees, had higher house prices. Again, after investigation, this was found to be challenging, since there is not a lot of information about house prices available - at least not on a neighbourhood level.

It was also considered if there was a correlation between the trees/features of the trees and the air pollution. This dataset was used for simple linear regression.

For the different maps used, a few other datasets have been included in the shape of geojson files, which includes the data needed for drawing the d3 maps (polygons) as well as basic information about the parts of the city, they're representing, such as borough, community district etc., which was used in combination with our own data from the Street Tree dataset to produce interactive maps. The geojsons used can be found and downloaded at (https://github.com/cecli/cecli.github.io/tree/master/data/geojson).

3. Theory

3.1 Machine Learning tools

When doing predictions it can be difficult to find the appropriate tools to use. Different tools have different qualities and it all depends on the data and the patterns in your data. In this project, different tools have been tried out, typically multiple tools for the same prediction to inspect the model performance of each tool.

3.1.1 KNN

KNN is a tool rather easy to grasp and implement. It was chosen for predicting the health of a tree based on GPS coordinates, as well as predicting species based on GPS coordinates. An argument for KNN being the most appropriate choice is that one could think that when planting trees, one would be inclined to plant the same trees together. One could also think that unhealthy trees are likely in the same area, presumable because of a decease in the area, a pollution problem, soil problems or something completely different. A drawback of the KNN method is that when dealing when an unbalanced dataset it will favour the most occuring observation.

3.1.2 Decision trees and Random Forest

Decision trees can often be a good choice because they are nice to visualize. A drawback is that they tend to overfit the training data. It was used for predicting health based on GPS coordinates, as well as species based on GPS coordinates in spite of its drawback. And also, we use it for predicting the tree species based on location and predicting diameter based on species and location. When predicting species different features were added to see if they contributed to the predictions, e.g. the diameter. The main reason was to compare with the other results. If the decision trees did not overfit and still performed well, then it would be nice to visualize. To accomodate the overfitting issue a random forest was also tried out.

Decision trees were also used to predict diameter based on species and problems, as well as predicting diamater based on the amount of problems. Here, the diameter was binned in bins of different sizes (1-10, 10-15, 15-20, ..., 45-50, 50-60, 60-70, ..., 90-100, 100-150, 150-200, ...)

3.1.3 SVM

As a third tool, Support Vector Machines were tried out. SVMs can do linear classification by creating a "maximum seperating hyperplane" between data. It can also do non-linear classification using a so-called kernel-trick where inputs are mapped to high-dimensional feature space. This was used to predict health based on GPS coordinates.

3.1.4 Apriori

Apriori is an algorithm for frequent item search. It was used to inspect problems appearing together. We used this algorithm for seeing if some problems were appearing together in the same observation. We found that some problems are sometimes appearing in the same trees. [We used Apyori package: https://pypi.python.org/pypi/apyori/1.1.1]

3.1.5 Linear and multiple regression

Linear regression was used to inspect correlation between different features, and is not really a machine learning tool as much as a tool for investigating linear correlations. It was used to predict air pollution based on the amount of trees as well as diameter. Techniques used were Lasso and Elastic Net, apart from the standard linear regression.

3.2 Model selection

When selecting appropriate models, first thing is to split the data into a training set and a test set. When predicting health (or species) based on GPS coordinates, a test set consisting of 15% of the total amount of observations was used. Hereafter the training set was "split" into a training set and a validation set, using a 5-fold and 10-fold cross-validation. The best model was chosen based on accuracy scored, but with computation time taken into account as well. For the KNN, different values of $k$ was tried, ranging from $K=2,...,10$. The limit was set to 10 because it was not expected that we would have a whole area of unhealthy trees, and that might just confuse the predictions.

3.3 Model performance

For predicting species and health based on GPS coordinates, KNN was selected as the best model. SVM simply took a significant amount of time to run, making it difficult to fine tune and handle. Decision trees overfitted the training data and was not good at handling sparse data.

For predicting species, the KNN classifier was able to predict $51.7$% accurately for $K=4$ on the test data, whereas the average (average over the 5 folds) performance on the validation set were $49.9$% (compared to $49$% for decision trees). This is considered rather good, taken into account that it is labelling $20$ different species, but it would also suggest that the same species is not always planted next to each other. They are actually not planted next to each other more often than expected before investigating the data. In comparison, when only trying to predict the top 5 species instead of all 20 species, the average performance on the validation set were $70.4$% for $K=4$. This also confirms that the same species are not always planted next to each other, and shows, as expected, that the model performs better when addressing fewer species.

For predicting health, the KNN classifier was able to predict $80.7$% accurately for $K=5$ on the test data, whereas the average (over the 5 folds) performance on the valdiation set were $80.4$% (compared to $74.8$% for decision trees). An accuracy of $80.7$% is rather good considering the sparsity of the "fair" and "poor" tree observations. The KNN did handle the sparsity trees better than the decision tree classifier. When it labeled a tree as "fair" it had around $43$% correct (on the validation set for $K=5$). A bit worse it went for the "poor" classifications, here it only predicted around $1/3$ correctly. It was, as we thought, much better at predicting the good trees. This makes sense since there were a lot more training data available. In comparison, the decision tree classifier had around $31$% correct for the "fair" trees, and $17$% for the "poor" trees. The amount of misclassifications on the "poor" and "fair" trees suggests that the condition of the trees do not really reflect on its neighbours and are most likely caused by other, individual things.

The decision trees were performing almost like the random forest in our case. When using species as an input, we obtained a maximum accuracy of 0.5 with the top 5 species. With top 10 and 20, the accuracy was even less, which made us conclude it was not a good model. Moreover, we saw that in the visualization using graphviz, it was predicting just two species, Locust and Pine Oak, so it was definitely not a good tool for our case.

The apriori algorithm found some correlation among the stone problems. The score was not that good, but we saw it was improving when taking into account just the trees with some problems or the trees with just one problem. This is understandable, because some of the trees might be new, which can also be seen on the diameter of these trees. So it is possible that they don't have a lot of problems.

Elastic Net and Lasso are generally good tools, but we lack of variables and data points (maximum 188). That can be the reason behind the non-optimal performance.

4. Visualizations

4.1 Top 20 Trees

The initial histogram shows all the trees in New York City. It can be clearly seen that some trees are present in a really low amount. That is why we have focused on a subset of species, mostly the top 20 (for all the anlysis), and sometimes top 10 and 5. The pie chart shows then the distribution of top 20 trees, with the possibility of hovering it for a more clear eplanation of the hovered slice.

4.2 Tree distribution of top 5 species

The d3 bar chart show the distribution of the 5 most common species in NYC for each borough, as well as NYC. This illustrates the borough-wise differences regarding the most common trees. E.g. the London Planetree is the most common tree in NYC. It is also the most common tree in Queens. Queens is the largest area in NYC both regarding trees and size, as shown in the pie charts. Therefore it might be that most of NYC's London Planetrees are located in Queens, but as it can be seen from the chart, most of them actually comes from Brooklyn. The chart also have the option of changing the years from 2015 to either 2005 or 1995, and hereby see the differences over time as well.

This chart is important for the project since it enables the user to view the distribution of the most common trees in NYC on borough level for the three different years available: 1995, 2005, 2015. This helps us show the changes in street trees in NYC over time.

We have also appended a plot in Python (using GeoPlotLib) in order to give the user an overview of how distribution/amount of trees has changed over years.

4.3 Fun facts about the top 10 species

This page enables the user to hover over pictures of leaves for the 10 most common species to see the species name, and then when clicking a leaf image, fun facts about these trees appear.

This function is important for the project since it sets each of the trees in a context: What is it called? What is its ranking? How many trees are there of this species? What is special/interesting about the species? Why is it a good street tree compared to others?

4.4 Health distribution of the trees

A basic histogram and a pie chart for make user understanding the health distribution of the trees. It can be clearly seen that there are more healthy trees in respect to the poor ones, but the histogram gives some more details about various species, probably discovering some interesting species.

4.5 Health map of the NYC street trees

The d3 map enables the user to view a prediction of the health of the NYC street trees, using KNN as a classifier. It is possible to hover over the individual boroughs for details, and to switch between visualizing the good, fair, and poor trees. When hovering over a borough, a tooltip is displayed, showing information in the shape of borough name, borough size, amount of trees in that borough specified by the tree species, and finally, the percentages of 'good' and 'poor' trees in that borough. The boroughs have different colors in respect to the amount of trees present there.

The map is an important visualization for the project since it shows the location of the trees in regards to different health. Because of the large amount of data points it would have created a too confusing picture to visualize all three health states at the same time, which is why it has been split up into 'good' and 'poor' with the 'fair' trees available to append at the click of a button.

4.6 Scatterplots

The site includes three scatterplots. The first scatterplot visualizes the most common problems trees can have according to the dataset. The problems visualized are split into three different ones: 'Trunk', 'Root' and 'Branch'. These problems are caused by humans, such as trees growing into phone lines, wires around the trunk or stones on the root. Further definitions can be found in the dataset manual in the link provided.

The second and the third scatterplot shows the correlation among the amount of trees, diameter and air pollution in each of the neighbourhoods. We just took the community district points and assign the mean of each neighbourhood. The user can explore each neighbourhood when hovering over the map (which actually shows community districts) and the corresponding plots in the scatterplot (and viceversa). The scatterplots colors are per borough, thus the users can explore the air pollution amount in each different area, and in various level of details.

References for ispiration for d3 visualization (scatterplot):

5. Discussion

When working with the project, two things became clear to us: 1) Real life data is messy 2) Data do not care what you think of them

It is possible to have good intentions, and a lot of good ideas as to what to do when analyzing a dataset, but the data itself just have limitations that are not always possible to overcome.

5.1 What went well? What went wrong?

During the project, a lot of things did not go as expected. First of all, the pattern we expected to find in the data was just not there. The intention was to find correlation between the problems of the trees and the health, possible also correlation with the diameter. A lot of basic Pearson correlations was done on the data, but it appeared that there actually were no significant correlations. Then a lot of other datasets were inspected to see if they were correlated with some of the tree data features. The only interesting part that appears regarded the air pollution, seeing the correlation among the amount of trees and diameter and the air quality. Moreover, we discovered that the diameter of a trees is influenced by the problems. We managed to create a lot of plots and visualizations of the data, showing the fundamentals counts. We also managed to apply different machine learning methods, though they only showed us what the preliminary analysis did. In general, there were not really large areas with problematic trees, and that could not really be related to the health. The health was sparse, influencing the predictions.

In general, we feel that we have acquired familiarity with Python for data analysis. On the other hand, it took much more time having visualization working, especially when d3 visualization were put online. Sadly, we spent a lot of time in fixing minor problems of compatibility with the website/d3, taking it out from our analysis/visualization work.

5.2 Possible improvements

If there was more data available to join with the street tree data, we might have been able to find other nice pattern/correlation. The air pollution was one, and it could have been nice to focus more on this, exploring also other variables, like amount of traffic in New York City.

We could also have focused on one prediction goal instead of trying to find a lot of different patterns, that turned out to not be there. E.g. we could have focused on predicting species based but with some other tools, since KNN obviously was not the best choice. A suggestion would be to try do some binary classifications, locating e.g. London Planetrees using SVMs (so just one species instead of focusing on more than one).

5.3 What is still missing?

In the end, what we actually missed was patterns in the data so we could have done some more advanced and great predictions. But patterns cannot be forced, so with that in mind, we could have used more visualizations for the predictions we did. We could also have showed some more interactive features regarding health and problems in various years, deeply analyzing the changes of the trees.

We could have assigned the air pollution to each neighbourhood available in the dataset instead of the community district mean. This was too much work because of the different neighbourhood names. Moreover, we could have explore the 311 dataset more than we did to see if some areas with more complaints performed worse in regards to air pollution.

We also started analyzing house prices, but the neighbourhood names were too different, so time prohibited us from continuing down this path.

The visualization of the map on the site is a bit slow, and we could not figure out how to optimize it. Moreover, the scatterplots were not working initially so we spent a lot of time figuring out how to improve the site in regards to these and one axis is even still missing in the second scatterplot (even if locally was working before).

Appendix: Code

Loading data


In [1]:
#Import data the whole dataset 
import pandas as pd
import csv

tree_data = pd.read_csv('2015_tree_data_updated.csv')
tree_data


Out[1]:
Diameter Status Health Spc_Latin Spc_Common Sidewalk_Condition problems root_stone root_grate root_other ... brch_shoe brch_other Address Zipcode CB Borough nta Neighbourhoods Latitude Longitude
0 3 Alive Fair Acer rubrum red maple NoDamage None No No No ... No No 108-005 70 AVENUE 11375 406 Queens QN17 Forest Hills 40.723092 -73.844215
1 21 Alive Fair Quercus palustris pin oak Damage Stones Yes No No ... No No 147-074 7 AVENUE 11357 407 Queens QN49 Whitestone 40.794111 -73.818679
2 3 Alive Good Gleditsia triacanthos var. inermis honeylocust Damage None No No No ... No No 390 MORGAN AVENUE 11211 301 Brooklyn BK90 East Williamsburg 40.717581 -73.936608
3 10 Alive Good Gleditsia triacanthos var. inermis honeylocust Damage Stones Yes No No ... No No 1027 GRAND STREET 11211 301 Brooklyn BK90 East Williamsburg 40.713537 -73.934456
4 21 Alive Good Tilia americana American linden Damage Stones Yes No No ... No No 603 6 STREET 11215 306 Brooklyn BK37 Park Slope-Gowanus 40.666778 -73.975979
5 11 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage None No No No ... No No 8 COLUMBUS AVENUE 10023 107 Manhattan MN14 Lincoln Square 40.770046 -73.984950
6 11 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage None No No No ... No No 120 WEST 60 STREET 10023 107 Manhattan MN14 Lincoln Square 40.770210 -73.985338
7 9 Alive Good Tilia americana American linden NoDamage MetalGrates No Yes No ... No No 311 WEST 50 STREET 10019 104 Manhattan MN15 Clinton 40.762724 -73.987297
8 6 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage None No No No ... No No 65 JEROME AVENUE 10305 502 Staten Island SI14 Grasmere-Arrochar-Ft. Wadsworth 40.596579 -74.076255
9 21 Alive Fair Platanus x acerifolia London planetree NoDamage None No No No ... No No 638 AVENUE Z 11223 313 Brooklyn BK26 Gravesend 40.586357 -73.969744
10 11 Alive Good Platanus x acerifolia London planetree NoDamage None No No No ... No No 20-025 24 STREET 11105 401 Queens QN72 Steinway 40.782428 -73.911171
11 8 Alive Poor Platanus x acerifolia London planetree NoDamage None No No No ... No No 20-055 24 STREET 11105 401 Queens QN72 Steinway 40.781735 -73.912020
12 13 Alive Fair Platanus x acerifolia London planetree NoDamage Stones Yes No No ... No No 35 FENWAY CIRCLE 10308 503 Staten Island SI54 Great Kills 40.557103 -74.162670
13 22 Alive Good Platanus x acerifolia London planetree NoDamage RootOther No No Yes ... No No 100 WAVERLY AVENUE 11205 302 Brooklyn BK69 Clinton Hill 40.694733 -73.968211
14 30 Alive Fair Platanus x acerifolia London planetree Damage Stones,BranchOther Yes No No ... No Yes 2126 UNION STREET 11212 316 Brooklyn BK81 Brownsville 40.664317 -73.921130
15 12 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage None No No No ... No No 449 MYRTLE AVENUE 11205 302 Brooklyn BK69 Clinton Hill 40.693314 -73.967601
16 2 Alive Fair Ginkgo biloba ginkgo Damage None No No No ... No No 8797 25 AVENUE 11214 311 Brooklyn BK29 Bensonhurst East 40.593788 -73.991597
17 14 Alive Good Gleditsia triacanthos var. inermis honeylocust Damage None No No No ... No No 1601 CHURCH AVENUE 11226 314 Brooklyn BK42 Flatbush 40.648788 -73.964524
18 14 Alive Fair Gleditsia triacanthos var. inermis honeylocust NoDamage TrunkLights,BranchLights No No No ... No No 55-026 96 STREET 11373 404 Queens QN25 Corona 40.737646 -73.865300
19 10 Alive Good Ginkgo biloba ginkgo NoDamage Stones Yes No No ... No No 206 CARLTON AVENUE 11205 302 Brooklyn BK68 Fort Greene 40.691499 -73.972588
20 11 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage None No No No ... No No 367 PROSPECT AVENUE 11215 307 Brooklyn BK37 Park Slope-Gowanus 40.661239 -73.985889
21 14 Alive Good Gleditsia triacanthos var. inermis honeylocust Damage RootOther,TrunkOther,BranchOther No No Yes ... No Yes 170 EAST 75 STREET 10021 108 Manhattan MN40 Upper East Side-Carnegie Hill 40.772171 -73.960456
22 33 Alive Good Platanus x acerifolia London planetree NoDamage Stones,BranchLights Yes No No ... No No 84-036 127 STREET 11415 409 Queens QN60 Kew Gardens 40.706534 -73.824992
23 19 Alive Good Platanus x acerifolia London planetree NoDamage None No No No ... No No 401 AVENUE O 11230 312 Brooklyn BK46 Ocean Parkway South 40.611905 -73.970427
24 9 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage None No No No ... No No 71 STANTON STREET 10002 103 Manhattan MN27 Chinatown 40.721807 -73.989830
25 9 Alive Good Gleditsia triacanthos var. inermis honeylocust Damage TrunkOther,BranchLights No No No ... No No 1817 DE KALB AVENUE 11385 405 Queens QN20 Ridgewood 40.708040 -73.915497
26 7 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage None No No No ... No No 456 5 AVENUE 11215 306 Brooklyn BK37 Park Slope-Gowanus 40.668826 -73.986703
27 3 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage Stones Yes No No ... No No 2022 LA FONTAINE AVENUE 10457 206 Bronx BX17 East Tremont 40.847947 -73.893382
28 7 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage MetalGrates,TrunkOther No Yes No ... No No 1880 BROADWAY 10023 107 Manhattan MN14 Lincoln Square 40.770396 -73.981627
29 5 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage MetalGrates,TrunkOther No Yes No ... No No 1 WEST 62 STREET 10023 107 Manhattan MN14 Lincoln Square 40.770227 -73.981218
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
534484 20 Alive Good Ginkgo biloba ginkgo Damage Stones Yes No No ... No No 31 FISKE PLACE 11215 306 Brooklyn BK37 Park Slope-Gowanus 40.671685 -73.975309
534485 11 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage RootOther No No Yes ... No No 325 EAST 64 STREET 10065 108 Manhattan MN31 Lenox Hill-Roosevelt Island 40.763224 -73.960984
534486 4 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage None No No No ... No No 483 MYRTLE AVENUE 11205 302 Brooklyn BK69 Clinton Hill 40.693606 -73.965886
534487 12 Alive Fair Gleditsia triacanthos var. inermis honeylocust Damage Stones Yes No No ... No No 491 GRANDVIEW AVENUE 11385 405 Queens QN20 Ridgewood 40.709886 -73.907888
534488 18 Alive Fair Gleditsia triacanthos var. inermis honeylocust NoDamage Stones Yes No No ... No No 112-001 72 DRIVE 11375 406 Queens QN17 Forest Hills 40.720503 -73.837470
534489 22 Alive Good Platanus x acerifolia London planetree Damage Stones,RootOther,TrunkOther Yes No Yes ... No No 1494 PUTNAM AVENUE 11237 304 Brooklyn BK77 Bushwick North 40.697094 -73.909830
534490 25 Alive Good Platanus x acerifolia London planetree NoDamage Stones,BranchLights Yes No No ... No No 1349 64 STREET 11219 310 Brooklyn BK30 Dyker Heights 40.625847 -73.999582
534491 19 Alive Good Platanus x acerifolia London planetree NoDamage None No No No ... No No 81-020 PETTIT AVENUE 11373 404 Queens QN29 Elmhurst 40.743967 -73.883604
534492 9 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage None No No No ... No No 21 SOUTH END AVENUE 10280 101 Manhattan MN25 Battery Park City-Lower Manhattan 40.707884 -74.017598
534493 6 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage None No No No ... No No 8922 3 AVENUE 11209 310 Brooklyn BK31 Bay Ridge 40.620861 -74.032224
534494 4 Alive Fair Gleditsia triacanthos var. inermis honeylocust Damage None No No No ... No No 140-040 31 DRIVE 11354 407 Queens QN22 Flushing 40.769787 -73.827245
534495 7 Alive Poor Tilia americana American linden NoDamage Stones,RootOther Yes No Yes ... No No 8700 25 AVENUE 11214 311 Brooklyn BK29 Bensonhurst East 40.595673 -73.989637
534496 10 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage None No No No ... No No 41 WEST 86 STREET 10024 107 Manhattan MN12 Upper West Side 40.786150 -73.971152
534497 13 Alive Fair Gleditsia triacanthos var. inermis honeylocust Damage Stones Yes No No ... No No 699 GROTE STREET 10457 206 Bronx BX06 Belmont 40.851115 -73.885511
534498 6 Alive Good Ginkgo biloba ginkgo Damage Stones Yes No No ... No No 166 IRWIN STREET 11235 315 Brooklyn BK17 Sheepshead Bay-Gerritsen Beach-Manhattan Beach 40.578603 -73.943733
534499 4 Alive Good Gleditsia triacanthos var. inermis honeylocust NoDamage None No No No ... No No 86 WAVERLY AVENUE 11205 302 Brooklyn BK69 Clinton Hill 40.695204 -73.968306
534500 12 Alive Good Ginkgo biloba ginkgo NoDamage None No No No ... No No 149 EAST 23 STREET 10010 106 Manhattan MN21 Gramercy 40.739270 -73.983960
534501 17 Alive Good Quercus palustris pin oak NoDamage None No No No ... No No 67-008 JUNO STREET 11375 406 Queens QN17 Forest Hills 40.716741 -73.857177
534502 5 Alive Good Quercus palustris pin oak NoDamage None No No No ... No No 345 WEST 13 STREET 10014 102 Manhattan MN23 West Village 40.739913 -74.004892
534503 11 Alive Good Quercus palustris pin oak Damage None No No No ... No No 2766 BEDFORD AVENUE 11210 314 Brooklyn BK42 Flatbush 40.635175 -73.953392
534504 12 Alive Good Ulmus americana American elm NoDamage None No No No ... No No 1040 EASTERN PARKWAY 11213 309 Brooklyn BK61 Crown Heights North 40.668770 -73.934053
534505 29 Alive Good Platanus x acerifolia London planetree Damage Stones,BranchLights Yes No No ... No No 1040 EAST 16 STREET 11230 314 Brooklyn BK43 Midwood 40.624296 -73.960344
534506 27 Alive Good Platanus x acerifolia London planetree NoDamage Stones,WiresRope,BranchLights Yes No No ... No No 2720 QUENTIN ROAD 11229 315 Brooklyn BK44 Madison 40.609541 -73.945835
534507 15 Alive Fair Platanus x acerifolia London planetree NoDamage None No No No ... No No 50-017 SKILLMAN AVENUE 11377 402 Queens QN31 Hunters Point-Sunnyside-West Maspeth 40.746122 -73.913657
534508 20 Alive Good Quercus palustris pin oak NoDamage Stones Yes No No ... No No 1040 FLATBUSH AVENUE 11226 314 Brooklyn BK42 Flatbush 40.645694 -73.958179
534509 3 Alive Good Quercus palustris pin oak NoDamage None No No No ... No No 2185 VALENTINE AVENUE 10457 205 Bronx BX41 Mount Hope 40.854570 -73.899192
534510 25 Alive Good Quercus palustris pin oak Damage None No No No ... No No 32 MARCY AVENUE 11211 301 Brooklyn BK73 North Side-South Side 40.713211 -73.954944
534511 12 Alive Good Acer rubrum red maple Damage None No No No ... No No 130 BIDWELL AVENUE 10314 501 Staten Island SI07 Westerleigh 40.620762 -74.136517
534512 9 Alive Good Acer rubrum red maple NoDamage None No No No ... No No 1985 ANTHONY AVENUE 10457 205 Bronx BX41 Mount Hope 40.850828 -73.903115
534513 23 Alive Fair Acer rubrum red maple NoDamage None No No No ... No No 69-069 183 STREET 11365 408 Queens QN41 Fresh Meadows-Utopia 40.732165 -73.787526

534514 rows × 24 columns


In [4]:
#Convert health categories to numbers. 1: Poor, 2: Fair, 3: Good. Higher = Better
health = []
for i in range(len(tree_data['Health'])):
    if tree_data['Health'][i] == 'Good':
        health.append(3)
    elif tree_data['Health'][i] == 'Fair':
        health.append(2)
    elif tree_data['Health'][i] == 'Poor':
        health.append(1)
    else:
        health.append(0)
        #print "err", tree_data['Health'][i], i

Data analysis


In [5]:
#Finding the total number of trees and how many there are of the different species of trees
tree_amount = tree_data['Spc_Common'].value_counts()

print(tree_amount)


London planetree     87014
honeylocust          64264
Callery pear         58931
pin oak              53185
Norway maple         34189
littleleaf linden    29742
cherry               29279
Japanese zelkova     29258
ginkgo               21024
Sophora              19338
red maple            17246
green ash            16251
American linden      13530
silver maple         12277
sweetgum             10657
northern red oak      8400
silver linden         7995
American elm          7975
maple                 7080
purple-leaf plum      6879
Name: Spc_Common, dtype: int64

In [7]:
#Plotting the results to get an overview
#plt.style.use('ggplot')
%matplotlib inline

def barplot(series, title, figsize, ylabel, flag, rotation):
    ax = series.plot(kind='bar', 
                title = title,
                figsize = figsize,
                fontsize = 13)
    
    # set ylabel
    ax.set_ylabel(ylabel)
    # set xlabel (depending on the flag that comes as a function parameter)
    ax.get_xaxis().set_visible(flag)
    # set series index as xlabels and rotate them
    ax.set_xticklabels(series.index, rotation= rotation)
    
barplot(tree_amount,'Tree types', figsize=(20,8), ylabel = 'tree count',flag = True, rotation = 90)



In [9]:
#Putting the percentages on a pie chart
ax = tree_amount.plot(kind='pie', title='Top 20 tree species in NYC', autopct='%1.0f%%', pctdistance=0.9)
ax.set_ylabel('')
ax.set_aspect('equal')



In [10]:
#Count no. of trees in each borough:
boros = tree_data['Borough'].unique()
print tree_data['Borough'].value_counts()


Queens           195985
Brooklyn         138760
Staten Island     82619
Bronx             63035
Manhattan         54115
Name: Borough, dtype: int64

In [11]:
#Finding how many trees there are of the different health types
tree_health = tree_data['Health'].value_counts()
print(tree_health)


Good    435306
Fair     78460
Poor     20746
Name: Health, dtype: int64

In [12]:
#Comparing the count of each tree species in the whole city with a borough. This was used in the initial analysis to try and 
#determine if focus should be put on a single borough, and which borough this should be.
queens_tree_types = tree_data.loc[tree_data['Borough'] == 'Queens', 'Spc_Common'].value_counts()
brooklyn_tree_types = tree_data.loc[tree_data['Borough'] == 'Brooklyn', 'Spc_Common'].value_counts()
staten_tree_types = tree_data.loc[tree_data['Borough'] == 'Staten Island', 'Spc_Common'].value_counts()
bronx_tree_types = tree_data.loc[tree_data['Borough'] == 'Bronx', 'Spc_Common'].value_counts()
manhattan_tree_types = tree_data.loc[tree_data['Borough'] == 'Manhattan', 'Spc_Common'].value_counts()

df = pd.concat([tree_amount, queens_tree_types], axis=1)
print(df)
df.columns = ['NYC', 'Queens']

df = df.sort_values('NYC', ascending=False) # sort the df using NYC values

df.plot.bar(color=['red','blue'])


                   Spc_Common  Spc_Common
American elm             7975        1709
American linden         13530        4769
Callery pear            58931       16547
Japanese zelkova        29258        8987
London planetree        87014       31111
Norway maple            34189       19407
Sophora                 19338        5386
cherry                  29279       13497
ginkgo                  21024        5971
green ash               16251        7389
honeylocust             64264       20290
littleleaf linden       29742       11902
maple                    7080        2992
northern red oak         8400        2697
pin oak                 53185       22610
purple-leaf plum         6879        3035
red maple               17246        4935
silver linden            7995        4146
silver maple            12277        6116
sweetgum                10657        2489
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x117664a50>

In [17]:
#Comparing the number of trees in each of the five boroughs
import matplotlib as plt

fig, axes = df.plot.subplots(nrows=5)

plt.subplots_adjust(wspace=1, hspace=0.5)

plot = queens_tree_types.plot(ax=axes[0], kind='bar', figsize=(8,30)); axes[0].set_title('Queens');
brooklyn_tree_types.plot(ax=axes[1], kind='bar'); axes[1].set_title('Brooklyn');
manhattan_tree_types.plot(ax=axes[2], kind='bar'); axes[2].set_title('Manhattan');
staten_tree_types.plot(ax=axes[3], kind='bar'); axes[3].set_title('Staten Island');
bronx_tree_types.plot(ax=axes[4], kind='bar'); axes[4].set_title('Bronx');

fig = plot.get_figure()


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-1bf616e90beb> in <module>()
      2 import matplotlib as plt
      3 
----> 4 fig, axes = df.plot.subplots(nrows=5)
      5 
      6 plt.subplots_adjust(wspace=1, hspace=0.5)

AttributeError: 'FramePlotMethods' object has no attribute 'subplots'

In [19]:
with open('2015_tree_data_updated.csv', 'r') as infile:
    # read the file as a dictionary for each row ({header : value})
    reader = csv.DictReader(infile)
    data = {} #empty set
    for row in reader:
        for header, value in row.items():
            try:
                data[header].append(value)
            except KeyError:
                data[header] = [value]
Diameter = data['Diameter']
Health = data['Health']
Spc_Latin = data['Spc_Latin']
Spc_Common = data['Spc_Common']
Sidewalk_Condition = data['Sidewalk_Condition']
problems = data['problems']
root_stone = data['root_stone']
root_grate = data['root_grate']
root_other = data['root_other']
trunk_wire = data['trunk_wire']
trnk_light = data['trnk_light']
trnk_other = data['trnk_other']
brch_light = data['brch_light']
brch_shoe = data['brch_shoe']
brch_other = data['brch_other']
Address = data['Address']
Zipcode = data['Zipcode']
CB = data['CB']
Borough = data['Borough']
Latitude = data['Latitude']
Longitude = data['Longitude']
#Heatmap of tree distribution
X_new=[]
Y_new=[]

for i in range(len(CB)):
    X_new.append(Longitude[i])
    Y_new.append(Latitude[i])
    
with open('Coordinates_trees.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(('lon', 'lat'))
    for xd, yd in zip(X_new, Y_new):
        writer.writerow( (xd, yd ) )
    csvfile.close()

import geoplotlib
from geoplotlib.utils import read_csv, BoundingBox, DataAccessObject

min_lat = min(X_new)
max_lat = max(X_new)
min_lon = min(Y_new)
max_lon = max(Y_new)

bbox = BoundingBox(north=float(max_lon), west=float(max_lat), south=float(min_lon), east=float(min_lat))
print "Trees:", bbox

data_trees = read_csv('Coordinates_trees.csv')
geoplotlib.kde(data_trees, bw=0.5, cmap = 'jet', cut_below=1e-4)
geoplotlib.set_bbox(bbox)
geoplotlib.inline()


Trees: BoundingBox(north=40.912614, west=-74.254965, south=40.498466, east=-73.700488)
('smallest non-zero count', 1.392495637759815e-07)
('max count:', 24.80812040755379)

In the repository, there's also data from 2005 and 1995 but it is not included in this explainer notebook as it does not provide any value for explaining our analysis.

Data for histogram


In [20]:
#2015 data 

import numpy as np

#NYC top 20 species
unique, counts = np.unique(zip(tree_data['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Extract data for each borough
tree_data_bronx = tree_data.loc[tree_data['Borough'] == 'Bronx']
tree_data_brook = tree_data.loc[tree_data['Borough'] == 'Brooklyn']
tree_data_stat = tree_data.loc[tree_data['Borough'] == 'Staten Island']
tree_data_manh = tree_data.loc[tree_data['Borough'] == 'Manhattan']
tree_data_queens = tree_data.loc[tree_data['Borough'] == 'Queens']

#Bronx
unique, counts = np.unique(zip(tree_data_bronx['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Brooklyn
unique, counts = np.unique(zip(tree_data_brook['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Staten Island
unique, counts = np.unique(zip(tree_data_stat['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Manhattan
unique, counts = np.unique(zip(tree_data_manh['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Queens
unique, counts = np.unique(zip(tree_data_queens['Spc_Common']), return_counts=True)
print "Queens data:"
print sorted(zip(counts, unique), reverse = True)


Queens data:
[(31111, 'London planetree'), (22610, 'pin oak'), (20290, 'honeylocust'), (19407, 'Norway maple'), (16547, 'Callery pear'), (13497, 'cherry'), (11902, 'littleleaf linden'), (8987, 'Japanese zelkova'), (7389, 'green ash'), (6116, 'silver maple'), (5971, 'ginkgo'), (5386, 'Sophora'), (4935, 'red maple'), (4769, 'American linden'), (4146, 'silver linden'), (3035, 'purple-leaf plum'), (2992, 'maple'), (2697, 'northern red oak'), (2489, 'sweetgum'), (1709, 'American elm')]

In [23]:
#2005 data
import pandas as pd
import numpy as np

tree_data = pd.read_csv('2005_tree_data_updated.csv')

#NYC top 20 species
unique, counts = np.unique(zip(tree_data['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Extract data for each borough
tree_data_bronx = tree_data.loc[tree_data['Borough'] == 'Bronx']
tree_data_brook = tree_data.loc[tree_data['Borough'] == 'Brooklyn']
tree_data_stat = tree_data.loc[tree_data['Borough'] == 5]
tree_data_manh = tree_data.loc[tree_data['Borough'] == 'Manhattan']
tree_data_queens = tree_data.loc[tree_data['Borough'] == 'Queens']

#Bronx
unique, counts = np.unique(zip(tree_data_bronx['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Brooklyn
unique, counts = np.unique(zip(tree_data_brook['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Staten Island
unique, counts = np.unique(zip(tree_data_stat['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Manhattan
unique, counts = np.unique(zip(tree_data_manh['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Queens
unique, counts = np.unique(zip(tree_data_queens['Spc_Common']), return_counts=True)
print "Queens data:"
print sorted(zip(counts, unique), reverse = True)


Queens data:
[(31111, 'London planetree'), (22610, 'pin oak'), (20290, 'honeylocust'), (19407, 'Norway maple'), (16547, 'Callery pear'), (13497, 'cherry'), (11902, 'littleleaf linden'), (8987, 'Japanese zelkova'), (7389, 'green ash'), (6116, 'silver maple'), (5971, 'ginkgo'), (5386, 'Sophora'), (4935, 'red maple'), (4769, 'American linden'), (4146, 'silver linden'), (3035, 'purple-leaf plum'), (2992, 'maple'), (2697, 'northern red oak'), (2489, 'sweetgum'), (1709, 'American elm')]

In [ ]:
#1995 data
import pandas as pd
import numpy as np

tree_data = pd.read_csv('1995_tree_data_updated.csv')

#NYC top 20 species
unique, counts = np.unique(zip(tree_data['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Extract data for each borough
tree_data_bronx = tree_data.loc[tree_data['Borough'] == 'Bronx']
tree_data_brook = tree_data.loc[tree_data['Borough'] == 'Brooklyn']
tree_data_stat = tree_data.loc[tree_data['Borough'] == 'Staten Island']
tree_data_manh = tree_data.loc[tree_data['Borough'] == 'Manhattan']
tree_data_queens = tree_data.loc[tree_data['Borough'] == 'Queens']

#Bronx
unique, counts = np.unique(zip(tree_data_bronx['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Brooklyn
unique, counts = np.unique(zip(tree_data_brook['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Staten Island
unique, counts = np.unique(zip(tree_data_stat['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Manhattan
unique, counts = np.unique(zip(tree_data_manh['Spc_Common']), return_counts=True)
sorted(zip(counts, unique), reverse = True)

#Queens
unique, counts = np.unique(zip(tree_data_queens['Spc_Common']), return_counts=True)
print "Queens data:"
print sorted(zip(counts, unique), reverse = True)

Problems exploration


In [25]:
bor_list = list(set(list(Borough)))

In [26]:
bronx_prob_list = [0, 0, 0, 0, 0]
brooklyn_prob_list = [0, 0, 0, 0, 0]
staten_prob_list = [0, 0, 0, 0, 0]
man_prob_list = [0, 0, 0, 0, 0]
queens_prob_list = [0, 0, 0, 0, 0]

dic_all_boro = {}
for b in bor_list:
    dic_all_boro[b] = [0, 0, 0, 0, 0]

temp_root = 0
temp_trunk = 0
temp_branch = 0
temp_tot = 0
sidewalk = 0
for i in range (0, len(CB)):
    if root_stone[i] == 'Yes':
        temp_root += 1
    if root_grate[i] == 'Yes':
        temp_root += 1
    if root_other[i] == 'Yes':
        temp_root += 1
    if trunk_wire[i] == 'Yes':
        temp_trunk += 1
    if trnk_light[i] == 'Yes':
        temp_trunk += 1
    if trnk_other[i] == 'Yes':
        temp_trunk += 1
    if brch_light[i] == 'Yes':
        temp_branch += 1
    if brch_shoe[i] == 'Yes':
        temp_branch += 1
    if brch_other[i] == 'Yes':
        temp_branch += 1
    if Sidewalk_Condition[i] == 'Damage':
        sidewalk += 1
    temp_tot = temp_root + temp_trunk + temp_branch + sidewalk
    temp_list = [temp_root, temp_trunk, temp_branch, sidewalk, temp_tot]
    
    #choose which list to update
    c = 0
    for t in temp_list:
        dic_all_boro[Borough[i]][c] += t
        c += 1
    temp_root = 0
    temp_trunk = 0
    temp_branch = 0
    temp_tot = 0
    sidewalk = 0

In [27]:
with open('problem_count.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerow(('Borough','Root_Prob', 'Trunk_Prob', 'Branch_Prob', 'Sidewalk', 'Tot_Prob'))
    for d in dic_all_boro.keys():
        writer.writerow((d, dic_all_boro[d][0], dic_all_boro[d][1], dic_all_boro[d][2], dic_all_boro[d][3], dic_all_boro[d][4] ))
f.close()

In [28]:
bronx_prob_list = [0, 0, 0, 0, 0]
brooklyn_prob_list = [0, 0, 0, 0, 0]
staten_prob_list = [0, 0, 0, 0, 0]
man_prob_list = [0, 0, 0, 0, 0]
queens_prob_list = [0, 0, 0, 0, 0]

dic_all_boro = {}
for b in bor_list:
    dic_all_boro[b] = [0, 0, 0, 0, 0]

temp_root = 0
temp_trunk = 0
temp_branch = 0
temp_tot = 0
sidewalk = 0
for i in range (0, len(CB)):
    if root_stone[i] == 'Yes':
        temp_root += 1
    if root_grate[i] == 'Yes':
        temp_root += 1
    if root_other[i] == 'Yes':
        temp_root += 1
    if trunk_wire[i] == 'Yes':
        temp_trunk += 1
    if trnk_light[i] == 'Yes':
        temp_trunk += 1
    if trnk_other[i] == 'Yes':
        temp_trunk += 1
    if brch_light[i] == 'Yes':
        temp_branch += 1
    if brch_shoe[i] == 'Yes':
        temp_branch += 1
    if brch_other[i] == 'Yes':
        temp_branch += 1
    if Sidewalk_Condition[i] == 'Damage':
        sidewalk += 1
    temp_tot = temp_root + temp_trunk + temp_branch + sidewalk
    temp_list = [temp_root, temp_trunk, temp_branch, sidewalk, temp_tot]
    
    #choose which list to update
    c = 0
    for t in temp_list:
        dic_all_boro[Borough[i]][c] += t
        c += 1
    temp_root = 0
    temp_trunk = 0
    temp_branch = 0
    temp_tot = 0
    sidewalk = 0

In [29]:
import matplotlib
import matplotlib.pylab as plt
import numpy as np

root = [dic_all_boro['Bronx'][0], dic_all_boro['Brooklyn'][0] , dic_all_boro['Manhattan'][0] , dic_all_boro['Queens'][0] , dic_all_boro['Staten Island'][0]]
trunk = [dic_all_boro['Bronx'][1], dic_all_boro['Brooklyn'][1], dic_all_boro['Manhattan'][1] , dic_all_boro['Queens'][1] , dic_all_boro['Staten Island'][1]]
branch = [dic_all_boro['Bronx'][2], dic_all_boro['Brooklyn'][2], dic_all_boro['Manhattan'][2] , dic_all_boro['Queens'][2] , dic_all_boro['Staten Island'][2]]
tot = [dic_all_boro['Bronx'][3], dic_all_boro['Brooklyn'][3], dic_all_boro['Manhattan'][3] , dic_all_boro['Queens'][3] , dic_all_boro['Staten Island'][3]]

#f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(15,15))

fig_index = 0
#fig = plt.figure(fig_index)
fig, ax = plt.subplots()

ax.set_xlabel('Root')
ax.set_ylabel('Trunk')
ax.scatter(np.asarray(root), np.asarray(trunk))
for label, x, y in zip(bor_list, root, trunk):
    ax.annotate(
        label,
        xy=(x, y), xytext=(-20, 20),
        textcoords='offset points', ha='right', va='bottom',
        bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
        arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))
fig.savefig("problem1")
plt.show()

f, ax = plt.subplots()
ax.set_xlabel('Root')
ax.set_ylabel('Branch')
ax.scatter(np.asarray(root), np.asarray(branch))
for label, x, y in zip(bor_list, root, branch):
    ax.annotate(
        label,
        xy=(x, y), xytext=(-20, 20),
        textcoords='offset points', ha='right', va='bottom',
        bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
        arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))
f.savefig("problem2")
plt.show()

f, ax = plt.subplots()
ax.set_xlabel('Trunk')
ax.set_ylabel('Branch')
ax.scatter(np.asarray(trunk), np.asarray(branch))
for label, x, y in zip(bor_list, trunk, branch):
    ax.annotate(
        label,
        xy=(x, y), xytext=(-20, 20),
        textcoords='offset points', ha='right', va='bottom',
        bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
        arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))
f.savefig("problem3")
plt.show()


We have used the apriori algorithm for exploring if some problems happen to appear together.

Exploration of problems:


In [30]:
index = 0
root_stone_lon = []
root_stone_lat = []
root_grate_lon = []
root_grate_lat = []
trunk_wire_lon = []
trunk_wire_lat = []
trunk_light_lon = []
trunk_light_lat = []
branch_light_lon = []
branch_light_lat = []
branch_shoe_lon = []
branch_shoe_lat = []
count_br = 0
count2 = 0

for i in range(0, len(Latitude)):
    if root_stone[i] == 'Yes':
        root_stone_lat.append(float(Latitude[i]))
        root_stone_lon.append(float(Longitude[i]))
    if root_grate[i] == 'Yes':
        root_grate_lat.append(Latitude[i])
        root_grate_lon.append(Longitude[i])
    if trunk_wire[i] == 'Yes':
        trunk_wire_lat.append(Latitude[i])
        trunk_wire_lon.append(Longitude[i])
    if trnk_light[i] == 'Yes':
        trunk_light_lat.append(Latitude[i])
        trunk_light_lon.append(Longitude[i])    
    if brch_light[i] == 'Yes':
        branch_light_lat.append(Latitude[i])
        branch_light_lon.append(Longitude[i])    
    if brch_shoe[i] == 'Yes':
        branch_shoe_lat.append(Latitude[i])
        branch_shoe_lon.append(Longitude[i])
    if Borough[i] == 'Brooklyn' and brch_light[i] == 'Yes' and trunk_wire[i] == 'Yes':
        count_br += 1
    if Borough[i] == 'Brooklyn':
        count2 += 1
print 'Count: ', count_br, count2
print (set(Borough))

root_stone_zip = zip(root_stone_lon, root_stone_lat)
root_grate_zip = zip(root_grate_lon, root_grate_lat)
trunk_wire_zip = zip(trunk_wire_lon, trunk_wire_lat)
trunk_light_zip = zip(trunk_light_lon, trunk_light_lat)
branch_light_zip = zip(branch_light_lon,  branch_light_lat)
branch_shoe_zip = zip(branch_shoe_lon, branch_shoe_lat)


Count:  1913 138760
set(['Bronx', 'Brooklyn', 'Staten Island', 'Manhattan', 'Queens'])

In [31]:
sidewalk_cond_lon = []
sidewalk_cond_lat = []
for i in range(0, len(Latitude)):
    if Sidewalk_Condition[i] == 'Damage':
        sidewalk_cond_lat.append(float(Latitude[i]))
        sidewalk_cond_lon.append(float(Longitude[i]))
with open('sidewalk_dam.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(('lon', 'lat'))
    for xd, yd in zip(sidewalk_cond_lon, sidewalk_cond_lat):
        writer.writerow( (xd, yd ) )
    csvfile.close()

In [32]:
with open('root_stone.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(('lon', 'lat'))
    for xd, yd in zip(root_stone_lon, root_stone_lat):
        writer.writerow( (xd, yd ) )
    csvfile.close()
with open('root_grate.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(('lon', 'lat'))
    for xd, yd in zip(root_grate_lon, root_grate_lat):
        writer.writerow( (xd, yd ) )
    csvfile.close()
with open('trk_wire.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(('lon', 'lat'))
    for xd, yd in  zip(trunk_wire_lon, trunk_wire_lat):
        writer.writerow( (xd, yd ) )
    csvfile.close()
with open('trk_light.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(('lon', 'lat'))
    for xd, yd in  zip(trunk_light_lon, trunk_light_lat):
        writer.writerow( (xd, yd ) )
    csvfile.close()
with open('brc_light.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(('lon', 'lat'))
    for xd, yd in  zip(branch_light_lon, branch_light_lat):
        writer.writerow( (xd, yd ) )
    csvfile.close()
with open('brc_shoe.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(('lon', 'lat'))
    for xd, yd in  zip(branch_shoe_lon, branch_shoe_lat):
        writer.writerow( (xd, yd ) )
    csvfile.close()

In [33]:
import geoplotlib
from geoplotlib.utils import read_csv, BoundingBox, DataAccessObject

min_lat = min(root_stone_lat)
max_lat = max(root_stone_lat)
min_lon = min(root_stone_lon)
max_lon = max(root_stone_lon)

bbox = BoundingBox(north=float(max_lat), west=float(max_lon), south=float(min_lat), east=float(min_lon))
print "Trees:", bbox


geoplotlib.set_bbox(bbox)

data = read_csv('root_stone.csv')
print 'Root stone: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()

data = read_csv('root_grate.csv')
print 'Root grate: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()

data = read_csv('trk_wire.csv')
print 'Trunk wire: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()

data = read_csv('trk_light.csv')
print 'Trunk light: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()

data = read_csv('brc_light.csv')
print 'Branch light: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()

data = read_csv('brc_shoe.csv')
print 'Branch shoe: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()

data1 = read_csv('brc_light.csv')
data2 = read_csv('trk_wire.csv')
print 'Branch light and Trunk Wire: '
geoplotlib.dot(data1, 'r', point_size = 0.4)
geoplotlib.dot(data2, 'g', point_size = 0.4)
geoplotlib.inline()


Trees: BoundingBox(north=40.911798, west=-73.701253, south=40.498466, east=-74.254432)
Root stone: 
Root grate: 
Trunk wire: 
Trunk light: 
Branch light: 
Branch shoe: 
Branch light and Trunk Wire: 

In [34]:
data = read_csv('sidewalk_dam.csv')
print 'Sidewalk damaged: '
geoplotlib.dot(data, 'r', point_size = 0.4)
geoplotlib.inline()


Sidewalk damaged: 

In [35]:
!pip install apyori-1.1.1.tar.gz
## Trying association mining
from apyori import apriori

transactions = [
    ['cheese', 'nuggets'],
    ['burgers', 'balls'],
]
results = list(apriori(transactions))


Processing ./apyori-1.1.1.tar.gz
  Requirement already satisfied (use --upgrade to upgrade): apyori==1.1.1 from file:///Users/daniele/Desktop/NordSecMob/DTU/SocialData/Final%20Project/cecli.github.io/apyori-1.1.1.tar.gz in /anaconda/lib/python2.7/site-packages
Building wheels for collected packages: apyori
  Running setup.py bdist_wheel for apyori ... - \ done
  Stored in directory: /Users/daniele/Library/Caches/pip/wheels/78/7c/59/591d6048c22ef3269e40f050a804ad92cae7ea71bf04fcd19f
Successfully built apyori

In [36]:
## Trying association mining
from apyori import apriori

transactions = [
    ['beer', 'nuts'],
    ['beer', 'cheese'],
    ['nuts', 'cheese'],
]
transactions.append(['nuts', 'cheese'])
results = list(apriori(transactions))
print results[0]
print ''
print results[1]
print ''
print results[4]


RelationRecord(items=frozenset(['beer']), support=0.5, ordered_statistics=[OrderedStatistic(items_base=frozenset([]), items_add=frozenset(['beer']), confidence=0.5, lift=1.0)])

RelationRecord(items=frozenset(['cheese']), support=0.75, ordered_statistics=[OrderedStatistic(items_base=frozenset([]), items_add=frozenset(['cheese']), confidence=0.75, lift=1.0)])

RelationRecord(items=frozenset(['beer', 'nuts']), support=0.25, ordered_statistics=[OrderedStatistic(items_base=frozenset(['beer']), items_add=frozenset(['nuts']), confidence=0.5, lift=0.6666666666666666), OrderedStatistic(items_base=frozenset(['nuts']), items_add=frozenset(['beer']), confidence=0.3333333333333333, lift=0.6666666666666666)])

In [37]:
import numpy as np
#root_stone = data['root_stone']
#root_grate = data['root_grate']
#root_other = data['root_other']
#trunk_wire = data['trunk_wire']
#trnk_light = data['trnk_light']
#trnk_other = data['trnk_other']
#brch_light = data['brch_light']
#brch_shoe = data['brch_shoe']
#brch_other = data['brch_other']
transactions = []
temp = []
np_count = 0
nuno_count = 0
counter = 0
print len(temp)
for i in range(0,len(root_stone)):
    temp = []
    if root_stone[i] == 'Yes':
        temp.append("Root_Stone")
    if root_grate[i] == 'Yes':
        temp.append("Root_Grate")
    #if root_other[i] == 'Yes':
        #temp.append("Root_Other")
    if trunk_wire[i] == 'Yes':
        temp.append("Trunk_Wire")
    if trnk_light[i] == 'Yes':
        temp.append("Trunk_Light")
    #if trnk_other[i] == 'Yes':
        #temp.append("Trunk_Other")
    if brch_light[i] == 'Yes':
        temp.append("Branch_Light")
    if brch_shoe[i] == 'Yes':
        temp.append("Branch_Shoe")
    if Sidewalk_Condition[i] == 'Damage':
        temp.append("Sidewalk")
    #if brch_other[i] == 'Yes':
        #temp.append("Branch_Other")
    if (len(temp)) > 1:    
        transactions.append(temp)
    elif (len(temp)) == 0: 
        np_count = np_count + 1
    elif (len(temp)) == 1: 
        nuno_count += 1
    if (len(temp)) > 1:
        counter += 1
        
results = list(apriori(np.asarray(transactions)))
print 'Associated:', len(transactions)
print len(results)
print 'Empty:', np_count
print 'One Item:', nuno_count
print 'More: ', counter


0
Associated: 98665
7
Empty: 291659
One Item: 144190
More:  98665

In [38]:
import numpy as np
transactions = []
temp = []
np_count = 0
nuno_count = 0
counter = 0
print len(temp)
for i in range(0,len(root_stone)):
    temp = []
    if root_stone[i] == 'Yes':
        temp.append("Root_Stone")
    #if root_grate[i] == 'Yes':
    #    temp.append("Root_Grate")
    if trunk_wire[i] == 'Yes':
        temp.append("Trunk_Wire")
    #if trnk_light[i] == 'Yes':
    #    temp.append("Trunk_Light")
    if brch_light[i] == 'Yes':
        temp.append("Branch_Light")
    #if brch_shoe[i] == 'Yes':
    #    temp.append("Branch_Shoe")
    #if Sidewalk_Condition[i] == 'Damage':
    #    temp.append("Sidewalk")
    if (len(temp)) > 1:    
        transactions.append(temp)
    elif (len(temp)) == 0: 
        np_count = np_count + 1
    elif (len(temp)) == 1: 
        nuno_count += 1
    if (len(temp)) > 1:
        counter += 1
        
results = list(apriori(np.asarray(transactions)))
for i in range (0, len(results)):
    print '- ', i, ':', results[i][0], results[i][1], ', Lift:' ,results[i][-1][-1][-1]


0
-  0 : frozenset(['Branch_Light']) 0.935900331226 , Lift: 1.0
-  1 : frozenset(['Root_Stone']) 0.893819632641 , Lift: 1.0
-  2 : frozenset(['Trunk_Wire']) 0.263964167419 , Lift: 1.0
-  3 : frozenset(['Branch_Light', 'Root_Stone']) 0.829719963866 , Lift: 0.991863820558
-  4 : frozenset(['Trunk_Wire', 'Branch_Light']) 0.199864498645 , Lift: 0.809023396238
-  5 : frozenset(['Trunk_Wire', 'Root_Stone']) 0.15778380006 , Lift: 0.668755775081

The results show us that branch light appears together with trunk wire, which can also be seen in the plot. This could be because trees places such that their branches grow into street lights also have the trunk be provoked by the lighting structures.

We have also explored the 311 dataset in the context of trees, as there is some data in there that is specific to our domain.

Some of the most interesting 311 requests that we found, were in relation to overgrown trees and new tree requests. The Python analysis has not been included in detail in the notebook, but two images are included which show two geoplots of the mentioned complaints (note that there are some hotspots ).


In [39]:
from IPython.display import Image
Image("new_requests.png")


Out[39]:

In [40]:
Image("overgrown_trees.png")


Out[40]:

We have found out that, sadly, problems are not related with the health. In fact, the Pearson correlation of these two parameters was very low.

We explore the diameter and the amount of trees in area of the city for discovering that these two factors are influencing the air quality. We also discovered that the problems seem to have an influence on the diameter.


In [41]:
Image("problem_amount.png")


Out[41]:

The above image shows the resuls of the regression, which can also be seen here:

  • 0.28485178

Regression

These are the results of the regression between air pollution and O2 for predicting the amount of air pollution (and O2) given the amount of trees and their diameter as parameters. The results were almost the same for the other types of particles found. The whole regression notebook is included in the repo with the necessary datafiles. (https://github.com/cecli/cecli.github.io/blob/master/regression_notebook.ipynb)

  • r^2 elastic net on test data : 0.509594
  • Mean squared error: 5.42
  • Mean squared error: 5.36
  • r^2 lasso on test data : 0.515746
  • Mean squared error: 5.52
  • Variance score (ols): 0.50

The images showing the corellation between the amount of trees and the pollution is in the other notebook, and this is also the data which the regression visualizaiton on the website is based on.


In [ ]:

Predicting species based on location


In [42]:
#KNN classifier

#Load relevant libraries
import numpy as np
import pylab as pl
from sklearn import neighbors, datasets, model_selection

#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_data['Latitude'], tree_data['Longitude'] )
                                                    , tree_data['Spc_Common']
                                                    , test_size=0.15
                                                    , random_state=42)


accuracy = []
#Classify KNN with K=2-10
for k in range(2,11):
    knn = neighbors.KNeighborsClassifier(n_neighbors=k, weights = "distance")

    #Fit the data and make predictions
    knn.fit(X_train, y_train).predict(X_test)

    #Calculate accuracy from validation set
    n_folds = 5
    score = np.mean(model_selection.cross_val_score(knn.fit(X_train, y_train),X_train, y_train,cv=n_folds))
    print "KNN score for k =", k, ":", score
    
    #Save accuracy into a list
    accuracy.append(score)
    
#KNN score for k = 2 : 0.523009486208
#KNN score for k = 3 : 0.518710892965
#KNN score for k = 4 : 0.520454115483
#KNN score for k = 5 : 0.519925851647
#KNN score for k = 6 : 0.51948343831
#KNN score for k = 7 : 0.518847340502
#KNN score for k = 8 : 0.518138623743
#KNN score for k = 9 : 0.517289048133
#KNN score for k = 10 : 0.51639320774


KNN score for k = 2 : 0.523009486208
KNN score for k = 3 : 0.518710892965
KNN score for k = 4 : 0.520454115483
KNN score for k = 5 : 0.519925851647
KNN score for k = 6 : 0.51948343831
KNN score for k = 7 : 0.518847340502
KNN score for k = 8 : 0.518138623743
KNN score for k = 9 : 0.517289048133
KNN score for k = 10 : 0.51639320774

In [43]:
#Plot accuracy as a function of the number of K (2-10)
import matplotlib.pyplot as plt

plt.figure(figsize=(20,5))
ks = range(2, 11)
plt.plot(ks, accuracy)
plt.xticks(ks)
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.title("Prediction accuracy as a function of k")
plt.show()



In [44]:
#K=4 was chosen for simplicity compared to accuracy

#Test score
knn = neighbors.KNeighborsClassifier(n_neighbors=4, weights = "distance")

#Fit the data and make predictions
knn.fit(X_train, y_train).predict(X_test)

#Calculate accuracy
score = knn.fit(X_train, y_train).score(X_test, y_test)
print score #0.543241288134


0.543241288134

In [45]:
tree_data = pd.read_csv('2015_tree_data_updated.csv')

unique, counts = np.unique(zip(tree_data['Spc_Common']), return_counts=True)
print sorted(zip(counts, unique), reverse = True)
#Try doing KNN for only the top 5 species
top5_spec = ['London planetree','honeylocust', 'Callery pear','pin oak', 'Norway maple']
tree_spec5 = []
tree_lat5 = []
tree_lon5 = []
for i in range(len(tree_data)):
    if tree_data['Spc_Common'][i] in top5_spec:
        tree_spec5.append(tree_data['Spc_Common'][i])
        tree_lat5.append(tree_data['Latitude'][i])
        tree_lon5.append(tree_data['Longitude'][i])
print len(tree_spec5)
#print tree_spec5[:10]
#print tree_lat5[:10]
#print tree_lon5[:10]

#KNN classifier
#Load relevant libraries
import numpy as np
import pylab as pl
from sklearn import neighbors, datasets, model_selection

#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_lat5, tree_lon5)
                                                    , tree_spec5
                                                    , test_size=0.15
                                                    , random_state=42)

accuracy = []
#Classify KNN with K=2-10
for k in range(2,11):
    knn = neighbors.KNeighborsClassifier(n_neighbors=k, weights = "distance")
    
    #Fit the data and make predictions
    knn.fit(X_train, y_train).predict(X_test)

    #Calculate accuracy
    #score = knn.fit(X_train, y_train).score(X_test, y_test)
    n_folds = 5
    score = np.mean(model_selection.cross_val_score(knn.fit(X_train, y_train),X_train, y_train,cv=n_folds))
    print "KNN score for k =", k, ":", score
    
    #Save accuracy into a list
    accuracy.append(score)


[(87014, 'London planetree'), (64264, 'honeylocust'), (58931, 'Callery pear'), (53185, 'pin oak'), (34189, 'Norway maple'), (29742, 'littleleaf linden'), (29279, 'cherry'), (29258, 'Japanese zelkova'), (21024, 'ginkgo'), (19338, 'Sophora'), (17246, 'red maple'), (16251, 'green ash'), (13530, 'American linden'), (12277, 'silver maple'), (10657, 'sweetgum'), (8400, 'northern red oak'), (7995, 'silver linden'), (7975, 'American elm'), (7080, 'maple'), (6879, 'purple-leaf plum')]
297583
KNN score for k = 2 : 0.704081936424
KNN score for k = 3 : 0.702540110987
KNN score for k = 4 : 0.704034511536
KNN score for k = 5 : 0.702745680523
KNN score for k = 6 : 0.701962891889
KNN score for k = 7 : 0.701346174765
KNN score for k = 8 : 0.700005955431
KNN score for k = 9 : 0.698875265002
KNN score for k = 10 : 0.69768924392

In [46]:
#Create Decision tree classifier

#Load relevant libraries
import numpy as np
from sklearn import tree
from sklearn import model_selection

#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_data['Latitude'], tree_data['Longitude'])
                                                    , tree_data['Spc_Common']
                                                    , test_size=0.15
                                                    , random_state=42)

#Classify Decision trees
dt = tree.DecisionTreeClassifier(random_state = 42)

#Fit the data and make predictions
dt.fit(X_train, y_train).predict(X_test)

#Calculate accuracy
#score = dtnn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 5
score = np.mean(model_selection.cross_val_score(dt.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Decision tree accuracy:", score


Decision tree accuracy: 0.453208685409

Classify health based on location


In [47]:
#Adjust KNN classifyer

#Load relevant libraries
import numpy as np
import pylab as pl
from sklearn import neighbors, datasets, model_selection

#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_data['Latitude'], tree_data['Longitude'])
                                                    , tree_data['Health']
                                                    , test_size=0.15
                                                    , random_state=42)

accuracy = []
#Classify KNN with K=2-10
for k in range(2,11):
    knn = neighbors.KNeighborsClassifier(n_neighbors=k, weights="distance")

    #Fit the data and make predictions
    knn_pred = knn.fit(X_train, y_train).predict(X_test)

    #Calculate accuracy
    #score = knn.fit(X_train, y_train).score(X_test, y_test)
    n_folds = 5
    score = np.mean(model_selection.cross_val_score(knn.fit(X_train, y_train),X_train, y_train,cv=n_folds))
    print "KNN score for k =", k, ":", score
    
    #Save accuracy into a list
    accuracy.append(score)
    
#KNN score for k = 2 : 0.768112149227
#KNN score for k = 3 : 0.789842752784
#KNN score for k = 4 : 0.798109764942
#KNN score for k = 5 : 0.804213182933
#KNN score for k = 6 : 0.80804514699
#KNN score for k = 7 : 0.810998909425
#KNN score for k = 8 : 0.813439834781
#KNN score for k = 9 : 0.815178637038
#KNN score for k = 10 : 0.816549869474


//anaconda/lib/python2.7/site-packages/numpy/lib/arraysetops.py:200: FutureWarning: numpy not_equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (`is`)) and will change.
  flag = np.concatenate(([True], aux[1:] != aux[:-1]))
//anaconda/lib/python2.7/site-packages/sklearn/model_selection/_split.py:581: Warning: The least populated class in y has only 2 members, which is too few. The minimum number of groups for any class cannot be less than n_splits=5.
  % (min_groups, self.n_splits)), Warning)
KNN score for k = 2 : 0.768112149227
KNN score for k = 3 : 0.789842752784
KNN score for k = 4 : 0.798109764942
KNN score for k = 5 : 0.804213182933
KNN score for k = 6 : 0.80804514699
KNN score for k = 7 : 0.810998909425
KNN score for k = 8 : 0.813439834781
KNN score for k = 9 : 0.815178637038
KNN score for k = 10 : 0.816549869474

In [48]:
#Test accuracy

knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights="distance")
#Fit the data and make predictions
knn_pred = knn.fit(X_train, y_train).predict(X_test)

#Calculate accuracy
score = knn.fit(X_train, y_train).score(X_test, y_test)
print score #0.806917109432


0.806917109432

In [ ]:
#Create Decision tree classifier

#Load relevant libraries
import numpy as np
from sklearn import tree
from sklearn import model_selection


#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_data['Latitude'], tree_data['Longitude'])
                                                    , tree_data['Health']
                                                    , test_size=0.15
                                                    , random_state=42)

#Classify Decision trees
dt = tree.DecisionTreeClassifier(random_state = 42)

#Fit the data and make predictions
dt_pred = dt.fit(X_train, y_train).predict(X_test)

#Calculate accuracy
#score = dtnn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 5
score = np.mean(model_selection.cross_val_score(dt.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Decision tree accuracy:", score #0.748111525321


Decision tree accuracy: 0.748111525321

In [ ]:
# Create SVM classifier

#Load relevant libraries
import numpy as np
from sklearn import svm
from sklearn import model_selection

#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_data['Latitude'], tree_data['Longitude'])
                                                    , tree_data['Health']
                                                    , test_size=0.15
                                                    , random_state=42)

#Classify Decision trees
svm = svm.SVC(random_state = 42)

#Fit the data and make predictions
svm.fit(X_train, y_train).predict(X_test)

#Calculate accuracy
#score = dtnn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 5
score = np.mean(model_selection.cross_val_score(svm.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "SVM accuracy:", score

Decision tree and Random Forest

We tried decision tree and random forest to predict species based on locations and diameter. We tried to classify both top 20, 10 and 5 species. We found out that the decision tree was not an idea solution, as the best result was 0.5 for the top 5 species. Random forest had the same result. We have included the images of the decision trees, and as can be seen, they just predict two species: Locust and Pine Oak. So this is clearly not a good model. We also tried predicting neighbourhoods based on problems, but that also did not work, as the accuracy was even worse. The images are included in the repository.


In [ ]:
#Try doing KNN for only the top 10 species
top10_spec = ['London planetree','pin oak', 'honeylocust','Norway maple', 'Callery pear']
tree_spec10 = []
tree_lat10 = []
tree_lon10 = []
tree_health10 = []
tree_nth10 = []
tree_diam10 = []
tree_cb10 = []
tree_boro10 = []
tree_root10 = []
tree_branch10 = []
tree_trunk10 = []
tree_total10 = []
print len(health), len(tree_data)
for i in range(len(health)):
    if tree_data['Spc_Common'][i] in top10_spec and float(tree_data['Diameter'][i]) >= 10.00:
        tree_spec10.append(tree_data['Spc_Common'][i])
        tree_lat10.append(tree_data['Latitude'][i])
        tree_lon10.append(tree_data['Longitude'][i])
        tree_health10.append(health[i])
        tree_nth10.append(tree_data['Neighbourhoods'][i])
        tree_diam10.append(tree_data['Diameter'][i])
        tree_cb10.append(tree_data['CB'][i])
        tree_boro10.append(tree_data['Borough'][i])
        tree_root10.append(root_list[i])
        tree_branch10.append(branch_list[i])
        tree_trunk10.append(trunk_list[i])
        tree_total10.append(total_prob_list[i])
print len(tree_spec10)
print tree_spec10[:10]
print tree_lat10[:10]
print tree_lon10[:10]

In [ ]:
bin1 = 1.00
bin2 = 10.00
bin3 = 15.00
bin4 = 20.00
bin5 = 25.00
bin6 = 30.00
bin7 = 35.00
bin8 = 40.00
bin9 = 45.00
bin10 = 50.00
bin11 = 60.00
bin12 = 70.00
bin13 = 80.00
bin14 = 90.00
bin15 = 100.00
bin16 = 150.00
bin17 = 200.00
bin18 = 250.00
bin19 = 300.00
list_bins = [bin19, bin18, bin17, bin16, bin15, bin14, bin13, bin12, bin11, bin10, bin9, bin8, 
             bin7, bin6, bin5, bin4, bin3, bin2]
new_diam = []
for t in tree_diam10: 
    ft = float(t)
    for l in list_bins: 
        if ft >= l: 
            new_diam.append(l)
            break
        else: 
            continue

In [ ]:
#Decision tree for classifying tree species based on health and diameter 
#Load relevant libraries
import numpy as np
from sklearn import tree
from sklearn import model_selection
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
#le.fit(tree_data['Neighbourhoods'])
#list(le.classes_)
#trans_nbh = le.transform(tree_data['Neighbourhoods']) 

#Split data set into a training and a test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_data['Diameter'], tree_data['Longitude'], tree_data['Latitude'])
                                                    , tree_data['Spc_Common']
                                                    , test_size=0.10
                                                    , random_state=42)

#Classify Decision trees
dt = tree.DecisionTreeClassifier(random_state = 42)

#Fit the data and make predictions
dt.fit(X_train, y_train).predict(X_test)

#Calculate accuracy
#score = dtnn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 10
score = np.mean(model_selection.cross_val_score(dt.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Decision tree accuracy:", score

#trans_nbh10 = le.transform(tree_nth10) 

#X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_diam10, tree_health10, trans_cb10)
                                                    #, tree_spec10
                                                    #, test_size=0.33
                                                    #, random_state=42)
#Classify Decision trees
#dt = tree.DecisionTreeClassifier(random_state = 42)

#Fit the data and make predictions
#dt.fit(X_train, y_train).predict(X_test)

#Calculate accuracy
#score = dtnn.fit(X_train, y_train).score(X_test, y_test)
#n_folds = 10
#score = np.mean(model_selection.cross_val_score(dt.fit(X_train, y_train),X_train, y_train,cv=n_folds))
#print "Decision tree accuracy for classifying top 10 species:", score

In [ ]:
import numpy as np
from sklearn import tree
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import ensemble
le = preprocessing.LabelEncoder()
le.fit(tree_nth10)
list(le.classes_)
trans_nbh10 = le.transform(tree_nth10) 
le = preprocessing.LabelEncoder()
le.fit(tree_boro10)
list(le.classes_)
trans_boro10 = le.transform(tree_boro10) 

X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(tree_lon10, tree_lat10),
                                                     tree_spec10
                                                    , test_size=0.10
                                                    , random_state=42)
#Classify Decision trees
dt = tree.DecisionTreeClassifier(random_state = 42)
dt2 = ensemble.RandomForestClassifier(random_state = 42)

#Fit the data and make predictions
dt.fit(X_train, y_train).predict(X_test)
dt2.fit(X_train, y_train).predict(X_test)

#Calculate accuracy
score = dtnn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 10
score = np.mean(model_selection.cross_val_score(dt.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Decision tree accuracy for classifying top 10 species:", score
score = np.mean(model_selection.cross_val_score(dt2.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Random forest accuracy for classifying top 10 species:", score

In [ ]:
import numpy as np
from sklearn import tree
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import ensemble
le = preprocessing.LabelEncoder()
le.fit(tree_nth10)
list(le.classes_)
trans_nbh10 = le.transform(tree_nth10) 
le = preprocessing.LabelEncoder()
le.fit(tree_boro10)
list(le.classes_)
trans_boro10 = le.transform(tree_boro10) 

le = preprocessing.LabelEncoder()
le.fit(tree_spec10)
list(le.classes_)
trans_species10 = le.transform(tree_spec10) 

X_train, X_test, y_train, y_test = model_selection.train_test_split(zip(trans_species10, tree_total10)
                                                    , new_diam
                                                    , test_size=0.25
                                                    , random_state=42)
#Classify Decision trees
dt = tree.DecisionTreeClassifier(max_depth=20, max_leaf_nodes=40, random_state = 42)
#dt2 = ensemble.RandomForestClassifier(random_state = 42)

#Fit the data and make predictions
dt.fit(X_train, y_train).predict(X_test)
#dt2.fit(X_train, y_train).predict(X_test)

#Calculate accuracy
#score = dtnn.fit(X_train, y_train).score(X_test, y_test)
n_folds = 5
score = np.mean(model_selection.cross_val_score(dt.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Decision tree accuracy for classifying top 10 species:", score
score = np.mean(model_selection.cross_val_score(dt2.fit(X_train, y_train),X_train, y_train,cv=n_folds))
print "Random forest accuracy for classifying top 10 species:", score