Explainer Notebook

This is a notebook produced in the course 02806 - Social data analysis and visualization at DTU in Denmark, Spring 2017 [2]. The intention of this notebook is to present details about the data, reasonings for choices made and several discussions. It should give a better in depth understanding of what we have created for our final project. The requirements for the project can be found here [3].

The story the group wants to present is a story about Arvind winning the lottery. He receives a huge amount of money and has in mind that he will move to New York City. He has questions such as how he can spend his money wisely by investing in a place to live. The Data Scientist David plays a central role, as he will be the one in charge of analyzing the different data sets in the story. Arvind has another request for David; He wants to live in a neighborhood considered good. David tells Arvind that he will figure things out and then come back to him. He is ready to do some data analysis and visualisation!

Participants:

  • David Coleman
  • Arvind Iyer
  • Uyen Dan Nguyen

Motivation

Data sets

The group has used a set of data sets to realize the final project. Firstly we used a data set with NYC GeoJSON for mapping the actual NYC data [10]. Secondly we used a data set called Median House Value Per Sq Ft from 2014-2016 from Zillow [11], the leading real estate and home-related information marketplace in the US. To look at the quality in different areas, the group used the 311 Service Requests data sets from 2010 to 2015. The NYC311 is a 24/7 help with more than 3,600 non-emergency government services such as noise complaints and illegal parkings [12]. The last data set we used contained individual income tax statistics with agi from 2014 in all states [19].

Reasoning

Our initial idea was to understand the Real Estate market in the city and predicting the quality of life in a certain area, as the prices in New York City were known to be high. The intial plan to predict the future housing prices with several Zillow data sets. The problem with this idea was that our main data sets from Zillow had a lot of holes in it, which made them difficult to use. The future housing market is also dependent on a lot of outside parameters, which made it hard for us to do this prediction with a high percentage of certainty. With this is mind, we took some inspiration from the other groups to intercoperate existing 311 data sets. We still wanted to predict the quality of an area in terms of non-criminal complaints in a neighborhood or a borough. This way, we could somehow predict the kinds of non-criminal complaints that were most likely to happend in an area or a neighborhood, similar to what was done with crime in San Francisco. By looking at the frequency of certain types of complaints in an area, it was also possible for us to differ a specific area from another - and still keep the story we presented earlier. To do so, we used the parameters from "Location", "Longitude", "Latitude" and "Complaint type" from the 311 data sets. Also by comparing the zip codes for 311 requests and individual income tax, we could find out what kind of 311 requests were more frequent for people in an area with a certain income. To do so, we took the parameters "Zip" and "Complaint type" from the 311 and "Zip", "Agi(1-6)" and an additional row for everybody with an income in the agi bracket. Lastly the only Zillow data set we ended up using was the Median House Value Per Sq Ft, which contained data that helped us to estimate what the mean price per square foot in each borough was.

The end user's experience

Our story is about Arvind and David's quest to find where Arvind can live in terms of quality of life. The group thus made the visualizations to fit the story. We offer the user some kind of interaction with the website such as zooming in and out of NYC maps. By moving the cursor over an area, one can can see what neighborhood and what borough the cursor is currently on with information dependent on what kind of visualization it is.

What we had in mind prior creating the website was that it should be easy and intuitive for the user. The visualizations should support the story and the user should understand why the different things are on the page. The user should be able to click on buttons for some of the visualizations to change, such as choose different focus for complaints. Without knowing anything about the K-means, the user should get the message of what it does. The SVM is a diagram that is not that intuitive to the user, but by looking at it and reading the description, we think that it will be somehow understandable.

The optimal solution which we intially had in mind was for the user to choose to toggle on or off some parameters and all the visualizations would update automatically. This would help an individual to find a good area after own requirements if the individual was planning on moving to NYC. By toggling on a parameter, the user would see on a map what kind of areas that fit that individual the most. As time was a big factor for each of the members in the group, we decided to only follow the story we presented earlier to find Arvind a good area to live with a limited level of interactive visualizations. The website currently offers a list of standard or one chosen complaint type parameter in the focus at one time. More information about this can be read in the section about visualizations.

Basic stats

Choices in data cleaning and preprocessing

First we had to look through the relevant data sets to see the size, rows and the different parameters. The data sets we used were large and had parameters or data we were not too interested in to tell our story. A concrete example was the income data set. It contained data about all the states in the United States, while we were only interested in New York City. The data set needed to be cleaned two times to fit our purpose. Another concrete example was the Zillow data set, which also contained data about other states which we were not interested in. We cleaned it to only show data from New York City to fit our purpose. The third example was the cleaning we did with the 311 data sets. The data sets contain data from all the departments in NYC, but the only department that served our purpose was the one about housing. The data sets were thus cleaned to only contain information with where the "Agency Name" was "Department of Housing Preservation and Development". The fourth example was a dataset we created from a 311 file which contained zip codes and the sum for five different complaint categories (Sanitation, Food, Homeless, Neighbourhood, Noise) for each zip code in 2016.

Discussion of the data set stats

The first data set with NYC GeoJSON consists of 635 kB of data and 527 rows. We did not feel an urge to clean this, as the information was only used in the map of New York City.

Before we cleaned the second data set from Zillow, it was a data set containing 5820 rows with data from 1996 to 2004. The file was 5.1 MB before cleaning, as it contained data from other cities than New York City as well.

If we used the 311 data set that contained all the data from 2010-2015, we would have to deal with 7 columns and over 15 million rows [13]. The data set was big and hard to process, so we downloaded several data sets for each year instead. The data sets from 2011 to 2015 can be found in the references: [14], [15], [16], [17] and [18]. Each 311 data set is approximately 1.3 GB. In the following visualizations the 311 open dataset from New York was utilised. This data was reduced down by zip-code, so that it only included the 240 zip-codes contained within New York City itself; discarding the zip-codes from New York State. Although there are 240 zip-codes contained with the New York City limits the 311 data did not contain information on all of them. On researching it was found that some zip-codes are entirely contained within other zip-codes. These larger zip-codes contain the data for both. For this reason the dataset contains 214 entries.

The data set with information about individual income tax statistics was cleaned to only contain data from New York State and once again cleaned to only contain the New York City data. After cleaning, the 6 MB data set contained 9265 rows.

The fourth data set was 8 kB big and with 203 rows. To realize the fourth type of visualization we had to combine the first and fourth data set. More information is available in the section about visualizations.

Theory

Machine learning tools

Radial Basis Function Kernel

The RBF kernel is in the form of a radial basis function [4]. We let X represent the input space and consider a function Φ: X→F such that it takes the input space X and maps it to a point in a new space F. Let us say that we have mapped all the data points from X to F. If you try to solve the normal linear svm in the new space F, we will notice that all the earlier working looks the same, except for that all the points xi are represented as Φ(xi). Instead of using the dot product, which is the natural inner product for Euclidean space, we replace it with ⟨Φ(x),Φ(y)⟩ - a representation of the natural inner product in the new space F. Let us then say that there exists a function k: X x X -> R such that k(xi,xj) = ⟨Φ(x),Φ(y)⟩. We can then replace all the dot products above with k(xi,xj) - also known as a kernel function [5].

This algorithm is commonly used to support vector machine classification, which consists of supervising learning models to analyze data used for classification and regression analysis. The RBF kernel is well-studied and many SVM packages include it as a default method [7]. Kernel methods are commonly used for pattern analysis to find and study general types of relations as well [6].

We used this algorithm to implement a Support Vector based regressor as a prediction model that attempts to estimate future patterns of median house prices based on past fluctuations. Using the median value per square foot of area gave us a good understanding of the average cost of living in terms of real estate space. The general trends of the housing market as a whole are significantly easier to predict than the valuation estimation of a specific plot of land or locality. The SVM reflects the quality of living somehow by looking at the house prices in an area, which is something that is relevant for someone being interested in living in New York City.

K-Nearest Neighbors

The KNN algorithm is used for classification and regression, and will make predictions using the training dataset directly. Data Sciene From Scratch explains this algorithm this way: Imagine that you are voting soon, but you have no idea about who you are gonna vote for. By looking at the neighbors, you might get a better idea. You take relevant factors in the account to find the most accurate prediction. This is the idea behind nearest neighbors classification [8]. The KNN needs some notion of distance and an assumption that points that are close to each other are similar. This is one of the simplest algorithms for prediction, and it has some downsides such as focusing too much on the points around and assuming that they are similar. On the other hand, it can give us the general idea or prediction with the plotted parameters or factors.

In the project, we used the KNN for predicting the areas or districts where some kind of complaint was most common. This gave us a general idea of where some complaints were reported more frequently than other places. This was also relevant for the presented story, as Arvind would have the chance to pick between districts or neighborhoods after what kind of preferences he has for most common complaints in the area.

K-Means Clustering

K-means clustering is a method used for cluster analysis in data mining. The method aims to partition n observations into k clusters, where each observation belongs to the cluster with the nearest mean based on the features that are provided. The results of the algorithm are centroids of K clusters, which can be used to label new data and labels for the training data as each data point is assigned to a single cluster. Each centroid is defined by a collection of features, and instead of defining groups before looking at the data, clustering will allow us to find and analyze groups that have formed organically [9]. The algorithm uses the squared Euclidean distance to assign each point to its nearest centroid and it recomputes the centroids by taking the mean of the assigned data points to a centroid's cluster. To find the right number of K, we had to run the K-means clustering algorithm for a range of K values and compare the results.

In this project, the K-means clustering was done from the 311 data per zip code and for the income per zip code. The reasoning for this was to comment on the relationship between the type of 311 calls and the income of the people living in that zip code, and to see if the clustered groups were similiar to each other. By combining the 311 and income data sets, we had 3.5 million rows to measure the K-means clustering. After observing the k-means partitioning of the 311 data it was contemplated as to whether the distribution of the different 311 partitions and their complaint types had in any way an economic component, wherein people living in similar economic areas experience the same types of complaints. A dataset of the average gross income for each zip-code was found. Unfortunately it was from 2014 the last year that the data had been collected. The dataset also has only 177 zip-codes contained wihin. The data set had 6 different income brackets ranging from less than 25,000 up to 200,000+. Performing another k-means utilizing the zip-code and the income bracket and the results plotted as before.

Model selection and performance

For the housing data by Zillow, 24 of 36 months were used as training data for the SVM. The training data was then tested for the last 12 months.

The KNN was trained using all the locations of complaints available and we used the model to generate a grid of points with a 0.01° difference between neighbouring points, each with a predicted complaint type.

For the K-means, we only had clustering. We intended to test it with a distance algorithm, plot the curve and find the elbow which is the best k value.

We used sci-kit learn's inbuilt cross validation support(sklearn.model_selection.cross_val_score) to get the accuracy, mean error and standard deviation of the models used above. We experimented with different scoring functions, but lack of understanding of the scien

To measure the results of the SVM and regressions for the housing data, we looked at the error values between the three models.

Visualizations

One of the visualization we chose was one of a map of New York City divided by the different boroughs. Each borough has a different color and each zip code or area in the borough has a difference in opacity. This opacity is calculated from five main categories in a cleaned 2016 311-csv: Sanitation, Food, Homeless, Noise and Neighbourhood. The different categories are divided in smaller categories, but we only take the main categories in mind. Since there are some differences between the total sum of those five categories for each zip code in 2016, we used scaling, as scales provide a convenient way to map those data values to new values useful for visualization purposes [1]. The scaled value was used to set the opacity in the bind-function. A zip code with a high opacity means that the area had few reported complaints in 2016 compared to the other zip codes. A zip code with a low opacity means that the area had a lot of complaints in 2016. The visualization shows the user how the quality of life was in the different areas with the five main categories in mind in 2016 and was thus relevant for our story. By clicking on one of the areas, a corresponding bar graph will reveal itself and show the frequency of the different main complaints. By hovering over each bar, one can also see the frequency for the specific crime.

Another visualization type we went for was a heat map which shows the total number of complaints in each month in 2015 for all crimes in the whole city. It shows the patterns of which days that are more prone for non-criminal complaints such as in the weekends. Each main category of complaints has additionally a corresponding heat map so the user can see which days had more registred complaints for that specific complaint type.

The third type of visualization we went for was one showing the relationship between income and complaint types by zip codes as described in the K-means section. This is an intereactive map visualization, as the user is able to click some buttoms for numbers of clusters. "Click k2" means two clusters, "Click k3" means three and etc. The user can see the areas where people approximately earn the same, and by putting the one for 311 complaints and the other one for income next to each other, the user can compare the data visually. In the first visualisation a k-means algorithm was applied to the 311, 2016 dataset with respect to the ‘Complaint Types’ which were assigned a numeric value and the ‘Incident Zip’ of the complaint. K-means is an unsupervised learning algorithm which seeks to partition the observations into k clusters based on the nearest mean. This algorithm was run with a k values of 2-10 and the output values stored. All duplicates from the output dataset were removed leaving only one instance of each zip-code with its corresponding cluster number. Unfortunately when it came to plot the visualisation a geojson file for the New York zip codes could not be found. Other New York geojson were available and so it was decided to use a neighbourhood geojson with the zip-codes plotted as points at their appropriate latitude and longitude. This is to see what kind of neighborhoods that are related to each other in terms of types of complaints and income level.

The fourth type of visualization was similar to the third one; we have a map of New York City and the user is able to click through different types of complaints. By doing so, the user can see which areas that are more prone to noise, neighborhood, food, homeless or sanitation complaints as described in the first visualization. The circles are scaled and the bigger circle it is, the more complaints were registered in the corresponding area in 2016. The user are thus able to see the areas for most homeless reported complaints and areas with a lot of noise. This is relevant for our story because Arvind might have different kinds of preferences such as not liking a lot of noise. Manhattan had a lot of reported noise and homeless complaints in 2016 and should not be in the top list of areas to live with Arvinds preferences in mind. It was decided to further investigate some complaints and to form subsets out of similar complaints and then graph them across the zip-codes so as to make it easier to get an understanding of the type of complaints that could affect your quality of life. The subsets are: Noise complaints; it is comprised of complaints types 'Noise ', 'Noise - Vehicle', 'Noise - Residential', 'Noise - Street/Sidewalk', 'Noise - Commercial'. Neighbourhood groups complaints about the upkeep of the area. It contains complaints about - 'Street Condition', 'Street Light Condition', 'Sweeping/Inadequate', 'Graffiti', and ‘Derelict Vehicle'. Food contains complaints about 'Food Poisoning' and 'Food Establishment'. Homeless contains complaints about 'Homeless Person Assistance' and 'Homeless Encampment'. Sanitation contains complaints about 'Rodent', 'Dirty Conditions', 'Sanitation Condition', 'Sewer’, 'Overflowing Recycling Baskets', 'Missed Collection (All Materials)',’Unsanitary Conditions’. This is not a complete list and many other complaints could be included but processing only the higher occurring complaints were taken. Select Subsets using the Buttons and the graph is zoomable for acloser inspection. Each plot is has its own scale for that type so you can visualise the amount of the complaint across the city, but the plots should be be compared against each other size wise.

Discussion

Good

What went well was that the group managed to create something from the different data sets we found. We managed to clean the data sets and combine them in a way that it matched the story we wanted to tell. Additional to this, we managed to use three machine learning tools and visualize the results. The exercises and tools from the class helped us in realizing the project. The design of the visualizations was clean and simple.

Bad

The project was very open since we could choose whatever topic regarding a city and fulfill the requirements from the project page [3]. There were too many choices to pick between, so we ended up wasting a lot of time on what topic we should go for. We ended up being unsecure about our topic and therefore changing it several times. When we finally went for something, we found out that the data sets were not sufficient enough for what we wanted to project. The reasoning for this could be that we had several topics in mind, but not the relevant data sets. We were also too indecisive and scared of choosing something simple when the project requirements were so open. We learned that sometimes you should go for something and go 100% for it rather than changing the topic several times when there is a deadline.

What could be improved in our project was for example the user interaction part and the simple design. With more possible ways for the user to interact with the page and a more interesting and intuitive design, the website would be more interesting. Another improvement could be making a better video after seeing all the other videos during the presentation. The website could look more coherent, and the machine learning tool used on the Zillow data set could be more relevant for our project. We would like to add more prediction, as we only had a few of them.

References

  • [8] J. Grus, Data Science from Scratch, 1st ed. Sebastopol, Calif.: O'Reilly, 2015.

In [ ]: