DAT210x - Programming with Python for DS

Module5- Lab1

Start by importing whatever you need to import in order to make this lab work:


In [ ]:
# .. your code here ..

How to Get The Dataset

  1. Open up the City of Chicago's Open Data | Crimes page.
  2. In the Primary Type column, click on the Menu button next to the info button, and select Filter This Column. It might take a second for the filter option to show up, since it has to load the entire list first.
  3. Scroll down to GAMBLING
  4. Click the light blue Export button next to the Filter button, and select Download As CSV

Now that you have th dataset stored as a CSV, load it up being careful to double check headers, as per usual:


In [ ]:
# .. your code here ..

Get rid of any rows that have nans in them:


In [ ]:
# .. your code here ..

Display the dtypes of your dset:


In [ ]:
# .. your code here ..

Coerce the Date feature (which is currently a string object) into real date, and confirm by displaying the dtypes again. This might be a slow executing process...


In [ ]:
# .. your code here ..

In [ ]:
def doKMeans(df):
    # Let's plot your data with a '.' marker, a 0.3 alpha at the Longitude,
    # and Latitude locations in your dataset. Longitude = x, Latitude = y
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(df.Longitude, df.Latitude, marker='.', alpha=0.3)

    
    # TODO: Filter `df` using indexing so it only contains Longitude and Latitude,
    # since the remaining columns aren't really applicable for this lab:
    #
    # .. your code here ..


    # TODO: Use K-Means to try and find seven cluster centers in this df.
    # Be sure to name your kmeans model `model` so that the printing works.
    #
    # .. your code here ..


    # Now we can print and plot the centroids:
    centroids = model.cluster_centers_
    print(centroids)
    ax.scatter(centroids[:,0], centroids[:,1], marker='x', c='red', alpha=0.5, linewidths=3, s=169)
    plt.show()

In [ ]:
# Print & Plot your data
doKMeans(df)

Filter out the data so that it only contains samples that have a Date > '2011-01-01', using indexing. Then, in a new figure, plot the crime incidents, as well as a new K-Means run's centroids.


In [ ]:
# .. your code here ..

In [ ]:
# Print & Plot your data
doKMeans(df)

In [ ]: