Data management

Introduction

CARTOframes is built on top of Pandas and GeoPandas. Therefore, it's compatible with all the data formats supported in those projects like CSV, GeoJSON, Shapefile, etc. This guide will show how to load different data files into DataFrames and how to interact with the CARTO platform to upload DataFrames into tables and download tables or SQL queries into DataFrames.

There are two main concepts we should know before continuing with the guide:

  • A DataFrame is a two-dimensional data structure for generic data. It can be thought of as a table with rows and columns. It's composed of Series objects which are one-dimensional data structures.
  • A GeoDataFrame is a DataFrame with an extra geometry column. This geometry column is a GeoSeries object.

Every time we manage Geographic data, a GeoDataFrame should be used. In case a DataFrame with an encoded geometry column is used (WKB, WKT, etc) every method contains a geom_col param to provide the name of that column and decode the geometry internally.

For more information, you can see all the examples here.

Steps

To show how to manage your data with CARTOframes, we will follow the next steps:

  • Load San Francisco neighborhoods from a GeoJSON file
  • Load San Francisco incidents from a CSV file
  • Upload data to CARTO
  • Calculate their intersection using a SQL query
  • Download the result

Load San Francisco neighborhoods from a GeoJSON file

Let's start by loading and visualizing San Francisco neighborhoods reading an external GeoJSON file.


In [1]:
from geopandas import read_file

neighborhoods_gdf = read_file('https://data.sfgov.org/api/geospatial/pty2-tcw4?method=export&format=GeoJSON')
neighborhoods_gdf.head()


Out[1]:
link name geometry
0 http://en.wikipedia.org/wiki/Sea_Cliff,_San_Fr... Seacliff MULTIPOLYGON (((-122.49346 37.78352, -122.4937...
1 None Lake Street MULTIPOLYGON (((-122.48715 37.78379, -122.4872...
2 http://www.nps.gov/prsf/index.htm Presidio National Park MULTIPOLYGON (((-122.47758 37.81099, -122.4771...
3 None Presidio Terrace MULTIPOLYGON (((-122.47241 37.78735, -122.4710...
4 http://www.sfgate.com/neighborhoods/sf/innerri... Inner Richmond MULTIPOLYGON (((-122.47263 37.78631, -122.4668...

In [2]:
from cartoframes.viz import Layer

Layer(neighborhoods_gdf)


Out[2]:
:
StackTrace
    ">

    Load San Francisco incidents from a CSV file

    Let's do the same with San Francisco incidents. In this case, we will work with an external CSV file.

    
    
    In [3]:
    from pandas import read_csv
    from geopandas import GeoDataFrame, points_from_xy
    
    df = read_csv('http://data.sfgov.org/resource/wg3w-h783.csv')
    
    # Clean NaN values
    df = df[df['longitude'].notna()]
    
    incidents_gdf = GeoDataFrame(df, geometry=points_from_xy(df['longitude'], df['latitude']))
    incidents_gdf.head()
    
    
    
    
    Out[3]:
    incident_datetime incident_date incident_time incident_year incident_day_of_week report_datetime row_id incident_id incident_number cad_number ... :@computed_region_qgnn_b9vv :@computed_region_26cr_cadq :@computed_region_ajp5_b2md :@computed_region_nqbw_i6c3 :@computed_region_2dwj_jsy4 :@computed_region_h4ep_8xdi :@computed_region_y6ts_4iup :@computed_region_jg9y_a9du :@computed_region_6pnf_4xz7 geometry
    4 2020-02-03T14:45:00.000 2020-02-03T00:00:00.000 14:45 2020 Monday 2020-02-03T17:50:00.000 89881675000 898816 200085557 200342870.0 ... 10.0 8.0 16.0 NaN NaN NaN NaN NaN 2.0 POINT (-122.47604 37.72695)
    6 2020-02-03T03:45:00.000 2020-02-03T00:00:00.000 03:45 2020 Monday 2020-02-03T03:45:00.000 89860711012 898607 200083749 200340316.0 ... 3.0 2.0 20.0 3.0 NaN NaN NaN NaN 2.0 POINT (-122.41517 37.75244)
    7 2020-02-03T10:00:00.000 2020-02-03T00:00:00.000 10:00 2020 Monday 2020-02-03T10:06:00.000 89867264015 898672 200084060 200340808.0 ... 5.0 3.0 8.0 NaN 35.0 NaN NaN NaN 2.0 POINT (-122.40734 37.78456)
    9 2020-01-05T00:00:00.000 2020-01-05T00:00:00.000 00:00 2020 Sunday 2020-02-03T16:09:00.000 89877368020 898773 200085193 200342341.0 ... 4.0 6.0 30.0 NaN NaN NaN NaN NaN 1.0 POINT (-122.44025 37.78711)
    10 2020-02-03T08:36:00.000 2020-02-03T00:00:00.000 08:36 2020 Monday 2020-02-03T08:36:00.000 89876268020 898762 200083909 200340826.0 ... 6.0 3.0 8.0 NaN NaN NaN NaN NaN 1.0 POINT (-122.39951 37.79693)

    5 rows × 37 columns

    
    
    In [4]:
    from cartoframes.viz import Layer
    
    Layer(incidents_gdf)
    
    
    
    
    Out[4]:
    :
    StackTrace
      ">

      Upload data to CARTO

      Let's upload both GeoDataFrames to CARTO so we can see how to interact with the platform. In order to continue, you have to set your CARTO credentials. If you aren't sure about your API key, check the Authentication guide to learn how to get it.

      
      
      In [5]:
      from cartoframes.auth import set_default_credentials
      
      set_default_credentials('creds.json')
      
      
      
      In [6]:
      from cartoframes import to_carto
      
      
      neighborhoods_table = 'sf_neighborhoods'
      incidents_table = 'sf_incidents'
      
      to_carto(neighborhoods_gdf, neighborhoods_table, if_exists='replace')
      to_carto(incidents_gdf, incidents_table, if_exists='replace')
      
      
      
      
      Success! Data uploaded to table "sf_neighborhoods" correctly
      Success! Data uploaded to table "sf_incidents" correctly
      
      Out[6]:
      'sf_incidents'

      Now that we have uploaded the data, we can directly visualize the tables using:

      Layer(neighborhoods_table)
      Layer(incidents_table)
      

      Calculate their intersection using a SQL query

      Let's see how we can apply a SQL query to inserct both tables and download the result of the query tp visualize it.

      
      
      In [7]:
      from cartoframes import read_carto
      
      incidents_neighborhoods_gdf = read_carto('''
          SELECT n.cartodb_id, n.the_geom, n.the_geom_webmercator, n.name, COUNT(*) AS incidents
          FROM sf_incidents i INNER JOIN sf_neighborhoods n ON ST_Intersects(i.the_geom, n.the_geom)
          GROUP BY n.cartodb_id
      ''')
      incidents_neighborhoods_gdf.head()
      
      
      
      
      Out[7]:
      cartodb_id the_geom name incidents
      0 828 MULTIPOLYGON (((-122.51314 37.77133, -122.5110... Golden Gate Park 7
      1 934 MULTIPOLYGON (((-122.43734 37.76235, -122.4370... Eureka Valley 4
      2 908 MULTIPOLYGON (((-122.39899 37.71534, -122.3988... Little Hollywood 2
      3 850 MULTIPOLYGON (((-122.38452 37.78739, -122.3843... South Beach 7
      4 894 MULTIPOLYGON (((-122.41622 37.70833, -122.4159... Visitacion Valley 4
      
      
      In [8]:
      from cartoframes.viz import color_continuous_style
      
      Layer(incidents_neighborhoods_gdf, style=color_continuous_style('incidents'))
      
      
      
      
      Out[8]:
      :
      StackTrace
        ">

        Conclusion

        Congratulations! You have seen how to load data locally, upload it to CARTO, apply a SQL query and download the results. We recommend to upload your data to CARTO when it is too big (> 30MB) to be visualized from a GeoDataFrame or when you want to apply PostGIS queries.