In [2]:
%autosave 10


Autosaving every 10 seconds

 Background of scientists

  • Bridging the gap between databases and pandas.
  • Scientists use SQL, or are forced to use e.g. Microsoft SQL Server.
  • How to help them reach pandas?

Why should you care about databases at all? StackOverflow

  • 18 million questions, 23GB of XML for posts
  • Multiple tables, relationships.
  • Popular tags? Trending tags over time?

Idea: store in DB, load post-processes subset into pandas. It'd be nice if we could just do everything on a regular laptop

!!AI isn't this why HDF5 was invented? Or is that only suitable for numeric data?

Goals

  • Importing data from PostgreSQL into pandas
  • Use SQLAlchemy without too much pain
  • Use pandasql to make data manipulation easier
  • Semi-structured data in PostgreSQL through HSTORE, SQLAlchemy
  • We are not talking about:
    • "Big data", this is quite small
    • Scalability.

What is a database

  • Wikipedia definition is useless. By its definition NumPy arrays, pandas DataFrames, CSV files are databases
  • Why are MySQL / PostgreSQL / Oracle databases?
    • querying languages?
    • data modelling?
    • storage?
  • Well, NumPy arrays are fast because data is continguous in one big block in-memory.
    • You can use memory mapping to have file-backed NumPy arrays.
    • But then concurrent read/write access to this memory mapped file gets very painful, easy to suffer corruption.
  • Concurrent read/write access to massive file-backed data that cannot fit in-memory is where databases excel.
    • !!AI And HDF5 doesn't offer concurrent read/write access, and in fact is poor at random writes, only fast for appends (I think?)
  • So storage and data modelling are where databases are useful.

In [6]:
import pandas as pd
import pandasql

# Useful shim, saves typing
pysqldf = lambda q:  pandasql.sqldf(q, globals())

# !!AI maybe use examples from Intro to Data Science course,
# it's identical to this.

Database data structures

  • They use B-Trees. Efficient search and insert, both of which NumPy arrays can't do efficiently.
    • NumPy array of a million entries, append a million, no choice but to allocate an empty array of two million then copy.
  • Databases know how to do disk-based I/O, and random I/O, very well. NumPy and pandas don't.

 Using SQLAlchemy to deal with databases

  • The most interesting part isn't that it's an Object Relational Mapper (ORM).
  • It executes queries in layers, where the ORM is optional.

    • You can deal directly with tables and data types.
  • !!AI the speaker gives an SQLAlchemy tutorial via IPython Notebook.

Moving onto StackOverflow data

  • Start with math.stackexchange.com Posts.xml.
  • Source (I think): https://archive.org/details/stackexchange (500MB 7-zip file!!)
  • Uses etree.iterparse because the XML file is massive, don't load it all into memory.
  • Commit every 1000 inserts; don't autocommit per row, don't shove everything in one transaction.

Crossing the boundary


In [8]:
# !!AI won't run, just the gist

import pandas.io.sql
import psycopg2

connection = psycopg2.connect()  # !!AI TODO fill in

math_by_date = pandas.io.sql.read_sql("""\
    SELECT ...
    FROM...
    WHERE ...
    AND .
    AND ...
    GROUP BY ...
""", connection)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-cbea362ebd4b> in <module>()
     11     AND ...
     12     GROUP BY ...
---> 13 """, connection)

/usr/local/lib/python2.7/site-packages/pandas/io/sql.pyc in read_frame(sql, con, index_col, coerce_float, params)
    158         List of parameters to pass to execute method.
    159     """
--> 160     cur = execute(sql, con, params=params)
    161     rows = _safe_fetch(cur)
    162     columns = [col_desc[0] for col_desc in cur.description]

/usr/local/lib/python2.7/site-packages/pandas/io/sql.pyc in execute(sql, con, retry, cur, params)
     51     except Exception:
     52         try:
---> 53             con.rollback()
     54         except Exception:  # pragma: no cover
     55             pass

AttributeError: 'NoneType' object has no attribute 'rollback'
Error on sql     SELECT ...
    FROM...
    WHERE ...
    AND ..
    AND ...
    GROUP BY ...

ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))


In [9]:
# More work with pandas.io.sql

Semi-structured data analysis


In [ ]: