In [2]:

    
%autosave 10









    














    



Autosaving every 10 seconds

Where to get talk

https://github.com/enthought/pydata-ldn-2014

Background of scientists

Bridging the gap between databases and pandas.
Scientists use SQL, or are forced to use e.g. Microsoft SQL Server.
How to help them reach pandas?

Why should you care about databases at all? StackOverflow

18 million questions, 23GB of XML for posts
Multiple tables, relationships.
Popular tags? Trending tags over time?

Idea: store in DB, load post-processes subset into pandas. It'd be nice if we could just do everything on a regular laptop

!!AI isn't this why HDF5 was invented? Or is that only suitable for numeric data?

Goals

Importing data from PostgreSQL into pandas
Use SQLAlchemy without too much pain
Use pandasql to make data manipulation easier
Semi-structured data in PostgreSQL through HSTORE, SQLAlchemy
We are not talking about:
- "Big data", this is quite small
- Scalability.

What is a database

Wikipedia definition is useless. By its definition NumPy arrays, pandas DataFrames, CSV files are databases
Why are MySQL / PostgreSQL / Oracle databases?
- querying languages?
- data modelling?
- storage?
Well, NumPy arrays are fast because data is continguous in one big block in-memory.
- You can use memory mapping to have file-backed NumPy arrays.
- But then concurrent read/write access to this memory mapped file gets very painful, easy to suffer corruption.
Concurrent read/write access to massive file-backed data that cannot fit in-memory is where databases excel.
- !!AI And HDF5 doesn't offer concurrent read/write access, and in fact is poor at random writes, only fast for appends (I think?)
So storage and data modelling are where databases are useful.



In [6]:

    
import pandas as pd
import pandasql

# Useful shim, saves typing
pysqldf = lambda q:  pandasql.sqldf(q, globals())

# !!AI maybe use examples from Intro to Data Science course,
# it's identical to this.

Database data structures

They use B-Trees. Efficient search and insert, both of which NumPy arrays can't do efficiently.
- NumPy array of a million entries, append a million, no choice but to allocate an empty array of two million then copy.
Databases know how to do disk-based I/O, and random I/O, very well. NumPy and pandas don't.

Using SQLAlchemy to deal with databases

The most interesting part isn't that it's an Object Relational Mapper (ORM).
It executes queries in layers, where the ORM is optional.
- You can deal directly with tables and data types.
!!AI the speaker gives an SQLAlchemy tutorial via IPython Notebook.

Moving onto StackOverflow data

Start with math.stackexchange.com Posts.xml.
Source (I think): https://archive.org/details/stackexchange (500MB 7-zip file!!)
Uses etree.iterparse because the XML file is massive, don't load it all into memory.
Commit every 1000 inserts; don't autocommit per row, don't shove everything in one transaction.

Crossing the boundary



In [8]:

    
# !!AI won't run, just the gist

import pandas.io.sql
import psycopg2

connection = psycopg2.connect()  # !!AI TODO fill in

math_by_date = pandas.io.sql.read_sql("""\
    SELECT ...
    FROM...
    WHERE ...
    AND .
    AND ...
    GROUP BY ...
""", connection)









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-cbea362ebd4b> in <module>()
     11     AND ...
     12     GROUP BY ...
---> 13 """, connection)

/usr/local/lib/python2.7/site-packages/pandas/io/sql.pyc in read_frame(sql, con, index_col, coerce_float, params)
    158         List of parameters to pass to execute method.
    159     """
--> 160     cur = execute(sql, con, params=params)
    161     rows = _safe_fetch(cur)
    162     columns = [col_desc[0] for col_desc in cur.description]

/usr/local/lib/python2.7/site-packages/pandas/io/sql.pyc in execute(sql, con, retry, cur, params)
     51     except Exception:
     52         try:
---> 53             con.rollback()
     54         except Exception:  # pragma: no cover
     55             pass

AttributeError: 'NoneType' object has no attribute 'rollback'





    



Error on sql     SELECT ...
    FROM...
    WHERE ...
    AND ..
    AND ...
    GROUP BY ...







    



ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))



In [9]:

    
# More work with pandas.io.sql

Semi-structured data analysis

Using PostgreSQL HSTORE.
!!AI I guess Entity Value Attribute (EVA) is controversial in SQL circles?
Note that SQLAlchemy knows about HSTORE!
pandas.io.sql.read_sql knows about HSTORE obviously, because it executes queries directly using a psycopg2 connection.
How do HSTOREs actually work: http://stackoverflow.com/questions/13133774/how-does-hstore-internally-store-the-data



In [ ]: