SQL methods

Now let's play with databases. MySQL and SQLite databases are accessed in very similar fashion in python

Let's install the different packages

pip install MySQL-python
pip install sqlalchemy
pip install ipython-sql
pip install dbtools

SQLite and MySQL have very similar features / language, but also some subtiles differences. The main difference if that MySQL requires a server running to host the data, while SQLite can be stored in a file locally. Some pros and cons:

SQLite:

  • easier to setup
  • great for temporary (testing databases)
  • great for rapid development
  • great for embedding in an application
  • doesn't have user management
  • doesn't have many performance features
  • doesn't scale well.

MySQL:

  • far more difficult/complex to set up
  • better options for performance tuning
  • can scale well if tuned properly
  • can manage users, permissions, etc.

Start with sqlite, you can then test out the mysql by setting

sql_type = 'mysql'

In [1]:
sql_type = 'mysql'

In [2]:
user_name = "pire"

In [3]:
tbname = "py4we_" + user_name

Importing the logit magic for benchmarking the different methods


In [4]:
%load_ext logit

Loading Data

Let's load the data previously saved in json format


In [5]:
import json
with open('tmp.json','r') as f:
    json_dic = json.load(f)
json_dic.keys()


Out[5]:
[u'latitude', u'dates', u'orientation', u'longitude', u'weekdays']

In [6]:
#first replace the dates by datetimes formats
from datetime import datetime
#fmt = "%Y%m%d %H:%M:%S"
#json_dic['dates'] = [datetime.strptime(r, fmt) for r in json_dic['dates']]

#Then build a numpy array containing all the informations
db_data = array([json_dic[c] for c in ['dates', 'weekdays', 'latitude', 'longitude', 'orientation']]).T
db_data[:10]


Out[6]:
array([[u'20131114 06:05:19', u'Thursday', u'42.6040077209',
        u'13.3692359924', u'327.700012207'],
       [u'20131114 06:05:48', u'Thursday', u'42.6041603088',
        u'13.3694620132', u'325.399993896'],
       [u'20131114 06:06:19', u'Thursday', u'42.6043243408',
        u'13.369717598', u'323.700012207'],
       [u'20131114 06:06:48', u'Thursday', u'42.6044769287',
        u'13.3699598312', u'325.600006104'],
       [u'20131114 06:07:19', u'Thursday', u'42.6046409607',
        u'13.37022686', u'327.5'],
       [u'20131114 06:07:49', u'Thursday', u'42.6047935486',
        u'13.3704900742', u'327.799987793'],
       [u'20131114 06:08:18', u'Thursday', u'42.6049423218',
        u'13.3707456589', u'330.399993896'],
       [u'20131114 06:08:49', u'Thursday', u'42.6050987244',
        u'13.3710184097', u'331.200012207'],
       [u'20131114 06:09:19', u'Thursday', u'42.6052474976',
        u'13.3712882996', u'331.100006104'],
       [u'20131114 06:09:49', u'Thursday', u'42.6053924561',
        u'13.3715610504', u'331.600006104']], 
      dtype='<U17')

Ipython SQL

This is the easiest to use mysql, but it only works within ipython notebooks, and it has some limitations.

First you need to load the sql magic.


In [5]:
%load_ext sql

It works both for SQLite and MySQL (and PostgreSQL if you are in that sort of things).


In [8]:
if sql_type == 'sqlite':
    ## Connect
    %sql sqlite:///mysqlite.db
    # Print out what are the tables
    res = %sql SELECT * FROM sqlite_master WHERE type='table'
elif sql_type == 'mysql':
    ## Connect
    %sql mysql://student:DTU63Course@10.40.20.14/coursedb 
    ## Print out what are the tables
    res = %sql SHOW TABLES
res


6 rows affected.
Out[8]:
Tables_in_coursedb
'py4we_pire'
py4we_exercise
py4we_pire
py4we_tb
py4we_tb_pd
test_sql

Delete the table


In [9]:
%sql DROP TABLE test_sql


0 rows affected.
Out[9]:
[]

Create the Table


In [10]:
%%sql 
CREATE TABLE test_sql
(
    dates VARCHAR(20),
    weekdays VARCHAR(10),
    latitude DOUBLE,
    longitude DOUBLE,
    orientation DOUBLE
);


0 rows affected.
Out[10]:
[]

It's possible to insert the data into the database line by line, but it's going to be painfully slow for large datasets.

Notice the way the python variables are added in the SQL query using the ":" in front of it. It's a bit tricky because it will replace the variable name by its value with '' around it. There is unfortunately a small bug when you try to use this to define the table name. It will replace :table_name by 'table_name' which is not a valid SQL query.


In [11]:
N = 20
for date, weekd, latitude, longitude, orientation in db_data[:N]:
    res = %sql INSERT INTO test_sql VALUES (:date, :weekd, :latitude, :longitude, :orientation);
%sql SELECT * FROM test_sql


1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
20 rows affected.
Out[11]:
dates weekdays latitude longitude orientation
20131114 06:05:19 Thursday 42.6040077209 13.3692359924 327.700012207
20131114 06:05:48 Thursday 42.6041603088 13.3694620132 325.399993896
20131114 06:06:19 Thursday 42.6043243408 13.369717598 323.700012207
20131114 06:06:48 Thursday 42.6044769287 13.3699598312 325.600006104
20131114 06:07:19 Thursday 42.6046409607 13.37022686 327.5
20131114 06:07:49 Thursday 42.6047935486 13.3704900742 327.799987793
20131114 06:08:18 Thursday 42.6049423218 13.3707456589 330.399993896
20131114 06:08:49 Thursday 42.6050987244 13.3710184097 331.200012207
20131114 06:09:19 Thursday 42.6052474976 13.3712882996 331.100006104
20131114 06:09:49 Thursday 42.6053924561 13.3715610504 331.600006104
20131114 06:10:19 Thursday 42.6055335999 13.3718280792 331.700012207
20131114 06:10:48 Thursday 42.6056747437 13.3720989227 328.700012207
20131114 06:11:18 Thursday 42.6058120728 13.3723545074 326.0
20131114 06:11:49 Thursday 42.6059684753 13.3726148605 320.600006104
20131114 06:12:18 Thursday 42.6061248779 13.3728408813 318.600006104
20131114 06:12:49 Thursday 42.6062736511 13.3730554581 321.5
20131114 06:13:18 Thursday 42.6064224243 13.3732671738 323.399993896
20131114 06:13:49 Thursday 42.6065788269 13.3734874725 325.399993896
20131114 06:14:18 Thursday 42.6067237854 13.3736925125 327.899993896
20131114 06:14:49 Thursday 42.6068840027 13.373919487 327.600006104

It's easy to obtain a pandas.DataFrame from the query result


In [12]:
results = %sql SELECT * FROM test_sql
pres = results.DataFrame()
plot(pres.latitude, pres.longitude, '.')


20 rows affected.
Out[12]:
[<matplotlib.lines.Line2D at 0x111de3ed0>]

Using SQLite3 and MySQLdb directly

Using the SQLite3 and MySQLdb modules


In [13]:
### You can select here which method to use: sqlite or mysql
if sql_type == 'sqlite':
    import sqlite3
    db = sqlite3.connect('mysqlite.db')
elif sql_type == 'mysql':
    import MySQLdb
    db = MySQLdb.connect(host="10.40.20.14", user="student", passwd="DTU63Course", db="coursedb")
    
c = db.cursor()

(Drop and) Create the table


In [14]:
try: 
    c.execute("DROP TABLE "+tbname)
except: 
    print tbname+' doesn\'t exist yet'

c.execute("""
CREATE TABLE %s (
    dates VARCHAR(20),
    weekdays VARCHAR(10),
    latitude DOUBLE,
    longitude DOUBLE,
    orientation DOUBLE
)"""%(tbname))


Out[14]:
0L

Insert all the data and read it


In [15]:
%%logit executemany, mysqlite.db, sql_type
# Insert all the data from db_data
if sql_type == 'sqlite':
    query_insert = 'INSERT INTO '+tbname+' VALUES (?,?,?,?,?)'
elif sql_type == 'mysql':
    query_insert = 'INSERT INTO '+tbname+' VALUES (%s,%s,%s,%s,%s)'

    

## Notice the executemany instead of execute. It's much faster when inserting data.
c.executemany(query_insert, db_data.tolist())

# Don't forget to use this if you want to save your modification on the database
db.commit()


executemany_mysql :
14-01-30 00:16:47
File size: -1 MB
Creation time: 3.268174 sec

In [16]:
# Read all the data back
c.execute('SELECT * FROM '+tbname)

#Print the first ten rows
for row in c.fetchall()[:10]:
    print row

# Don't forget to close the database    
c.close()


('20131114 06:05:19', 'Thursday', 42.6040077209, 13.3692359924, 327.700012207)
('20131114 06:05:48', 'Thursday', 42.6041603088, 13.3694620132, 325.399993896)
('20131114 06:06:19', 'Thursday', 42.6043243408, 13.369717598, 323.700012207)
('20131114 06:06:48', 'Thursday', 42.6044769287, 13.3699598312, 325.600006104)
('20131114 06:07:19', 'Thursday', 42.6046409607, 13.37022686, 327.5)
('20131114 06:07:49', 'Thursday', 42.6047935486, 13.3704900742, 327.799987793)
('20131114 06:08:18', 'Thursday', 42.6049423218, 13.3707456589, 330.399993896)
('20131114 06:08:49', 'Thursday', 42.6050987244, 13.3710184097, 331.200012207)
('20131114 06:09:19', 'Thursday', 42.6052474976, 13.3712882996, 331.100006104)
('20131114 06:09:49', 'Thursday', 42.6053924561, 13.3715610504, 331.600006104)

Little bit of fun with pandas

Pandas as a SQL interface that can use SQLite3 or MySQLdb connectors


In [17]:
import pandas as pd
import pandas.io.sql as psql
df1 = psql.frame_query('SELECT * FROM '+tbname, db)
# Indicate that the index is of date format
df1.index = pd.DatetimeIndex(df1.dates) 
df1.head()


Out[17]:
dates weekdays latitude longitude orientation
2013-11-14 06:05:19 20131114 06:05:19 Thursday 42.604008 13.369236 327.700012
2013-11-14 06:05:48 20131114 06:05:48 Thursday 42.604160 13.369462 325.399994
2013-11-14 06:06:19 20131114 06:06:19 Thursday 42.604324 13.369718 323.700012
2013-11-14 06:06:48 20131114 06:06:48 Thursday 42.604477 13.369960 325.600006
2013-11-14 06:07:19 20131114 06:07:19 Thursday 42.604641 13.370227 327.500000

In [18]:
df1.plot()


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1119225d0>

You can also use pandas directly to write to your database


In [19]:
%%logit pandas_write_frame, mysqlite.db, sql_type
if psql.table_exists(tbname, db, sql_type):
    psql.execute('DROP TABLE '+tbname, db)
psql.write_frame(df1, tbname, db, flavor=sql_type)


pandas_write_frame_mysql :
14-01-30 00:17:31
File size: -1 MB
Creation time: 2.08852 sec

Let's time it using sqlite

And reading again from it


In [20]:
psql.frame_query('SELECT * FROM '+tbname, db).head()


Out[20]:
dates weekdays latitude longitude orientation
0 20131114 06:05:19 Thursday 42.6040 13.3692 327.7
1 20131114 06:05:48 Thursday 42.6042 13.3695 325.4
2 20131114 06:06:19 Thursday 42.6043 13.3697 323.7
3 20131114 06:06:48 Thursday 42.6045 13.3700 325.6
4 20131114 06:07:19 Thursday 42.6046 13.3702 327.5

dbTools

dbTools is tool to access SQLite datafiles. It is inspired from SQLAlchelmy, but is much simpler to use (but also less powerfull and less stable)


In [21]:
from dbtools import Table
try: Table("mysqlite_dtools.db", tbname).drop()
except: pass
tbl = Table.create("mysqlite_dtools.db", tbname,
    [('dates', str),
     ('weekdays', str),
     ('latitude', float),
     ('longitude', float),
     ('orientation', float)])
tbl.columns


Out[21]:
(u'dates', u'weekdays', u'latitude', u'longitude', u'orientation')

Once it's created, you can just use it this way. It's very slow, so we reduce it to the first 10,000 entries.


In [22]:
%%logit dbtools_insertlist, mysqlite_dtools.db, sqlite
tbl = Table("mysqlite_dtools.db", tbname)
tbl.insert(db_data[:10000])


dbtools_insertlist_sqlite :
14-01-30 00:17:43
File size: 6.517578125 MB
Creation time: 5.694544 sec

The select() function directly returns a panadas dataframe


In [23]:
df2 = tbl.select()
df2.plot()


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x111df5c50>

Select function can be used as a SELECT query


In [24]:
### This is equivalent to this query:
### SELECT dates, orientation FROM py4we_tb WHERE weekdays="Thursday";
### or df3 = df2[df2.weekdays=='Thursday']
df3 = tbl.select(['orientation', 'latitude'], where='weekdays="Thursday"')
df3.plot()
df3.head()


Out[24]:
orientation latitude
0 327.700012 42.604008
1 325.399994 42.604160
2 323.700012 42.604324
3 325.600006 42.604477
4 327.500000 42.604641

It's possible to use a pandas.Dataframe as an input to create a new table but it's even slower:


In [25]:
%%logit dbtools_insertpandas, mysqlite_dtools2.db, sqlite
try: Table("mysqlite_dtools.db", tbname).drop()
except: print tbname, 'doesnt exist'
Table.create("mysqlite_dtools.db", tbname, df2)


dbtools_insertpandas_sqlite :
14-01-30 00:17:52
File size: 0.642578125 MB
Creation time: 22.784908 sec

Unfortunately the module is still at its infancy, and can't work with sqlite files that haven't been created by it (yet?)


In [26]:
df5 = Table("mysqlite.db", tbname).select()


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-26-ae0eeea4b26b> in <module>()
----> 1 df5 = Table("mysqlite.db", tbname).select()

/Users/pire/venv/openmdao/lib/python2.7/site-packages/dbtools/table.pyc in __init__(self, db, name, verbose)
    267         # parse the response -- it will look like 'CREATE TABLE
    268         # name(col1 TYPE, col2 TYPE, ...)'
--> 269         args = re.match("([^\(]*)\((.*)\)", info[0][0]).groups()[1]
    270 
    271         # compute repr string

AttributeError: 'NoneType' object has no attribute 'groups'

Plotting


In [27]:
from logit import plot_stats
df = plot_stats()


Exercise

You now are going to apply your knowledge to upload the results from the previous exercises to a mysql database. You should upload the results logged in the JSON to this table: py4we_results.

The py4we_results table is organised as followed:

  • userid: Your own user id you used in this class so far (e.g. 4 letter code)
  • data: The date/time you have run the test
  • testid: The test id
  • filesize: The file size found in the test (when irrelevent use -1)
  • time: The time it took to run the test (when irrelevent use -1)

In [36]:
%sql mysql://student:DTU63Course@10.40.20.14/coursedb 
%sql DESCRIBE py4we_exercise


5 rows affected.
Out[36]:
Field Type Null Key Default Extra
userid varchar(10) YES None
date varchar(20) YES None
testid varchar(10) YES None
filesize double YES None
time double YES None

In [48]:
%sql SELECT * FROM py4we_exercise LIMIT 10


10 rows affected.
Out[48]:
userid date testid filesize time
pire 14-01-29 23:53:57 readlines 6.32970333099 0.022118
pire 14-01-29 23:53:58 read_csv 6.32970333099 0.224039
pire 14-01-29 23:54:10 read_csv_module 6.32970333099 0.232334
pire 14-01-29 23:55:14 save_csv 6.19353675842 0.101102
pire 14-01-29 23:55:53 save_csv 6.19353675842 0.089683
pire 14-01-29 23:56:35 save_csv_module 6.32970333099 0.240466
pire 14-01-29 23:56:41 save_csv_module 6.32970333099 0.232233
pire 14-01-29 23:57:27 savetxt 17.0224962234 1.086016
pire 14-01-29 23:58:04 savetxt 5.44677352905 0.037129
pire 14-01-29 23:58:09 savetxt 5.44666290283 0.010352

DO NOT ERASE THE RESULTS OF YOUR COLLEAGUES, WE WILL HUNT YOU DOWN!!! USE APPEND

Backing up


In [1]:
import sqlite3
import pandas as pd
import pandas.io.sql as psql
import MySQLdb
    
dbmysql = MySQLdb.connect(host="10.40.20.14", user="student", passwd="DTU63Course", db="coursedb")
df = pd.io.sql.frame_query('SELECT * FROM py4we_exercise', dbmysql)
dblite = sqlite3.connect('sqlite_backup.db')
pd.io.sql.write_frame(df, 'py4we_exercise', dblite, flavor='sqlite', if_exists='append')

Solution:


In [ ]:
import sqlite3
dblite = sqlite3.connect('sqlite_logger.db')
df = pd.io.sql.frame_query('SELECT * FROM logger', dblite)
dbmysql = MySQLdb.connect(host="10.40.20.14", user="student", passwd="DTU63Course", db="coursedb")
pd.io.sql.write_frame(df, 'py4we_exercise', dbmysql, flavor='mysql', if_exists='append')

A little bit more fun with Pandas


In [2]:
df.head()


Out[2]:
userid date testid filesize time
0 pire 14-01-29 23:53:57 readlines 6.329703 0.022118
1 pire 14-01-29 23:53:58 read_csv 6.329703 0.224039
2 pire 14-01-29 23:54:10 read_csv_module 6.329703 0.232334
3 pire 14-01-29 23:55:14 save_csv 6.193537 0.101102
4 pire 14-01-29 23:55:53 save_csv 6.193537 0.089683

Plotting the benchmarks

This line is achieving the following manipulation:

  • Group by the userids and testids
  • take the time column
  • average
  • redistribute the indices
  • pivot the tables so that the userids become the column name instead of a value, and index by testid
  • plot with horizontal bars.

In [32]:
df.groupby(['userid', 'testid'])['time'].mean().reset_index() \
        .pivot(index='testid', columns='userid', values='time').plot(kind='barh')


Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x106e62fd0>

For maintenance purpose


In [46]:
#sql DROP TABLE py4we_exercise

In [ ]:
##sql 
CREATE TABLE py4we_exercise (
    userid VARCHAR(10),
    date VARCHAR(20),
    testid VARCHAR(50),
    filesize DOUBLE,
    time DOUBLE
);