Magic functions

You can enable magic functions by loading pandas_td.ipython:


In [1]:
%load_ext pandas_td.ipython

It can be loaded automatically by the following configuration in "~/.ipython/profile_default/ipython_config.py":


In [ ]:
c = get_config()

c.InteractiveShellApp.extensions = [
    'pandas_td.ipython',
]

After loading the extension, type "%td" and press TAB to list magic functions:

List functions

%td_databases returns the list of databases:


In [2]:
%td_databases


Out[2]:
name count permission created_at updated_at
0 sample_datasets 8812278 query_only 2014-10-04 01:13:11+00:00 2015-01-09 01:07:14+00:00
1 tutorial 44629 full_access 2015-05-27 13:12:43+00:00 2015-05-27 13:12:43+00:00

%td_tables returns the list of tables:


In [3]:
%td_tables sample


Out[3]:
db_name name count estimated_storage_size last_log_timestamp created_at
0 sample_datasets nasdaq 8807278 168205061 1970-01-01 00:00:00+00:00 2014-10-08 02:57:38+00:00
1 sample_datasets www_access 5000 0 2014-10-04 01:13:15+00:00 2014-10-04 01:13:12+00:00

%td_jobs returns the list of recently executed jobs:


In [4]:
%td_jobs


Out[4]:
status job_id type start_at query
0 success 28870939 presto 2015-07-10 19:37:05+00:00 -- read_td_table('nasdaq')\nSELECT time, symbo...
1 success 28870917 presto 2015-07-10 19:36:05+00:00 -- read_td_table('nasdaq')\nSELECT time, symbo...
2 success 28870902 presto 2015-07-10 19:35:40+00:00 -- read_td_table('www_access')\nSELECT *\nFROM...

Use database

%td_use is a special function that has side effects. First, it pushes table names into the current namespace:


In [5]:
%td_use sample_datasets


INFO: import nasdaq
INFO: import www_access

By printing a table name, you can describe column names:


In [6]:
nasdaq


Out[6]:
field type alias
0 symbol string
1 open double
2 volume long
3 high double
4 low double
5 close double

Tab completion is also supported:

As the second effect of %td_use, it implicitly changes "default database", which is used when you write queries without database names.

Query functions

%%td_hive, %%td_pig, and %%td_presto are cell magic functions that run queries:


In [7]:
%%td_presto
select count(1) cnt
from nasdaq


Out[7]:
cnt
0 8807278

The result of the query can be stored in a variable by -o:


In [8]:
%%td_presto -o df
select count(1) cnt
from nasdaq

In [9]:
df


Out[9]:
cnt
0 8807278

Or you can save the result into a file by -O:


In [10]:
%%td_presto -O './output.csv'
select count(1) cnt
from nasdaq


INFO: saved to './output.csv'

Python-style variable substition is supported:


In [11]:
start = '2010-01-01'
end = '2011-01-01'

In [12]:
%%td_presto
select count(1) cnt
from nasdaq
where td_time_range(time, '{start}', '{end}')


Out[12]:
cnt
0 503399

You can preview the actual query by --dry-run (or -n):


In [13]:
%%td_presto -n
select count(1) cnt
from nasdaq
where td_time_range(time, '{start}', '{end}')


# translated code
_q = '''
select count(1) cnt
from nasdaq
where td_time_range(time, '2010-01-01', '2011-01-01')
'''
_e = td.create_engine('presto:sample_datasets')
_d = td.read_td_query(_q, _e)

Time-series index

With magic functions, "time" column is converted into time-series index automatically. You can use td_date_trunc() or td_time_format() in combination with GROUP BY for aggregation:


In [14]:
%%td_presto
select
    -- Time-series index (yearly)
    td_date_trunc('year', time) time,

    -- Same as above
    -- td_time_format(time, 'yyyy-01-01') time,

    count(1) cnt
from
    nasdaq
group by
    1
limit
    3


Out[14]:
cnt
time
1979-01-01 26967
1981-01-01 32930
1998-01-01 230757

Plotting

--plot is a convenient option for plotting. The first column represents x-axis. Other columns represent y-axis:


In [15]:
%matplotlib inline

In [16]:
%%td_presto --plot
select
    -- x-axis
    td_date_trunc('year', time) time,

    -- y-axis
    min(low) low,
    max(high) high
from
    nasdaq
where
    symbol = 'AAPL'
group by
    1


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x106a9f5c0>

In practice, however, it is more efficient to execute rough calculation on the server side and store the result into a variable for further analysis:


In [17]:
%%td_presto -o df
select
    -- daily summary
    td_date_trunc('day', time) time,

    min(low) low,
    max(high) high,
    sum(volume) volume
from
    nasdaq
where
    symbol = 'AAPL'
group by
    1

In [18]:
# Use resample for local calculation
df['high'].resample('1m', how='max').plot()


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x108f92a90>

--plot provides a shortcut way of plotting "pivot charts", as a combination of pivot() and plot(). If the query result contains non-numeric columns, or column names ending with "_id", they are used as columns parameter:


In [19]:
%%td_presto --plot
select
    -- x-axis
    td_date_trunc('month', time) time,

    -- columns
    symbol,

    -- y-axis
    avg(close) close 
from
    nasdaq
where
    symbol in ('AAPL', 'MSFT')
group by
    1, 2


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1095b6320>

Pivot tables

--pivot creates a pivot table from the result of query. Like --plot, the first column represents index and other non-numeric columns represents new columns:


In [20]:
%%td_presto --pivot
select
    td_date_trunc('year', time) time,
    symbol,
    avg(close) close 
from
    nasdaq
where
    td_time_range(time, '2010', '2015')
    and symbol like 'AA%'
group by
    1, 2


Out[20]:
symbol AAIT AAL AAME AAOI AAON AAPL AAVL AAWW AAXJ
time
2010-01-01 NaN NaN 1.533376 NaN 6.969612 37.134102 NaN 50.654775 57.195677
2011-01-01 NaN NaN 2.011771 NaN 8.949413 52.000335 NaN 51.993968 57.992659
2012-01-01 27.083288 NaN 2.702549 NaN 8.721841 82.292882 NaN 46.915226 55.078226
2013-01-01 28.865318 25.754375 3.718914 12.677075 14.904996 67.518991 NaN 43.477718 58.670437
2014-01-01 33.147182 34.268635 3.845645 20.009516 19.857238 86.402960 29.485122 35.345806 61.125349

Verbose output

By passing -v (--verbose) option, you can print pseudo Python code that was executed by the magic function.


In [21]:
%%td_presto -v --plot
select
    td_date_trunc('year', time) time,
    sum(volume) volume
from
    nasdaq
group by
    1


# issued at 2015-11-22T05:28:14Z
URL: https://console.treasuredata.com/jobs/40273945
# started at 2015-11-22T05:28:15Z
** WARNING: time index filtering is not set on 
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/presto-performance-tuning#leveraging-time-based-partitioning
2015-11-22 05:28:33:   rows  pending  running   done / total
            Stage-0:     37        0        0      1 /     1
            Stage-1:   1.4K        0        0     24 /    24
            Stage-2:   8.8M        0        0     38 /    38
Result size: 466 bytes
Download: 466 / 466 bytes (100.00%)
# downloaded at 2015-11-22T05:28:35Z
# translated code
_q = '''
select
    td_date_trunc('year', time) time,
    sum(volume) volume
from
    nasdaq
group by
    1
'''
_e = td.create_engine('presto:sample_datasets', show_progress=True, clear_progress=False)
_d = td.read_td_query(_q, _e)
_d['time'] = pd.to_datetime(_d['time'], unit='s')
_d.set_index('time', inplace=True)
_d.plot()
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x109689a90>

In [ ]: