In [1]:
%load_ext pandas_td.ipython
It can be loaded automatically by the following configuration in "~/.ipython/profile_default/ipython_config.py":
In [ ]:
c = get_config()
c.InteractiveShellApp.extensions = [
'pandas_td.ipython',
]
After loading the extension, type "%td" and press TAB to list magic functions:
In [2]:
%td_databases
Out[2]:
%td_tables returns the list of tables:
In [3]:
%td_tables sample
Out[3]:
%td_jobs returns the list of recently executed jobs:
In [4]:
%td_jobs
Out[4]:
%td_use is a special function that has side effects. First, it pushes table names into the current namespace:
In [5]:
%td_use sample_datasets
By printing a table name, you can describe column names:
In [6]:
nasdaq
Out[6]:
Tab completion is also supported:
As the second effect of %td_use, it implicitly changes "default database", which is used when you write queries without database names.
%%td_hive, %%td_pig, and %%td_presto are cell magic functions that run queries:
In [7]:
%%td_presto
select count(1) cnt
from nasdaq
Out[7]:
The result of the query can be stored in a variable by -o:
In [8]:
%%td_presto -o df
select count(1) cnt
from nasdaq
In [9]:
df
Out[9]:
Or you can save the result into a file by -O:
In [10]:
%%td_presto -O './output.csv'
select count(1) cnt
from nasdaq
Python-style variable substition is supported:
In [11]:
start = '2010-01-01'
end = '2011-01-01'
In [12]:
%%td_presto
select count(1) cnt
from nasdaq
where td_time_range(time, '{start}', '{end}')
Out[12]:
You can preview the actual query by --dry-run (or -n):
In [13]:
%%td_presto -n
select count(1) cnt
from nasdaq
where td_time_range(time, '{start}', '{end}')
In [14]:
%%td_presto
select
-- Time-series index (yearly)
td_date_trunc('year', time) time,
-- Same as above
-- td_time_format(time, 'yyyy-01-01') time,
count(1) cnt
from
nasdaq
group by
1
limit
3
Out[14]:
In [15]:
%matplotlib inline
In [16]:
%%td_presto --plot
select
-- x-axis
td_date_trunc('year', time) time,
-- y-axis
min(low) low,
max(high) high
from
nasdaq
where
symbol = 'AAPL'
group by
1
Out[16]:
In practice, however, it is more efficient to execute rough calculation on the server side and store the result into a variable for further analysis:
In [17]:
%%td_presto -o df
select
-- daily summary
td_date_trunc('day', time) time,
min(low) low,
max(high) high,
sum(volume) volume
from
nasdaq
where
symbol = 'AAPL'
group by
1
In [18]:
# Use resample for local calculation
df['high'].resample('1m', how='max').plot()
Out[18]:
--plot provides a shortcut way of plotting "pivot charts", as a combination of pivot() and plot(). If the query result contains non-numeric columns, or column names ending with "_id", they are used as columns parameter:
In [19]:
%%td_presto --plot
select
-- x-axis
td_date_trunc('month', time) time,
-- columns
symbol,
-- y-axis
avg(close) close
from
nasdaq
where
symbol in ('AAPL', 'MSFT')
group by
1, 2
Out[19]:
--pivot creates a pivot table from the result of query. Like --plot, the first column represents index and other non-numeric columns represents new columns:
In [20]:
%%td_presto --pivot
select
td_date_trunc('year', time) time,
symbol,
avg(close) close
from
nasdaq
where
td_time_range(time, '2010', '2015')
and symbol like 'AA%'
group by
1, 2
Out[20]:
In [21]:
%%td_presto -v --plot
select
td_date_trunc('year', time) time,
sum(volume) volume
from
nasdaq
group by
1
Out[21]:
In [ ]: