Getting Started

Pandas-TD allows you to read a query result into a DataFrame and to store a DataFrame to Treasure Data.

Before starting any session, you need to set your API key to the environment variable TD_API_KEY:

export TD_API_KEY=1234/abcd...

You can initialize your connection by connect:


In [1]:
%matplotlib inline

import os
import numpy as np
import pandas as pd
import pandas_td as td

# Initialize connection
con = td.connect(apikey=os.environ['TD_API_KEY'], endpoint='https://api.treasuredata.com/')

Using a connection, you can use some convenient methods like databases and tables:


In [2]:
# Get a list of databases
con.databases()


Out[2]:
name count permission created_at updated_at
0 sample_datasets 8812278 query_only 2014-10-04 01:13:11+00:00 2015-01-09 01:07:14+00:00
1 tutorial 19 full_access 2015-05-27 13:12:43+00:00 2015-05-27 13:12:43+00:00

In [3]:
# Get a list of tables
con.tables('sample_datasets')


Out[3]:
name count estimated_storage_size last_log_timestamp created_at
0 nasdaq 8807278 168205061 1970-01-01 00:00:00+00:00 2014-10-08 02:57:38+00:00
1 www_access 5000 0 2014-10-04 01:13:15+00:00 2014-10-04 01:13:12+00:00

You can choose a query engine and use it to run a queby by read_td:


In [4]:
# Select a query engine
presto = con.query_engine(database='sample_datasets', type='presto')

# Run a query
td.read_td('select * from nasdaq limit 3', presto)


Out[4]:
symbol open volume high low close time
0 ASNA 16.9600 1087798 17.06 16.89 17.01 1410364800
1 ADRD 24.8100 725 24.87 24.81 24.87 1410364800
2 ARIS 3.0799 39918 3.20 3.00 3.15 1410364800

Alternatively, you can read a table into a DataFrame by read_td_table, optionally with a specific time range, sample rate, and/or a limit:


In [5]:
# Read from a table with 5-percent sampling
df = td.read_td_table('nasdaq', presto, time_range=('2000-01-01', '2010-01-01'), sample=0.05, limit=10000)

Pandas-TD also supports importing a DataFrame into a table by to_td:


In [6]:
# Create a DataFrame with random values
df = pd.DataFrame(np.random.rand(3, 3), columns=['x', 'y', 'z'])

# Import it into 'tutorial.tut1'
td.to_td(df, 'tutorial.tut1', con, if_exists='replace', index=False)

In [7]:
# Check the result
td.read_td('select * from tutorial.tut1', presto)


Out[7]:
index z y x time
0 0 0.898739 0.975951 0.712556 1433052180
1 1 0.877093 0.343772 0.736913 1433052180
2 2 0.294983 0.197632 0.813423 1433052180

read_td_query


In [8]:
help(td.read_td_query)


Help on function read_td_query in module pandas_td.td:

read_td_query(query, engine, index_col=None, params=None, parse_dates=None)
    Read Treasure Data query into a DataFrame.
    
    Returns a DataFrame corresponding to the result set of the query string.
    Optionally provide an index_col parameter to use one of the columns as
    the index, otherwise default integer index will be used.
    
    Parameters
    ----------
    query : string
        Query string to be executed.
    engine : QueryEngine
        Handler returned by Connection.query_engine.
    index_col : string, optional
        Column name to use as index for the returned DataFrame object.
    params : dict, optional
        Parameters to pass to execute method.
    parse_dates : list or dict, optional
        - List of column names to parse as dates
        - Dict of {column_name: format string} where format string is strftime
          compatible in case of parsing string times or is one of (D, s, ns, ms, us)
          in case of parsing integer timestamps
    
    Returns
    -------
    DataFrame

Examples


In [9]:
# Convert "time" to a time series index
df = td.read_td_query("select time, close from nasdaq where symbol='AAPL'",
                      presto,
                      index_col='time',
                      parse_dates={'time': 's'})
df.plot()


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x107539748>

read_td_table


In [10]:
help(td.read_td_table)


Help on function read_td_table in module pandas_td.td:

read_td_table(table_name, engine, index_col=None, parse_dates=None, columns=None, time_range=None, sample=None, limit=10000)
    Read Treasure Data table into a DataFrame.
    
    The number of returned rows is limited by "limit" (default 10,000).
    Setting limit=None means all rows. Be careful when you set limit=None
    because your table might be very large and the result does not fit into memory.
    
    Parameters
    ----------
    table_name : string
        Name of Treasure Data table in database.
    engine : QueryEngine
        Handler returned by Connection.query_engine.
    index_col : string, optional
        Column name to use as index for the returned DataFrame object.
    parse_dates : list or dict, optional
        - List of column names to parse as dates
        - Dict of {column_name: format string} where format string is strftime
          compatible in case of parsing string times or is one of (D, s, ns, ms, us)
          in case of parsing integer timestamps
    columns : list, optional
        List of column names to select from table.
    time_range : tuple (start, end), optional
        Limit time range to select. "start" and "end" are one of None, integers,
        strings or datetime objects. "end" is exclusive, not included in the result.
    sample : double, optional
        Enable sampling data (Presto only). 1.0 means all data (100 percent).
        See TABLESAMPLE BERNOULLI at https://prestodb.io/docs/current/sql/select.html
    limit : int, default 10,000
        Maximum number of rows to select.
    
    Returns
    -------
    DataFrame

Examples


In [11]:
# Read all records (up to 10,000 rows by default)
df = td.read_td_table("www_access", presto)
df.head(3)


Out[11]:
user host path referer code agent size method time
0 None 192.225.229.196 /category/software - 200 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; r... 117 GET 1412382292
1 None 120.168.215.131 /category/software - 200 Mozilla/5.0 (compatible; Googlebot/2.1; +http:... 53 GET 1412382284
2 None 180.198.173.136 /category/electronics /category/computers 200 Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.1... 106 GET 1412382275

In [12]:
# From 1990 to 2010, 1-percent sample, no limit
df = td.read_td_table("nasdaq",
                      presto,
                      index_col='time',
                      parse_dates={'time': 's'},
                      columns=['time', 'symbol', 'close'],
                      time_range=('1990', '2010'),
                      sample=0.01,
                      limit=None)
len(df)


Out[12]:
54797

In [13]:
df[df.symbol=='AAPL'].plot()


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x109d4d2b0>

to_td


In [14]:
help(td.to_td)


Help on function to_td in module pandas_td.td:

to_td(frame, name, con, if_exists='fail', time_col=None, time_index=None, index=True, index_label=None, chunksize=10000)
    Write a DataFrame to a Treasure Data table.
    
    This method converts the dataframe into a series of key-value pairs
    and send them using the Treasure Data streaming API. The data is divided
    into chunks of rows (default 10,000) and uploaded separately. If upload
    failed, the client retries the process for a certain amount of time
    (max_cumul_retry_delay; default 600 secs). This method may fail and
    raise an exception when retries did not success, in which case the data
    may be partially inserted. Use the bulk import utility if you cannot
    accept partial inserts.
    
    Parameters
    ----------
    frame : DataFrame
        DataFrame to be written.
    name : string
        Name of table to be written, in the form 'database.table'.
    con : Connection
        Connection to a Treasure Data account.
    if_exists: {'fail', 'replace', 'append'}, default 'fail'
        - fail: If table exists, do nothing.
        - replace: If table exists, drop it, recreate it, and insert data.
        - append: If table exists, insert data. Create if does not exist.
    time_col : string, optional
        Column name to use as "time" column for the table. Column type must be
        integer (unixtime) or datetime. If None is given (default), then the current
        time is used as time values.
    time_index : int, optional
        Level of index to use as "time" column for the table. Set 0 for a single index.
        This parameter implies index=False.
    index : boolean, default True
        Write DataFrame index as a column.
    index_label : string or sequence, default None
        Column label for index column(s). If None is given (default) and index is True,
        then the index names are used. A sequence should be given if the DataFrame uses
        MultiIndex.
    chunksize : int, default 10,000
        Number of rows to be inserted in each chunk from the dataframe.

Examples


In [15]:
# Create a DataFrame with random values
df = pd.DataFrame(np.random.rand(3, 3), columns=['x', 'y', 'z'])

to_td will fail if table already exists:


In [16]:
td.to_td(df, 'tutorial.tut1', con)


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-16-3ee20f431658> in <module>()
----> 1 td.to_td(df, 'tutorial.tut1', con)

/Users/knishida/projects/pandas-td/pandas_td/td.py in to_td(frame, name, con, if_exists, time_col, time_index, index, index_label, chunksize)
    360             con.client.create_log_table(database, table)
    361         else:
--> 362             raise RuntimeError('table "%s" already exists' % name)
    363     elif if_exists == 'replace':
    364         try:

RuntimeError: table "tutorial.tut1" already exists

In [17]:
# Set "if_exists" to 'replace' or 'append'
td.to_td(df, 'tutorial.tut1', con, if_exists='replace')

Use index=False if you don't need to insert DataFrame index:


In [18]:
td.to_td(df, 'tutorial.tut1', con, if_exists='replace', index=False)

to_td inserts the current time as "time" column. You can pass "time" column explicitly by "time_col":


In [19]:
import datetime

df = pd.DataFrame(np.random.rand(3, 3), columns=['x', 'y', 'z'])

# Set "time" column explicitly
df['time'] = datetime.datetime.now()

# Use "time" as the time column in Treasure Data
td.to_td(df, 'tutorial.tut1', con, if_exists='replace', index=False, time_col='time')

If you are using a time series index, set "time_index" to 0 (or the right level of indexes):


In [20]:
df = pd.DataFrame(np.random.rand(3, 3), columns=['x', 'y', 'z'])

# Set time series index
df.index = pd.date_range('2001-01-01', periods=3)

# Use index as the time column in Treasure Data
td.to_td(df, 'tutorial.tut1', con, if_exists='replace', index=False, time_index=0)

In [ ]: