Getting Started

Pandas-TD allows you to read a query result into a DataFrame and to store a DataFrame to Treasure Data.

Before starting any session, you need to set your API key to the environment variable TD_API_KEY:

export TD_API_KEY=1234/abcd...

You can initialize your connection by connect:



In [1]:

    
%matplotlib inline

import os
import numpy as np
import pandas as pd
import pandas_td as td

# Initialize connection
con = td.connect(apikey=os.environ['TD_API_KEY'], endpoint='https://api.treasuredata.com/')

Using a connection, you can use some convenient methods like databases and tables:



In [2]:

    
# Get a list of databases
con.databases()









    Out[2]:






  
    
      
      name
      count
      permission
      created_at
      updated_at
    
  
  
    
      0
      sample_datasets
      8812278
      query_only
      2014-10-04 01:13:11+00:00
      2015-01-09 01:07:14+00:00
    
    
      1
      tutorial
      19
      full_access
      2015-05-27 13:12:43+00:00
      2015-05-27 13:12:43+00:00



In [3]:

    
# Get a list of tables
con.tables('sample_datasets')









    Out[3]:






  
    
      
      name
      count
      estimated_storage_size
      last_log_timestamp
      created_at
    
  
  
    
      0
      nasdaq
      8807278
      168205061
      1970-01-01 00:00:00+00:00
      2014-10-08 02:57:38+00:00
    
    
      1
      www_access
      5000
      0
      2014-10-04 01:13:15+00:00
      2014-10-04 01:13:12+00:00

You can choose a query engine and use it to run a queby by read_td:



In [4]:

    
# Select a query engine
presto = con.query_engine(database='sample_datasets', type='presto')

# Run a query
td.read_td('select * from nasdaq limit 3', presto)

Alternatively, you can read a table into a DataFrame by read_td_table, optionally with a specific time range, sample rate, and/or a limit:



In [5]:

    
# Read from a table with 5-percent sampling
df = td.read_td_table('nasdaq', presto, time_range=('2000-01-01', '2010-01-01'), sample=0.05, limit=10000)

Pandas-TD also supports importing a DataFrame into a table by to_td:



In [6]:

    
# Create a DataFrame with random values
df = pd.DataFrame(np.random.rand(3, 3), columns=['x', 'y', 'z'])

# Import it into 'tutorial.tut1'
td.to_td(df, 'tutorial.tut1', con, if_exists='replace', index=False)



In [7]:

    
# Check the result
td.read_td('select * from tutorial.tut1', presto)

read_td_query



In [8]:

    
help(td.read_td_query)









    



Help on function read_td_query in module pandas_td.td:

read_td_query(query, engine, index_col=None, params=None, parse_dates=None)
    Read Treasure Data query into a DataFrame.
    
    Returns a DataFrame corresponding to the result set of the query string.
    Optionally provide an index_col parameter to use one of the columns as
    the index, otherwise default integer index will be used.
    
    Parameters
    ----------
    query : string
        Query string to be executed.
    engine : QueryEngine
        Handler returned by Connection.query_engine.
    index_col : string, optional
        Column name to use as index for the returned DataFrame object.
    params : dict, optional
        Parameters to pass to execute method.
    parse_dates : list or dict, optional
        - List of column names to parse as dates
        - Dict of {column_name: format string} where format string is strftime
          compatible in case of parsing string times or is one of (D, s, ns, ms, us)
          in case of parsing integer timestamps
    
    Returns
    -------
    DataFrame

Examples



In [9]:

    
# Convert "time" to a time series index
df = td.read_td_query("select time, close from nasdaq where symbol='AAPL'",
                      presto,
                      index_col='time',
                      parse_dates={'time': 's'})
df.plot()









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x107539748>

read_td_table



In [10]:

    
help(td.read_td_table)









    



Help on function read_td_table in module pandas_td.td:

read_td_table(table_name, engine, index_col=None, parse_dates=None, columns=None, time_range=None, sample=None, limit=10000)
    Read Treasure Data table into a DataFrame.
    
    The number of returned rows is limited by "limit" (default 10,000).
    Setting limit=None means all rows. Be careful when you set limit=None
    because your table might be very large and the result does not fit into memory.
    
    Parameters
    ----------
    table_name : string
        Name of Treasure Data table in database.
    engine : QueryEngine
        Handler returned by Connection.query_engine.
    index_col : string, optional
        Column name to use as index for the returned DataFrame object.
    parse_dates : list or dict, optional
        - List of column names to parse as dates
        - Dict of {column_name: format string} where format string is strftime
          compatible in case of parsing string times or is one of (D, s, ns, ms, us)
          in case of parsing integer timestamps
    columns : list, optional
        List of column names to select from table.
    time_range : tuple (start, end), optional
        Limit time range to select. "start" and "end" are one of None, integers,
        strings or datetime objects. "end" is exclusive, not included in the result.
    sample : double, optional
        Enable sampling data (Presto only). 1.0 means all data (100 percent).
        See TABLESAMPLE BERNOULLI at https://prestodb.io/docs/current/sql/select.html
    limit : int, default 10,000
        Maximum number of rows to select.
    
    Returns
    -------
    DataFrame

Examples



In [11]:

    
# Read all records (up to 10,000 rows by default)
df = td.read_td_table("www_access", presto)
df.head(3)









    Out[11]:






  
    
      
      user
      host
      path
      referer
      code
      agent
      size
      method
      time
    
  
  
    
      0
      None
      192.225.229.196
      /category/software
      -
      200
      Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; r...
      117
      GET
      1412382292
    
    
      1
      None
      120.168.215.131
      /category/software
      -
      200
      Mozilla/5.0 (compatible; Googlebot/2.1; +http:...
      53
      GET
      1412382284
    
    
      2
      None
      180.198.173.136
      /category/electronics
      /category/computers
      200
      Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.1...
      106
      GET
      1412382275



In [12]:

    
# From 1990 to 2010, 1-percent sample, no limit
df = td.read_td_table("nasdaq",
                      presto,
                      index_col='time',
                      parse_dates={'time': 's'},
                      columns=['time', 'symbol', 'close'],
                      time_range=('1990', '2010'),
                      sample=0.01,
                      limit=None)
len(df)









    Out[12]:





54797



In [13]:

    
df[df.symbol=='AAPL'].plot()









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x109d4d2b0>

to_td



In [14]:

    
help(td.to_td)









    



Help on function to_td in module pandas_td.td:

to_td(frame, name, con, if_exists='fail', time_col=None, time_index=None, index=True, index_label=None, chunksize=10000)
    Write a DataFrame to a Treasure Data table.
    
    This method converts the dataframe into a series of key-value pairs
    and send them using the Treasure Data streaming API. The data is divided
    into chunks of rows (default 10,000) and uploaded separately. If upload
    failed, the client retries the process for a certain amount of time
    (max_cumul_retry_delay; default 600 secs). This method may fail and
    raise an exception when retries did not success, in which case the data
    may be partially inserted. Use the bulk import utility if you cannot
    accept partial inserts.
    
    Parameters
    ----------
    frame : DataFrame
        DataFrame to be written.
    name : string
        Name of table to be written, in the form 'database.table'.
    con : Connection
        Connection to a Treasure Data account.
    if_exists: {'fail', 'replace', 'append'}, default 'fail'
        - fail: If table exists, do nothing.
        - replace: If table exists, drop it, recreate it, and insert data.
        - append: If table exists, insert data. Create if does not exist.
    time_col : string, optional
        Column name to use as "time" column for the table. Column type must be
        integer (unixtime) or datetime. If None is given (default), then the current
        time is used as time values.
    time_index : int, optional
        Level of index to use as "time" column for the table. Set 0 for a single index.
        This parameter implies index=False.
    index : boolean, default True
        Write DataFrame index as a column.
    index_label : string or sequence, default None
        Column label for index column(s). If None is given (default) and index is True,
        then the index names are used. A sequence should be given if the DataFrame uses
        MultiIndex.
    chunksize : int, default 10,000
        Number of rows to be inserted in each chunk from the dataframe.

Examples



In [15]:

    
# Create a DataFrame with random values
df = pd.DataFrame(np.random.rand(3, 3), columns=['x', 'y', 'z'])

to_td will fail if table already exists:



In [16]:

    
td.to_td(df, 'tutorial.tut1', con)









    



---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-16-3ee20f431658> in <module>()
----> 1 td.to_td(df, 'tutorial.tut1', con)

/Users/knishida/projects/pandas-td/pandas_td/td.py in to_td(frame, name, con, if_exists, time_col, time_index, index, index_label, chunksize)
    360             con.client.create_log_table(database, table)
    361         else:
--> 362             raise RuntimeError('table "%s" already exists' % name)
    363     elif if_exists == 'replace':
    364         try:

RuntimeError: table "tutorial.tut1" already exists



In [17]:

    
# Set "if_exists" to 'replace' or 'append'
td.to_td(df, 'tutorial.tut1', con, if_exists='replace')

Use index=False if you don't need to insert DataFrame index:



In [18]:

    
td.to_td(df, 'tutorial.tut1', con, if_exists='replace', index=False)

to_td inserts the current time as "time" column. You can pass "time" column explicitly by "time_col":



In [19]:

    
import datetime

df = pd.DataFrame(np.random.rand(3, 3), columns=['x', 'y', 'z'])

# Set "time" column explicitly
df['time'] = datetime.datetime.now()

# Use "time" as the time column in Treasure Data
td.to_td(df, 'tutorial.tut1', con, if_exists='replace', index=False, time_col='time')

If you are using a time series index, set "time_index" to 0 (or the right level of indexes):



In [20]:

    
df = pd.DataFrame(np.random.rand(3, 3), columns=['x', 'y', 'z'])

# Set time series index
df.index = pd.date_range('2001-01-01', periods=3)

# Use index as the time column in Treasure Data
td.to_td(df, 'tutorial.tut1', con, if_exists='replace', index=False, time_index=0)



In [ ]:

	symbol	open	volume	high	low	close	time
0	ASNA	16.9600	1087798	17.06	16.89	17.01	1410364800
1	ADRD	24.8100	725	24.87	24.81	24.87	1410364800
2	ARIS	3.0799	39918	3.20	3.00	3.15	1410364800

	index	z	y	x	time
0	0	0.898739	0.975951	0.712556	1433052180
1	1	0.877093	0.343772	0.736913	1433052180
2	2	0.294983	0.197632	0.813423	1433052180

	name	count	permission	created_at	updated_at
0	sample_datasets	8812278	query_only	2014-10-04 01:13:11+00:00	2015-01-09 01:07:14+00:00
1	tutorial	19	full_access	2015-05-27 13:12:43+00:00	2015-05-27 13:12:43+00:00

	name	count	estimated_storage_size	last_log_timestamp	created_at
0	nasdaq	8807278	168205061	1970-01-01 00:00:00+00:00	2014-10-08 02:57:38+00:00
1	www_access	5000	0	2014-10-04 01:13:15+00:00	2014-10-04 01:13:12+00:00

	user	host	path	referer	code	agent	size	method	time
0	None	192.225.229.196	/category/software	-	200	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; r...	117	GET	1412382292
1	None	120.168.215.131	/category/software	-	200	Mozilla/5.0 (compatible; Googlebot/2.1; +http:...	53	GET	1412382284
2	None	180.198.173.136	/category/electronics	/category/computers	200	Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.1...	106	GET	1412382275