This tutorial describes how to use Pandas-TD in Jupyter to explore data interactively.
Set your API key to the environment variable TD_API_KEY
and run "jupyter notebook":
$ export TD_API_SERVER="https://api.treasuredata.com/"
$ export TD_API_KEY="1234/abcd..."
$ jupyter notebook
You can connect to your database by create_engine
:
In [1]:
%matplotlib inline
import os
import pandas_td as td
# Set engine type and database, using the default connection
engine = td.create_engine('presto:sample_datasets')
# Alternatively, initialize a connection explicitly
con = td.connect(apikey=os.environ['TD_API_KEY'], endpoint=os.environ['TD_API_SERVER'])
engine = td.create_engine('presto:sample_datasets', con=con)
You can run a query by read_td
:
In [2]:
query = '''
select * from nasdaq limit 3
'''
td.read_td(query, engine)
Out[2]:
Or you can read an existing job result by read_td_job
:
In [3]:
td.read_td_job(35809747, engine)
Out[3]:
You can read a table into a DataFrame by read_td_table
, optionally with specific time range and limit:
In [4]:
# Read from a table with time range
df = td.read_td_table('nasdaq', engine,
time_range=('2000-01-01', '2010-01-01'),
limit=10000)
Importing a DataFrame into a table is also supported by to_td
:
In [5]:
# Create a DataFrame with random values
df = pd.DataFrame(np.random.rand(3, 3), columns=['x', 'y', 'z'])
# Import it into 'tutorial.import1'
con = td.connect()
td.to_td(df, 'tutorial.import1', con, if_exists='replace', index=False)
Note that to_td
currently uses Streaming API for imports. It takes more than a few seconds before your data become visible. Be patient.
In [6]:
# Check the result
td.read_td_table('tutorial.import1', engine)
Out[6]:
In [7]:
help(td.create_engine)
In [8]:
# presto
engine = td.create_engine('presto://APIKEY@api.treasuredata.com/sample_datasets')
# hive
engine = td.create_engine('hive://APIKEY@api.treasuredata.com/sample_datasets')
create_engine
uses "default" connection if apikey and host are omitted. In this case, the environment variables "TD_API_KEY" and "TD_API_SERVER" are used to initialize a connection:
In [9]:
# use default connection (TD_API_KEY is used)
engine = td.create_engine('presto:sample_datasets')
If you prefer initializing a connection with custom parameters, you can use connect
:
In [10]:
# create a connection with detailed parameters (via tdclient.Client)
# See https://github.com/treasure-data/td-client-python/blob/master/tdclient/api.py
con = td.connect(apikey=os.environ['TD_API_KEY'],
endpoint=os.environ['TD_API_SERVER'],
retry_post_requests=True)
engine = td.create_engine('presto:sample_datasets', con=con)
In [11]:
help(td.read_td_query)
In [12]:
query = '''
select time, close from nasdaq where symbol='AAPL'
'''
# Run a query, converting "time" to a time series index
df = td.read_td_query(query, engine, index_col='time', parse_dates={'time': 's'})
df.plot()
Out[12]:
In [13]:
help(td.read_td_job)
In [14]:
import tdclient
# Before using read_td_job, you need to issue a job separately
client = tdclient.Client()
job = client.query("sample_datasets",
"select time, close from nasdaq where symbol='AAPL'",
type="presto")
In [15]:
# Get result and convert it to dataframe
df = td.read_td_job(job.id, engine, index_col='time', parse_dates={'time': 's'})
df.plot()
Out[15]:
In [16]:
help(td.read_td_table)
In [17]:
# Read all records (up to 10,000 rows by default)
df = td.read_td_table("www_access", engine)
df.head(3)
Out[17]:
In [20]:
help(td.to_td)
In [21]:
# Create DataFrame with random values
df = pd.DataFrame(np.random.rand(3, 3), columns=['x', 'y', 'z'])
to_td
fails if table already exists:
In [22]:
td.to_td(df, 'tutorial.import1', con)
In [23]:
# Set "if_exists" to 'replace' or 'append'
td.to_td(df, 'tutorial.import1', con, if_exists='replace')
Use index=False
if you don't need to insert DataFrame index:
In [24]:
td.to_td(df, 'tutorial.import1', con, if_exists='replace', index=False)
to_td
inserts the current time as "time" column. You can pass "time" column explicitly by time_col
:
In [25]:
import datetime
df = pd.DataFrame(np.random.rand(3, 3), columns=['x', 'y', 'z'])
# Set "time" column explicitly
df['time'] = datetime.datetime.now()
# Use "time" as the time column in Treasure Data
td.to_td(df, 'tutorial.import1', con, if_exists='replace', index=False, time_col='time')
If you are using a time series index, set time_index=0
:
In [26]:
df = pd.DataFrame(np.random.rand(3, 3), columns=['x', 'y', 'z'])
# Set time series index
df.index = pd.date_range('2001-01-01', periods=3)
# Use index as the time column in Treasure Data
td.to_td(df, 'tutorial.import1', con, if_exists='replace', index=False, time_index=0)
In [ ]: