<img src="images/continuum_analytics_logo.png" alt="Continuum Logo", align="right", width="30%">,

`into`

Into migrates data between formats and locations.

Before we can use a database we need to move data into it. The into project provides a single consistent interface to move data between formats and between locations.

We'll start with local data and eventually move out to remote data.

into docs

Examples

Into moves data into a target from a source

>>> into(target, source)

The target and source can be either a Python object or a string URI. The following are all valid calls to into

>>> into(pd.DataFrame, 'iris.csv')  # Load CSV file into new DataFrame
>>> into('iris.json', my_df)        # Write DataFrame into JSON file
>>> into('iris.json', 'iris.csv')   # Migrate data from CSV to JSON

Excerise

Use into to load the iris.csv file into a Python list, a np.ndarray, and a pd.DataFrame



In [ ]:

    
from into import into
import numpy as np
import pandas as pd

URI Strings

Into refers to foreign data either with a Python object like a sqlalchemy.Table object for a SQL table, or with a string URI, like postgresql://hostname::tablename.

URI's often take on the following form

protocol://path-to-resource::path-within-resource

Where path-to-resource might point to a file, a database hostname, etc. while path-within-resource might refer to a datapath or table name. Note the two main separators

:// separates the protocol on the left (sqlite, mongodb, ssh, hdfs, hive, ...)
:: separates the path within the database on the right (e.g. tablename)

into docs on uri strings

Examples

Here are some example URIs

myfile.json
myfiles.*.csv'
postgresql://hostname::tablename
mongodb://hostname/db::collection
ssh://user@host:/path/to/myfile.csv
hdfs://user@host:/path/to/*.csv

Exercise

Migrate your CSV file into a table named iris in a new SQLite database at sqlite:///my.db. Remember to use the :: separator and to separate your database name from your table name.

into docs on SQL



In [ ]:

What kind of object did you get receive as output? Call type on your result.



In [ ]:

Exercise

We have a MongoDB database waiting for you at the following address

mongodb://ec2-54-159-160-163.compute-1.amazonaws.com/db

Move your newly built SQLite table into a MongoDB collection. Remember to use :: to add a name to your collection.



In [ ]:

Verify that your data arrived safely by converting your mongo collection into a list.



In [ ]:

Finally, clearn up and remove your collection from the MongoDB by calling the drop function.



In [ ]:

    
from into import drop

How it works

Into is a network of fast pairwise conversions between pairs of formats. We when we migrate between two formats we traverse a path of pairwise conversions.

We visualize that network below:

Each node represents a data format. Each directed edge represents a function to transform data between two formats. A single call to into may traverse multiple edges and multiple intermediate formats. Red nodes support larger-than-memory data.

A single call to into may traverse several intermediate formats calling on several conversion functions. For example, we when migrate a CSV file to a Mongo database we might take the following route:

Load in to a DataFrame (pandas.read_csv)
Convert to np.recarray (DataFrame.to_records)
Then to a Python Iterator (np.ndarray.tolist)
Finally to Mongo (pymongo.Collection.insert)

Alternatively we could write a special function that uses MongoDB's native CSV loader and shortcut this entire process with a direct edge CSV -> Mongo.

These functions are chosen because they are fast, often far faster than converting through a central serialization format.

This picture is actually from an older version of into, when the graph was still small enough to visualize pleasantly. See into docs for a more updated version.

Remote Data

We can interact with remote data in three locations

On Amazon's S3 (this will be quick)
On a remote machine via ssh
On the Hadoop File System (HDFS)

For most of this we'll wait until we've seen Blaze, briefly we'll use S3.

S3

For now, we quickly grab a file from Amazon's S3.

This example depends on boto to interact with S3.

conda install boto

into docs on aws



In [ ]:

    
into(pd.DataFrame, 's3://nyqpug/tips.csv')