<img src="images/continuum_analytics_logo.png" alt="Continuum Logo", align="right", width="30%">,

into

Into migrates data between formats and locations.

Before we can use a database we need to move data into it. The into project provides a single consistent interface to move data between formats and between locations.

We'll start with local data and eventually move out to remote data.

into docs

Examples

Into moves data into a target from a source

>>> into(target, source)

The target and source can be either a Python object or a string URI. The following are all valid calls to into

>>> into(pd.DataFrame, 'iris.csv')  # Load CSV file into new DataFrame
>>> into('iris.json', my_df)        # Write DataFrame into JSON file
>>> into('iris.json', 'iris.csv')   # Migrate data from CSV to JSON

Excerise

Use into to load the iris.csv file into a Python list, a np.ndarray, and a pd.DataFrame


In [ ]:
from into import into
import numpy as np
import pandas as pd

URI Strings

Into refers to foreign data either with a Python object like a sqlalchemy.Table object for a SQL table, or with a string URI, like postgresql://hostname::tablename.

URI's often take on the following form

protocol://path-to-resource::path-within-resource

Where path-to-resource might point to a file, a database hostname, etc. while path-within-resource might refer to a datapath or table name. Note the two main separators

  • :// separates the protocol on the left (sqlite, mongodb, ssh, hdfs, hive, ...)
  • :: separates the path within the database on the right (e.g. tablename)

into docs on uri strings

Examples

Here are some example URIs

myfile.json
myfiles.*.csv'
postgresql://hostname::tablename
mongodb://hostname/db::collection
ssh://user@host:/path/to/myfile.csv
hdfs://user@host:/path/to/*.csv

Exercise

Migrate your CSV file into a table named iris in a new SQLite database at sqlite:///my.db. Remember to use the :: separator and to separate your database name from your table name.

into docs on SQL


In [ ]:

What kind of object did you get receive as output? Call type on your result.


In [ ]:

Exercise

We have a MongoDB database waiting for you at the following address

mongodb://ec2-54-159-160-163.compute-1.amazonaws.com/db

Move your newly built SQLite table into a MongoDB collection. Remember to use :: to add a name to your collection.


In [ ]:

Verify that your data arrived safely by converting your mongo collection into a list.


In [ ]:

Finally, clearn up and remove your collection from the MongoDB by calling the drop function.


In [ ]:
from into import drop

How it works

Into is a network of fast pairwise conversions between pairs of formats. We when we migrate between two formats we traverse a path of pairwise conversions.

We visualize that network below:

Each node represents a data format. Each directed edge represents a function to transform data between two formats. A single call to into may traverse multiple edges and multiple intermediate formats. Red nodes support larger-than-memory data.

A single call to into may traverse several intermediate formats calling on several conversion functions. For example, we when migrate a CSV file to a Mongo database we might take the following route:

  • Load in to a DataFrame (pandas.read_csv)
  • Convert to np.recarray (DataFrame.to_records)
  • Then to a Python Iterator (np.ndarray.tolist)
  • Finally to Mongo (pymongo.Collection.insert)

Alternatively we could write a special function that uses MongoDB's native CSV loader and shortcut this entire process with a direct edge CSV -> Mongo.

These functions are chosen because they are fast, often far faster than converting through a central serialization format.

This picture is actually from an older version of into, when the graph was still small enough to visualize pleasantly. See into docs for a more updated version.

Remote Data

We can interact with remote data in three locations

  1. On Amazon's S3 (this will be quick)
  2. On a remote machine via ssh
  3. On the Hadoop File System (HDFS)

For most of this we'll wait until we've seen Blaze, briefly we'll use S3.

S3

For now, we quickly grab a file from Amazon's S3.

This example depends on boto to interact with S3.

conda install boto

into docs on aws


In [ ]:
into(pd.DataFrame, 's3://nyqpug/tips.csv')