<img src="images/continuum_analytics_logo.png" alt="Continuum Logo", align="right", width="30%">,


Into migrates data between formats and locations.

Before we can use a database we need to move data into it. The into project provides a single consistent interface to move data between formats and between locations.

We'll start with local data and eventually move out to remote data.

into docs


Into moves data into a target from a source

>>> into(target, source)

The target and source can be either a Python object or a string URI. The following are all valid calls to into

>>> into(pd.DataFrame, 'iris.csv')  # Load CSV file into new DataFrame
>>> into('iris.json', my_df)        # Write DataFrame into JSON file
>>> into('iris.json', 'iris.csv')   # Migrate data from CSV to JSON


Use into to load the iris.csv file into a Python list, a np.ndarray, and a pd.DataFrame

In [ ]:
from into import into
import numpy as np
import pandas as pd

URI Strings

Into refers to foreign data either with a Python object like a sqlalchemy.Table object for a SQL table, or with a string URI, like postgresql://hostname::tablename.

URI's often take on the following form


Where path-to-resource might point to a file, a database hostname, etc. while path-within-resource might refer to a datapath or table name. Note the two main separators

  • :// separates the protocol on the left (sqlite, mongodb, ssh, hdfs, hive, ...)
  • :: separates the path within the database on the right (e.g. tablename)

into docs on uri strings


Here are some example URIs



Migrate your CSV file into a table named iris in a new SQLite database at sqlite:///my.db. Remember to use the :: separator and to separate your database name from your table name.

into docs on SQL

In [ ]:

What kind of object did you get receive as output? Call type on your result.

In [ ]:


We have a MongoDB database waiting for you at the following address


Move your newly built SQLite table into a MongoDB collection. Remember to use :: to add a name to your collection.

In [ ]:

Verify that your data arrived safely by converting your mongo collection into a list.

In [ ]:

Finally, clearn up and remove your collection from the MongoDB by calling the drop function.

In [ ]:
from into import drop

How it works

Into is a network of fast pairwise conversions between pairs of formats. We when we migrate between two formats we traverse a path of pairwise conversions.

We visualize that network below:

Each node represents a data format. Each directed edge represents a function to transform data between two formats. A single call to into may traverse multiple edges and multiple intermediate formats. Red nodes support larger-than-memory data.

A single call to into may traverse several intermediate formats calling on several conversion functions. For example, we when migrate a CSV file to a Mongo database we might take the following route:

  • Load in to a DataFrame (pandas.read_csv)
  • Convert to np.recarray (DataFrame.to_records)
  • Then to a Python Iterator (np.ndarray.tolist)
  • Finally to Mongo (pymongo.Collection.insert)

Alternatively we could write a special function that uses MongoDB's native CSV loader and shortcut this entire process with a direct edge CSV -> Mongo.

These functions are chosen because they are fast, often far faster than converting through a central serialization format.

This picture is actually from an older version of into, when the graph was still small enough to visualize pleasantly. See into docs for a more updated version.

Remote Data

We can interact with remote data in three locations

  1. On Amazon's S3 (this will be quick)
  2. On a remote machine via ssh
  3. On the Hadoop File System (HDFS)

For most of this we'll wait until we've seen Blaze, briefly we'll use S3.


For now, we quickly grab a file from Amazon's S3.

This example depends on boto to interact with S3.

conda install boto

into docs on aws

In [ ]:
into(pd.DataFrame, 's3://nyqpug/tips.csv')