About

Extract specific parts from JSON documents and store them as columns in a CSV/TSV for further data analysis.

Prerequesites

  • A running mongodb instance which holds the JSON documents (see docker container).
  • Python libraries:
    • pymongo - python bindings for mongodb.
    • jsonpath_rw - path expressions for matching parts of a JSON document.
  • A configuration file (config.yaml) with setting for
    • mongodb instance and collection name
    • JSON field matching rules

Configuration example:

mongo:
    url        : "mongodb://mongo:27017"
    database   : "articles"
    collection : "comments"

json2csv:
    # extract all article comments
    item_selector : "$.article[*].comment[*]"
    item_fields   : [ "id", "author", "text" ]
    item_count    : False
    output_path   : "../data/comments.tsv"

In [ ]:
from pymongo import MongoClient
from jsonpath_rw import jsonpath, parse
import json
import yaml
import pandas as pd
import re

Prepare configuration


In [ ]:
config_file = 'config_json2csv.yaml'

In [ ]:
with open(config_file) as yaml_file:
    cfg = yaml.load(yaml_file)

Prepare database connection


In [ ]:
mongo_cfg = cfg['mongo']

client = MongoClient(mongo_cfg['url'])

db = client[mongo_cfg['database']][mongo_cfg['collection']]

print "%d entries in database." % db.find().count()

Process all documents

Define function to clean text data such that it can be saved as one line.


In [ ]:
def clean(data):
    """Replace new line and tab; remove leading and trailing spaces."""
    if isinstance(data, (str, unicode)):
        return re.sub('[\t\n\r]', ' ', data).strip().encode('utf-8')
    else:
        return data

def as_list(value):
    if isinstance(value, list):
        return value
    else:
        return [value]

def extract(parser, json):
    match = parser.find(json)
    if (len(match) == 0):
        return None
    else:
        return match[0].value

def extract_data(cursor):
    return [ tuple(clean(extract(parser, item)) for parser in field_parser)
    for result in cursor
    for match in item_selector.find(result)
    for item in as_list(match.value) ]

In [ ]:
item_selector = parse(cfg['json2csv']['item_selector'])
item_fields   = cfg['json2csv']['item_fields']
field_parser  = [ parse(field) for field in item_fields ]

In [ ]:
from collections import Counter

if cfg['json2csv']['item_count']:
    counter = Counter(extract_data(db.find()))
    df = pd.DataFrame.from_records([ list(x) + [count] for (x, count) in counter.iteritems()], columns=item_fields+["count"])
else:
    records = extract_data(db.find())
    df = pd.DataFrame.from_records(records, columns=item_fields)

In [ ]:
print "%d items." % len(df)

In [ ]:
output_file = cfg['json2csv']['output_file']

df.to_csv(output_file, sep="\t", index=False, quoting=3, encoding="utf-8")