Git Index Generator

This notebook generates a ElasticSearch (ES) index with information about git (commits, files, lines added, lines removed, commit authors) for a given list of git repositories defined in a settings.yml file.

Let's start by importing the utils python script, setting up the connection to the ES server and defining some variables


In [ ]:
import utils
utils.logging.basicConfig(level=utils.logging.INFO)
""" You can comment previous line if you don't want logging information
"""

settings = utils.read_config_file('settings.yml')
es = utils.establish_connection(settings['es_host'])

Let's define an ES index mapping for the data that will be uploaded to the ES server


In [ ]:
MAPPING_GIT = {
    "mappings": {
        "item": {
            "properties": {
                "date": {
                    "type": "date",
                    "format" : "E MMM d HH:mm:ss yyyy Z",
                    "locale" : "US"
                },
                "commit": {"type": "keyword"},
                "author": {"type": "keyword"},
                "domain": {"type": "keyword"},
                "file": {"type": "keyword"},
                "added": {"type": "integer"},
                "removed": {"type": "integer"},
                "repository": {"type": "keyword"}
            }
        }
    }
}

Let's give a name to the index to be created, and create it.

Note: utils.create_ES_index() removes any existing index with the given name before creating it


In [ ]:
index_name = 'git'
utils.create_ES_index(es, index_name, MAPPING_GIT)

Let's import the git backend from Perceval


In [ ]:
from perceval.backends.core.git import Git

For each repository in the settings file, let's get its data, create a summary object with the desired information and upload data to the ES server using ES bulk API.


In [ ]:
for repo_url in settings['git']:
    
    repo_name = repo_url.split('/')[-1]
    repo = Git(uri=repo_url, gitpath='/tmp/'+repo_name)
    
    utils.logging.info('Go for {}'.format(repo_name))
    
    items = []
    bulk_size = 10000
    
    for commit in repo.fetch():
        
        author_name = commit['data']['Author'].split('<')[0][:-1]
        author_domain = commit['data']['Author'].split('@')[-1][:-1]
        
        for file in commit['data']['files']:
            if 'added' not in file.keys() or file['added'] == '-':
                file['added'] = 0
            if 'removed' not in file.keys() or file['removed'] == '-':
                file['removed'] = 0

            summary = {
                'date': commit['data']['AuthorDate'],
                'commit': commit['data']['commit'],
                'author': author_name,
                'domain': author_domain,
                'file': file['file'],
                'added': file['added'],
                'removed': file['removed'],
                'repository': repo_name
            }
            
            items.append({'_index': index_name, '_type': 'item', '_source': summary})
            
            if len(items) > bulk_size:
                utils.helpers.bulk(es, items)
                items = []
                utils.logging.info('{} items uploaded'.format(bulk_size))
            
    if len(items) != 0:
        utils.helpers.bulk(es, items)
        utils.logging.info('Remaining {} items uploaded'.format(len(items)))