This notebook generates a ElasticSearch (ES) index with information about git (commits, files, lines added, lines removed, commit authors) for a given list of git repositories defined in a settings.yml
file.
Let's start by importing the utils python script, setting up the connection to the ES server and defining some variables
In [ ]:
import utils
utils.logging.basicConfig(level=utils.logging.INFO)
""" You can comment previous line if you don't want logging information
"""
settings = utils.read_config_file('settings.yml')
es = utils.establish_connection(settings['es_host'])
Let's define an ES index mapping for the data that will be uploaded to the ES server
In [ ]:
MAPPING_GIT = {
"mappings": {
"item": {
"properties": {
"date": {
"type": "date",
"format" : "E MMM d HH:mm:ss yyyy Z",
"locale" : "US"
},
"commit": {"type": "keyword"},
"author": {"type": "keyword"},
"domain": {"type": "keyword"},
"file": {"type": "keyword"},
"added": {"type": "integer"},
"removed": {"type": "integer"},
"repository": {"type": "keyword"}
}
}
}
}
Let's give a name to the index to be created, and create it.
Note: utils.create_ES_index()
removes any existing index with the given name before creating it
In [ ]:
index_name = 'git'
utils.create_ES_index(es, index_name, MAPPING_GIT)
Let's import the git backend from Perceval
In [ ]:
from perceval.backends.core.git import Git
For each repository in the settings file, let's get its data, create a summary
object with the desired information and upload data to the ES server using ES bulk
API.
In [ ]:
for repo_url in settings['git']:
repo_name = repo_url.split('/')[-1]
repo = Git(uri=repo_url, gitpath='/tmp/'+repo_name)
utils.logging.info('Go for {}'.format(repo_name))
items = []
bulk_size = 10000
for commit in repo.fetch():
author_name = commit['data']['Author'].split('<')[0][:-1]
author_domain = commit['data']['Author'].split('@')[-1][:-1]
for file in commit['data']['files']:
if 'added' not in file.keys() or file['added'] == '-':
file['added'] = 0
if 'removed' not in file.keys() or file['removed'] == '-':
file['removed'] = 0
summary = {
'date': commit['data']['AuthorDate'],
'commit': commit['data']['commit'],
'author': author_name,
'domain': author_domain,
'file': file['file'],
'added': file['added'],
'removed': file['removed'],
'repository': repo_name
}
items.append({'_index': index_name, '_type': 'item', '_source': summary})
if len(items) > bulk_size:
utils.helpers.bulk(es, items)
items = []
utils.logging.info('{} items uploaded'.format(bulk_size))
if len(items) != 0:
utils.helpers.bulk(es, items)
utils.logging.info('Remaining {} items uploaded'.format(len(items)))