Visualizing MusicBrainz data with Python/JS, an introduction

This introductory notebook will explain how I get database from MusicBrainz and how I transform it to Python format for display in tables or plots.

A static HTML version of this notebook and the next ones should be available on github.io.

Prerequisites: having PostgreSQL to store the database (or being able to create virtual machines that will run PostgreSQL). I will use Python to manipulate the data but you can probably do the same in other languages. I will not go into details on how I build the SQL queries to fetch the data, you will need to look into the MusicBrainz schema if you try something too different from my examples.

Getting the MusicBrainz data

The first step is to get a local copy of the MusicBrainz database in order to make direct queries to it without going through the website or or webservice (which doesn't give the possibility to write complex queries).

The raw data itself is available for download and the files are updated twice a week. As of early 2017 the database zipped files to download are close to 2.5Gb.

Several possibilities exist to build the database locally, using the raw data above. I'm only explaining the basics here:

  • if you already have or can have PostgreSQL installed (MusicBrainz uses version 9.5 for the moment) on your machine, you can use the mbslave project that will recreate the database structure on your machine. You will also be able to synchronise your database and fetch the latest changes when you want.

  • another possibility is to use virtual machines to store the database and create a local copy of the website also (this is not required for what I intend to show here). I'm using the musicbrainz-docker project that uses Docker to create several machines for the different MusicBrainz components (database, website, search)

In both cases you should expect to download several Gb of data and need several Gb of RAM to have the postgreSQL database running smoothly.

Customize the database

Note: this step is again absolutely not required. It also increases a lot the space you need to run the database (the new dump you need to download is 4Gb large).

In my case, I want to explore metadata about the data modifications, i.e. the edits performed by MusicBrainz contributors. In order to do so I had to download also the mbdump-edit.tar.bz2 and mbdump-editor.tar.bz2 and add them to the local database build process (I did that by patching the createdb.sh script in musicbrainz-docker).

 Python toolbox

For data analysis I will use Python3 libraries:

  • PanDas for manipulating data as tables
  • psycopg2 and sqlalchemy to access the SQL database
  • plotly for plots

In [1]:
%load_ext watermark
%watermark --python -r


CPython 3.7.0b5
IPython 6.4.0
Git repo: git@bitbucket.org:loujine/musicbrainz-dataviz.git

In [2]:
%watermark --date --updated


last updated: 2018-06-05

Accessing the database from Python

Once the local database is set I can access it using e.g. Python with the psycopg2 library to perform SQL queries. Let's try a simple query.

With musicbrainz-docker, my database is on a virtual machine. I can access it from my main machine by setting the following parameters:


In [3]:
import os
import psycopg2
# define global variables to store our DB credentials
PGHOST = 'localhost'
PGDATABASE = os.environ.get('PGDATABASE', 'musicbrainz')
PGUSER = os.environ.get('PGUSER', 'musicbrainz')
PGPASSWORD = os.environ.get('PGPASSWORD', 'musicbrainz')

Of course your parameters (especially IP) might be different from mine.

In order to simplify this procedure I developed a new branch in the musicbrainz-docker project that creates a Jupyter VM. If you use this branch, you don't need to set the parameters above, they are set when you start your notebook.

We need to define a SQL query as a Python string that psycopg2 will send to our database


In [4]:
sql_beethoven = """
SELECT gid, name, begin_date_year, end_date_year 
  FROM artist 
 WHERE name='Ludwig van Beethoven'
"""

Let's apply our query


In [5]:
with psycopg2.connect(host=PGHOST, database=PGDATABASE, 
                      user=PGUSER, password=PGPASSWORD) as cnx:
    crs = cnx.cursor()
    crs.execute(sql_beethoven)
    for result in crs:
        print(result)


('1f9df192-a621-4f54-8850-2c5373b7eac9', 'Ludwig van Beethoven', 1770, 1827)

We got one result! So that means the correct Ludwig van Beethoven (1770-1828) exists in the MusicBrainz database. I also extracted his MBID (unique identifier) so that you can check Beethoven's page is available on the main musicbrainz server.

If you only want to manipulate basic data as Python strings and numbers, that's all you need, and you can start writing other queries.

But in my case I want to do more complex stuff on the data, so I want to use another Python library that will help me to manipulate and plot the data. I'm going to use PanDas for that.

Using PanDas to manipulate data

The PanDas library allows manipulations of complex data in Python as Series or DataFrames. It also integrates some of the matplotlib plotting library capabilities directly on the DataFrames object. Let's do the same query as earlier using pandas:


In [6]:
# pandas SQL query require an sqlalchemy engine object
# rather than the direct psycopg2 connection
import sqlalchemy
import pandas
engine = sqlalchemy.create_engine(
   'postgresql+psycopg2://{PGUSER}:{PGPASSWORD}@{PGHOST}/{PGDATABASE}'.format(**locals()),
    isolation_level='READ UNCOMMITTED'
)

In [7]:
pandas.read_sql(sql_beethoven, engine)


Out[7]:
gid name begin_date_year end_date_year
0 1f9df192-a621-4f54-8850-2c5373b7eac9 Ludwig van Beethoven 1770 1827

So we got (of course) the same result, as a pandas table (DataFrame). This makes it easier to apply the same operation or filter along one column when we will have several thousands of lines in our results. Calling pandas read_sql function makes also our calls to PostgreSQL slightly easier to write.

Exploring the MusicBrainz database

The next notebooks will show examples of queries to extract and visualize data. If you want to write your own queries, you should look at the MusicBrainz database schema by reading the documentation.