FMA: A Dataset For Music Analysis

Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson, EPFL LTS2.


From raw_*.csv, this notebook generates:

  • tracks.csv: per-track / album / artist metadata.
  • genres.csv: genre hierarchy.
  • echonest.csv: cleaned Echonest features.

A companion script,

  1. Query the API and store metadata in raw_tracks.csv, raw_albums.csv, raw_artists.csv and raw_genres.csv.
  2. Download the audio for each track.
  3. Trim the audio to 30s clips.
  4. Normalize the permissions and modification / access times.
  5. Create the .zip archives.

In [ ]:
import os
import ast
import pickle

import IPython.display as ipd
import numpy as np
import pandas as pd

import utils
import creation

In [ ]:
AUDIO_DIR = os.environ.get('AUDIO_DIR')
BASE_DIR = os.path.abspath(os.path.dirname(AUDIO_DIR))
FMA_FULL = os.path.join(BASE_DIR, 'fma_full')
FMA_LARGE = os.path.join(BASE_DIR, 'fma_large')

1 Retrieve metadata and audio from FMA

  1. Crawl the tracks, albums and artists metadata through their API.
  2. Download original .mp3 by HTTPS for each track id (only if we don't have it already).


  • Scrap curators.
  • Download images (track_image_file, album_image_file, artist_image_file). Beware the quality.
  • Verify checksum for some random tracks.

Dataset update:

  • To add new tracks: iterate from largest known track id to the most recent only.
  • To update user data: we need to get all tracks again.

In [ ]:
# ./ metadata
# ./ data /path/to/fma/fma_full
# ./ clips /path/to/fma


In [ ]:
# converters={'genres': ast.literal_eval}
tracks = pd.read_csv('raw_tracks.csv', index_col=0)
albums = pd.read_csv('raw_albums.csv', index_col=0)
artists = pd.read_csv('raw_artists.csv', index_col=0)
genres = pd.read_csv('raw_genres.csv', index_col=0)

not_found = pickle.load(open('not_found.pickle', 'rb'))

In [ ]:
def get_fs_tids(audio_dir):
    tids = []
    for _, dirnames, files in os.walk(audio_dir):
        if dirnames == []:
            tids.extend(int(file[:-4]) for file in files)
    return tids

audio_tids = get_fs_tids(FMA_FULL)
clips_tids = get_fs_tids(FMA_LARGE)

In [ ]:
print('tracks: {} collected ({} not found, {} max id)'.format(
    len(tracks), len(not_found['tracks']), tracks.index.max()))
print('albums: {} collected ({} not found, {} in tracks)'.format(
    len(albums), len(not_found['albums']), len(tracks['album_id'].unique())))
print('artists: {} collected ({} not found, {} in tracks)'.format(
    len(artists), len(not_found['artists']), len(tracks['artist_id'].unique())))
print('genres: {} collected'.format(len(genres)))
print('audio: {} collected ({} not found, {} not in tracks)'.format(
    len(audio_tids), len(not_found['audio']), len(set(audio_tids).difference(tracks.index))))
print('clips: {} collected ({} not found, {} not in tracks)'.format(
    len(clips_tids), len(not_found['clips']), len(set(clips_tids).difference(tracks.index))))
assert sum(tracks.index.isin(audio_tids)) + len(not_found['audio']) == len(tracks)
assert sum(tracks.index.isin(clips_tids)) + len(not_found['clips']) == sum(tracks.index.isin(audio_tids))
assert len(clips_tids) + len(not_found['clips']) + len(not_found['audio']) == len(tracks)

In [ ]:
N = 5

2 Format metadata


  • Sanitize values, e.g. list of words for tags, valid links in artist_wikipedia_page, remove html markup in free-form text.
    • Clean tags. E.g. some tags are just artist names.
  • Fill metadata about encoding: length, number of samples, sample rate, bit rate, channels (mono/stereo), 16bits?.
  • Update duration from audio
    • 2624 is marked as 05:05:50 (18350s) although it is reported as 00:21:15.15 by ffmpeg.
    • 112067: 3714s --> 01:59:55.06, 112808: 3718s --> 01:59:59.56
    • ffmpeg: Estimating duration from bitrate, this may be inaccurate
    • Solution, decode the complete mp3: ffmpeg -i input.mp3 -f null -

In [ ]:
df, column = tracks, 'tags'
null = sum(df[column].isnull())
print('{} null, {} non-null'.format(null, df.shape[0] - null))

2.1 Tracks

In [ ]:
drop = [
    'license_image_file', 'license_image_file_large', 'license_parent_id', 'license_url',  # keep title only
    'track_file', 'track_image_file',  # used to download only
    'track_url', 'album_url', 'artist_url',  # only relevant on website
    'track_copyright_c', 'track_copyright_p',  # present for ~1000 tracks only
    # 'track_composer', 'track_lyricist', 'track_publisher',  # present for ~4000, <1000 and <2000 tracks
    'track_disc_number',  # different from 1 for <1000 tracks
    'track_explicit', 'track_explicit_notes',  # present for <4000 tracks
    'track_instrumental'  # ~6000 tracks have a 1, there is an instrumental genre
tracks.drop(drop, axis=1, inplace=True)
tracks.rename(columns={'license_title': 'track_license', 'tags': 'track_tags'}, inplace=True)

In [ ]:
tracks['track_duration'] = tracks['track_duration'].map(creation.convert_duration)

In [ ]:
def convert_datetime(df, column, format=None):
    df[column] = pd.to_datetime(df[column], infer_datetime_format=True, format=format)
convert_datetime(tracks, 'track_date_created')
convert_datetime(tracks, 'track_date_recorded')

In [ ]:
tracks['album_id'].fillna(-1, inplace=True)
tracks['track_bit_rate'].fillna(-1, inplace=True)
tracks = tracks.astype({'album_id': int, 'track_bit_rate': int})

In [ ]:
def convert_genres(genres):
    genres = ast.literal_eval(genres)
    return [int(genre['genre_id']) for genre in genres]

tracks['track_genres'].fillna('[]', inplace=True)
tracks['track_genres'] = tracks['track_genres'].map(convert_genres)

2.2 Albums

In [ ]:
drop = [
    'artist_name', 'album_url', 'artist_url',  # in tracks already (though it can be different)
    'album_image_file', 'album_images',  # todo: shall be downloaded
    #'album_producer', 'album_engineer',  # present for ~2400 albums only
albums.drop(drop, axis=1, inplace=True)
albums.rename(columns={'tags': 'album_tags'}, inplace=True)

In [ ]:
convert_datetime(albums, 'album_date_created')
convert_datetime(albums, 'album_date_released')

2.3 Artists

In [ ]:
drop = [
    'artist_website', 'artist_url',  # in tracks already (though it can be different)
    'artist_image_file', 'artist_images',  # todo: shall be downloaded
    'artist_donation_url', 'artist_paypal_name', 'artist_flattr_name',  # ~1600 & ~400 & ~70, not relevant
    'artist_contact',  # ~1500, not very useful data
    # 'artist_active_year_begin', 'artist_active_year_end',  # ~1400, ~500 only
    # 'artist_associated_labels',  # ~1000
    # 'artist_related_projects',  # only ~800, but can be combined with bio
artists.drop(drop, axis=1, inplace=True)
artists.rename(columns={'tags': 'artist_tags'}, inplace=True)

In [ ]:
convert_datetime(artists, 'artist_date_created')
for column in ['artist_active_year_begin', 'artist_active_year_end']:
    artists[column].replace(0.0, np.nan, inplace=True)
    convert_datetime(artists, column, format='%Y.0')

2.4 Merge DataFrames

In [ ]:
not_found['albums'] = [int(i) for i in not_found['albums']]
not_found['artists'] = [int(i) for i in not_found['artists']]

In [ ]:
tracks = tracks.merge(albums, left_on='album_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))

n = sum(tracks['album_title_dup'].isnull())
print('{} tracks without extended album information ({} tracks without album_id)'.format(
    n, sum(tracks['album_id'] == -1)))
assert sum(tracks['album_id'].isin(not_found['albums'])) == n
assert sum(tracks['album_title'] != tracks['album_title_dup']) == n

tracks.drop('album_title_dup', axis=1, inplace=True)
assert not any('dup' in col for col in tracks.columns)

In [ ]:
# Album artist can be different than track artist. Keep track artist.
#tracks[tracks['artist_name'] != tracks['artist_name_dup']].select(lambda x: 'artist_name' in x, axis=1)

In [ ]:
tracks = tracks.merge(artists, left_on='artist_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))

n = sum(tracks['artist_name_dup'].isnull())
print('{} tracks without extended artist information'.format(n))
assert sum(tracks['artist_id'].isin(not_found['artists'])) == n
assert sum(tracks['artist_name'] != tracks[('artist_name_dup')]) == n

tracks.drop('artist_name_dup', axis=1, inplace=True)
assert not any('dup' in col for col in tracks.columns)

In [ ]:
columns = []
for name in tracks.columns:
    names = name.split('_')
    columns.append((names[0], '_'.join(names[1:])))
tracks.columns = pd.MultiIndex.from_tuples(columns)
assert all(label in ['track', 'album', 'artist'] for label in tracks.columns.get_level_values(0))

In [ ]:
# Todo: fill other columns ?
tracks['album', 'tags'].fillna('[]', inplace=True)
tracks['artist', 'tags'].fillna('[]', inplace=True)

columns = [('album', 'favorites'), ('album', 'comments'), ('album', 'listens'), ('album', 'tracks'),
           ('artist', 'favorites'), ('artist', 'comments')]
for column in columns:
    tracks[column].fillna(-1, inplace=True)
columns = {column: int for column in columns}
tracks = tracks.astype(columns)

3 Data cleaning

Todo: duplicates (metadata and audio)

In [ ]:
def keep(index, df):
    old = len(df)
    df = df.loc[index]
    new = len(df)
    print('{} lost, {} left'.format(old - new, new))
    return df

tracks = keep(tracks.index, tracks)

In [ ]:
# Audio not found or could not be trimmed.
tracks = keep(tracks.index.difference(not_found['audio']), tracks)
tracks = keep(tracks.index.difference(not_found['clips']), tracks)

Errors from the script.

  • IndexError('index 0 is out of bounds for axis 0 with size 0',)
    • ffmpeg: Header missing
    • ffmpeg: Could not find codec parameters for stream 0 (Audio: mp3, 0 channels, s16p): unspecified frame size. Consider increasing the value for the 'analyzeduration' and 'probesize' options
    • tids: 117759
  • NoBackendError()
    • ffmpeg: Format mp3 detected only with low score of 1, misdetection possible!
    • tids: 80015, 115235
  • UserWarning('Trying to estimate tuning from empty frequency set.',)
    • librosa error
    • tids: 1440, 26436, 38903, 57603, 62095, 62954, 62956, 62957, 62959, 62971, 86079, 96426, 104623, 106719, 109714, 114501, 114528, 118003, 118004, 127827, 130298, 130296, 131076, 135804, 154923
  • ParameterError('Filter pass-band lies beyond Nyquist',)
    • librosa error
    • tids: 152204, 28106, 29166, 29167, 29169, 29168, 29170, 29171, 29172, 29173, 29179, 43903, 56757, 59361, 75461, 92346, 92345, 92347, 92349, 92350, 92351, 92353, 92348, 92352, 92354, 92355, 92356, 92358, 92359, 92361, 92360, 114448, 136486, 144769, 144770, 144771, 144773, 144774, 144775, 144778, 144776, 144777

In [ ]:
# Feature extraction failed.
FAILED = [1440, 26436, 28106, 29166, 29167, 29168, 29169, 29170, 29171, 29172,
          29173, 29179, 38903, 43903, 56757, 57603, 59361, 62095, 62954, 62956,
          62957, 62959, 62971, 75461, 80015, 86079, 92345, 92346, 92347, 92348,
          92349, 92350, 92351, 92352, 92353, 92354, 92355, 92356, 92357, 92358,
          92359, 92360, 92361, 96426, 104623, 106719, 109714, 114448, 114501,114528,
          115235, 117759, 118003, 118004, 127827, 130296, 130298, 131076, 135804, 136486,
          144769, 144770, 144771, 144773, 144774, 144775, 144776, 144777, 144778, 152204,
tracks = keep(tracks.index.difference(FAILED), tracks)

In [ ]:
# License forbids redistribution.
tracks = keep(tracks['track', 'license'] != 'FMA-Limited: Download Only', tracks)
print('{} licenses'.format(len(tracks[('track', 'license')].unique())))

In [ ]:
#sum(tracks['track', 'title'].duplicated())

4 Genres

In [ ]:
genres.drop(['genre_handle', 'genre_color'], axis=1, inplace=True)
genres.rename(columns={'genre_parent_id': 'parent', 'genre_title': 'title'}, inplace=True)

In [ ]:
genres['parent'].fillna(0, inplace=True)
genres = genres.astype({'parent': int})

In [ ]:
# 13 (Easy Listening) has parent 126 which is missing
# --> a root genre on the website, although not in the genre menu[13, 'parent'] = 0

# 580 (Abstract Hip-Hop) has parent 1172 which is missing
# --> listed as child of Hip-Hop on the website[580, 'parent'] = 21

# 810 (Nu-Jazz) has parent 51 which is missing
# --> listed as child of Easy Listening on website[810, 'parent'] = 13

# 763 (Holiday) has parent 763 which is itself
# --> listed as child of Sound Effects on website[763, 'parent'] = 16

# Todo: should novelty be under Experimental? It is alone on website.

In [ ]:
# Genre 806 (hiphop) should not exist. Replace it by 21 (Hip-Hop).
print('{} tracks have genre 806'.format(
    sum(tracks['track', 'genres'].map(lambda genres: 806 in genres))))
def change_genre(genres):
    return [genre if genre != 806 else 21 for genre in genres]
tracks['track', 'genres'] = tracks['track', 'genres'].map(change_genre)
genres.drop(806, inplace=True)

In [ ]:
def get_parent(genre, track_all_genres=None):
    parent =[genre, 'parent']
    if track_all_genres is not None:
    return genre if parent == 0 else get_parent(parent, track_all_genres)

# Get all genres, i.e. all genres encountered when walking from leafs to roots.
def get_all_genres(track_genres):
    track_all_genres = list()
    for genre in track_genres:
        get_parent(genre, track_all_genres)
    return list(set(track_all_genres))

tracks['track', 'genres_all'] = tracks['track', 'genres'].map(get_all_genres)

In [ ]:
# Number of tracks per genre.
def count_genres(subset=tracks.index):
    count = pd.Series(0, index=genres.index)
    for _, track_all_genres in tracks.loc[subset, ('track', 'genres_all')].items():
        for genre in track_all_genres:
            count[genre] += 1
    return count

genres['#tracks'] = count_genres()
genres[genres['#tracks'] == 0]

In [ ]:
def get_top_genre(track_genres):
    top_genres = set([[genre, 'top_level'], 'title'] for genre in track_genres)
    return top_genres.pop() if len(top_genres) == 1 else np.nan

# Top-level genre.
genres['top_level'] =
tracks['track', 'genre_top'] = tracks['track', 'genres'].map(get_top_genre)

5 Subsets: large, medium, small

5.1 Large

Main characteristic: the full set with clips trimmed to a manageable size.

5.2 Medium

Main characteristic: clean metadata (includes 1 top-level genre) and quality audio.

In [ ]:
fma_medium = pd.DataFrame(tracks)

In [ ]:
# Missing meta-information.

# Missing extended album and artist information.
fma_medium = keep(~fma_medium['album', 'id'].isin(not_found['albums']), fma_medium)
fma_medium = keep(~fma_medium['artist', 'id'].isin(not_found['artists']), fma_medium)

# Untitled track or album.
fma_medium = keep(~fma_medium['track', 'title'].isnull(), fma_medium)
fma_medium = keep(fma_medium['track', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)
fma_medium = keep(fma_medium['album', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)

# One tag is often just the artist name. Tags too scarce for tracks and albums.
#keep(fma_medium['artist', 'tags'].map(len) >= 2, fma_medium)

# Too scarce.
#fma_medium = keep(~fma_medium['album', 'information'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'bio'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'website'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'wikipedia_page'].isnull(), fma_medium)

# Too scarce.
#fma_medium = keep(~fma_medium['artist', 'location'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'latitude'].isnull(), fma_medium)
#fma_medium = keep(~fma_medium['artist', 'longitude'].isnull(), fma_medium)

In [ ]:
# Technical quality.
# Todo: sample rate
fma_medium = keep(fma_medium['track', 'bit_rate'] > 100000, fma_medium)

# Choosing standard bit rates discards all VBR.
#fma_medium = keep(fma_medium['track', 'bit_rate'].isin([320000, 256000, 192000, 160000, 128000]), fma_medium)

In [ ]:
fma_medium = keep(fma_medium['track', 'duration'] >= 60, fma_medium)
fma_medium = keep(fma_medium['track', 'duration'] <= 600, fma_medium)

fma_medium = keep(fma_medium['album', 'tracks'] >= 1, fma_medium)
fma_medium = keep(fma_medium['album', 'tracks'] <= 50, fma_medium)

In [ ]:
# Lower popularity bound.
fma_medium = keep(fma_medium['track', 'listens'] >= 100, fma_medium)
fma_medium = keep(fma_medium['track', 'interest'] >= 200, fma_medium)
fma_medium = keep(fma_medium['album', 'listens'] >= 1000, fma_medium);

# Favorites and comments are very scarce.
#fma_medium = keep(fma_medium['artist', 'favorites'] >= 1, fma_medium)

In [ ]:
# Targeted genre classification.
fma_medium = keep(~fma_medium['track', 'genre_top'].isnull(), fma_medium);
#keep(fma_medium['track', 'genres'].map(len) == 1, fma_medium);

In [ ]:
# Adjust size with popularity measure. Should be of better quality.
N_TRACKS = 25000

# Observations
# * More albums killed than artists --> be sure not to kill diversity
# * Favorites and preterites genres differently --> do it per genre?
# Normalization
# * mean, median, std, max
# * tracks per album or artist
# Test
# * 4/5 of same tracks were selected with various set of measures
# * <5% diff with max and mean

popularity_measures = [('track', 'listens'), ('track', 'interest')]  # ('album', 'listens')
# ('track', 'favorites'), ('track', 'comments'),
# ('album', 'favorites'), ('album', 'comments'),
# ('artist', 'favorites'), ('artist', 'comments'),

normalization = {measure: fma_medium[measure].max() for measure in popularity_measures}
def popularity_measure(track):
    return sum(track[measure] / normalization[measure] for measure in popularity_measures)
fma_medium['popularity_measure'] = fma_medium.apply(popularity_measure, axis=1)
fma_medium = keep(fma_medium.sort_values('popularity_measure', ascending=False).index[:N_TRACKS], fma_medium)

In [ ]:
tmp = genres[genres['parent'] == 0].reset_index().set_index('title')
tmp['#tracks_medium'] = fma_medium['track', 'genre_top'].value_counts()
tmp.sort_values('#tracks_medium', ascending=False)

5.3 Small

Main characteristic: genre balanced (and echonest features).


  • 8 genres with 1000 tracks --> 8,000 tracks
  • 10 genres with 500 tracks --> 5,000 tracks


  • Download more echonest features so that all tracks can have them. Otherwise intersection of tracks with echonest features and one top-level genre is too small.

In [ ]:
N_TRACKS = 1000

top_genres = tmp.sort_values('#tracks_medium', ascending=False)[:N_GENRES].index
fma_small = pd.DataFrame(fma_medium)
fma_small = keep(fma_small['track', 'genre_top'].isin(top_genres), fma_small)

In [ ]:
to_keep = []
for genre in top_genres:
    subset = fma_small[fma_small['track', 'genre_top'] == genre]
    drop = subset.sort_values('popularity_measure').index[:-N_TRACKS]
    fma_small.drop(drop, inplace=True)
assert len(fma_small) == N_GENRES * N_TRACKS

5.4 Subset indication

In [ ]:
SUBSETS = ('small', 'medium', 'large')
tracks['set', 'subset'] = pd.Series().astype('category', categories=SUBSETS, ordered=True)
tracks.loc[tracks.index, ('set', 'subset')] = 'large'
tracks.loc[fma_medium.index, ('set', 'subset')] = 'medium'
tracks.loc[fma_small.index, ('set', 'subset')] = 'small'

5.5 Echonest

In [ ]:
echonest = pd.read_csv('raw_echonest.csv', index_col=0, header=[0, 1, 2])
echonest = keep(~echonest['echonest', 'temporal_features'].isnull().any(axis=1), echonest)
echonest = keep(~echonest['echonest', 'audio_features'].isnull().any(axis=1), echonest)
echonest = keep(~echonest['echonest', 'social_features'].isnull().any(axis=1), echonest)

echonest = keep(echonest.index.isin(tracks.index), echonest);
keep(echonest.index.isin(fma_medium.index), echonest);
keep(echonest.index.isin(fma_small.index), echonest);

6 Splits: training, validation, test

Take into account:

  • Artists may only appear on one side.
  • Stratification: ideally, all characteristics (#tracks per artist, duration, sampling rate, information, bio) and targets (genres, tags) should be equally distributed.

In [ ]:
for genre in genres.index:
    tracks['genre',[genre, 'title']] = tracks['track', 'genres_all'].map(lambda genres: genre in genres)

SPLITS = ('training', 'test', 'validation')
PERCENTAGES = (0.8, 0.1, 0.1)
tracks['set', 'split'] = pd.Series().astype('category', categories=SPLITS)

for subset in SUBSETS:

    tracks_subset = tracks['set', 'subset'] <= subset

    # Consider only top-level genres for small and medium.
    genre_list = list(tracks.loc[tracks_subset, ('track', 'genre_top')].unique())
    if subset == 'large':
        genre_list = list(genres['title']) 

    while True:
        if len(genre_list) == 0:

        # Choose most constrained genre, i.e. genre with the least unassigned artists.
        tracks_unsplit = tracks['set', 'split'].isnull()
        count = tracks[tracks_subset & tracks_unsplit].set_index(('artist', 'id'), append=True)['genre']
        count = count.groupby(level=1).sum().astype(np.bool).sum()
        genre = np.argmin(count[genre_list])
        # Given genre, select artists.
        tracks_genre = tracks['genre', genre] == 1
        artists = tracks.loc[tracks_genre & tracks_subset & tracks_unsplit, ('artist', 'id')].value_counts()
        #print('-->', genre, len(artists))

        current = {split: np.sum(tracks_genre & tracks_subset & (tracks['set', 'split'] == split)) for split in SPLITS}

        # Assign artists with most tracks first.
        for artist, count in artists.items():
            choice = np.argmin([current[split] / percentage for split, percentage in zip(SPLITS, PERCENTAGES)])
            current[SPLITS[choice]] += count
            #assert tracks.loc[tracks['artist', 'id'] == artist, ('set', 'split')].isnull().all()
            tracks.loc[tracks['artist', 'id'] == artist, ('set', 'split')] = SPLITS[choice]

# Tracks without genre can only serve as unlabeled data for training, e.g. for semi-supervised algorithms.
no_genres = tracks['track', 'genres_all'].map(lambda genres: len(genres) == 0)
no_split = tracks['set', 'split'].isnull()
assert not (no_split & ~no_genres).any()
tracks.loc[no_split, ('set', 'split')] = 'training'

# Not needed any more.
tracks.drop('genre', axis=1, level=0, inplace=True)

7 Store

In [ ]:
for dataset in 'tracks', 'genres', 'echonest':
    eval(dataset).sort_index(axis=0, inplace=True)
    eval(dataset).sort_index(axis=1, inplace=True)
    params = dict(float_format='%.10f') if dataset == 'echonest' else dict()
    eval(dataset).to_csv(dataset + '.csv', **params)

In [ ]:
# ./ normalize /path/to/fma
# ./ zips /path/to/fma

8 Description

In [ ]:
tracks = utils.load('tracks.csv')

In [ ]:
N = 5