Getting started with the tech side of moving WLM data to Wikidata

Introduction

The purpose of this Notebook is to give anyone wanting to help out with the technical part of moving data from the Wiki Loves Monuments database to Wikidata as flying start.

My name is Mattias Östmar and work for Wikimedia Sverige as developer as a part of the project Connected Open Heritage.

I'm new to the Wikimedia community and platforms and am really a coding communicator with little experience of professional coding. Hopefully more people will decide, just like me, to learn to code late in life - at least enough to be able to help out with relatively simple coding tasks such as this.

Disclaimer This is actual experimental code and probably full of bad coding practice and might even produce faulty results. Please report any errors you find to me on mattias.ostmar(a) wikimedia.se so it can be corrected!

If you find any other ways to improve this document - please let me know any way you find suitable.

Thank you!

(optional) Install MariaDB on Ubuntu

Here's a good tutorial if you have an Ubuntu computer: http://www.liquidweb.com/kb/how-to-install-mariadb-5-5-on-ubuntu-14-04-lts/

Start MariaDB by typing

mysql -u root -p # enter the password you set at installation

Now when you're signed in to MariaDB create a new empty database with no tables or nothing.

CREATE DATABASE new_database;

And now you an sign out from MariaDB'

exit

If everything has gone well you can now load the database file into MariaDB simply by typing from the terminal:

mysql -h localhost -u <your_username> -p <your_password> databasename < filename

Data preparation

This is an example from a Swedish list of cultural heritage buildings used to produced this guide:

https://www.wikidata.org/wiki/User:Mattias_%C3%96stmar_%28WMSE%29/how_to_map_WLM_data#Step_1:_What_is_implicity_known_about_the_data.3F.

First, download a full dump of the WLM Database ('monuments_db') from here: http://toolserver.org/~erfgoed/monuments_db.sql.gz


In [24]:
# Boring prerequisites
import pandas as pd # Beautiful tool for data wrangling! e.g. '!pip install pandas' from a Notebook
import datetime
import numpy as np
pd.set_option("display.max_rows",25) # Useful when having large Pandas DataFrames like we do here
import mysql.connector as mariadb # See https://mariadb.com/blog/how-connect-python-programs-mariadb e.g. '!pip install mysql' from Notebook
import csv

Setup a connection with local MariaDB


In [16]:
# Replace 'mos' with your username
conn = mariadb.connect(user='mos', password='', database='monuments_db',buffered=True)
cursor = conn.cursor()

Read SQL tables directly into Pandas DataFrame


In [5]:
cursor.execute("SELECT * FROM monuments_all WHERE country='se-bbr'")
all_bbr = pd.io.sql.read_sql('select * from monuments_all WHERE country="se-bbr"', conn)
all_bbr.shape


Out[5]:
(5686, 22)

In [6]:
all_bbr.keys()


Out[6]:
Index(['country', 'lang', 'project', 'id', 'adm0', 'adm1', 'adm2', 'adm3',
       'adm4', 'name', 'address', 'municipality', 'lat', 'lon', 'lat_int',
       'lon_int', 'image', 'commonscat', 'source', 'changed',
       'monument_article', 'registrant_url'],
      dtype='object')

In [7]:
all_bbr.dtypes


Out[7]:
country                     object
lang                        object
project                     object
id                          object
adm0                        object
adm1                        object
adm2                        object
adm3                        object
adm4                        object
name                        object
address                     object
municipality                object
lat                        float64
lon                        float64
lat_int                    float64
lon_int                    float64
image                       object
commonscat                  object
source                      object
changed             datetime64[ns]
monument_article            object
registrant_url              object
dtype: object

In [9]:
all_bbr.to_csv("./files/all_bbr.csv")

In [10]:
!head -n3 ./files/all_bbr.csv


,country,lang,project,id,adm0,adm1,adm2,adm3,adm4,name,address,municipality,lat,lon,lat_int,lon_int,image,commonscat,source,changed,monument_article,registrant_url
0,bytearray(b'se-bbr'),bytearray(b'sv'),wikipedia,21000001001755,bytearray(b'se'),bytearray(b'se-o'),Skövde,,,[[Vaholms bro]] (Skövde Vaholm 3:1),[[Vaholm]],Skövde,58.58737,14.00023,1172.0,280.0,Tidan vid vaholms brohus.JPG,,//sv.wikipedia.org/w/index.php?title=Lista_%C3%B6ver_byggnadsminnen_i_V%C3%A4stra_G%C3%B6talands_l%C3%A4n&oldid=34704927,2016-06-08 06:51:53,Vaholms_bro,http://www.bebyggelseregistret.raa.se/bbr2/anlaggning/visaHistorik.raa?page=historik&visaHistorik=true&anlaggningId=21000001001755
1,bytearray(b'se-bbr'),bytearray(b'sv'),wikipedia,21000001107600,bytearray(b'se'),bytearray(b'se-f'),Värnamo,,,[[Bruno Mathssons villa]] (Rolstorp 5:4),"Södrakull, Tånnö",Värnamo,57.06516,14.0398,1141.0,281.0,,,//sv.wikipedia.org/w/index.php?title=Lista_%C3%B6ver_byggnadsminnen_i_J%C3%B6nk%C3%B6pings_l%C3%A4n&oldid=33805579,2016-06-08 06:51:51,Bruno_Mathssons_villa,http://www.bebyggelseregistret.raa.se/bbr2/anlaggning/visaHistorik.raa?page=historik&visaHistorik=true&anlaggningId=21000001107600

Ooops! We get the original bytearray objects (e.g. bytearray(b'se-bbr')) from the WLM database storage formate out. We need to convert them.


In [322]:
# First fill na values

all_bbr["country"] = all_bbr.country.str.decode("utf-8").astype("str")
all_bbr["lang"] = all_bbr.lang.str.decode("utf-8").astype("str")
all_bbr["adm0"] = all_bbr.adm0.str.decode("utf-8")
all_bbr["adm1"] = all_bbr.adm1.str.decode("utf-8")

# We also want lat_int and lon_int to be integers, just to be correct
all_bbr["lat_int"] = all_bbr.lat_int.fillna(0).astype("int64")
all_bbr["lon_int"] = all_bbr.lon_int.fillna(0).astype("int64")
all_bbr.dtypes


Out[322]:
country                     object
lang                        object
project                     object
id                          object
adm0                        object
adm1                        object
adm2                        object
adm3                        object
adm4                        object
name                        object
address                     object
municipality                object
lat                        float64
lon                        float64
lat_int                      int64
lon_int                      int64
image                       object
commonscat                  object
source                      object
changed             datetime64[ns]
monument_article            object
registrant_url              object
dtype: object

In [11]:
all_bbr.to_csv("./files/all_bbr.csv")

In [13]:
!head -n 3 ./files/all_bbr.csv


,country,lang,project,id,adm0,adm1,adm2,adm3,adm4,name,address,municipality,lat,lon,lat_int,lon_int,image,commonscat,source,changed,monument_article,registrant_url
0,bytearray(b'se-bbr'),bytearray(b'sv'),wikipedia,21000001001755,bytearray(b'se'),bytearray(b'se-o'),Skövde,,,[[Vaholms bro]] (Skövde Vaholm 3:1),[[Vaholm]],Skövde,58.58737,14.00023,1172.0,280.0,Tidan vid vaholms brohus.JPG,,//sv.wikipedia.org/w/index.php?title=Lista_%C3%B6ver_byggnadsminnen_i_V%C3%A4stra_G%C3%B6talands_l%C3%A4n&oldid=34704927,2016-06-08 06:51:53,Vaholms_bro,http://www.bebyggelseregistret.raa.se/bbr2/anlaggning/visaHistorik.raa?page=historik&visaHistorik=true&anlaggningId=21000001001755
1,bytearray(b'se-bbr'),bytearray(b'sv'),wikipedia,21000001107600,bytearray(b'se'),bytearray(b'se-f'),Värnamo,,,[[Bruno Mathssons villa]] (Rolstorp 5:4),"Södrakull, Tånnö",Värnamo,57.06516,14.0398,1141.0,281.0,,,//sv.wikipedia.org/w/index.php?title=Lista_%C3%B6ver_byggnadsminnen_i_J%C3%B6nk%C3%B6pings_l%C3%A4n&oldid=33805579,2016-06-08 06:51:51,Bruno_Mathssons_villa,http://www.bebyggelseregistret.raa.se/bbr2/anlaggning/visaHistorik.raa?page=historik&visaHistorik=true&anlaggningId=21000001107600

OK. Now we've got those strings in a proper readable format.

Create DataFrame for country specific table 'monuments_allse-bbr(sv)' a.k.a 'se_bbr'


In [32]:
table_name = "se_bbr" # I've renamed monuments_se-bbr_(se) to 'se_bbr' in local database, change to correct name
se_bbr = pd.io.sql.read_sql('select * from se_bbr', conn)
se_bbr.shape


Out[32]:
(5686, 16)

In [33]:
se_bbr.keys()


Out[33]:
Index(['bbr', 'namn', 'region-iso', 'funktion', 'byggar', 'arkitekt', 'plats',
       'kommun', 'lat', 'lon', 'bild', 'commonscat', 'source', 'changed',
       'monument_article', 'registrant_url'],
      dtype='object')

In [34]:
se_bbr.dtypes


Out[34]:
bbr                         object
namn                        object
region-iso                  object
funktion                    object
byggar                      object
arkitekt                    object
plats                       object
kommun                      object
lat                        float64
lon                        float64
bild                        object
commonscat                  object
source                      object
changed             datetime64[ns]
monument_article            object
registrant_url              object
dtype: object

In [35]:
se_bbr.to_csv("./files/se_bbr.csv")

In [36]:
!head -n3 ./files/se_bbr.csv


,bbr,namn,region-iso,funktion,byggar,arkitekt,plats,kommun,lat,lon,bild,commonscat,source,changed,monument_article,registrant_url
0,21300000013342,[[Ekolsunds f.d. värdshus]] (Ekolsund 3:7),SE-C,Gästgivargård (1 byggnad),1770-talet,,,Enköping,59.65269,17.36559,Ekolsunds_vardshus.jpg,,//sv.wikipedia.org/w/index.php?title=Lista_%C3%B6ver_byggnadsminnen_i_Uppsala_l%C3%A4n&oldid=30716291,2016-06-08 06:51:46,Ekolsunds_f.d._värdshus,http://www.bebyggelseregistret.raa.se/bbr2/anlaggning/visaHistorik.raa?page=historik&visaHistorik=true&anlaggningId=21300000013342
1,21300000013350,[[Ekolsunds slott]] (Ekolsund 1:227; f.d. 1:102),SE-C,"Herrgård,Livsmedelsindustri,Bageri,Säteri (3 byggnader)",1630-talet,,,Enköping,59.65075,17.36639,Ekolsund_2.jpg,,//sv.wikipedia.org/w/index.php?title=Lista_%C3%B6ver_byggnadsminnen_i_Uppsala_l%C3%A4n&oldid=30716291,2016-06-08 06:51:46,Ekolsunds_slott,http://www.bebyggelseregistret.raa.se/bbr2/anlaggning/visaHistorik.raa?page=historik&visaHistorik=true&anlaggningId=21300000013350

In [ ]: