The purpose of this Notebook is to give anyone wanting to help out with the technical part of moving data from the Wiki Loves Monuments database to Wikidata as flying start.
My name is Mattias Östmar and work for Wikimedia Sverige as developer as a part of the project Connected Open Heritage.
I'm new to the Wikimedia community and platforms and am really a coding communicator with little experience of professional coding. Hopefully more people will decide, just like me, to learn to code late in life - at least enough to be able to help out with relatively simple coding tasks such as this.
Disclaimer This is actual experimental code and probably full of bad coding practice and might even produce faulty results. Please report any errors you find to me on mattias.ostmar(a) wikimedia.se so it can be corrected!
If you find any other ways to improve this document - please let me know any way you find suitable.
Thank you!
Here's a good tutorial if you have an Ubuntu computer: http://www.liquidweb.com/kb/how-to-install-mariadb-5-5-on-ubuntu-14-04-lts/
Start MariaDB by typing
mysql -u root -p # enter the password you set at installation
Now when you're signed in to MariaDB create a new empty database with no tables or nothing.
CREATE DATABASE new_database;
And now you an sign out from MariaDB'
exit
If everything has gone well you can now load the database file into MariaDB simply by typing from the terminal:
mysql -h localhost -u <your_username> -p <your_password> databasename < filename
This is an example from a Swedish list of cultural heritage buildings used to produced this guide:
First, download a full dump of the WLM Database ('monuments_db') from here: http://toolserver.org/~erfgoed/monuments_db.sql.gz
In [24]:
# Boring prerequisites
import pandas as pd # Beautiful tool for data wrangling! e.g. '!pip install pandas' from a Notebook
import datetime
import numpy as np
pd.set_option("display.max_rows",25) # Useful when having large Pandas DataFrames like we do here
import mysql.connector as mariadb # See https://mariadb.com/blog/how-connect-python-programs-mariadb e.g. '!pip install mysql' from Notebook
import csv
In [16]:
# Replace 'mos' with your username
conn = mariadb.connect(user='mos', password='', database='monuments_db',buffered=True)
cursor = conn.cursor()
In [5]:
cursor.execute("SELECT * FROM monuments_all WHERE country='se-bbr'")
all_bbr = pd.io.sql.read_sql('select * from monuments_all WHERE country="se-bbr"', conn)
all_bbr.shape
Out[5]:
In [6]:
all_bbr.keys()
Out[6]:
In [7]:
all_bbr.dtypes
Out[7]:
In [9]:
all_bbr.to_csv("./files/all_bbr.csv")
In [10]:
!head -n3 ./files/all_bbr.csv
Ooops! We get the original bytearray objects (e.g. bytearray(b'se-bbr')) from the WLM database storage formate out. We need to convert them.
In [322]:
# First fill na values
all_bbr["country"] = all_bbr.country.str.decode("utf-8").astype("str")
all_bbr["lang"] = all_bbr.lang.str.decode("utf-8").astype("str")
all_bbr["adm0"] = all_bbr.adm0.str.decode("utf-8")
all_bbr["adm1"] = all_bbr.adm1.str.decode("utf-8")
# We also want lat_int and lon_int to be integers, just to be correct
all_bbr["lat_int"] = all_bbr.lat_int.fillna(0).astype("int64")
all_bbr["lon_int"] = all_bbr.lon_int.fillna(0).astype("int64")
all_bbr.dtypes
Out[322]:
In [11]:
all_bbr.to_csv("./files/all_bbr.csv")
In [13]:
!head -n 3 ./files/all_bbr.csv
OK. Now we've got those strings in a proper readable format.
In [32]:
table_name = "se_bbr" # I've renamed monuments_se-bbr_(se) to 'se_bbr' in local database, change to correct name
se_bbr = pd.io.sql.read_sql('select * from se_bbr', conn)
se_bbr.shape
Out[32]:
In [33]:
se_bbr.keys()
Out[33]:
In [34]:
se_bbr.dtypes
Out[34]:
In [35]:
se_bbr.to_csv("./files/se_bbr.csv")
In [36]:
!head -n3 ./files/se_bbr.csv
In [ ]: