The Iris dataset is a popular dataset especially in the Machine Learning community, it is a set of features of 50 Iris flowers and their classification into 3 species. It is often used to introduce classification Machine Learning algorithms.
First let's download the dataset in SQLite
format from Kaggle:
https://www.kaggle.com/uciml/iris/
Download database.sqlite
and save it in the data/iris
folder.
From Wikimedia, by Diliff - Own work, CC BY-SA 3.0, Link
First let's check that the sqlite database is available and display an error message if the file is not available (assert
checks if the expression is True
, otherwise throws AssertionError
with the error message string provided):
In [7]:
import os
data_iris_folder_content = os.listdir("./iris-species")
In [10]:
error_message = "Error: sqlite file not available, check instructions above to download it"
assert "database.sqlite" in data_iris_folder_content, error_message
We can use the sqlite3
package from the Python standard library to connect to the sqlite
database:
In [11]:
import sqlite3
In [13]:
conn = sqlite3.connect('./iris-species/database.sqlite')
In [14]:
cursor = conn.cursor()
In [15]:
type(cursor)
Out[15]:
A sqlite3.Cursor
object is our interface to the database, mostly throught the execute
method that allows to run any SQL
query on our database.
First of all we can get a list of all the tables saved into the database, this is done by reading the column name
from the sqlite_master
metadata table with:
SELECT name FROM sqlite_master
The output of the execute
method is an iterator that can be used in a for
loop to print the value of each row.
In [16]:
for row in cursor.execute("SELECT name FROM sqlite_master"):
print(row)
a shortcut to directly execute the query and gather the results is the fetchall
method:
In [17]:
cursor.execute("SELECT name FROM sqlite_master").fetchall()
Out[17]:
Notice: this way of finding the available tables in a database is specific to sqlite
, other databases like MySQL
or PostgreSQL
have different syntax.
Then we can execute standard SQL
query on the database, SQL
is a language designed to interact with data stored in a relational database. It has a standard specification, therefore the commands below work on any database.
If you need to connect to another database, you would use another package instead of sqlite3
, for example:
then you would connect to the database using specific host, port and authentication credentials but then you could execute the same exact SQL
statements.
Let's take a look for example at the first 3 rows in the Iris table:
In [18]:
sample_data = cursor.execute("SELECT * FROM Iris LIMIT 20").fetchall()
In [19]:
print(type(sample_data))
sample_data
Out[19]:
In [20]:
[row[0] for row in cursor.description]
Out[20]:
It is evident that the interface provided by sqlite3
is low-level, for data exploration purposes we would like to directly import data into a more user friendly library like pandas
.
In [21]:
import pandas as pd
In [22]:
iris_data = pd.read_sql_query("SELECT * FROM Iris", conn)
In [23]:
iris_data.head()
Out[23]:
In [24]:
iris_data.dtypes
Out[24]:
pandas.read_sql_query
takes a SQL
query and a connection object and imports the data into a DataFrame
, also keeping the same data types of the database columns. pandas
provides a lot of the same functionality of SQL
with a more user-friendly interface.
However, sqlite3
is extremely useful for downselecting data before importing them in pandas
.
For example you might have 1 TB of data in a table stored in a database on a server machine. You are interested in working on a subset of the data based on some criterion, unfortunately it would be impossible to first load data into pandas
and then filter them, therefore we should tell the database to perform the filtering and just load into pandas
the downsized dataset.
In [25]:
iris_setosa_data = pd.read_sql_query("SELECT * FROM Iris WHERE Species == 'Iris-setosa'", conn)
In [26]:
iris_setosa_data
print(iris_setosa_data.shape)
print(iris_data.shape)