Toma de contacto con Jupyter y SQL

Esta hoja muestra cómo acceder a bases de datos SQL y también a conectar la salida con Jupyter. Las partes en SQL también se pueden realizar directamente en MySQL ejecutando el programa mysql del contenedor.

Igual que en la práctica de introducción, los contenedores se pueden lanzar automáticamente usando docker-compose desde el directorio bdge/sql una vez bajado el repositorio Git de las prácticas:

$ git clone https://github.com/dsevilla/bdge.git
$ cd bdge/sql
$ docker-compose up

Dentro del Notebook, la base de datos está disponible en el host con nombre mysql.

Instalación de los paquetes Python necesarios:


In [ ]:
!pip install pymysql ipython-sql

In [ ]:
%load_ext sql
#%config SqlMagic.feedback = False # Evitar que muestre el número de filas

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

%matplotlib inline
matplotlib.style.use('ggplot')

In [ ]:
%%sql 
mysql+pymysql://root:root@mysql/?charset=utf8mb4&local_infile=1
  • Format: 7zipped
  • Files:
    • badges.xml
      • UserId, e.g.: "420"
      • Name, e.g.: "Teacher"
      • Date, e.g.: "2008-09-15T08:55:03.923"
    • comments.xml
      • Id
      • PostId
      • Score
      • Text, e.g.: "@Stu Thompson: Seems possible to me - why not try it?"
      • CreationDate, e.g.:"2008-09-06T08:07:10.730"
      • UserId
    • posts.xml
      • Id
      • PostTypeId
        • 1: Question
        • 2: Answer
      • ParentID (only present if PostTypeId is 2)
      • AcceptedAnswerId (only present if PostTypeId is 1)
      • CreationDate
      • Score
      • ViewCount
      • Body
      • OwnerUserId
      • LastEditorUserId
      • LastEditorDisplayName="Jeff Atwood"
      • LastEditDate="2009-03-05T22:28:34.823"
      • LastActivityDate="2009-03-11T12:51:01.480"
      • CommunityOwnedDate="2009-03-11T12:51:01.480"
      • ClosedDate="2009-03-11T12:51:01.480"
      • Title=
      • Tags=
      • AnswerCount
      • CommentCount
      • FavoriteCount
    • posthistory.xml
      • Id
      • PostHistoryTypeId
        • 1: Initial Title - The first title a question is asked with.
        • 2: Initial Body - The first raw body text a post is submitted with.
        • 3: Initial Tags - The first tags a question is asked with.
        • 4: Edit Title - A question's title has been changed.
        • 5: Edit Body - A post's body has been changed, the raw text is stored here as markdown.
        • 6: Edit Tags - A question's tags have been changed.
        • 7: Rollback Title - A question's title has reverted to a previous version.
        • 8: Rollback Body - A post's body has reverted to a previous version - the raw text is stored here.
        • 9: Rollback Tags - A question's tags have reverted to a previous version.
        • 10: Post Closed - A post was voted to be closed.
        • 11: Post Reopened - A post was voted to be reopened.
        • 12: Post Deleted - A post was voted to be removed.
        • 13: Post Undeleted - A post was voted to be restored.
        • 14: Post Locked - A post was locked by a moderator.
        • 15: Post Unlocked - A post was unlocked by a moderator.
        • 16: Community Owned - A post has become community owned.
        • 17: Post Migrated - A post was migrated.
        • 18: Question Merged - A question has had another, deleted question merged into itself.
        • 19: Question Protected - A question was protected by a moderator
        • 20: Question Unprotected - A question was unprotected by a moderator
        • 21: Post Disassociated - An admin removes the OwnerUserId from a post.
        • 22: Question Unmerged - A previously merged question has had its answers and votes restored.
          • PostId
          • RevisionGUID: At times more than one type of history record can be recorded by a single action. All of these will be grouped using the same RevisionGUID
          • CreationDate: "2009-03-05T22:28:34.823"
          • UserId
          • UserDisplayName: populated if a user has been removed and no longer referenced by user Id
          • Comment: This field will contain the comment made by the user who edited a post
          • Text: A raw version of the new value for a given revision
        • If PostHistoryTypeId = 10, 11, 12, 13, 14, or 15 this column will contain a JSON encoded string with all users who have voted for the PostHistoryTypeId
        • If PostHistoryTypeId = 17 this column will contain migration details of either "from " or "to "
          • CloseReasonId
        • 1: Exact Duplicate - This question covers exactly the same ground as earlier questions on this topic; its answers may be merged with another identical question.
        • 2: off-topic
        • 3: subjective
        • 4: not a real question
        • 7: too localized
    • postlinks.xml
      • Id
      • CreationDate
      • PostId
      • RelatedPostId
      • PostLinkTypeId
        • 1: Linked
        • 3: Duplicate
    • users.xml
      • Id
      • Reputation
      • CreationDate
      • DisplayName
      • EmailHash
      • LastAccessDate
      • WebsiteUrl
      • Location
      • Age
      • AboutMe
      • Views
      • UpVotes
      • DownVotes
    • votes.xml
      • Id
      • PostId
      • VoteTypeId
        • 1: AcceptedByOriginator
        • 2: UpMod
        • 3: DownMod
        • 4: Offensive
        • 5: Favorite - if VoteTypeId = 5 UserId will be populated
        • 6: Close
        • 7: Reopen
        • 8: BountyStart
        • 9: BountyClose
        • 10: Deletion
        • 11: Undeletion
        • 12: Spam
        • 13: InformModerator
      • CreationDate
      • UserId (only for VoteTypeId 5)
      • BountyAmount (only for VoteTypeId 9)

Descargar todos los CSVs:


In [ ]:
import os
import os.path as path
from urllib.request import urlretrieve

def download_csv_upper_dir(baseurl, filename):
    file = path.abspath(path.join(os.getcwd(),os.pardir,filename))
    if not os.path.isfile(file):
        urlretrieve(baseurl + '/' + filename, file)
        
baseurl = 'http://neuromancer.inf.um.es:8080/es.stackoverflow/'
download_csv_upper_dir(baseurl, 'Posts.csv')
download_csv_upper_dir(baseurl, 'Users.csv')
download_csv_upper_dir(baseurl, 'Tags.csv')
download_csv_upper_dir(baseurl, 'Comments.csv')
download_csv_upper_dir(baseurl, 'Votes.csv')

In [ ]:
%%sql
DROP SCHEMA IF EXISTS stackoverflow;
CREATE SCHEMA stackoverflow CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

In [ ]:
%%sql
USE stackoverflow;

Se tiene que habilitar esto para que se permita importar CSVs.


In [ ]:
%%sql
SET GLOBAL local_infile = true;

In [ ]:
%%sql
DROP TABLE IF EXISTS Posts;
CREATE TABLE Posts (
    Id INT,
    AcceptedAnswerId INT NULL DEFAULT NULL,
    AnswerCount INT DEFAULT 0,
    Body TEXT,
    ClosedDate DATETIME(6) NULL DEFAULT NULL,
    CommentCount INT DEFAULT 0,
    CommunityOwnedDate DATETIME(6) NULL DEFAULT NULL,
    CreationDate DATETIME(6) NULL DEFAULT NULL,
    FavoriteCount INT DEFAULT 0,
    LastActivityDate DATETIME(6) NULL DEFAULT NULL,
    LastEditDate DATETIME(6) NULL DEFAULT NULL,
    LastEditorDisplayName TEXT,
    LastEditorUserId INT NULL DEFAULT NULL,
    OwnerDisplayName TEXT,
    OwnerUserId INT NULL DEFAULT NULL,
    ParentId INT NULL DEFAULT NULL,
    PostTypeId INT, -- 1 = Question, 2 = Answer
    Score INT DEFAULT 0,
    Tags TEXT,
    Title TEXT,
    ViewCount INT DEFAULT 0,
    PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Posts.csv" INTO TABLE Posts
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,                                                                             
@AcceptedAnswerId,                                                               
@AnswerCount,                                                                    
Body,                                                                           
@ClosedDate,                                                                     
@CommentCount,                                                                   
@CommunityOwnedDate,                                                             
CreationDate,                                                                   
@FavoriteCount,                                                                  
@LastActivityDate,                                                               
@LastEditDate,                                                                   
LastEditorDisplayName,                                                          
@LastEditorUserId,                                                               
OwnerDisplayName,                                                               
@OwnerUserId,                                                                    
@ParentId,                                                                       
PostTypeId,                                                                     
Score,                                                                          
Tags,                                                                           
Title,                                                                          
@ViewCount)
SET ParentId = nullif (@ParentId, ''),
ClosedDate = nullif(@ClosedDate, ''),
LastEditorUserId = nullif(@OLastEditorUserId, ''),
LastActivityDate = nullif(@LastActivityDate, ''),
LastEditDate = nullif(@LastEditDate, ''),
AcceptedAnswerId = nullif (@AcceptedAnswerId, ''),
OwnerUserId = nullif(@OwnerUserId, ''),
LastEditorUserId = nullif(@LastEditorUserId, ''),
CommunityOwnedDate = nullif(@CommunityOwnedDate, ''),
FavoriteCount = if(@FavoriteCount = '',0,@FavoriteCount),
CommentCount = if(@CommentCount = '',0,@CommentCount),
ViewCount = if(@ViewCount = '',0,@ViewCount),
AnswerCount = if(@AnswerCount = '',0,@AnswerCount)
;

In [ ]:
%%sql
select count(*) from Posts;

In [ ]:
%%sql
select Id,Title,CreationDate from Posts LIMIT 2;

In [ ]:
%%sql
DROP TABLE IF EXISTS Users;
CREATE TABLE Users (
    Id INT,
    AboutMe TEXT,
    AccountId INT,
    Age INT NULL DEFAULT NULL,
    CreationDate DATETIME(6) NULL DEFAULT NULL,
    DisplayName TEXT,
    DownVotes INT DEFAULT 0,
    LastAccessDate DATETIME(6) NULL DEFAULT NULL,
    Location TEXT,
    ProfileImageUrl TEXT,
    Reputation INT DEFAULT 0,
    UpVotes INT DEFAULT 0,
    Views INT DEFAULT 0,
    WebsiteUrl TEXT,
    PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Users.csv" INTO TABLE Users
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,AboutMe,@AccountId,@Age,@CreationDate,DisplayName,DownVotes,LastAccessDate,Location,ProfileImageUrl,
 Reputation,UpVotes,Views,WebsiteUrl)
SET LastAccessDate = nullif(@LastAccessDate,''),
Age = nullif(@Age, ''),
CreationDate = nullif(@CreationDate,''),
AccountId = nullif(@AccountId, '')
;

In [ ]:
%%sql
select count(*) from Users;

In [ ]:
%%sql
DROP TABLE IF EXISTS Tags;
CREATE TABLE Tags (
    Id INT,
    Count INT DEFAULT 0,
    ExcerptPostId INT NULL DEFAULT NULL,
    TagName TEXT,
    WikiPostId INT NULL DEFAULT NULL,
    PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Tags.csv" INTO TABLE Tags
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,Count,@ExcerptPostId,TagName,@WikiPostId)
SET WikiPostId = nullif(@WikiPostId, ''),
ExcerptPostId = nullif(@ExcerptPostId, '')
;

In [ ]:
%%sql
DROP TABLE IF EXISTS Comments;
CREATE TABLE Comments (
    Id INT,
    CreationDate DATETIME(6) NULL DEFAULT NULL,
    PostId INT NULL DEFAULT NULL,
    Score INT DEFAULT 0,
    Text TEXT,
    UserDisplayName TEXT,
    UserId INT NULL DEFAULT NULL,
    PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Comments.csv" INTO TABLE Comments
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,@CreationDate,@PostId,Score,Text,@UserDisplayName,@UserId)
SET UserId = nullif(@UserId, ''),
PostId = nullif(@PostId, ''),
CreationDate = nullif(@CreationDate,''),
UserDisplayName = nullif(@UserDisplayName,'')
;

In [ ]:
%%sql
SELECT Count(*) FROM Comments;

In [ ]:
%%sql
DROP TABLE IF EXISTS Votes;
CREATE TABLE Votes (
    Id INT,
    BountyAmount INT DEFAULT 0,
    CreationDate DATETIME(6) NULL DEFAULT NULL,
    PostId INT NULL DEFAULT NULL,
    UserId INT NULL DEFAULT NULL,
    VoteTypeId INT,
    PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Votes.csv" INTO TABLE Votes
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,@BountyAmount,@CreationDate,@PostId,@UserId,VoteTypeId)
SET UserId = nullif(@UserId, ''),
PostId = nullif(@PostId, ''),
BountyAmount = if(@BountyAmount = '',0,@BountyAmount),
CreationDate = nullif(@CreationDate, '')
;

Añadimos las claves ajenas para que todas las tablas estén referenciadas correctamente

Usaremos los comandos alter table.


In [ ]:
%%sql

ALTER TABLE Posts ADD FOREIGN KEY (ParentId) REFERENCES Posts(Id);
ALTER TABLE Posts ADD FOREIGN KEY (OwnerUserId) REFERENCES Users(Id);
ALTER TABLE Posts ADD FOREIGN KEY (LastEditorUserId) REFERENCES Users(Id);
ALTER TABLE Posts ADD FOREIGN KEY (AcceptedAnswerId) REFERENCES Posts(Id);

In [ ]:
%%sql

ALTER TABLE Tags ADD FOREIGN KEY (WikiPostId) REFERENCES Posts(Id);
ALTER TABLE Tags ADD FOREIGN KEY (ExcerptPostId) REFERENCES Posts(Id);

In [ ]:
%%sql

ALTER TABLE Comments ADD FOREIGN KEY (PostId) REFERENCES Posts(Id);
ALTER TABLE Comments ADD FOREIGN KEY (UserId) REFERENCES Users(Id);

In [ ]:
%%sql

ALTER TABLE Votes ADD FOREIGN KEY (PostId) REFERENCES Posts(Id);
ALTER TABLE Votes ADD FOREIGN KEY (UserId) REFERENCES Users(Id);

In [ ]:
%%sql
EXPLAIN
SELECT Y.PostId,Y.Present
FROM (SELECT v.PostId AS PostId, COALESCE(p.Id,CONCAT('No: ', v.PostId)) AS Present
      FROM Votes v LEFT JOIN Posts p ON v.PostId = p.Id) AS Y 
WHERE Y.Present LIKE 'No%';

In [ ]:
%%sql
EXPLAIN SELECT PostId from Votes WHERE PostId NOT IN (select Id from Posts);

In [ ]:
%%sql
select * from Votes LIMIT 20;

In [ ]:
%%sql
SELECT Y.Id, Y.PostId, Y.Present
FROM (SELECT v.PostId AS PostId, v.Id AS Id, p.Id AS Pid, COALESCE(p.Id, CONCAT('No: ', v.PostId)) AS Present
      FROM Votes v LEFT JOIN Posts p ON v.PostId = p.Id) AS Y
WHERE Y.Pid IS NULL
LIMIT 1000

EJERCICIO: Eliminar de Votes las entradas que se refieran a Posts inexistentes


In [ ]:
%%sql
-- DELETE FROM Votes WHERE ...;

In [ ]:
%%sql
-- Y ahora 
ALTER TABLE Votes ADD FOREIGN KEY (PostId) REFERENCES Posts(Id);
ALTER TABLE Votes ADD FOREIGN KEY (UserId) REFERENCES Users(Id);

In [ ]:
%sql use stackoverflow

In [ ]:
%%sql
SHOW TABLES;

In [ ]:
%%sql
DESCRIBE Posts;

In [ ]:
top_tags = %sql SELECT Id, TagName, Count FROM Tags ORDER BY Count DESC LIMIT 40;

¡¡Los resultados de %sql se pueden convertir a un DataFrame!!


In [ ]:
top_tags_df = top_tags.DataFrame()

In [ ]:
# invert_y_axis() hace que el más usado aparezca primero. Por defecto es al revés.
top_tags_df.plot(kind='barh',x='TagName', y='Count', figsize=(14,14*2/3)).invert_yaxis()

In [ ]:
top_tags

In [ ]:
%%sql
select Id,TagName,Count from Tags WHERE Count > 5 ORDER BY Count ASC LIMIT 40;

Para comparación con HBase

Voy a hacer unas consultas para comparar la eficiencia con HBase. Calcularé el tamaño medio del texto de los comentarios de un post en particular (he seleccionado el 7251, que es el que más tiene comentarios, 32). Hago el cálculo en local porque aunque existe la función AVG de SQL, es posible que la función que tuviéramos que calcular no la tuviera la base de datos, con lo que tenemos que obtener todos los datos y calcularla en local. Eso también nos dará una idea de la eficiencia de recuperación de la base de datos.


In [ ]:
%%sql
SELECT p.Id, MAX(p.CommentCount) AS c FROM Posts p GROUP BY p.Id ORDER BY c DESC LIMIT 1;

In [ ]:
%sql SELECT AVG(CHAR_LENGTH(Text)) from Comments WHERE PostId = 7251;

In [ ]:
from functools import reduce

def doit():
    q = %sql select Text from Comments WHERE PostId = 7251;
    (s,n) = reduce(lambda res, e: (res[0]+len(e[0]), res[1]+1), q, (0,0))
    return (s/n)

%timeit doit()

EJERCICIO: Calcular las preguntas con más respuestas

En la casilla siguiente:


In [ ]:
%%sql
-- Preguntas con más respuestas (20 primeras)

In [ ]:
%%sql
select Title from Posts where Id = 5;

Código de suma de posts de cada Tag


In [ ]:
# Calcular la suma de posts cada Tag de manera eficiente
import re

# Obtener los datos iniciales de los Tags
results = %sql SELECT Id, Tags FROM Posts where Tags IS NOT NULL;

tagcount = {}

for result in results:
    # Inserta las tags en la tabla Tag
    tags = re.findall('<(.*?)>', result[1])
    for tag in tags:
        tagcount[tag] = tagcount.get(tag,0) + 1;
        
# Comprobar que son iguales las cuentas
for k in tagcount:
    res = %sql select TagName,SUM(Count) from Tags WHERE TagName = :k GROUP BY TagName;
    if tagcount[k] != res[0][1]:
        print("Tag %s NO coincide (%d)!!" % (k, res[0][1]))

In [ ]:
tagcount

In [ ]:
df = pd.DataFrame({'count' : pd.Series(list(tagcount.values()),
                                       index=list(tagcount.keys()))})
df

In [ ]:
sort_df = df.sort_values(by='count',ascending=False)

In [ ]:
sort_df

In [ ]:
sort_df[:100].plot(kind='bar',figsize=(20,20*2/3))

In [ ]:
sort_df[-100:].plot(kind='bar',figsize=(20,20*2/3))

EJERCICIO: Crear una tabla "PostTags" que relaciona cada Tag con su Post

Tendrá cuatro entradas, Id, PostId (referencia a Posts.Id), TagId (referencia a Tags.Id) y TagName (nombre del tag copiado de Tags)

EJERCICIO: ¿Cómo se podría encontrar lo más rápido posible todos los Posts de un Tag en particular (dando el TagName)?

Se pueden dar varias alternativas comparando la eficiencia de cada una. Se pueden hacer pruebas de eficiencia.