SQL (ii) (sesión 2)

Esta hoja muestra cómo acceder a bases de datos SQL y también a conectar la salida con Jupyter. Las partes en SQL también se pueden realizar directamente en MySQL ejecutando el programa mysql del contenedor.

Igual que en la práctica de introducción, los contenedores se pueden lanzar automáticamente usando docker-compose desde el directorio bdge/sql una vez bajado el repositorio Git de las prácticas:

$ git clone https://github.com/dsevilla/bdge.git
$ cd bdge/sql
$ docker-compose up

Dentro del Notebook, la base de datos está disponible en el host con nombre mysql.


In [ ]:
!pip install pymysql ipython-sql

In [ ]:
%load_ext sql
#%config SqlMagic.feedback = False # Evitar que muestre el número de filas

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

%matplotlib inline
matplotlib.style.use('ggplot')

In [ ]:
%%sql 
mysql+pymysql://root:root@mysql/?charset=utf8mb4&local_infile=1
  • Format: 7zipped
  • Files:
    • badges.xml
      • UserId, e.g.: "420"
      • Name, e.g.: "Teacher"
      • Date, e.g.: "2008-09-15T08:55:03.923"
    • comments.xml
      • Id
      • PostId
      • Score
      • Text, e.g.: "@Stu Thompson: Seems possible to me - why not try it?"
      • CreationDate, e.g.:"2008-09-06T08:07:10.730"
      • UserId
    • posts.xml
      • Id
      • PostTypeId
        • 1: Question
        • 2: Answer
      • ParentID (only present if PostTypeId is 2)
      • AcceptedAnswerId (only present if PostTypeId is 1)
      • CreationDate
      • Score
      • ViewCount
      • Body
      • OwnerUserId
      • LastEditorUserId
      • LastEditorDisplayName="Jeff Atwood"
      • LastEditDate="2009-03-05T22:28:34.823"
      • LastActivityDate="2009-03-11T12:51:01.480"
      • CommunityOwnedDate="2009-03-11T12:51:01.480"
      • ClosedDate="2009-03-11T12:51:01.480"
      • Title=
      • Tags=
      • AnswerCount
      • CommentCount
      • FavoriteCount
    • posthistory.xml
      • Id
      • PostHistoryTypeId
        • 1: Initial Title - The first title a question is asked with.
        • 2: Initial Body - The first raw body text a post is submitted with.
        • 3: Initial Tags - The first tags a question is asked with.
        • 4: Edit Title - A question's title has been changed.
        • 5: Edit Body - A post's body has been changed, the raw text is stored here as markdown.
        • 6: Edit Tags - A question's tags have been changed.
        • 7: Rollback Title - A question's title has reverted to a previous version.
        • 8: Rollback Body - A post's body has reverted to a previous version - the raw text is stored here.
        • 9: Rollback Tags - A question's tags have reverted to a previous version.
        • 10: Post Closed - A post was voted to be closed.
        • 11: Post Reopened - A post was voted to be reopened.
        • 12: Post Deleted - A post was voted to be removed.
        • 13: Post Undeleted - A post was voted to be restored.
        • 14: Post Locked - A post was locked by a moderator.
        • 15: Post Unlocked - A post was unlocked by a moderator.
        • 16: Community Owned - A post has become community owned.
        • 17: Post Migrated - A post was migrated.
        • 18: Question Merged - A question has had another, deleted question merged into itself.
        • 19: Question Protected - A question was protected by a moderator
        • 20: Question Unprotected - A question was unprotected by a moderator
        • 21: Post Disassociated - An admin removes the OwnerUserId from a post.
        • 22: Question Unmerged - A previously merged question has had its answers and votes restored.
          • PostId
          • RevisionGUID: At times more than one type of history record can be recorded by a single action. All of these will be grouped using the same RevisionGUID
          • CreationDate: "2009-03-05T22:28:34.823"
          • UserId
          • UserDisplayName: populated if a user has been removed and no longer referenced by user Id
          • Comment: This field will contain the comment made by the user who edited a post
          • Text: A raw version of the new value for a given revision
        • If PostHistoryTypeId = 10, 11, 12, 13, 14, or 15 this column will contain a JSON encoded string with all users who have voted for the PostHistoryTypeId
        • If PostHistoryTypeId = 17 this column will contain migration details of either "from " or "to "
          • CloseReasonId
        • 1: Exact Duplicate - This question covers exactly the same ground as earlier questions on this topic; its answers may be merged with another identical question.
        • 2: off-topic
        • 3: subjective
        • 4: not a real question
        • 7: too localized
    • postlinks.xml
      • Id
      • CreationDate
      • PostId
      • RelatedPostId
      • PostLinkTypeId
        • 1: Linked
        • 3: Duplicate
    • users.xml
      • Id
      • Reputation
      • CreationDate
      • DisplayName
      • EmailHash
      • LastAccessDate
      • WebsiteUrl
      • Location
      • Age
      • AboutMe
      • Views
      • UpVotes
      • DownVotes
    • votes.xml
      • Id
      • PostId
      • VoteTypeId
        • 1: AcceptedByOriginator
        • 2: UpMod
        • 3: DownMod
        • 4: Offensive
        • 5: Favorite - if VoteTypeId = 5 UserId will be populated
        • 6: Close
        • 7: Reopen
        • 8: BountyStart
        • 9: BountyClose
        • 10: Deletion
        • 11: Undeletion
        • 12: Spam
        • 13: InformModerator
      • CreationDate
      • UserId (only for VoteTypeId 5)
      • BountyAmount (only for VoteTypeId 9)

In [ ]:
import os
import os.path as path
from urllib.request import urlretrieve

def download_csv_upper_dir(baseurl, filename):
    file = path.abspath(path.join(os.getcwd(),os.pardir,filename))
    if not os.path.isfile(file):
        urlretrieve(baseurl + '/' + filename, file)
        
baseurl = 'http://neuromancer.inf.um.es:8080/es.stackoverflow/'
download_csv_upper_dir(baseurl, 'Posts.csv')
download_csv_upper_dir(baseurl, 'Users.csv')
download_csv_upper_dir(baseurl, 'Tags.csv')
download_csv_upper_dir(baseurl, 'Comments.csv')
download_csv_upper_dir(baseurl, 'Votes.csv')

In [ ]:
%%sql
DROP SCHEMA IF EXISTS stackoverflow;
CREATE SCHEMA stackoverflow CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

In [ ]:
%%sql
USE stackoverflow;

In [ ]:
%%sql
SET GLOBAL local_infile = true;

In [ ]:
%%sql
DROP TABLE IF EXISTS PostsPart;
CREATE TABLE PostsPart (
    Id INT,
    AcceptedAnswerId INT NULL DEFAULT NULL,
    AnswerCount INT DEFAULT 0,
    Body TEXT,
    ClosedDate DATETIME(6) NULL DEFAULT NULL,
    CommentCount INT DEFAULT 0,
    CommunityOwnedDate DATETIME(6) NULL DEFAULT NULL,
    CreationDate DATETIME(6) NULL DEFAULT NULL,
    FavoriteCount INT DEFAULT 0,
    LastActivityDate DATETIME(6) NULL DEFAULT NULL,
    LastEditDate DATETIME(6) NULL DEFAULT NULL,
    LastEditorDisplayName TEXT,
    LastEditorUserId INT NULL DEFAULT NULL,
    OwnerDisplayName TEXT,
    OwnerUserId INT NULL DEFAULT NULL,
    ParentId INT NULL DEFAULT NULL,
    PostTypeId INT, -- 1 = Question, 2 = Answer
    Score INT DEFAULT 0,
    Tags TEXT,
    Title TEXT,
    ViewCount INT DEFAULT 0,
    PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
PARTITION BY KEY(Id)
PARTITIONS 6;

In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Posts.csv" INTO TABLE PostsPart
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,                                                                             
@AcceptedAnswerId,                                                               
@AnswerCount,                                                                    
Body,                                                                           
@ClosedDate,                                                                     
@CommentCount,                                                                   
@CommunityOwnedDate,                                                             
CreationDate,                                                                   
@FavoriteCount,                                                                  
@LastActivityDate,                                                               
@LastEditDate,                                                                   
LastEditorDisplayName,                                                          
@LastEditorUserId,                                                               
OwnerDisplayName,                                                               
@OwnerUserId,                                                                    
@ParentId,                                                                       
PostTypeId,                                                                     
Score,                                                                          
Tags,                                                                           
Title,                                                                          
@ViewCount)
SET ParentId = nullif (@ParentId, ''),
ClosedDate = nullif(@ClosedDate, ''),
LastEditorUserId = nullif(@OLastEditorUserId, ''),
LastActivityDate = nullif(@LastActivityDate, ''),
LastEditDate = nullif(@LastEditDate, ''),
AcceptedAnswerId = nullif (@AcceptedAnswerId, ''),
OwnerUserId = nullif(@OwnerUserId, ''),
LastEditorUserId = nullif(@LastEditorUserId, ''),
CommunityOwnedDate = nullif(@CommunityOwnedDate, ''),
FavoriteCount = if(@FavoriteCount = '',0,@FavoriteCount),
CommentCount = if(@CommentCount = '',0,@CommentCount),
ViewCount = if(@ViewCount = '',0,@ViewCount),
AnswerCount = if(@AnswerCount = '',0,@AnswerCount)
;

In [ ]:
%%sql
DROP TABLE IF EXISTS Users;
CREATE TABLE Users (
    Id INT,
    AboutMe TEXT,
    AccountId INT,
    Age INT NULL DEFAULT NULL,
    CreationDate DATETIME(6) NULL DEFAULT NULL,
    DisplayName TEXT,
    DownVotes INT DEFAULT 0,
    LastAccessDate DATETIME(6) NULL DEFAULT NULL,
    Location TEXT,
    ProfileImageUrl TEXT,
    Reputation INT DEFAULT 0,
    UpVotes INT DEFAULT 0,
    Views INT DEFAULT 0,
    WebsiteUrl TEXT,
    PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Users.csv" INTO TABLE Users
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,AboutMe,@AccountId,@Age,@CreationDate,DisplayName,DownVotes,LastAccessDate,Location,ProfileImageUrl,
 Reputation,UpVotes,Views,WebsiteUrl)
SET LastAccessDate = nullif(@LastAccessDate,''),
Age = nullif(@Age, ''),
CreationDate = nullif(@CreationDate,''),
AccountId = nullif(@AccountId, '')
;

In [ ]:
%%sql
DROP TABLE IF EXISTS Tags;
CREATE TABLE Tags (
    Id INT,
    Count INT DEFAULT 0,
    ExcerptPostId INT NULL DEFAULT NULL,
    TagName TEXT,
    WikiPostId INT NULL DEFAULT NULL,
    PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Tags.csv" INTO TABLE Tags
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,Count,@ExcerptPostId,TagName,@WikiPostId)
SET WikiPostId = nullif(@WikiPostId, ''),
ExcerptPostId = nullif(@ExcerptPostId, '')
;

In [ ]:
%%sql
DROP TABLE IF EXISTS Comments;
CREATE TABLE Comments (
    Id INT,
    CreationDate DATETIME(6) NULL DEFAULT NULL,
    PostId INT NULL DEFAULT NULL,
    Score INT DEFAULT 0,
    Text TEXT,
    UserDisplayName TEXT,
    UserId INT NULL DEFAULT NULL,
    PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Comments.csv" INTO TABLE Comments
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,@CreationDate,@PostId,Score,Text,@UserDisplayName,@UserId)
SET UserId = nullif(@UserId, ''),
PostId = nullif(@PostId, ''),
CreationDate = nullif(@CreationDate,''),
UserDisplayName = nullif(@UserDisplayName,'')
;

In [ ]:
%%sql
DROP TABLE IF EXISTS Votes;
CREATE TABLE Votes (
    Id INT,
    BountyAmount INT DEFAULT 0,
    CreationDate DATETIME(6) NULL DEFAULT NULL,
    PostId INT NULL DEFAULT NULL,
    UserId INT NULL DEFAULT NULL,
    VoteTypeId INT,
    PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Votes.csv" INTO TABLE Votes
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,@BountyAmount,@CreationDate,@PostId,@UserId,VoteTypeId)
SET UserId = nullif(@UserId, ''),
PostId = nullif(@PostId, ''),
BountyAmount = if(@BountyAmount = '',0,@BountyAmount),
CreationDate = nullif(@CreationDate, '')
;

In [ ]:
%sql use stackoverflow

In [ ]:
%%sql
SHOW TABLES;

In [ ]:
%%sql
EXPLAIN SELECT * FROM PostsPart;

In [ ]:
%%sql
EXPLAIN SELECT * FROM PostsPart WHERE Id = 5;

In [ ]:
%%sql
SELECT * FROM PostsPart PARTITION(p0) LIMIT 10;

In [ ]:
%%sql
DROP TABLE IF EXISTS Posts;
CREATE TABLE Posts AS SELECT * FROM PostsPart;

In [ ]:
%%sql
DESCRIBE Posts;

In [ ]:
%%sql
ALTER TABLE Posts ADD PRIMARY KEY (Id);
ALTER TABLE Posts ADD FOREIGN KEY(OwnerUserId) REFERENCES Users(Id);
ALTER TABLE Posts ADD FOREIGN KEY(ParentId) REFERENCES Posts(Id);

In [ ]:
%%sql
DESCRIBE Posts;

Creamos un índice full-text para Tags de Posts.


In [ ]:
%%sql
CREATE FULLTEXT INDEX Post_Tag_idx ON Posts(Tags);

In [ ]:
%timeit %sql SELECT * FROM Posts WHERE Tags LIKE '%java%';

In [ ]:
%timeit %sql SELECT * FROM Posts WHERE MATCH(Tags) AGAINST ('java');

In [ ]:
%%sql
EXPLAIN SELECT * FROM Posts WHERE MATCH(Tags) AGAINST ('java');

In [ ]:
%%sql
EXPLAIN SELECT COUNT(*) FROM Posts WHERE MATCH(Tags) AGAINST ('java');

In [ ]:
%%sql
EXPLAIN SELECT * FROM Posts WHERE Tags LIKE '%java%';

Decline de Stackoverflow

En esta dirección se habla del decline de Stackoverflow. Por ejemplo, se habla de que el 77% de los usuarios sólo hacen una pregunta, que el 65% sólo responde a una pregunta y que sólo el 8% de los usuarios responden a más de una pregunta.

Los siguientes ejercicios están orientados a ver si esto también se cumple en stackoverflow en español.

En el artículo, se habla de una referencia, que se muestra a continuación. En una página web se listan un conjunto de trabajos de investigación que utilizan el conjunto de datos de stackoveflow para hacer estudios. La idea es reproducir esos resultados y ver cómo se comparan con los de español. Se muestran dos artículos.

Estudio empírico de Stackoverflow

Artículo que se puede encontrar aquí. La fuente original que lista varios artículos con fuente en los datos de Stackoverflow se puede ver aquí.


In [ ]:
# http://stackoverflow.com/questions/19470099/view-pdf-image-in-an-ipython-notebook
class PDF(object):
  def __init__(self, pdf, size=(200,200)):
    self.pdf = pdf
    self.size = size

  def _repr_html_(self):
    return '<iframe src={0} width={1[0]} height={1[1]}></iframe>'.format(self.pdf, self.size)

  def _repr_latex_(self):
    return r'\includegraphics[width=\textwidth]{{{0}}}'.format(self.pdf)

In [ ]:
PDF('http://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=2810&context=sis_research',size=(800,600))

Embebido en el propio Notebook:

EJERCICIO: Mostrar cómo conseguir RQ1, RQ2, RQ3 y RQ4 (tablas y gráficas) del artículo anterior, y ver si la tendencia también se repite en Stackoverflow en español


In [ ]:
%%sql
SELECT C.c, COUNT(C.c) FROM 
  (SELECT COUNT(*) as C FROM Posts GROUP BY OwnerUserId) AS C
GROUP BY C.c
ORDER BY C.c;

EJERCICIO (E1): Idear las tablas e índices necesarios para que la consulta de, dados un usuario y un tag, se pueda saber de la forma más rápida qué posts tiene ese usuario de ese tag, y también, dado un tag, saber qué usuarios han preguntado sobre ese tag. Rellenar las tablas. Mostrar con un EXPLAIN que realmente no se usa una búsqueda secuencial sobre la tabla


In [ ]:

EJERCICIO (E2): Crear una tabla de hechos StackOverflowFacts que incluya y ordene, por el campo CreationDate, todos los eventos de los datos obtenidos, Posts (diferenciad entre la creación de una pregunta y una respuesta), Users, Votes, Comments. Téngase en cuenta como se vió que puede haber información (p. ej. en la tabla Votes) que no mantenga la integridad referencial. Defina las claves ajenas y los índices que considere necesarios.


In [ ]:

EJERCICIO (E3): Con la información de la tabla StackOverflowFacts, escriba una consulta SQL que calcule el tiempo medio en segundos, para todos los usuarios, entre creación de un usuario y su primera pregunta


In [ ]:

EJERCICIO (E4): Muestre un histograma de tiempo hasta la primera pregunta que clasifique a los usuarios en grupos de número de días desde que se crea el usuario hasta la primera pregunta, considerando el último periodo como "una semana o más"