Esta hoja muestra cómo acceder a bases de datos SQL y también a conectar la salida con Jupyter. Las partes en SQL también se pueden realizar directamente en MySQL ejecutando el programa mysql
del contenedor.
Igual que en la práctica de introducción, los contenedores se pueden lanzar automáticamente usando docker-compose
desde el directorio bdge/sql
una vez bajado el repositorio Git de las prácticas:
$ git clone https://github.com/dsevilla/bdge.git
$ cd bdge/sql
$ docker-compose up
Dentro del Notebook, la base de datos está disponible en el host con nombre mysql
.
In [ ]:
!pip install pymysql ipython-sql
In [ ]:
%load_ext sql
#%config SqlMagic.feedback = False # Evitar que muestre el número de filas
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')
In [ ]:
%%sql
mysql+pymysql://root:root@mysql/?charset=utf8mb4&local_infile=1
1
: AcceptedByOriginator2
: UpMod3
: DownMod4
: Offensive5
: Favorite - if VoteTypeId = 5 UserId will be populated6
: Close7
: Reopen8
: BountyStart9
: BountyClose10
: Deletion11
: Undeletion12
: Spam13
: InformModerator
In [ ]:
import os
import os.path as path
from urllib.request import urlretrieve
def download_csv_upper_dir(baseurl, filename):
file = path.abspath(path.join(os.getcwd(),os.pardir,filename))
if not os.path.isfile(file):
urlretrieve(baseurl + '/' + filename, file)
baseurl = 'http://neuromancer.inf.um.es:8080/es.stackoverflow/'
download_csv_upper_dir(baseurl, 'Posts.csv')
download_csv_upper_dir(baseurl, 'Users.csv')
download_csv_upper_dir(baseurl, 'Tags.csv')
download_csv_upper_dir(baseurl, 'Comments.csv')
download_csv_upper_dir(baseurl, 'Votes.csv')
In [ ]:
%%sql
DROP SCHEMA IF EXISTS stackoverflow;
CREATE SCHEMA stackoverflow CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
In [ ]:
%%sql
USE stackoverflow;
In [ ]:
%%sql
SET GLOBAL local_infile = true;
In [ ]:
%%sql
DROP TABLE IF EXISTS PostsPart;
CREATE TABLE PostsPart (
Id INT,
AcceptedAnswerId INT NULL DEFAULT NULL,
AnswerCount INT DEFAULT 0,
Body TEXT,
ClosedDate DATETIME(6) NULL DEFAULT NULL,
CommentCount INT DEFAULT 0,
CommunityOwnedDate DATETIME(6) NULL DEFAULT NULL,
CreationDate DATETIME(6) NULL DEFAULT NULL,
FavoriteCount INT DEFAULT 0,
LastActivityDate DATETIME(6) NULL DEFAULT NULL,
LastEditDate DATETIME(6) NULL DEFAULT NULL,
LastEditorDisplayName TEXT,
LastEditorUserId INT NULL DEFAULT NULL,
OwnerDisplayName TEXT,
OwnerUserId INT NULL DEFAULT NULL,
ParentId INT NULL DEFAULT NULL,
PostTypeId INT, -- 1 = Question, 2 = Answer
Score INT DEFAULT 0,
Tags TEXT,
Title TEXT,
ViewCount INT DEFAULT 0,
PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
PARTITION BY KEY(Id)
PARTITIONS 6;
In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Posts.csv" INTO TABLE PostsPart
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,
@AcceptedAnswerId,
@AnswerCount,
Body,
@ClosedDate,
@CommentCount,
@CommunityOwnedDate,
CreationDate,
@FavoriteCount,
@LastActivityDate,
@LastEditDate,
LastEditorDisplayName,
@LastEditorUserId,
OwnerDisplayName,
@OwnerUserId,
@ParentId,
PostTypeId,
Score,
Tags,
Title,
@ViewCount)
SET ParentId = nullif (@ParentId, ''),
ClosedDate = nullif(@ClosedDate, ''),
LastEditorUserId = nullif(@OLastEditorUserId, ''),
LastActivityDate = nullif(@LastActivityDate, ''),
LastEditDate = nullif(@LastEditDate, ''),
AcceptedAnswerId = nullif (@AcceptedAnswerId, ''),
OwnerUserId = nullif(@OwnerUserId, ''),
LastEditorUserId = nullif(@LastEditorUserId, ''),
CommunityOwnedDate = nullif(@CommunityOwnedDate, ''),
FavoriteCount = if(@FavoriteCount = '',0,@FavoriteCount),
CommentCount = if(@CommentCount = '',0,@CommentCount),
ViewCount = if(@ViewCount = '',0,@ViewCount),
AnswerCount = if(@AnswerCount = '',0,@AnswerCount)
;
In [ ]:
%%sql
DROP TABLE IF EXISTS Users;
CREATE TABLE Users (
Id INT,
AboutMe TEXT,
AccountId INT,
Age INT NULL DEFAULT NULL,
CreationDate DATETIME(6) NULL DEFAULT NULL,
DisplayName TEXT,
DownVotes INT DEFAULT 0,
LastAccessDate DATETIME(6) NULL DEFAULT NULL,
Location TEXT,
ProfileImageUrl TEXT,
Reputation INT DEFAULT 0,
UpVotes INT DEFAULT 0,
Views INT DEFAULT 0,
WebsiteUrl TEXT,
PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Users.csv" INTO TABLE Users
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,AboutMe,@AccountId,@Age,@CreationDate,DisplayName,DownVotes,LastAccessDate,Location,ProfileImageUrl,
Reputation,UpVotes,Views,WebsiteUrl)
SET LastAccessDate = nullif(@LastAccessDate,''),
Age = nullif(@Age, ''),
CreationDate = nullif(@CreationDate,''),
AccountId = nullif(@AccountId, '')
;
In [ ]:
%%sql
DROP TABLE IF EXISTS Tags;
CREATE TABLE Tags (
Id INT,
Count INT DEFAULT 0,
ExcerptPostId INT NULL DEFAULT NULL,
TagName TEXT,
WikiPostId INT NULL DEFAULT NULL,
PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Tags.csv" INTO TABLE Tags
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,Count,@ExcerptPostId,TagName,@WikiPostId)
SET WikiPostId = nullif(@WikiPostId, ''),
ExcerptPostId = nullif(@ExcerptPostId, '')
;
In [ ]:
%%sql
DROP TABLE IF EXISTS Comments;
CREATE TABLE Comments (
Id INT,
CreationDate DATETIME(6) NULL DEFAULT NULL,
PostId INT NULL DEFAULT NULL,
Score INT DEFAULT 0,
Text TEXT,
UserDisplayName TEXT,
UserId INT NULL DEFAULT NULL,
PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Comments.csv" INTO TABLE Comments
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,@CreationDate,@PostId,Score,Text,@UserDisplayName,@UserId)
SET UserId = nullif(@UserId, ''),
PostId = nullif(@PostId, ''),
CreationDate = nullif(@CreationDate,''),
UserDisplayName = nullif(@UserDisplayName,'')
;
In [ ]:
%%sql
DROP TABLE IF EXISTS Votes;
CREATE TABLE Votes (
Id INT,
BountyAmount INT DEFAULT 0,
CreationDate DATETIME(6) NULL DEFAULT NULL,
PostId INT NULL DEFAULT NULL,
UserId INT NULL DEFAULT NULL,
VoteTypeId INT,
PRIMARY KEY(Id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
In [ ]:
%%sql
LOAD DATA LOCAL INFILE "../Votes.csv" INTO TABLE Votes
CHARACTER SET utf8mb4
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(Id,@BountyAmount,@CreationDate,@PostId,@UserId,VoteTypeId)
SET UserId = nullif(@UserId, ''),
PostId = nullif(@PostId, ''),
BountyAmount = if(@BountyAmount = '',0,@BountyAmount),
CreationDate = nullif(@CreationDate, '')
;
In [ ]:
%sql use stackoverflow
In [ ]:
%%sql
SHOW TABLES;
In [ ]:
%%sql
EXPLAIN SELECT * FROM PostsPart;
In [ ]:
%%sql
EXPLAIN SELECT * FROM PostsPart WHERE Id = 5;
In [ ]:
%%sql
SELECT * FROM PostsPart PARTITION(p0) LIMIT 10;
In [ ]:
%%sql
DROP TABLE IF EXISTS Posts;
CREATE TABLE Posts AS SELECT * FROM PostsPart;
In [ ]:
%%sql
DESCRIBE Posts;
In [ ]:
%%sql
ALTER TABLE Posts ADD PRIMARY KEY (Id);
ALTER TABLE Posts ADD FOREIGN KEY(OwnerUserId) REFERENCES Users(Id);
ALTER TABLE Posts ADD FOREIGN KEY(ParentId) REFERENCES Posts(Id);
In [ ]:
%%sql
DESCRIBE Posts;
Creamos un índice full-text para Tags de Posts.
In [ ]:
%%sql
CREATE FULLTEXT INDEX Post_Tag_idx ON Posts(Tags);
In [ ]:
%timeit %sql SELECT * FROM Posts WHERE Tags LIKE '%java%';
In [ ]:
%timeit %sql SELECT * FROM Posts WHERE MATCH(Tags) AGAINST ('java');
In [ ]:
%%sql
EXPLAIN SELECT * FROM Posts WHERE MATCH(Tags) AGAINST ('java');
In [ ]:
%%sql
EXPLAIN SELECT COUNT(*) FROM Posts WHERE MATCH(Tags) AGAINST ('java');
In [ ]:
%%sql
EXPLAIN SELECT * FROM Posts WHERE Tags LIKE '%java%';
En esta dirección se habla del decline de Stackoverflow. Por ejemplo, se habla de que el 77% de los usuarios sólo hacen una pregunta, que el 65% sólo responde a una pregunta y que sólo el 8% de los usuarios responden a más de una pregunta.
Los siguientes ejercicios están orientados a ver si esto también se cumple en stackoverflow en español.
En el artículo, se habla de una referencia, que se muestra a continuación. En una página web se listan un conjunto de trabajos de investigación que utilizan el conjunto de datos de stackoveflow para hacer estudios. La idea es reproducir esos resultados y ver cómo se comparan con los de español. Se muestran dos artículos.
In [ ]:
# http://stackoverflow.com/questions/19470099/view-pdf-image-in-an-ipython-notebook
class PDF(object):
def __init__(self, pdf, size=(200,200)):
self.pdf = pdf
self.size = size
def _repr_html_(self):
return '<iframe src={0} width={1[0]} height={1[1]}></iframe>'.format(self.pdf, self.size)
def _repr_latex_(self):
return r'\includegraphics[width=\textwidth]{{{0}}}'.format(self.pdf)
In [ ]:
PDF('http://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=2810&context=sis_research',size=(800,600))
Embebido en el propio Notebook:
In [ ]:
%%sql
SELECT C.c, COUNT(C.c) FROM
(SELECT COUNT(*) as C FROM Posts GROUP BY OwnerUserId) AS C
GROUP BY C.c
ORDER BY C.c;
In [ ]:
StackOverflowFacts
que incluya y ordene, por el campo CreationDate
, todos los eventos de los datos obtenidos, Posts (diferenciad entre la creación de una pregunta y una respuesta), Users, Votes, Comments. Téngase en cuenta como se vió que puede haber información (p. ej. en la tabla Votes) que no mantenga la integridad referencial. Defina las claves ajenas y los índices que considere necesarios.
In [ ]:
In [ ]: