Our loggins application, available as a public UI at http://computers.library.gwu.edu, has been running continuously for over two years. It tracks user logins at any of the public computing workstations we maintain here in GW Libraries, and shows where available machines might be. We conceived of and implemented the project in 2013 as we prepared to welcome the GWU community back into Gelman Library after a major renovation that resulted in several major architectural reconfigurations featuring a new entrance, new computing space, and a very different set of flows through the building. We wanted users to know where they could find available machines -- even in this era of ubiquitous mobile devices, our public machines are still heavily used, for a variety of reasons.
This set of changes gave us the opportunity to collect data as well, and except for a few important exceptions, we've continuously tracked logins and sessions (start and stop times, platform, and machine identifiers, but otherwise anonymously). Here's a snapshot of the public interface:
In [7]:
from IPython.display import Image
Image(url="screenshot.png", height=600, width=400)
Out[7]:
For nearly two years, we displayed this interface on a large kiosk just inside our new entrance, collecting data to keep it up to date around the clock (our library is open 24x7 during fall and spring semesters, aside from holidays). In those two years, we collected a lot of data. Let's have a look.
We start by exporting data from our live system using pgsql's COPY
function, e.g.:
% echo "COPY ui_building TO STDOUT DELIMITER ',' CSV HEADER" | psql -h localhost -U django -d loggins > /tmp/building.csv
% echo "COPY ui_location TO STDOUT DELIMITER ',' CSV HEADER" | psql -h localhost -U django -d loggins > /tmp/location.csv
% echo "COPY ui_zone TO STDOUT DELIMITER ',' CSV HEADER" | psql -h localhost -U django -d loggins > /tmp/zone.csv
% echo "COPY ui_session TO STDOUT DELIMITER ',' CSV HEADER" | psql -h localhost -U django -d loggins > /tmp/session.csv
% echo "COPY ui_floor TO STDOUT DELIMITER ',' CSV HEADER" | psql -h localhost -U django -d loggins > /tmp/floor.csv
I've completed this task on a copy of our full dataset and pull the resulting CSV files into a local directory.
In [13]:
!wc *.csv
The data is normalized for production capture and display. Information about our buildings, floors, and zones within floors allow us to show those distinct counts above, and is essetianlly reference information in our normalized schema.
In [14]:
!csvlook building.csv
In [15]:
!csvlook floor.csv
In [16]:
!csvlook zone.csv
Locations are individual workstations, or rather, the location where one or another public workstation might be installed. Over time, these workstations may be taken down for maintenance and later returned, or replaced entirely by new workstations with new identifiers.
In [126]:
!head -10 location.csv | csvcut -c1,2,3,4,5,10 | csvlook
In [22]:
!head -20 location.csv | csvcut -c1,6,7,8,9 | csvlook
The state
values may be i
for "logged in", n
for "no response", or a
for "available". This table captures a particular set of states at only the current time, making that public view snapshot above an easy query to run. here are some more details about the overall set of states and machines by OS.
In [29]:
!csvcut -c6 location.csv | csvsort -c1 | sort | uniq -c
"state" being the column header, we can see that at the time of CSV export, overall, 86 machines were in use, 63 were available, and 54 were not responding. Due to machine turnover, many of these 54 machines not responding might very well be retired or otherwise no longer physically available.
In [34]:
!csvcut -c5 location.csv | sort | uniq -c
"os" being the column header, we see that most of our machines are Win7 virtual machines, 43 are Macs, and a handful do not have their OS recorded.
Finally, we have over 570,000 sessions, dating back to the first production app deployment in July 2013:
In [30]:
!head session.csv | csvlook
In [33]:
!csvcut -c5 session.csv | sort | uniq -c | head -2
In [1]:
%load_ext sql
In [2]:
!createdb -U dchud logginsdw
In [6]:
!csvsql --db postgresql:///logginsdw --insert building.csv
In [7]:
!csvsql --db postgresql:///logginsdw --insert floor.csv
In [8]:
!csvsql --db postgresql:///logginsdw --insert zone.csv
In [9]:
!csvsql --db postgresql:///logginsdw --insert location.csv
In [11]:
!csvsql --db postgresql:///logginsdw --insert session.csv
In [2]:
%sql postgresql://dchud@localhost/logginsdw
Out[2]:
In [3]:
%%sql
SELECT * FROM zone
Out[3]:
In [4]:
%%sql
SELECT * FROM location LIMIT 5
Out[4]:
In [5]:
%%sql
SELECT * FROM session LIMIT 5
Out[5]:
Looks like all the data loaded just fine, we're good to go.
This data is in a schema optimized for production functions: capture of user sessions and display of current state for users looking for a machine. To make the data more readily usable for analysis, we'll transform it into a set of traditional star-style data warehouse tables. The basic model of this kind of schema is to model "facts", or processes we wish to measure and analyse, and "dimensions", contexts we bring to those processes when asking different kinda of questions. For example, if we want to study computing session length by day and by building floor, we'd want to use days and locations as dimensions, and the session length as the fact to measure. This calls for three new tables: one for each dimension (time and location), and one for the fact (session), though we'll give them clear names to differentiate the new tables from the old ones.
One more thing about DW-style schema designs - we denormalize them heavily to simplify queries for analyses. If this is new to you, but you know normalization, it will feel weird. But it works, just roll with it.
Fortunately for us, if you want to get up to speed quickly on what this set of techniques entails, there are two very good books available (among many others). First is a technically-oriented title:
In [10]:
Image(url="star-schema.jpeg", height=600, width=400)
Out[10]:
I taught with this book this past semester and it was widely liked for being accessible and focused on the key technical principles. You can get up to speed quickly on the main ideas in data warehousing with this one.
It is available online at GWU: http://findit.library.gwu.edu/item/8821759
The other is about the project management side of things:
In [11]:
Image(url="agile-design.jpg", height=600, width=400)
Out[11]:
I've only skimmed this book, but it looks like it aligns the work of building one of these systems with the agile techniques we've all been learning about. A student in my class with agile experience did a review of it and recommended it highly.
Let's get into it, starting with the exact model described above: a table for each dimension (time and location), and one table for the facts (session). We'll give them clear names to differentiate the new tables from the old ones.
First we'll build the location dimension by rolling all the building, floor, and zone information into one redundant dimension.
We'll also use a simple naming convention: for DW tables, dimensions will start with d_
, and facts will start with f_
.
In [108]:
%%sql
DROP TABLE IF EXISTS d_location;
CREATE TABLE d_location (
location_key SERIAL NOT NULL PRIMARY KEY,
building_name CHAR(6) NOT NULL,
building_id SMALLINT NOT NULL DEFAULT 0,
floor SMALLINT NOT NULL,
floor_id SMALLINT NOT NULL DEFAULT 0,
zone_name CHAR(25) NOT NULL DEFAULT '',
zone_display_order SMALLINT NOT NULL,
zone_id SMALLINT NOT NULL DEFAULT 0,
location_id SMALLINT NOT NULL
)
Out[108]:
Now we populate the table with the location information, pulling it out of the existing normalized tables into one denormalized pile.
In [111]:
%%sql
DELETE FROM d_location;
INSERT INTO d_location (building_name, building_id,
floor, floor_id,
zone_name, zone_display_order, zone_id,
location_id)
SELECT building.name AS building_name, building.id AS building_id,
floor.floor AS floor, floor.id AS floor_id,
zone.name AS zone_name, zone.display_order AS zone_display_order, zone.id AS zone_id,
location.id
FROM location, zone, floor, building
WHERE location.zone_id = zone.id
AND zone.floor_id = floor.id
AND floor.id = zone.floor_id
AND building.id = floor.building_id;
Out[111]:
We had some records up above that didn't reference back to a zone, floor, or building. To handle those, let's also insert a signifier row to use for those missing values.
In [113]:
%%sql
INSERT INTO d_location (building_name, building_id, floor, floor_id,
zone_name, zone_display_order, zone_id,
location_id)
VALUES ('None', 0, 0, 0, 'None', 0, 0, 0)
Out[113]:
In [115]:
%%sql
SELECT * FROM d_location;
Out[115]:
We're likely to be doing some joining on that location_id
column, which was an important primary/foreign key in the transaction data. In dimensional schemas like this, a transactional key column that carries over into this design is referred to as a "natural key". Its value is self-evident here, and we can index it to make sure those joins will go quickly:
In [116]:
%%sql
CREATE INDEX idx_location_id ON d_location (location_id)
Out[116]:
One more note on terminology: that location_key
primary key we just created on the d_location
dimension is a new addition - it didn't come from the source transaction data. In dimensional design, that's referred to as a "surrogate key". We make this distinction because we might have information about a location that changes over time (i.e. "slowly changing dimensions"). There is a lot to read about this in the Star Schema book. In this case, we can have more than one row referring to the same location_id
if the location data changes; each would get its own unique new location_key
here, so we can refer to each version of the source location reference.
There are two canonical examples of modeling this kind of change over time: a birthdate correction (you got it wrong, and need to fix it, so there's one truth) and a change-of-address update (both addresses are correct, just at different times). In the first case, you want one correct birthdate value; in the second, you want both addresses to be present for analysis.
Getting back to our model, we've taken care of our physical locations. Now let's pull out a time dimension, by day. Here we can spell out every variation of aspects of the date; they can be helpful for analysis. Our goal here is to make analysis easy. Toward that goal, we break the pieces apart here and now so we can do analysis later without having to worry about getting these bits right.
In [69]:
%%sql
DROP TABLE IF EXISTS d_day;
CREATE TABLE d_day (
day_key SERIAL NOT NULL PRIMARY KEY,
full_date DATE NOT NULL,
day SMALLINT NOT NULL,
day_of_week_number SMALLINT NOT NULL,
day_of_week_name CHAR(9) NOT NULL,
day_of_week_abbr CHAR(3) NOT NULL,
day_of_year SMALLINT NOT NULL,
week_of_month SMALLINT NOT NULL,
week_of_year SMALLINT NOT NULL,
federal_holiday_flag BOOLEAN DEFAULT FALSE,
gwu_holiday_flag BOOLEAN DEFAULT FALSE,
gwu_in_session BOOLEAN DEFAULT FALSE,
weekday_flag BOOLEAN,
weekend_flag BOOLEAN,
month SMALLINT NOT NULL,
month_name CHAR(9) NOT NULL,
month_abbr CHAR(3) NOT NULL,
quarter SMALLINT NOT NULL,
year SMALLINT NOT NULL
)
Out[69]:
Double checking our source data...
In [12]:
%%sql
SELECT * FROM session LIMIT 5;
Out[12]:
Just for convenience, let's pull this data in sorted by date; we'll add an index to make this speedy.
In [54]:
%%sql
CREATE INDEX idx_timestamp_start ON session (timestamp_start);
Out[54]:
In [70]:
%%sql
DELETE FROM d_day;
INSERT INTO d_day
(full_date,
day, day_of_week_number, day_of_week_name, day_of_week_abbr, day_of_year,
week_of_month, week_of_year,
federal_holiday_flag, gwu_holiday_flag, gwu_in_session,
weekday_flag, weekend_flag,
month, month_name, month_abbr,
quarter, year)
SELECT DISTINCT DATE(timestamp_start) AS full_date,
TO_NUMBER(TO_CHAR(timestamp_start, 'DD'), '99') AS day_of_month,
TO_NUMBER(TO_CHAR(timestamp_start, 'D'), '9') AS day_of_week_number,
TO_CHAR(timestamp_start, 'Day') AS day_of_week_name,
TO_CHAR(timestamp_start, 'Dy') AS day_of_week_abbr,
TO_NUMBER(TO_CHAR(timestamp_start, 'DDD'), '999') AS day_of_year,
TO_NUMBER(TO_CHAR(timestamp_start, 'W'), '9') AS week_of_month,
TO_NUMBER(TO_CHAR(timestamp_start, 'WW'), '99') AS week_of_year,
FALSE AS federal_holiday_flag,
FALSE AS gwu_holiday_flag,
FALSE AS gwu_in_session,
TO_NUMBER(TO_CHAR(timestamp_start, 'ID'), '9') <= 5 AS weekday_flag,
TO_NUMBER(TO_CHAR(timestamp_start, 'ID'), '9') > 5 AS weekend_flag,
TO_NUMBER(TO_CHAR(timestamp_start, 'MM'), '99') AS month,
TO_CHAR(timestamp_start, 'Month') AS month_name,
TO_CHAR(timestamp_start, 'Mon') AS month_abbr,
TO_NUMBER(TO_CHAR(timestamp_start, 'Q'), '9') AS quarter,
TO_NUMBER(TO_CHAR(timestamp_start, 'YYYY'), '9999') AS year
FROM session
ORDER BY full_date
Out[70]:
In [73]:
%%sql
SELECT * FROM d_day LIMIT 20;
Out[73]:
In [74]:
%%sql
CREATE INDEX idx_full_date ON d_day (full_date)
Out[74]:
Okay, then, that was fun. Now that we have two dimensions set up, let's add our fact table. We can create default values at first for the dimensions' new primary keys (the "surrogate keys") on the fact table, then go back and update them with lookup queries. This will save us some query complexity.
The main thing about the fact table is that it needs to feature data we want to measure (i.e. count, average, compare, or otherwise study) and everything else should be a reference to one of the dimensions.
In [91]:
%%sql
DROP TABLE IF EXISTS f_login;
CREATE TABLE f_login (
login_id SERIAL PRIMARY KEY,
location_key SMALLINT NOT NULL DEFAULT 15,
timestamp_start TIMESTAMP NOT NULL,
day_key_start SMALLINT NOT NULL,
time_start CHAR(8) NOT NULL,
timestamp_end TIMESTAMP NOT NULL,
day_key_end SMALLINT NOT NULL,
time_end CHAR(8) NOT NULL,
hour_start SMALLINT NOT NULL,
hour_end SMALLINT NOT NULL,
duration_minutes REAL NOT NULL,
duration_hours REAL NOT NULL,
session_type CHAR(13) NOT NULL DEFAULT 'Available',
os CHAR(4) NOT NULL DEFAULT '',
session_id INT NOT NULL,
location_id INT NOT NULL
)
Out[91]:
In [95]:
%%sql
DELETE FROM f_login;
INSERT INTO f_login (location_key,
timestamp_start, day_key_start, time_start,
timestamp_end, day_key_end, time_end,
hour_start, hour_end,
duration_minutes, duration_hours,
session_type, os,
session_id, location_id
)
SELECT 15 AS location_key,
timestamp_start,
1 AS day_key_start,
TO_CHAR(timestamp_start, 'HH24:MI:SS') AS time_start,
timestamp_end,
2 AS day_key_end,
TO_CHAR(timestamp_end, 'HH24:MI:SS') AS time_end,
TO_NUMBER(TO_CHAR(timestamp_start, 'HH24'), '99') AS hour_start,
TO_NUMBER(TO_CHAR(timestamp_end, 'HH24'), '99') AS hour_end,
EXTRACT(EPOCH FROM (timestamp_end - timestamp_start)) / 60 AS duration_minutes,
EXTRACT(EPOCH FROM (timestamp_end - timestamp_start)) / 3600 AS duration_hours,
CASE WHEN session_type='i' THEN 'In use'
WHEN session_type='n' THEN 'Unavailable' END AS session_type,
CASE WHEN os IS NULL THEN ''
ELSE os END,
session.id AS session_id,
session.location_id AS location_id
FROM session, location
WHERE location.id = session.location_id
ORDER BY session_id
Out[95]:
In [96]:
%%sql
SELECT * FROM f_login LIMIT 10
Out[96]:
Now go back and set the key references correctly.
In [105]:
%%sql
UPDATE f_login
SET day_key_start = d_day.day_key
FROM d_day
WHERE DATE(f_login.timestamp_start) = d_day.full_date
Out[105]:
In [106]:
%%sql
UPDATE f_login
SET day_key_end = d_day.day_key
FROM d_day
WHERE DATE(f_login.timestamp_end) = d_day.full_date
Out[106]:
In [117]:
%%sql
UPDATE f_login
SET location_key = d_location.location_key
FROM d_location
WHERE f_login.location_id = d_location.location_id
Out[117]:
In [118]:
%%sql
SELECT * FROM f_login LIMIT 10
Out[118]:
In [15]:
%matplotlib inline
In [16]:
%%sql
SELECT AVG(duration_minutes), os, building_name
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
AND duration_hours < 12
GROUP BY os, building_name
HAVING os IN ('mac', 'win7')
Out[16]:
In [17]:
_.bar()
Out[17]:
In [14]:
%%sql
SELECT hour_start, AVG(duration_minutes)
FROM f_login, d_day
WHERE d_day.day_key = f_login.day_key_start
AND duration_hours < 12
GROUP BY hour_start
ORDER BY hour_start
Out[14]:
In [15]:
_.bar()
Out[15]:
In [9]:
%%sql
SELECT COUNT(*), session_type, os
FROM f_login
WHERE duration_hours > 48
GROUP BY os, session_type
LIMIT 50
Out[9]:
In [10]:
_.bar()
Out[10]:
In [49]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login
WHERE duration_minutes <= 20
AND os = 'win7'
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC
Out[49]:
In [50]:
_.bar()
Out[50]:
Fascinating - clearly the graph is dominated by short "unavailable" sessions. We can go in for a closer look on both of those:
In [17]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login
WHERE duration_minutes <= 20
AND os = 'win7'
AND session_type = 'Unavailable'
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC
Out[17]:
In [18]:
_.bar()
Out[18]:
In [43]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login
WHERE duration_minutes <= 30
AND os = 'win7'
AND session_type = 'In use'
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC
Out[43]:
In [44]:
_.bar()
Out[44]:
Let's see if we have the same profile on OSX sessions:
In [45]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login
WHERE duration_minutes <= 30
AND os = 'mac'
AND session_type = 'In use'
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC
Out[45]:
In [46]:
_.bar()
Out[46]:
Looks pretty clear that there is a heavy profile of very short sessions on both platforms. Is that true at all locations, or just the Gelman entrance floor?
In [35]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
AND duration_minutes <= 30
AND os = 'win7'
AND session_type = 'In use'
AND building_name = 'Gelman'
AND floor = 2
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC
Out[35]:
In [36]:
_.bar()
Out[36]:
In [39]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
AND duration_minutes <= 30
AND os = 'win7'
AND session_type = 'In use'
AND building_name = 'Gelman'
AND floor != 2
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC
Out[39]:
In [40]:
_.bar()
Out[40]:
And how about at Eckles?
In [41]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
AND duration_minutes <= 30
AND os = 'win7'
AND session_type = 'In use'
AND building_name = 'Eckles'
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC
Out[41]:
In [42]:
_.bar()
Out[42]:
You can imagine a wide variety of other kinds of questions you'd want to get answers to here. Like "how does this quick session profile peak change over different hours of the day, or on weekends?"
In [51]:
%%sql
SELECT session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login, d_day
WHERE d_day.day_key = f_login.day_key_start
AND duration_minutes <= 30
AND session_type = 'In use'
AND weekday_flag = True
GROUP BY session_type, minutes
ORDER BY session_type, minutes ASC
Out[51]:
In [52]:
_.bar()
Out[52]:
In [53]:
%%sql
SELECT session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login, d_day
WHERE d_day.day_key = f_login.day_key_start
AND duration_minutes <= 30
AND session_type = 'In use'
AND weekend_flag = True
GROUP BY session_type, minutes
ORDER BY session_type, minutes ASC
Out[53]:
In [54]:
_.bar()
Out[54]:
This is all well and good when reviewing just one dataset. It gets much more interesting (and you really need those books) when you add more kinds of transactions into the mix.
We also have, for example, GWorld tap logs from the entrance. They look like this:
In [56]:
!head -20 20150802-entrance.csv | csvlook
This date, August 2, 2015, was a Sunday, and we were operating on summer hours (12-6).
"Entrances" are a new kind of fact here - we want to be able to measure them. One important and obvious dimension is "day", which we already have in the system. We could define another one for "affiliation", but let's skip that for now, as that data is very straightforward (just two columns).
Like we did before, we can load the raw data in one step, then create a new fact table in another step.
In [78]:
!csvsql --db postgresql:///logginsdw --insert entrance.csv
In [79]:
%%sql
SELECT * FROM entrance LIMIT 10
Out[79]:
And now a fact table:
In [58]:
%%sql
DROP TABLE IF EXISTS f_entrance;
CREATE TABLE f_entrance (
entrance_id SERIAL PRIMARY KEY,
day_key SMALLINT NOT NULL,
hour SMALLINT NOT NULL,
affiliation VARCHAR(255),
subaffiliation VARCHAR(255),
count INT NOT NULL
)
Out[58]:
...and now we can populate the fact table from the raw data. We only have one day's worth of data here, otherwise we'd do a lookup for each row like we did before. Right now we can just grab that one day's key:
In [63]:
%%sql
SELECT day_key
FROM d_day
WHERE full_date = '2015-08-02'
Out[63]:
In [84]:
%%sql
DELETE FROM f_entrance;
INSERT INTO f_entrance
(day_key, hour, affiliation, subaffiliation, count)
SELECT 713, hour, affiliation, subaffiliation, count
FROM entrance
ORDER BY hour, affiliation, subaffiliation
Out[84]:
In [85]:
%%sql
SELECT * FROM f_entrance LIMIT 10
Out[85]:
In [88]:
%%sql
SELECT hour, SUM(COUNT)
FROM f_entrance
GROUP BY hour
ORDER BY hour
Out[88]:
In [89]:
_.bar()
Out[89]:
In [93]:
%%sql
SELECT hour_start, COUNT(*)
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
AND building_name = 'Gelman'
AND day_key_start = 713
GROUP BY hour_start
ORDER BY hour_start ASC
Out[93]:
Okay, this is weird. Why would there be such a steady number of sessions during hours when the building should have been closed?
Let's go back to looking at their lengths.
In [98]:
%%sql
SELECT hour_start, os, session_type, COUNT(*), ROUND(AVG(duration_minutes))
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
AND building_name = 'Gelman'
AND day_key_start = 713
GROUP BY hour_start, os, session_type
ORDER BY hour_start ASC
Out[98]:
Now we're getting somewhere - all of the off-hour sessions are win7
and Unavailable
. Clearly this is some kind of background noise involving the VDI machines. Perhaps Matt, Ian, and Dominique will have insight on this. :)
In the meantime, let's focus in on the "In use" sessions, which is clearly where we want to be.
In [99]:
%%sql
SELECT hour_start, os, session_type, COUNT(*), ROUND(AVG(duration_minutes))
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
AND building_name = 'Gelman'
AND day_key_start = 713
AND session_type = 'In use'
GROUP BY hour_start, os, session_type
ORDER BY hour_start ASC
Out[99]:
In [100]:
_.bar()
Out[100]:
Ah, one more issue - long sessions. Let's lock those down a bit:
In [101]:
%%sql
SELECT hour_start, os, session_type, COUNT(*), ROUND(AVG(duration_minutes))
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
AND building_name = 'Gelman'
AND day_key_start = 713
AND session_type = 'In use'
AND duration_minutes < 600
GROUP BY hour_start, os, session_type
ORDER BY hour_start ASC
Out[101]:
In [102]:
_.bar()
Out[102]:
Now we can start to put some things together. Let's look at each incoming user group by affiliation, and their potential as a maximum percentage of the sessions held. This will be a "drill-across" query: we pull data from each of two fact tables, roll each of them up to a dimension granularity that conforms -- day and hour -- and then join the data together.
First, the session data:
In [108]:
%%sql
SELECT hour_start AS hour, COUNT(*) AS total_sessions
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
AND building_name = 'Gelman'
AND day_key_start = 713
AND session_type = 'In use'
GROUP BY hour_start
ORDER BY hour_start ASC
Out[108]:
Next, the group data, which we've already seen:
In [111]:
%%sql
SELECT hour, affiliation, subaffiliation, count AS group_entrances
FROM f_entrance
ORDER BY hour ASC
Out[111]:
And here we assemble it together - both result sets grouped by hour, then joined.
In [125]:
%%sql
SELECT f.hour, f.affiliation, f.subaffiliation, f.count AS group_entrances, computer_sessions_started
FROM f_entrance f
LEFT JOIN
(
SELECT hour_start AS hour, COUNT(*) AS computer_sessions_started
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
AND building_name = 'Gelman'
AND day_key_start = 713
AND session_type = 'In use'
GROUP BY hour_start
) s
ON s.hour = f.hour
ORDER BY f.hour, affiliation, subaffiliation ASC
Out[125]:
In [ ]: