Exploring loggins and entrance data

Our loggins application, available as a public UI at http://computers.library.gwu.edu, has been running continuously for over two years. It tracks user logins at any of the public computing workstations we maintain here in GW Libraries, and shows where available machines might be. We conceived of and implemented the project in 2013 as we prepared to welcome the GWU community back into Gelman Library after a major renovation that resulted in several major architectural reconfigurations featuring a new entrance, new computing space, and a very different set of flows through the building. We wanted users to know where they could find available machines -- even in this era of ubiquitous mobile devices, our public machines are still heavily used, for a variety of reasons.

This set of changes gave us the opportunity to collect data as well, and except for a few important exceptions, we've continuously tracked logins and sessions (start and stop times, platform, and machine identifiers, but otherwise anonymously). Here's a snapshot of the public interface:


In [7]:
from IPython.display import Image
Image(url="screenshot.png", height=600, width=400)


Out[7]:

For nearly two years, we displayed this interface on a large kiosk just inside our new entrance, collecting data to keep it up to date around the clock (our library is open 24x7 during fall and spring semesters, aside from holidays). In those two years, we collected a lot of data. Let's have a look.

We start by exporting data from our live system using pgsql's COPY function, e.g.:

% echo "COPY ui_building TO STDOUT DELIMITER ',' CSV HEADER" | psql -h localhost -U django -d loggins > /tmp/building.csv
% echo "COPY ui_location TO STDOUT DELIMITER ',' CSV HEADER" | psql -h localhost -U django -d loggins > /tmp/location.csv
% echo "COPY ui_zone TO STDOUT DELIMITER ',' CSV HEADER" | psql -h localhost -U django -d loggins > /tmp/zone.csv
% echo "COPY ui_session TO STDOUT DELIMITER ',' CSV HEADER" | psql -h localhost -U django -d loggins > /tmp/session.csv
% echo "COPY ui_floor TO STDOUT DELIMITER ',' CSV HEADER" | psql -h localhost -U django -d loggins > /tmp/floor.csv

I've completed this task on a copy of our full dataset and pull the resulting CSV files into a local directory.


In [13]:
!wc *.csv


       4       4      33 building.csv
      12      12      89 floor.csv
     204     813   28410 location.csv
  573555 1720663 40020979 session.csv
      15      30     295 zone.csv
  573790 1721522 40049806 total

The data is normalized for production capture and display. Information about our buildings, floors, and zones within floors allow us to show those distinct counts above, and is essetianlly reference information in our normalized schema.


In [14]:
!csvlook building.csv


|-----+---------|
|  id | name    |
|-----+---------|
|  1  | Gelman  |
|  2  | Eckles  |
|  3  | VSTC    |
|-----+---------|

In [15]:
!csvlook floor.csv


|-----+-------+--------------|
|  id | floor | building_id  |
|-----+-------+--------------|
|  1  | 0     | 1            |
|  2  | 1     | 1            |
|  3  | 2     | 1            |
|  4  | 3     | 1            |
|  5  | 4     | 1            |
|  6  | 5     | 1            |
|  7  | 6     | 1            |
|  8  | 7     | 1            |
|  9  | 1     | 2            |
|  10 | 2     | 2            |
|  11 | 1     | 3            |
|-----+-------+--------------|

In [16]:
!csvlook zone.csv


|-----+-------------------------+---------------+-----------|
|  id | name                    | display_order | floor_id  |
|-----+-------------------------+---------------+-----------|
|  1  | Lower Level             | 0             | 1         |
|  2  | 1st Floor               | 1             | 2         |
|  3  | Entrance Floor          | 2             | 3         |
|  4  | Lab @ Entrance Floor    | 3             | 3         |
|  5  | 3rd Floor               | 4             | 4         |
|  6  | 4th Floor               | 5             | 5         |
|  7  | 5th Floor               | 6             | 6         |
|  8  | 6th Floor               | 7             | 7         |
|  9  | 7th Floor               | 8             | 8         |
|  10 | Global Resources Center | 9             | 8         |
|  11 | Special Collections     | 10            | 8         |
|  12 | Main                    | 0             | 9         |
|  13 | 2nd Floor               | 1             | 10        |
|  14 | Main                    | 0             | 11        |
|-----+-------------------------+---------------+-----------|

Locations are individual workstations, or rather, the location where one or another public workstation might be installed. Over time, these workstations may be taken down for maintenance and later returned, or replaced entirely by new workstations with new identifiers.


In [126]:
!head -10 location.csv | csvcut -c1,2,3,4,5,10 | csvlook


|-------+--------------+---------------------------+-----------------+------+----------|
|  id   | station_name | hostname                  | ip_address      | os   | zone_id  |
|-------+--------------+---------------------------+-----------------+------+----------|
|  1384 |              | GELM0MM012                | 128.164.61.161  |      |          |
|  1366 | E-L13        | E-L13                     |                 | mac  | 4        |
|  198  | E-L06        | E-L06                     | 128.164.61.162  | mac  | 4        |
|  1388 |              | GELM0MM009                | 128.164.61.158  |      |          |
|  1383 |              | GELM0MM004                | 128.164.61.153  |      |          |
|  59   | E-045        | E-045                     | 128.164.61.126  | mac  | 3        |
|  1387 |              | GELM0MM003                |                 |      |          |
|  26   | E-011        | pcoip-portal-E85B5B76E4AE | 128.164.214.239 | win7 | 3        |
|  131  | 6-006        | pcoip-portal-E85B5B76E2D1 | 128.164.213.209 | win7 | 8        |
|-------+--------------+---------------------------+-----------------+------+----------|

In [22]:
!head -20 location.csv | csvcut -c1,6,7,8,9 | csvlook


|-------+-------+-------------------------------+-------------------------------+--------------------------------|
|  id   | state | observation_time              | last_login_start_time         | last_offline_start_time        |
|-------+-------+-------------------------------+-------------------------------+--------------------------------|
|  1384 | i     | 2015-11-18 09:53:59-05        | 2015-01-27 15:58:16.281559-05 | 2015-01-27 15:58:16.281631-05  |
|  1366 | i     | 2014-12-09 14:42:43-05        | 2014-12-09 14:42:43-05        | 2014-11-17 14:32:28.675724-05  |
|  198  | i     | 2014-12-09 09:58:29-05        | 2014-12-09 09:58:29-05        | 2014-12-08 14:10:17-05         |
|  1388 | i     | 2015-11-18 13:03:27-05        | 2015-01-28 12:08:50.716618-05 | 2015-01-28 12:08:50.716657-05  |
|  1383 | i     | 2015-09-27 12:59:53-04        | 2015-01-27 15:40:57.926605-05 | 2015-01-27 15:40:57.926645-05  |
|  59   | i     | 2015-11-18 15:14:44-05        | 2015-11-18 15:14:44-05        | 2014-12-09 13:34:38-05         |
|  1387 | i     | 2015-11-18 18:07:12-05        | 2015-01-28 11:03:12.025079-05 | 2015-01-28 11:03:12.025117-05  |
|  26   | n     | 2015-11-18 18:26:37.121724-05 | 2015-11-18 11:11:44.76753-05  | 2015-11-18 11:29:50.132119-05  |
|  131  | a     | 2015-11-18 18:26:37.138253-05 | 2015-11-18 07:19:08.41537-05  | 2015-11-18 17:14:44.539967-05  |
|  42   | n     | 2015-11-18 18:26:43.156294-05 | 2015-11-18 05:29:08.666245-05 | 2015-11-18 06:12:43.893971-05  |
|  79   | i     | 2015-11-18 18:26:43.17159-05  | 2015-11-18 14:33:50.368556-05 | 2015-11-18 09:31:50.897823-05  |
|  1385 | i     | 2015-11-18 17:36:29-05        | 2015-01-27 16:00:03.352809-05 | 2015-01-27 16:00:03.352848-05  |
|  102  | i     | 2015-11-18 18:26:43.185027-05 | 2015-11-18 16:58:50.544295-05 | 2015-11-18 15:56:50.536521-05  |
|  94   | i     | 2015-11-18 18:26:43.199407-05 | 2015-11-18 17:29:19.858173-05 | 2015-11-18 15:43:13.687966-05  |
|  100  | a     | 2015-11-18 18:26:43.212684-05 | 2015-11-18 16:08:07.880534-05 | 2015-11-18 15:53:50.529648-05  |
|  1381 | i     | 2015-01-29 18:07:52-05        | 2015-01-27 15:34:16.980957-05 | 2015-01-27 15:34:16.980997-05  |
|  1389 | a     | 2015-08-12 09:40:55-04        | 2015-01-28 13:27:07.954563-05 | 2015-01-28 13:27:07.9546-05    |
|  1386 | a     | 2015-06-12 15:59:29-04        | 2015-01-28 10:41:25.448717-05 | 2015-01-28 10:41:25.448755-05  |
|  203  | i     | 2015-11-18 12:25:58-05        | 2015-11-17 09:14:10-05        | 2014-11-25 07:20:04-05         |
|-------+-------+-------------------------------+-------------------------------+--------------------------------|

The state values may be i for "logged in", n for "no response", or a for "available". This table captures a particular set of states at only the current time, making that public view snapshot above an easy query to run. here are some more details about the overall set of states and machines by OS.


In [29]:
!csvcut -c6 location.csv | csvsort -c1 | sort | uniq -c


  63 a
  86 i
  54 n
   1 state

"state" being the column header, we can see that at the time of CSV export, overall, 86 machines were in use, 63 were available, and 54 were not responding. Due to machine turnover, many of these 54 machines not responding might very well be retired or otherwise no longer physically available.


In [34]:
!csvcut -c5 location.csv | sort | uniq -c


  11 ""
  43 mac
   1 os
 149 win7

"os" being the column header, we see that most of our machines are Win7 virtual machines, 43 are Macs, and a handful do not have their OS recorded.

Finally, we have over 570,000 sessions, dating back to the first production app deployment in July 2013:


In [30]:
!head session.csv | csvlook


|-----+-------------+-------------------------------+-------------------------------+---------------|
|  id | location_id | timestamp_start               | timestamp_end                 | session_type  |
|-----+-------------+-------------------------------+-------------------------------+---------------|
|  1  | 15          | 2013-07-26 15:41:46.632112-04 | 2013-07-26 16:15:02.913629-04 | i             |
|  2  | 15          | 2013-07-26 16:18:02.941385-04 | 2013-07-26 16:29:02.936998-04 | i             |
|  3  | 11          | 2013-07-26 16:24:01.860098-04 | 2013-07-26 16:31:02.809215-04 | i             |
|  4  | 8           | 2013-07-26 15:41:46.480906-04 | 2013-07-26 16:33:02.781183-04 | i             |
|  5  | 8           | 2013-07-26 16:34:02.762411-04 | 2013-07-26 16:40:02.769593-04 | i             |
|  6  | 15          | 2013-07-26 16:53:04.60264-04  | 2013-07-26 17:03:02.853404-04 | i             |
|  7  | 8           | 2013-07-26 17:02:02.733525-04 | 2013-07-26 17:04:02.737419-04 | i             |
|  8  | 11          | 2013-07-26 16:36:02.828008-04 | 2013-07-26 17:08:02.757389-04 | i             |
|  9  | 13          | 2013-07-26 15:41:46.587518-04 | 2013-07-26 18:03:02.731687-04 | i             |
|-----+-------------+-------------------------------+-------------------------------+---------------|

In [33]:
!csvcut -c5 session.csv | sort | uniq -c | head -2


426504 i
147050 n

Loading the current data

First let's load the current data into a clean new db and we can explore what to do next from there.


In [1]:
%load_ext sql


/Users/dchud/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.
  "You should import from traitlets.config instead.", ShimWarning)
/Users/dchud/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.
  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")

In [2]:
!createdb -U dchud logginsdw

In [6]:
!csvsql --db postgresql:///logginsdw --insert building.csv

In [7]:
!csvsql --db postgresql:///logginsdw --insert floor.csv

In [8]:
!csvsql --db postgresql:///logginsdw --insert zone.csv

In [9]:
!csvsql --db postgresql:///logginsdw --insert location.csv

In [11]:
!csvsql --db postgresql:///logginsdw --insert session.csv

In [2]:
%sql postgresql://dchud@localhost/logginsdw


Out[2]:
u'Connected: dchud@logginsdw'

In [3]:
%%sql 
SELECT * FROM zone


14 rows affected.
Out[3]:
id name display_order floor_id
1 Lower Level 0 1
2 1st Floor 1 2
3 Entrance Floor 2 3
4 Lab @ Entrance Floor 3 3
5 3rd Floor 4 4
6 4th Floor 5 5
7 5th Floor 6 6
8 6th Floor 7 7
9 7th Floor 8 8
10 Global Resources Center 9 8
11 Special Collections 10 8
12 Main 0 9
13 2nd Floor 1 10
14 Main 0 11

In [4]:
%%sql
SELECT * FROM location LIMIT 5


5 rows affected.
Out[4]:
id station_name hostname ip_address os state observation_time last_login_start_time last_offline_start_time zone_id
1384 None GELM0MM012 128.164.61.161 None i 2015-11-18 09:53:59 2015-01-27 15:58:16.281559 2015-01-27 15:58:16.281631 None
1366 E-L13 E-L13 None mac i 2014-12-09 14:42:43 2014-12-09 14:42:43 2014-11-17 14:32:28.675724 4
198 E-L06 E-L06 128.164.61.162 mac i 2014-12-09 09:58:29 2014-12-09 09:58:29 2014-12-08 14:10:17 4
1388 None GELM0MM009 128.164.61.158 None i 2015-11-18 13:03:27 2015-01-28 12:08:50.716618 2015-01-28 12:08:50.716657 None
1383 None GELM0MM004 128.164.61.153 None i 2015-09-27 12:59:53 2015-01-27 15:40:57.926605 2015-01-27 15:40:57.926645 None

In [5]:
%%sql
SELECT * FROM session LIMIT 5


5 rows affected.
Out[5]:
id location_id timestamp_start timestamp_end session_type
1 15 2013-07-26 15:41:46.632112 2013-07-26 16:15:02.913629 i
2 15 2013-07-26 16:18:02.941385 2013-07-26 16:29:02.936998 i
3 11 2013-07-26 16:24:01.860098 2013-07-26 16:31:02.809215 i
4 8 2013-07-26 15:41:46.480906 2013-07-26 16:33:02.781183 i
5 8 2013-07-26 16:34:02.762411 2013-07-26 16:40:02.769593 i

Looks like all the data loaded just fine, we're good to go.

Transforming into a dimensional model

This data is in a schema optimized for production functions: capture of user sessions and display of current state for users looking for a machine. To make the data more readily usable for analysis, we'll transform it into a set of traditional star-style data warehouse tables. The basic model of this kind of schema is to model "facts", or processes we wish to measure and analyse, and "dimensions", contexts we bring to those processes when asking different kinda of questions. For example, if we want to study computing session length by day and by building floor, we'd want to use days and locations as dimensions, and the session length as the fact to measure. This calls for three new tables: one for each dimension (time and location), and one for the fact (session), though we'll give them clear names to differentiate the new tables from the old ones.

One more thing about DW-style schema designs - we denormalize them heavily to simplify queries for analyses. If this is new to you, but you know normalization, it will feel weird. But it works, just roll with it.

Fortunately for us, if you want to get up to speed quickly on what this set of techniques entails, there are two very good books available (among many others). First is a technically-oriented title:


In [10]:
Image(url="star-schema.jpeg", height=600, width=400)


Out[10]:

I taught with this book this past semester and it was widely liked for being accessible and focused on the key technical principles. You can get up to speed quickly on the main ideas in data warehousing with this one.

It is available online at GWU: http://findit.library.gwu.edu/item/8821759

The other is about the project management side of things:


In [11]:
Image(url="agile-design.jpg", height=600, width=400)


Out[11]:

I've only skimmed this book, but it looks like it aligns the work of building one of these systems with the agile techniques we've all been learning about. A student in my class with agile experience did a review of it and recommended it highly.

Let's get into it, starting with the exact model described above: a table for each dimension (time and location), and one table for the facts (session). We'll give them clear names to differentiate the new tables from the old ones.

First we'll build the location dimension by rolling all the building, floor, and zone information into one redundant dimension.

We'll also use a simple naming convention: for DW tables, dimensions will start with d_, and facts will start with f_.


In [108]:
%%sql
DROP TABLE IF EXISTS d_location;
CREATE TABLE d_location (
  location_key SERIAL NOT NULL PRIMARY KEY,
  building_name CHAR(6) NOT NULL,
  building_id SMALLINT NOT NULL DEFAULT 0,
  floor SMALLINT NOT NULL,  
  floor_id SMALLINT NOT NULL DEFAULT 0,
  zone_name CHAR(25) NOT NULL DEFAULT '',
  zone_display_order SMALLINT NOT NULL,
  zone_id SMALLINT NOT NULL DEFAULT 0,
  location_id SMALLINT NOT NULL
)


Done.
Done.
Out[108]:
[]

Now we populate the table with the location information, pulling it out of the existing normalized tables into one denormalized pile.


In [111]:
%%sql
DELETE FROM d_location;
INSERT INTO d_location (building_name, building_id,
                       floor, floor_id,
                       zone_name, zone_display_order, zone_id,
                       location_id)
SELECT building.name AS building_name, building.id AS building_id,
       floor.floor AS floor, floor.id AS floor_id,
       zone.name AS zone_name, zone.display_order AS zone_display_order, zone.id AS zone_id,
       location.id
FROM location, zone, floor, building
WHERE location.zone_id = zone.id
  AND zone.floor_id = floor.id
  AND floor.id = zone.floor_id
  AND building.id = floor.building_id;


0 rows affected.
192 rows affected.
Out[111]:
[]

We had some records up above that didn't reference back to a zone, floor, or building. To handle those, let's also insert a signifier row to use for those missing values.


In [113]:
%%sql
INSERT INTO d_location (building_name, building_id, floor, floor_id, 
                        zone_name, zone_display_order, zone_id,
                        location_id)
VALUES ('None', 0, 0, 0, 'None', 0, 0, 0)


1 rows affected.
Out[113]:
[]

In [115]:
%%sql
SELECT * FROM d_location;


193 rows affected.
Out[115]:
location_key building_name building_id floor floor_id zone_name zone_display_order zone_id location_id
1 Gelman 1 1 2 1st Floor 1 2 104
2 Gelman 1 1 2 1st Floor 1 2 82
3 Gelman 1 1 2 1st Floor 1 2 76
4 Gelman 1 1 2 1st Floor 1 2 98
5 Gelman 1 1 2 1st Floor 1 2 99
6 Gelman 1 1 2 1st Floor 1 2 119
7 Gelman 1 1 2 1st Floor 1 2 109
8 Gelman 1 1 2 1st Floor 1 2 80
9 Gelman 1 1 2 1st Floor 1 2 85
10 Gelman 1 1 2 1st Floor 1 2 155
11 Gelman 1 1 2 1st Floor 1 2 111
12 Gelman 1 1 2 1st Floor 1 2 1367
13 Gelman 1 1 2 1st Floor 1 2 95
14 Gelman 1 1 2 1st Floor 1 2 88
15 Gelman 1 1 2 1st Floor 1 2 105
16 Gelman 1 1 2 1st Floor 1 2 89
17 Gelman 1 1 2 1st Floor 1 2 7
18 Gelman 1 1 2 1st Floor 1 2 124
19 Gelman 1 1 2 1st Floor 1 2 77
20 Gelman 1 1 2 1st Floor 1 2 83
21 Gelman 1 1 2 1st Floor 1 2 78
22 Gelman 1 1 2 1st Floor 1 2 116
23 Gelman 1 1 2 1st Floor 1 2 93
24 Gelman 1 1 2 1st Floor 1 2 110
25 Gelman 1 1 2 1st Floor 1 2 8
26 Gelman 1 1 2 1st Floor 1 2 79
27 Gelman 1 1 2 1st Floor 1 2 127
28 Gelman 1 1 2 1st Floor 1 2 101
29 Gelman 1 1 2 1st Floor 1 2 102
30 Gelman 1 1 2 1st Floor 1 2 126
31 Gelman 1 1 2 1st Floor 1 2 15
32 Gelman 1 1 2 1st Floor 1 2 106
33 Gelman 1 1 2 1st Floor 1 2 113
34 Gelman 1 1 2 1st Floor 1 2 156
35 Gelman 1 1 2 1st Floor 1 2 94
36 Gelman 1 1 2 1st Floor 1 2 100
37 Gelman 1 1 2 1st Floor 1 2 97
38 Gelman 1 1 2 1st Floor 1 2 84
39 Gelman 1 1 2 1st Floor 1 2 117
40 Gelman 1 1 2 1st Floor 1 2 114
41 Gelman 1 1 2 1st Floor 1 2 115
42 Gelman 1 1 2 1st Floor 1 2 107
43 Gelman 1 1 2 1st Floor 1 2 81
44 Gelman 1 1 2 1st Floor 1 2 12
45 Gelman 1 1 2 1st Floor 1 2 92
46 Gelman 1 1 2 1st Floor 1 2 120
47 Gelman 1 1 2 1st Floor 1 2 118
48 Gelman 1 1 2 1st Floor 1 2 90
49 Gelman 1 1 2 1st Floor 1 2 157
50 Gelman 1 1 2 1st Floor 1 2 123
51 Gelman 1 1 2 1st Floor 1 2 11
52 Gelman 1 1 2 1st Floor 1 2 122
53 Gelman 1 1 2 1st Floor 1 2 108
54 Gelman 1 1 2 1st Floor 1 2 1368
55 Gelman 1 1 2 1st Floor 1 2 2
56 Gelman 1 1 2 1st Floor 1 2 1369
57 Gelman 1 1 2 1st Floor 1 2 103
58 Gelman 1 1 2 1st Floor 1 2 91
59 Gelman 1 2 3 Entrance Floor 2 3 63
60 Gelman 1 2 3 Lab @ Entrance Floor 3 4 198
61 Gelman 1 2 3 Entrance Floor 2 3 59
62 Gelman 1 2 3 Entrance Floor 2 3 26
63 Gelman 1 2 3 Entrance Floor 2 3 42
64 Gelman 1 2 3 Lab @ Entrance Floor 3 4 203
65 Gelman 1 2 3 Entrance Floor 2 3 34
66 Gelman 1 2 3 Entrance Floor 2 3 65
67 Gelman 1 2 3 Entrance Floor 2 3 45
68 Gelman 1 2 3 Entrance Floor 2 3 30
69 Gelman 1 2 3 Entrance Floor 2 3 31
70 Gelman 1 2 3 Entrance Floor 2 3 16
71 Gelman 1 2 3 Entrance Floor 2 3 9
72 Gelman 1 2 3 Entrance Floor 2 3 44
73 Gelman 1 2 3 Entrance Floor 2 3 1
74 Gelman 1 2 3 Lab @ Entrance Floor 3 4 195
75 Gelman 1 2 3 Lab @ Entrance Floor 3 4 200
76 Gelman 1 2 3 Entrance Floor 2 3 52
77 Gelman 1 2 3 Entrance Floor 2 3 35
78 Gelman 1 2 3 Entrance Floor 2 3 47
79 Gelman 1 2 3 Entrance Floor 2 3 43
80 Gelman 1 2 3 Entrance Floor 2 3 28
81 Gelman 1 2 3 Entrance Floor 2 3 24
82 Gelman 1 2 3 Entrance Floor 2 3 32
83 Gelman 1 2 3 Entrance Floor 2 3 50
84 Gelman 1 2 3 Entrance Floor 2 3 46
85 Gelman 1 2 3 Entrance Floor 2 3 10
86 Gelman 1 2 3 Entrance Floor 2 3 19
87 Gelman 1 2 3 Entrance Floor 2 3 25
88 Gelman 1 2 3 Entrance Floor 2 3 14
89 Gelman 1 2 3 Entrance Floor 2 3 27
90 Gelman 1 2 3 Entrance Floor 2 3 57
91 Gelman 1 2 3 Entrance Floor 2 3 36
92 Gelman 1 2 3 Entrance Floor 2 3 54
93 Gelman 1 2 3 Entrance Floor 2 3 75
94 Gelman 1 2 3 Entrance Floor 2 3 72
95 Gelman 1 2 3 Entrance Floor 2 3 66
96 Gelman 1 2 3 Entrance Floor 2 3 48
97 Gelman 1 2 3 Entrance Floor 2 3 190
98 Gelman 1 2 3 Entrance Floor 2 3 56
99 Gelman 1 2 3 Entrance Floor 2 3 69
100 Gelman 1 2 3 Lab @ Entrance Floor 3 4 194
101 Gelman 1 2 3 Lab @ Entrance Floor 3 4 202
102 Gelman 1 2 3 Lab @ Entrance Floor 3 4 201
103 Gelman 1 2 3 Entrance Floor 2 3 58
104 Gelman 1 2 3 Entrance Floor 2 3 55
105 Gelman 1 2 3 Lab @ Entrance Floor 3 4 199
106 Gelman 1 2 3 Entrance Floor 2 3 73
107 Gelman 1 2 3 Entrance Floor 2 3 51
108 Gelman 1 2 3 Entrance Floor 2 3 205
109 Gelman 1 2 3 Entrance Floor 2 3 60
110 Gelman 1 2 3 Entrance Floor 2 3 67
111 Gelman 1 2 3 Lab @ Entrance Floor 3 4 193
112 Gelman 1 2 3 Lab @ Entrance Floor 3 4 197
113 Gelman 1 2 3 Lab @ Entrance Floor 3 4 204
114 Gelman 1 2 3 Entrance Floor 2 3 17
115 Gelman 1 2 3 Entrance Floor 2 3 208
116 Gelman 1 2 3 Entrance Floor 2 3 3
117 Gelman 1 2 3 Entrance Floor 2 3 71
118 Gelman 1 2 3 Entrance Floor 2 3 70
119 Gelman 1 2 3 Entrance Floor 2 3 49
120 Gelman 1 2 3 Entrance Floor 2 3 191
121 Gelman 1 2 3 Entrance Floor 2 3 38
122 Gelman 1 2 3 Entrance Floor 2 3 21
123 Gelman 1 2 3 Entrance Floor 2 3 68
124 Gelman 1 2 3 Entrance Floor 2 3 207
125 Gelman 1 2 3 Entrance Floor 2 3 22
126 Gelman 1 2 3 Lab @ Entrance Floor 3 4 196
127 Gelman 1 2 3 Entrance Floor 2 3 53
128 Gelman 1 2 3 Entrance Floor 2 3 41
129 Gelman 1 2 3 Entrance Floor 2 3 18
130 Gelman 1 2 3 Entrance Floor 2 3 20
131 Gelman 1 2 3 Entrance Floor 2 3 62
132 Gelman 1 2 3 Entrance Floor 2 3 64
133 Gelman 1 2 3 Entrance Floor 2 3 37
134 Gelman 1 2 3 Lab @ Entrance Floor 3 4 1366
135 Gelman 1 3 4 3rd Floor 4 5 139
136 Gelman 1 3 4 3rd Floor 4 5 135
137 Gelman 1 3 4 3rd Floor 4 5 137
138 Gelman 1 3 4 3rd Floor 4 5 134
139 Gelman 1 3 4 3rd Floor 4 5 140
140 Gelman 1 4 5 4th Floor 5 6 13
141 Gelman 1 5 6 5th Floor 6 7 6
142 Gelman 1 6 7 6th Floor 7 8 128
143 Gelman 1 6 7 6th Floor 7 8 192
144 Gelman 1 6 7 6th Floor 7 8 132
145 Gelman 1 6 7 6th Floor 7 8 133
146 Gelman 1 6 7 6th Floor 7 8 129
147 Gelman 1 6 7 6th Floor 7 8 4
148 Gelman 1 6 7 6th Floor 7 8 130
149 Gelman 1 6 7 6th Floor 7 8 5
150 Gelman 1 6 7 6th Floor 7 8 131
151 Gelman 1 7 8 Global Resources Center 9 10 151
152 Gelman 1 7 8 Global Resources Center 9 10 147
153 Gelman 1 7 8 Special Collections 10 11 152
154 Gelman 1 7 8 Global Resources Center 9 10 161
155 Gelman 1 7 8 Global Resources Center 9 10 148
156 Gelman 1 7 8 Global Resources Center 9 10 150
157 Gelman 1 7 8 Global Resources Center 9 10 162
158 Gelman 1 7 8 Global Resources Center 9 10 149
159 Gelman 1 7 8 Global Resources Center 9 10 160
160 Gelman 1 7 8 Global Resources Center 9 10 159
161 Gelman 1 7 8 Special Collections 10 11 153
162 Eckles 2 1 9 Main 0 12 142
163 Eckles 2 1 9 Main 0 12 143
164 Eckles 2 1 9 Main 0 12 154
165 Eckles 2 1 9 Main 0 12 145
166 Eckles 2 1 9 Main 0 12 144
167 Eckles 2 1 9 Main 0 12 158
168 Eckles 2 2 10 2nd Floor 1 13 146
169 VSTC 3 1 11 Main 0 14 173
170 VSTC 3 1 11 Main 0 14 183
171 VSTC 3 1 11 Main 0 14 182
172 VSTC 3 1 11 Main 0 14 181
173 VSTC 3 1 11 Main 0 14 163
174 VSTC 3 1 11 Main 0 14 170
175 VSTC 3 1 11 Main 0 14 180
176 VSTC 3 1 11 Main 0 14 167
177 VSTC 3 1 11 Main 0 14 169
178 VSTC 3 1 11 Main 0 14 175
179 VSTC 3 1 11 Main 0 14 171
180 VSTC 3 1 11 Main 0 14 178
181 VSTC 3 1 11 Main 0 14 165
182 VSTC 3 1 11 Main 0 14 177
183 VSTC 3 1 11 Main 0 14 179
184 VSTC 3 1 11 Main 0 14 168
185 VSTC 3 1 11 Main 0 14 174
186 VSTC 3 1 11 Main 0 14 186
187 VSTC 3 1 11 Main 0 14 185
188 VSTC 3 1 11 Main 0 14 184
189 VSTC 3 1 11 Main 0 14 172
190 VSTC 3 1 11 Main 0 14 164
191 VSTC 3 1 11 Main 0 14 176
192 VSTC 3 1 11 Main 0 14 166
193 None 0 0 0 None 0 0 0

We're likely to be doing some joining on that location_id column, which was an important primary/foreign key in the transaction data. In dimensional schemas like this, a transactional key column that carries over into this design is referred to as a "natural key". Its value is self-evident here, and we can index it to make sure those joins will go quickly:


In [116]:
%%sql
CREATE INDEX idx_location_id ON d_location (location_id)


Done.
Out[116]:
[]

One more note on terminology: that location_key primary key we just created on the d_location dimension is a new addition - it didn't come from the source transaction data. In dimensional design, that's referred to as a "surrogate key". We make this distinction because we might have information about a location that changes over time (i.e. "slowly changing dimensions"). There is a lot to read about this in the Star Schema book. In this case, we can have more than one row referring to the same location_id if the location data changes; each would get its own unique new location_key here, so we can refer to each version of the source location reference.

There are two canonical examples of modeling this kind of change over time: a birthdate correction (you got it wrong, and need to fix it, so there's one truth) and a change-of-address update (both addresses are correct, just at different times). In the first case, you want one correct birthdate value; in the second, you want both addresses to be present for analysis.

Getting back to our model, we've taken care of our physical locations. Now let's pull out a time dimension, by day. Here we can spell out every variation of aspects of the date; they can be helpful for analysis. Our goal here is to make analysis easy. Toward that goal, we break the pieces apart here and now so we can do analysis later without having to worry about getting these bits right.


In [69]:
%%sql
DROP TABLE IF EXISTS d_day;
CREATE TABLE d_day (
  day_key SERIAL NOT NULL PRIMARY KEY,
  full_date DATE NOT NULL,
  day SMALLINT NOT NULL,
  day_of_week_number SMALLINT NOT NULL,
  day_of_week_name CHAR(9) NOT NULL,
  day_of_week_abbr CHAR(3) NOT NULL,
  day_of_year SMALLINT NOT NULL,
  week_of_month SMALLINT NOT NULL,
  week_of_year SMALLINT NOT NULL,
  federal_holiday_flag BOOLEAN DEFAULT FALSE,
  gwu_holiday_flag BOOLEAN DEFAULT FALSE,
  gwu_in_session BOOLEAN DEFAULT FALSE,
  weekday_flag BOOLEAN,
  weekend_flag BOOLEAN,
  month SMALLINT NOT NULL,
  month_name CHAR(9) NOT NULL,
  month_abbr CHAR(3) NOT NULL,
  quarter SMALLINT NOT NULL,
  year SMALLINT NOT NULL
)


Done.
Done.
Out[69]:
[]

Double checking our source data...


In [12]:
%%sql
SELECT * FROM session LIMIT 5;


5 rows affected.
Out[12]:
id location_id timestamp_start timestamp_end session_type
1 15 2013-07-26 15:41:46.632112 2013-07-26 16:15:02.913629 i
2 15 2013-07-26 16:18:02.941385 2013-07-26 16:29:02.936998 i
3 11 2013-07-26 16:24:01.860098 2013-07-26 16:31:02.809215 i
4 8 2013-07-26 15:41:46.480906 2013-07-26 16:33:02.781183 i
5 8 2013-07-26 16:34:02.762411 2013-07-26 16:40:02.769593 i

Just for convenience, let's pull this data in sorted by date; we'll add an index to make this speedy.


In [54]:
%%sql
CREATE INDEX idx_timestamp_start ON session (timestamp_start);


Done.
Out[54]:
[]

In [70]:
%%sql
DELETE FROM d_day;
INSERT INTO d_day
  (full_date, 
   day, day_of_week_number, day_of_week_name, day_of_week_abbr, day_of_year,
   week_of_month, week_of_year,
   federal_holiday_flag, gwu_holiday_flag, gwu_in_session,
   weekday_flag, weekend_flag,
   month, month_name, month_abbr,
   quarter, year)
SELECT DISTINCT DATE(timestamp_start) AS full_date,
   TO_NUMBER(TO_CHAR(timestamp_start, 'DD'), '99') AS day_of_month,
     TO_NUMBER(TO_CHAR(timestamp_start, 'D'), '9') AS day_of_week_number,
     TO_CHAR(timestamp_start, 'Day') AS day_of_week_name,
     TO_CHAR(timestamp_start, 'Dy') AS day_of_week_abbr,
     TO_NUMBER(TO_CHAR(timestamp_start, 'DDD'), '999') AS day_of_year,
   TO_NUMBER(TO_CHAR(timestamp_start, 'W'), '9') AS week_of_month,
     TO_NUMBER(TO_CHAR(timestamp_start, 'WW'), '99') AS week_of_year,
   FALSE AS federal_holiday_flag, 
     FALSE AS gwu_holiday_flag,
     FALSE AS gwu_in_session,
   TO_NUMBER(TO_CHAR(timestamp_start, 'ID'), '9') <= 5 AS weekday_flag,
     TO_NUMBER(TO_CHAR(timestamp_start, 'ID'), '9') > 5 AS weekend_flag,
   TO_NUMBER(TO_CHAR(timestamp_start, 'MM'), '99') AS month,
     TO_CHAR(timestamp_start, 'Month') AS month_name,
     TO_CHAR(timestamp_start, 'Mon') AS month_abbr,
   TO_NUMBER(TO_CHAR(timestamp_start, 'Q'), '9') AS quarter,
     TO_NUMBER(TO_CHAR(timestamp_start, 'YYYY'), '9999') AS year
FROM session
ORDER BY full_date


0 rows affected.
821 rows affected.
Out[70]:
[]

In [73]:
%%sql
SELECT * FROM d_day LIMIT 20;


20 rows affected.
Out[73]:
day_key full_date day day_of_week_number day_of_week_name day_of_week_abbr day_of_year week_of_month week_of_year federal_holiday_flag gwu_holiday_flag gwu_in_session weekday_flag weekend_flag month month_name month_abbr quarter year
1 2013-07-26 26 6 Friday Fri 207 4 30 False False False True False 7 July Jul 3 2013
2 2013-07-27 27 7 Saturday Sat 208 4 30 False False False False True 7 July Jul 3 2013
3 2013-07-28 28 1 Sunday Sun 209 4 30 False False False False True 7 July Jul 3 2013
4 2013-07-29 29 2 Monday Mon 210 5 30 False False False True False 7 July Jul 3 2013
5 2013-07-30 30 3 Tuesday Tue 211 5 31 False False False True False 7 July Jul 3 2013
6 2013-07-31 31 4 Wednesday Wed 212 5 31 False False False True False 7 July Jul 3 2013
7 2013-08-01 1 5 Thursday Thu 213 1 31 False False False True False 8 August Aug 3 2013
8 2013-08-02 2 6 Friday Fri 214 1 31 False False False True False 8 August Aug 3 2013
9 2013-08-03 3 7 Saturday Sat 215 1 31 False False False False True 8 August Aug 3 2013
10 2013-08-04 4 1 Sunday Sun 216 1 31 False False False False True 8 August Aug 3 2013
11 2013-08-05 5 2 Monday Mon 217 1 31 False False False True False 8 August Aug 3 2013
12 2013-08-06 6 3 Tuesday Tue 218 1 32 False False False True False 8 August Aug 3 2013
13 2013-08-07 7 4 Wednesday Wed 219 1 32 False False False True False 8 August Aug 3 2013
14 2013-08-08 8 5 Thursday Thu 220 2 32 False False False True False 8 August Aug 3 2013
15 2013-08-09 9 6 Friday Fri 221 2 32 False False False True False 8 August Aug 3 2013
16 2013-08-14 14 4 Wednesday Wed 226 2 33 False False False True False 8 August Aug 3 2013
17 2013-08-15 15 5 Thursday Thu 227 3 33 False False False True False 8 August Aug 3 2013
18 2013-08-16 16 6 Friday Fri 228 3 33 False False False True False 8 August Aug 3 2013
19 2013-08-20 20 3 Tuesday Tue 232 3 34 False False False True False 8 August Aug 3 2013
20 2013-08-21 21 4 Wednesday Wed 233 3 34 False False False True False 8 August Aug 3 2013

In [74]:
%%sql
CREATE INDEX idx_full_date ON d_day (full_date)


Done.
Out[74]:
[]

Okay, then, that was fun. Now that we have two dimensions set up, let's add our fact table. We can create default values at first for the dimensions' new primary keys (the "surrogate keys") on the fact table, then go back and update them with lookup queries. This will save us some query complexity.

The main thing about the fact table is that it needs to feature data we want to measure (i.e. count, average, compare, or otherwise study) and everything else should be a reference to one of the dimensions.


In [91]:
%%sql
DROP TABLE IF EXISTS f_login;
CREATE TABLE f_login (
  login_id SERIAL PRIMARY KEY,
  location_key SMALLINT NOT NULL DEFAULT 15,
  timestamp_start TIMESTAMP NOT NULL,
  day_key_start SMALLINT NOT NULL,
  time_start CHAR(8) NOT NULL,
  timestamp_end TIMESTAMP NOT NULL,
  day_key_end SMALLINT NOT NULL,
  time_end CHAR(8) NOT NULL,
  hour_start SMALLINT NOT NULL,
  hour_end SMALLINT NOT NULL,
  duration_minutes REAL NOT NULL,
  duration_hours REAL NOT NULL,
  session_type CHAR(13) NOT NULL DEFAULT 'Available',
  os CHAR(4) NOT NULL DEFAULT '',
  session_id INT NOT NULL,
  location_id INT NOT NULL
)


Done.
Done.
Out[91]:
[]

In [95]:
%%sql
DELETE FROM f_login;
INSERT INTO f_login (location_key,
                     timestamp_start, day_key_start, time_start,
                     timestamp_end, day_key_end, time_end,
                     hour_start, hour_end,
                     duration_minutes, duration_hours,
                     session_type, os,
                     session_id, location_id
                     )
SELECT 15 AS location_key,
  timestamp_start, 
    1 AS day_key_start,
    TO_CHAR(timestamp_start, 'HH24:MI:SS') AS time_start,
  timestamp_end,
    2 AS day_key_end,
    TO_CHAR(timestamp_end, 'HH24:MI:SS') AS time_end,
  TO_NUMBER(TO_CHAR(timestamp_start, 'HH24'), '99') AS hour_start,
    TO_NUMBER(TO_CHAR(timestamp_end, 'HH24'), '99') AS hour_end,
  EXTRACT(EPOCH FROM (timestamp_end - timestamp_start)) / 60 AS duration_minutes,
    EXTRACT(EPOCH FROM (timestamp_end - timestamp_start)) / 3600 AS duration_hours,
  CASE WHEN session_type='i' THEN 'In use'
       WHEN session_type='n' THEN 'Unavailable' END AS session_type,
    CASE WHEN os IS NULL THEN ''
         ELSE os END, 
  session.id AS session_id,
    session.location_id AS location_id
FROM session, location
WHERE location.id = session.location_id
ORDER BY session_id


573554 rows affected.
573554 rows affected.
Out[95]:
[]

In [96]:
%%sql
SELECT * FROM f_login LIMIT 10


10 rows affected.
Out[96]:
login_id location_key timestamp_start day_key_start time_start timestamp_end day_key_end time_end hour_start hour_end duration_minutes duration_hours session_type os session_id location_id
930965 15 2013-07-26 15:41:46.632112 1 15:41:46 2013-07-26 16:15:02.913629 2 16:15:02 15 16 33.2714 0.554523 In use win7 1 15
930966 15 2013-07-26 16:18:02.941385 1 16:18:02 2013-07-26 16:29:02.936998 2 16:29:02 16 16 10.9999 0.183332 In use win7 2 15
930967 15 2013-07-26 16:24:01.860098 1 16:24:01 2013-07-26 16:31:02.809215 2 16:31:02 16 16 7.01582 0.11693 In use win7 3 11
930968 15 2013-07-26 15:41:46.480906 1 15:41:46 2013-07-26 16:33:02.781183 2 16:33:02 15 16 51.2717 0.854528 In use win7 4 8
930969 15 2013-07-26 16:34:02.762411 1 16:34:02 2013-07-26 16:40:02.769593 2 16:40:02 16 16 6.00012 0.100002 In use win7 5 8
930970 15 2013-07-26 16:53:04.602640 1 16:53:04 2013-07-26 17:03:02.853404 2 17:03:02 16 17 9.97085 0.166181 In use win7 6 15
930971 15 2013-07-26 17:02:02.733525 1 17:02:02 2013-07-26 17:04:02.737419 2 17:04:02 17 17 2.00006 0.0333344 In use win7 7 8
930972 15 2013-07-26 16:36:02.828008 1 16:36:02 2013-07-26 17:08:02.757389 2 17:08:02 16 17 31.9988 0.533314 In use win7 8 11
930973 15 2013-07-26 15:41:46.587518 1 15:41:46 2013-07-26 18:03:02.731687 2 18:03:02 15 18 141.269 2.35448 In use win7 9 13
930974 15 2013-07-26 15:41:46.351422 1 15:41:46 2013-07-26 18:54:02.459601 2 18:54:02 15 18 192.268 3.20447 In use win7 10 2

Now go back and set the key references correctly.


In [105]:
%%sql
UPDATE f_login
SET day_key_start = d_day.day_key
FROM d_day
WHERE DATE(f_login.timestamp_start) = d_day.full_date


573554 rows affected.
Out[105]:
[]

In [106]:
%%sql
UPDATE f_login
SET day_key_end = d_day.day_key
FROM d_day
WHERE DATE(f_login.timestamp_end) = d_day.full_date


573534 rows affected.
Out[106]:
[]

In [117]:
%%sql
UPDATE f_login
SET location_key = d_location.location_key
FROM d_location
WHERE f_login.location_id = d_location.location_id


572575 rows affected.
Out[117]:
[]

In [118]:
%%sql
SELECT * FROM f_login LIMIT 10


10 rows affected.
Out[118]:
login_id location_key timestamp_start day_key_start time_start timestamp_end day_key_end time_end hour_start hour_end duration_minutes duration_hours session_type os session_id location_id
1504011 106 2015-11-18 07:53:07 821 07:53:07 2015-11-18 09:31:30 821 09:31:30 7 9 98.3833 1.63972 In use mac 573846 73
1504012 95 2015-11-16 19:53:28 819 19:53:28 2015-11-18 09:32:48 821 09:32:48 19 9 2259.33 37.6556 In use mac 573847 66
1504013 26 2015-11-18 09:31:50.897823 821 09:31:50 2015-11-18 09:32:14.782618 821 09:32:14 9 9 0.39808 0.00663467 Unavailable win7 573848 79
1504014 133 2015-11-18 09:31:22.378543 821 09:31:22 2015-11-18 09:33:17.535447 821 09:33:17 9 9 1.91928 0.031988 In use win7 573849 37
1504015 85 2015-11-18 08:52:22.274224 821 08:52:22 2015-11-18 09:33:22.277084 821 09:33:22 8 9 41.0 0.683334 In use win7 573850 10
1504016 81 2015-11-18 09:16:51.198746 821 09:16:51 2015-11-18 09:33:45.598483 821 09:33:45 9 9 16.9067 0.281778 In use win7 573851 24
1504017 73 2015-11-18 09:27:56.712761 821 09:27:56 2015-11-18 09:34:01.674449 821 09:34:01 9 9 6.0827 0.101378 In use win7 573852 1
1504018 81 2015-11-18 09:33:45.598483 821 09:33:45 2015-11-18 09:34:26.263571 821 09:34:26 9 9 0.677751 0.0112959 Unavailable win7 573853 24
1504019 62 2015-11-18 09:25:38.174414 821 09:25:38 2015-11-18 09:34:37.175312 821 09:34:37 9 9 8.98335 0.149722 In use win7 573854 26
1504020 89 2015-11-18 09:25:52.306736 821 09:25:52 2015-11-18 09:35:14.819213 821 09:35:14 9 9 9.37521 0.156253 In use win7 573855 27

Brief exploration

Okay then, now that we've got that settled, let's take a look.


In [15]:
%matplotlib inline

In [16]:
%%sql
SELECT AVG(duration_minutes), os, building_name
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
AND duration_hours < 12
GROUP BY os, building_name
HAVING os IN ('mac', 'win7')


4 rows affected.
Out[16]:
avg os building_name
42.495822703 win7 Eckles
68.2718437427 mac Gelman
55.3730708337 win7 VSTC
61.1427185188 win7 Gelman

In [17]:
_.bar()


Out[17]:
<Container object of 4 artists>

In [14]:
%%sql
SELECT hour_start, AVG(duration_minutes)
FROM f_login, d_day
WHERE d_day.day_key = f_login.day_key_start
AND duration_hours < 12
GROUP BY hour_start
ORDER BY hour_start


24 rows affected.
Out[14]:
hour_start avg
0 61.5716181477
1 53.8161665148
2 45.9910844912
3 41.5682954684
4 38.8229400904
5 43.5218039582
6 49.1274802832
7 52.3189000911
8 58.5990348551
9 60.1363271633
10 59.8462094435
11 63.065894131
12 58.5644837844
13 64.3247302932
14 63.258076047
15 58.4706792104
16 56.8463585612
17 59.5690319863
18 64.3895210889
19 75.1379761796
20 76.7059574569
21 73.0379374572
22 70.1253375386
23 67.7803680071

In [15]:
_.bar()


Out[15]:
<Container object of 24 artists>

In [9]:
%%sql
SELECT COUNT(*), session_type, os
FROM f_login
WHERE duration_hours > 48
GROUP BY os, session_type
LIMIT 50


5 rows affected.
Out[9]:
count session_type os
928 In use     
79 In use win7
222 Unavailable mac
919 Unavailable win7
1163 In use mac

In [10]:
_.bar()


Out[10]:
<Container object of 5 artists>

In [49]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login
WHERE duration_minutes <= 20
AND os = 'win7'
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC


44 rows affected.
Out[49]:
os session_type minutes count
win7 In use -30.0 1
win7 In use 0.0 580
win7 In use 1.0 6473
win7 In use 2.0 10799
win7 In use 3.0 17470
win7 In use 4.0 16890
win7 In use 5.0 14663
win7 In use 6.0 11959
win7 In use 7.0 10147
win7 In use 8.0 8513
win7 In use 9.0 7151
win7 In use 10.0 6274
win7 In use 11.0 5631
win7 In use 12.0 5053
win7 In use 13.0 4557
win7 In use 14.0 4293
win7 In use 15.0 3894
win7 In use 16.0 3674
win7 In use 17.0 3548
win7 In use 18.0 3340
win7 In use 19.0 3224
win7 In use 20.0 1542
win7 Unavailable -43.0 1
win7 Unavailable 0.0 70560
win7 Unavailable 1.0 22927
win7 Unavailable 2.0 830
win7 Unavailable 3.0 425
win7 Unavailable 4.0 233
win7 Unavailable 5.0 203
win7 Unavailable 6.0 179
win7 Unavailable 7.0 126
win7 Unavailable 8.0 115
win7 Unavailable 9.0 134
win7 Unavailable 10.0 145
win7 Unavailable 11.0 136
win7 Unavailable 12.0 78
win7 Unavailable 13.0 64
win7 Unavailable 14.0 73
win7 Unavailable 15.0 149
win7 Unavailable 16.0 55
win7 Unavailable 17.0 59
win7 Unavailable 18.0 68
win7 Unavailable 19.0 51
win7 Unavailable 20.0 21

In [50]:
_.bar()


Out[50]:
<Container object of 44 artists>

Fascinating - clearly the graph is dominated by short "unavailable" sessions. We can go in for a closer look on both of those:


In [17]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login
WHERE duration_minutes <= 20
AND os = 'win7'
AND session_type = 'Unavailable'
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC


22 rows affected.
Out[17]:
os session_type minutes count
win7 Unavailable -43.0 1
win7 Unavailable 0.0 70560
win7 Unavailable 1.0 22927
win7 Unavailable 2.0 830
win7 Unavailable 3.0 425
win7 Unavailable 4.0 233
win7 Unavailable 5.0 203
win7 Unavailable 6.0 179
win7 Unavailable 7.0 126
win7 Unavailable 8.0 115
win7 Unavailable 9.0 134
win7 Unavailable 10.0 145
win7 Unavailable 11.0 136
win7 Unavailable 12.0 78
win7 Unavailable 13.0 64
win7 Unavailable 14.0 73
win7 Unavailable 15.0 149
win7 Unavailable 16.0 55
win7 Unavailable 17.0 59
win7 Unavailable 18.0 68
win7 Unavailable 19.0 51
win7 Unavailable 20.0 21

In [18]:
_.bar()


Out[18]:
<Container object of 22 artists>

In [43]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login
WHERE duration_minutes <= 30
AND os = 'win7'
AND session_type = 'In use'
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC


32 rows affected.
Out[43]:
os session_type minutes count
win7 In use -30.0 1
win7 In use 0.0 580
win7 In use 1.0 6473
win7 In use 2.0 10799
win7 In use 3.0 17470
win7 In use 4.0 16890
win7 In use 5.0 14663
win7 In use 6.0 11959
win7 In use 7.0 10147
win7 In use 8.0 8513
win7 In use 9.0 7151
win7 In use 10.0 6274
win7 In use 11.0 5631
win7 In use 12.0 5053
win7 In use 13.0 4557
win7 In use 14.0 4293
win7 In use 15.0 3894
win7 In use 16.0 3674
win7 In use 17.0 3548
win7 In use 18.0 3340
win7 In use 19.0 3224
win7 In use 20.0 3035
win7 In use 21.0 2956
win7 In use 22.0 2805
win7 In use 23.0 2723
win7 In use 24.0 2789
win7 In use 25.0 2546
win7 In use 26.0 2540
win7 In use 27.0 2471
win7 In use 28.0 2311
win7 In use 29.0 2320
win7 In use 30.0 1157

In [44]:
_.bar()


Out[44]:
<Container object of 32 artists>

Let's see if we have the same profile on OSX sessions:


In [45]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login
WHERE duration_minutes <= 30
AND os = 'mac'
AND session_type = 'In use'
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC


32 rows affected.
Out[45]:
os session_type minutes count
mac In use -2.0 1
mac In use 0.0 1112
mac In use 1.0 1072
mac In use 2.0 1893
mac In use 3.0 2490
mac In use 4.0 2448
mac In use 5.0 2053
mac In use 6.0 1746
mac In use 7.0 1440
mac In use 8.0 1261
mac In use 9.0 1069
mac In use 10.0 974
mac In use 11.0 798
mac In use 12.0 774
mac In use 13.0 613
mac In use 14.0 649
mac In use 15.0 577
mac In use 16.0 787
mac In use 17.0 640
mac In use 18.0 590
mac In use 19.0 524
mac In use 20.0 518
mac In use 21.0 494
mac In use 22.0 490
mac In use 23.0 474
mac In use 24.0 437
mac In use 25.0 445
mac In use 26.0 448
mac In use 27.0 412
mac In use 28.0 414
mac In use 29.0 363
mac In use 30.0 182

In [46]:
_.bar()


Out[46]:
<Container object of 32 artists>

Looks pretty clear that there is a heavy profile of very short sessions on both platforms. Is that true at all locations, or just the Gelman entrance floor?


In [35]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
  AND duration_minutes <= 30
  AND os = 'win7'
  AND session_type = 'In use'
  AND building_name = 'Gelman'
  AND floor = 2
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC


31 rows affected.
Out[35]:
os session_type minutes count
win7 In use 0.0 299
win7 In use 1.0 3483
win7 In use 2.0 6635
win7 In use 3.0 11657
win7 In use 4.0 11206
win7 In use 5.0 9283
win7 In use 6.0 7382
win7 In use 7.0 6118
win7 In use 8.0 5071
win7 In use 9.0 4223
win7 In use 10.0 3690
win7 In use 11.0 3215
win7 In use 12.0 2845
win7 In use 13.0 2584
win7 In use 14.0 2387
win7 In use 15.0 2091
win7 In use 16.0 2005
win7 In use 17.0 1934
win7 In use 18.0 1797
win7 In use 19.0 1730
win7 In use 20.0 1598
win7 In use 21.0 1554
win7 In use 22.0 1464
win7 In use 23.0 1391
win7 In use 24.0 1479
win7 In use 25.0 1251
win7 In use 26.0 1277
win7 In use 27.0 1257
win7 In use 28.0 1162
win7 In use 29.0 1196
win7 In use 30.0 572

In [36]:
_.bar()


Out[36]:
<Container object of 31 artists>

In [39]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
  AND duration_minutes <= 30
  AND os = 'win7'
  AND session_type = 'In use'
  AND building_name = 'Gelman'
  AND floor != 2
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC


32 rows affected.
Out[39]:
os session_type minutes count
win7 In use -30.0 1
win7 In use 0.0 257
win7 In use 1.0 2751
win7 In use 2.0 3881
win7 In use 3.0 5431
win7 In use 4.0 5349
win7 In use 5.0 5061
win7 In use 6.0 4310
win7 In use 7.0 3782
win7 In use 8.0 3246
win7 In use 9.0 2792
win7 In use 10.0 2424
win7 In use 11.0 2281
win7 In use 12.0 2087
win7 In use 13.0 1883
win7 In use 14.0 1824
win7 In use 15.0 1715
win7 In use 16.0 1582
win7 In use 17.0 1522
win7 In use 18.0 1458
win7 In use 19.0 1437
win7 In use 20.0 1371
win7 In use 21.0 1325
win7 In use 22.0 1267
win7 In use 23.0 1279
win7 In use 24.0 1230
win7 In use 25.0 1237
win7 In use 26.0 1204
win7 In use 27.0 1160
win7 In use 28.0 1082
win7 In use 29.0 1071
win7 In use 30.0 567

In [40]:
_.bar()


Out[40]:
<Container object of 32 artists>

And how about at Eckles?


In [41]:
%%sql
SELECT os, session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
  AND duration_minutes <= 30
  AND os = 'win7'
  AND session_type = 'In use'
  AND building_name = 'Eckles'
GROUP BY os, session_type, minutes
ORDER BY os, session_type, minutes ASC


31 rows affected.
Out[41]:
os session_type minutes count
win7 In use 0.0 15
win7 In use 1.0 144
win7 In use 2.0 210
win7 In use 3.0 311
win7 In use 4.0 264
win7 In use 5.0 241
win7 In use 6.0 208
win7 In use 7.0 200
win7 In use 8.0 158
win7 In use 9.0 106
win7 In use 10.0 118
win7 In use 11.0 100
win7 In use 12.0 89
win7 In use 13.0 68
win7 In use 14.0 60
win7 In use 15.0 59
win7 In use 16.0 64
win7 In use 17.0 67
win7 In use 18.0 64
win7 In use 19.0 46
win7 In use 20.0 44
win7 In use 21.0 58
win7 In use 22.0 57
win7 In use 23.0 39
win7 In use 24.0 57
win7 In use 25.0 40
win7 In use 26.0 47
win7 In use 27.0 42
win7 In use 28.0 52
win7 In use 29.0 37
win7 In use 30.0 13

In [42]:
_.bar()


Out[42]:
<Container object of 31 artists>

You can imagine a wide variety of other kinds of questions you'd want to get answers to here. Like "how does this quick session profile peak change over different hours of the day, or on weekends?"


In [51]:
%%sql
SELECT session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login, d_day
WHERE d_day.day_key = f_login.day_key_start
  AND duration_minutes <= 30
  AND session_type = 'In use'
  AND weekday_flag = True
GROUP BY session_type, minutes
ORDER BY session_type, minutes ASC


32 rows affected.
Out[51]:
session_type minutes count
In use -2.0 1
In use 0.0 1447
In use 1.0 6326
In use 2.0 11027
In use 3.0 17594
In use 4.0 17107
In use 5.0 14732
In use 6.0 12087
In use 7.0 10173
In use 8.0 8599
In use 9.0 7200
In use 10.0 6331
In use 11.0 5655
In use 12.0 5100
In use 13.0 4551
In use 14.0 4313
In use 15.0 3903
In use 16.0 3956
In use 17.0 3673
In use 18.0 3476
In use 19.0 3334
In use 20.0 3133
In use 21.0 3077
In use 22.0 2893
In use 23.0 2824
In use 24.0 2855
In use 25.0 2632
In use 26.0 2624
In use 27.0 2528
In use 28.0 2447
In use 29.0 2378
In use 30.0 1182

In [52]:
_.bar()


Out[52]:
<Container object of 32 artists>

In [53]:
%%sql
SELECT session_type, ROUND(duration_minutes) AS minutes, COUNT(*)
FROM f_login, d_day
WHERE d_day.day_key = f_login.day_key_start
  AND duration_minutes <= 30
  AND session_type = 'In use'
  AND weekend_flag = True
GROUP BY session_type, minutes
ORDER BY session_type, minutes ASC


32 rows affected.
Out[53]:
session_type minutes count
In use -30.0 1
In use 0.0 245
In use 1.0 1219
In use 2.0 1665
In use 3.0 2366
In use 4.0 2231
In use 5.0 1984
In use 6.0 1618
In use 7.0 1414
In use 8.0 1175
In use 9.0 1020
In use 10.0 917
In use 11.0 774
In use 12.0 727
In use 13.0 619
In use 14.0 629
In use 15.0 568
In use 16.0 505
In use 17.0 515
In use 18.0 454
In use 19.0 414
In use 20.0 420
In use 21.0 373
In use 22.0 402
In use 23.0 373
In use 24.0 371
In use 25.0 359
In use 26.0 364
In use 27.0 355
In use 28.0 278
In use 29.0 305
In use 30.0 157

In [54]:
_.bar()


Out[54]:
<Container object of 32 artists>

Adding more facts

This is all well and good when reviewing just one dataset. It gets much more interesting (and you really need those books) when you add more kinds of transactions into the mix.

We also have, for example, GWorld tap logs from the entrance. They look like this:


In [56]:
!head -20 20150802-entrance.csv | csvlook


|-------+-------------+----------------+--------|
|  Hour | Affiliation | Subaffiliation | Count  |
|-------+-------------+----------------+--------|
|  11   | STAFF       |                | 4      |
|  11   | STUDENT     | LAW SCHOOL     | 1      |
|  11   | STUDENT     |                | 1      |
|  12   | ALUMNI      | CCAS           | 3      |
|  12   | ALUMNI      | GWSB           | 1      |
|  12   | ALUMNI      | LAW SCHOOL     | 1      |
|  12   | ALUMNI      | SEAS           | 1      |
|  12   | FACULTY     |                | 2      |
|  12   | MFA         |                | 1      |
|  12   | STAFF       |                | 5      |
|  12   | STUDENT     | LAW SCHOOL     | 1      |
|  12   | STUDENT     | MED CENTER     | 2      |
|  12   | STUDENT     |                | 54     |
|  12   | VISITOR     | CAMPUS TENANT  | 1      |
|  12   | VISITOR     | FRIENDS        | 1      |
|  13   | ALUMNI      | ESIA           | 1      |
|  13   | ALUMNI      | GWSB           | 3      |
|  13   | ALUMNI      | LAW SCHOOL     | 1      |
|  13   | ALUMNI      | SEAS           | 1      |
|-------+-------------+----------------+--------|

This date, August 2, 2015, was a Sunday, and we were operating on summer hours (12-6).

"Entrances" are a new kind of fact here - we want to be able to measure them. One important and obvious dimension is "day", which we already have in the system. We could define another one for "affiliation", but let's skip that for now, as that data is very straightforward (just two columns).

Like we did before, we can load the raw data in one step, then create a new fact table in another step.


In [78]:
!csvsql --db postgresql:///logginsdw --insert entrance.csv

In [79]:
%%sql
SELECT * FROM entrance LIMIT 10


10 rows affected.
Out[79]:
hour affiliation subaffiliation count
11 STAFF None 4
11 STUDENT LAW SCHOOL 1
11 STUDENT None 1
12 ALUMNI CCAS 3
12 ALUMNI GWSB 1
12 ALUMNI LAW SCHOOL 1
12 ALUMNI SEAS 1
12 FACULTY None 2
12 MFA None 1
12 STAFF None 5

And now a fact table:


In [58]:
%%sql
DROP TABLE IF EXISTS f_entrance;
CREATE TABLE f_entrance (
  entrance_id SERIAL PRIMARY KEY,
  day_key SMALLINT NOT NULL,
  hour SMALLINT NOT NULL,
  affiliation VARCHAR(255),
  subaffiliation VARCHAR(255),
  count INT NOT NULL
)


Done.
Done.
Out[58]:
[]

...and now we can populate the fact table from the raw data. We only have one day's worth of data here, otherwise we'd do a lookup for each row like we did before. Right now we can just grab that one day's key:


In [63]:
%%sql
SELECT day_key 
FROM d_day 
WHERE full_date = '2015-08-02'


1 rows affected.
Out[63]:
day_key
713

In [84]:
%%sql
DELETE FROM f_entrance;
INSERT INTO f_entrance
  (day_key, hour, affiliation, subaffiliation, count)
SELECT 713, hour, affiliation, subaffiliation, count
FROM entrance
ORDER BY hour, affiliation, subaffiliation


0 rows affected.
47 rows affected.
Out[84]:
[]

In [85]:
%%sql
SELECT * FROM f_entrance LIMIT 10


10 rows affected.
Out[85]:
entrance_id day_key hour affiliation subaffiliation count
48 713 11 STAFF None 4
49 713 11 STUDENT LAW SCHOOL 1
50 713 11 STUDENT None 1
51 713 12 ALUMNI CCAS 3
52 713 12 ALUMNI GWSB 1
53 713 12 ALUMNI LAW SCHOOL 1
54 713 12 ALUMNI SEAS 1
55 713 12 FACULTY None 2
56 713 12 MFA None 1
57 713 12 STAFF None 5

In [88]:
%%sql
SELECT hour, SUM(COUNT)
FROM f_entrance
GROUP BY hour
ORDER BY hour


8 rows affected.
Out[88]:
hour sum
11 6
12 73
13 43
14 52
15 42
16 29
17 21
22 3

In [89]:
_.bar()


Out[89]:
<Container object of 8 artists>

In [93]:
%%sql
SELECT hour_start, COUNT(*)
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
  AND building_name = 'Gelman'
  AND day_key_start = 713
GROUP BY hour_start
ORDER BY hour_start ASC


24 rows affected.
Out[93]:
hour_start count
0 21
1 10
2 13
3 13
4 15
5 23
6 11
7 13
8 12
9 15
10 25
11 13
12 26
13 28
14 27
15 23
16 29
17 21
18 11
19 13
20 11
21 15
22 11
23 17

Okay, this is weird. Why would there be such a steady number of sessions during hours when the building should have been closed?

Let's go back to looking at their lengths.


In [98]:
%%sql
SELECT hour_start, os, session_type, COUNT(*), ROUND(AVG(duration_minutes))
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
  AND building_name = 'Gelman'
  AND day_key_start = 713
GROUP BY hour_start, os, session_type
ORDER BY hour_start ASC


33 rows affected.
Out[98]:
hour_start os session_type count round
0 win7 Unavailable 21 45.0
1 win7 Unavailable 10 38.0
2 win7 Unavailable 13 48.0
3 win7 Unavailable 13 37.0
4 win7 Unavailable 15 15.0
5 win7 Unavailable 23 137.0
6 win7 Unavailable 11 8.0
7 win7 Unavailable 13 23.0
8 win7 Unavailable 12 44.0
9 win7 Unavailable 15 45.0
10 win7 Unavailable 25 29.0
11 win7 Unavailable 13 108.0
12 win7 In use 11 50.0
12 win7 Unavailable 15 26.0
13 mac In use 1 1297.0
13 win7 Unavailable 14 56.0
13 win7 In use 13 78.0
14 win7 Unavailable 19 385.0
14 win7 In use 8 58.0
15 mac In use 2 10165.0
15 win7 Unavailable 8 56.0
15 win7 In use 13 53.0
16 win7 Unavailable 14 7.0
16 win7 In use 15 57.0
17 mac In use 1 1194.0
17 win7 Unavailable 13 43.0
17 win7 In use 7 25.0
18 win7 Unavailable 11 9.0
19 win7 Unavailable 13 11.0
20 win7 Unavailable 11 23.0
21 win7 Unavailable 15 66.0
22 win7 Unavailable 11 43.0
23 win7 Unavailable 17 102.0

Now we're getting somewhere - all of the off-hour sessions are win7 and Unavailable. Clearly this is some kind of background noise involving the VDI machines. Perhaps Matt, Ian, and Dominique will have insight on this. :)

In the meantime, let's focus in on the "In use" sessions, which is clearly where we want to be.


In [99]:
%%sql
SELECT hour_start, os, session_type, COUNT(*), ROUND(AVG(duration_minutes))
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
  AND building_name = 'Gelman'
  AND day_key_start = 713
  AND session_type = 'In use'
GROUP BY hour_start, os, session_type
ORDER BY hour_start ASC


9 rows affected.
Out[99]:
hour_start os session_type count round
12 win7 In use 11 50.0
13 win7 In use 13 78.0
13 mac In use 1 1297.0
14 win7 In use 8 58.0
15 mac In use 2 10165.0
15 win7 In use 13 53.0
16 win7 In use 15 57.0
17 win7 In use 7 25.0
17 mac In use 1 1194.0

In [100]:
_.bar()


Out[100]:
<Container object of 9 artists>

Ah, one more issue - long sessions. Let's lock those down a bit:


In [101]:
%%sql
SELECT hour_start, os, session_type, COUNT(*), ROUND(AVG(duration_minutes))
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
  AND building_name = 'Gelman'
  AND day_key_start = 713
  AND session_type = 'In use'
  AND duration_minutes < 600
GROUP BY hour_start, os, session_type
ORDER BY hour_start ASC


6 rows affected.
Out[101]:
hour_start os session_type count round
12 win7 In use 11 50.0
13 win7 In use 13 78.0
14 win7 In use 8 58.0
15 win7 In use 13 53.0
16 win7 In use 15 57.0
17 win7 In use 7 25.0

In [102]:
_.bar()


Out[102]:
<Container object of 6 artists>

Now we can start to put some things together. Let's look at each incoming user group by affiliation, and their potential as a maximum percentage of the sessions held. This will be a "drill-across" query: we pull data from each of two fact tables, roll each of them up to a dimension granularity that conforms -- day and hour -- and then join the data together.

First, the session data:


In [108]:
%%sql
SELECT hour_start AS hour, COUNT(*) AS total_sessions
FROM f_login, d_location
WHERE d_location.location_key = f_login.location_key
  AND building_name = 'Gelman'
  AND day_key_start = 713
  AND session_type = 'In use'
GROUP BY hour_start
ORDER BY hour_start ASC


6 rows affected.
Out[108]:
hour total_sessions
12 11
13 14
14 8
15 15
16 15
17 8

Next, the group data, which we've already seen:


In [111]:
%%sql
SELECT hour, affiliation, subaffiliation, count AS group_entrances
FROM f_entrance
ORDER BY hour ASC


47 rows affected.
Out[111]:
hour affiliation subaffiliation group_entrances
11 STAFF None 4
11 STUDENT LAW SCHOOL 1
11 STUDENT None 1
12 ALUMNI CCAS 3
12 ALUMNI GWSB 1
12 ALUMNI LAW SCHOOL 1
12 ALUMNI SEAS 1
12 FACULTY None 2
12 MFA None 1
12 STAFF None 5
12 STUDENT LAW SCHOOL 1
12 STUDENT MED CENTER 2
12 STUDENT None 54
12 VISITOR CAMPUS TENANT 1
12 VISITOR FRIENDS 1
13 ALUMNI ESIA 1
13 ALUMNI GWSB 3
13 ALUMNI LAW SCHOOL 1
13 ALUMNI SEAS 1
13 FACULTY None 2
13 STAFF None 3
13 STUDENT MED CENTER 2
13 STUDENT None 29
13 VISITOR FRIENDS 1
14 ALUMNI CPS 1
14 ALUMNI GWSB 2
14 ALUMNI SEAS 2
14 FACULTY None 1
14 MFA None 1
14 STAFF None 3
14 STUDENT None 41
14 VISITOR FRIENDS 1
15 ALUMNI LAW SCHOOL 1
15 ALUMNI SEAS 1
15 STAFF None 3
15 STUDENT LAW SCHOOL 2
15 STUDENT None 35
16 ALUMNI GWSB 1
16 ALUMNI SEAS 1
16 STAFF None 3
16 STUDENT MED CENTER 1
16 STUDENT None 23
17 ALUMNI ESIA 1
17 ALUMNI GWSB 1
17 STAFF None 1
17 STUDENT None 18
22 STAFF None 3

And here we assemble it together - both result sets grouped by hour, then joined.


In [125]:
%%sql
SELECT f.hour, f.affiliation, f.subaffiliation, f.count AS group_entrances, computer_sessions_started
FROM f_entrance f
LEFT JOIN
(
    SELECT hour_start AS hour, COUNT(*) AS computer_sessions_started
    FROM f_login, d_location
    WHERE d_location.location_key = f_login.location_key
      AND building_name = 'Gelman'
      AND day_key_start = 713
      AND session_type = 'In use'
    GROUP BY hour_start
) s
    ON s.hour = f.hour
ORDER BY f.hour, affiliation, subaffiliation ASC


47 rows affected.
Out[125]:
hour affiliation subaffiliation group_entrances computer_sessions_started
11 STAFF None 4 None
11 STUDENT LAW SCHOOL 1 None
11 STUDENT None 1 None
12 ALUMNI CCAS 3 11
12 ALUMNI GWSB 1 11
12 ALUMNI LAW SCHOOL 1 11
12 ALUMNI SEAS 1 11
12 FACULTY None 2 11
12 MFA None 1 11
12 STAFF None 5 11
12 STUDENT LAW SCHOOL 1 11
12 STUDENT MED CENTER 2 11
12 STUDENT None 54 11
12 VISITOR CAMPUS TENANT 1 11
12 VISITOR FRIENDS 1 11
13 ALUMNI ESIA 1 14
13 ALUMNI GWSB 3 14
13 ALUMNI LAW SCHOOL 1 14
13 ALUMNI SEAS 1 14
13 FACULTY None 2 14
13 STAFF None 3 14
13 STUDENT MED CENTER 2 14
13 STUDENT None 29 14
13 VISITOR FRIENDS 1 14
14 ALUMNI CPS 1 8
14 ALUMNI GWSB 2 8
14 ALUMNI SEAS 2 8
14 FACULTY None 1 8
14 MFA None 1 8
14 STAFF None 3 8
14 STUDENT None 41 8
14 VISITOR FRIENDS 1 8
15 ALUMNI LAW SCHOOL 1 15
15 ALUMNI SEAS 1 15
15 STAFF None 3 15
15 STUDENT LAW SCHOOL 2 15
15 STUDENT None 35 15
16 ALUMNI GWSB 1 15
16 ALUMNI SEAS 1 15
16 STAFF None 3 15
16 STUDENT MED CENTER 1 15
16 STUDENT None 23 15
17 ALUMNI ESIA 1 8
17 ALUMNI GWSB 1 8
17 STAFF None 1 8
17 STUDENT None 18 8
22 STAFF None 3 None

In [ ]: