End-To-End Example: Data Analysis of iSchool Classes

In this end-to-end example we will perform a data analysis in Python Pandas we will attempt to answer the following questions:

  • What percentage of the schedule are undergrad (course number 500 or lower)?
  • What undergrad classes are on Friday? or at 8AM?

Things we will demonstrate:

  • read_html() for basic web scraping
  • dealing with 5 pages of data
  • append() multiple DataFrames together
  • Feature engineering (adding a column to the DataFrame)

The iSchool schedule of classes can be found here: https://ischool.syr.edu/classes


In [22]:
import pandas as pd

# this turns off warning messages
import warnings
warnings.filterwarnings('ignore')

In [23]:
# just figure out how to get the data
website = 'https://ischool.syr.edu/classes/?page=1'
data = pd.read_html(website)
data[0]


Out[23]:
0 1 2 3 4 5 6 7 8
0 GET300 M001 41064 3.0 Enterprise Data Analysis Michael A Leonardo 5:15pm - 8:05pm M Hinds Hall 010
1 GET302 M001 37049 3.0 Global Financial Sys Arch Frank Jr Marullo 5:15pm - 8:00pm M Hinds Hall 021
2 GET365 M001 36973 1.5 Business Value of IT Timothy D. Stedman 5:00pm - 7:50pm Tu Hinds Hall 018
3 GET400 M002 37081 3.0 Global Consulting Challenges Jason Dedrick 10:35am - 1:10pm F Hinds Hall 018
4 GET433 M001 37075 3.0 Multi-tier App. Development P Douglas Taber 8:00am - 9:20am TuTh Hinds Hall 013
5 GET471 M001 41974 1.0 GET Internship Susan Monica Bonzi 12:00am - 12:00am NaN NaN
6 GET487 M800 37046 3.0 Global Tech Michael Fudge 12:00am - 12:00am NaN Online
7 GET487 M801 41859 3.0 Global Tech Paul Brian Gandel 12:00am - 12:00am NaN Online
8 GET602 M001 37050 3.0 Global Financial Sys Arch Frank Jr Marullo 5:15pm - 8:00pm M Hinds Hall 021
9 GET687 M800 37047 3.0 Global Tech Michael Fudge 12:00am - 12:00am NaN Online
10 GET687 M801 41920 3.0 Global Tech Paul Brian Gandel 12:00am - 12:00am NaN Online
11 IDS402 M001 37004 3.0 Idea2Startup Michael A D'Eredita 9:30am - 12:15pm F Hinds Hall 011
12 IDS403 M001 37002 1.0 Startup Sandbox John DuRoss Liddy 10:00am - 12:50pm F Syracuse Technology Garden
13 IDS403 M002 41254 1.0 Startup Sandbox John DuRoss Liddy 1:00pm - 3:50pm F Syracuse Technology Garden
14 IST101 M001 40687 1.0 Freshman Forum Julie Walas Huynh 11:40am - 12:35pm M Hinds Hall 120
15 IST195 M001 36935 3.0 Information Technologies Jeff Rubin 9:30am - 10:25am MW Huntington Beard Crouse Giff
16 IST195 M003 37021 3.0 LAB: Information Technologies Jeff Rubin 9:30am - 10:25am F Hinds Hall 010
17 IST195 M004 37022 3.0 LAB: Information Technologies Jeff Rubin 10:35am - 11:30am F Hinds Hall 010
18 IST195 M005 37023 3.0 LAB: Information Technologies Jeff Rubin 11:40am - 12:35pm F Hinds Hall 010
19 IST195 M006 37024 3.0 LAB: Information Technologies Jeff Rubin 12:45pm - 1:40pm F Hinds Hall 010
20 IST195 M007 37025 3.0 LAB: Information Technologies Jeff Rubin 1:50pm - 2:45pm F Hinds Hall 010
21 IST195 M008 37031 3.0 LAB: Information Technologies Jeff Rubin 2:55pm - 3:50pm F Hinds Hall 010
22 IST195 M002 37107 3.0 LAB: Information Technologies Jeff Rubin 8:25am - 9:20am F Hinds Hall 010
23 IST233 M001 36966 3.0 Intro to Computer Networking David J Molta 12:45pm - 2:05pm MW Hall of Languages 207
24 IST233 M008 37013 3.0 Intro to Computer Networking S Bruce Boardman 5:00pm - 6:20pm TuTh Hinds Hall 027 Hall of Languages 211
25 IST233 M003 37027 3.0 LAB: Intro to Computer Networking David J Molta 10:35am - 11:30am F Hinds Hall 027
26 IST233 M004 37028 3.0 LAB: Intro to Computer Networking David J Molta 11:40am - 12:35pm F Hinds Hall 027
27 IST233 M005 37029 3.0 LAB: Intro to Computer Networking David J Molta 12:45pm - 1:40pm F Hinds Hall 027
28 IST233 M006 37030 3.0 LAB: Intro to Computer Networking David J Molta 1:50pm - 2:45pm F Hinds Hall 027
29 IST233 M007 40344 3.0 LAB: Intro to Computer Networking David J Molta 2:55pm - 3:50pm F Hinds Hall 027
30 IST256 M003 36977 3.0 Appl.Prog.For Information Syst Nick Lyga 5:15pm - 8:05pm M Hinds Hall 117
31 IST256 M001 40238 3.0 Appl.Prog.For Information Syst Michael Fudge Avinash Kadaji Deborah L Nosky... 8:00am - 9:20am WF School of Management 007
32 IST263 M001 36972 3.0 Web Design and Mgmt Christian A Kirkegaard 11:00am - 12:20pm TuTh Hinds Hall 111
33 IST263 M003 37015 3.0 Web Design and Mgmt Joseph E Flateau 8:00am - 9:20am MW Hinds Hall 021
34 IST263 M002 37017 3.0 Web Design and Mgmt Christian A Kirkegaard 8:00am - 9:20am TuTh Hinds Hall 018
35 IST263 M005 37100 3.0 Web Design and Mgmt Andrei V Vieru 11:00am - 12:20pm TuTh Hinds Hall 117
36 IST300 M001 36947 3.0 IT Client Support Practicum Christopher Sean Perrello 12:00am - 12:00am NaN NaN
37 IST300 M002 40688 3.0 Digital Campaign Strat/Analyti Michael Clarke 3:30pm - 4:50pm TuTh Hinds Hall 010
38 IST323 M001 36983 3.0 Intro to Information Security Christopher Croad 2:15pm - 5:05pm W Hinds Hall 027 Hinds Hall 117
39 IST323 M002 36999 3.0 Intro to Information Security Joon S. Park 3:45pm - 5:05pm MW Hinds Hall 021 Hinds Hall 117
40 IST323 M003 37091 3.0 Intro to Information Security David J Molta 2:15pm - 3:35pm MW Hinds Hall 111
41 IST335 M002 36937 3.0 Intro/Info Based Organizations Michael A D'Eredita 12:45pm - 2:05pm WF Hinds Hall 011
42 IST335 M001 36948 3.0 Intro/Info Based Organizations JoAnne Wallingford 5:00pm - 7:50pm Th Hinds Hall 011
43 IST335 M003 36978 3.0 Intro/Info Based Organizations Michael A D'Eredita 5:15pm - 8:05pm W Hall of Languages 115
44 IST335 M004 37011 3.0 Intro/Info Based Organizations Marcene S. Sonneborn 3:30pm - 4:50pm TuTh Hinds Hall 011
45 IST335 M005 37060 3.0 Intro/Info Based Organizations Jeffrey Fouts 9:30am - 10:50am TuTh Hinds Hall 011
46 IST341 M001 40731 3.0 User-Based Design Michael S Nilan 11:00am - 12:20pm TuTh Hinds Hall 013
47 IST346 M001 36928 3.0 Info Tech Mgmt&Administration Ryan Elstad 12:30pm - 1:50pm TuTh Hinds Hall 018 Hinds Hall 027
48 IST346 M002 36952 3.0 Info Tech Mgmt&Administration Timothy A Jorgensen 12:30pm - 1:50pm TuTh Hinds Hall 013
49 IST346 M007 37000 3.0 Info Tech Mgmt&Administration James Powell 2:00pm - 3:20pm TuTh Hinds Hall 010 Hinds Hall 021

In [24]:
# let's generate links to the other pages
website = 'https://ischool.syr.edu/classes/?page='
for i in range(1,6):
    link = website + str(i)
    print(link)


https://ischool.syr.edu/classes/?page=1
https://ischool.syr.edu/classes/?page=2
https://ischool.syr.edu/classes/?page=3
https://ischool.syr.edu/classes/?page=4
https://ischool.syr.edu/classes/?page=5

In [25]:
# let's read them all and append them to a single data frame

website = 'https://ischool.syr.edu/classes/?page='
classes = pd.DataFrame() #  (columns = ['Course','Section','ClassNo','Credits','Title','Instructor','Time','Days','Room'])

for i in range(1,6):
    link = website + str(i)
    data = pd.read_html(website  + str(i))    
    classes = classes.append(data[0], ignore_index=True)
    
classes.sample(5)


Out[25]:
0 1 2 3 4 5 6 7 8
162 IST659 M400 41928 3.0 Data Admin Concepts & Db Mgmt Chad Aaron Harper 12:00am - 8:30pm M Online Online

In [26]:
## let's set the columns

website = 'https://ischool.syr.edu/classes/?page='
classes = pd.DataFrame() 

for i in range(1,6):
    link = website + str(i)
    data = pd.read_html(website  + str(i))    
    classes = classes.append(data[0], ignore_index=True)
    
classes.columns = ['Course','Section','ClassNo','Credits','Title','Instructor','Time','Days','Room']

classes.sample(5)


Out[26]:
Course Section ClassNo Credits Title Instructor Time Days Room
205 IST840 M801 41093 1.0 Practicum in Teaching Jennifer Stromer-Galley 12:00am - 12:00am NaN Online

In [31]:
## this is good stuff. Let's make a function out of it for simplicity

def get_ischool_classes():
    website = 'https://ischool.syr.edu/classes/?page='
    classes = pd.DataFrame() 

    for i in range(1,6):
        link = website + str(i)
        data = pd.read_html(website  + str(i))    
        classes = classes.append(data[0], ignore_index=True)
    
    classes.columns = ['Course','Section','ClassNo','Credits','Title','Instructor','Time','Days','Room']

    return classes

# main program 
classes = get_ischool_classes()

In [32]:
# undergrad classes are 0-499, grad classes are 500 and up but we don't have course numbers!!!! So we must engineer them.

classes['Course'].str[0:3].sample(5)
classes['Course'].str[3:].sample(5)


Out[32]:
42     335
151    646
202    810
41     335
177    690
Name: Course, dtype: object

In [33]:
# make the subject and number columns
classes['Subject'] = classes['Course'].str[0:3]
classes['Number'] = classes['Course'].str[3:]
classes.sample(5)


Out[33]:
Course Section ClassNo Credits Title Instructor Time Days Room Subject Number
33 IST263 M003 37015 3.0 Web Design and Mgmt Joseph E Flateau 8:00am - 9:20am MW Hinds Hall 021 IST 263
164 IST661 M800 36997 3.0 Managing a School Library Susan Kowalski 12:00am - 12:00am NaN Online IST 661
171 IST687 M006 37118 3.0 LAB: Applied Data Science Erik Scott Anderson 3:30pm - 4:50pm Th Hinds Hall 021 IST 687
64 IST400 M002 37088 3.0 Enterprise IT Consultation Michelle L. Kaarst-Brown 2:00pm - 4:45pm Th Hinds Hall 111 IST 400
11 IDS402 M001 37004 3.0 Idea2Startup Michael A D'Eredita 9:30am - 12:15pm F Hinds Hall 011 IDS 402

In [36]:
# and finally we can create the column we need!
classes['Type'] = ''
classes['Type'][classes['Number'] < '500'] = 'UGrad'
classes['Type'][classes['Number'] >= '500'] = 'Grad'

classes.sample(5)


Out[36]:
Course Section ClassNo Credits Title Instructor Time Days Room Subject Number UG Type
113 IST600 M800 37113 1.0 IT Auditing Thomas J Wood 12:00am - 12:00am NaN Online IST 600 N Grad
1 GET302 M001 37049 3.0 Global Financial Sys Arch Frank Jr Marullo 5:15pm - 8:00pm M Hinds Hall 021 GET 302 Y UGrad
182 IST718 M801 37082 3.0 Advanced Information Analytics Gary E Krudys 12:00am - 12:00am NaN Online IST 718 N Grad
193 IST754 M001 37065 3.0 Telecom Final Project Lee H Badman 9:30am - 12:15pm W Hinds Hall 117 IST 754 N Grad
69 IST425 M001 36950 3.0 Enterprise Risk Management Michael Larche 5:15pm - 6:35pm MW Slocum 104 IST 425 Y UGrad

In [44]:
# the entire program to retrieve the data and setup the columns looks like this:

# main program 
classes = get_ischool_classes()
classes['Subject'] = classes['Course'].str[0:3]
classes['Number'] = classes['Course'].str[3:]
classes['Type'] = ''
classes['Type'][classes['Number'] < '500'] = 'UGrad'
classes['Type'][classes['Number'] >= '500'] = 'Grad'

In [45]:
# let's fins the number of grad / undergrad courses
classes['Type'].value_counts()

# more grad classes than undergrad


Out[45]:
Grad     116
UGrad     95
Name: Type, dtype: int64

In [46]:
# how many undergrad classes on a Friday?
friday = classes[ (classes['Type'] == 'UGrad') & (classes['Days'].str.find('F')>=0 ) ]
friday


Out[46]:
Course Section ClassNo Credits Title Instructor Time Days Room Subject Number Type
3 GET400 M002 37081 3.0 Global Consulting Challenges Jason Dedrick 10:35am - 1:10pm F Hinds Hall 018 GET 400 UGrad
11 IDS402 M001 37004 3.0 Idea2Startup Michael A D'Eredita 9:30am - 12:15pm F Hinds Hall 011 IDS 402 UGrad
12 IDS403 M001 37002 1.0 Startup Sandbox John DuRoss Liddy 10:00am - 12:50pm F Syracuse Technology Garden IDS 403 UGrad
13 IDS403 M002 41254 1.0 Startup Sandbox John DuRoss Liddy 1:00pm - 3:50pm F Syracuse Technology Garden IDS 403 UGrad
16 IST195 M003 37021 3.0 LAB: Information Technologies Jeff Rubin 9:30am - 10:25am F Hinds Hall 010 IST 195 UGrad
17 IST195 M004 37022 3.0 LAB: Information Technologies Jeff Rubin 10:35am - 11:30am F Hinds Hall 010 IST 195 UGrad
18 IST195 M005 37023 3.0 LAB: Information Technologies Jeff Rubin 11:40am - 12:35pm F Hinds Hall 010 IST 195 UGrad
19 IST195 M006 37024 3.0 LAB: Information Technologies Jeff Rubin 12:45pm - 1:40pm F Hinds Hall 010 IST 195 UGrad
20 IST195 M007 37025 3.0 LAB: Information Technologies Jeff Rubin 1:50pm - 2:45pm F Hinds Hall 010 IST 195 UGrad
21 IST195 M008 37031 3.0 LAB: Information Technologies Jeff Rubin 2:55pm - 3:50pm F Hinds Hall 010 IST 195 UGrad
22 IST195 M002 37107 3.0 LAB: Information Technologies Jeff Rubin 8:25am - 9:20am F Hinds Hall 010 IST 195 UGrad
25 IST233 M003 37027 3.0 LAB: Intro to Computer Networking David J Molta 10:35am - 11:30am F Hinds Hall 027 IST 233 UGrad
26 IST233 M004 37028 3.0 LAB: Intro to Computer Networking David J Molta 11:40am - 12:35pm F Hinds Hall 027 IST 233 UGrad
27 IST233 M005 37029 3.0 LAB: Intro to Computer Networking David J Molta 12:45pm - 1:40pm F Hinds Hall 027 IST 233 UGrad
28 IST233 M006 37030 3.0 LAB: Intro to Computer Networking David J Molta 1:50pm - 2:45pm F Hinds Hall 027 IST 233 UGrad
29 IST233 M007 40344 3.0 LAB: Intro to Computer Networking David J Molta 2:55pm - 3:50pm F Hinds Hall 027 IST 233 UGrad
31 IST256 M001 40238 3.0 Appl.Prog.For Information Syst Michael Fudge Avinash Kadaji Deborah L Nosky... 8:00am - 9:20am WF School of Management 007 IST 256 UGrad
41 IST335 M002 36937 3.0 Intro/Info Based Organizations Michael A D'Eredita 12:45pm - 2:05pm WF Hinds Hall 011 IST 335 UGrad

In [47]:
# let's get rid of those pesky LAB sections!!!
# how many undergrad classes on a Friday?
friday_no_lab = friday[ ~friday['Title'].str.startswith('LAB:')]
friday_no_lab


Out[47]:
Course Section ClassNo Credits Title Instructor Time Days Room Subject Number Type
3 GET400 M002 37081 3.0 Global Consulting Challenges Jason Dedrick 10:35am - 1:10pm F Hinds Hall 018 GET 400 UGrad
11 IDS402 M001 37004 3.0 Idea2Startup Michael A D'Eredita 9:30am - 12:15pm F Hinds Hall 011 IDS 402 UGrad
12 IDS403 M001 37002 1.0 Startup Sandbox John DuRoss Liddy 10:00am - 12:50pm F Syracuse Technology Garden IDS 403 UGrad
13 IDS403 M002 41254 1.0 Startup Sandbox John DuRoss Liddy 1:00pm - 3:50pm F Syracuse Technology Garden IDS 403 UGrad
31 IST256 M001 40238 3.0 Appl.Prog.For Information Syst Michael Fudge Avinash Kadaji Deborah L Nosky... 8:00am - 9:20am WF School of Management 007 IST 256 UGrad
41 IST335 M002 36937 3.0 Intro/Info Based Organizations Michael A D'Eredita 12:45pm - 2:05pm WF Hinds Hall 011 IST 335 UGrad

In [48]:
# Looking for more classes to avoid? How about 8AM classes?
eight_am = classes[ classes['Time'].str.startswith('8:00am')]
eight_am


Out[48]:
Course Section ClassNo Credits Title Instructor Time Days Room Subject Number Type
4 GET433 M001 37075 3.0 Multi-tier App. Development P Douglas Taber 8:00am - 9:20am TuTh Hinds Hall 013 GET 433 UGrad
31 IST256 M001 40238 3.0 Appl.Prog.For Information Syst Michael Fudge Avinash Kadaji Deborah L Nosky... 8:00am - 9:20am WF School of Management 007 IST 256 UGrad
33 IST263 M003 37015 3.0 Web Design and Mgmt Joseph E Flateau 8:00am - 9:20am MW Hinds Hall 021 IST 263 UGrad
34 IST263 M002 37017 3.0 Web Design and Mgmt Christian A Kirkegaard 8:00am - 9:20am TuTh Hinds Hall 018 IST 263 UGrad
54 IST352 M005 37068 3.0 Info Analysis of Org. Systems Alexander Corsello 8:00am - 9:20am MW Hinds Hall 111 IST 352 UGrad
55 IST352 M006 37069 3.0 Info Analysis of Org. Systems Alexander Corsello 8:00am - 9:20am TuTh Hinds Hall 117 IST 352 UGrad
58 IST359 M003 36986 3.0 Intro to Data Base Mgmt Systs Deborah L Nosky 8:00am - 9:20am TuTh Hinds Hall 010 Hinds Hall 111 IST 359 UGrad

In [ ]: