8 (a) We placed the College.csv file in the Datasets directory. Let us access this file.



In [6]:

    
college = read.csv("Datasets/College.csv")
head(college) #Use fix(college) in R-Studio to display in internal editor









    





X Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate

	Abilene Christian University Yes                         1660                        1232                        721                         23                          52                          2885                         537                         7440                       3300                        450                         2200                        70                          78                          18.1                        12                           7041                       60                          
	Adelphi University          Yes                         2186                        1924                        512                         16                          29                          2683                        1227                        12280                       6450                        750                         1500                        29                          30                          12.2                        16                          10527                       56                          
	Adrian College              Yes                         1428                        1097                        336                         22                          50                          1036                          99                        11250                       3750                        400                         1165                        53                          66                          12.9                        30                           8735                       54                          
	Agnes Scott College         Yes                          417                         349                        137                         60                          89                           510                          63                        12960                       5450                        450                          875                        92                          97                           7.7                        37                          19016                       59                          
	Alaska Pacific University   Yes                          193                         146                         55                         16                          44                           249                         869                         7560                       4120                        800                         1500                        76                          72                          11.9                         2                          10922                       15                          
	Albertson College           Yes                          587                         479                        158                         38                          62                           678                          41                        13500                       3335                        500                          675                        67                          73                           9.4                        11                           9727                       55

I used the head() function to display only the first few tuples of the dataset. In 'R', we would use fix(college) in R-Studio to display in the internal editor.

NOTE: All columns may not be visible on print. So lets see the fields in the dataset (for reference).



In [9]:

    
names(college)









    





	'X'
	'Private'
	'Apps'
	'Accept'
	'Enroll'
	'Top10perc'
	'Top25perc'
	'F.Undergrad'
	'P.Undergrad'
	'Outstate'
	'Room.Board'
	'Books'
	'Personal'
	'PhD'
	'Terminal'
	'S.F.Ratio'
	'perc.alumni'
	'Expend'
	'Grad.Rate'

8 (b) In the table, we do not want the college name to appear as a part of the data. However, this information may be useful later on. We can store them as row names. Let us now check the current row names of this table.



In [7]:

    
rownames(college)[1:10] #Display the first 10 row names









    





	'1'
	'2'
	'3'
	'4'
	'5'
	'6'
	'7'
	'8'
	'9'
	'10'

rownames() gives us the implicit row names of the table. Notice these numbers are not displayed in the table shown in 8(a). Let us change it to the first column enteries.



In [8]:

    
rownames(college) = college[ , 1]
head(college)









    





X Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate

	Abilene Christian University Abilene Christian University Yes                         1660                        1232                        721                         23                          52                          2885                         537                         7440                       3300                        450                         2200                        70                          78                          18.1                        12                           7041                       60                          
	Adelphi University Adelphi University          Yes                         2186                        1924                        512                         16                          29                          2683                        1227                        12280                       6450                        750                         1500                        29                          30                          12.2                        16                          10527                       56                          
	Adrian College Adrian College              Yes                         1428                        1097                        336                         22                          50                          1036                          99                        11250                       3750                        400                         1165                        53                          66                          12.9                        30                           8735                       54                          
	Agnes Scott College Agnes Scott College         Yes                          417                         349                        137                         60                          89                           510                          63                        12960                       5450                        450                          875                        92                          97                           7.7                        37                          19016                       59                          
	Alaska Pacific University Alaska Pacific University   Yes                          193                         146                         55                         16                          44                           249                         869                         7560                       4120                        800                         1500                        76                          72                          11.9                         2                          10922                       15                          
	Albertson College Albertson College           Yes                          587                         479                        158                         38                          62                           678                          41                        13500                       3335                        500                          675                        67                          73                           9.4                        11                           9727                       55

Perfect! If the above table is viewed in RStudio, we would see the column name row.names above the bold college names. We now remove the column X from the table.



In [10]:

    
college = college[ , -1] #Exclude first column
head(college)









    





Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate

	Abilene Christian University Yes  1660 1232 721  23   52   2885  537  7440 3300 450  2200 70   78   18.1 12    7041 60   
	Adelphi University Yes  2186 1924 512  16   29   2683 1227 12280 6450 750  1500 29   30   12.2 16   10527 56   
	Adrian College Yes  1428 1097 336  22   50   1036   99 11250 3750 400  1165 53   66   12.9 30    8735 54   
	Agnes Scott College Yes   417  349 137  60   89    510   63 12960 5450 450   875 92   97    7.7 37   19016 59   
	Alaska Pacific University Yes   193  146  55  16   44    249  869  7560 4120 800  1500 76   72   11.9  2   10922 15   
	Albertson College Yes   587  479 158  38   62    678   41 13500 3335 500   675 67   73    9.4 11    9727 55

Note the first column in bold, although not explicitly mentioned, is called row.names. This is not a data column but rather the name that R is giving to each row.

8 (c) i. Let us now produce a numerical summary of the dataset.



In [11]:

    
summary(college)









    





 Private        Apps           Accept          Enroll       Top10perc    
 No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
 Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
           Median : 1558   Median : 1110   Median : 434   Median :23.00  
           Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
           3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
           Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
   Top25perc      F.Undergrad     P.Undergrad         Outstate    
 Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
 1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
 Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
 Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
 3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
 Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
   Room.Board       Books           Personal         PhD        
 Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
 1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
 Median :4200   Median : 500.0   Median :1200   Median : 75.00  
 Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
 3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
 Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
    Terminal       S.F.Ratio      perc.alumni        Expend     
 Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
 1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
 Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
 Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
 3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
 Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
   Grad.Rate     
 Min.   : 10.00  
 1st Qu.: 53.00  
 Median : 65.00  
 Mean   : 65.46  
 3rd Qu.: 78.00  
 Max.   :118.00

In the description above, it is easy to observe the qualitative(classification) and quantitative(regression) variables. The first column Private has only 2 values Yes or No and is hence categorical. Every other field is quantitative with a minimum value, maximum, mean, 1st Quartile, median and 3rd Quartile value.

8 (c) ii. We now display the relationship between the first 10 columns using a scatterplot matrix.



In [12]:

    
pairs(college[,1:10])

8 (c) iii. Let us produce side-by-side boxplots of Outstate versus Private.



In [13]:

    
plot(college$Private, college$Outstate, xlab="Public/Private Indicator", ylab="Out of State Tuition($)", main="Boxplot of Outstate Vs. Private")

8 (c) iv. Let us add a new categorical field called Elite which takes 2 values depending on Top10Perc:

Yes: when the proportion of students of a given college coming from the top 10% exceeds 50%.
No: when the proportion of students of a given college coming from the top 10% is less than 50%.

This is implemented by initializing this field for every college as No, then applying the condition.



In [14]:

    
Elite = rep("No", length(rownames(college))) #Initialize all entries of Elite to 'No'
Elite[college$Top10perc > 50] = "Yes" #If Top10Perc > 50, assign field as 'Yes'
Elite = as.factor(Elite)
college = data.frame(college, Elite)



In [15]:

    
head(college)









    





Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate Elite

	Abilene Christian University Yes  1660 1232 721  23   52   2885  537  7440 3300 450  2200 70   78   18.1 12    7041 60   No   
	Adelphi University Yes  2186 1924 512  16   29   2683 1227 12280 6450 750  1500 29   30   12.2 16   10527 56   No   
	Adrian College Yes  1428 1097 336  22   50   1036   99 11250 3750 400  1165 53   66   12.9 30    8735 54   No   
	Agnes Scott College Yes   417  349 137  60   89    510   63 12960 5450 450   875 92   97    7.7 37   19016 59   Yes  
	Alaska Pacific University Yes   193  146  55  16   44    249  869  7560 4120 800  1500 76   72   11.9  2   10922 15   No   
	Albertson College Yes   587  479 158  38   62    678   41 13500 3335 500   675 67   73    9.4 11    9727 55   No

NOTE 2: In case of multiple execution of the code above, we may end up with multiple instances of the field Elite. I once ended up with 4 before. They can be deleted using the command: college[,-(ncol(college)-4:ncol(college))]

Now that the Elite field has been appended as the last column, let us see how many such Elite universities exist.



In [16]:

    
summary(college$Elite)

So of our 777 colleges, 78 of them are Elite. The boxpolot below shows Out of State tution for elite and non-elite universities.



In [17]:

    
plot(college$Elite, college$Outstate, ylab="Out of State Tuition ($)", xlab="Is the University Elite?",main="Elite Vs. Outstate")

8 (c) v. We will create histograms of quantitative variables. Each histogram will be have 5, 10, 15 & 20 bins specified by the breaks argument. To plot a number of these histograms, we use par(). Let us start with Out of State Tuition.



In [18]:

    
par(mfrow=c(2,2))
for (numBins in 1:20){
    hist(college$Outstate, breaks=numBins, xlab="OutState", ylab="Freq", main=paste("Bins = ",numBins))
}

Interesting Observation: Even after iterating the number of break points from 1 to 20 for the field OutState, we observe the number of bins numBins between 7 and 13 to be the same. They are broken down into 10 bins, generating the same histogram. Similarly, the histograms generated for numBins from 14 to 20 are also the same. Each generating the same 20 bin histogram. Also, the number of breaks from 3 to 6 generate a 5 bin histogram.

Let us check the same for another quantitative field: Book Cost.



In [19]:

    
par(mfrow=c(2,2))
for (numBins in 1:20){
    hist(college$Books, breaks=numBins, xlab="Book Cost", ylab="Freq", main=paste("Bins = ",numBins))
}

We observe similar histograms when the number of bins was set from:

4 throguh 8
9 through 16
17 through 20

Let us plot the graphs in different ways, varying the par arguments.



In [22]:

    
par(mfrow=c(5,2))
for (numBins in 1:20){
    hist(college$Books, breaks=numBins, xlab="Book Cost", ylab="Freq", main=paste("Bins = ",numBins))
}

Observation: It seems to be plotting 5 rows and 2 columns of histograms in the same space as it was plotting 2 rows and 2 columns of histograms before.

8 (c) vi. SEE OBSERVATIONS AND NOTES IN THE ANSWERS ABOVE

Additionally, From the scatter plot matrix in 8 (b) ii., we observe nearly linear relationships between different variables like F. Undergrad Vs Enroll and Top10Prec Vs Top25Prec. While analyzing our data, we should only retain one covariate (feature) among others which are highly correlated.

X	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60
Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56
Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54
Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59
Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15
Albertson College	Yes	587	479	158	38	62	678	41	13500	3335	500	675	67	73	9.4	11	9727	55