8 (a) We placed the College.csv file in the Datasets directory. Let us access this file.
In [6]:
college = read.csv("Datasets/College.csv")
head(college) #Use fix(college) in R-Studio to display in internal editor
I used the head() function to display only the first few tuples of the dataset. In 'R', we would use fix(college) in R-Studio to display in the internal editor.
NOTE: All columns may not be visible on print. So lets see the fields in the dataset (for reference).
In [9]:
names(college)
8 (b) In the table, we do not want the college name to appear as a part of the data. However, this information may be useful later on. We can store them as row names. Let us now check the current row names of this table.
In [7]:
rownames(college)[1:10] #Display the first 10 row names
rownames() gives us the implicit row names of the table. Notice these numbers are not displayed in the table shown in 8(a). Let us change it to the first column enteries.
In [8]:
rownames(college) = college[ , 1]
head(college)
Perfect! If the above table is viewed in RStudio, we would see the column name row.names above the bold college names. We now remove the column X from the table.
In [10]:
college = college[ , -1] #Exclude first column
head(college)
Note the first column in bold, although not explicitly mentioned, is called row.names. This is not a data column but rather the name that R is giving to each row.
8 (c) i. Let us now produce a numerical summary of the dataset.
In [11]:
summary(college)
In the description above, it is easy to observe the qualitative(classification) and quantitative(regression) variables. The first column Private has only 2 values Yes or No and is hence categorical. Every other field is quantitative with a minimum value, maximum, mean, 1st Quartile, median and 3rd Quartile value.
8 (c) ii. We now display the relationship between the first 10 columns using a scatterplot matrix.
In [12]:
pairs(college[,1:10])
8 (c) iii. Let us produce side-by-side boxplots of Outstate versus Private.
In [13]:
plot(college$Private, college$Outstate, xlab="Public/Private Indicator", ylab="Out of State Tuition($)", main="Boxplot of Outstate Vs. Private")
8 (c) iv. Let us add a new categorical field called Elite which takes 2 values depending on Top10Perc:
This is implemented by initializing this field for every college as No, then applying the condition.
In [14]:
Elite = rep("No", length(rownames(college))) #Initialize all entries of Elite to 'No'
Elite[college$Top10perc > 50] = "Yes" #If Top10Perc > 50, assign field as 'Yes'
Elite = as.factor(Elite)
college = data.frame(college, Elite)
In [15]:
head(college)
NOTE 2: In case of multiple execution of the code above, we may end up with multiple instances of the field Elite. I once ended up with 4 before. They can be deleted using the command:
college[,-(ncol(college)-4:ncol(college))]
Now that the Elite field has been appended as the last column, let us see how many such Elite universities exist.
In [16]:
summary(college$Elite)
So of our 777 colleges, 78 of them are Elite. The boxpolot below shows Out of State tution for elite and non-elite universities.
In [17]:
plot(college$Elite, college$Outstate, ylab="Out of State Tuition ($)", xlab="Is the University Elite?",main="Elite Vs. Outstate")
8 (c) v. We will create histograms of quantitative variables. Each histogram will be have 5, 10, 15 & 20 bins specified by the breaks argument. To plot a number of these histograms, we use par(). Let us start with Out of State Tuition.
In [18]:
par(mfrow=c(2,2))
for (numBins in 1:20){
hist(college$Outstate, breaks=numBins, xlab="OutState", ylab="Freq", main=paste("Bins = ",numBins))
}
Interesting Observation: Even after iterating the number of break points from 1 to 20 for the field OutState, we observe the number of bins numBins between 7 and 13 to be the same. They are broken down into 10 bins, generating the same histogram. Similarly, the histograms generated for numBins from 14 to 20 are also the same. Each generating the same 20 bin histogram. Also, the number of breaks from 3 to 6 generate a 5 bin histogram.
Let us check the same for another quantitative field: Book Cost.
In [19]:
par(mfrow=c(2,2))
for (numBins in 1:20){
hist(college$Books, breaks=numBins, xlab="Book Cost", ylab="Freq", main=paste("Bins = ",numBins))
}
We observe similar histograms when the number of bins was set from:
Let us plot the graphs in different ways, varying the par arguments.
In [22]:
par(mfrow=c(5,2))
for (numBins in 1:20){
hist(college$Books, breaks=numBins, xlab="Book Cost", ylab="Freq", main=paste("Bins = ",numBins))
}
Observation: It seems to be plotting 5 rows and 2 columns of histograms in the same space as it was plotting 2 rows and 2 columns of histograms before.
8 (c) vi. SEE OBSERVATIONS AND NOTES IN THE ANSWERS ABOVE
Additionally, From the scatter plot matrix in 8 (b) ii., we observe nearly linear relationships between different variables like F. Undergrad Vs Enroll and Top10Prec Vs Top25Prec. While analyzing our data, we should only retain one covariate (feature) among others which are highly correlated.