This notebook contains the final report for the project of the course A Network Tour Of Data Science.
Project : Course Suggestor
Assistant : Michaël Defferrard
Students : Cherqui Alexandre, Gusset Frédérick, El Hamamsy Laila, Triquet Thomas
In [1]:
import pandas as pd
import networkx as nx
import os
import pickle
Once the master level is reached, students have the opportunity to choose between a vast pool of courses, as well as courses which are not part of their original study plan. However, aside from going through all the coursebooks or asking other people's opinions, there is nothing to help students select courses which may be relevant to their domains of interest. That is why the goal of this project was to work on a course suggester where the student could input the name of a course of interest or a list of courses taken and obtain a list of courses which are related.
To do so a graph where the nodes are the courses and the links are given by features was created by recovering the necessary information from the different EPFL databases. The idea was to make use of the list of courses that students took as well as the course descriptions to weight the edges. From the latter multiple attributes were recovered: the professor, keywords, pre-requisites, teaching methods, assessments, the faculty to which the course is attached and so on. All of this information was public either via is-academia or the public access.
The courses each student took between 2007 and 2016, their affiliation (MT, EL, IN...), the current semester (BA1, BA2,...) as well as an XML with the course descriptions of the courses given in 2016 were provided by Francisco Pinto who already distributed this dataset during the DataJam days and can be found on the dropbox link here. Therefore it was not necessary to do any web-scraping but there was a large portion of data cleaning and formatting before being able to construct the different graphs. It is important to note that the data is anonymous so there is no way to trace back any information to a particular student.
The recommendation system was based on two main methods seen in the course : spectral clustering and graph diffusion.
The data cleaning was done in the Data Cleaning notebook, for any additional details regarding what was done in this section, please refer to the original notebook. In this notebook both the enrollements and the course descriptions were cleaned in order to have the information under a format which was more appropriate for the graph constructions. Non essential fields were also removed or merged the ones which were deemed to be redundant.
This data is comprised of multiple fields : 'CourseEnrolmentID', 'PersonID', 'SubjectID', 'Year', 'Semester', 'Session', 'SubjectName', 'LevelName', 'TeachingLanguage', 'StudyDomain', 'SectionName', 'SectionCode', 'StudyPlanName', 'StudyPlanCode', 'CourseCodes'
The only fields which were of interest to were the:
All other fields were dropped.
The sections codes were then unified to keep only the main section attribute (EL, MT, GM, etc...), removing any other extention such as _HES, _ECH, etc...
Study plan codes also needed to be unified as semestre automne and printemps were often input instead of BA1 and BA2 for example. To do the course code which indicates the year was used. Therefore a course given in 4th year, semestre automne would be MA1. A course given in 3rd year semestre de printemps would be BA6.
To avoid high weights due to courses which were taken multiple times by the same students, only the record of their first enrollement was kept.
All students following internships or in the CMS or first year were removed as these were irrelevant to the application. Courses that no longer exist were also removed as there was no point in recommending them. The social science courses were also dropped as they are not linked to other courses in the study plans.
Given all these modifications the following dataframe was obtained.
In [2]:
enrol = pd.read_pickle("../data/cleanedAR_enrol_initial.pickle")
enrol.head()
Out[2]:
This data was more complex than the enrollments as it contained more fields which were not standardized. The course descriptions are filled out by professors and, although there is a similar structure in all of them, there was a large portion of data cleaning and merging which had to be done.
The fields provided in this database are the following : 'AcademicYear', 'CourseCode', 'CourseTitleFR', 'CourseTitleEN', 'ProfessorSCIPERs', 'AssistantSCIPERs', 'StudyPlansFR', 'StudyPlansEN', 'Summary_FR', 'Summary_EN', 'KeyWords_FR', 'KeyWords_EN', 'Content_FR', 'Content_EN', 'ImportantConcepts_FR', 'ImportantConcepts_EN', 'MandatoryPrerequirements_FR', 'MandatoryPrerequirements_EN', 'IndicativePrerequirements_FR', 'IndicativePrerequirements_EN', 'PreparesFor_FR', 'PreparesFor_EN', 'Handouts_FR', 'Handouts_EN', 'Bibliography_FR', 'Bibliography_EN', 'TeachingMethod_FR', 'TeachingMethod_EN', 'EvaluationMethod_FR', 'EvaluationMethod_EN', 'ExpectedWork_FR', 'ExpectedWork_EN', 'Other_FR', 'Other_EN', 'ProjectManagement', 'WorkInSociety', 'PersonalEffectiveness', 'TrainingAchievements', 'WorkInGroupsAndOrgs', 'CommunicateInformation'
Many fields are provided twice, once in english and once in french which requires merging. There are also fields which are not relevant to the aplication such as the teaching method, evaluation method, expected work, project management, work in society etc...
That is why only the ones which were deemed as being of interest when constructing the final graph were kept :
The social science courses were removed as they are elective courses which are often unrelated to the rest of the studies. Other courses were removed later on following the analysis conducted on the students' graph (also called enrolment's graph) in section 3.1. and the study plans' graph in section 3.2.
First and second year courses were also removed as most sections do not give the possiblity of choosing elective courses at this level. Furthermore most of these courses are basic polytechnic courses which are not highly specialized. SHS courses were also removed as they werre considered irrelevant for the course suggestor. Note that the focus of the recommendation system was finally placed on the master courses as that is where students have the most choice.
In the same way as for the course enrollements, the study plans were simplified to keep only the main categories : 'AR', 'Area and Cultural Studies minor', 'Auditeurs en ligne', 'CGC', 'Computer engineering minor', 'EL', 'EL MNIS', 'GC', 'GM', 'HD', 'Hors plans', 'IF', 'IN', 'Information security minor', 'MA', 'MES', 'MT', 'MTEE', 'MX', 'Mineur : Biocomputing', 'Mineur : Biotechnologie', 'Mineur : Design intégré, architecture et durabilité', 'Mineur : Développement territorial et urbanisme', 'Mineur : Informatique', 'Mineur : Management de la technologie et entrepreneuriat', 'Mineur : Neuroprosthétiques', 'Mineur : Neurosciences computationnelles', 'Mineur : Systems Engineering', 'Mineur : Systèmes de communication', 'Mineur : Technologies biomédicales', 'Mineur : Technologies spatiales', 'Mineur : Énergie', 'Mineur STAS', 'PH', 'SC', 'SHS', 'SIE', 'SV', 'UNIL'
The requirements were not necessarily provided under the form of course codes. Therefore the course codes needed to be indentified from the string and correctly formatted if present, otherwise they needed to be inferred from the course name provided, knowing that it was not necessarily exact. For the first part regex was used to identify course codes in strings. For the second part string edition distance was used in order to best match the course name provided to the official course name and then map it to the corresponding course code.
To compute a distance between courses based on keywords, the first thing which needed to be done was to obtain a set of terms per course. These terms were given by the keywords, contents, summaries and concepts which need to be unified and lemmatized before processing.
The resulting dataframe can be seen below.
In [3]:
courses = pd.read_pickle("../data/cleanedAR_courses_initial.pickle")
courses.head()
Out[3]:
Given the information collected from the cleaned dataset, different graphs were constructed and analyzed to find the relations between the different courses, and thus the influence of all these graphs. The goal was to link all courses with different manners to represent all interests from students. In the end, a final graph was constructed as a weighted sum of the the most relevant graphs, taking into account the relative importance of the information provided by each.
The main source of information was given by the student themselves and the courses they took, as it was information given over several years (since 2007). Being the most important piece of information, this graph became the baseline for the analysis and the recommendation system.
Other important pieces of information were considered, taking into account the fact that they had to link courses in a way that brought another type of connection, useful for students that want to discover something other than the mandatory courses from their section.
This was provided by the professors, the assistants (if given), the topics (keywords of a course) and pre-requirements (and in the other way, the preparing for courses) which were constructed in the Graph Creation notebook (enrolments, study plans, professors and assistants), the Graph Requirements notebook and the Topics Graph notebook.
Note that different graphs were created using a generic function from the graph module and the analysis was conducted in the Graph Analysis notebook.
This graph was constructed while keeping in mind the fact that two courses should be connected if a student has taken them both over their curriculum. To avoid giving high weights to courses which were given over a large number of years, the weights were normalized by taking into account the number of years the courses co-existed when increasing the weights. This was done in the Graph Creation notebook and the figures generated in the Graph Analysis notebook.
The following figures correspond to the weight matrix and the weighted degree distribution of the network.
Looking at the weights distribution there are a few courses which are highly connected with large degrees.
The graph of the Bachelor and Master courses for all sections combined was then plotted on the eigenvectors of the normalized Laplacian and analyzed using pyGSP.
Visually, it would seem that there are certain groups which can be distinguished by looking at the branches formed. However only one big component remains as can be seen below (the size of the giant component is equal to the number of nodes).
In [4]:
# Load the weight matrix
pkl_file = open(os.path.join(os.getcwd(), 'Graphs','students_graph_with_AR.pkl'), 'rb')
students_mat = pickle.load(pkl_file)
pkl_file.close()
# Create the graph and find the giant component.
G1=nx.from_numpy_matrix(students_mat)
Gcc=sorted(nx.connected_component_subgraphs(G1), key = len, reverse=True)[0]
# Compare the giant component to the amount of nodes.
print("The size of the giant component is "+str(len(Gcc)))
print("The number of nodes in the graph is "+str(len(students_mat)))
Using the normalized Laplacian and observing the graph on the first two eigenvectors, the graph constructed from the enrolments forms several clusters that are more or less distant. The results where the different sections and faculties are highlighted can be seen in the following figures. Note that only a few outputs are shown here, all other graphs can be seen in the original notebook
The first thing to see is that one big cluster is distinguishable from the others. This corresponds exatcly to the courses contained in the AR study plan (or more specifically, with a link with AR section), as can be seen after highliting the courses based on the study plans.
There is another distinct cluster, although less evident due to the scale imposed by the presence of AR. This ones corresponds to GC and SIE courses which occupy each one portion of the same branch. One thing to add is that half courses from SIE are not in this cluster because they are implicated in other study plans as well, that have a bigger influence.
Thus, it is possible to say that the ENAC faculty (apart from some SIE courses) has few, or weak, connections with the rest of the faculties. Students from one side has then less chance to take courses from the other side. It was then decided that the ENAC faculty could be dropped from the other graphs too.
Certain sections are well grouped into one distinct branch such as Systèmes de Communications whilst others are more dispersed such as MT, due to it's interdisciplinary nature and overlap with other STI sections.
Finally, the graphs show that all distinct clusters correspond to faculty, section, or even orientations within the different master curriculums. Note however that there are certain faculties which are more disperse than others. The IC faculty occupies its own branch whilst the ENAC faculty is quite disperse but relatively separable from the rest. The STI faculty on the other hand forms one big cluster at the center.
It is important to remark that minors are spread out over all sections as they are accessible to all students.
Generally, this demonstrates that students have a tendency to take courses provided by their sections and are less inclined to take others. This is most likely due to the fact that certain sections do not allow students to take many courses outside their study plans as well as misinformation or ignorance. This is where the current project can be useful which is why information linking courses accross the different sections should be incorporated.
This graph was constructed in the Graph Creation notebook and the figures generated in the Graph Analysis notebook.
In this graph, two courses are connected if they are part of the same study plan. This was compared to the enrolments graph as it bore the largest similarity.
There seems to be a high correlation between all courses albeit architecture courses. This confirms what was observed previously. Note that the different faculties and sections are less distinguishable on the eigenmaps using the study plans but maintain similar configuration as in the enrolments graph as can be seen below.
The projection on the first two eigevnectors of the normalized Laplacian show two main clusters which are orthogonal in their correlations.
The farthest cluster is the architecture cluster as mentioned previously.
The SIE and GC clusters are once again located on one side and relatively grouped together.
However, using a spring view, the different study plans are more visible with the different components corresponds relatively well to the different sections.
The graph is connected (no subgraph) and has only one big component as can be seen below. The information is therefore redundant with the baseline (enrolments' graph)which is why it was not kept.
In [5]:
# Load the weight matrix
pkl_file = open(os.path.join(os.getcwd(), 'Graphs','section_graph_with_AR.pkl'), 'rb')
students_mat = pickle.load(pkl_file)
pkl_file.close()
# Create the graph and find the giant component.
G1=nx.from_numpy_matrix(students_mat)
Gcc=sorted(nx.connected_component_subgraphs(G1), key = len, reverse=True)[0]
# Compare the giant component to the amount of nodes.
print("The size of the giant component is "+str(len(Gcc)))
print("The number of nodes in the graph is "+str(len(students_mat)))
⚠️ ⚠️ ⚠️ In order to be able to judge of the pertinence of the recommender system and given the nature of the baseline graph (i.e. enrolments) which clustered the faculties relatively well, the decision was taken to do the recommendations only for the STI faculty. Only master courses were considered as this is where the students have the most flexibility and the least guidance. Therefore the following graphs are shown only for courses taken by students in the STI faculty during the master curriculum. ⚠️ ⚠️ ⚠️
In this graph, two courses are connected if they were teached by the same professor. This graph was considered as professors tend to teach in similar domains therefore giving potential continuity in the curriculum and the domain of interest. The graph shown is the one corresponding to the master courses in STI. Refer to the Graph Creation notebook for the creation of the graph and the Graph Analysis notebook for the generation of the different figures.
Unfortunately this graph is quite sparse as can be seen in by observing the weight matrix leading to many disconnected subgraphs.
The graph is not made up of one or several large component but rather a multitude of small components or isolated nodes.
In the following graphs, the degree distribution and the distribution of giant components can be seen respectively.
Most nodes are of low distribution with 3 or less connections. Very few have high distributions and those that do are due to multiple professors hosting the same course and giving other courses with multiple professors. The node with the highest degree is the Lab in EDA design which is linked to :
Most components are made up of less than 5 nodes with very few exceeding that value. The largest component is of size 16, regrouping professors from STI and SV, and corresponds to the following courses:
These courses have very little link between them. The more hops there are between the courses the less they are related. Given that as well as the fact that this graph is very sparse, this graph was not considered for the final recommender system.
Doing the same for the assistants and plotting for the STI faculty, the results obtained are even less convincing given the even sparser nature of the weight matrix. This is due to the fact that professors do not necessarily input the assistant scipers into the course descriptions. Most likely the doctoral assistants who participate usually in just one or two courses are input but not the student assistants which participate in a large panel of courses. Refer to the Graph Creation notebook for the corresponding code
This graph was constructed by taking various textual elements in the course descriptions, in particular the keywords, the summaries, contents and concepts as these relate to the topics of the course. The idea was to extract keywords from the various course descriptions and link them using topic detections methods.
The first step was to format the text and lemmatize it to reduce the words to simpler forms, that way for example plural terms will be reduced to their singular form and they will be indentified as the same. This was done using the NLTK package. Stopwords were also removed as they are often irrelevant to the topics themselves.
Once the corpus of words was created for all the different courses, the TFIDF was computed. The TFIDF is defined as the "term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus" (Wikipedia page). Therefore the most relevant words in the corpus will be those that are the most unique and as such those that will better describe the topics. This was done using the TfidfVectorizer from the scikit learn package.
Then, given a matrix of Courses x Words, matrix factorization was used in order to determine the various topics and obtain a Courses x Topics matrix. A weight matrix between the different courses was then computed. This was done using Latent Dirichlet Annotation (LDA) from the scikit-learn library.
Unfortunately the topic detection was not functional for multiple reasons, the main one being that there were very little samples compared to the number of keywords to be analyzed. The results obtained in the Topics Graph - Keywords, Summaries, Concepts notebook were the following :
Extracting tf features for LDA...
Fitting LDA models with tf features, n_samples=196 and n_features=1919...
Topics in LDA model:
Topic #0: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #1: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #2: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #3: 6ghz aberration ability able abstract acoustic acquaint acquire acquires acquisition active act activity actuator add addition address adaptive advance afm
Topic #4: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #5: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #6: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #7: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #8: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
Topic #9: direct diode dimensional digital different differential difference diagnostic device development determine develop detection detail described design description depth describe depend
As can be seen, the topics are nearly all identical which is suprising given the low number of topics to be extracted. That is why this graph was not used although it could have offered links between courses related to topics without relying on a professor actually defining the link.
In order to take certain courses, students may need to have specific background skills that they acquired by following other courses. These ones are called pre-requirement courses. It may be interesting to link courses based on this aspect. Once the requirements were linked to the course codes, it was possible to construct 3 graphs based on this feature (this was done in the Graph Requirements notebook):
Overall, these graphs are quite sparse which is most probably due to two factors :
As mentioned previoulsy, the decision was taken to focus solely courses taken by students in the STI faculty over their curriculum. This was done on the baseline graph with the enrolments with the idea of enriching it using the different requirements graph, following the analysis in the previous section. Note that the following images were generated in the Graph Analysis notebook
Graph visualized along the first two eigenvectors of the normalized Laplacian
Weight matrix : it is interesting to see that the weight matrix is nearly block diagonal with 5 blocks corresponding to the 5 sections in the STI master curriculums.
Unweighted degree distribution : the degree distribution is nearly symmetrical. Note that the course with the highest degree is the Space Mission Design and Operations course which is a popular course amongst students as it is given by Claude Nicollier who talks about his adventures in space. Other courses with high degrees often correspond to other appreciated courses such as this one or mandatory courses. The latter is problematic in the sense that it may bias the recommender system towards courses that students must take rather than courses that students appreciate.
Highlighting the different sections as was done in the analysis, it can be shown that the MX, GM and EE sections have their own branches whilst the MT, BIOENG and a portion of EE are dispersed over all the sections.
Using the resulting graph, the objective was therefore to recommend the most relevant courses using notions of spectral clustering and graph diffusion seen in the course. One could have imagined simply taking the neighbours of a course of interest but this would not have lead to a coherent set of courses such as what would be obtained using clustering or the optimal list of courses provided multiple inputs.
This portion of the project was done in the Spectral Clustering notebook.
The objective here was to see whether it would be possible to isolate a given number of coherent clusters which could be used as a basis for recommendation. What needed to be done was to select the number of clusters k. The method presented in the course to select the optimal k was based on the spectral gap. Unfortunately, this relies on the fact that there are clusters which can more or less easily be separated. In the present case, there is one giant component which is highly linked for the STI faculty which is why the output is k = 1.
Another method presented in the Applied Data Analysis course is to base the selection of k on the silhouette. The silhouette is defined by :
$$ S = mean(\sum_{i}^{N} s(i)) = mean(\sum_{i}^{N} \frac{b(i) - a(i)}{\max{a(i), b(i)}})$$where :
Computing the silhouette and selecting the value of k which gives the maximal output S is the best result.
The optimal value is therefore k = 8 and the results of the clustering can be seen in the following figure.
The clusters obtained are the following and are coherent with personal knowledge of the faculties :
These clusters are coherent as they tend towards the study plans of certain sections or their respective specializations. Therefore, when a course is input into the recommender system, it will output the courses of the cluster, sorted by importance with respect to the original course. The relative importance is given by the distance between the course of interest and the recommendations.
In [ ]:
K-Means clustering is a particular case of Gaussian Mixture Models with isotropic covariance matrices. There is no guarantee that the distribution of nodes in the course space follows an isotropic distribution. That is why Gaussian Mixture Models were tested to see whether the possibility of having full covariance matrices and eventually the possiblity of having soft clustering would lead to better outputs.
Using the GMM implemented in scikit-learn it was possible to do unsupervised clustering on the courses and obtain the output class labels corresponding to the most likely class for each course.
Selecting the optimal k was done using AIC and BIC metrics which evaluate the quality of the model for a given K. These metrics are therefore computed for different values of k and the one which leads to the minimum AIC / BIC is selected. The optimal k given by the BIC is 16 as opposed to 24 with AIC.
Using k=16 lead to better outputs than with k=24 which is why it was maintained.
The clusters obtained are the following :
Courses are not necessarily part of one curriculum or one set of relevant courses. That is why soft clustering was considered for the recommendation. This gives more flexibility to the recommender system and could potentially improve the quality and relevance of the outputs.
Using the baseline graph (enrolments), the results obtained lead to hard clustering, this is due to the fact that the gaussians did not overlap sufficiently. However, as it can be seen below, the obtained clusters were different:
Given the clusters output by the methods detailed previously, the recommendations are done as follows :
Given a single course
Given multiple courses
The recommendation function is given in the following cell.
In [6]:
def suggest_wrt1(course):
# Deal with the case course is a string and not a list of strings:
if type(course)==str:
course=[course]
print('Results with K-means:\n')
L=[]
indices1=[]
for q in range(len(course)):
L.append(labels_1[courses_index_dico[course[q]]]) #list the labels of the courses in course
indices1.append(courses_index_dico[course[q]]) #indices of the chosen course of course
L=Counter(L).most_common(1)[0][0] #keep the label L of the most common cluster
ind=np.where(labels_1==L)[0] # get the indices of the courses of the cluster
indices1=np.array(indices1)
indices=list(indices1[list(np.isin(indices1, ind))]) # indices of the chosen courses that are in the most common cluster
# Compute the distances between the courses of the chosen cluster and the remaining chosen courses:
dist=scipy.spatial.distance.cdist(F1_K[ind,:],F1_K[indices,:], 'euclidean')
dist=np.mean(dist, axis=1)
# Output the list of courses in the cluster ordered with respect to the closest distance with the chosen courses:
ind=ind[np.argsort(dist)]
print(Courses[ind])
print([courses2[courses2.index.str.endswith(Courses[ind][p])].CourseTitleFR.tolist()[0] for p in range(len(ind))])
print('\n')
print('\nResults with GMM soft-clustering:\n')
if len(course)==1:
L=multi_labels[courses_index_dico[course[0]]] #keep the labels L of the clusters
else:
L=[item for q in range(len(course)) for item in multi_labels[courses_index_dico[course[q]]]]
L=[Counter(L).most_common(1)[0][0]] #keep the label L of the most common cluster
for p in range(len(L)):
ind=np.array(labels_matrix[L[p]]) # get the indices of the courses of the cluster
indices=list(indices1[list(np.isin(indices1, ind))]) # indices of the chosen courses that are in the most common cluster
# Compute the distances between the courses of the chosen cluster and the remaining chosen courses:
dist=scipy.spatial.distance.cdist(F1_gmm_s[ind,:],F1_gmm_s[indices,:], 'euclidean')
dist=np.mean(dist, axis=1)
# Output the list of courses in the cluster ordered with respect to the closest distance with the chosen courses:
ind=ind[np.argsort(dist)]
print(Courses[ind])
print([courses2[courses2.index.str.endswith(Courses[ind][p])].CourseTitleFR.tolist()[0] for p in range(len(ind))])
print('\n')
print('\nResults with GMM hard-clustering:\n')
L=[]
for q in range(len(course)):
L.append(labels_1c[courses_index_dico[course[q]]])
L=Counter(L).most_common(1)[0][0] #keep the label L of the most common cluster
ind=np.where(labels_1c==L)[0] # get the indices of the courses of the cluster
indices=list(indices1[list(np.isin(indices1, ind))]) # indices of the chosen courses that are in the most common cluster
# Compute the distances between the courses of the chosen cluster and the remaining chosen courses:
dist=scipy.spatial.distance.cdist(F1_gmm_h[ind,:],F1_gmm_h[indices,:], 'euclidean')
dist=np.mean(dist, axis=1)
# Output the list of courses in the cluster ordered with respect to the closest distance with the chosen courses:
ind=ind[np.argsort(dist)]
print(Courses[ind])
print([courses2[courses2.index.str.endswith(Courses[ind][p])].CourseTitleFR.tolist()[0] for p in range(len(ind))])
For the set of courses ['Nanoelectronics', 'Analog circuits design I'], the recommandations gave the following outputs:
Results with K-means: ['Nanoelectronics', 'Analog circuits design I', 'Fundamentals of VLSI design', 'Lab in EDA based design', 'Test of VLSI systems', 'HF and VHF circuits and techniques I', 'Modeling of emerging electron devices', 'Physical models for micro and nanosystems', 'Hardware systems modeling I', 'Hardware systems modeling', 'Advanced analog and RF integrated circuits design I', 'Introduction to VLSI Design', 'Integrated circuits technology', 'Analog circuits design II', 'Bio-nano-chip design', 'Analog circuits for biochip', 'Advanced VLSI design', 'Advanced analog and RF integrated circuits design II', 'Lab in microelectronics', 'HF and VHF circuits and techniques II', 'Advanced wireless communications: algorithms and architectures', 'Electrical filters', 'Advanced lab in electrical engineering']
Results with GMM soft-clustering: ['Nanoelectronics', 'Analog circuits design I', 'Test of VLSI systems', 'Lab in EDA based design', 'HF and VHF circuits and techniques I', 'Fundamentals of VLSI design', 'Modeling of emerging electron devices', 'Physical models for micro and nanosystems', 'Integrated circuits technology', 'Hardware systems modeling', 'Lab in microelectronics', 'Introduction to VLSI Design', 'Optical detectors']
Results with GMM hard-clustering: ['Nanoelectronics', 'Analog circuits design I', 'Test of VLSI systems', 'Lab in EDA based design', 'HF and VHF circuits and techniques I', 'Fundamentals of VLSI design', 'Modeling of emerging electron devices', 'Physical models for micro and nanosystems', 'Integrated circuits technology', 'Hardware systems modeling', 'Lab in microelectronics', 'Introduction to VLSI Design', 'Optical detectors']
Assessment
As it can be seen, the outputs are different: here, the results of GMM for hard and soft clustering are the same here but they are different from the results of k-means. However, the recommandations seem to be quite good for this set of courses.
NB: A part of randomness was observed in the clustering based on GMM leading to different clusters but the reason of this could not be figured out since the initialization of the algorithm is done using K-Means which should always lead to the same output.
In order to take into consideration the graph of requirements, all these three methods were repeated for a graph obtained thanks to the combinaison of the graphs based on the students' choices (enrolments) and the requirements. Two output examples of the recommendation based on this new graph can be seen in the table of part 4.3 where it is compared to the diffusion method. This new graph was obtained in the same manner as in the Graph Diffusion part (normalization and weighted sum).
To test other courses in the recommendation functions, it is sufficient to re-run the notebook Spectral Clustering notebook and try different courses in the parts 8.1 and 8.2 of this notebook. Note that the course descriptions and enrolments dataframes must first be generated by running the Data Cleaning notebook after downloading the dataset as it was not possible to upload the dataset onto moodle (given its size).
Using graph diffusion, the objective was to diffuse a dirac from a select number of nodes given by the courses of interest. Diffusing from there and sorting the courses based on the resulting signal from highest to lowest output would then give the most relevant courses. This was done in the Weighting Metrics and Graph Diffusion notebook using the following functions :
In [7]:
"""
Function used to diffuse a signal from a given number of nodes using a heat
diffusion filter with the desired tau. Note that tau should not be too small
in order to diffuse and not too high to avoid having a uniform signal
over the entire graph.
"""
def heat_diffusion(G, courses, tau):
# Create the heat diffusion filter
filt = filters.Heat(G, tau)
# Create the signal for the given graph
signal = np.zeros(G.N)
for course in courses:
NODE = np.where(np.asarray(full_courses_list) == course)[0]
signal[NODE] = 1
# Apply the filter to the signal
filtered_s = filt.filter(signal)
return filtered_s
"""
This function calls the graph diffusion for valid course codes
and keeps the most relevant courses for the recommendation.
The output is the list of courses in descending order.
"""
def diffusion(weight_mat, list_loved_courses, n_result_courses, tau_filter):
# Define the index of the loved courses to hgighlight them later.
NODE = []
for i in range(0,len(list_loved_courses)):
if (len(np.where(np.asarray(full_courses_list) == list_loved_courses[i])[0])==0):
print("ERROR! Course loved is not in the list of the courses.")
return
NODE.append(np.where(np.asarray(full_courses_list) == list_loved_courses[i])[0][0])
# Create the graph and do the diffusion on it.
G_diffusion = create_graph(weight_mat)
filtered_signals = heat_diffusion(G_diffusion,list_loved_courses,tau_filter)
# Plot the diffusion
G_diffusion.set_coordinates("spring")#G_diffusion.U[:,1:3])
G_diffusion.plot_signal(filtered_signals, vertex_size=50, highlight = NODE, )
# Create the list of courses ordered with their values found by the diffusion.
filtered_signals_int = list(filtered_signals)
courses_list = []
if(n_result_courses > len(filtered_signals_int)):
n_result_courses = len(filtered_signals_int)
for i in range(0,n_result_courses):
course_code = full_courses_list[filtered_signals_int.index(max(filtered_signals_int))]
courses_list.append(courses[courses.index.str.endswith(course_code)].CourseTitleFR.tolist()[0])
filtered_signals_int[filtered_signals_int.index(max(filtered_signals_int))] = -1
return courses_list
"""
# Example of usage : note this is extracted from the original notebook
weight_different_graph = [0.2,0,0,0,0,0,1] # weights used for the different graphs
# Combine the different graphs
diffusion_graph = weight_different_graph[0]*weight_matrices[0]
for i in range(1, len(weight_matrices)):
diffusion_graph = diffusion_graph + weight_different_graph[i]*weight_matrices[i]
# Call the recommendation
recommended_courses = diffusion(diffusion_graph,["MICRO-421", "MICRO-520"],7,4)
"""
print()
Originally, the idea was to do supervised learning to do a grid search in order to :
Unfortunately it was not possible to conduct a grid search as there was no ground truth upon which to assess the results to automatically select the best combination. One idea was to use the probabilities of two courses being taken which is computed in Computing Course Probabilities notebook, but this would have overfit on the enrolments graph as it is based on the same data. Another alternative was to rely on a new database of courses taken by students currently finishing their masters and adding to it the courses that they would have liked to take as well (see the Performance Evaluation notebook), this unfortunately was not sufficient in order to construct a ground truth. As a matter of fact, this application would have benefitted from the knowledge of the sutdents' assessment of each course but this data was not available. Therefore without a metric to assess the results, personal knowledge was used to assess the outputs and optimize the coefficients.
Using the baseline graph, and given a list of courses of interest, a dirac is diffused using a heat filter from the different nodes. An example output can be seen in the following.
Outputting the top 15 corresponding recommendations given these two courses the following courses are obtained: 'Applied machine learning', 'Automatic speech processing', 'Advanced lab in electrical engineering', 'Composants semiconducteurs', 'Introduction to planetary sciences', 'Space mission design and operations', 'Wave propagation along transmission lines', 'Mécanique vibratoire', 'Lasers: theory and modern applications', 'Lab in microelectronics', 'Scaling laws in micro- and nanosystems', 'Advanced lab in electrical energy systems', 'Fracture mechanics', 'Social media', 'Image and video processing'
Applied Machine Learning is a course given mostly for micro-engineering sutdents whereas automatic speech processing is an electrical engineering course. That's why in the list presented before, the courses are ones given in these two sections. It is logical as this has been found using the student graph. However, the subjects of the courses are not really the same. For example, Composants semiconducteurs is a micro-engineering course about transistors which is not really related to the machine learning and speech processing fields. Using this graph to do recommendation seems interesting in the sense that it proposes courses of the same section, but other graphs need to be considered to obtain courses which are more related in terms of topics covered during the class.
As mentioned previously, it is important to consider different graphs during the recommendation process. The graphs were summed using weights to adjust the importance of each of them.
Looking at the outputs when combining with the different requirements graphs while diffusing from applied machine learning the results are the following (with a tau of 4):
Enrolments Weight | 1 | 0 | 1e-11 | 1e-11 | 1e-11 | 1e-11 |
---|---|---|---|---|---|---|
Assistants Weight | 0 | 1 | 0 | 0 | 0 | 0 |
Professors Weight | 0 | 0 | 1 | 0 | 0 | 0 |
Courses With Same Requirements Weight | 0 | 0 | 0 | 1 | 0 | 0 |
Course To Its Requirements Weight | 0 | 0 | 0 | 0 | 1 | 0 |
Requirements of same course Weight | 0 | 0 | 0 | 0 | 1 | |
Outputs | 'Applied machine learning', 'Advanced lab in electrical engineering', 'Composants semi-conducteurs', 'Introduction to planetary sciences', 'Lasers: theory and modern applications', 'Wave propagation along transmission lines', 'Mécanique vibratoire' | 'Applied machine learning', 'Analysis and modelling of locomotion', 'Introduction to cellular and molecular biotechnology', 'Biotechnology lab (for CGC)', 'Pharmaceutical biotechnology', 'Biomaterials', 'Advanced bioengineering methods laboratory' | 'Applied machine learning', 'Turbomachines thermiques', 'Hydrodynamics', 'Scaling laws in micro- and nanosystems', 'Composants semiconducteurs', 'Biomechanics of the musculoskeletal system', 'Lifecycle performance of product systems' | 'Applied machine learning', 'Lasers: theory and modern applications', 'Advanced lab in electrical engineering', 'Wave propagation along transmission lines', 'Introduction to planetary sciences', 'Space mission design and operations', 'Lab in microelectronics' | 'Applied machine learning', 'Advanced machine learning', 'Lasers: theory and modern applications', 'Advanced lab in electrical engineering', 'Wave propagation along transmission lines', 'Lab in microelectronics', 'Introduction to planetary sciences' | 'Applied machine learning', 'Advanced machine learning', 'Automatic speech processing', 'Production management', 'Image processing II', 'Biomicroscopy I', 'Biomicroscopy II' |
Looking at the results obtained from the graph diffusion from Applied machine learning :
Given all of this, only the graph linking requirements of the same course is maintained for the final application. This graph will be designated by requirements graph for the following sections. A general remark concerning the requirements graph is that only the immediate neighbours of the nodes are pertinent, looking farther away from the node often leads to courses which are irrelevant, as mentioned previously.
The next step was then to find the optimal weights to combine the remaining graphs. The issue is that the relevance of the results depend highly on the courses. The outputs were tested for different combinations.
Courses | Enrolment | Requirement | Result |
---|---|---|---|
Applied Machine Learning | 0.15 | 1 | ['Applied machine learning','Advanced machine learning','Automatic speech processing','Biomicroscopy II','Biomicroscopy I','Image optics','Lasers: theory and modern applications'] |
Applied Machine Learning | 0.2 | 1 | ['Applied machine learning','Advanced machine learning','Lasers: theory and modern applications','Advanced lab in electrical engineering','Wave propagation along transmission lines','Introduction to planetary sciences','Space mission design and operations'] |
Applied Machine Learning | 0.3 | 1 | ['Applied machine learning','Advanced lab in electrical engineering','Lasers: theory and modern applications','Wave propagation along transmission lines','Introduction to planetary sciences','Space mission design and operations','Mécanique vibratoire'] |
Lab on app development for tablets and smartphones | 0.2 | 1 | ['Lab on app development for tablets and smartphones','Semester project for specialisation C','Advanced analog and RF integrated circuits design I','Image communication','Lab in signal and image processing','HF and VHF circuits and techniques II','Mathematics of data: from theory to computation'] |
Automatic speech processing | 0.2 | 1 | ['Automatic speech processing','Advanced machine learning','Applied machine learning','Biomicroscopy I','Biomicroscopy II','Image optics','Optical communications'] |
In the three first lines of the table, Applied Machine Learning was used as the course of interest. The process to tune the weights was first to set the weights to 1 for both graphs then vary the weight of the enrolments graph. Doing so leads to varying results, only if the weight of the graphs are different.
For Applied Machine Learning :
Therefore the best weight would seem to be 0.2 for the Applied Machine Learning course. Testing this combination for different courses :
Testing the outputs using multiple input courses and outputting only the top 7 (with the first two corresponding to the input courses) :
Courses | Results |
---|---|
Applied Machine Learning + Space Mission Design and Operations | ['Space mission design and operations', 'Applied machine learning', 'Introduction to planetary sciences', 'Advanced lab in electrical engineering', 'Wave propagation along transmission lines', 'Lasers: theory and modern applications', 'Mécanique vibratoire'] |
Nanoelectronics + Analog Circuit Design I | ['Analog circuits design I', 'Nanoelectronics', 'HF and VHF circuits and techniques I', 'Lab in EDA based design', 'Lab in acoustics', 'Analog circuits design II', 'Industrial electronics I'] |
Image Optics + Laser Microprocessing | ['Image optics','Biomicroscopy II','Biomicroscopy I','Laser microprocessing','Capteurs','Image processing II','Composites technology'] |
Generally speaking the recommendation system seems to perform well, there are of course certain adjustments which need to be made but the courses are for the most case relevant and could indeed be taken by a student given the topics of interest.
It appears interesting to compare the results obtained with the spectral clustering and the ones obtained by the diffusion. To do so, a table is presented which summarizes the results found with every method on the student graph and the mixed graph for the course Automatic Speech Processing:
Method used for the course Automatic Speech Processing | Results |
---|---|
K-means on the student graph | ['Automatic speech processing', 'Biomedical signal processing', 'Microwaves', 'Systems and architectures for signal processing', 'Speech processing', 'Image analysis and pattern recognition', 'Lab in microwaves', 'Propagation of acoustic waves', 'Photonic systems and technology', 'Lab in signal and image processing', 'Wireless receivers: algorithms and architectures', 'Image and video processing', 'Image communication', 'Optical communications', 'Rayonnement et antennes', 'Lab in acoustics', 'Brain computer interaction', 'Lab on app development for tablets and smartphones', 'Social media', 'Audio', 'Mathematics of data: from theory to computation', 'Media security'] |
GMM soft-clustering on the student graph | ['Automatic speech processing', 'Image processing I', 'Image processing II', 'Model predictive control', 'Transducteurs et entraînements intégrés', 'Mobile robots', "Commande d'actionneurs à l'aide d'un microprocesseur + TP", 'Signal processing for functional brain imaging', 'Robotics practicals', 'Haptic human robot interfaces', 'Advanced machine learning', 'In Silico neuroscience', 'Scientific literature analysis in Neuroscience'] |
GMM hard-clustering on the student graph | ['Automatic speech processing', 'Image processing I', 'Image processing II', 'Sensors in medical instrumentation', 'Applied machine learning', 'Biomechanics of the cardiovascular system', 'Biomaterials (pour MX)', 'Biomicroscopy II', 'Fundamentals of biosensors and electronic biochips', 'Flexible bioelectronics', 'Biomechanics of the musculoskeletal system', 'Biomicroscopy I', 'Seminar in physiology and instrumentation', 'BioMEMS', 'Biomaterials', 'Advanced nanomaterials', 'Space mission design and operations', 'Biomedical optics', 'Composites technology', 'Numerical methods in biomechanics', 'Life cycle engineering of polymers', 'Tribology', 'Electrochemistry for materials technology', 'Thin film fabrication processes', 'Powder technology', 'Electron microscopy: advanced methods', 'Cementitious materials (advanced)'] |
K-mean on the combined graph | ['Automatic speech processing', 'Biomedical signal processing', 'Microwaves', 'Systems and architectures for signal processing', 'Wireless receivers: algorithms and architectures', 'Image analysis and pattern recognition', 'Lab in microwaves', 'Speech processing', 'Photonic systems and technology', 'Mathematics of data: from theory to computation', 'Lab in signal and image processing', 'Image and video processing', 'Optical communications', 'Image communication', 'Social media', 'Rayonnement et antennes', 'Lab in acoustics', 'Media security', 'Brain computer interaction', 'Audio'] |
GMM soft-clustering on the combined graph | ['Automatic speech processing', 'Biomedical signal processing', 'Image analysis and pattern recognition', 'Systems and architectures for signal processing', 'Microwaves', 'Speech processing', 'Photonic systems and technology', 'Lab in signal and image processing', 'Image and video processing', 'Lab in microwaves', 'Image communication', 'Wireless receivers: algorithms and architectures', 'Rayonnement et antennes', 'Optical communications', 'Mathematics of data: from theory to computation', 'Media security', 'Lab in acoustics', 'Social media'] |
GMM hard-clustering on the combined graph | ['Automatic speech processing', 'Biomedical signal processing', 'Image analysis and pattern recognition', 'Systems and architectures for signal processing', 'Microwaves', 'Speech processing', 'Photonic systems and technology', 'Lab in signal and image processing', 'Image and video processing', 'Lab in microwaves', 'Image communication', 'Wireless receivers: algorithms and architectures', 'Rayonnement et antennes', 'Optical communications', 'Mathematics of data: from theory to computation', 'Brain computer interaction', 'Media security', 'Lab in acoustics', 'Social media', 'Audio'] |
Diffusion on the student graph | ['Automatic speech processing','Image and video processing','Sensors in medical instrumentation','Lessons learned from the space exploration','Bio-nano-chip design','Introduction to VLSI Design','Lab in microwaves'] |
Diffusion on the combined graph | ['Automatic speech processing','Advanced machine learning','Applied machine learning','Biomicroscopy I','Biomicroscopy II','Image optics','Optical communications'] |
Comparing the spectral clustering methods with the diffusion, the courses proposed are not the same. However, they remain in the same field in the sense that most of them are given by the electrical engineering faculty, such as "Automatic Speech Processing". It is not really possible to conclude whether a method is better than another as there is no information regarding how the students liked the different courses. It is not possible to determine which courses should appear at the top of the recommendations' list using Automatic Speech Processing. It appears also interesting to analyze the recommendation results for multiple courses. The following table presents them for the courses Nanoelectronics and Analog circuits design I:
Method used for the course Nanoelectronics + Analog circuits design I | Results |
---|---|
K-means on the student graph | ['Nanoelectronics', 'Analog circuits design I', 'Fundamentals of VLSI design', 'Lab in EDA based design', 'Test of VLSI systems', 'HF and VHF circuits and techniques I', 'Modeling of emerging electron devices', 'Physical models for micro and nanosystems', 'Hardware systems modeling I', 'Hardware systems modeling', 'Advanced analog and RF integrated circuits design I', 'Introduction to VLSI Design', 'Integrated circuits technology', 'Analog circuits design II', 'Bio-nano-chip design', 'Analog circuits for biochip', 'Advanced VLSI design', 'Advanced analog and RF integrated circuits design II', 'Lab in microelectronics', 'HF and VHF circuits and techniques II', 'Advanced wireless communications: algorithms and architectures', 'Electrical filters', 'Advanced lab in electrical engineering'] |
GMM soft-clustering on the student graph | ['Nanoelectronics', 'Analog circuits design I', 'Test of VLSI systems', 'Lab in EDA based design', 'HF and VHF circuits and techniques I', 'Fundamentals of VLSI design', 'Modeling of emerging electron devices', 'Physical models for micro and nanosystems', 'Integrated circuits technology', 'Hardware systems modeling', 'Lab in microelectronics', 'Introduction to VLSI Design', 'Optical detectors'] |
GMM hard-clustering on the student graph | ['Nanoelectronics', 'Analog circuits design I', 'Test of VLSI systems', 'Lab in EDA based design', 'HF and VHF circuits and techniques I', 'Fundamentals of VLSI design', 'Modeling of emerging electron devices', 'Physical models for micro and nanosystems', 'Integrated circuits technology', 'Hardware systems modeling', 'Lab in microelectronics', 'Introduction to VLSI Design', 'Optical detectors'] |
K-mean on the combined graph | ['Nanoelectronics', 'Analog circuits design I', 'Fundamentals of VLSI design', 'HF and VHF circuits and techniques I', 'Test of VLSI systems', 'Modeling of emerging electron devices', 'Advanced analog and RF integrated circuits design I', 'Hardware systems modeling I', 'Introduction to VLSI Design', 'Hardware systems modeling', 'Lab in EDA based design', 'Integrated circuits technology', 'Bio-nano-chip design', 'Lab in microelectronics', 'Analog circuits design II', 'Advanced analog and RF integrated circuits design II', 'Analog circuits for biochip', 'HF and VHF circuits and techniques II', 'Lasers: theory and modern applications', 'Advanced wireless communications: algorithms and architectures', 'Electrical filters'] |
GMM soft-clustering on the combined graph | ['Nanoelectronics', 'Analog circuits design I', 'Fundamentals of VLSI design', 'HF and VHF circuits and techniques I', 'Test of VLSI systems', 'Modeling of emerging electron devices', 'Advanced analog and RF integrated circuits design I', 'Hardware systems modeling I', 'Hardware systems modeling', 'Introduction to VLSI Design', 'Integrated circuits technology', 'Bio-nano-chip design', 'Analog circuits design II', 'Lab in microelectronics', 'Lab in EDA based design', 'Advanced analog and RF integrated circuits design II', 'Analog circuits for biochip', 'HF and VHF circuits and techniques II', 'Advanced wireless communications: algorithms and architectures', 'Electrical filters', 'Physical models for micro and nanosystems', 'Propagation of acoustic waves', 'Advanced lab in electrical engineering', 'Advanced VLSI design'] |
GMM hard-clustering on the combined graph | ['Nanoelectronics', 'Analog circuits design I', 'Fundamentals of VLSI design', 'HF and VHF circuits and techniques I', 'Test of VLSI systems', 'Modeling of emerging electron devices', 'Advanced analog and RF integrated circuits design I', 'Hardware systems modeling I', 'Hardware systems modeling', 'Introduction to VLSI Design', 'Integrated circuits technology', 'Bio-nano-chip design', 'Analog circuits design II', 'Lab in microelectronics', 'Lab in EDA based design', 'Advanced analog and RF integrated circuits design II', 'Analog circuits for biochip', 'HF and VHF circuits and techniques II', 'Advanced wireless communications: algorithms and architectures', 'Lasers: theory and modern applications', 'Electrical filters'] |
Diffusion on the student graph | ['Analog circuits design I','Nanoelectronics','Lab in EDA based design','Lab in acoustics','Image analysis and pattern recognition','Biomedical optics','Pharmaceutical biotechnology'] |
Diffusion on the combined graph | ['Analog circuits design I','Nanoelectronics','HF and VHF circuits and techniques I','Lab in EDA based design','Lab in acoustics','Analog circuits design II','Industrial electronics I'] |
The analysis of this table is similar to the one presented before: the courses recommended are electrical engineering courses which is a good point but it is not possible to determine precisely which method is best. One important point to be made is the fact that there is a risk of of overfitting the recommendations to the sections and their study plans. Doing so would lead to recommending courses which are not "original" as results.
Given that there was no way of doing supervised learning on the graph, the results could not objectively be assessed based on a ground truth or an objective metric. Personal knowledge of the courses within the STI faculty, more precisely within the Micro-engineering and Electrical engineering curriculums was used to judge the quality of the results which were generally satisfying. There are nonetheless improvements which can be made to graphs which were constructed :
Generally, the recommendation system would have benefitted largely from the knowledge of the student's preferences. Ideally one would have had the rankings of the courses or the course evaluations to know which courses students liked to recommend them to others and avoid the bias caused by the mandatory courses in the curriculum. This unfortunately was not possible as there was no way to access the course evaluations, not forgetting the fact that the evaluations are anonymous and therefore cannot be linked to a given student, as well as the fact that students do not always participate or take them very seriously. That is why it would be interesting to take the project to another level and incorporate a form of reinforcement learning so that the recommender system may take in feedback from the users and use it to improve the outputs.
In [ ]: