Introduction to Non-Personalized Recommenders

The recommendation problem

Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:

Amazon
Netflix
Facebook
Last.fm.

What exactly do they do?

Definitions from the literature

In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997

Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992

In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005

Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012

Notation

$U$ is the set of users in our domain. Its size is $|U|$.
$I$ is the set of items in our domain. Its size is $|I|$.
$I(u)$ is the set of items that user $u$ has rated.
$-I(u)$ is the complement of $I(u)$ i.e., the set of items not yet seen by user $u$.
$U(i)$ is the set of users that have rated item $i$.
$-U(i)$ is the complement of $U(i)$.

Goal of a recommendation system

$ \newcommand{\argmax}{\mathop{\rm argmax}\nolimits} \forall{u \in U},\; i^* = \argmax_{i \in -I(u)} [S(u,i)] $

Problem statement

The recommendation problem in its most basic form is quite simple to define:

|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1               | ?   | ?   | 4   | ?   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_2               | 3   | ?   | ?   | 2   | 2   |
|-------------------+-----+-----+-----+-----+-----|
| u_3               | 3   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_4               | ?   | 1   | 2   | 1   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_5               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_6               | 2   | ?   | 2   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_7               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_8               | 3   | 1   | 5   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_9               | ?   | ?   | ?   | ?   | 2   |
|-------------------+-----+-----+-----+-----+-----|

Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.

Challenges

Availability of item metadata

Content-based techniques are limited by the amount of metadata that is available to describe an item. There are domains in which feature extraction methods are expensive or time consuming, e.g., processing multimedia data such as graphics, audio/video streams. In the context of grocery items for example, it's often the case that item information is only partial or completely missing. Examples include:

Ingredients
Nutrition facts
Brand
Description
County of origin

New user problem

A user has to have rated a sufficient number of items before a recommender system can have a good idea of what their preferences are. In a content-based system, the aggregation function needs ratings to aggregate.

New item problem

Collaborative filters rely on an item being rated by many users to compute aggregates of those ratings. Think of this as the exact counterpart of the new user problem for content-based systems.

Data sparsity

When looking at the more general versions of content-based and collaborative systems, the success of the recommender system depends on the availability of a critical mass of user/item iteractions. We get a first glance at the data sparsity problem by quantifying the ratio of existing ratings vs $|U|x|I|$. A highly sparse matrix of interactions makes it difficult to compute similarities between users and items. As an example, for a user whose tastes are unusual compared to the rest of the population, there will not be any other users who are particularly similar, leading to poor recommendations.

Flow chart: the big picture



In [4]:

    
from IPython.core.display import Image 
Image(filename='/Users/chengjun/GitHub/cjc2016/figure/recsys_arch.png')









    Out[4]:

The CourseTalk dataset: loading and first look

Loading of the CourseTalk database.

The CourseTalk data is spread across three files. Using the pd.read_table method we load each file:



In [5]:

    
import pandas as pd

unames = ['user_id', 'username']
users = pd.read_table('/Users/chengjun/GitHub/cjc2016/data/users_set.dat',
                      sep='|', header=None, names=unames)

rnames = ['user_id', 'course_id', 'rating']
ratings = pd.read_table('/Users/chengjun/GitHub/cjc2016/data/ratings.dat',
                        sep='|', header=None, names=rnames)

mnames = ['course_id', 'title', 'avg_rating', 'workload', 'university', 'difficulty', 'provider']
courses = pd.read_table('/Users/chengjun/GitHub/cjc2016/data/cursos.dat',
                       sep='|', header=None, names=mnames)

# show how one of them looks
ratings.head(10)



In [6]:

    
# show how one of them looks
users[:5]









    Out[6]:






  
    
      
      user_id
      username
    
  
  
    
      0
      1
      patrickdijusto1
    
    
      1
      2
      natalya_ivanova
    
    
      2
      3
      justineittreim
    
    
      3
      4
      ronmay
    
    
      4
      5
      paulstock



In [7]:

    
courses[:5]









    Out[7]:






  
    
      
      course_id
      title
      avg_rating
      workload
      university
      difficulty
      provider
    
  
  
    
      0
      1
      An Introduction to Interactive Programming in ...
      4.9
      7-10 hours/week
      Rice University
      Medium
      coursera
    
    
      1
      2
      Modern &amp; Contemporary American Poetry
      4.9
      5-9 hours/week
      University of Pennsylvania
      Easy/medium
      coursera
    
    
      2
      3
      A Beginner&#39;s Guide to Irrational Behavior
      4.9
      7-10 hours/week
      Duke University
      Medium
      coursera
    
    
      3
      4
      Design: Creation of Artifacts in Society
      4.9
      5-10 hours/week
      University of Pennsylvania
      Medium
      coursera
    
    
      4
      5
      Greek and Roman Mythology
      4.9
      8-10 hours/week
      University of Pennsylvania
      Medium
      coursera

Using pd.merge we get it all into one big DataFrame.



In [8]:

    
coursetalk = pd.merge(pd.merge(ratings, courses), users)
coursetalk









    Out[8]:






  
    
      
      user_id
      course_id
      rating
      title
      avg_rating
      workload
      university
      difficulty
      provider
      username
    
  
  
    
      0
      1
      1
      5.0
      An Introduction to Interactive Programming in ...
      4.9
      7-10 hours/week
      Rice University
      Medium
      coursera
      patrickdijusto1
    
    
      1
      2
      1
      5.0
      An Introduction to Interactive Programming in ...
      4.9
      7-10 hours/week
      Rice University
      Medium
      coursera
      natalya_ivanova
    
    
      2
      3
      1
      5.0
      An Introduction to Interactive Programming in ...
      4.9
      7-10 hours/week
      Rice University
      Medium
      coursera
      justineittreim
    
    
      3
      4
      1
      5.0
      An Introduction to Interactive Programming in ...
      4.9
      7-10 hours/week
      Rice University
      Medium
      coursera
      ronmay
    
    
      4
      5
      1
      5.0
      An Introduction to Interactive Programming in ...
      4.9
      7-10 hours/week
      Rice University
      Medium
      coursera
      paulstock
    
    
      5
      6
      1
      5.0
      An Introduction to Interactive Programming in ...
      4.9
      7-10 hours/week
      Rice University
      Medium
      coursera
      boyarsky
    
    
      6
      6
      11
      4.5
      Functional Programming Principles in Scala
      4.8
      5-7 hours/week
      Ecole Polytechnique Federale de Lausanne
      Medium/hard
      coursera
      boyarsky
    
    
      7
      6
      12
      4.0
      Gamification
      4.8
      4-8 hours/week
      University of Pennsylvania
      Easy/medium
      coursera
      boyarsky
    
    
      8
      6
      19
      5.0
      M101P: MongoDB for Developers
      4.7
      TBA
      NaN
      Medium
      None
      boyarsky
    
    
      9
      6
      21
      5.0
      6.002x: Circuits and Electronics
      4.7
      12 hours/week.
      MIT
      Medium/hard
      edx
      boyarsky
    
    
      10
      6
      32
      5.0
      Internet History, Technology, and Security
      4.6
      3-5 hours/week
      University of Michigan
      Easy
      coursera
      boyarsky
    
    
      11
      6
      33
      4.0
      Web Development
      4.6
      Self-paced
      NaN
      Easy/medium
      udacity
      boyarsky
    
    
      12
      6
      93
      5.0
      CS-169.1x: Software as a Service
      4.2
      TBA
      UC Berkeley
      Medium
      edx
      boyarsky
    
    
      13
      6
      108
      5.0
      Human-Computer Interaction
      4.1
      10-12 hours/week
      Stanford University
      Easy/medium
      coursera
      boyarsky
    
    
      14
      6
      134
      2.0
      Web Intelligence and Big Data
      3.9
      3-4 hours/week
      Indian Institute of Technology Delhi
      Medium
      coursera
      boyarsky
    
    
      15
      6
      141
      4.0
      Coding the Matrix: Linear Algebra through Comp...
      3.8
      7-10 hours/week
      Brown University
      Medium/hard
      coursera
      boyarsky
    
    
      16
      6
      145
      4.0
      Game Theory
      3.8
      5-7 hours/week
      Stanford University
      Medium
      coursera
      boyarsky
    
    
      17
      6
      198
      2.0
      Software Testing
      2.9
      Self-paced
      NaN
      Easy
      udacity
      boyarsky
    
    
      18
      7
      1
      5.0
      An Introduction to Interactive Programming in ...
      4.9
      7-10 hours/week
      Rice University
      Medium
      coursera
      barak
    
    
      19
      8
      1
      5.0
      An Introduction to Interactive Programming in ...
      4.9
      7-10 hours/week
      Rice University
      Medium
      coursera
      alexjeffrey
    
    
      20
      9
      1
      5.0
      An Introduction to Interactive Programming in ...
      4.9
      7-10 hours/week
      Rice University
      Medium
      coursera
      celsoagustinhernandezdiaz
    
    
      21
      10
      1
      5.0
      An Introduction to Interactive Programming in ...
      4.9
      7-10 hours/week
      Rice University
      Medium
      coursera
      vadimsolomonik
    
    
      22
      10
      14
      5.0
      An Introduction to Operations Management
      4.8
      5-7 hours/week
      University of Pennsylvania
      Medium
      coursera
      vadimsolomonik
    
    
      23
      10
      35
      5.0
      Model Thinking
      4.5
      4-8 hours/week
      University of Michigan
      Easy/medium
      coursera
      vadimsolomonik
    
    
      24
      10
      49
      5.0
      Fantasy and Science Fiction: The Human Mind, O...
      4.5
      8-12 hours/week
      University of Michigan
      Medium
      coursera
      vadimsolomonik
    
    
      25
      10
      87
      4.0
      Networked Life
      4.2
      In session
      University of Pennsylvania
      Easy/medium
      coursera
      vadimsolomonik
    
    
      26
      10
      142
      5.0
      Social Network Analysis
      3.8
      5-7 hours/week (8-10 if completing additional ...
      University of Michigan
      Medium
      coursera
      vadimsolomonik
    
    
      27
      10
      188
      1.0
      Computational Investing, Part I
      3.2
      8-12 hours/week
      Georgia Institute of Technology
      Medium
      coursera
      vadimsolomonik
    
    
      28
      11
      1
      3.5
      An Introduction to Interactive Programming in ...
      4.9
      7-10 hours/week
      Rice University
      Medium
      coursera
      CrunchyCookie
    
    
      29
      12
      1
      5.0
      An Introduction to Interactive Programming in ...
      4.9
      7-10 hours/week
      Rice University
      Medium
      coursera
      skywalking
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2743
      1988
      188
      3.0
      Computational Investing, Part I
      3.2
      8-12 hours/week
      Georgia Institute of Technology
      Medium
      coursera
      alanwilliams
    
    
      2744
      1989
      188
      1.0
      Computational Investing, Part I
      3.2
      8-12 hours/week
      Georgia Institute of Technology
      Medium
      coursera
      Lyon
    
    
      2745
      1990
      188
      1.5
      Computational Investing, Part I
      3.2
      8-12 hours/week
      Georgia Institute of Technology
      Medium
      coursera
      fernandomontenegro
    
    
      2746
      1991
      188
      4.5
      Computational Investing, Part I
      3.2
      8-12 hours/week
      Georgia Institute of Technology
      Medium
      coursera
      lemic
    
    
      2747
      1992
      188
      5.0
      Computational Investing, Part I
      3.2
      8-12 hours/week
      Georgia Institute of Technology
      Medium
      coursera
      andreamitchell
    
    
      2748
      1993
      189
      4.5
      CB22x: The Ancient Greek Hero
      3.1
      TBA
      Harvard University
      Easy/medium
      edx
      megkawento
    
    
      2749
      1994
      189
      0.5
      CB22x: The Ancient Greek Hero
      3.1
      TBA
      Harvard University
      Easy/medium
      edx
      anonymous
    
    
      2750
      1995
      189
      2.0
      CB22x: The Ancient Greek Hero
      3.1
      TBA
      Harvard University
      Easy/medium
      edx
      anonymous
    
    
      2751
      1996
      189
      5.0
      CB22x: The Ancient Greek Hero
      3.1
      TBA
      Harvard University
      Easy/medium
      edx
      anonymous
    
    
      2752
      1997
      190
      4.0
      Introduction to Systems Biology
      3.1
      6-8 hours/week
      Mount Sinai School of Medicine
      Medium/hard
      coursera
      anonymous
    
    
      2753
      1998
      192
      1.0
      Property and Liability: An Introduction to Law...
      3.0
      2-4 hours/week
      Wesleyan University
      Easy/medium
      coursera
      echolearning
    
    
      2754
      1999
      193
      3.0
      Physics
      3.0
      Self-paced
      NaN
      Medium
      khanacademy
      anonymous
    
    
      2755
      2000
      196
      3.0
      Poetry: What It Is, and How to Understand It
      3.0
      Self-paced
      NaN
      Medium
      udemy
      marcianuffer
    
    
      2756
      2001
      197
      3.0
      Principles of Obesity Economics
      2.9
      3-5 hours/week
      Johns Hopkins University
      Easy
      coursera
      irvwiswall
    
    
      2757
      2002
      198
      3.5
      Software Testing
      2.9
      Self-paced
      NaN
      Easy
      udacity
      ivandutoit
    
    
      2758
      2003
      200
      1.0
      Introduction to Logic
      2.8
      5-7 hours/week
      Stanford University
      Medium
      coursera
      llewellynfalco1
    
    
      2759
      2004
      200
      4.5
      Introduction to Logic
      2.8
      5-7 hours/week
      Stanford University
      Medium
      coursera
      anonymous
    
    
      2760
      2005
      200
      1.5
      Introduction to Logic
      2.8
      5-7 hours/week
      Stanford University
      Medium
      coursera
      caleesarya
    
    
      2761
      2006
      200
      1.0
      Introduction to Logic
      2.8
      5-7 hours/week
      Stanford University
      Medium
      coursera
      nathanhall
    
    
      2762
      2007
      200
      5.0
      Introduction to Logic
      2.8
      5-7 hours/week
      Stanford University
      Medium
      coursera
      valeria
    
    
      2763
      2008
      205
      2.0
      Preparation for Introductory Biology: DNA to O...
      2.4
      8-10 hours/week
      UC Irvine
      Hard
      coursera
      stuartwilloughby
    
    
      2764
      2009
      207
      1.5
      Computer Architecture
      2.2
      5-8 hours/week
      Princeton University
      Hard
      coursera
      jonsnow
    
    
      2765
      2010
      212
      0.5
      HTML5 Game Development
      1.8
      Self-paced
      NaN
      Medium
      udacity
      anonymous
    
    
      2766
      2011
      212
      3.0
      HTML5 Game Development
      1.8
      Self-paced
      NaN
      Medium
      udacity
      florianschaetz
    
    
      2767
      2012
      212
      1.0
      HTML5 Game Development
      1.8
      Self-paced
      NaN
      Medium
      udacity
      anonymous
    
    
      2768
      2013
      212
      1.0
      HTML5 Game Development
      1.8
      Self-paced
      NaN
      Medium
      udacity
      anonymous
    
    
      2769
      2014
      213
      0.5
      A New History for a New China, 1700-2000: New ...
      1.4
      3-4 hours/week
      The Hong Kong University of Science and Techno...
      Medium/hard
      coursera
      chihchengyuan1
    
    
      2770
      2015
      213
      0.5
      A New History for a New China, 1700-2000: New ...
      1.4
      3-4 hours/week
      The Hong Kong University of Science and Techno...
      Medium/hard
      coursera
      kj
    
    
      2771
      2016
      214
      1.0
      Sports and Society
      1.3
      3-5 hours/week
      Duke University
      Easy
      coursera
      debbie
    
    
      2772
      2017
      214
      0.5
      Sports and Society
      1.3
      3-5 hours/week
      Duke University
      Easy
      coursera
      kuba
    
  

2773 rows × 10 columns



In [9]:

    
coursetalk.ix[0]









    Out[9]:





user_id                                                       1
course_id                                                     1
rating                                                        5
title         An Introduction to Interactive Programming in ...
avg_rating                                                  4.9
workload                                        7-10 hours/week
university                                      Rice University
difficulty                                               Medium
provider                                               coursera
username                                        patrickdijusto1
Name: 0, dtype: object

Collaborative filtering: generalizations of the aggregation function

Non-personalized recommendations

Groupby

The idea of groupby is that of split-apply-combine:

split data in an object according to a given key;
apply a function to each subset;
combine results into a new object.

To get mean course ratings grouped by the provider, we can use the pivot_table method:



In [14]:

    
dir(pivot_table)









    Out[14]:





['__call__',
 '__class__',
 '__closure__',
 '__code__',
 '__defaults__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__get__',
 '__getattribute__',
 '__globals__',
 '__hash__',
 '__init__',
 '__module__',
 '__name__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'func_closure',
 'func_code',
 'func_defaults',
 'func_dict',
 'func_doc',
 'func_globals',
 'func_name']



In [15]:

    
from pandas import pivot_table
mean_ratings = pivot_table(coursetalk, values = 'rating', columns='provider', aggfunc='mean')
mean_ratings.order(ascending=False)









    Out[15]:





provider
None            4.562500
coursera        4.527835
edx             4.491620
codecademy      4.450000
udacity         4.241071
udemy           4.200000
open2study      4.083333
khanacademy     4.000000
novoed          3.281250
mruniversity    3.250000
Name: rating, dtype: float64

Now let's filter down to courses that received at least 20 ratings (a completely arbitrary number); To do this, I group the data by course_id and use size() to get a Series of group sizes for each title:



In [16]:

    
ratings_by_title = coursetalk.groupby('title').size()
ratings_by_title[:10]









    Out[16]:





title
14.73x: The Challenges of Global Poverty                     2
2.01x: Elements of Structures                                2
3.091x: Introduction to Solid State Chemistry                3
6.002x: Circuits and Electronics                            10
6.00x: Introduction to Computer Science and Programming     21
7.00x: Introduction to Biology - The Secret of Life          3
8.02x: Electricity and Magnetism                             3
8.MReVx: Mechanics ReView                                    1
A Beginner&#39;s Guide to Irrational Behavior              147
A Crash Course on Creativity                                 5
dtype: int64



In [17]:

    
active_titles = ratings_by_title.index[ratings_by_title >= 20]
active_titles[:10]









    Out[17]:





Index([u'6.00x: Introduction to Computer Science and Programming',
       u'A Beginner&#39;s Guide to Irrational Behavior',
       u'An Introduction to Interactive Programming in Python',
       u'An Introduction to Operations Management',
       u'CS-191x: Quantum Mechanics and Quantum Computation',
       u'CS188.1x Artificial Intelligence', u'Calculus: Single Variable',
       u'Computing for Data Analysis',
       u'Critical Thinking in Global Challenges', u'Cryptography I'],
      dtype='object', name=u'title')

The index of titles receiving at least 20 ratings can then be used to select rows from mean_ratings above:



In [18]:

    
mean_ratings = coursetalk.pivot_table('rating', columns='title', aggfunc='mean')
mean_ratings









    Out[18]:





title
14.73x: The Challenges of Global Poverty                                      4.250000
2.01x: Elements of Structures                                                 4.750000
3.091x: Introduction to Solid State Chemistry                                 4.166667
6.002x: Circuits and Electronics                                              4.800000
6.00x: Introduction to Computer Science and Programming                       4.166667
7.00x: Introduction to Biology - The Secret of Life                           4.666667
8.02x: Electricity and Magnetism                                              4.333333
8.MReVx: Mechanics ReView                                                     5.000000
A Beginner&#39;s Guide to Irrational Behavior                                 4.874150
A Crash Course on Creativity                                                  3.500000
A History of the World since 1300                                             4.318182
A Look at Nuclear Science and Technology                                      3.000000
A New History for a New China, 1700-2000: New Data and New Methods, Part 1    0.500000
AIDS                                                                          5.000000
Aboriginal Worldviews and Education                                           4.333333
Algorithms                                                                    4.250000
Algorithms, Part I                                                            4.555556
Algorithms, Part II                                                           4.500000
Algorithms: Design and Analysis, Part 1                                       4.777778
Algorithms: Design and Analysis, Part 2                                       4.500000
An Introduction to Interactive Programming in Python                          4.915652
An Introduction to Operations Management                                      4.785714
An Introduction to the U.S. Food System: Perspectives from Public Health      5.000000
Animal Behaviour                                                              4.500000
Applied Cryptography                                                          4.666667
Archaeology&#39;s Dirty Little Secrets                                        4.928571
Artificial Intelligence Planning                                              3.250000
Artificial Intelligence for Robotics                                          4.333333
Astrobiology and the Search for Extraterrestrial Life                         3.928571
Automata                                                                      4.000000
                                                                                ...   
Sports and Society                                                            0.666667
Stat2.1X: Introduction to Statistics: Descriptive Statistics                  4.642857
Stat2.2x: Introduction to Statistics: Probability                             5.000000
Stat2.3x: Introduction to Statistics: Inference                               4.500000
Statistics One                                                                3.909091
Synapses, Neurons and Brains                                                  4.600000
Teaching Adult Learners (WPTrain)                                             2.500000
Technology Entrepreneurship Part 1                                            2.900000
Technology Entrepreneurship Part 2                                            0.500000
The Ancient Greeks                                                            4.550000
The Eurozone Crisis                                                           3.250000
The Fiction of Relationship                                                   5.000000
The Hardware/Software Interface                                               3.857143
The Language of Hollywood: Storytelling, Sound, and Color                     4.800000
The Massey Method: Learn Spanish from a Former NSA Agent                      4.000000
The Modern World: Global History since 1760                                   4.775862
The Modern and the Postmodern                                                 4.777778
The Science of Gastronomy                                                     4.000000
The Social Context of Mental Health and Illness                               4.333333
Think Again: How to Reason and Argue                                          3.815789
Useful Genetics Part 1                                                        4.500000
VLSI CAD:  Logic to Layout                                                    4.500000
Vaccine Trials: Methods and Best Practices                                    5.000000
Vaccines                                                                      3.750000
Web Development                                                               4.625000
Web Intelligence and Big Data                                                 3.802326
Women and the Civil Rights Movement                                           5.000000
Writing for the Web (WriteWeb)                                                5.000000
Writing in the Sciences                                                       4.000000
jQuery                                                                        4.250000
Name: rating, dtype: float64

By computing the mean rating for each course, we will order with the highest rating listed first.



In [19]:

    
mean_ratings.ix[active_titles].order(ascending=False)









    Out[19]:





title
An Introduction to Interactive Programming in Python             4.915652
Modern &amp; Contemporary American Poetry                        4.901515
Design: Creation of Artifacts in Society                         4.879581
A Beginner&#39;s Guide to Irrational Behavior                    4.874150
Greek and Roman Mythology                                        4.864198
Calculus: Single Variable                                        4.854167
CS188.1x Artificial Intelligence                                 4.833333
Machine Learning                                                 4.830000
Functional Programming Principles in Scala                       4.822581
Gamification                                                     4.796296
An Introduction to Operations Management                         4.785714
The Modern World: Global History since 1760                      4.775862
Programming Languages                                            4.770833
CS-191x: Quantum Mechanics and Quantum Computation               4.727273
Cryptography I                                                   4.700000
Discrete Optimization                                            4.695652
Introduction to Computer Science                                 4.687500
Learn to Program: Crafting Quality Code                          4.585714
Model Thinking                                                   4.578125
Internet History, Technology, and Security                       4.541667
Fantasy and Science Fiction: The Human Mind, Our Modern World    4.522727
Learn to Program: The Fundamentals                               4.303571
6.00x: Introduction to Computer Science and Programming          4.166667
Critical Thinking in Global Challenges                           3.961538
Web Intelligence and Big Data                                    3.802326
Computing for Data Analysis                                      3.187500
Introduction to Finance                                          3.086957
Introduction to Data Science                                     3.060000
Name: rating, dtype: float64

To see the top courses among Coursera students, we can sort by the 'Coursera' column in descending order:



In [20]:

    
mean_ratings = coursetalk.pivot_table('rating', index='title',columns='provider', aggfunc='mean')
mean_ratings[:10]









    Out[20]:






  
    
      provider
      None
      codecademy
      coursera
      edx
      khanacademy
      mruniversity
      novoed
      open2study
      udacity
      udemy
    
    
      title
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      14.73x: The Challenges of Global Poverty
      NaN
      NaN
      NaN
      4.250000
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2.01x: Elements of Structures
      NaN
      NaN
      NaN
      4.750000
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3.091x: Introduction to Solid State Chemistry
      NaN
      NaN
      NaN
      4.166667
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      6.002x: Circuits and Electronics
      NaN
      NaN
      NaN
      4.800000
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      6.00x: Introduction to Computer Science and Programming
      NaN
      NaN
      NaN
      4.166667
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      7.00x: Introduction to Biology - The Secret of Life
      NaN
      NaN
      NaN
      4.666667
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      8.02x: Electricity and Magnetism
      NaN
      NaN
      NaN
      4.333333
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      8.MReVx: Mechanics ReView
      NaN
      NaN
      NaN
      5.000000
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      A Beginner&#39;s Guide to Irrational Behavior
      NaN
      NaN
      4.87415
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      A Crash Course on Creativity
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      3.5
      NaN
      NaN
      NaN



In [21]:

    
mean_ratings['coursera'][active_titles].order(ascending=False)[:10]









    Out[21]:





title
An Introduction to Interactive Programming in Python    4.915652
Modern &amp; Contemporary American Poetry               4.901515
Design: Creation of Artifacts in Society                4.879581
A Beginner&#39;s Guide to Irrational Behavior           4.874150
Greek and Roman Mythology                               4.864198
Calculus: Single Variable                               4.854167
Programming Languages                                   4.850000
Machine Learning                                        4.830000
Functional Programming Principles in Scala              4.822581
Gamification                                            4.796296
Name: coursera, dtype: float64

Now, let's go further! How about rank the courses with the highest percentage of ratings that are 4 or higher ? % of ratings 4+

Let's start with a simple pivoting example that does not involve any aggregation. We can extract a ratings matrix as follows:



In [23]:

    
# transform the ratings frame into a ratings matrix
ratings_mtx_df = coursetalk.pivot_table(values='rating',
                                             index='user_id',
                                             columns='title')
ratings_mtx_df.ix[ratings_mtx_df.index[:15], ratings_mtx_df.columns[:15]]









    Out[23]:






  
    
      title
      14.73x: The Challenges of Global Poverty
      2.01x: Elements of Structures
      3.091x: Introduction to Solid State Chemistry
      6.002x: Circuits and Electronics
      6.00x: Introduction to Computer Science and Programming
      7.00x: Introduction to Biology - The Secret of Life
      8.02x: Electricity and Magnetism
      8.MReVx: Mechanics ReView
      A Beginner&#39;s Guide to Irrational Behavior
      A Crash Course on Creativity
      A History of the World since 1300
      A Look at Nuclear Science and Technology
      A New History for a New China, 1700-2000: New Data and New Methods, Part 1
      AIDS
      Aboriginal Worldviews and Education
    
    
      user_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      5
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      6
      NaN
      NaN
      NaN
      5
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      7
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      8
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      9
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      10
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      11
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      12
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      13
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      14
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      15
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN

Let's extract only the rating that are 4 or higher.



In [24]:

    
ratings_gte_4 = ratings_mtx_df[ratings_mtx_df>=4.0]
# with an integer axis index only label-based indexing is possible

ratings_gte_4.ix[ratings_gte_4.index[:15], ratings_gte_4.columns[:15]]









    Out[24]:






  
    
      title
      14.73x: The Challenges of Global Poverty
      2.01x: Elements of Structures
      3.091x: Introduction to Solid State Chemistry
      6.002x: Circuits and Electronics
      6.00x: Introduction to Computer Science and Programming
      7.00x: Introduction to Biology - The Secret of Life
      8.02x: Electricity and Magnetism
      8.MReVx: Mechanics ReView
      A Beginner&#39;s Guide to Irrational Behavior
      A Crash Course on Creativity
      A History of the World since 1300
      A Look at Nuclear Science and Technology
      A New History for a New China, 1700-2000: New Data and New Methods, Part 1
      AIDS
      Aboriginal Worldviews and Education
    
    
      user_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      5
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      6
      NaN
      NaN
      NaN
      5
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      7
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      8
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      9
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      10
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      11
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      12
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      13
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      14
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      15
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN

Now picking the number of total ratings for each course and the count of ratings 4+ , we can merge them into one DataFrame.



In [25]:

    
ratings_gte_4_pd = pd.DataFrame({'total': ratings_mtx_df.count(), 'gte_4': ratings_gte_4.count()})
ratings_gte_4_pd.head(10)









    Out[25]:






  
    
      
      gte_4
      total
    
    
      title
      
      
    
  
  
    
      14.73x: The Challenges of Global Poverty
      2
      2
    
    
      2.01x: Elements of Structures
      2
      2
    
    
      3.091x: Introduction to Solid State Chemistry
      2
      3
    
    
      6.002x: Circuits and Electronics
      10
      10
    
    
      6.00x: Introduction to Computer Science and Programming
      15
      21
    
    
      7.00x: Introduction to Biology - The Secret of Life
      3
      3
    
    
      8.02x: Electricity and Magnetism
      2
      3
    
    
      8.MReVx: Mechanics ReView
      1
      1
    
    
      A Beginner&#39;s Guide to Irrational Behavior
      146
      147
    
    
      A Crash Course on Creativity
      2
      5



In [26]:

    
ratings_gte_4_pd['gte_4_ratio'] = (ratings_gte_4_pd['gte_4'] * 1.0)/ ratings_gte_4_pd.total
ratings_gte_4_pd.head(10)









    Out[26]:






  
    
      
      gte_4
      total
      gte_4_ratio
    
    
      title
      
      
      
    
  
  
    
      14.73x: The Challenges of Global Poverty
      2
      2
      1.000000
    
    
      2.01x: Elements of Structures
      2
      2
      1.000000
    
    
      3.091x: Introduction to Solid State Chemistry
      2
      3
      0.666667
    
    
      6.002x: Circuits and Electronics
      10
      10
      1.000000
    
    
      6.00x: Introduction to Computer Science and Programming
      15
      21
      0.714286
    
    
      7.00x: Introduction to Biology - The Secret of Life
      3
      3
      1.000000
    
    
      8.02x: Electricity and Magnetism
      2
      3
      0.666667
    
    
      8.MReVx: Mechanics ReView
      1
      1
      1.000000
    
    
      A Beginner&#39;s Guide to Irrational Behavior
      146
      147
      0.993197
    
    
      A Crash Course on Creativity
      2
      5
      0.400000



In [27]:

    
ranking = [(title,total,gte_4, score) for title, total, gte_4, score in ratings_gte_4_pd.itertuples()]

for title, total, gte_4, score in sorted(ranking, key=lambda x: (x[3], x[2], x[1])  , reverse=True)[:10]:
    print title, total, gte_4, score









    



Functional Programming Principles in Scala 31 31 1.0
Introduction to Computer Science 24 24 1.0
Programming Languages 24 24 1.0
Web Development 16 16 1.0
6.002x: Circuits and Electronics 10 10 1.0
Compilers 8 8 1.0
Archaeology&#39;s Dirty Little Secrets 7 7 1.0
How to Build a Startup 7 7 1.0
Introduction to Sociology 7 7 1.0
Stat2.1X: Introduction to Statistics: Descriptive Statistics 7 7 1.0

Let's now go easy. Let's count the number of ratings for each course, and order with the most number of ratings.



In [28]:

    
ratings_by_title = coursetalk.groupby('title').size()
ratings_by_title.order(ascending=False)[:10]









    Out[28]:





title
An Introduction to Interactive Programming in Python    575
Design: Creation of Artifacts in Society                191
A Beginner&#39;s Guide to Irrational Behavior           147
Modern &amp; Contemporary American Poetry               132
An Introduction to Operations Management                 98
Greek and Roman Mythology                                81
Critical Thinking in Global Challenges                   65
Gamification                                             54
Machine Learning                                         50
Web Intelligence and Big Data                            43
dtype: int64

Considering this information we can sort by the most rated ones with highest percentage of 4+ ratings.



In [29]:

    
for title, total, gte_4, score in sorted(ranking, key=lambda x: (x[2], x[3], x[1])  , reverse=True)[:10]:
    print title, total, gte_4, score









    



An Introduction to Interactive Programming in Python 572 575 0.994782608696
Design: Creation of Artifacts in Society 190 191 0.994764397906
A Beginner&#39;s Guide to Irrational Behavior 146 147 0.993197278912
Modern &amp; Contemporary American Poetry 130 132 0.984848484848
An Introduction to Operations Management 96 98 0.979591836735
Greek and Roman Mythology 80 81 0.987654320988
Critical Thinking in Global Challenges 47 65 0.723076923077
Gamification 52 54 0.962962962963
Machine Learning 48 49 0.979591836735
Web Intelligence and Big Data 26 43 0.604651162791

Finally using the formula above that we learned, let's find out what the courses that most often occur wit the popular MOOC An introduction to Interactive Programming with Python by using the method "x + y/ x" . For each course, calculate the percentage of Programming with python raters who also rated that course. Order with the highest percentage first, and voilá we have the top 5 moocs.



In [31]:

    
course_users = coursetalk.pivot_table('rating', index='title', columns='user_id')
course_users.ix[course_users.index[:15], course_users.columns[:15]]









    Out[31]:






  
    
      user_id
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
    
    
      title
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      14.73x: The Challenges of Global Poverty
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2.01x: Elements of Structures
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3.091x: Introduction to Solid State Chemistry
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      6.002x: Circuits and Electronics
      NaN
      NaN
      NaN
      NaN
      NaN
      5
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      6.00x: Introduction to Computer Science and Programming
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      7.00x: Introduction to Biology - The Secret of Life
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      8.02x: Electricity and Magnetism
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      8.MReVx: Mechanics ReView
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      A Beginner&#39;s Guide to Irrational Behavior
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      A Crash Course on Creativity
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      A History of the World since 1300
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      A Look at Nuclear Science and Technology
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      A New History for a New China, 1700-2000: New Data and New Methods, Part 1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      AIDS
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Aboriginal Worldviews and Education
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN

First, let's get only the users that rated the course An Introduction to Interactive Programming in Python



In [32]:

    
ratings_by_course = coursetalk[coursetalk.title == 'An Introduction to Interactive Programming in Python']
ratings_by_course.set_index('user_id', inplace=True)

Now, for all other courses let's filter out only the ratings from users that rated the Python course.



In [33]:

    
their_ids = ratings_by_course.index
their_ratings = course_users[their_ids]
course_users[their_ids].ix[course_users[their_ids].index[:15], course_users[their_ids].columns[:15]]









    Out[33]:






  
    
      user_id
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
    
    
      title
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      14.73x: The Challenges of Global Poverty
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2.01x: Elements of Structures
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3.091x: Introduction to Solid State Chemistry
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      6.002x: Circuits and Electronics
      NaN
      NaN
      NaN
      NaN
      NaN
      5
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      6.00x: Introduction to Computer Science and Programming
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      7.00x: Introduction to Biology - The Secret of Life
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      8.02x: Electricity and Magnetism
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      8.MReVx: Mechanics ReView
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      A Beginner&#39;s Guide to Irrational Behavior
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      A Crash Course on Creativity
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      A History of the World since 1300
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      A Look at Nuclear Science and Technology
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      A New History for a New China, 1700-2000: New Data and New Methods, Part 1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      AIDS
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Aboriginal Worldviews and Education
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN

By applying the division: number of ratings who rated Python Course and the given course / total of ratings who rated the Python Course we have our percentage.



In [34]:

    
course_count =  their_ratings.ix['An Introduction to Interactive Programming in Python'].count()
sims = their_ratings.apply(lambda profile: profile.count() / float(course_count) , axis=1)

Ordering by the score, highest first excepts the first one which contains the course itself.



In [35]:

    
sims.order(ascending=False)[1:][:10]









    Out[35]:





title
Cryptography I                             0.006957
Machine Learning                           0.006957
CS-169.1x: Software as a Service           0.005217
Python                                     0.005217
Introduction to Computer Science           0.005217
Human-Computer Interaction                 0.005217
Computational Investing, Part I            0.005217
Learn to Program: Crafting Quality Code    0.005217
Web Development                            0.005217
Gamification                               0.005217
dtype: float64



In [ ]:

	user_id	username
0	1	patrickdijusto1
1	2	natalya_ivanova
2	3	justineittreim
3	4	ronmay
4	5	paulstock

	course_id	title	avg_rating	workload	university	difficulty	provider
0	1	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera
1	2	Modern & Contemporary American Poetry	4.9	5-9 hours/week	University of Pennsylvania	Easy/medium	coursera
2	3	A Beginner's Guide to Irrational Behavior	4.9	7-10 hours/week	Duke University	Medium	coursera
3	4	Design: Creation of Artifacts in Society	4.9	5-10 hours/week	University of Pennsylvania	Medium	coursera
4	5	Greek and Roman Mythology	4.9	8-10 hours/week	University of Pennsylvania	Medium	coursera

provider	None	codecademy	coursera	edx	khanacademy	mruniversity	novoed	open2study	udacity	udemy
title
14.73x: The Challenges of Global Poverty	NaN	NaN	NaN	4.250000	NaN	NaN	NaN	NaN	NaN	NaN
2.01x: Elements of Structures	NaN	NaN	NaN	4.750000	NaN	NaN	NaN	NaN	NaN	NaN
3.091x: Introduction to Solid State Chemistry	NaN	NaN	NaN	4.166667	NaN	NaN	NaN	NaN	NaN	NaN
6.002x: Circuits and Electronics	NaN	NaN	NaN	4.800000	NaN	NaN	NaN	NaN	NaN	NaN
6.00x: Introduction to Computer Science and Programming	NaN	NaN	NaN	4.166667	NaN	NaN	NaN	NaN	NaN	NaN
7.00x: Introduction to Biology - The Secret of Life	NaN	NaN	NaN	4.666667	NaN	NaN	NaN	NaN	NaN	NaN
8.02x: Electricity and Magnetism	NaN	NaN	NaN	4.333333	NaN	NaN	NaN	NaN	NaN	NaN
8.MReVx: Mechanics ReView	NaN	NaN	NaN	5.000000	NaN	NaN	NaN	NaN	NaN	NaN
A Beginner's Guide to Irrational Behavior	NaN	NaN	4.87415	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Crash Course on Creativity	NaN	NaN	NaN	NaN	NaN	NaN	3.5	NaN	NaN	NaN

title	14.73x: The Challenges of Global Poverty	2.01x: Elements of Structures	3.091x: Introduction to Solid State Chemistry	6.002x: Circuits and Electronics	6.00x: Introduction to Computer Science and Programming	7.00x: Introduction to Biology - The Secret of Life	8.02x: Electricity and Magnetism	8.MReVx: Mechanics ReView	A Beginner's Guide to Irrational Behavior	A Crash Course on Creativity	A History of the World since 1300	A Look at Nuclear Science and Technology	A New History for a New China, 1700-2000: New Data and New Methods, Part 1	AIDS	Aboriginal Worldviews and Education
user_id
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	gte_4	total	gte_4_ratio
title
14.73x: The Challenges of Global Poverty	2	2	1.000000
2.01x: Elements of Structures	2	2	1.000000
3.091x: Introduction to Solid State Chemistry	2	3	0.666667
6.002x: Circuits and Electronics	10	10	1.000000
6.00x: Introduction to Computer Science and Programming	15	21	0.714286
7.00x: Introduction to Biology - The Secret of Life	3	3	1.000000
8.02x: Electricity and Magnetism	2	3	0.666667
8.MReVx: Mechanics ReView	1	1	1.000000
A Beginner's Guide to Irrational Behavior	146	147	0.993197
A Crash Course on Creativity	2	5	0.400000