In [ ]:
from pandas import Series, DataFrame
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
%pylab inline

Question 1: Read in data

Read in the data from "gold.txt" and "labels.txt".
Since there are no headers in the files, names parameter should be set explicitly.

Duplicate records in both dataframes are kept, for repeated test on the same url provides enables more precise information about the turks' discernibility


In [ ]:
gold = pd.read_table("gold.txt", names=["url", "category"]).dropna()
labels = pd.read_table("labels.txt", names=["turk", "url", "category"]).dropna()

Question 2: Split into two DataFrames

To determine if a url in labels is in gold, make a list of unique url in gold, and map the lambda expression on the url series in labels.


In [ ]:
url_list = gold["url"].unique()
labels_on_gold = labels[labels["url"].map(lambda s: s in url_list)]
labels_unknown = labels[labels["url"].map(lambda s: s not in url_list)]

Question 3: Compute accuracies of turks

  1. Since the computation is all on "gold" set url, "labels_on_gold" dataframe is used instead of "labels"
  2. Merge "labels_on_gold" with "gold" on url.
  3. Create a new column correct in the new dataframe, and assign True where the "turk" rating is the same with the true rating.
  4. Optional: drop the rating columns to reduce the size of the dataframe
  5. groupby on turk, and sum up the True records on correct for each turk, the returned value is a series
  6. value_counts on turk, a series of total rating numbers is returned.
  7. Divide the previous two series to get the rating accuracy of each turk
  8. Create a new dataframe "rater_goodness" with the total rating number series and rating accuracy series, index by default set as turk

In [ ]:
rater_merged = pd.merge(
                    labels_on_gold,
                    gold,
                    left_on="url",
                    right_on="url",
                    suffixes=["_1", "_2"]
                )

rater_merged["correct"] = rater_merged["category_1"] == rater_merged["category_2"]
rater_merged = rater_merged[["turk", "correct"]]
correct_counts = rater_merged.groupby("turk")["correct"].sum()
total_counts = rater_merged["turk"].value_counts()
avg_correctness = correct_counts/total_counts
rater_goodness = pd.DataFrame({"number_of_ratings": total_counts, "average_correctness": avg_correctness})
rater_goodness[:10]

Question 4: Odds ratios

  1. Use "map" function on average_correctness to get $\frac{average\ correctness}{1 - average\ correctness}$
  2. By definition, when average_correctness = 1, the ratio should be assigned float("inf")

In [ ]:
rater_goodness["odds"] = rater_goodness["average_correctness"].map(lambda x: x/(1.001-x))
rater_goodness[:20]

Question 5: Most accurate turks

  1. Use rater_goodness["number of ratings"]>=20 to select turks who rated at least 20 times.
  2. Sort the list by average_correctness in descending order.
  3. .index.values is optional to return only turks, but for aesthetic reasons it is not applied.

In [ ]:
rater_goodness[rater_goodness["number_of_ratings"]>=20].sort_values(by="average_correctness", ascending=False)[:10]

Question 6: Rating counts versus accuracy

Plotting average_correctness against number of ratings makes it easier to have an general idea between the two variables. However, from the plot, it is difficult to identify a clear pattern.


In [ ]:
plot(rater_goodness['number_of_ratings'],
     rater_goodness['average_correctness'],
     marker='o',
     color='blue',
     linestyle='None')
xlabel('number of ratings')
ylabel('average correctness')

To quantitatively measure the linear correlation between number of ratings and average correctness, linear regression is used to draw insights.
From the model summary, it is still difficult to establish reliable linear correlation between the two variables, since the coefficient of number of ratings is not significantly different from zero.

statsmodels and patsy modules are imported for linear regression


In [ ]:
import statsmodels.api as sm
from patsy import dmatrices

y, X = dmatrices('average_correctness ~ number_of_ratings', data=rater_goodness, return_type='dataframe')
model = sm.OLS(y, X)
result = model.fit()
print result.summary()

Question 7: Overall predicted odds

  1. Define the cutpoint of top 25% turks in term of number of ratings using quantile(q=.75).
  2. Make a list of "turk: number of ratings"
  3. Make a mask to select records rated by top 25% turks using map function.
  4. Select from the total "labels" data set the records rated by top 25% turks.
  5. Merge this dataframe with "labels_unknown" dataframe on url and category, duplicates dropped.
  6. Next merge the resulting dataframe with "rater_goodness" dataframe.
    • First create a new turk column in "rater_goodness" dataframe from the index
    • Only select the records rated by top 25% turks from "rater_goodness" dataframe
    • Merge the two dataframe on turk
    • Drop duplicates and missing values
  7. groupby the resulting dataframe on url and category.
  8. Apply prod() on odds to calculate overall odds by url and category.

    here odds is the "overall odds" as defined in the assignment description


In [ ]:
top_25_cutpoint = labels_on_gold["turk"].value_counts().quantile(q=.75)
turk_list = labels_on_gold["turk"].value_counts()

mask_1 = labels_unknown["turk"].map(lambda s: turk_list[s]>=top_25_cutpoint if s in turk_list else False)
labels_bytop25 = labels_unknown[mask_1]

rater_goodness["turk"] = rater_goodness.index

odds_top25 = rater_goodness[rater_goodness["turk"].map(lambda s: turk_list[s]>=top_25_cutpoint if s in turk_list else False)]

overall_odds = pd.merge(labels_bytop25,
                       odds_top25,
                       left_on="turk",
                       right_on="turk",
                       how="left").dropna()

overall_odds.groupby(["url", "category"])[["odds"]].prod()[:10]

Question 8: Predicted categories

  1. Create a dataframe from the groupby object in the last question, containing url, category and overall odds.
  2. Apply unstack to breakdown category from index to columns.
  3. Transpose the dataframe and get idxmax() on all columns, i.e. url, returned value is a series with url as index and np.array ("odds", category) as values.
  4. Create a dataframe using the returned series, and convert the np.array into a string column "top category" by selecting the second element.
  5. Create a new "top odds" column for the dataframe by max() on the transposed dataframe in step 2.

In [ ]:
overall_odds_df = overall_odds.groupby(["url", "category"])[["odds"]].prod().unstack("category").T.fillna(0)
url_rating = pd.DataFrame(overall_odds_df.idxmax())
url_rating["top category"] = url_rating[0].map(lambda s: s[1])
url_rating = url_rating.set_index(url_rating.index.values)
url_rating["top odds"] = overall_odds_df.max()
url_rating = url_rating[["top category", "top odds"]]
url_rating[:10]

Question 9: Predicted categories using more turks

  1. Repeat Question\ 7 and Question\ 8 to create a dataframe where url are rated by top 75% turks.

    Here only the "top category" column is kept and named result_75

  2. Take out top category column from the dataframe from Question 8 and rename it result_25, and make it a dataframe.
  3. Merge the two dataframes on index.
  4. Create a crosstab with the two columns as index and columns respectively.
  5. From the crosstab it can be seen that, the most errors are where the top 25% turks rated "G" but the top 75% turks rated "P" (836 occurences), "G" versus "R" (285 occurences), and "P" versus "G" (229 occurences).

In [ ]:
top_75_cutpoint = labels_on_gold["turk"].value_counts().quantile(q=.25)

mask_2 = labels_unknown["turk"].map(lambda s: turk_list[s]>=top_75_cutpoint if s in turk_list else False)
labels_bytop75 = labels_unknown[mask_2]

odds_top75 = rater_goodness[rater_goodness["turk"].map(lambda s: turk_list[s]>=top_75_cutpoint if s in turk_list else False)]

overall_odds_75 = pd.merge(labels_bytop75,
                       odds_top75,
                       left_on="turk",
                       right_on="turk",
                       how="left").dropna()

overall_odds_df_75 = overall_odds_75.groupby(["url", "category"])[["odds"]].prod().unstack("category").T.fillna(0)

url_rating_75 = pd.DataFrame(overall_odds_df_75.idxmax())
url_rating_75["result_75"] = url_rating_75[0].map(lambda s: s[1])
url_rating_75 = pd.DataFrame(url_rating_75["result_75"])
url_rating_75 = url_rating_75.set_index(url_rating_75.index.values)

url_rating_25 = pd.DataFrame({"result_25": url_rating["top category"]})

url_rating_merged = pd.merge(url_rating_25,
                            url_rating_75,
                            left_index=True,
                            right_index=True,
                            ).dropna()

url_rating_crosstab = pd.crosstab(index=url_rating_merged["result_25"],
                                 columns=url_rating_merged["result_75"]
                                 )

url_rating_crosstab