In [1]:
    
from pandas import Series, DataFrame
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
%pylab inline
    
    
Read in the data from "gold.txt" and "labels.txt".
Since there are no headers in the files, names parameter should be set explicitly.
Duplicate records in both dataframes are kept, for repeated test on the same url provides enables more precise information about the turks' discernibility
In [2]:
    
gold = pd.read_table("gold.txt", names=["url", "category"]).dropna()
labels = pd.read_table("labels.txt", names=["turk", "url", "category"]).dropna()
    
To determine if a url in labels is in gold, make a list of unique url in gold, and map the lambda expression on the url series in labels.
In [3]:
    
url_list = gold["url"].unique()
labels_on_gold = labels[labels["url"].map(lambda s: s in url_list)]
labels_unknown = labels[labels["url"].map(lambda s: s not in url_list)]
    
url, "labels_on_gold" dataframe is used instead of "labels"url.correct in the new dataframe, and assign True where the "turk" rating is the same with the true rating.groupby on turk, and sum up the True records on correct for each turk, the returned value is a seriesvalue_counts on turk, a series of total rating numbers is returned.turkturk
In [4]:
    
rater_merged = pd.merge(
                    labels_on_gold,
                    gold,
                    left_on="url",
                    right_on="url",
                    suffixes=["_1", "_2"]
                )
rater_merged["correct"] = rater_merged["category_1"] == rater_merged["category_2"]
rater_merged = rater_merged[["turk", "correct"]]
correct_counts = rater_merged.groupby("turk")["correct"].sum()
total_counts = rater_merged["turk"].value_counts()
avg_correctness = correct_counts/total_counts
rater_goodness = pd.DataFrame({"number_of_ratings": total_counts, "average_correctness": avg_correctness})
rater_goodness[:10]
    
    Out[4]:
average_correctness to get $\frac{average\ correctness}{1 - average\ correctness}$average_correctness = 1, the ratio should be assigned float("inf")
In [5]:
    
rater_goodness["odds"] = rater_goodness["average_correctness"].map(lambda x: x/(1.001-x))
rater_goodness[:20]
    
    Out[5]:
rater_goodness["number of ratings"]>=20 to select turks who rated at least 20 times.average_correctness in descending order..index.values is optional to return only turks, but for aesthetic reasons it is not applied.
In [6]:
    
rater_goodness[rater_goodness["number_of_ratings"]>=20].sort_values(by="average_correctness", ascending=False)[:10]
    
    Out[6]:
Plotting average_correctness against number of ratings makes it easier to have an general idea between the two variables. However, from the plot, it is difficult to identify a clear pattern.
In [7]:
    
plot(rater_goodness['number_of_ratings'],
     rater_goodness['average_correctness'],
     marker='o',
     color='blue',
     linestyle='None')
xlabel('number of ratings')
ylabel('average correctness')
    
    Out[7]:
    
To quantitatively measure the linear correlation between number of ratings and average correctness, linear regression is used to draw insights.
From the model summary, it is still difficult to establish reliable linear correlation between the two variables, since the coefficient of number of ratings is not significantly different from zero.
statsmodels and patsy modules are imported for linear regression
In [8]:
    
import statsmodels.api as sm
from patsy import dmatrices
y, X = dmatrices('average_correctness ~ number_of_ratings', data=rater_goodness, return_type='dataframe')
model = sm.OLS(y, X)
result = model.fit()
print result.summary()
    
    
quantile(q=.75).map function.url and category, duplicates dropped.turk column in "rater_goodness" dataframe from the indexturkgroupby the resulting dataframe on url and category.prod() on odds to calculate overall odds by url and category.here
oddsis the "overall odds" as defined in the assignment description
In [9]:
    
top_25_cutpoint = labels_on_gold["turk"].value_counts().quantile(q=.75)
turk_list = labels_on_gold["turk"].value_counts()
mask_1 = labels_unknown["turk"].map(lambda s: turk_list[s]>=top_25_cutpoint if s in turk_list else False)
labels_bytop25 = labels_unknown[mask_1]
rater_goodness["turk"] = rater_goodness.index
odds_top25 = rater_goodness[rater_goodness["turk"].map(lambda s: turk_list[s]>=top_25_cutpoint if s in turk_list else False)]
overall_odds = pd.merge(labels_bytop25,
                       odds_top25,
                       left_on="turk",
                       right_on="turk",
                       how="left").dropna()
overall_odds.groupby(["url", "category"])[["odds"]].prod()[:10]
    
    Out[9]:
groupby object in the last question, containing url, category and overall odds.unstack to breakdown category from index to columns.idxmax() on all columns, i.e. url, returned value is a series with url as index and np.array ("odds", category) as values.max() on the transposed dataframe in step 2.
In [10]:
    
overall_odds_df = overall_odds.groupby(["url", "category"])[["odds"]].prod().unstack("category").T.fillna(0)
url_rating = pd.DataFrame(overall_odds_df.idxmax())
url_rating["top category"] = url_rating[0].map(lambda s: s[1])
url_rating = url_rating.set_index(url_rating.index.values)
url_rating["top odds"] = overall_odds_df.max()
url_rating = url_rating[["top category", "top odds"]]
url_rating.[:10]
    
    
url are rated by top 75% turks.Here only the "top category" column is kept and named
result_75
top category column from the dataframe from Question\ 8 and rename it result_25, and make it a dataframe.crosstab with the two columns as index and columns respectively.
In [ ]:
    
top_75_cutpoint = labels_on_gold["turk"].value_counts().quantile(q=.25)
mask_2 = labels_unknown["turk"].map(lambda s: turk_list[s]>=top_75_cutpoint if s in turk_list else False)
labels_bytop75 = labels_unknown[mask_2]
odds_top75 = rater_goodness[rater_goodness["turk"].map(lambda s: turk_list[s]>=top_75_cutpoint if s in turk_list else False)]
overall_odds_75 = pd.merge(labels_bytop75,
                       odds_top75,
                       left_on="turk",
                       right_on="turk",
                       how="left").dropna()
overall_odds_df_75 = overall_odds_75.groupby(["url", "category"])[["odds"]].prod().unstack("category").T.fillna(0)
url_rating_75 = pd.DataFrame(overall_odds_df_75.idxmax())
url_rating_75["result_75"] = url_rating_75[0].map(lambda s: s[1])
url_rating_75 = pd.DataFrame(url_rating_75["result_75"])
url_rating_75 = url_rating_75.set_index(url_rating_75.index.values)
url_rating_25 = pd.DataFrame({"result_25": url_rating["top category"]})
url_rating_merged = pd.merge(url_rating_25,
                            url_rating_75,
                            left_index=True,
                            right_index=True,
                            ).dropna()
url_rating_crosstab = pd.crosstab(index=url_rating_merged["result_25"],
                                 columns=url_rating_merged["result_75"]
                                 )
url_rating_crosstab