In [ ]:
from pandas import Series, DataFrame
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
%pylab inline
Read in the data from "gold.txt" and "labels.txt".
Since there are no headers in the files, names
parameter should be set explicitly.
Duplicate records in both dataframes are kept, for repeated test on the same url provides enables more precise information about the turks' discernibility
In [ ]:
gold = pd.read_table("gold.txt", names=["url", "category"]).dropna()
labels = pd.read_table("labels.txt", names=["turk", "url", "category"]).dropna()
To determine if a url
in labels is in gold, make a list of unique url
in gold, and map
the lambda expression on the url
series in labels.
In [ ]:
url_list = gold["url"].unique()
labels_on_gold = labels[labels["url"].map(lambda s: s in url_list)]
labels_unknown = labels[labels["url"].map(lambda s: s not in url_list)]
url
, "labels_on_gold" dataframe is used instead of "labels"url
.correct
in the new dataframe, and assign True
where the "turk" rating is the same with the true rating.groupby
on turk
, and sum up the True
records on correct
for each turk
, the returned value is a seriesvalue_counts
on turk
, a series of total rating numbers is returned.turk
turk
In [ ]:
rater_merged = pd.merge(
labels_on_gold,
gold,
left_on="url",
right_on="url",
suffixes=["_1", "_2"]
)
rater_merged["correct"] = rater_merged["category_1"] == rater_merged["category_2"]
rater_merged = rater_merged[["turk", "correct"]]
correct_counts = rater_merged.groupby("turk")["correct"].sum()
total_counts = rater_merged["turk"].value_counts()
avg_correctness = correct_counts/total_counts
rater_goodness = pd.DataFrame({"number_of_ratings": total_counts, "average_correctness": avg_correctness})
rater_goodness[:10]
average_correctness
to get $\frac{average\ correctness}{1 - average\ correctness}$average_correctness
= 1, the ratio should be assigned float("inf")
In [ ]:
rater_goodness["odds"] = rater_goodness["average_correctness"].map(lambda x: x/(1.001-x))
rater_goodness[:20]
rater_goodness["number of ratings"]>=20
to select turks who rated at least 20 times.average_correctness
in descending order..index.values
is optional to return only turks, but for aesthetic reasons it is not applied.
In [ ]:
rater_goodness[rater_goodness["number_of_ratings"]>=20].sort_values(by="average_correctness", ascending=False)[:10]
Plotting average_correctness
against number of ratings
makes it easier to have an general idea between the two variables. However, from the plot, it is difficult to identify a clear pattern.
In [ ]:
plot(rater_goodness['number_of_ratings'],
rater_goodness['average_correctness'],
marker='o',
color='blue',
linestyle='None')
xlabel('number of ratings')
ylabel('average correctness')
To quantitatively measure the linear correlation between number of ratings and average correctness, linear regression is used to draw insights.
From the model summary, it is still difficult to establish reliable linear correlation between the two variables, since the coefficient of number of ratings is not significantly different from zero.
statsmodels and patsy modules are imported for linear regression
In [ ]:
import statsmodels.api as sm
from patsy import dmatrices
y, X = dmatrices('average_correctness ~ number_of_ratings', data=rater_goodness, return_type='dataframe')
model = sm.OLS(y, X)
result = model.fit()
print result.summary()
quantile(q=.75)
.map
function.url
and category
, duplicates dropped.turk
column in "rater_goodness" dataframe from the indexturk
groupby
the resulting dataframe on url
and category
.prod()
on odds
to calculate overall odds by url
and category
.here
odds
is the "overall odds" as defined in the assignment description
In [ ]:
top_25_cutpoint = labels_on_gold["turk"].value_counts().quantile(q=.75)
turk_list = labels_on_gold["turk"].value_counts()
mask_1 = labels_unknown["turk"].map(lambda s: turk_list[s]>=top_25_cutpoint if s in turk_list else False)
labels_bytop25 = labels_unknown[mask_1]
rater_goodness["turk"] = rater_goodness.index
odds_top25 = rater_goodness[rater_goodness["turk"].map(lambda s: turk_list[s]>=top_25_cutpoint if s in turk_list else False)]
overall_odds = pd.merge(labels_bytop25,
odds_top25,
left_on="turk",
right_on="turk",
how="left").dropna()
overall_odds.groupby(["url", "category"])[["odds"]].prod()[:10]
groupby
object in the last question, containing url
, category
and overall odds
.unstack
to breakdown category
from index to columns.idxmax()
on all columns, i.e. url
, returned value is a series with url
as index and np.array ("odds", category
) as values.max()
on the transposed dataframe in step 2.
In [ ]:
overall_odds_df = overall_odds.groupby(["url", "category"])[["odds"]].prod().unstack("category").T.fillna(0)
url_rating = pd.DataFrame(overall_odds_df.idxmax())
url_rating["top category"] = url_rating[0].map(lambda s: s[1])
url_rating = url_rating.set_index(url_rating.index.values)
url_rating["top odds"] = overall_odds_df.max()
url_rating = url_rating[["top category", "top odds"]]
url_rating[:10]
url
are rated by top 75% turks.Here only the "top category" column is kept and named
result_75
top category
column from the dataframe from Question 8 and rename it result_25
, and make it a dataframe.crosstab
with the two columns as index and columns respectively.
In [ ]:
top_75_cutpoint = labels_on_gold["turk"].value_counts().quantile(q=.25)
mask_2 = labels_unknown["turk"].map(lambda s: turk_list[s]>=top_75_cutpoint if s in turk_list else False)
labels_bytop75 = labels_unknown[mask_2]
odds_top75 = rater_goodness[rater_goodness["turk"].map(lambda s: turk_list[s]>=top_75_cutpoint if s in turk_list else False)]
overall_odds_75 = pd.merge(labels_bytop75,
odds_top75,
left_on="turk",
right_on="turk",
how="left").dropna()
overall_odds_df_75 = overall_odds_75.groupby(["url", "category"])[["odds"]].prod().unstack("category").T.fillna(0)
url_rating_75 = pd.DataFrame(overall_odds_df_75.idxmax())
url_rating_75["result_75"] = url_rating_75[0].map(lambda s: s[1])
url_rating_75 = pd.DataFrame(url_rating_75["result_75"])
url_rating_75 = url_rating_75.set_index(url_rating_75.index.values)
url_rating_25 = pd.DataFrame({"result_25": url_rating["top category"]})
url_rating_merged = pd.merge(url_rating_25,
url_rating_75,
left_index=True,
right_index=True,
).dropna()
url_rating_crosstab = pd.crosstab(index=url_rating_merged["result_25"],
columns=url_rating_merged["result_75"]
)
url_rating_crosstab