This started out as a curiosity. I was interested in what I'd need to do to take a bunch of "Top X" lists, combine them and then ask questions to the data like, "What thing was number one the most?" or "If the votes are weighted, what does the actual top X look like?" I then remembered that Shreddit just did a voting. ;)
This isn't a scientifically accurate analysis rooted in best practices. But I'm also just getting started with data analysis. So there's that.
In [4]:
# set up all the data for the rest of the notebook
import json
from collections import Counter
from itertools import chain
from IPython.display import HTML
def vote_table(votes):
"""Render a crappy HTML table for easy display. I'd use Pandas, but that seems like
complete overkill for this simple task.
"""
base_table = """
<table>
<tr><td>Position</td><td>Album</td><td>Votes</td></tr>
{}
</table>
"""
base_row = "<tr><td>{0}</td><td>{1}</td><td>{2}</td></tr>"
vote_rows = [base_row.format(idx, name, vote) for idx, (name, vote) in enumerate(votes, 1)]
return HTML(base_table.format('\n'.join(vote_rows)))
with open('shreddit_q2_votes.json', 'r') as fh:
ballots = json.load(fh)
with open('tallied_votes.json', 'r') as fh:
tallied = Counter(json.load(fh))
equal_placement_ballots = Counter(chain.from_iterable(ballots))
The equal placement ballot assumes that any position on the ballot is equal to any other. And given that this how the voting was designed, it makes the most sense to look at this first. There are some differences, but given that /u/kaptain_carbon was tallying by hand, and I manually copy-pasted ballots (regex is hard) and then had to manually massage some data (fixing names and the like), differences are to be expected. Another note, all the data in my set is lower cased in an effort to normalize to make the data more accurate. My analysis also includes submissions from after voting was closed, mostly because I was too lazy to check dates.
I'm also playing fast and loose with items that end up with the same total, rather than doing the "right thing" and marking them at the same position. So, there's that.
Here's the top ten of the table in the post.
In [5]:
vote_table(tallied.most_common(10))
Out[5]:
And here's the top ten from my computed tally:
In [6]:
vote_table(equal_placement_ballots.most_common(10))
Out[6]:
But that's boring. What if we pretended for a second that everyone submitted a ballot where the albums were actually ranked one through five. What would the top ten look like then? There's a few ways to figure this one out. Initially, my thought was to provide a number 1 to 5 based on position to each vote and then find the lowest sum. However, the problem is that an item that only appears once will be considered the most preferred. That won't work. But going backwards from five to one for each item and then finding the largest total probably would:
In [7]:
weighted_ballot = Counter()
for ballot in ballots:
for item, weight in zip(ballot, range(5, 0, -1)):
weighted_ballot[item] += weight
This handles the situation where a ballot may not be full (five votes), which make up a surpsingly non trival amount of the ballots:
In [8]:
sum(1 for _ in filter(lambda x: len(x) < 5, ballots)) / len(ballots)
Out[8]:
Anyways, what does a top ten for weighted votes end up looking like?
In [9]:
vote_table(weighted_ballot.most_common(10))
Out[9]:
Hm, it's not actually all the different. Some bands move around a little bit, Deathhammer moves into the top ten using this method. But overall, the general spread is pretty much the same.
It's also interesting to look at the difference in position from the weighted tally vs the way it's done in the thread. There's major differences between the two due to the voting difference and from including submissions from after voting expired. There's also a missing band. :?
In [10]:
regular_tally_spots = {name.lower(): pos for pos, (name, _) in enumerate(tallied.most_common(), 1)}
base_table = """
<table>
<tr><td>Album</td><td>Regular Spot</td><td>Weighted Spot</td></tr>
{}
</table>
"""
base_row = "<tr><td>{0}</td><td>{1}</td><td>{2}</td></tr>"
rows = [base_row.format(name, regular_tally_spots[name], pos)
for pos, (name, _) in enumerate(weighted_ballot.most_common(), 1)
# some albums didn't make it, like Arcturian D:
if name in regular_tally_spots]
HTML(base_table.format('\n'.join(rows)))
Out[10]:
In [11]:
number_one = Counter([b[0] for b in ballots])
vote_table(number_one.most_common(10))
Out[11]:
This paints a slightly different picture of the top ten. While the names are largely the same, Scar Sighted was thought of as the top album most often, despite being at two or three through the other methods. And Misþyrming is at four (okay, "2", again fast and loose with numbering) despite being the solid top choice for all other methods.
There's lot of different ways to look at the ballots and different ways to tally them. Weighted voting is certainly an interesting avenue to explore.
Originally, I had wondered if something like something along the lines of Instant Runoff Voting or data processing packages like Panadas, Numpy or SciPy would be needed. But for basic prodding and poking, it turns out the stdlib is just fine.
Also: a lot of awesome music I haven't listened to at all this year (been tied up with Peace is the Mission the last few weeks, too, sorry guys).
In [12]:
#regular tallying
vote_table(equal_placement_ballots.most_common())
Out[12]:
In [13]:
#weighted ballot
vote_table(weighted_ballot.most_common())
Out[13]:
In [14]:
#number one count
vote_table(number_one.most_common())
Out[14]:
In [ ]: