Search Project for CST 495

CMU Movie Summary Corpus http://www.cs.cmu.edu/~ark/personas/

Dustin D'Avignon

Chris Ngo

Let's go

We begin with normalise the text by removing unwanted characters and converting to lowercase


In [6]:
import csv
import re

with open("data/MovieSummaries/plot_summaries.tsv") as f:
    r = csv.reader(f, delimiter='\t', quotechar='"')
    tag = re.compile(r'\b[0-9]+\b')
    rgx = re.compile(r'\b[a-zA-Z]+\b')
    #docs = [ (' '.join(re.findall(tag, x[0])).lower(), ' '.join(re.findall(rgx, x[1])).lower()) for i,x in enumerate(r) if r>1 ]
    docs= {}
    for i,x in enumerate(r):
        if i >1:
            docs[' '.join(re.findall(tag, x[0])).lower()] = ' '.join(re.findall(rgx, x[1])).lower()
> now to normalize the movie meta data to swap the item titles with index from above ** just the basics for now to get index, tried to pull out genre, but it was screwing up the rest of the code due to potential parsing errors **

In [7]:
import csv
import re

with open("data/MovieSummaries/movie.metadata.tsv") as f:
    r = csv.reader(f, delimiter='\t', quotechar='"')
    tag = re.compile(r'\b[0-9]+\b')
    rgx = re.compile(r'\b[a-zA-Z]+\b')
    docs2= {}
    for i,x in enumerate(r):
        if i >1:
            docs2[' '.join(re.findall(tag, x[0])).lower()] = ' '.join(re.findall(rgx, x[2])).lower(), ' '.join(re.findall(rgx, x[8])).lower()
            
#print(docs2)

now is the time to join the docs together


In [8]:
doc = [(docs2.get(x), y) for x, y in docs.items() if docs2.get(x)]



# for testing
# import random
 #print doc[random.randint(0, len(doc)-1)]
print doc[0][0], doc[0][1]

items_t = [ d[0] for d in doc ] # item titles
items_d = [ d[1] for d in doc ] # item description
items_i = range(0 , len(items_t)) # item id


('periya idathu penn', 'm drama') murugappa is a small time farm labourer who lives with his widowed sister gangamma in a village pillaival is the zamindar of the village and sabapathy and punitha are his children punitha is studying in college in a nearby town while sabapathy is not educated both the father and the children are both arrogant about their wealth and try to rule the villagers murugappa tries to question their authority and this leads to frequent clashes with the zamindar s family pichandi is a wealthy college mate of punitha who is crazy about her sabapathy falls in love with thillaiammal who has been informally enagaged to murugappa for a long time both pillaival and gangamma propose for her on the same day to avoid a direct clash with the zamindar her father says that he took a vow that his daughter would marry the winner of a silambam competition punitha promises to marry pichandi if he dopes a drink which murugappa drinks during the fight sabapathy wins the fight and marries thilakam punitha goes back on her word and an angry pichandi confesses his duplicity to murugappa who confronts punitha the two get into an argument during which murugappa wows to mary punitha the two families get into another clash regarding the villagers right to worship at the temple at the same time as pillaival in retaliation he sends his henchmen to beat up murugappa and burn down their house when gangamma confronts him he rapes her and she disappears after writing a suicide note to her brother urging him to leave the village and make a life elsewhere pillaival is haunted by the fear that gangamma would return from the grave to take revenge he goes to the city where he meets pichandi again pichandi does a pygmalion on him and mgr emerges from the tutelage as azhagappa competent in english and even knows to play the piano he meets punitha at a club and the two begin dating she fails to connect the suave azhagappa with the village bumpkin murugappa and falls for him pillaival gets an anonymous letter informing him that his daughter is in love with someone in the city and he sends for her immediately and confronts her she admits to being in love with azgagappa and her family decides to get them married azhagappa and pichandi as his secretary meet pillaival and the marriage takes place he meets thillaiammal s father and reveals his identity to him he discovers that pillaival is in responsible for his sister s death and the whole family comes to know sabapathy tries to support his father and punitha opposes him azhagappa reveals his identity and walks out of the marriage he finds it impossible to live with punitha after knowing what her father did to his sister punitha discovers that she is pregnant and with the help of her sister in law thillaiammal meets murugappan and tells him the truth murugappan is caught between his desires to live with his wife and avenge his sister s death sabapathy discovers her there and gets into a fight with murugappan confronted by a deadlock situation punitha initially contemplates suicide but decides to live and have the baby pichandi meanwhile falls in love with thillaiammal s sister valli and learns silambam from her father in order to wed her murugappan and punitha have a baby boy and are still unable to be together murugappan longs to see his baby and goes to her house secretly in the nights he finds that pillai val has been stabbed punitha and sabapathy think that their father was killed by murugappan who tries to follow the killer and finds out that it is his sister she says that she was in hiding waiting for a chance to avenge herself and advices him to return to his wife and son meanwhile he is confronted by sabapathy and the police pursue gangamma who jumps off a bridge and kills herself all they find is a note from her confessing to pillaival s murder and urging murugappan to return to his wife pichandi marries valli and the family unites

term freq


In [10]:
corpus = items_d[0:25]
print corpus


['murugappa is a small time farm labourer who lives with his widowed sister gangamma in a village pillaival is the zamindar of the village and sabapathy and punitha are his children punitha is studying in college in a nearby town while sabapathy is not educated both the father and the children are both arrogant about their wealth and try to rule the villagers murugappa tries to question their authority and this leads to frequent clashes with the zamindar s family pichandi is a wealthy college mate of punitha who is crazy about her sabapathy falls in love with thillaiammal who has been informally enagaged to murugappa for a long time both pillaival and gangamma propose for her on the same day to avoid a direct clash with the zamindar her father says that he took a vow that his daughter would marry the winner of a silambam competition punitha promises to marry pichandi if he dopes a drink which murugappa drinks during the fight sabapathy wins the fight and marries thilakam punitha goes back on her word and an angry pichandi confesses his duplicity to murugappa who confronts punitha the two get into an argument during which murugappa wows to mary punitha the two families get into another clash regarding the villagers right to worship at the temple at the same time as pillaival in retaliation he sends his henchmen to beat up murugappa and burn down their house when gangamma confronts him he rapes her and she disappears after writing a suicide note to her brother urging him to leave the village and make a life elsewhere pillaival is haunted by the fear that gangamma would return from the grave to take revenge he goes to the city where he meets pichandi again pichandi does a pygmalion on him and mgr emerges from the tutelage as azhagappa competent in english and even knows to play the piano he meets punitha at a club and the two begin dating she fails to connect the suave azhagappa with the village bumpkin murugappa and falls for him pillaival gets an anonymous letter informing him that his daughter is in love with someone in the city and he sends for her immediately and confronts her she admits to being in love with azgagappa and her family decides to get them married azhagappa and pichandi as his secretary meet pillaival and the marriage takes place he meets thillaiammal s father and reveals his identity to him he discovers that pillaival is in responsible for his sister s death and the whole family comes to know sabapathy tries to support his father and punitha opposes him azhagappa reveals his identity and walks out of the marriage he finds it impossible to live with punitha after knowing what her father did to his sister punitha discovers that she is pregnant and with the help of her sister in law thillaiammal meets murugappan and tells him the truth murugappan is caught between his desires to live with his wife and avenge his sister s death sabapathy discovers her there and gets into a fight with murugappan confronted by a deadlock situation punitha initially contemplates suicide but decides to live and have the baby pichandi meanwhile falls in love with thillaiammal s sister valli and learns silambam from her father in order to wed her murugappan and punitha have a baby boy and are still unable to be together murugappan longs to see his baby and goes to her house secretly in the nights he finds that pillai val has been stabbed punitha and sabapathy think that their father was killed by murugappan who tries to follow the killer and finds out that it is his sister she says that she was in hiding waiting for a chance to avenge herself and advices him to return to his wife and son meanwhile he is confronted by sabapathy and the police pursue gangamma who jumps off a bridge and kills herself all they find is a note from her confessing to pillaival s murder and urging murugappan to return to his wife pichandi marries valli and the family unites', 'a hyper vigilant agent of the department of public safety erroll babbage checks on registered sex offenders burnt out after a long career he has become frustrated with the system of sex offender monitoring with little faith in humanity left he takes on one last job to find a missing girl he is three weeks away from taking early retirement and his final job is to train his young female replacement allison lowry after being left a newspaper with his characteristic headline circling he is convinced the case of kidnapping is connected to a paroled sex offender he s monitoring and he takes it upon himself to find the victim at all costs errol is eventually forced to leave the department early due to his relentless interrogation of sexual offenders and occasional vigilante actions against them his efforts center on viola a woman who has a history of being abused but is known to have a connection to another culprit that errol suspects to have taken the girl together with his partner they figure out that viola has become an abuser herself and is the ringleader in a kidnapping and torture syndicate they track her down to a deserted scrap heap where they find the latest kidnapped girl as well as corpses of previous victims the movie ends with viola being brought to book after errol considers killing her errol and allison realize that in fighting the monsters involved in sexual offenses they must not become monsters themselves', 'four friends gangu abdul nihal and gary get together to start their business but their roots are built on friendship and trust they succeed in their criminal goals although gangu is arrested and sentenced to jail for five years before going to jail he asks them to promise to go straight to which they all agree when gangu is released he is pleased to find that abdul is now driving a taxi his mother is well looked after and that nihal and gary have also started doing business it is when gangu meets his sweetheart sanam and proposes marriage that he learns that all is not well in their world', 'a married man is having an affair with another man after some time apart the two men spend a night together in a family vacation home in taal batangas together in such close quarters the two are left with nothing to do but to confront the realities of their relationship the movie opens with william a doctor driving up to tagaytay city to meet his secret lover jp a handsome young fellow in his mid twenties while driving william had been engaged in a cell phone conversation with his wife who was asking when he will return home william made up an alibi saying that he has an unexpected appointment in tagaytay and will probably be back in manila the next day on the other hand jp was waiting in a public viewing park of the taal volcano to join william in his tagaytay escapade jp is a local boatman and tour guide in taal batangas and william s secret lover for a year now although it was not specifically reiterated in the movie listening to their dialogue one will learn that the two men obviously met during one of william and his wife s visits to taal volcano if william has a wife jp has a girlfriend the two did not see each other for two months partly because william is very busy being a makati physician and partly because he is a family man jp on the other hand had spent two months secretly waiting for william while keeping himself busy with his boating job and his girlfriend deep inside the two lovers missed each other when they reach the rest house william cooks pasta and they eat and drink wine together it is to be however the last night that the two lovers will be together as william is scheduled to leave for australia william wants to call it quits but doesn t know how to break the news to jp who is already emotionally attached to him although they really didn t have an agreement that they are indeed a couple the tension starts when william tells jp that he s leaving the country several scenes and dialogue lead up to the climax as jp is reluctant to accept the fact that it was the end of their relationship the two men had spent the night talking about their past including happy and unforgettable memories while browsing their pictures the night has been a beautiful night their intimacy is renewed and they make love as pleasurable as ever but as the daybreak comes william remains firm to his decision leaving for australia and ending his relationship with jp the movie ends as william is driving back to manila again and jp alone in the nocturnal tagaytay rest house is numbed to the truth that their relationship is doomed in just one night', 'the movie concerns the life of tomasina tommy boyd who works as a mechanic and her hopes to become a stock car driver', 'azhagiri and his sister shanmughapriya are siblings who live only for each other azhagiri is a student and his sister is a lecturer santhanam and co forms his more than one sidekicks and m s baskar is their professor who invariable falls quarry to santhanam s unwitting jokes and intentional pranks azhagiri goes to thambikkotai on an nss field trip there he falls in love with kanaga who is the daughter of the amirthalingam a rich don of the village amirthalingam tries to kill government officers who want to repair a damaged bridge that connects the village with the outside world the village has a connection to azhagiri his sister and their dead father shanmugam', 'in a mad max style future jake mcqueen is the ultimate smuggler smuggling in mexicans for money to survive only for his smuggling to come to a halt when he is busted by his brother while getting his truck repaired however what he doesn t know is that he is under observation by jared the crippled head of chrysalis corporation who sends one of his most valued employees hannah tyree to bring him in to work for them as part of their video games division jake initially is skeptical about the idea of working with hannah and is scared away when she admits that she accidentally downloaded herself onto prism a crystalline solid state memory unit for her computer once due to an unexpected side effect jake is then hunted down after jared has his data and eventually finds his way back home only to find his father near death acquiring a junked mustang and a special engine his father had kept in trust he goes to find a way to stop chrysalis while pursuing a lead he ends up shot and is witness to hannah s apparent death only to find she was trapped in her prism going into battle against jared with hannah as his car s new ai he eventually destroys him when he discovers the one side effect of jared s life support that it is slowly killing the person it protects now jake and hannah travel the world of the future fighting for justice in a lawless desert that is forgotten by the world', 'colleen is the manager of a dress shop named the ames company owned by donald ames they try to keep uncle cedric from working because he ll ruin the company troubles start when he hires schemer joe as his personal assistant he later also hires minnie a woman who has a great passion for fashion when he buys the dress shop for minnie where colleen works as a bookkeeper a scandal is soon followed donald decides to shut the shop but is stopped because of his infatuation towards colleen it is colleen who eventually makes a profit out of the things that happened meanwhile a man named cedric tries to adopt minnie minnie refuses and thereby causes a scandel this angers alicia but the press can t get enough of it donald loses colleen s affection and thus is sued by joe', 'a young man in his early twenties juggles his dreams to be a filmmaker with his family life his best friend s troubles the girl he s interested in and living in pakistan during political turmoil', 'a man who suspects his wife is having an affair with his daughter s fiance places the two in dangerous situations in order to satisfy his voyeuristic curiosity', 'poirot joins his assistant hastings in acapulco mexico where hastings is staying they go to a party at which the other guests include the writer janet crisp the american actor charles cartwright a clergyman called babbington daisy eastman and her daughter egg dr strange and ricardo montoya babbington dies of poisoning and then strange is poisoned too poirot hunts the murderer', 'daniel eugene rudy ruettiger grows up in joliet illinois dreaming of playing college football at the university of notre dame though he is achieving some success with his local high school team joliet catholic academy he lacks the grades and money necessary to attend notre dame as well as talent and physical stature ruettiger takes a job at a local steel mill like his father daniel sr who is also a notre dame fan he prepares to settle down but when his best friend pete is killed in an explosion at the mill rudy decides to follow his dream of attending notre dame and playing for the fighting irish he perseveres to do everything he can to get into the football powerhouse he leaves for the campus but fails to get admitted to notre dame with the help and sponsorship of a local priest rudy starts at a small junior college nearby named holy cross hoping to get good enough grades to qualify for a transfer he also manages to get a part time job on notre dame s groundskeeping staff and befriends d bob a graduate student at notre dame and a teaching assistant at his junior college the socially awkward d bob offers to tutor rudy if he helps him meet girls suspecting an underlying cause to ruettiger s previous academic problems d bob has rudy tested and rudy learns that he has dyslexia rudy learns how to overcome his disability and becomes a better student at christmas vacation rudy returns home to his family s appreciation of his report card but is still mocked for his attempts at playing football and also dumped by his fiance who starts seeing one of his brothers after numerous rejections rudy is finally admitted to notre dame during his final semester of transfer eligibility he rushes home to tell his family and his father announces the news to his steel mill workers over the loudspeaker after walking on as a non scholarship player for the football team ruettiger convinces coach ara parseghian to give him a spot on the practice squad an assistant coach warns the players that scholarship players won t make the dress roster of players who take the field during the games but also notices that ruettiger exhibits more drive than many of his scholarship teammates coach parseghian agrees to rudy s request to suit up for one home game in his senior year so his family and friends can see him as a member of the team however parseghian steps down as coach following the season dan devine succeeds him in and honors parseghian s promise only after a player protest led by senior team captain and all american roland steele the other seniors rise to his defense and lay their jerseys on devine s desk each requesting that rudy be allowed to dress in their place in response devine lets ruettiger appear for the final home game against georgia tech at the final home game steele invites ruettiger to lead the team out of the tunnel onto the playing field as the game comes to an end and notre dame is ahead devine sends all the seniors to the field but refuses to let rudy play despite the pleas from steele and the assistant coaches as a rudy chant spreads from the notre dame bench into the stadium and the offensive team led by tailback jamie o hare overrules devine s call for victory formation and they score another touchdown instead devine finally lets rudy enter the field with the defensive team on the final kickoff he stays in for the final play of the game sacks the georgia tech quarterback and to cheers from the stadium is carried off the field on the shoulders of his teammates', 'as a descendant of an impoverished polish noble family young wokulski is forced to work as a waiter at hopfer s a warsaw restaurant while dreaming of a life in science after taking part in the failed uprising against tsarist russia he is sentenced to exile in siberia on eventual return to warsaw he becomes a salesman at mincel s haberdashery marrying the late owner s widow he comes into money and uses it to set up a partnership with a russian merchant he had met while in exile the two merchants go to bulgaria during the russo turkish war of and wokulski makes a fortune supplying the russian army the enterprising wokulski now proves a romantic at heart falling in love with izabela daughter of the vacuous bankrupt aristocrat tomasz cki in his quest to win izabela wokulski begins frequenting theatres and aristocratic salons and to help her financially distressed father founds a company and sets the aristocrats up as shareholders in his business http www imdb com searchplotwriters polish cinema database http info fuw edu pl filmy the indolence of these aristocrats who secure with their pensions are too lazy to undertake new business risks frustrates wokulski his ability to make money is respected but his lack of family and social rank is condescended to because of his help to izabela s impecunious but influential father the girl becomes aware of his affection in the end she consents to accept him but without true devotion or love', 'in the year a gang of bandits calling themselves the united regime invade the town of new hope they are led by general quantrill a descendent of the famous confederate cavalry officer william quantrill a mysterious motorcycle riding gunslinger called yuma arrives in town and joins the gang but plays the thugs against each other causing the drunken riders to shoot each other the next morning yuma s love interest sarah kills two others from the gang but the regime believes yuma is responsible yuma hunts down his accusers and shoots six of them he is chased by the rest of the gang to the tire refinery but kills them all including quantrill s son the sole survivor ackett escapes to warn quantrill most of the townsfolk flee but sarah stays to help yuma quantrill descends on the town with his entire army only to find the road blocked by coffins filled with the bodies of his son and troops enraged he shoots ackett with a shotgun and enters the town only to find it empty the buildings rigged with explosives are detonated and most of quantrill s army is destroyed the remaining thugs converge on the town center and are attacked by armed townsfolk quantrill spots yuma in a tower and sends his men after him when they enter the tower yuma abseils down and detonates a bomb hidden inside meanwhile sarah s son hides in an armored school bus which is later hijacked by quantrill who is chased by yuma and sarah yuma climbs into the bus and crashes it into a wrecked car quantrill and yuma lie on the injured road both within reach of their guns yuma draws first and kills quantrill yuma loads quantrill s corpse onto his bike and reveals he was a bounty hunter tasked with capturing quantrill before riding off into the sunset steel frontier', 'the story centers on characters who are immortal the lead character colin macleod became an immortal after his first death in ad in roman britain when his village was attacked by the conquering romans another immortal marcus octavius was leading the roman empire s military forces in hopes of creating his dream of a utopian society octavius killed colin s wife but was not able to kill colin whose unconscious body was dragged by a horse to within stonehenge a holy ground in the story where immortals are forbidden to fight waking days later within stonehenge colin is left confused about who and what he is and why he is still alive it is at this moment that the spirit of a former druid of this holy site amergan begins communicating with macleod and explains to him what he is colin learns of the game from amergan and the druid becomes his lifelong teacher and conscience the movie interlaces flashback scenes of this and colin s following plight for vengeance throughout time as marcus attempts to re enact his utopian society through force and fear throughout the next two millennia marcus serves as a high ranking member of various powerful empires ranging from the british colonial empire to nazi germany marcus wants to encourage the development of a utopian world empire but in contrast the side he chooses tend to be ruthless and authoritarian for years colin clashes with marcus multiple times throughout history always fighting on the opposing side trying to bring down which ever authoritarian regime marcus is supporting while colin is often badly beaten neither one ever succeeds in killing the other though colin is ostensibly in the role of the barbarian and marcus is the bringer of civilization marcus cares more about building an empire and less about people s well being with an ends justify the means mentality in the year colin finds octavius in a post apocalyptic future of new york city octavius has stopped supporting other regimes but is setting up a new empire with himself as its tyrannical leader when colin arrives marcus is busy making plans to release a deadly virus which will further his goals of conquest colin falls in with the disaffected rebels in the city after a final duel colin defeats octavius and his quickening destroys the virus in question', 'northern leg travels across china to find the man responsible for the death of his parents the culprit is none other than the silver fox a feared martial arts expert and bandit silver fox has also caught the attention of southern fist a government agent while southern fist and northern leg are both after the same man they discover that alone they are no match for silver fox the two heroes must combine their skills knowing that it is the only way to gain success against their awesome adversary in the course of finding and defeating the silver fox both northern leg and southern fist fall for the same woman the daughter of the owner of the inn they stay at for the duration of the movie throughout the movie they both vie for her attention asking questions of the butler at the inn and a child who follows southern fist throughout the film', 'after years from the fateful day of christmas at welcome to home gori the scenery changes during a robbery in a villa in tuscany by danilo and his friends addicts back home he expect a sad circumstance the death of his mother adele in fear of theft danilo hiding the loot in the coffin of his mother exposed in the red room in the house for the wake', 'the flying scissors is a mockumentary about the world of competitive rock paper scissors the film delves into the lives and daily routines of a wide array of quirky characters who vie to be the best at this unorthodox sport each competitor must balance the nuances of their everyday life in hopes of becoming a champion the film uses rock paper scissors to satire the current state of professional sports and the modern success of poker', 'raju and guddi are childhood friends and neighbors who are virtually inseparable raju s father is arrested after a dramatic police chase for break enter and theft resulting in their separation guddi grows up to be a professional stage singer and dancer while raju grows up to be a card sharp and a thief years later both raju and guddi meet and fall in love with each other unaware that they were childhood friends while seema is on her way to her birthplace for religious reasons chander too is headed that way to get himself arrested so that he can be jailed for a motive that gets him a hefty sum of money from a gangster', 'an indian girl wants the freedom to choose her own destiny and the love of her life but her mother wants to marry her off in an arranged marriage the film portrays an intriguing mix of matchmakers bhangra dancers psychic healers and religious fanatics and addresses one of the most important issues in contemporary muslim culture women s rights veils and burkas', 'l a detective sam dietz struggling to emotionally survive his previous big case is unwillingly paired with a shady fbi agent kyle valsone during a case tracking another serial killer who kills seemingly at random but every time dietz gets a lead valsone gets in the way and somewhat throws off the investigation suspecting more than meets the eye dietz goes around the law to learn the identity of the killer and find out what valsone is hiding and his connection to the killer meanwhile dietz s wife carol now estranged from him due to his long hours tries to deal with her current situation and their uncertain future matt patay http www imdb com title plotsummary', 'the film opens with drake s return from his voyage of circumnavigation he is nervous about how he will be received at home and rightly so for he has executed thomas doughty an influential courtier investor in the voyage and formerly his closest friend the story is mostly told in flashback as drake recounts the circumstances of the voyage to queen elizabeth i although it is clear how drake interprets the events that led to doughty s execution the depicted scenes paint a more ambiguous tale conflict between drake and doughty grows due to an escalating pattern of drake s increasing autocracy and paranoia and doughty s underhanded means to regain the authority he sees as rightly his due before the fleet leaves plymouth drake learns that someone has betrayed the news of the voyage to william cecil lord burghley drake is upset as the destination of the venture to raid spanish ships in peru and return home via a route theorized by john dee called the straits of anian is top secret known only to drake doughty the queen and a few select insiders drake has told the crew they will be voyaging to alexandria on a trade mission tension is high when the truth is revealed a few of drake s crewmen are discontent at being tricked into a long dangerous voyage doughty has second thoughts and tries to convince drake to redirect the fleet to less uncertain plunder on the spanish main drake is resolute but alienates his former friend through his high handedness doughty believes that his investment his advocacy of the venture at court and his command of the soldiers accompanying the fleet entitle him to equal command the fleet encounters the santa maria a portuguese vessel upon its capture drake induces the cooperation of its captain and navigator nuno da silva who has an extensive knowledge of the coastline of brazil hostilities escalate when drake reprimands one of doughty s officers for stealing from a portuguese prisoner doughty is given command of the prize ship drake s ship the pelican immediately falls victim to a number of misfortunes including a lack of wind and an outbreak of scurvy doughty s ship is nowhere to be found drake becomes increasingly paranoid attributing the misfortunes to doughty s betrayal and his interest in the occult upon the reunion of the ships doughty continues to agitate for what he perceives is his due co equal status in directing the fleet he has a final confrontation with drake who strikes him and has him bound to the mast of the pelican the film climaxes at the scene of doughty s execution at san julian drake brings doughty to trial accusing him of mutiny and witchcraft encouraged by his nephew drake induces the ship s carpenter ned bright to perjure himself in order to assure the conviction drake manipulates the men into sentencing doughty to death the chaplain francis fletcher tries to persuade doughty to confess his sins but the gentleman protests his innocence until the end he takes communion with drake and goes resolutely to his death drake then makes a speech promising the men wealth beyond their wildest dreams but the gentlemen adventurers and the mariners must settle their differences he changes the name of the pelican to the golden hind in hopes of placating sir christopher hatton doughty s former employer drake s successful exploits of plunder and subsequent return to england are covered almost incidentally one disturbing scene involves the abandonment of the navigator nuno da silva on the shores of mexico drake knows that da silva will certainly fall into the hands of the spanish inquisition yet is unwilling for a portuguese national to see the straits of anian the treatment of da silva is extremely upsetting to the crew including drake s nephew who formerly idolized his uncle drake cannot find anian and so returns home the long way by going completely around the globe after hearing drake s story elizabeth clearly interested in the riches which drake has won grants him her full protection from doughty s friends and from the spanish king who has demanded his execution she cynically informs him that if he had not succeeded he would be as good as dead for the execution of doughty the film ends with drake s knighthood a triumph that seems oddly hollow due to the strained reaction shots of some of the surviving characters including drake s nephew and the preacher francis fletcher', 'while trying to prove he is a hero mwansa does the unforgivable and accidentally breaks his big sister shula s special mud doll he goes on a quest not only to fix it but to finally prove he is mwansa the great', 'a police officer dsp shamsher singh captures the notorious bandit mangal singh just as mangal s wife is about to give birth she dies at childbirth but not before extracting from the dsp his promise to take care of her son subsequently dsp singh raises the boy alongside his own though ironically dsp singh s natural son kishen has a wicked streak while mangal s son amit is endowed with an honest nature released after years in jail mangal finds out that his son is with his old enemy he mistakes kishen to be his son and instigates him to fight against shamsher s family they join the underworld gang and spread havoc amit becomes an honest police officer and is assigned the task of nabbing the gang after a misunderstanding kishen comes to believe he is actually mangal s son and falls under the bandit s influence though he continues to live in the inspector s home parvarish synopsis', 'sgt ryker is charged with korean war treason court martialed prosecuted by capt david young convicted and sentenced to die his wife ann insists that ryker received an inadequate defense she believes her husband s story that he had been on a secret mission assigned to it by a superior officer who has since died and can no longer vouch for him capt young is not only persuaded to get general bailey s approval for a new trial he volunteers to defend ryker this time a grateful ryker ends up furious when he discovers a romantic attachment is developing between his wife and the captain the new prosecutor maj whitaker unearths new evidence damning to the defendant s case at the last minute though young produces a sergeant named winkler who verifies ryker s story setting him free']

start by computing frequncy of entire corpus


In [11]:
tf = {}
for doc in corpus:
    for word in doc.split():
        if word in tf:
            tf[word] += 1
        else:
            tf[word] = 1
print(tf)


{'baskar': 1, 'advices': 1, 'demanded': 1, 'protest': 1, 'captain': 3, 'offenses': 1, 'disability': 1, 'pensions': 1, 'bike': 1, 'under': 2, 'teaching': 1, 'merchant': 1, 'lack': 2, 'rise': 1, 'connects': 1, 'every': 1, 'confederate': 1, 'stabbed': 1, 'four': 1, 'school': 2, 'prize': 1, 'skills': 1, 'triumph': 1, 'force': 1, 'warns': 1, 'direct': 1, 'preacher': 1, 'second': 1, 'persuade': 1, 'even': 1, 'ruthless': 1, 'ned': 1, 'beaten': 1, 'corporation': 1, 'new': 8, 'increasing': 1, 'ever': 3, 'told': 2, 'hero': 1, 'whose': 1, 'men': 6, 'met': 2, 'protection': 1, 'china': 1, 'daughter': 7, 'employees': 1, 'pillaival': 8, 'browsing': 1, 'military': 1, 'changes': 2, 'golden': 1, 'secure': 1, 'amirthalingam': 2, 'brought': 1, 'guests': 1, 'tutelage': 1, 'unit': 1, 'sarah': 4, 'would': 3, 'army': 3, 'handedness': 1, 'chooses': 1, 'call': 2, 'survive': 2, 'tell': 1, 'coffins': 1, 'holy': 3, 'successful': 1, 'brings': 1, 'aware': 1, 'warn': 1, 'phone': 1, 'lord': 1, 'must': 4, 'shoot': 1, 'join': 2, 'room': 1, 'rights': 1, 'pursue': 1, 'work': 2, 'advocacy': 1, 'mechanic': 1, 'misfortunes': 2, 'give': 2, 'climax': 1, 'want': 1, 'times': 1, 'unforgettable': 1, 'end': 4, 'travel': 1, 'how': 4, 'badly': 1, 'poker': 1, 'elizabeth': 2, 'after': 21, 'misunderstanding': 1, 'shores': 1, 'lay': 1, 'curiosity': 1, 'burghley': 1, 'law': 2, 'siberia': 1, 'wins': 1, 'descends': 1, 'childhood': 2, 'allison': 2, 'ultimate': 1, 'enter': 3, 'amit': 2, 'order': 3, 'wind': 1, 'wine': 1, 'executed': 1, 'over': 1, 'tricked': 1, 'kickoff': 1, 'sidekicks': 1, 'before': 4, 'personal': 1, 'fix': 1, 'exhibits': 1, 'writing': 1, 'destroyed': 1, 'weeks': 1, 'overcome': 1, 'pleasurable': 1, 'eventually': 4, 'them': 6, 'break': 2, 'they': 24, 'lifelong': 1, 'silver': 4, 'routines': 1, 'arrested': 3, 'l': 1, 'victory': 1, 'each': 8, 'volcano': 2, 'side': 4, 'schemer': 1, 'stealing': 1, 'merchants': 1, 'driving': 4, 'psychic': 1, 'dsp': 4, 're': 1, 'encourage': 1, 'daniel': 2, 'parvarish': 1, 'lawless': 1, 'revenge': 1, 'free': 1, 'admits': 2, 'formation': 1, 'delves': 1, 'starts': 3, 'days': 1, 'onto': 3, 'caught': 2, 'enraged': 1, 'already': 1, 'rank': 1, 'hearing': 1, 'another': 6, 'scissors': 3, 'navigator': 2, 'enagaged': 1, 'top': 1, 'girls': 1, 'boating': 1, 'too': 3, 'john': 1, 'ranging': 1, 'murder': 1, 'took': 1, 'somewhat': 1, 'tuscany': 1, 'eastman': 1, 'begins': 2, 'scenes': 3, 'extracting': 1, 'bridge': 2, 'chrysalis': 2, 'sees': 1, 'longs': 1, 'modern': 1, 'upset': 1, 'talking': 1, 'tells': 2, 'forced': 2, 'alibi': 1, 'responsible': 3, 'causing': 1, 'confronted': 2, 'forces': 1, 'quarterback': 1, 'though': 5, 'germany': 1, 'letter': 1, 'unorthodox': 1, 'grave': 1, 'maria': 1, 'singer': 1, 'don': 1, 'observation': 1, 'professor': 1, 'm': 1, 'makati': 1, 'tech': 2, 'sum': 1, 'saying': 1, 'bomb': 1, 'random': 1, 'ending': 1, 'attempts': 2, 'nephew': 3, 'dopes': 1, 'busy': 3, 'just': 2, 'headline': 1, 'situations': 1, 'rich': 1, 'pursuing': 1, 'jailed': 1, 'do': 2, 'stop': 1, 'da': 4, 'haunted': 1, 'despite': 1, 'report': 1, 'dr': 1, 'angers': 1, 'guns': 1, 'shots': 1, 'release': 1, 'unwitting': 1, 'shula': 1, 'secretary': 1, 'decides': 4, 'crewmen': 1, 'checks': 1, 'disturbing': 1, 'nocturnal': 1, 'jerseys': 1, 'best': 3, 'pete': 1, 'voyage': 5, 'draws': 1, 'scrap': 1, 'gentleman': 1, 'mud': 1, 'unable': 1, 'discovers': 5, 'cooperation': 1, 'encounters': 1, 'lazy': 1, 'nature': 1, 'however': 3, 'portrays': 1, 'news': 3, 'received': 2, 'country': 1, 'against': 7, 'players': 3, 'games': 2, 'com': 2, 'frequenting': 1, 'connection': 3, 'smuggler': 1, 'trust': 2, 'san': 1, 'asks': 1, 'three': 1, 'been': 5, 'communion': 1, 'interest': 2, 'life': 7, 'families': 1, 'attacked': 2, 'filmmaker': 1, 'child': 1, 'physician': 1, 'conviction': 1, 'tomasina': 1, 'contemplates': 1, 'near': 1, 'balance': 1, 'leaves': 2, 'mexico': 2, 'is': 120, 'dumped': 1, 'it': 20, 'player': 2, 'straits': 2, 'in': 101, 'exile': 2, 'punitha': 14, 'if': 4, 'damaged': 1, 'things': 1, 'make': 4, 'clearly': 1, 'roland': 1, 'several': 1, 'grows': 4, 'meets': 6, 'raid': 1, 'hand': 2, 'consents': 1, 'characters': 3, 'thoughts': 1, 'kept': 1, 'kyle': 1, 'academic': 1, 'mother': 4, 'the': 332, 'veils': 1, 'left': 4, 'haberdashery': 1, 'assigned': 2, 'proposes': 1, 'grades': 2, 'financially': 1, 'victim': 2, 'plight': 1, 'yet': 1, 'previous': 3, 'bhangra': 1, 'enters': 1, 'burnt': 1, 'character': 1, 'spread': 1, 'has': 27, 'interprets': 1, 'humanity': 1, 'valsone': 3, 'mocked': 1, 'capt': 2, 'breaks': 1, 'addicts': 1, 'birth': 1, 'vigilante': 1, 'korean': 1, 'dreams': 2, 'apart': 1, 'tagaytay': 4, 'steps': 1, 'officer': 4, 'night': 6, 'notorious': 1, 'portuguese': 3, 'www': 2, 'right': 1, 'old': 1, 'deal': 1, 'people': 1, 'sends': 5, 'ironically': 1, 'dead': 2, 'donald': 3, 'for': 47, 'fox': 4, 'everything': 1, 'asking': 2, 'disappears': 1, 'conquering': 1, 'whitaker': 1, 'christmas': 2, 'burn': 1, 'silva': 4, 'deadly': 1, 'idolized': 1, 'confrontation': 1, 'defensive': 1, 'post': 1, 'descendent': 1, 'months': 2, 'hostilities': 1, 'o': 1, 'tyree': 1, 'efforts': 1, 'mast': 1, 'formerly': 2, 'bound': 1, 'son': 10, 'down': 8, 'raises': 1, 'shoots': 2, 'support': 2, 'flying': 1, 'fight': 5, 'way': 7, 'rapes': 1, 'quirky': 1, 'was': 12, 'war': 2, 'happy': 1, 'head': 1, 'becoming': 1, 'differences': 1, 'mysterious': 1, 'uncle': 2, 'heap': 1, 'syndicate': 1, 'true': 1, 'duel': 1, 'overrules': 1, 'inside': 2, 'attached': 1, 'until': 1, 'plymouth': 1, 'promises': 1, 'adopt': 1, 'request': 1, 'evidence': 1, 'proves': 1, 'octavius': 5, 'ship': 4, 'trip': 1, 'physical': 1, 'eventual': 1, 'no': 2, 'when': 18, 'actor': 1, 'setting': 2, 'role': 1, 'unwilling': 1, 'quickening': 1, 'realize': 1, 'amergan': 2, 'brothers': 1, 'dies': 2, 'welcome': 1, 'russo': 1, 'died': 1, 'longer': 1, 'sacks': 1, 'daily': 1, 'rejections': 1, 'time': 8, 'intimacy': 1, 'neighbors': 1, 'detective': 1, 'coach': 4, 'interlaces': 1, 'leads': 1, 'manager': 1, 'manages': 1, 'bandits': 1, 'battle': 1, 'colleen': 5, 'mill': 3, 'hijacked': 1, 'certainly': 1, 'suicide': 2, 'father': 15, 'finally': 3, 'circumstances': 1, 'division': 1, 'hannah': 5, 'none': 1, 'seemingly': 1, 'nihal': 2, 'anonymous': 1, 'stays': 2, 'word': 1, 'minute': 1, 'cooks': 1, 'did': 2, 'die': 1, 'jamie': 1, 'mwansa': 2, 'brother': 2, 'minnie': 4, 'temple': 1, 'leave': 3, 'settle': 2, 'perceives': 1, 'team': 7, 'loads': 1, 'unaware': 1, 'unexpected': 2, 'says': 2, 'detonated': 1, 'discover': 1, 'agitate': 1, 'detonates': 1, 'drinks': 1, 'appear': 1, 'havoc': 1, 'murugappa': 8, 'current': 2, 'goes': 8, 'falling': 1, 'cynically': 1, 'filled': 1, 'satisfy': 1, 'supporting': 2, 'explosion': 1, 'climbs': 1, 'parseghian': 4, 'muslim': 1, 'downloaded': 1, 'crippled': 1, 'alone': 2, 'teacher': 1, 'supplying': 1, 'boy': 2, 'trial': 2, 'women': 1, 'bob': 3, 'love': 11, 'touchdown': 1, 'fbi': 1, 'opposes': 1, 'troops': 1, 'working': 2, 'prove': 2, 'angry': 1, 'sports': 1, 'live': 5, 'peru': 1, 'memory': 1, 'prosecutor': 1, 'wicked': 1, 'tsarist': 1, 'club': 1, 'apparent': 1, 'given': 1, 'riders': 1, 'capturing': 1, 'valued': 1, 'valli': 2, 'defeats': 1, 'car': 3, 'believes': 3, 'can': 5, 'stature': 1, 'following': 2, 'making': 1, 'indolence': 1, 'streak': 1, 'heart': 1, 'crazy': 1, 'reprimands': 1, 'figure': 1, 'awesome': 1, 'lowry': 1, 'confused': 1, 'agent': 3, 'death': 9, 'acapulco': 1, 'allowed': 1, 'thilakam': 1, 'monitoring': 2, 'bankrupt': 1, 'spreads': 1, 'staying': 1, 'crashes': 1, 'escalate': 1, 'max': 1, 'spot': 1, 'informs': 1, 'mad': 1, 'such': 1, 'heroes': 1, 'maj': 1, 'data': 1, 'man': 8, 'natural': 1, 'succeeded': 1, 'sr': 1, 'tale': 1, 'quantrill': 12, 'so': 4, 'drunken': 1, 'repair': 1, 'indeed': 1, 'years': 5, 'course': 1, 'still': 3, 'worship': 1, 'blocked': 1, 'police': 4, 'troubles': 2, 'forms': 1, 'offers': 1, 'main': 1, 'into': 14, 'happened': 1, 'non': 1, 'halt': 1, 'thereby': 1, 'killer': 4, 'doll': 1, 'realities': 1, 'wealthy': 1, 'not': 10, 'now': 5, 'killed': 3, 'occult': 1, 'name': 1, 'didn': 1, 'rock': 2, 'repaired': 1, 'chander': 1, 'directing': 1, 'year': 4, 'girl': 6, 'morning': 1, 'emerges': 1, 'profit': 1, 'investigation': 1, 'uprising': 1, 'montoya': 1, 'hefty': 1, 'romans': 1, 'quits': 1, 'card': 2, 'care': 1, 'waiter': 1, 'british': 1, 'honest': 2, 'place': 2, 'think': 1, 'teammates': 2, 'first': 2, 'pelican': 3, 'surviving': 1, 'one': 13, 'sentencing': 1, 'long': 5, 'spanish': 4, 'impossible': 1, 'array': 1, 'city': 5, 'little': 1, 'zamindar': 3, 'immortals': 1, 'returns': 2, 'twenties': 2, 'bounty': 1, 'friend': 4, 'mostly': 1, 'that': 50, 'season': 1, 'viewing': 1, 'released': 2, 'than': 4, 'wide': 1, 'future': 4, 'venture': 2, 'were': 1, 'poisoning': 1, 'russia': 1, 'and': 184, 'mangal': 5, 'armored': 1, 'sam': 1, 'ann': 1, 'argument': 1, 'sad': 1, 'adele': 1, 'confesses': 1, 'warsaw': 2, 'note': 2, 'nuances': 1, 'squad': 1, 'take': 3, 'bailey': 1, 'pranks': 1, 'begin': 1, 'multiple': 1, 'roster': 1, 'track': 1, 'betrayal': 1, 'fails': 2, 'falls': 8, 'dream': 2, 'later': 4, 'drive': 1, 'defeating': 1, 'sunset': 1, 'professional': 2, 'senior': 2, 'shop': 3, 'walking': 1, 'shot': 1, 'contemporary': 1, 'hopes': 4, 'bright': 1, 'fletcher': 2, 'raju': 4, 'ground': 1, 'frequent': 1, 'title': 1, 'mexicans': 1, 'only': 11, 'going': 3, 'bringer': 1, 'ranking': 1, 'get': 11, 'assistant': 5, 'mission': 2, 'cannot': 1, 'regarding': 1, 'requesting': 1, 'jokes': 1, 'priest': 1, 'where': 5, 'husband': 1, 'bulgaria': 1, 'reveals': 3, 'gangster': 1, 'college': 5, 'sport': 1, 'janet': 1, 'loudspeaker': 1, 'suspects': 2, 'ostensibly': 1, 'subsequent': 1, 'sponsorship': 1, 'outside': 1, 'arrogant': 1, 'between': 3, 'across': 1, 'yuma': 12, 'jp': 10, 'inseparable': 1, 'killing': 3, 'by': 26, 'cedric': 2, 'childbirth': 1, 'come': 1, 'ringleader': 1, 'reaction': 1, 'acquiring': 1, 'many': 1, 's': 72, 'comes': 5, 'nearby': 2, 'pakistan': 1, 'grants': 1, 'crew': 2, 'considers': 1, 'better': 1, 'arts': 1, 'underworld': 1, 'dyslexia': 1, 'cares': 1, 'abdul': 2, 'vacation': 2, 'mary': 1, 'sabapathy': 8, 'engine': 1, 'educated': 1, 'dramatic': 1, 'wake': 1, 'former': 3, 'case': 4, 'developing': 1, 'these': 1, 'promising': 1, 'newspaper': 1, 'situation': 2, 'hunts': 2, 'then': 3, 'endowed': 1, 'engaged': 1, 'refinery': 1, 'good': 2, 'everyday': 1, 'doctor': 1, 'theorized': 1, 'tour': 1, 'same': 4, 'gentlemen': 1, 'attributing': 1, 'verifies': 1, 'speech': 1, 'manila': 2, 'buys': 1, 'struggling': 1, 'regime': 3, 'events': 1, 'status': 1, 'noble': 1, 'closest': 1, 'socially': 1, 'driver': 1, 'unearths': 1, 'cheers': 1, 'tracking': 1, 'without': 1, 'bodies': 1, 'scholarship': 3, 'taxi': 1, 'founds': 1, 'gangu': 4, 'being': 7, 'money': 5, 'dancers': 1, 'rest': 3, 'fateful': 1, 'polish': 2, 'renewed': 1, 'seems': 1, 'lets': 2, 'interested': 2, 'treatment': 1, 'hind': 1, 'extensive': 1, 'seema': 1, 'scheduled': 1, 'jake': 4, 'around': 2, 'kills': 5, 'early': 3, 'listening': 1, 'world': 6, 'execution': 4, 'fortune': 1, 'serves': 1, 'oddly': 1, 't': 5, 'robbery': 1, 'offenders': 2, 'eugene': 1, 'tower': 2, 'exposed': 1, 'competition': 1, 'duration': 1, 'colonial': 1, 'nazi': 1, 'apocalyptic': 1, 'destroys': 2, 'football': 4, 'business': 4, 'strained': 1, 'innocence': 1, 'francis': 2, 'on': 28, 'of': 129, 'informing': 1, 'escalating': 1, 'or': 1, 'road': 2, 'widowed': 1, 'protests': 1, 'follows': 1, 'murugappan': 7, 'dreaming': 2, 'actually': 1, 'her': 34, 'tutor': 1, 'there': 2, 'start': 2, 'resolute': 1, 'poirot': 2, 'enough': 2, 'smuggling': 2, 'pleased': 1, 'strikes': 1, 'gori': 1, 'trying': 2, 'with': 55, 'handsome': 1, 'abused': 1, 'arranged': 1, 'romantic': 2, 'abuser': 1, 'deadlock': 1, 'agree': 1, 'affection': 2, 'regimes': 1, 'wants': 4, 'fist': 4, 'cinema': 1, 'ai': 1, 'deep': 1, 'general': 2, 'britain': 1, 'at': 30, 'walks': 1, 'girlfriend': 2, 'film': 7, 'again': 2, 'bumpkin': 1, 'graduate': 1, 'field': 6, 'prism': 2, 'authoritarian': 2, 'anian': 3, 'carpenter': 1, 'tailback': 1, 'building': 1, 'wife': 12, 'having': 2, 'groundskeeping': 1, 'treason': 1, 'all': 7, 'forbidden': 1, 'invariable': 1, 'scandel': 1, 'devine': 6, 'follow': 2, 'religious': 2, 'children': 2, 'causes': 1, 'reluctant': 1, 'spirit': 1, 'to': 209, 'siblings': 1, 'woman': 3, 'appointment': 1, 'very': 1, 'fan': 1, 'fall': 3, 'knighthood': 1, 'accompanying': 1, 'small': 2, 'gang': 6, 'grateful': 1, 'azgagappa': 1, 'past': 1, 'aristocrat': 1, 'further': 1, 'befriends': 1, 'investment': 1, 'conscience': 1, 'what': 6, 'wrecked': 1, 'circumstance': 1, 'guddi': 3, 'shamsher': 2, 'public': 2, 'contrast': 1, 'turmoil': 1, 'full': 1, 'eligibility': 1, 'aristocrats': 2, 'hours': 1, 'underhanded': 1, 'ahead': 1, 'social': 1, 'via': 1, 'followed': 1, 'family': 13, 'vie': 2, 'courtier': 1, 'semester': 1, 'pichandi': 8, 'armed': 1, 'select': 1, 'shareholders': 1, 'eye': 1, 'takes': 5, 'destination': 1, 'two': 17, 'almost': 1, 'taken': 1, 'achieving': 1, 'more': 5, 'inquisition': 1, 'knows': 2, 'company': 3, 'tested': 1, 'silambam': 2, 'known': 2, 'chaplain': 1, 'town': 6, 'keeping': 1, 'offender': 2, 'science': 1, 'learn': 2, 'dee': 1, 'history': 2, 'beautiful': 1, 'junked': 1, 'abandonment': 1, 'accept': 2, 'mcqueen': 1, 'protects': 1, 'sharp': 1, 'lacks': 1, 'dress': 4, 'charged': 1, 'court': 2, 'awkward': 1, 'explains': 1, 'plans': 1, 'communicating': 1, 'notre': 10, 'shanmugam': 1, 'azhagiri': 4, 'tries': 8, 'rudy': 14, 'response': 1, 'a': 161, 'shady': 1, 'shanmughapriya': 1, 'deserted': 1, 'hatton': 1, 'climaxes': 1, 'egg': 1, 'playing': 4, 'help': 5, 'ruettiger': 7, 'soon': 1, 'trade': 1, 'paper': 2, 'through': 2, 'its': 3, 'roots': 1, 'style': 1, 'ricardo': 1, 'clashes': 2, 'someone': 2, 'return': 8, 'propose': 1, 'hunter': 1, 'thillaiammal': 4, 'outbreak': 1, 'hunted': 1, 'marcus': 9, 'fanatics': 1, 'autocracy': 1, 'pregnant': 1, 'always': 1, 'drake': 30, 'ryker': 5, 'stopped': 2, 'found': 1, 'friendship': 1, 'england': 1, 'house': 5, 'hare': 1, 'mockumentary': 1, 'idea': 1, 'accidentally': 2, 'connect': 1, 'undertake': 1, 'suspecting': 2, 'beyond': 1, 'really': 1, 'travels': 1, 'lovers': 2, 'missed': 1, 'since': 1, 'induces': 2, 'safety': 1, 'avenge': 2, 'danilo': 2, 'horse': 1, 'capture': 1, 'qualify': 1, 'coastline': 1, 'american': 2, 'daisy': 1, 'azhagappa': 4, 'throws': 1, 'number': 1, 'barbarian': 1, 'numbed': 1, 'villa': 1, 'story': 6, 'leading': 1, 'erroll': 1, 'paint': 1, 'relationship': 4, 'trapped': 1, 'convinces': 1, 'park': 1, 'appreciation': 1, 'part': 3, 'believe': 1, 'convinced': 1, 'thambikkotai': 1, 'king': 1, 'colin': 13, 'marriage': 4, 'risks': 1, 'sentenced': 3, 'placating': 1, 'alongside': 1, 'lie': 1, 'nights': 1, 'built': 1, 'officers': 2, 'also': 7, 'costs': 1, 'finding': 1, 'play': 3, 'towards': 1, 'hires': 2, 'mentality': 1, 'english': 1, 'reach': 2, 'most': 4, 'virus': 2, 'nothing': 1, 'extremely': 1, 'clear': 1, 'justify': 1, 'dragged': 1, 'thomas': 1, 'hyper': 1, 'rebels': 1, 'converge': 1, 'jared': 4, 'find': 13, 'northern': 3, 'justice': 1, 'nervous': 1, 'ruin': 1, 'writer': 1, 'unwillingly': 1, 'failed': 1, 'captures': 1, 'his': 128, 'meanwhile': 5, 'adventurers': 1, 'alicia': 1, 'famous': 1, 'actions': 1, 'theft': 2, 'during': 9, 'him': 30, 'enemy': 1, 'retaliation': 1, 'sanam': 1, 'mistakes': 1, 'kill': 2, 'poisoned': 1, 'relentless': 1, 'body': 1, 'set': 1, 'sex': 3, 'ara': 1, 'see': 4, 'defense': 2, 'are': 19, 'close': 1, 'feared': 1, 'learns': 6, 'expert': 1, 'pictures': 1, 'powerhouse': 1, 'won': 2, 'various': 1, 'probably': 1, 'numerous': 1, 'monsters': 2, 'knowing': 2, 'creating': 1, 'unconscious': 1, 'missing': 1, 'initially': 2, 'attention': 2, 'succeed': 1, 'competent': 1, 'both': 8, 'last': 3, 'restaurant': 1, 'influential': 2, 'bandit': 3, 'roman': 2, 'became': 1, 'forgotten': 1, 'whole': 1, 'finds': 6, 'duplicity': 1, 'headed': 1, 'reasons': 1, 'steele': 3, 'cartwright': 1, 'pygmalion': 1, 'fashion': 1, 'village': 8, 'vessel': 1, 'throughout': 5, 'political': 1, 'due': 7, 'convicted': 1, 'secret': 4, 'mutiny': 1, 'empty': 1, 'dame': 10, 'firm': 1, 'dialogue': 2, 'partly': 2, 'champion': 1, 'lives': 2, 'insiders': 1, 'fuw': 1, 'intriguing': 1, 'georgia': 2, 'uncertain': 2, 'solid': 1, 'errol': 4, 'straight': 1, 'while': 14, 'enact': 1, 'macleod': 2, 'involved': 1, 'fleet': 5, 'loot': 1, 'guide': 1, 'hoping': 1, 'kishen': 3, 'vow': 1, 'bookkeeper': 1, 'chase': 1, 'salons': 1, 'voyaging': 1, 'matchmakers': 1, 'virtually': 1, 'viola': 3, 'conflict': 1, 'development': 1, 'arrives': 2, 'henchmen': 1, 'moment': 1, 'uses': 2, 'abseils': 1, 'task': 1, 'database': 1, 'makes': 3, 'spent': 2, 'obviously': 1, 'person': 1, 'synopsis': 1, 'kanaga': 1, 'theatres': 1, 'spend': 1, 'hastings': 2, 'doughty': 19, 'exploits': 1, 'competitive': 1, 'quarters': 1, 'circling': 1, 'questions': 1, 'immediately': 2, 'vigilant': 1, 'retirement': 1, 'marrying': 1, 'workers': 1, 'parents': 1, 'impoverished': 1, 'remaining': 1, 'townsfolk': 2, 'australia': 2, 'victims': 1, 'big': 2, 'couple': 1, 'game': 6, 'quest': 2, 'crisp': 1, 'mariners': 1, 'skeptical': 1, 'd': 3, 'confront': 1, 'pillai': 1, 'mincel': 1, 'often': 1, 'some': 3, 'back': 5, 'martial': 1, 'mustang': 1, 'culprit': 2, 'decision': 1, 'civilization': 1, 'prosecuted': 1, 'dating': 1, 'invade': 1, 'resolutely': 1, 'be': 16, 'jumps': 1, 'agreement': 1, 'nowhere': 1, 'santa': 1, 'refuses': 2, 'carol': 1, 'gangamma': 5, 'wealth': 2, 'faith': 1, 'hidden': 1, 'plunder': 2, 'truck': 1, 'together': 7, 'ackett': 2, 'seeing': 1, 'conquest': 1, 'within': 3, 'martialed': 1, 'nss': 1, 'aristocratic': 1, 'witchcraft': 1, 'question': 2, 'filmy': 1, 'specifically': 1, 'explosives': 1, 'suit': 1, 'alexandria': 1, 'opens': 2, 'confessing': 1, 'himself': 5, 'an': 30, 'elsewhere': 1, 'boyd': 1, 'registered': 1, 'reiterated': 1, 'authority': 2, 'russian': 2, 'junior': 2, 'info': 1, 'characteristic': 1, 'up': 13, 'paired': 1, 'called': 3, 'influence': 1, 'doesn': 2, 'thugs': 2, 'occasional': 1, 'department': 2, 'problems': 1, 'druid': 2, 'visits': 1, 'william': 17, 'prepares': 1, 'desert': 1, 'healers': 1, 'opposing': 1, 'memories': 1, 'labourer': 1, 'intentional': 1, 'avoid': 1, 'once': 1, 'discontent': 1, 'nuno': 2, 'infatuation': 1, 'edu': 1, 'go': 3, 'flashback': 2, 'centers': 1, 'issues': 1, 'birthplace': 1, 'turkish': 1, 'vacuous': 1, 'suave': 1, 'young': 7, 'helps': 1, 'late': 1, 'include': 1, 'stonehenge': 2, 'queen': 2, 'manipulates': 1, 'torture': 1, 'continues': 2, 'entire': 1, 'izabela': 3, 'marry': 3, 'damning': 1, 'try': 2, 'tunnel': 1, 'pleas': 1, 'waking': 1, 'video': 1, 'reunion': 1, 'incidentally': 1, 'honors': 1, 'plays': 1, 'cell': 1, 'waiting': 3, 'indian': 1, 'scenery': 1, 'led': 4, 'desires': 1, 'leg': 3, 'upsetting': 1, 'sins': 1, 'let': 1, 'separation': 1, 'others': 1, 'enterprising': 1, 'kidnapped': 1, 'great': 2, 'talent': 1, 'survivor': 1, 'distressed': 1, 'chant': 1, 'leaving': 2, 'paranoid': 1, 'resulting': 1, 'santhanam': 2, 'involves': 1, 'winkler': 1, 'named': 4, 'daybreak': 1, 'addresses': 1, 'win': 1, 'sgt': 1, 'paranoia': 1, 'expect': 1, 'scandal': 1, 'from': 20, 'remains': 1, 'next': 3, 'few': 2, 'depicted': 1, 'butler': 1, 'themselves': 2, 'stage': 1, 'defendant': 1, 'started': 1, 'becomes': 6, 'about': 10, 'train': 1, 'paroled': 1, 'informally': 1, 'baby': 3, 'had': 7, 'this': 7, 'insists': 1, 'wildest': 1, 'meet': 4, 'coffin': 1, 'hides': 1, 'escapes': 1, 'high': 4, 'serial': 1, 'sir': 1, 'united': 1, 'six': 1, 'unites': 1, 'attachment': 1, 'instead': 1, 'stock': 1, 'tension': 2, 'buildings': 1, 'attend': 1, 'farm': 1, 'unforgivable': 1, 'perjure': 1, 'impecunious': 1, 'ambiguous': 1, 'scurvy': 1, 'furious': 1, 'christopher': 1, 'subsequently': 1, 'volunteers': 1, 'produces': 1, 'encouraged': 1, 'including': 5, 'motorcycle': 1, 'superior': 1, 'gunslinger': 1, 'll': 1, 'ships': 2, 'choose': 1, 'covered': 1, 'criminal': 1, 'dan': 1, 'practice': 1, 'wows': 1, 'hands': 1, 'flee': 1, 'day': 3, 'estranged': 1, 'university': 1, 'truth': 3, 'accusers': 1, 'seniors': 2, 'doing': 1, 'globe': 1, 'gary': 2, 'society': 2, 'salesman': 1, 'special': 2, 'out': 8, 'empires': 1, 'secretly': 2, 'matt': 1, 'frontier': 1, 'announces': 1, 'confess': 1, 'mate': 1, 'cause': 1, 'red': 1, 'shut': 1, 'southern': 4, 'attending': 1, 'completely': 1, 'york': 1, 'route': 1, 'keep': 1, 'conversation': 1, 'succeeds': 2, 'powerful': 1, 'scene': 2, 'owned': 1, 'owner': 2, 'babbington': 2, 'revealed': 1, 'chased': 2, 'system': 1, 'their': 29, 'sergeant': 1, 'final': 8, 'quarry': 1, 'academy': 1, 'doomed': 1, 'vouch': 1, 'herself': 4, 'steel': 3, 'culture': 1, 'riches': 1, 'tommy': 1, 'lover': 2, 'devotion': 1, 'accusing': 1, 'julian': 1, 'partnership': 1, 'imdb': 2, 'have': 6, 'cecil': 1, 'catholic': 1, 'studying': 1, 'able': 1, 'mid': 1, 'tomasz': 1, 'mix': 1, 'concerns': 1, 'which': 8, 'jail': 3, 'singh': 4, 'tyrannical': 1, 'lecturer': 1, 'brazil': 1, 'nabbing': 1, 'clash': 2, 'who': 35, 'vengeance': 1, 'connected': 1, 'prisoner': 1, 'why': 1, 'cki': 1, 'looked': 1, 'movie': 8, 'away': 2, 'fact': 1, 'gain': 1, 'affair': 2, 'charles': 1, 'bring': 2, 'soldiers': 1, 'fear': 3, 'agrees': 1, 'staff': 1, 'redirect': 1, 'knowledge': 1, 'tire': 1, 'winner': 1, 'millennia': 1, 'employer': 1, 'score': 1, 'riding': 2, 'piano': 1, 'local': 4, 'hope': 1, 'sued': 1, 'means': 2, 'beat': 1, 'mgr': 1, 'joins': 2, 'married': 2, 'calling': 1, 'she': 13, 'marries': 2, 'widow': 1, 'national': 1, 'computer': 1, 'destiny': 1, 'pattern': 1, 'frustrates': 1, 'tend': 1, 'state': 2, 'escapade': 1, 'neither': 1, 'frustrated': 1, 'ends': 6, 'sole': 1, 'ability': 1, 'job': 5, 'joe': 2, 'approval': 1, 'ames': 2, 'david': 1, 'career': 1, 'taking': 2, 'equal': 2, 'assure': 1, 'busted': 1, 'invites': 1, 'co': 2, 'betrayed': 1, 'confronts': 3, 'crystalline': 1, 'ad': 1, 'boatman': 1, 'slowly': 1, 'entitle': 1, 'juggles': 1, 'shoulders': 1, 'fellow': 1, 'hopfer': 1, 'perseveres': 1, 'instigates': 1, 'as': 35, 'will': 8, 'fiance': 2, 'voyeuristic': 1, 'rigged': 1, 'burkas': 1, 'thus': 1, 'site': 1, 'respected': 1, 'searchplotwriters': 1, 'partner': 1, 'inspector': 1, 'dietz': 4, 'cross': 1, 'member': 2, 'strange': 2, 'party': 1, 'gets': 5, 'injured': 1, 'kidnapping': 2, 'http': 3, 'emotionally': 2, 'drink': 2, 'upon': 3, 'effect': 2, 'coaches': 1, 'student': 3, 'identity': 3, 'off': 5, 'center': 2, 'i': 1, 'shotgun': 1, 'well': 5, 'fighting': 4, 'command': 3, 'sets': 1, 'latest': 1, 'less': 2, 'increasingly': 1, 'underlying': 1, 'condescended': 1, 'clergyman': 1, 'wed': 1, 'investor': 1, 'taal': 4, 'bench': 1, 'book': 1, 'combine': 1, 'villagers': 2, 'match': 1, 'government': 2, 'interrogation': 1, 'five': 1, 'know': 3, 'desk': 1, 'press': 1, 'joliet': 2, 'hollow': 1, 'descendant': 1, 'necessary': 1, 'like': 1, 'success': 3, 'admitted': 2, 'loses': 1, 'dancer': 1, 'become': 4, 'works': 2, 'replacement': 1, 'because': 5, 'scared': 1, 'thief': 1, 'alive': 1, 'immortal': 3, 'motive': 1, 'babbage': 1, 'home': 14, 'empire': 5, 'competitor': 1, 'lead': 5, 'corpse': 1, 'does': 2, 'passion': 1, 'leader': 1, 'regain': 1, 'murderer': 1, 'batangas': 2, 'although': 4, 'pasta': 1, 'adversary': 1, 'hiding': 3, 'sister': 11, 'carried': 1, 'getting': 1, 'freedom': 1, 'important': 1, 'plotsummary': 1, 'persuaded': 1, 'circumnavigation': 1, 'patay': 1, 'own': 2, 'satire': 1, 'disaffected': 1, 'promise': 3, 'female': 1, 'val': 1, 'transfer': 2, 'spots': 1, 'rushes': 1, 'utopian': 3, 'notices': 1, 'stadium': 2, 'bus': 2, 'wokulski': 5, 'but': 30, 'tasked': 1, 'goals': 2, 'eat': 1, 'he': 82, 'made': 1, 'places': 1, 'dangerous': 2, 'corpses': 1, 'convince': 1, 'pl': 1, 'urging': 2, 'irish': 1, 'rightly': 2, 'inadequate': 1, 'recounts': 1, 'sweetheart': 1, 'inn': 2, 'campus': 1, 'cavalry': 1, 'sexual': 2, 'offensive': 1, 'illinois': 1, 'other': 13, 'witness': 1, 'living': 1, 'stay': 1, 'chance': 1, 'friends': 6, 'rule': 1, 'alienates': 1, 'defend': 1}

now that we have normailised the data we can compute the term frequency


In [12]:
from collections import Counter

def get_tf(corpus):
    tf = Counter()
    for doc in corpus:
        for word in doc.split():
            tf[word] += 1
    return tf

tf = get_tf(corpus)
print(tf)


Counter({'the': 332, 'to': 209, 'and': 184, 'a': 161, 'of': 129, 'his': 128, 'is': 120, 'in': 101, 'he': 82, 's': 72, 'with': 55, 'that': 50, 'for': 47, 'who': 35, 'as': 35, 'her': 34, 'at': 30, 'drake': 30, 'him': 30, 'an': 30, 'but': 30, 'their': 29, 'on': 28, 'has': 27, 'by': 26, 'they': 24, 'after': 21, 'it': 20, 'from': 20, 'are': 19, 'doughty': 19, 'when': 18, 'two': 17, 'william': 17, 'be': 16, 'father': 15, 'punitha': 14, 'into': 14, 'rudy': 14, 'while': 14, 'home': 14, 'one': 13, 'family': 13, 'colin': 13, 'find': 13, 'up': 13, 'she': 13, 'other': 13, 'was': 12, 'quantrill': 12, 'yuma': 12, 'wife': 12, 'love': 11, 'only': 11, 'get': 11, 'sister': 11, 'son': 10, 'not': 10, 'jp': 10, 'notre': 10, 'dame': 10, 'about': 10, 'death': 9, 'marcus': 9, 'during': 9, 'new': 8, 'pillaival': 8, 'each': 8, 'down': 8, 'time': 8, 'murugappa': 8, 'goes': 8, 'man': 8, 'falls': 8, 'sabapathy': 8, 'pichandi': 8, 'tries': 8, 'return': 8, 'both': 8, 'village': 8, 'out': 8, 'final': 8, 'which': 8, 'movie': 8, 'will': 8, 'daughter': 7, 'against': 7, 'life': 7, 'way': 7, 'team': 7, 'being': 7, 'murugappan': 7, 'film': 7, 'all': 7, 'ruettiger': 7, 'also': 7, 'due': 7, 'together': 7, 'young': 7, 'had': 7, 'this': 7, 'men': 6, 'them': 6, 'another': 6, 'meets': 6, 'night': 6, 'girl': 6, 'world': 6, 'field': 6, 'devine': 6, 'gang': 6, 'what': 6, 'town': 6, 'story': 6, 'learns': 6, 'finds': 6, 'game': 6, 'becomes': 6, 'have': 6, 'ends': 6, 'friends': 6, 'though': 5, 'voyage': 5, 'discovers': 5, 'been': 5, 'sends': 5, 'fight': 5, 'octavius': 5, 'colleen': 5, 'hannah': 5, 'live': 5, 'can': 5, 'years': 5, 'now': 5, 'long': 5, 'city': 5, 'mangal': 5, 'assistant': 5, 'where': 5, 'college': 5, 'comes': 5, 'money': 5, 'kills': 5, 't': 5, 'takes': 5, 'more': 5, 'help': 5, 'ryker': 5, 'house': 5, 'meanwhile': 5, 'throughout': 5, 'fleet': 5, 'back': 5, 'gangamma': 5, 'himself': 5, 'including': 5, 'job': 5, 'gets': 5, 'off': 5, 'well': 5, 'because': 5, 'empire': 5, 'lead': 5, 'wokulski': 5, 'sarah': 4, 'must': 4, 'end': 4, 'how': 4, 'before': 4, 'eventually': 4, 'silver': 4, 'side': 4, 'driving': 4, 'dsp': 4, 'da': 4, 'decides': 4, 'if': 4, 'make': 4, 'grows': 4, 'mother': 4, 'left': 4, 'tagaytay': 4, 'officer': 4, 'fox': 4, 'silva': 4, 'ship': 4, 'coach': 4, 'minnie': 4, 'parseghian': 4, 'so': 4, 'police': 4, 'killer': 4, 'year': 4, 'spanish': 4, 'friend': 4, 'than': 4, 'future': 4, 'later': 4, 'hopes': 4, 'raju': 4, 'case': 4, 'same': 4, 'gangu': 4, 'jake': 4, 'execution': 4, 'football': 4, 'business': 4, 'wants': 4, 'fist': 4, 'dress': 4, 'azhagiri': 4, 'playing': 4, 'thillaiammal': 4, 'azhagappa': 4, 'relationship': 4, 'marriage': 4, 'most': 4, 'jared': 4, 'see': 4, 'secret': 4, 'errol': 4, 'led': 4, 'named': 4, 'meet': 4, 'high': 4, 'southern': 4, 'herself': 4, 'singh': 4, 'local': 4, 'dietz': 4, 'fighting': 4, 'taal': 4, 'become': 4, 'although': 4, 'captain': 3, 'ever': 3, 'would': 3, 'army': 3, 'holy': 3, 'enter': 3, 'order': 3, 'arrested': 3, 'starts': 3, 'onto': 3, 'scissors': 3, 'too': 3, 'scenes': 3, 'responsible': 3, 'nephew': 3, 'busy': 3, 'best': 3, 'however': 3, 'news': 3, 'players': 3, 'connection': 3, 'characters': 3, 'previous': 3, 'valsone': 3, 'portuguese': 3, 'donald': 3, 'mill': 3, 'finally': 3, 'leave': 3, 'bob': 3, 'car': 3, 'believes': 3, 'agent': 3, 'still': 3, 'killed': 3, 'pelican': 3, 'zamindar': 3, 'take': 3, 'shop': 3, 'going': 3, 'reveals': 3, 'between': 3, 'killing': 3, 'former': 3, 'then': 3, 'regime': 3, 'scholarship': 3, 'rest': 3, 'early': 3, 'anian': 3, 'woman': 3, 'fall': 3, 'guddi': 3, 'company': 3, 'its': 3, 'part': 3, 'sentenced': 3, 'play': 3, 'northern': 3, 'sex': 3, 'last': 3, 'bandit': 3, 'steele': 3, 'kishen': 3, 'viola': 3, 'makes': 3, 'd': 3, 'some': 3, 'within': 3, 'called': 3, 'go': 3, 'izabela': 3, 'marry': 3, 'waiting': 3, 'leg': 3, 'next': 3, 'baby': 3, 'day': 3, 'truth': 3, 'steel': 3, 'jail': 3, 'fear': 3, 'confronts': 3, 'http': 3, 'upon': 3, 'student': 3, 'identity': 3, 'command': 3, 'know': 3, 'success': 3, 'immortal': 3, 'hiding': 3, 'promise': 3, 'utopian': 3, 'under': 2, 'lack': 2, 'school': 2, 'told': 2, 'met': 2, 'changes': 2, 'amirthalingam': 2, 'call': 2, 'survive': 2, 'join': 2, 'work': 2, 'misfortunes': 2, 'give': 2, 'elizabeth': 2, 'law': 2, 'childhood': 2, 'allison': 2, 'amit': 2, 'break': 2, 'volcano': 2, 'daniel': 2, 'admits': 2, 'caught': 2, 'navigator': 2, 'begins': 2, 'bridge': 2, 'chrysalis': 2, 'tells': 2, 'forced': 2, 'confronted': 2, 'tech': 2, 'attempts': 2, 'just': 2, 'do': 2, 'received': 2, 'games': 2, 'com': 2, 'trust': 2, 'interest': 2, 'attacked': 2, 'leaves': 2, 'mexico': 2, 'player': 2, 'straits': 2, 'exile': 2, 'hand': 2, 'assigned': 2, 'grades': 2, 'victim': 2, 'capt': 2, 'dreams': 2, 'www': 2, 'dead': 2, 'asking': 2, 'christmas': 2, 'months': 2, 'formerly': 2, 'shoots': 2, 'support': 2, 'war': 2, 'uncle': 2, 'inside': 2, 'no': 2, 'setting': 2, 'amergan': 2, 'dies': 2, 'suicide': 2, 'nihal': 2, 'stays': 2, 'did': 2, 'mwansa': 2, 'brother': 2, 'settle': 2, 'unexpected': 2, 'says': 2, 'current': 2, 'supporting': 2, 'alone': 2, 'boy': 2, 'trial': 2, 'working': 2, 'prove': 2, 'valli': 2, 'following': 2, 'monitoring': 2, 'troubles': 2, 'rock': 2, 'card': 2, 'honest': 2, 'place': 2, 'teammates': 2, 'first': 2, 'returns': 2, 'twenties': 2, 'released': 2, 'venture': 2, 'warsaw': 2, 'note': 2, 'fails': 2, 'dream': 2, 'professional': 2, 'senior': 2, 'fletcher': 2, 'mission': 2, 'suspects': 2, 'cedric': 2, 'nearby': 2, 'crew': 2, 'abdul': 2, 'vacation': 2, 'situation': 2, 'hunts': 2, 'good': 2, 'manila': 2, 'polish': 2, 'lets': 2, 'interested': 2, 'around': 2, 'offenders': 2, 'tower': 2, 'destroys': 2, 'francis': 2, 'road': 2, 'dreaming': 2, 'there': 2, 'start': 2, 'poirot': 2, 'enough': 2, 'smuggling': 2, 'trying': 2, 'romantic': 2, 'affection': 2, 'general': 2, 'girlfriend': 2, 'again': 2, 'prism': 2, 'authoritarian': 2, 'having': 2, 'follow': 2, 'religious': 2, 'children': 2, 'small': 2, 'shamsher': 2, 'public': 2, 'aristocrats': 2, 'vie': 2, 'knows': 2, 'silambam': 2, 'known': 2, 'offender': 2, 'learn': 2, 'history': 2, 'accept': 2, 'court': 2, 'paper': 2, 'through': 2, 'clashes': 2, 'someone': 2, 'stopped': 2, 'accidentally': 2, 'suspecting': 2, 'lovers': 2, 'induces': 2, 'avenge': 2, 'danilo': 2, 'american': 2, 'officers': 2, 'hires': 2, 'reach': 2, 'virus': 2, 'theft': 2, 'kill': 2, 'defense': 2, 'won': 2, 'monsters': 2, 'knowing': 2, 'initially': 2, 'attention': 2, 'influential': 2, 'roman': 2, 'dialogue': 2, 'partly': 2, 'lives': 2, 'georgia': 2, 'uncertain': 2, 'macleod': 2, 'arrives': 2, 'uses': 2, 'spent': 2, 'hastings': 2, 'immediately': 2, 'townsfolk': 2, 'australia': 2, 'big': 2, 'quest': 2, 'culprit': 2, 'refuses': 2, 'wealth': 2, 'plunder': 2, 'ackett': 2, 'question': 2, 'opens': 2, 'authority': 2, 'russian': 2, 'junior': 2, 'doesn': 2, 'thugs': 2, 'department': 2, 'druid': 2, 'nuno': 2, 'flashback': 2, 'stonehenge': 2, 'queen': 2, 'continues': 2, 'try': 2, 'great': 2, 'leaving': 2, 'santhanam': 2, 'few': 2, 'themselves': 2, 'tension': 2, 'ships': 2, 'seniors': 2, 'gary': 2, 'society': 2, 'special': 2, 'secretly': 2, 'succeeds': 2, 'scene': 2, 'owner': 2, 'babbington': 2, 'chased': 2, 'lover': 2, 'imdb': 2, 'clash': 2, 'away': 2, 'affair': 2, 'bring': 2, 'riding': 2, 'means': 2, 'joins': 2, 'married': 2, 'marries': 2, 'state': 2, 'joe': 2, 'ames': 2, 'taking': 2, 'equal': 2, 'co': 2, 'fiance': 2, 'member': 2, 'strange': 2, 'kidnapping': 2, 'emotionally': 2, 'drink': 2, 'effect': 2, 'center': 2, 'less': 2, 'villagers': 2, 'government': 2, 'joliet': 2, 'admitted': 2, 'works': 2, 'does': 2, 'batangas': 2, 'own': 2, 'transfer': 2, 'stadium': 2, 'bus': 2, 'goals': 2, 'dangerous': 2, 'urging': 2, 'rightly': 2, 'inn': 2, 'sexual': 2, 'baskar': 1, 'advices': 1, 'demanded': 1, 'protest': 1, 'offenses': 1, 'disability': 1, 'pensions': 1, 'bike': 1, 'teaching': 1, 'merchant': 1, 'rise': 1, 'connects': 1, 'every': 1, 'confederate': 1, 'stabbed': 1, 'four': 1, 'prize': 1, 'skills': 1, 'triumph': 1, 'force': 1, 'warns': 1, 'direct': 1, 'preacher': 1, 'second': 1, 'persuade': 1, 'even': 1, 'ruthless': 1, 'ned': 1, 'beaten': 1, 'corporation': 1, 'increasing': 1, 'hero': 1, 'whose': 1, 'protection': 1, 'china': 1, 'employees': 1, 'browsing': 1, 'military': 1, 'golden': 1, 'secure': 1, 'brought': 1, 'guests': 1, 'tutelage': 1, 'unit': 1, 'handedness': 1, 'chooses': 1, 'tell': 1, 'coffins': 1, 'successful': 1, 'brings': 1, 'aware': 1, 'warn': 1, 'phone': 1, 'lord': 1, 'shoot': 1, 'room': 1, 'rights': 1, 'pursue': 1, 'advocacy': 1, 'mechanic': 1, 'climax': 1, 'want': 1, 'times': 1, 'unforgettable': 1, 'travel': 1, 'badly': 1, 'poker': 1, 'misunderstanding': 1, 'shores': 1, 'lay': 1, 'curiosity': 1, 'burghley': 1, 'siberia': 1, 'wins': 1, 'descends': 1, 'ultimate': 1, 'wind': 1, 'wine': 1, 'executed': 1, 'over': 1, 'tricked': 1, 'kickoff': 1, 'sidekicks': 1, 'personal': 1, 'fix': 1, 'exhibits': 1, 'writing': 1, 'destroyed': 1, 'weeks': 1, 'overcome': 1, 'pleasurable': 1, 'lifelong': 1, 'routines': 1, 'l': 1, 'victory': 1, 'schemer': 1, 'stealing': 1, 'merchants': 1, 'psychic': 1, 're': 1, 'encourage': 1, 'parvarish': 1, 'lawless': 1, 'revenge': 1, 'free': 1, 'formation': 1, 'delves': 1, 'days': 1, 'enraged': 1, 'already': 1, 'rank': 1, 'hearing': 1, 'enagaged': 1, 'top': 1, 'girls': 1, 'boating': 1, 'john': 1, 'ranging': 1, 'murder': 1, 'took': 1, 'somewhat': 1, 'tuscany': 1, 'eastman': 1, 'extracting': 1, 'sees': 1, 'longs': 1, 'modern': 1, 'upset': 1, 'talking': 1, 'alibi': 1, 'causing': 1, 'forces': 1, 'quarterback': 1, 'germany': 1, 'letter': 1, 'unorthodox': 1, 'grave': 1, 'maria': 1, 'singer': 1, 'don': 1, 'observation': 1, 'professor': 1, 'm': 1, 'makati': 1, 'sum': 1, 'saying': 1, 'bomb': 1, 'random': 1, 'ending': 1, 'dopes': 1, 'headline': 1, 'situations': 1, 'rich': 1, 'pursuing': 1, 'jailed': 1, 'stop': 1, 'haunted': 1, 'despite': 1, 'report': 1, 'dr': 1, 'angers': 1, 'guns': 1, 'shots': 1, 'release': 1, 'unwitting': 1, 'shula': 1, 'secretary': 1, 'crewmen': 1, 'checks': 1, 'disturbing': 1, 'nocturnal': 1, 'jerseys': 1, 'pete': 1, 'draws': 1, 'scrap': 1, 'gentleman': 1, 'mud': 1, 'unable': 1, 'cooperation': 1, 'encounters': 1, 'lazy': 1, 'nature': 1, 'portrays': 1, 'country': 1, 'frequenting': 1, 'smuggler': 1, 'san': 1, 'asks': 1, 'three': 1, 'communion': 1, 'families': 1, 'filmmaker': 1, 'child': 1, 'physician': 1, 'conviction': 1, 'tomasina': 1, 'contemplates': 1, 'near': 1, 'balance': 1, 'dumped': 1, 'damaged': 1, 'things': 1, 'clearly': 1, 'roland': 1, 'several': 1, 'raid': 1, 'consents': 1, 'thoughts': 1, 'kept': 1, 'kyle': 1, 'academic': 1, 'veils': 1, 'haberdashery': 1, 'proposes': 1, 'financially': 1, 'plight': 1, 'yet': 1, 'bhangra': 1, 'enters': 1, 'burnt': 1, 'character': 1, 'spread': 1, 'interprets': 1, 'humanity': 1, 'mocked': 1, 'breaks': 1, 'addicts': 1, 'birth': 1, 'vigilante': 1, 'korean': 1, 'apart': 1, 'steps': 1, 'notorious': 1, 'right': 1, 'old': 1, 'deal': 1, 'people': 1, 'ironically': 1, 'everything': 1, 'disappears': 1, 'conquering': 1, 'whitaker': 1, 'burn': 1, 'deadly': 1, 'idolized': 1, 'confrontation': 1, 'defensive': 1, 'post': 1, 'descendent': 1, 'hostilities': 1, 'o': 1, 'tyree': 1, 'efforts': 1, 'mast': 1, 'bound': 1, 'raises': 1, 'flying': 1, 'rapes': 1, 'quirky': 1, 'happy': 1, 'head': 1, 'becoming': 1, 'differences': 1, 'mysterious': 1, 'heap': 1, 'syndicate': 1, 'true': 1, 'duel': 1, 'overrules': 1, 'attached': 1, 'until': 1, 'plymouth': 1, 'promises': 1, 'adopt': 1, 'request': 1, 'evidence': 1, 'proves': 1, 'trip': 1, 'physical': 1, 'eventual': 1, 'actor': 1, 'role': 1, 'unwilling': 1, 'quickening': 1, 'realize': 1, 'brothers': 1, 'welcome': 1, 'russo': 1, 'died': 1, 'longer': 1, 'sacks': 1, 'daily': 1, 'rejections': 1, 'intimacy': 1, 'neighbors': 1, 'detective': 1, 'interlaces': 1, 'leads': 1, 'manager': 1, 'manages': 1, 'bandits': 1, 'battle': 1, 'hijacked': 1, 'certainly': 1, 'circumstances': 1, 'division': 1, 'none': 1, 'seemingly': 1, 'anonymous': 1, 'word': 1, 'minute': 1, 'cooks': 1, 'die': 1, 'jamie': 1, 'temple': 1, 'perceives': 1, 'loads': 1, 'unaware': 1, 'detonated': 1, 'discover': 1, 'agitate': 1, 'detonates': 1, 'drinks': 1, 'appear': 1, 'havoc': 1, 'falling': 1, 'cynically': 1, 'filled': 1, 'satisfy': 1, 'explosion': 1, 'climbs': 1, 'muslim': 1, 'downloaded': 1, 'crippled': 1, 'teacher': 1, 'supplying': 1, 'women': 1, 'touchdown': 1, 'fbi': 1, 'opposes': 1, 'troops': 1, 'angry': 1, 'sports': 1, 'peru': 1, 'memory': 1, 'prosecutor': 1, 'wicked': 1, 'tsarist': 1, 'club': 1, 'apparent': 1, 'given': 1, 'riders': 1, 'capturing': 1, 'valued': 1, 'defeats': 1, 'stature': 1, 'making': 1, 'indolence': 1, 'streak': 1, 'heart': 1, 'crazy': 1, 'reprimands': 1, 'figure': 1, 'awesome': 1, 'lowry': 1, 'confused': 1, 'acapulco': 1, 'allowed': 1, 'thilakam': 1, 'bankrupt': 1, 'spreads': 1, 'staying': 1, 'crashes': 1, 'escalate': 1, 'max': 1, 'spot': 1, 'informs': 1, 'mad': 1, 'such': 1, 'heroes': 1, 'maj': 1, 'data': 1, 'natural': 1, 'succeeded': 1, 'sr': 1, 'tale': 1, 'drunken': 1, 'repair': 1, 'indeed': 1, 'course': 1, 'worship': 1, 'blocked': 1, 'forms': 1, 'offers': 1, 'main': 1, 'happened': 1, 'non': 1, 'halt': 1, 'thereby': 1, 'doll': 1, 'realities': 1, 'wealthy': 1, 'occult': 1, 'name': 1, 'didn': 1, 'repaired': 1, 'chander': 1, 'directing': 1, 'morning': 1, 'emerges': 1, 'profit': 1, 'investigation': 1, 'uprising': 1, 'montoya': 1, 'hefty': 1, 'romans': 1, 'quits': 1, 'care': 1, 'waiter': 1, 'british': 1, 'think': 1, 'surviving': 1, 'sentencing': 1, 'impossible': 1, 'array': 1, 'little': 1, 'immortals': 1, 'bounty': 1, 'mostly': 1, 'season': 1, 'viewing': 1, 'wide': 1, 'were': 1, 'poisoning': 1, 'russia': 1, 'armored': 1, 'sam': 1, 'ann': 1, 'argument': 1, 'sad': 1, 'adele': 1, 'confesses': 1, 'nuances': 1, 'squad': 1, 'bailey': 1, 'pranks': 1, 'begin': 1, 'multiple': 1, 'roster': 1, 'track': 1, 'betrayal': 1, 'drive': 1, 'defeating': 1, 'sunset': 1, 'walking': 1, 'shot': 1, 'contemporary': 1, 'bright': 1, 'ground': 1, 'frequent': 1, 'title': 1, 'mexicans': 1, 'bringer': 1, 'ranking': 1, 'cannot': 1, 'regarding': 1, 'requesting': 1, 'jokes': 1, 'priest': 1, 'husband': 1, 'bulgaria': 1, 'gangster': 1, 'sport': 1, 'janet': 1, 'loudspeaker': 1, 'ostensibly': 1, 'subsequent': 1, 'sponsorship': 1, 'outside': 1, 'arrogant': 1, 'across': 1, 'inseparable': 1, 'childbirth': 1, 'come': 1, 'ringleader': 1, 'reaction': 1, 'acquiring': 1, 'many': 1, 'pakistan': 1, 'grants': 1, 'considers': 1, 'better': 1, 'arts': 1, 'underworld': 1, 'dyslexia': 1, 'cares': 1, 'mary': 1, 'engine': 1, 'educated': 1, 'dramatic': 1, 'wake': 1, 'developing': 1, 'these': 1, 'promising': 1, 'newspaper': 1, 'endowed': 1, 'engaged': 1, 'refinery': 1, 'everyday': 1, 'doctor': 1, 'theorized': 1, 'tour': 1, 'gentlemen': 1, 'attributing': 1, 'verifies': 1, 'speech': 1, 'buys': 1, 'struggling': 1, 'events': 1, 'status': 1, 'noble': 1, 'closest': 1, 'socially': 1, 'driver': 1, 'unearths': 1, 'cheers': 1, 'tracking': 1, 'without': 1, 'bodies': 1, 'taxi': 1, 'founds': 1, 'dancers': 1, 'fateful': 1, 'renewed': 1, 'seems': 1, 'treatment': 1, 'hind': 1, 'extensive': 1, 'seema': 1, 'scheduled': 1, 'listening': 1, 'fortune': 1, 'serves': 1, 'oddly': 1, 'robbery': 1, 'eugene': 1, 'exposed': 1, 'competition': 1, 'duration': 1, 'colonial': 1, 'nazi': 1, 'apocalyptic': 1, 'strained': 1, 'innocence': 1, 'informing': 1, 'escalating': 1, 'or': 1, 'widowed': 1, 'protests': 1, 'follows': 1, 'actually': 1, 'tutor': 1, 'resolute': 1, 'pleased': 1, 'strikes': 1, 'gori': 1, 'handsome': 1, 'abused': 1, 'arranged': 1, 'abuser': 1, 'deadlock': 1, 'agree': 1, 'regimes': 1, 'cinema': 1, 'ai': 1, 'deep': 1, 'britain': 1, 'walks': 1, 'bumpkin': 1, 'graduate': 1, 'carpenter': 1, 'tailback': 1, 'building': 1, 'groundskeeping': 1, 'treason': 1, 'forbidden': 1, 'invariable': 1, 'scandel': 1, 'causes': 1, 'reluctant': 1, 'spirit': 1, 'siblings': 1, 'appointment': 1, 'very': 1, 'fan': 1, 'knighthood': 1, 'accompanying': 1, 'grateful': 1, 'azgagappa': 1, 'past': 1, 'aristocrat': 1, 'further': 1, 'befriends': 1, 'investment': 1, 'conscience': 1, 'wrecked': 1, 'circumstance': 1, 'contrast': 1, 'turmoil': 1, 'full': 1, 'eligibility': 1, 'hours': 1, 'underhanded': 1, 'ahead': 1, 'social': 1, 'via': 1, 'followed': 1, 'courtier': 1, 'semester': 1, 'armed': 1, 'select': 1, 'shareholders': 1, 'eye': 1, 'destination': 1, 'almost': 1, 'taken': 1, 'achieving': 1, 'inquisition': 1, 'tested': 1, 'chaplain': 1, 'keeping': 1, 'science': 1, 'dee': 1, 'beautiful': 1, 'junked': 1, 'abandonment': 1, 'mcqueen': 1, 'protects': 1, 'sharp': 1, 'lacks': 1, 'charged': 1, 'awkward': 1, 'explains': 1, 'plans': 1, 'communicating': 1, 'shanmugam': 1, 'response': 1, 'shady': 1, 'shanmughapriya': 1, 'deserted': 1, 'hatton': 1, 'climaxes': 1, 'egg': 1, 'soon': 1, 'trade': 1, 'roots': 1, 'style': 1, 'ricardo': 1, 'propose': 1, 'hunter': 1, 'outbreak': 1, 'hunted': 1, 'fanatics': 1, 'autocracy': 1, 'pregnant': 1, 'always': 1, 'found': 1, 'friendship': 1, 'england': 1, 'hare': 1, 'mockumentary': 1, 'idea': 1, 'connect': 1, 'undertake': 1, 'beyond': 1, 'really': 1, 'travels': 1, 'missed': 1, 'since': 1, 'safety': 1, 'horse': 1, 'capture': 1, 'qualify': 1, 'coastline': 1, 'daisy': 1, 'throws': 1, 'number': 1, 'barbarian': 1, 'numbed': 1, 'villa': 1, 'leading': 1, 'erroll': 1, 'paint': 1, 'trapped': 1, 'convinces': 1, 'park': 1, 'appreciation': 1, 'believe': 1, 'convinced': 1, 'thambikkotai': 1, 'king': 1, 'risks': 1, 'placating': 1, 'alongside': 1, 'lie': 1, 'nights': 1, 'built': 1, 'costs': 1, 'finding': 1, 'towards': 1, 'mentality': 1, 'english': 1, 'nothing': 1, 'extremely': 1, 'clear': 1, 'justify': 1, 'dragged': 1, 'thomas': 1, 'hyper': 1, 'rebels': 1, 'converge': 1, 'justice': 1, 'nervous': 1, 'ruin': 1, 'writer': 1, 'unwillingly': 1, 'failed': 1, 'captures': 1, 'adventurers': 1, 'alicia': 1, 'famous': 1, 'actions': 1, 'enemy': 1, 'retaliation': 1, 'sanam': 1, 'mistakes': 1, 'poisoned': 1, 'relentless': 1, 'body': 1, 'set': 1, 'ara': 1, 'close': 1, 'feared': 1, 'expert': 1, 'pictures': 1, 'powerhouse': 1, 'various': 1, 'probably': 1, 'numerous': 1, 'creating': 1, 'unconscious': 1, 'missing': 1, 'succeed': 1, 'competent': 1, 'restaurant': 1, 'became': 1, 'forgotten': 1, 'whole': 1, 'duplicity': 1, 'headed': 1, 'reasons': 1, 'cartwright': 1, 'pygmalion': 1, 'fashion': 1, 'vessel': 1, 'political': 1, 'convicted': 1, 'mutiny': 1, 'empty': 1, 'firm': 1, 'champion': 1, 'insiders': 1, 'fuw': 1, 'intriguing': 1, 'solid': 1, 'straight': 1, 'enact': 1, 'involved': 1, 'loot': 1, 'guide': 1, 'hoping': 1, 'vow': 1, 'bookkeeper': 1, 'chase': 1, 'salons': 1, 'voyaging': 1, 'matchmakers': 1, 'virtually': 1, 'conflict': 1, 'development': 1, 'henchmen': 1, 'moment': 1, 'abseils': 1, 'task': 1, 'database': 1, 'obviously': 1, 'person': 1, 'synopsis': 1, 'kanaga': 1, 'theatres': 1, 'spend': 1, 'exploits': 1, 'competitive': 1, 'quarters': 1, 'circling': 1, 'questions': 1, 'vigilant': 1, 'retirement': 1, 'marrying': 1, 'workers': 1, 'parents': 1, 'impoverished': 1, 'remaining': 1, 'victims': 1, 'couple': 1, 'crisp': 1, 'mariners': 1, 'skeptical': 1, 'confront': 1, 'pillai': 1, 'mincel': 1, 'often': 1, 'martial': 1, 'mustang': 1, 'decision': 1, 'civilization': 1, 'prosecuted': 1, 'dating': 1, 'invade': 1, 'resolutely': 1, 'jumps': 1, 'agreement': 1, 'nowhere': 1, 'santa': 1, 'carol': 1, 'faith': 1, 'hidden': 1, 'truck': 1, 'seeing': 1, 'conquest': 1, 'martialed': 1, 'nss': 1, 'aristocratic': 1, 'witchcraft': 1, 'filmy': 1, 'specifically': 1, 'explosives': 1, 'suit': 1, 'alexandria': 1, 'confessing': 1, 'elsewhere': 1, 'boyd': 1, 'registered': 1, 'reiterated': 1, 'info': 1, 'characteristic': 1, 'paired': 1, 'influence': 1, 'occasional': 1, 'problems': 1, 'visits': 1, 'prepares': 1, 'desert': 1, 'healers': 1, 'opposing': 1, 'memories': 1, 'labourer': 1, 'intentional': 1, 'avoid': 1, 'once': 1, 'discontent': 1, 'infatuation': 1, 'edu': 1, 'centers': 1, 'issues': 1, 'birthplace': 1, 'turkish': 1, 'vacuous': 1, 'suave': 1, 'helps': 1, 'late': 1, 'include': 1, 'manipulates': 1, 'torture': 1, 'entire': 1, 'damning': 1, 'tunnel': 1, 'pleas': 1, 'waking': 1, 'video': 1, 'reunion': 1, 'incidentally': 1, 'honors': 1, 'plays': 1, 'cell': 1, 'indian': 1, 'scenery': 1, 'desires': 1, 'upsetting': 1, 'sins': 1, 'let': 1, 'separation': 1, 'others': 1, 'enterprising': 1, 'kidnapped': 1, 'talent': 1, 'survivor': 1, 'distressed': 1, 'chant': 1, 'paranoid': 1, 'resulting': 1, 'involves': 1, 'winkler': 1, 'daybreak': 1, 'addresses': 1, 'win': 1, 'sgt': 1, 'paranoia': 1, 'expect': 1, 'scandal': 1, 'remains': 1, 'depicted': 1, 'butler': 1, 'stage': 1, 'defendant': 1, 'started': 1, 'train': 1, 'paroled': 1, 'informally': 1, 'insists': 1, 'wildest': 1, 'coffin': 1, 'hides': 1, 'escapes': 1, 'serial': 1, 'sir': 1, 'united': 1, 'six': 1, 'unites': 1, 'attachment': 1, 'instead': 1, 'stock': 1, 'buildings': 1, 'attend': 1, 'farm': 1, 'unforgivable': 1, 'perjure': 1, 'impecunious': 1, 'ambiguous': 1, 'scurvy': 1, 'furious': 1, 'christopher': 1, 'subsequently': 1, 'volunteers': 1, 'produces': 1, 'encouraged': 1, 'motorcycle': 1, 'superior': 1, 'gunslinger': 1, 'll': 1, 'choose': 1, 'covered': 1, 'criminal': 1, 'dan': 1, 'practice': 1, 'wows': 1, 'hands': 1, 'flee': 1, 'estranged': 1, 'university': 1, 'accusers': 1, 'doing': 1, 'globe': 1, 'salesman': 1, 'empires': 1, 'matt': 1, 'frontier': 1, 'announces': 1, 'confess': 1, 'mate': 1, 'cause': 1, 'red': 1, 'shut': 1, 'attending': 1, 'completely': 1, 'york': 1, 'route': 1, 'keep': 1, 'conversation': 1, 'powerful': 1, 'owned': 1, 'revealed': 1, 'system': 1, 'sergeant': 1, 'quarry': 1, 'academy': 1, 'doomed': 1, 'vouch': 1, 'culture': 1, 'riches': 1, 'tommy': 1, 'devotion': 1, 'accusing': 1, 'julian': 1, 'partnership': 1, 'cecil': 1, 'catholic': 1, 'studying': 1, 'able': 1, 'mid': 1, 'tomasz': 1, 'mix': 1, 'concerns': 1, 'tyrannical': 1, 'lecturer': 1, 'brazil': 1, 'nabbing': 1, 'vengeance': 1, 'connected': 1, 'prisoner': 1, 'why': 1, 'cki': 1, 'looked': 1, 'fact': 1, 'gain': 1, 'charles': 1, 'soldiers': 1, 'agrees': 1, 'staff': 1, 'redirect': 1, 'knowledge': 1, 'tire': 1, 'winner': 1, 'millennia': 1, 'employer': 1, 'score': 1, 'piano': 1, 'hope': 1, 'sued': 1, 'beat': 1, 'mgr': 1, 'calling': 1, 'widow': 1, 'national': 1, 'computer': 1, 'destiny': 1, 'pattern': 1, 'frustrates': 1, 'tend': 1, 'escapade': 1, 'neither': 1, 'frustrated': 1, 'sole': 1, 'ability': 1, 'approval': 1, 'david': 1, 'career': 1, 'assure': 1, 'busted': 1, 'invites': 1, 'betrayed': 1, 'crystalline': 1, 'ad': 1, 'boatman': 1, 'slowly': 1, 'entitle': 1, 'juggles': 1, 'shoulders': 1, 'fellow': 1, 'hopfer': 1, 'perseveres': 1, 'instigates': 1, 'voyeuristic': 1, 'rigged': 1, 'burkas': 1, 'thus': 1, 'site': 1, 'respected': 1, 'searchplotwriters': 1, 'partner': 1, 'inspector': 1, 'cross': 1, 'party': 1, 'injured': 1, 'coaches': 1, 'i': 1, 'shotgun': 1, 'sets': 1, 'latest': 1, 'increasingly': 1, 'underlying': 1, 'condescended': 1, 'clergyman': 1, 'wed': 1, 'investor': 1, 'bench': 1, 'book': 1, 'combine': 1, 'match': 1, 'interrogation': 1, 'five': 1, 'desk': 1, 'press': 1, 'hollow': 1, 'descendant': 1, 'necessary': 1, 'like': 1, 'loses': 1, 'dancer': 1, 'replacement': 1, 'scared': 1, 'thief': 1, 'alive': 1, 'motive': 1, 'babbage': 1, 'competitor': 1, 'corpse': 1, 'passion': 1, 'leader': 1, 'regain': 1, 'murderer': 1, 'pasta': 1, 'adversary': 1, 'carried': 1, 'getting': 1, 'freedom': 1, 'important': 1, 'plotsummary': 1, 'persuaded': 1, 'circumnavigation': 1, 'patay': 1, 'satire': 1, 'disaffected': 1, 'female': 1, 'val': 1, 'spots': 1, 'rushes': 1, 'notices': 1, 'tasked': 1, 'eat': 1, 'made': 1, 'places': 1, 'corpses': 1, 'convince': 1, 'pl': 1, 'irish': 1, 'inadequate': 1, 'recounts': 1, 'sweetheart': 1, 'campus': 1, 'cavalry': 1, 'offensive': 1, 'illinois': 1, 'witness': 1, 'living': 1, 'stay': 1, 'chance': 1, 'rule': 1, 'alienates': 1, 'defend': 1})

doc freq


In [16]:
import collections

def get_tf(document):
    tf = Counter()
    for word in document.split():
        tf[word] += 1
    return tf

def get_dtf(corpus):
    dtf = {}
    for i,doc in enumerate(corpus):
        dtf[i]= get_tf(doc)
    return dtf

dtf = get_dtf(items_d)
dtf[342]


Out[16]:
Counter({'a': 7,
         'about': 1,
         'again': 1,
         'and': 26,
         'angry': 1,
         'are': 1,
         'around': 1,
         'as': 1,
         'at': 1,
         'attempt': 1,
         'away': 2,
         'back': 3,
         'barking': 1,
         'be': 3,
         'been': 1,
         'begins': 1,
         'bone': 1,
         'but': 7,
         'by': 1,
         'can': 2,
         'catcher': 4,
         'catches': 1,
         'caught': 2,
         'chases': 1,
         'chasing': 1,
         'city': 2,
         'cover': 1,
         'crawls': 1,
         'cries': 1,
         'day': 1,
         'digs': 1,
         'disguises': 1,
         'doesn': 1,
         'dog': 16,
         'dogs': 1,
         'drama': 1,
         'driver': 1,
         'drives': 1,
         'driving': 1,
         'enter': 1,
         'escapes': 3,
         'fools': 1,
         'for': 2,
         'from': 4,
         'frowned': 1,
         'gate': 2,
         'get': 2,
         'gets': 2,
         'gives': 1,
         'goes': 2,
         'going': 1,
         'grabs': 3,
         'happily': 1,
         'happy': 1,
         'he': 22,
         'head': 1,
         'hides': 2,
         'him': 6,
         'himself': 2,
         'his': 8,
         'hits': 1,
         'hole': 2,
         'horrified': 1,
         'house': 1,
         'humming': 1,
         'hungry': 1,
         'in': 4,
         'inside': 1,
         'is': 9,
         'it': 6,
         'jerry': 1,
         'know': 1,
         'lamp': 1,
         'last': 1,
         'lets': 1,
         'license': 7,
         'licenses': 1,
         'locked': 1,
         'looks': 2,
         'main': 1,
         'manhole': 1,
         'napkin': 1,
         'news': 1,
         'newspaper': 3,
         'no': 1,
         'now': 1,
         'of': 1,
         'off': 1,
         'order': 1,
         'orders': 1,
         'own': 1,
         'panicking': 1,
         'past': 1,
         'pound': 3,
         'protagonist': 1,
         'pursues': 1,
         'quiet': 1,
         'reads': 2,
         'ready': 1,
         'realizes': 1,
         'remain': 1,
         'remove': 1,
         'roll': 1,
         'runs': 1,
         's': 4,
         'sacrifice': 1,
         'says': 2,
         'sees': 7,
         'shows': 1,
         'sits': 1,
         'sleep': 2,
         'sleeping': 1,
         'some': 1,
         'son': 1,
         'song': 1,
         'speeds': 1,
         'spike': 16,
         'stick': 1,
         'stops': 1,
         'street': 1,
         't': 1,
         'tags': 1,
         'taken': 1,
         'taking': 1,
         'tells': 1,
         'that': 2,
         'the': 34,
         'theme': 1,
         'then': 3,
         'they': 1,
         'through': 1,
         'throws': 1,
         'tip': 1,
         'to': 16,
         'toes': 1,
         'tom': 1,
         'took': 1,
         'trash': 2,
         'tricked': 1,
         'tries': 1,
         'truck': 4,
         'turns': 1,
         'tyke': 7,
         'under': 1,
         'up': 2,
         'use': 1,
         'uses': 1,
         'wakes': 1,
         'walk': 1,
         'was': 1,
         'wear': 1,
         'wears': 1,
         'when': 1,
         'where': 1,
         'while': 1,
         'who': 3,
         'will': 1,
         'with': 2,
         'without': 1,
         'yawns': 1,
         'yet': 1})

compute dtf for item descriptions


In [17]:
dtf = get_dtf(items_d)
dtf[12]


Out[17]:
Counter({'a': 10,
         'ability': 1,
         'accept': 1,
         'affection': 1,
         'after': 1,
         'against': 1,
         'an': 1,
         'and': 6,
         'are': 1,
         'aristocrat': 1,
         'aristocratic': 1,
         'aristocrats': 2,
         'army': 1,
         'as': 3,
         'at': 3,
         'aware': 1,
         'bankrupt': 1,
         'because': 1,
         'becomes': 2,
         'begins': 1,
         'bulgaria': 1,
         'business': 2,
         'but': 3,
         'cinema': 1,
         'cki': 1,
         'com': 1,
         'comes': 1,
         'company': 1,
         'condescended': 1,
         'consents': 1,
         'database': 1,
         'daughter': 1,
         'descendant': 1,
         'devotion': 1,
         'distressed': 1,
         'dreaming': 1,
         'during': 1,
         'edu': 1,
         'end': 1,
         'enterprising': 1,
         'eventual': 1,
         'exile': 2,
         'failed': 1,
         'falling': 1,
         'family': 2,
         'father': 2,
         'filmy': 1,
         'financially': 1,
         'forced': 1,
         'fortune': 1,
         'founds': 1,
         'frequenting': 1,
         'frustrates': 1,
         'fuw': 1,
         'girl': 1,
         'go': 1,
         'haberdashery': 1,
         'had': 1,
         'he': 4,
         'heart': 1,
         'help': 2,
         'her': 1,
         'him': 1,
         'his': 6,
         'hopfer': 1,
         'http': 2,
         'imdb': 1,
         'impecunious': 1,
         'impoverished': 1,
         'in': 8,
         'indolence': 1,
         'influential': 1,
         'info': 1,
         'into': 1,
         'is': 4,
         'it': 1,
         'izabela': 3,
         'lack': 1,
         'late': 1,
         'lazy': 1,
         'life': 1,
         'love': 2,
         'make': 1,
         'makes': 1,
         'marrying': 1,
         'merchant': 1,
         'merchants': 1,
         'met': 1,
         'mincel': 1,
         'money': 2,
         'new': 1,
         'noble': 1,
         'now': 1,
         'of': 8,
         'on': 1,
         'or': 1,
         'owner': 1,
         'part': 1,
         'partnership': 1,
         'pensions': 1,
         'pl': 1,
         'polish': 2,
         'proves': 1,
         'quest': 1,
         'rank': 1,
         'respected': 1,
         'restaurant': 1,
         'return': 1,
         'risks': 1,
         'romantic': 1,
         'russia': 1,
         'russian': 2,
         'russo': 1,
         's': 4,
         'salesman': 1,
         'salons': 1,
         'science': 1,
         'searchplotwriters': 1,
         'secure': 1,
         'sentenced': 1,
         'set': 1,
         'sets': 1,
         'shareholders': 1,
         'she': 1,
         'siberia': 1,
         'social': 1,
         'supplying': 1,
         'taking': 1,
         'the': 11,
         'theatres': 1,
         'their': 1,
         'these': 1,
         'to': 12,
         'tomasz': 1,
         'too': 1,
         'true': 1,
         'tsarist': 1,
         'turkish': 1,
         'two': 1,
         'undertake': 1,
         'up': 2,
         'uprising': 1,
         'uses': 1,
         'vacuous': 1,
         'waiter': 1,
         'war': 1,
         'warsaw': 2,
         'while': 2,
         'who': 1,
         'widow': 1,
         'win': 1,
         'with': 3,
         'without': 1,
         'wokulski': 5,
         'work': 1,
         'www': 1,
         'young': 1})

term freq matrix

with the lexicon we are able to compute the term freq matrix


In [18]:
def get_tfm(corpus):
    
    def get_lexicon(corpus):
        lexicon = set()
        for doc in corpus:
            lexicon.update([word for word in doc.split()])
        return list(lexicon)
    
    lexicon = get_lexicon(corpus)
    
    tfm =[]
    for doc in corpus:
        tfv = [0]*len(lexicon)
        for term in doc.split():
            tfv[lexicon.index(term)] += 1
    
        tfm.append(tfv)
    
    return tfm, lexicon

#test_corpus = ['mountain bike', 'road bike carbon', 'bike helmet']
#tfm, lexicon = get_tfm(test_corpus)
#print lexicon
#print tfm

sparsity of term frequency matrix

We took the approach of using Bokeh for displaying the sparsity of term frequency matrix


In [64]:
#!pip install bokeh

In [19]:
import pandas as pd
from bokeh.plotting import figure, output_notebook, show, vplot

# sparsity as a function of document count
n = []
s = []
for i in range(100,1000,100):
    corpus = items_d[0:i]
    tfm, lexicon = get_tfm(corpus)
    c = [ [x.count(0), x.count(1)] for x in tfm]
    n_zero = sum([ y[0] for y in c])
    n_one = sum( [y[1] for y in c])
    s.append(1.0 - (float(n_one) / (n_one + n_zero)))
    n.append(i)
    
output_notebook(hide_banner=True)
p = figure(x_axis_label='Documents', y_axis_label='Sparsity', plot_width=400, plot_height=400)
p.line(n, s, line_width=2)
p.circle(n, s, fill_color="white", size=8)
show(p)


Out[19]:

<Bokeh Notebook handle for In[19]>

boolean search

After doing the term frequency matrix, we went into using our first ranking function. We are using a boolean search to find documents that contains the words that are included within a user specified query. This is how our boolean search algorithm works:

  • Compute the lexicon for the corpus
  • Compute the term frequency matrix for the corpus
  • Convert query to query vector using the same lexicon
  • Compare each documents term frequncy vector to the query vector - specifically for each document in the corpus:
    • Compute a ranking score for each document by taking the dot product of the document's term frequency vector and the query vector
  • Sort the documents by ranking score

In [20]:
# compute term frequency matrix and lexicon
tfm, lexicon = get_tfm(corpus)


# define our query
qry = 'red bike'

# convert query to query vector using lexicon
qrv = [0]*len(lexicon)
for term in qry.split():
    if term in lexicon:
        qrv[lexicon.index(term)] = 1

#print qrv

# compare query vector to each term frequency vector
# this is dot product between qrv and each row of tfm
for i,tfv in enumerate(tfm):
    print i, sum([ xy[0] * xy[1] for xy in zip(qrv, tfv) ])


0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 1
14 0
15 0
16 1
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 1
31 0
32 0
33 0
34 0
35 0
36 0
37 0
38 0
39 0
40 0
41 0
42 0
43 0
44 0
45 0
46 0
47 0
48 0
49 0
50 0
51 0
52 0
53 0
54 0
55 0
56 0
57 0
58 1
59 0
60 0
61 0
62 0
63 1
64 0
65 0
66 0
67 0
68 0
69 0
70 0
71 0
72 0
73 0
74 0
75 0
76 0
77 0
78 0
79 0
80 0
81 0
82 0
83 0
84 0
85 0
86 0
87 0
88 0
89 0
90 0
91 0
92 0
93 0
94 0
95 0
96 0
97 0
98 0
99 0
100 0
101 0
102 0
103 0
104 0
105 0
106 0
107 0
108 2
109 0
110 0
111 0
112 0
113 0
114 0
115 0
116 0
117 0
118 0
119 0
120 0
121 0
122 0
123 0
124 0
125 0
126 0
127 0
128 0
129 0
130 0
131 0
132 0
133 1
134 0
135 0
136 0
137 2
138 0
139 0
140 0
141 0
142 0
143 0
144 0
145 0
146 0
147 0
148 0
149 0
150 0
151 0
152 0
153 0
154 0
155 0
156 0
157 0
158 0
159 0
160 0
161 0
162 0
163 0
164 0
165 0
166 0
167 0
168 0
169 0
170 0
171 0
172 0
173 0
174 0
175 0
176 0
177 0
178 0
179 0
180 0
181 0
182 0
183 0
184 0
185 0
186 0
187 0
188 0
189 0
190 0
191 0
192 0
193 0
194 1
195 0
196 0
197 0
198 0
199 0
200 0
201 0
202 0
203 0
204 0
205 0
206 0
207 0
208 0
209 0
210 0
211 0
212 0
213 0
214 0
215 0
216 0
217 0
218 0
219 0
220 3
221 0
222 0
223 0
224 0
225 0
226 0
227 0
228 0
229 0
230 0
231 0
232 0
233 0
234 0
235 0
236 0
237 0
238 0
239 0
240 0
241 0
242 0
243 0
244 0
245 0
246 0
247 1
248 0
249 0
250 0
251 0
252 0
253 0
254 0
255 0
256 0
257 0
258 0
259 0
260 0
261 0
262 0
263 0
264 0
265 0
266 0
267 0
268 0
269 0
270 0
271 0
272 0
273 0
274 0
275 0
276 0
277 0
278 0
279 1
280 0
281 0
282 0
283 0
284 0
285 0
286 0
287 0
288 0
289 0
290 0
291 0
292 0
293 0
294 1
295 0
296 0
297 0
298 0
299 0
300 0
301 0
302 0
303 0
304 0
305 0
306 0
307 0
308 0
309 0
310 0
311 1
312 0
313 1
314 0
315 0
316 0
317 1
318 0
319 0
320 0
321 0
322 0
323 0
324 0
325 0
326 0
327 0
328 0
329 0
330 0
331 0
332 0
333 0
334 0
335 0
336 0
337 0
338 0
339 0
340 0
341 0
342 0
343 0
344 0
345 0
346 0
347 0
348 2
349 0
350 0
351 0
352 0
353 0
354 0
355 0
356 0
357 0
358 0
359 0
360 0
361 0
362 0
363 0
364 0
365 0
366 0
367 0
368 0
369 0
370 0
371 0
372 0
373 0
374 0
375 0
376 0
377 0
378 3
379 0
380 0
381 0
382 0
383 0
384 1
385 0
386 0
387 0
388 0
389 0
390 0
391 0
392 0
393 0
394 0
395 0
396 0
397 0
398 0
399 0
400 1
401 0
402 0
403 0
404 0
405 1
406 0
407 0
408 0
409 0
410 0
411 1
412 0
413 0
414 0
415 0
416 0
417 0
418 0
419 0
420 0
421 0
422 0
423 0
424 0
425 0
426 2
427 0
428 0
429 0
430 0
431 0
432 0
433 0
434 0
435 0
436 0
437 0
438 0
439 0
440 0
441 0
442 0
443 0
444 0
445 0
446 0
447 0
448 0
449 0
450 0
451 0
452 0
453 0
454 0
455 0
456 0
457 0
458 0
459 0
460 0
461 0
462 1
463 0
464 0
465 0
466 0
467 0
468 0
469 0
470 0
471 0
472 0
473 0
474 0
475 0
476 0
477 0
478 0
479 0
480 0
481 0
482 1
483 0
484 0
485 0
486 0
487 0
488 0
489 0
490 0
491 0
492 0
493 0
494 0
495 0
496 0
497 0
498 14
499 0
500 0
501 0
502 0
503 0
504 0
505 0
506 0
507 0
508 0
509 0
510 0
511 0
512 0
513 0
514 0
515 0
516 0
517 0
518 0
519 0
520 0
521 0
522 0
523 0
524 0
525 0
526 0
527 0
528 0
529 0
530 0
531 0
532 0
533 0
534 0
535 0
536 0
537 0
538 0
539 0
540 0
541 0
542 0
543 0
544 1
545 0
546 0
547 0
548 0
549 0
550 0
551 0
552 0
553 2
554 0
555 0
556 0
557 0
558 0
559 0
560 0
561 0
562 0
563 0
564 0
565 0
566 0
567 0
568 0
569 0
570 0
571 0
572 0
573 0
574 0
575 0
576 0
577 0
578 0
579 0
580 0
581 0
582 0
583 0
584 0
585 0
586 0
587 0
588 0
589 0
590 0
591 0
592 0
593 0
594 0
595 0
596 0
597 0
598 0
599 0
600 0
601 0
602 0
603 0
604 0
605 0
606 0
607 0
608 0
609 0
610 0
611 0
612 0
613 0
614 0
615 0
616 0
617 0
618 0
619 0
620 0
621 0
622 1
623 0
624 0
625 0
626 0
627 0
628 0
629 0
630 0
631 0
632 0
633 0
634 0
635 5
636 0
637 0
638 0
639 0
640 0
641 0
642 0
643 0
644 0
645 0
646 0
647 0
648 0
649 0
650 0
651 0
652 0
653 1
654 0
655 0
656 0
657 0
658 0
659 1
660 0
661 0
662 0
663 0
664 0
665 0
666 0
667 0
668 0
669 0
670 0
671 0
672 0
673 0
674 0
675 0
676 1
677 0
678 0
679 1
680 0
681 0
682 4
683 0
684 0
685 0
686 0
687 0
688 0
689 0
690 0
691 0
692 0
693 0
694 0
695 0
696 0
697 0
698 0
699 0
700 0
701 0
702 0
703 0
704 0
705 0
706 0
707 0
708 0
709 0
710 0
711 0
712 0
713 0
714 0
715 3
716 0
717 0
718 0
719 0
720 0
721 0
722 0
723 0
724 0
725 0
726 0
727 0
728 0
729 0
730 0
731 0
732 0
733 0
734 0
735 0
736 0
737 0
738 0
739 0
740 0
741 0
742 0
743 0
744 0
745 0
746 0
747 0
748 0
749 0
750 0
751 0
752 0
753 0
754 0
755 0
756 0
757 0
758 0
759 0
760 0
761 0
762 0
763 0
764 0
765 0
766 0
767 0
768 0
769 0
770 0
771 0
772 0
773 0
774 0
775 0
776 0
777 0
778 0
779 0
780 0
781 0
782 0
783 0
784 0
785 0
786 0
787 0
788 0
789 0
790 0
791 0
792 1
793 0
794 0
795 0
796 0
797 0
798 0
799 0
800 0
801 0
802 0
803 0
804 0
805 0
806 0
807 0
808 0
809 0
810 0
811 0
812 0
813 0
814 0
815 0
816 0
817 0
818 0
819 0
820 0
821 0
822 0
823 0
824 0
825 0
826 0
827 0
828 0
829 0
830 0
831 0
832 0
833 0
834 0
835 0
836 0
837 1
838 0
839 0
840 0
841 0
842 0
843 0
844 0
845 0
846 0
847 0
848 0
849 0
850 0
851 0
852 0
853 0
854 0
855 0
856 0
857 0
858 0
859 0
860 0
861 0
862 0
863 0
864 0
865 0
866 0
867 0
868 0
869 0
870 0
871 0
872 0
873 0
874 0
875 0
876 0
877 0
878 0
879 0
880 0
881 0
882 0
883 0
884 0
885 0
886 0
887 0
888 0
889 1
890 0
891 0
892 0
893 0
894 0
895 0
896 0
897 0
898 0
899 0

To compute the document ranking score we used the function get_results_tf() with results from the term frequency matrix


In [21]:
def get_results_tf(qry, tfm, lexicon):
    qrv =[0]*len(lexicon)
    for term in qry.split():
        if term in lexicon:
            qrv[lexicon.index(term)] = 1
            
    results = []
    for i, tfv in enumerate(tfm):
        score = 0
        score = sum([ xy[0] * xy[1] for xy in zip(qrv,tfv)])
        results.append([score, i])
    
    sorted_results = sorted(results, key=lambda t: t[0] * -1)
    return sorted_results


def print_results(results,n, head=True):
    ''' Helper function to print results
    '''
    if head:    
        print('\nTop %d from recall set of %d items:' % (n,len(results)))
        for r in results[:n]:
            print('\t%0.2f - %s'%(r[0],items_t[r[1]]))
    else:
        print('\nBottom %d from recall set of %d items:' % (n,len(results)))
        for r in results[-n:]:
            print('\t%0.2f - %s'%(r[0],items_t[r[1]]))
    

tfm, lexicon = get_tfm(items_d[:1000])
results = get_results_tf('fun times', tfm , lexicon)
print_results(results,10)


Top 10 from recall set of 1000 items:
	4.00 - ('the challenge', 'm family film m children s m adventure m teen m comedy')
	3.00 - ('color me kubrick', 'm lgbt m drama m comedy m indie')
	3.00 - ('halloween years later', 'm cult m drama m horror m slasher m teen')
	3.00 - ('b b s kids', 'm family film m domestic comedy m comedy m animation')
	2.00 - ('the last day of summer', 'm family film m fantasy m comedy')
	2.00 - ('eti', 'm romance film')
	2.00 - ('halloweentown', 'm children s fantasy m children s family')
	2.00 - ('des pissenlits par la racine', 'm comedy')
	2.00 - ('santouri', 'm drama m world cinema')
	2.00 - ('banjo the woodpile cat', 'm short film m family film m children s family m animation')

Inverted Index

the inverted index maps terms to the document in which they can be found


In [22]:
def create_inverted_index(corpus):
    idx={}
    for i, document in enumerate(corpus):
        for word in document.split():
            if word in idx:
                idx[word].append(i)
            else:
                idx[word] = [i]
        ## HIDE
    return idx

test_corpus = ['mountain bike red','road bike carbon','bike helmet']
idx = create_inverted_index(test_corpus)
print(idx)


{'mountain': [0], 'helmet': [2], 'bike': [0, 1, 2], 'red': [0], 'carbon': [1], 'road': [1]}

inverted index for document titles


In [23]:
idx = create_inverted_index(items_d)
print(set(idx['good']).intersection(set(idx['times'])))
print(items_d[2061])


set([32488, 13314, 25605, 7688, 29707, 27661, 40338, 16911, 33808, 529, 12306, 12307, 16302, 534, 37798, 14224, 40111, 35356, 13094, 542, 31, 30240, 23587, 29221, 10278, 18983, 8234, 10283, 44, 39282, 560, 16435, 25141, 5696, 28218, 3131, 35388, 34367, 37440, 26689, 2114, 3652, 9286, 8801, 21070, 3853, 5715, 24046, 14934, 29881, 32141, 31426, 18523, 13404, 12429, 8798, 27232, 25697, 4283, 7270, 23313, 6516, 3691, 108, 29373, 20036, 39955, 11892, 629, 15479, 12408, 23161, 22652, 11836, 42112, 33921, 2690, 22055, 27270, 24711, 14444, 6423, 17037, 39277, 4752, 15505, 3943, 5781, 2710, 2711, 20632, 4036, 34971, 668, 5278, 28204, 32928, 12450, 1190, 2728, 22556, 9386, 38571, 30893, 7854, 10927, 33456, 39880, 13679, 12318, 16054, 22200, 30580, 17595, 24765, 22206, 36981, 4801, 24770, 17611, 16068, 8176, 7879, 5068, 41163, 30924, 27938, 3279, 42194, 19667, 18132, 20693, 27350, 37310, 31960, 4825, 5339, 16092, 1760, 40161, 6750, 35077, 4359, 34534, 1255, 3304, 9449, 41194, 35563, 31980, 36697, 22255, 17136, 22770, 39669, 34039, 21716, 15983, 13052, 5839, 33535, 7936, 11523, 33029, 14086, 2689, 40200, 26377, 19213, 25358, 783, 26895, 26385, 2322, 39700, 17685, 28208, 10008, 5403, 12080, 17697, 25890, 24356, 12166, 34598, 40569, 37672, 37057, 9351, 35629, 21806, 40825, 9008, 41779, 8500, 5941, 6281, 312, 1337, 11580, 32309, 26944, 7154, 41283, 29077, 11590, 31125, 5449, 37864, 19791, 7504, 6481, 19282, 29011, 32142, 11568, 13665, 5978, 17245, 18270, 13797, 34565, 4961, 21860, 8166, 18279, 28009, 22891, 29586, 1902, 11631, 4840, 14194, 30867, 10100, 13372, 22506, 24958, 3477, 19329, 30083, 33669, 13190, 39106, 26505, 39307, 15698, 22082, 13710, 31495, 4725, 23192, 1938, 7583, 41966, 35222, 39319, 9113, 31215, 20892, 10617, 41613, 12193, 7586, 23339, 21574, 36262, 32167, 34728, 27564, 25005, 942, 25519, 4019, 36790, 9144, 28603, 28606, 34751, 16832, 4768, 38852, 35937, 20779, 33738, 15819, 20940, 39586, 40398, 7585, 7635, 18901, 15609, 35289, 32219, 6108, 23518, 29663, 6112, 9698, 28645, 38018, 2536, 19945, 36842, 34285, 9710, 26607, 20026, 35314, 40303, 25598, 40952, 24175, 20990, 23125])
soldiers with the u n forces that entered korea during the korean war rape a village girl named eon rae the villagers ostracize eon rae and her son unable to make a living eon rae joins the brothel district that has been set up near the u n base on the other side of the river from the village the war and the introduction of u s culture break down the social order of the village after several village children have died the villagers put the blame on the prostitutes eventually the villagers unable to maintain the village leave their homes one by one eon rae and her son also leave synopsis from cite web

improve the ranking function


In [24]:
def get_results_tf(qry, idx):
    score = Counter()
    for term in qry.split():
        for doc in idx[term]:
            score[doc] += 1
            
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            results.append([x[1],x[0]])

    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results;


idx = create_inverted_index(items_d)
results = get_results_tf('zombies', idx)
print_results(results,20)


Top 20 from recall set of 190 items:
	30.00 - ('burial ground the nights of terror', 'm thriller m zombie film m horror m world cinema')
	19.00 - ('dance of the dead', 'm zombie film m horror m indie m teen m comedy')
	19.00 - ('video dead', 'm zombie film m horror m b movie m indie')
	16.00 - ('zombies zombies zombies', 'm zombie film m b movie m horror m comedy')
	14.00 - ('big tits zombie', 'm zombie film m japanese movies m horror')
	14.00 - ('flesheater', 'm horror m indie m creature film m zombie film m b movie m teen')
	13.00 - ('shaun of the dead', 'm parody m romantic comedy m horror m doomsday film m cult m comedy m zombie film m black comedy m horror comedy')
	12.00 - ('dawn of the dead', 'm horror m indie m doomsday film m cult m splatter film m zombie film')
	12.00 - ('dead and deader', 'm science fiction m horror m television movie m sci fi horror m zombie film m action')
	11.00 - ('route', 'm zombie film m horror m creature film')
	11.00 - ('undead or alive', 'm action adventure m zombie film m western m horror')
	11.00 - ('hide and creep', 'm science fiction m b movie m comedy m zombie film m horror m horror comedy')
	10.00 - ('the stink of flesh', 'm cult m black comedy m horror m comedy m zombie film')
	10.00 - ('abraham lincoln vs zombies', 'm action m horror')
	9.00 - ('planet terror', 'm thriller m action adventure m science fiction m horror m indie m creature film m cult m zombie film m disaster m action thrillers m action')
	9.00 - ('night of the living dead', 'm mystery m horror')
	9.00 - ('when good ghouls go bad', 'm black comedy m fantasy m comedy m children s fantasy')
	8.00 - ('zombi', 'm zombie film m horror m creature film m world cinema')
	8.00 - ('day of the dead contagium', 'm zombie film m horror')
	8.00 - ('land of the dead', 'm thriller m science fiction m horror m indie m doomsday film m creature film m cult m splatter film m zombie film m action m dystopia')

enter different queries


In [25]:
results = get_results_tf('ghouls and ghosts', idx)
print_results(results, 10)


Top 10 from recall set of 39747 items:
	181.00 - ('in the line of duty witness', 'm action thrillers m world cinema m action adventure m martial arts film m action m chinese movies')
	165.00 - ('dragon head', 'm science fiction m horror m world cinema m anime m disaster m japanese movies m action')
	165.00 - ('band of the hand', 'm crime fiction m thriller m action thrillers m action adventure m drama m crime thriller m action')
	162.00 - ('underworld rise of the lycans', 'm thriller m horror m gothic film m action adventure m period piece m fantasy m action m costume horror')
	145.00 - ('franklin and the green knight', 'm family film m children s m animation')
	144.00 - ('devil s diary', 'm horror m teen m television movie')
	140.00 - ('wishology', 'm fantasy')
	139.00 - ('the runaways', 'm punk rock m biography m indie m musical m drama m music m biographical film')
	134.00 - ('the guard post', 'm mystery m horror')
	129.00 - ('the mists of avalon', 'm costume drama m fantasy adventure m fantasy m feminist film')

In [26]:
import pandas as pd
from bokeh.plotting import output_notebook, show
from bokeh.charts import Bar
from bokeh.charts.attributes import CatAttr
#from bokeh.models import ColumnDataSource

df = pd.DataFrame({'term':[x for x in idx.keys()],'freq':[len(x) for x in idx.values()]})

output_notebook(hide_banner=True)
p = Bar(df.sort_values('freq', ascending=False)[:30], label=CatAttr(columns=['term'], sort=False), values='freq',
        plot_width=800, plot_height=400)
show(p)


Out[26]:

<Bokeh Notebook handle for In[26]>

TF-IDF

To implement TF-IDF we used the function: $$ IDF = log ( 1 + \frac{N}{n_t} ) $$


In [27]:
import math

def idf(term, idx, n):
    return math.log( float(n) / (1 + len(idx[term])))    


print(idf('zombie',idx,len(items_d)))
print(idf('survival',idx,len(items_d)))
print(idf('invasions',idx,len(items_d)))


4.35124994957
4.91040628425
8.45297461909

TF-IDF Intuition


In [28]:
from bokeh.charts import vplot

idx = create_inverted_index(items_d)

df = pd.DataFrame({'term':[x for x in idx.keys()],'freq':[len(x) for x in idx.values()],
                  'idf':[idf(x, idx, len(items_t)) for x in idx.keys()]})

output_notebook(hide_banner=True)
p1 = Bar(df.sort_values('freq', ascending=False)[:30], label=CatAttr(columns=['term'], sort=False), values='freq',
        plot_width=800, plot_height=400)
p2 = Bar(df.sort_values('freq', ascending=False)[:30], label=CatAttr(columns=['term'], sort=False), values='idf',
        plot_width=800, plot_height=400)
p = vplot(p1, p2)
show(p)


/Users/dustin/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:13: BokehDeprecationWarning: bokeh.io.vplot was deprecated in Bokeh 0.12.0; please use bokeh.models.layouts.Column instead
Out[28]:

<Bokeh Notebook handle for In[28]>

TF-IDF Ranking

We then created an inverted index for the TD-IDF ranking


In [29]:
def create_inverted_index(corpus):
    idx={}
    for i, doc in enumerate(corpus):
        for word in doc.split():
            if word in idx:
                if i in idx[word]:
                    # Update document's frequency
                    idx[word][i] += 1
                else:
                    # Add document
                    idx[word][i] = 1
            else:
                # Add term
                idx[word] = {i:1}
    return idx

def get_results_tfidf(qry, idx, n):
    score = Counter()
    for term in qry.split():
        if term in idx:
            i = idf(term, idx, n)
            for doc in idx[term]:
                score[doc] += idx[term][doc] * i
        
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            results.append([x[1],x[0]])
    
    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results

idx = create_inverted_index(items_d)
results = get_results_tfidf('lookout action bike zombie', idx, len(items_d))
print_results(results,10)


Top 10 from recall set of 1874 items:
	115.77 - ('i bought a vampire motorcycle', 'm parody m horror m slasher m horror comedy')
	104.68 - ('burial ground the nights of terror', 'm thriller m zombie film m horror m world cinema')
	90.60 - ('polladhavan', 'm romance film m action m drama')
	78.51 - ('hatchet ii', 'm thriller m horror m cult m comedy m black comedy m action m slasher')
	70.47 - ('the dirt bike kid', 'm family film m children s family m fantasy m adventure m comedy')
	60.40 - ('tuff turf', 'm romantic drama m romance film m action m drama m teen')
	57.58 - ('hide and creep', 'm science fiction m b movie m comedy m zombie film m horror m horror comedy')
	57.58 - ('day of the dead', 'm cult m zombie film m horror m indie')
	57.37 - ('amityville dollhouse', 'm horror')
	52.34 - ('fido', 'm parody m horror m period piece m drama m comedy m zombie film m romance film m horror comedy')

Ideally we do not want scores to be the same for lots of documents. High TF-IDF scores in shorter documents should be more relevant - so we could try by boosting the score for documents that are shorter than average.


In [30]:
def get_results_tfidf_boost(qry, corpus):
    idx = create_inverted_index(corpus)
    n = len(corpus)
    d = [len(x.split()) for x in corpus]
    d_avg = float(sum(d)) / len(d)
    score = Counter()
    for term in qry.split():
        if term in idx:
            i = idf(term, idx, n)
            for doc in idx[term]:
                f = float(idx[term][doc])
                score[doc] += i *  ( f / (float(d[doc]) / d_avg) )
        
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            # output [0] score, [1] doc_id
            results.append([x[1],x[0]])

    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results

In [31]:
from bokeh.charts import Scatter

results = get_results_tfidf_boost('zombie invasion', items_d)
print_results(results, 10)

# Plot score vs item length
df = pd.DataFrame({'score':[float(x[0]) for x in results],
                   'length':[len(items_d[x[1]].split()) for x in results]})

output_notebook()
p = Scatter(df, x='score', y='length')
show(p)


Top 10 from recall set of 566 items:
	104.51 - ('reel zombies', 'm horror m horror comedy')
	91.58 - ('zombie girl the movie', 'm documentary')
	86.76 - ('caustic zombies', 'm horror')
	80.42 - ('zombie bloodbath', 'm zombie film m horror m comedy')
	75.51 - ('mathrubhoomi', '')
	75.51 - ('gladiatress', 'm parody m sword and sandal m action m comedy')
	71.67 - ('first platoon', 'm comedy film m horror')
	68.65 - ('feeders', 'm drama m science fiction m horror')
	61.06 - ('dead roses', 'm zombie film m horror m indie')
	59.23 - ('time runner', 'm thriller m science fiction m action')
Loading BokehJS ...
Out[31]:

<Bokeh Notebook handle for In[31]>

Implementing BM25

To implement BM25, we used the function get_results_bm25 that used arguments "query, corpus, and the index sizes. We then printed out the results using a Bokeh chart.


In [32]:
def get_results_bm25(qry, corpus, k1=1.5, b=0.75):
    idx = create_inverted_index(corpus)
    # 1.Assign (integer) n to be the number of documents in the corpus
    n = len(corpus)
    # 2.Assign (list) d with elements corresponding to the number of terms in each document in the corpus
    d = [len(x.split()) for x in corpus]
    # 3.Assign (float) d_avg as the average document length of the documents in the corpus
    d_avg = float(sum(d)) / len(d)                
    score = Counter()
    for term in qry.split():
        if term in idx:
            i = idf(term, idx, n)
            for doc in idx[term]:
                # 4.Assign (float) f equal to the number of times the term appears in doc
                f = float(idx[term][doc])
                # 5.Assign (float) s the BM25 score for this (term, document) pair
                s = i * (( f * (k1 + 1) ) / (f + k1 * (1 - b + (b * (float(d[doc]) / d_avg)))))
                score[doc] += s
                
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            results.append([x[1],x[0]])

    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results

In [33]:
results = get_results_bm25('zombie apacolypse', items_d)
print_results(results, 10)


Top 10 from recall set of 224 items:
	11.21 - ('zombie bloodbath', 'm zombie film m horror m comedy')
	11.19 - ('day of the dead', 'm cult m zombie film m horror m indie')
	10.68 - ('fido', 'm parody m horror m period piece m drama m comedy m zombie film m romance film m horror comedy')
	10.67 - ('zombie vs mardi gras', 'm horror m comedy m indie')
	10.64 - ('hatchet ii', 'm thriller m horror m cult m comedy m black comedy m action m slasher')
	10.64 - ('super', 'm thriller m science fiction m action adventure m mystery m drama m action')
	10.62 - ('colin', 'm b movie m creature film m psychological thriller m drama m zombie film m horror m action')
	10.48 - ('burial ground the nights of terror', 'm thriller m zombie film m horror m world cinema')
	10.31 - ('first platoon', 'm comedy film m horror')
	10.31 - ('reel zombies', 'm horror m horror comedy')

In [34]:
!pip install bokeh
from bokeh.charts import Scatter

results = get_results_bm25('zombie apacolypse', items_d, k1=1.5, b=0.75)

# Plot score vs item length
df = pd.DataFrame({'score':[float(x[0]) for x in results],
                   'length':[len(items_d[x[1]].split()) for x in results]})
output_notebook()
p = Scatter(df, x='score', y='length')
show(p)


Requirement already satisfied (use --upgrade to upgrade): bokeh in /Users/dustin/anaconda2/lib/python2.7/site-packages
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Loading BokehJS ...
Out[34]:

<Bokeh Notebook handle for In[34]>

Implementing Random Forest Machine Learning

Using the example from class to implement random forest ranking algorithm.


In [35]:
import findspark
import os
findspark.init(os.getenv('HOME') + '/spark-1.6.0-bin-hadoop2.6')
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'

In [36]:
import pyspark
try: 
    print(sc)
except NameError:
    sc = pyspark.SparkContext()
    print(sc)


<pyspark.context.SparkContext object at 0x186955290>

In [37]:
from pyspark.sql import SQLContext
import os

sqlContext = SQLContext(sc)
df = sqlContext.read.format('data/MovieSummaries/plot_summaries.tsv').options().options(header='true', inferSchema='true', delimiter=',') \
        .load(os.getcwd() + 'data/MovieSummaries/plot_summaries.tsv') 
        
df.schema
df.dropna()


---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-37-96bf1b27a5d7> in <module>()
      3 
      4 sqlContext = SQLContext(sc)
----> 5 df = sqlContext.read.format('data/MovieSummaries/plot_summaries.tsv').options()         .options(header='true', inferSchema='true', delimiter=',')         .load(os.getcwd() + 'data/MovieSummaries/plot_summaries.tsv')
      6 
      7 df.schema

/Users/dustin/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/readwriter.pyc in load(self, path, format, schema, **options)
    135                     self._jreader.load(self._sqlContext._sc._jvm.PythonUtils.toSeq(path)))
    136             else:
--> 137                 return self._df(self._jreader.load(path))
    138         else:
    139             return self._df(self._jreader.load())

/Users/dustin/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/Users/dustin/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/Users/dustin/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Py4JJavaError: An error occurred while calling o25.load.
: java.lang.ClassNotFoundException: Failed to find data source: data/MovieSummaries/plot_summaries.tsv. Please find packages at http://spark-packages.org
	at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
	at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
	at py4j.Gateway.invoke(Gateway.java:259)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:209)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: data/MovieSummaries/plot_summaries.tsv.DefaultSource
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
	at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
	at scala.util.Try$.apply(Try.scala:161)
	at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
	at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
	at scala.util.Try.orElse(Try.scala:82)
	at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
	... 14 more

In [ ]:
sqlContext.registerDataFrameAsTable(df,'dataset')
sqlContext.tableNames()

data_full = sqlContext.sql("select label_relevanceBinary, feature_1, feature_2, feature_3, feature_4 \
                       feature_5, feature_6, feature_7, feature_8, feature_9, feature_10 \
               from dataset").rdd

In [ ]:
from pyspark.mllib.classification import SVMWithSGD, SVMModel
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import StandardScaler

label = data_full.map(lambda row: row[0])
features = data_full.map(lambda row: row[1:])

model = StandardScaler().fit(features)
features_transform = model.transform(features)

# Now combine and convert back to labelled points:
transformedData = label.zip(features_transform)
transformedData = transformedData.map(lambda row: LabeledPoint(row[0],[row[1]]))

transformedData.take(5)

In [ ]:
data_train, data_test = transformedData.randomSplit([.75,.25],seed=1973)

print('Training data records = ' + str(data_train.count()))
print('Training data records = ' + str(data_test.count()))

In [ ]:
from pyspark.mllib.tree import RandomForest

model = RandomForest.trainClassifier(data_train, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=400, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=10, maxBins=32)