About

This script uses stanford named entity recognizer to find the proper names in my women article corpus and exports a new csv column with the article text without proper nouns. At the end, it also tries to do a little analysis with those named entites.

NER


In [1]:
import ner
import os
import re
import csv
from urllib import urlopen

In [2]:
import sys

csv.field_size_limit(sys.maxsize)


Out[2]:
131072

In [3]:
tagger = ner.SocketNER(host='localhost', port=8080)

In [4]:
# test
entities = tagger.get_entities("The dictatorship of President Nicolae Ceausescu caused extreme hardships \
                                for all but a few hundred thousand of Rumania's 23 million citizens. \
                                But in the case of mothers and babies, his rule apparently had the most \
                                tragic consequences. Mr. Ceausescu, who was ousted in a popular uprising a \
                                month ago, decreed in 1967, two years after he came to power, that Rumania's \
                                population, then about 22 million, should increase to 30 million. The reason he \
                                gave was simply that he wanted a bigger Rumania - an assertion widely interpreted \
                                now as an early indication of his megalomania. And to achieve his goal he banned \
                                abortions, made contraception illegal and ordered that Rumanian women of \
                                child-bearing age have five children each. No Precise Accounting Harsh fines were \
                                ordered for women caught having abortions, and doctors or medical technicians who \
                                a consultant for Murray Stopes International, a British charity that assi")
entities


Out[4]:
{u'LOCATION': [u'Rumania', u'Rumania', u'Rumania'],
 u'ORGANIZATION': [u'Murray Stopes International'],
 u'PERSON': [u'Nicolae Ceausescu', u'Ceausescu']}

In [15]:
for key in entities:
    entities[key] = set(entities[key])
entities


Out[15]:
{u'LOCATION': {u'Rumania'},
 u'ORGANIZATION': {u'Murray Stopes International'},
 u'PERSON': {u'Ceausescu', u'Nicolae Ceausescu'}}

In [16]:
#read csv and read into a list of dictionaries
women = []
with open('Data/Corpora/women-foreign.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        women.append(row)
women[1]


Out[16]:
{'BYLINE': 'By DAVID BINDER, Special to The New York Times',
 'COUNTRY': 'ROMANIA\xc2\xa0(96%);',
 'COUNTRY_CODE': 'ROU',
 'COUNTRY_FINAL': 'Romania',
 'COUNTRY_MAJOR': 'ROMANIA',
 'COUNTRY_NR': 'ROMANIA\xc2\xa0(96%)',
 'COUNTRY_TOP_PERCENT': 'ROMANIA\xc2\xa0(96%)',
 'DATE': 'January 24, 1990',
 'LENGTH': '936 words',
 'PUBLICATION': 'NYT',
 'REGION': 'EECA',
 'SUBJECT': "ABORTION; BIRTH CONTROL AND FAMILY PLANNING; POPULATION; CHILDREN AND YOUTH; WOMEN; POLITICS AND GOVERNMENT \xc2\xa0PREGNANCY & CHILDBIRTH\xc2\xa0(90%);\xc2\xa0INFANTS & TODDLERS\xc2\xa0(90%);\xc2\xa0HEADS OF STATE & GOVERNMENT\xc2\xa0(90%);\xc2\xa0ABORTION\xc2\xa0(90%);\xc2\xa0WOMEN\xc2\xa0(90%); CHILDREN\xc2\xa0(89%);\xc2\xa0ORPHANS\xc2\xa0(89%);\xc2\xa0JAIL SENTENCING\xc2\xa0(88%);\xc2\xa0PARENTING\xc2\xa0(78%);\xc2\xa0WOMEN'S HEALTH\xc2\xa0(78%);\xc2\xa0POPULATION SIZE\xc2\xa0(78%);\xc2\xa0CONTRACEPTION\xc2\xa0(74%);\xc2\xa0LAW ENFORCEMENT\xc2\xa0(74%);",
 'TEXT': "The dictatorship of President Nicolae Ceausescu caused extreme hardships for all but a few hundred thousand of Rumania's 23 million citizens. But in the case of mothers and babies, his rule apparently had the most tragic consequences. Mr. Ceausescu, who was ousted in a popular uprising a month ago, decreed in 1967, two years after he came to power, that Rumania's population, then about 22 million, should increase to 30 million. The reason he gave was simply that he wanted a bigger Rumania - an assertion widely interpreted now as an early indication of his megalomania. And to achieve his goal he banned abortions, made contraception illegal and ordered that Rumanian women of child-bearing age have five children each. No Precise Accounting Harsh fines were ordered for women caught having abortions, and doctors or medical technicians who assisted in abortions were sentenced to up to four years in prison and prohibited from practicing for 10 years. In the latter years of the regime, women working in factories were subjected to pregnancy checks as often as once a week. There is as yet no precise accounting of how many Rumanians were adversely affected by such strictures. Officials of the new provisional Government and outside experts have only begun to gather data about what happened over the years. But the fragmentary figures and educated guesses that they have been able to provide depict a society of families torn by death and fear as a result of the decrees paradoxically meant to make them propagate. ''The policy was a total failure,'' said Dr. Timothy Rutter, a consultant for Murray Stopes International, a British charity that assists planned parenthood projects. Dr. Rutter spent the last week in Bucharest. In talks with officials at the new Government's Ministry of Health, Dr. Rutter was told that while the Rumanian population rate grew by 2.5 percent in 1966, in 1989 there was actually a negative growth rate. Rumanian officials said there were 300,000 births last year and 1.2 million abortions. Jail for Performing Abortions In Iasi, a city of 400,000 people in northern Rumania, physicians told a visitor that three Iasi University medical professors were jailed for one year each under the dictatorship for performing abortions. ''We have many maternal deaths and very many abandoned children'' as a result of the abortion policy, a physician said, adding that in Iasi, medical instruments that might be used in abortions were kept locked up and could be taken out only under the supervision of a state security police officer. Every Rumanian seems to know cases of mothers dying during botched abortions, children orphaned as a result of such deaths and babies harmed by unsuccessful attempts at abortion. Officials said there are 718 orphans up to the age of 3 in Bucharest, many in pitiable condition. The officials said that the children were orphaned as a result of the Ceausescu policies and that there would be more had the regime not sold Rumanian orphans to France for hard currency. Presumably the orphans were from families broken after the mothers had died during abortions or during unhealthy pregnancies. Death of a Wife In an interview, Ion Tudor, a 46-year-old museum worker, told of his family with tears in his eyes. In February 1975, he came home from a work assignment in another city to find his 26-year-old wife, Florica, in a state of collapse. ''She had gotten an abortion from a medical technician,'' he said, adding that he had no knowledge of her plans. ''I called an ambulance. It took 12 hours for the ambulance to arrive. We went to the Giulesti Maternity Hospital. They called the police, who said she could not receive treatment until she confessed who had performed the abortion. She received no care for two days. Then she had a kidney collapse.  ''The doctors sneaked her over to the Emergency Hospital, where the doctors treated her. The doctor there told me if she was strong she had a chance to live. She died 18 days later.'' She left Mr. Tudor with three sons, ages 2, 4 and 5. Mihai Orovanu, a Bucharest photographer, told of visiting an archeological site at Pacuiul Lui Soare, near the Danube, last November and finding a Bucharest physician living in a cave. He had been jailed and then barred from practice for 10 years because he had performed abortions. ''He was living off vegetables he stole from the fields and fish he caught in the river,'' Mr. Orovanu said. 'There Was No Milk' The babies who were born faced severe hardships. Christian Modolciu, a 39-year-old foreign trade specialist who has two children, said he and his wife decided not to have any more. ''We wanted a third child, but there was no milk to be had,'' he said. During the Ceausescu regime, interuterine devices and condoms were traded on the black market, with American-made I.U.D.'s selling for more than $100 apiece. But such devices were scarce. In Bucharest there are 20,000 women in hospitals being treated for abortion complications, Dr. Rutter said, quoting the Rumanian health officials. He said an additional 10,000 Bucharest women are waiting for places in hospitals for treatment of blocked fallopian tubes caused by mishandled abortions. Abortions are being performed at a rate of 60 a day at one Bucharest hospital alone. Condoms are now becoming available in the capital's pharmacies. The first legal decree of the Council of National Salvation, which took power after the fall of Mr. Ceausescus, dissolved the state security police. The second decree was to end the Ceausescu policy banning abortions.",
 'TITLE': 'Upheaval in the East; Where Fear and Death Went Forth and Multiplied',
 'TYPE': '',
 'UID': '15',
 'YEAR': '1990'}

In [17]:
# remove named entities
for article in women:
    entities = tagger.get_entities(article['TEXT'])
    for key in entities:
        entities[key] = set(entities[key])
    article['entities'] = entities
    entities = [item for sublist in entities.values() for item in sublist]
    text = article['TEXT']
    for noun in entities:
        text = text.replace(noun.encode('utf-8'),'')
    article['TEXT-NO-NOUN'] = text

Write File


In [18]:
keys = women[1].keys()
keys


Out[18]:
['BYLINE',
 'TEXT-NO-NOUN',
 'PUBLICATION',
 'TITLE',
 'COUNTRY',
 'COUNTRY_FINAL',
 'YEAR',
 'UID',
 'COUNTRY_NR',
 'entities',
 'LENGTH',
 'COUNTRY_TOP_PERCENT',
 'COUNTRY_CODE',
 'TEXT',
 'DATE',
 'COUNTRY_MAJOR',
 'TYPE',
 'REGION',
 'SUBJECT']

In [19]:
with open('Data/Corpora/women-processed.csv', 'wb') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(women)

In [ ]: