firebasePopulate

Description: Crawls, and analyses articles from stated URLs (and Mothership, because it's special/troublesome), churns out parameters via analyseArticle, and pushes them to Firebase.

The parameters are: {"title", "url", "authors", "date", "summary", "polarity", "subjectivity", "keywords", "images", "videos"}

Initialise


In [7]:
print("\nINITIALISING MODULES\n.")

%run 'analyseArticle.ipynb'
%run 'firebasePush.ipynb'

import traceback
import newspaper
import requests
import time
from bs4 import BeautifulSoup
from timeit import default_timer as timer

start = timer()

print("OPENING LOGS\n.")
log = open("CRAWL_LOG.txt", "w")

print("LOADING URL LISTS\n.\n")

COMPLETED = []

QUEUE = []

newsURLs = ["www.straitstimes.com","www.allsingaporestuff.com"]

mothershipURLs = ["mothership.sg/category/news","mothership.sg/category/perspectives",
                  "mothership.sg/category/community","mothership.sg/category/almost-famous",
                  "mothership.sg/category/mps-in-the-house","mothership.sg/category/humour"]

print("\nINITIALISED FIREBASEPOPULATE")


INITIALISING MODULES
.
OPENING LOGS
.
LOADING URL LISTS
.


INITIALISED FIREBASEPOPULATE

Crawl and analyse the latest Mothership Articles this month, outputting parameters, and pushing to Firebase


In [ ]:
mcount = 0
mnoteng = 0
mfailed = 0
mtooshort = 0
mfetcherror = 0

print("RUN MOTHERSHIP MODULE\n")

for URL in mothershipURLs:
    print("Retrieving URL...\n")
    try:
        sourceCode = requests.get("http://" + str(URL))
        soup = BeautifulSoup(sourceCode.content, "lxml")
        print("Target URL: " + str(URL))

        for div in soup.find_all("div", class_="ind-article"):
            for a in div.find_all("a"):
                if "mothership.sg" in a.get("href"):
                    try:
                        print(str(mcount + mnoteng + mfailed + mtooshort + mfetcherror + 1)+": ", end="")
                        parameters = analyseArticle(a.get("href")) #for getting link
                        
                        if parameters == "ZERO_SENTIMENT_ERROR": #Check for zero sentiment, means article is too short or redirected
                            mtooshort += 1
                            print("SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED!", end=" #")
                            print(str(mtooshort))
                            continue
                            
                        if parameters == "FETCH_ERROR": #Check for zero sentiment, means article is too short or redirected
                            mfetcherror += 1
                            print("SKIPPING: FETCH_ERROR, COULD NOT DOWNLOAD ARTICLE!", end=" #")
                            print(str(mfetcherror))
                            continue
            
                        if str(parameters["language"]) != "en": #Check if article is in English, if it isn't skip
                            mnoteng += 1
                            print("SKIPPING: LANG_ERROR, ARTICLE NOT IN ENGLISH!", end=" #")
                            print(str(mnoteng) + " (" + str(parameters["language"]) + ")")
                            continue
                        
                        title = str(parameters["title"])
                        url = str(parameters["url"])
                        authors = parameters["authors"]
                        date = str(parameters["date"])
                        summary = str(parameters["summary"])
                        polarity = str(parameters["polarity"])
                        subjectivity = str(parameters["subjectivity"])
                        keywords = parameters["keywords"]
                        images = str(parameters["images"])
                        videos = str(parameters["videos"])
                        text = str(parameters["text"])

                        firebasePush(title, url, authors, date, summary, polarity, subjectivity, keywords, images, videos, text)
                        mcount += 1
                        print("Processed article #", end="")
                        print(mcount)
                        
                    except Exception as ex:
                        mfailed += 1
                        print("FAILED article: #", end=" | ")
                        print(ex)
                        print(mfailed,end=" | Moving on...\n")
            
                        log.write("\n\n ------------------------ ")
                        log.write("\n\nMOTHERSHIP MODULE UNKNOWN ERROR DUMP | Fetch #")
                        log.write(str(mcount + mnoteng + mfailed + mtooshort + mfetcherror))
                        log.write(": \n\n")
                        log.write("ERROR:")
                        log.write(str(traceback.format_exc()))  #FOR DEBUGGING
                        log.write("\n\n")
                        log.write("Data:")
                        log.write(str(parameters))              #FOR DEBUGGING
                        
    except Exception as ex:
        print("Failed URL", end=" | ")
        print(ex)
        
    print("\n ------------------------ ")
    string = "FINISHED: " + str(URL)
    print(string.center(63))
    log.write("PROCESSED: ")
    log.write(str(URL))
    log.write("\n")
    log.flush()
    
    print(" ------------------------ \n")

methylHalf()

print("\n  ------------------------ ")
print("                FINISHED PROCESSING MOTHERSHIP")
log.write("FINISHED PROCESSING: ")
log.write("MOTHERSHIP")
log.write("\n\n")
print(" ------------------------ \n")

print("SUMMARY:")
print("Elapsed time: ",end="")
checkpoint = timer()
print(checkpoint - start,end="")
print(" seconds\n")
log.write("Elapsed Time: " + str(checkpoint - start))
log.write("\n\n")
log.flush

print(str(mcount + mnoteng + mfailed + mtooshort + mfetcherror) + " Total Articles Accessed")
print(str(mcount) + " Processed Articles\n")

print(str(mnoteng) + " LANG_ERRORs (Article not in English)")
print(str(mtooshort) + " ZERO_SENTIMENT_ERRORs (No sentiment detected)")
print(str(mfetcherror) + " FETCH_ERRORs (Failed to fetch article)")
print(str(mfailed) + " Failed Articles\n")

firebaseRefresh()
time.sleep(1)

print(" ------------------------ ")


RUN MOTHERSHIP MODULE

Retrieving URL...

Target URL: mothership.sg/category/news
1: Processed article #1
2: Processed article #2
3: Processed article #3
4: Processed article #4
5: Processed article #5
6: Processed article #6
7: Processed article #7
8: Processed article #8
9: Processed article #9
10: Processed article #10

 ------------------------ 
             FINISHED: mothership.sg/category/news             
 ------------------------ 

Retrieving URL...

Target URL: mothership.sg/category/perspectives
11: Processed article #11
12: Processed article #12
13: Processed article #13
14: Processed article #14
15: Processed article #15
16: Processed article #16
17: Processed article #17
18: Processed article #18
19: Processed article #19
20: Processed article #20

 ------------------------ 
         FINISHED: mothership.sg/category/perspectives         
 ------------------------ 

Retrieving URL...

Target URL: mothership.sg/category/community
21: Processed article #21
22: Processed article #22
23: Processed article #23
24: Processed article #24
25: Processed article #25
26: Processed article #26
27: Processed article #27
28: Processed article #28
29: Processed article #29
30: Processed article #30

 ------------------------ 
           FINISHED: mothership.sg/category/community          
 ------------------------ 

Retrieving URL...

Target URL: mothership.sg/category/almost-famous
31: Processed article #31
32: Processed article #32
33: Processed article #33
34: Processed article #34
35: Processed article #35
36: Processed article #36
37: Processed article #37
38: Processed article #38
39: Processed article #39
40: Processed article #40

 ------------------------ 
         FINISHED: mothership.sg/category/almost-famous        
 ------------------------ 

Retrieving URL...

Target URL: mothership.sg/category/mps-in-the-house
41: Processed article #41
42: Processed article #42
43: Processed article #43
44: Processed article #44
45: Processed article #45
46: Processed article #46
47: Processed article #47
48: Processed article #48
49: Processed article #49
50: Processed article #50

 ------------------------ 
       FINISHED: mothership.sg/category/mps-in-the-house       
 ------------------------ 

Retrieving URL...

Target URL: mothership.sg/category/humour
51: Processed article #51
52: Processed article #52
53: Processed article #53
54: Processed article #54
55: Processed article #55
56: Processed article #56
57: Processed article #57
58: Processed article #58
59: Processed article #59
60: Processed article #60

 ------------------------ 
            FINISHED: mothership.sg/category/humour            
 ------------------------ 

                            .     .
                         .  |\-^-/|  .    
                        /| } O.=.O { |\  

  ------------------------ 
                FINISHED PROCESSING MOTHERSHIP
 ------------------------ 

SUMMARY:
Elapsed time: 81.56103160401108 seconds

60 Total Articles Accessed
60 Processed Articles

0 LANG_ERRORs (Article not in English)
0 ZERO_SENTIMENT_ERRORs (No sentiment detected)
0 FETCH_ERRORs (Failed to fetch article)
0 Failed Articles

 ------------------------ 

Crawl and analyse the other URLs, outputting parameters, and pushing to Firebase


In [ ]:
count = 0
noteng = 0
failed = 0
tooshort = 0
fetcherror = 0

print("RUN URL MODULE\n")

for URL in newsURLs:
    print("Building domain...\n")
    
    try:
        paper = newspaper.build("http://" + str(URL), memoize_articles=False)
        print("Domain building complete for: " + str(URL))
    except Exception as ex:
        print("Failed DOMAIN", end=" | ")
        print(ex, end =" | moving on...\n")

    for article in paper.articles:
        try:
            print(str(count + noteng + failed + tooshort + fetcherror + 1)+": ",end="")
            parameters = analyseArticle(article.url)

            if parameters == "ZERO_SENTIMENT_ERROR": #Check for zero sentiment, means article is too short or redirected
                tooshort += 1
                print("SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED!", end=" #")
                print(str(tooshort))
                print(article.url)
                continue
                
            if parameters == "FETCH_ERROR":
                fetcherror +=1
                print("SKIPPING: FETCH_ERROR, COULD NOT DOWNLOAD ARTICLE!", end=" #")
                print(str(fetcherror))
                continue
                
            if str(parameters["language"]) != "en": #Check if article is in English, if it isn't skip
                noteng += 1
                print("SKIPPING: LANG_ERROR, ARTICLE NOT IN ENGLISH!", end=" #")
                print(str(noteng) + " (" + str(parameters["language"]) + ")")
                print(article.url)
                continue

            title = parameters["title"]
            url = str(article.url)
            authors = parameters["authors"]
            date = str(parameters["date"])
            summary = str(parameters["summary"])
            polarity = str(parameters["polarity"])
            subjectivity = str(parameters["subjectivity"])
            keywords = parameters["keywords"]
            images = str(parameters["images"])
            videos = str(parameters["videos"])
            text = str(parameters["text"])

            firebasePush(title, url, authors, date, summary, polarity, subjectivity, keywords, images, videos, text)
            count += 1
            print("Processed article #", end="")
            print(count)
  
        except Exception as ex:
            failed += 1
            print("FAILED article: #",end="")
            print(failed, end=" | ")
            print(ex,end=" | Moving on...\n")

            log.write("\n\n ------------------------ ")
            log.write("\n\nURL MODULE UNKNOWN ERROR DUMP | Fetch #")
            log.write(str(count + noteng + failed + tooshort + fetcherror))
            log.write(": \n\n")
            log.write("ERROR:")
            log.write(str(traceback.format_exc()))  #FOR DEBUGGING
            log.write("\n\n")
            log.write("DATA:\n")
            log.write(str(parameters))              #FOR DEBUGGING

            
    print("\n  ------------------------ ")
    string = "FINISHED: " + str(URL)
    print(string.center(63))
    log.write("PROCESSED: ")
    log.write(str(URL))
    log.write("\n")
    log.flush()
    print("  ------------------------ ")

    print("RUNNING SUMMARY:")
    print("Elapsed time: ",end="")
    checkpoint = timer()
    print(checkpoint - start,end="")
    print(" seconds\n")
    log.write("Elapsed Time: " + str(checkpoint - start))
    log.write("\n\n")
    log.flush
    
    print(str(count + noteng + failed + tooshort + fetcherror) + " Total Articles Fetched")
    print(str(count) + " Processed Articles\n")
    
    
    print(str(noteng) + " LANG_ERRORs (Article not in English)")
    print(str(tooshort) + " ZERO_SENTIMENT_ERRORs (No sentiment detected)")
    print(str(fetcherror) + " FETCH_ERRORs (Failed to fetch article)")
    print(str(failed) + " Failed Articles\n")
    
    firebaseRefresh()
    time.sleep(1)
    
    print(" ------------------------ ")

methylHalf()    
print("\n ------------------------ ")
print("                   FINISHED PROCESSING URLS!")
log.write("FINISHED PROCESSING: ")
log.write("URLS")
log.write("\n\n")
print("  ------------------------ \n")

print("SUMMARY:")
print(str(count + noteng + failed + tooshort + fetcherror) + " Total Articles Accessed")
print(str(count) + " Processed Articles\n")

print(str(noteng) + " LANG_ERRORs (Article not in English)")
print(str(tooshort) + " ZERO_SENTIMENT_ERRORs (No sentiment detected)")
print(str(fetcherror) + " FETCH_ERRORs (Failed to fetch article)")
print(str(failed) + " Failed Articles\n")

print(" ------------------------ \n")

print("Elapsed time: ",end="")
checkpoint = timer()
print(checkpoint - start,end="")
print(" seconds\n")
print("SHUTTING DOWN")
log.write("Elapsed Time: " + str(checkpoint - start))
log.write("\n\n")
log.write("SHUTTING DOWN")
log.flush

log.close()


RUN URL MODULE

Building domain...

unable to cache TLDs in file /usr/local/lib/python3.5/dist-packages/tldextract/.tld_set: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/tldextract/.tld_set'
Domain building complete for: www.straitstimes.com
1: Processed article #1
2: SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED! #1
http://www.straitstimes.com/files/donald-trump-scraps-key-obamacare-subsidies-urges-democrats-to-fix-broken-mess
3: Processed article #2
4: Processed article #3
5: Processed article #4
6: Processed article #5
7: Processed article #6
8: Processed article #7
9: Processed article #8
10: Processed article #9
11: Processed article #10
12: Processed article #11
13: Processed article #12
14: SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED! #2
http://www.straitstimes.com/multimedia/photos/in-pictures-the-stars-are-out-for-the-san-sebastian-international-film-festival-in
15: Processed article #13
16: Processed article #14
17: Processed article #15
18: Processed article #16
19: SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED! #3
http://www.straitstimes.com/files/the-lives-they-live-born-chinese-but-raised-by-indian-parents
20: Processed article #17
21: Processed article #18
22: Processed article #19
23: Processed article #20
24: Processed article #21
25: Processed article #22
26: SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED! #4
http://www.straitstimes.com/files/revamped-nokia-3310-dumb-phone-classic-to-return-to-singapore-in-october-2017
27: SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED! #5
http://www.straitstimes.com/files/ramen-nagi-chef-founder-satoshi-ikuta-preparing-the-signature-butao-king-ramen
28: Processed article #23
29: Processed article #24
30: Processed article #25
31: Processed article #26
32: Processed article #27
33: Processed article #28
34: Processed article #29
35: Processed article #30
36: Processed article #31
37: Processed article #32
38: Processed article #33
39: Processed article #34
40: Processed article #35
41: Processed article #36
42: Processed article #37
43: Processed article #38
44: Processed article #39
45: Processed article #40
46: Processed article #41
47: Processed article #42
48: Processed article #43
49: Processed article #44
50: Processed article #45
51: Processed article #46
52: Processed article #47
53: Processed article #48
54: Processed article #49
55: Processed article #50
56: Processed article #51
57: Processed article #52
58: Processed article #53
59: Processed article #54
60: Processed article #55
61: Processed article #56
62: Processed article #57
63: Processed article #58
64: Processed article #59
65: Processed article #60
66: Processed article #61
67: Processed article #62
68: Processed article #63
69: Processed article #64
70: Processed article #65
71: Processed article #66
72: Processed article #67
73: Processed article #68
74: Processed article #69
75: Processed article #70
76: Processed article #71
77: Processed article #72
78: Processed article #73
79: Processed article #74
80: Processed article #75
81: Processed article #76
82: Processed article #77
83: Processed article #78
84: Processed article #79
85: Processed article #80
86: Processed article #81
87: Processed article #82
88: Processed article #83
89: Processed article #84
90: Processed article #85
91: Processed article #86
92: Processed article #87
93: Processed article #88
94: Processed article #89
95: Processed article #90
96: Processed article #91
97: Processed article #92
98: Processed article #93
99: Processed article #94
100: Processed article #95
101: Processed article #96
102: Processed article #97
103: Processed article #98
104: Processed article #99
105: SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED! #6
http://www.straitstimes.com/files/donald-trump-strikes-blow-against-iran-nuclear-deal
106: SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED! #7
http://www.straitstimes.com/files/fa-brunei-20171003-00jpg
107: Processed article #100
108: Processed article #101
109: Processed article #102
110: Processed article #103
111: Processed article #104
112: Processed article #105
113: Processed article #106
114: Processed article #107
115: Processed article #108
116: Processed article #109
117: Processed article #110
118: Processed article #111
119: Processed article #112
120: Processed article #113
121: Processed article #114
122: Processed article #115
123: Processed article #116
124: Processed article #117
125: Processed article #118
126: Processed article #119
127: Processed article #120
128: Processed article #121
129: Processed article #122
130: Processed article #123
131: Processed article #124
132: Processed article #125
133: Processed article #126
134: Processed article #127
135: Processed article #128
136: Processed article #129
137: Processed article #130
138: Processed article #131
139: Processed article #132
140: Processed article #133
141: Processed article #134
142: Processed article #135
143: Processed article #136
144: Processed article #137
145: Processed article #138
146: Processed article #139
147: Processed article #140
148: Processed article #141
149: Processed article #142
150: Processed article #143
151: SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED! #8
http://www.straitstimes.com/files/right-bike-fit-for-a-learner-cyclist
152: Processed article #144
153: Processed article #145
154: Processed article #146
155: Processed article #147
156: Processed article #148
157: Processed article #149
158: Processed article #150
159: Processed article #151
160: Processed article #152
161: Processed article #153
162: Processed article #154
163: Processed article #155
164: Processed article #156
165: Processed article #157
166: Processed article #158
167: Processed article #159
168: Processed article #160
169: Processed article #161
170: Processed article #162
171: Processed article #163
172: Processed article #164
173: Processed article #165
174: Processed article #166
175: Processed article #167
176: Processed article #168
177: Processed article #169
178: Processed article #170
179: Processed article #171
180: Processed article #172
181: SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED! #9
http://www.straitstimes.com/files/2017-09-28t161316z1170224327rc1579463e20rtrmadp3britain-boejpg
182: SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED! #10
http://www.straitstimes.com/multimedia/photos/in-pictures-swiss-air-force-pilots-show-their-precision
183: Processed article #173
184: Processed article #174
185: Processed article #175
186: Processed article #176
187: Processed article #177
188: Processed article #178
189: Processed article #179
190: Processed article #180
191: Processed article #181
192: SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED! #11
http://www.straitstimes.com/files/versace-springsummer-2018-milan-fashion-week
193: Processed article #182
194: Processed article #183
195: Processed article #184
196: Processed article #185
197: Processed article #186
198: Processed article #187
199: Processed article #188
200: Processed article #189
201: Processed article #190
202: Processed article #191
203: Processed article #192
204: Processed article #193
205: Processed article #194
206: Processed article #195
207: Processed article #196
208: Processed article #197
209: Processed article #198
210: SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED! #12
http://www.straitstimes.com/files/2017-09-25t125359z547688413rc1fe86a48a0rtrmadp3soccer-champions-mci-shk-previewjpg
211: Processed article #199
212: Processed article #200
213: Processed article #201
214: Processed article #202
215: SKIPPING: ZERO_SENTIMENT_ERROR, NO SENTIMENT DETECTED! #13
http://www.straitstimes.com/asia/east-asia/7-reasons-the-world-is-watching-chinas-19th-party-congress
216: Processed article #203
217: Processed article #204
218: Processed article #205
219: Processed article #206
220: Processed article #207
221: Processed article #208
222: 

In [ ]: