This content was originally created as part of a Python Lecture series whose authors created it, and demoed it in Python 3.6. It worked cleanly for Python 3.5 and higher, but when I attempted to run it under Python 2.7, it exhibited strange behavior. Unicode encoding and decoding errors halted the code, but not consistently. Even adding the line "from \_\_future\_\_ import unicode_literals" to the top, the problem persisted.
This problem presented an interesting opportunity to test out some Python concepts including recursion. Then an actual solution (for Python 2.7) was devised. Part of the challenge undertaken: to alter the code to work in Python 2.7, and yet have it still run in Python 3.6 without having to "change it back".
In this Notebook:
* Note: Originally, different behaviors were experienced at the command line then what was encountered in Jypyter. Though the cause was eventually identified, just to be safe, all testing shows both the Jupyter code cells, and the Python script test.
In [1]:
# another copy of the event object (original unaltered code)
import re
import datetime
class event(object):
def __init__(self, title, time, location):
self.title = title
self.time = time
self.location = location
def day(self):
try:
day = re.findall('\w+', self.time)[:3]
day = ' '.join(day)
try:
return datetime.datetime.strptime(day, "%d %b %Y")
except ValueError:
return datetime.datetime.strptime(day, "%d %B %Y")
except ValueError:
return self.time
def status(self):
if isinstance(self.day(), datetime.datetime):
now = datetime.datetime.now()
if now < self.day():
return 'Upcoming'
elif now - self.day() < datetime.timedelta(days=1):
return 'Today'
else:
return 'Missed'
else:
return 'Unknown'
def __str__(self):
return self.status() + ' Event: %s' %self.title
In [4]:
# this version is unchanged from lecture content. It throws an error as shown below in the output
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import datetime
import re
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'
time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)
events = [event(title[i], time[i], location[i]) for i in range(len(title))]
for i in events:
print (30*'-')
print(i)
print (' Time: %s' %i.time)
print (' Location: %s' %i.location)
In [3]:
!python script/event_original.py
This solution may not be the most elegant, but it solves the problem. This code is shown here tested under Python 2.7 in Jupyter, running as a Python 2.7 script, and as the same script running under Python 3.6. The research section illustrates the quirks and gotcha's along the way to finding this answer.
In [5]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# This code produces a working solution. There may be more efficient ways to do this, but this works.
# Created for Python 2.7, then modified for cross-compatibility with Python 3.6
import requests
import datetime
import re
import unicodedata # for solution that did not work and is commeted out
class event(object):
def __init__(self, title, time, location):
self.title = title
self.time = time
self.location = location
def day(self):
try:
day = re.findall('\w+', self.time)[:3]
day = ' '.join(day)
try:
return datetime.datetime.strptime(day, "%d %b %Y")
except ValueError:
return datetime.datetime.strptime(day, "%d %B %Y")
except ValueError:
return self.time
def status(self):
if isinstance(self.day(), datetime.datetime):
now = datetime.datetime.now()
if now < self.day():
return 'Upcoming'
elif now - self.day() < datetime.timedelta(days=1):
return 'Today'
else:
return 'Missed'
else:
return 'Unknown'
def __str__(self):
try:
rtnVal = str(self.status() + ' Event: %s' %self.title)
except Exception as ee:
rtnVal = str_Intl(self.status() + ' Event: %s' %self.title.encode('utf-8'))
return rtnVal
# this function created instead of modifying __str__ because in testing, this error cropped up
# both in the use of a print() satement all by itself, and in an event.__str__ call
def str_Intl(strng):
try:
strng2 = strng.encode('utf-8')
rtnVal = str(strng2)
except UnicodeEncodeError as uee:
print("Warning!")
print("%s: %s" %(type(uee), uee))
chrStartIndx = len("'ascii' codec can't encode character ")
chrEndIndx = str(uee).find(" in position ")
replStr = str(uee)[chrStartIndx:chrEndIndx]
startIndx = (chrEndIndx+1) + len("in position ")
endIndx = str(uee).find(": ordinal")
oIndx = int(str(uee)[startIndx:endIndx])
print("Character %d cannot be processed by print() or str() and will be replaced." %(oIndx))
print("---------------------")
rtnVal = (strng[0:oIndx] + ("\"%s\"" %replStr) + strng[(oIndx+1):])
rtnVal = str_Intl(rtnVal) # recursive fuction call
except UnicodeDecodeError as ude:
# early testing with this line from stack overflow did not work for us:
# strng.encode('utf-8').strip()
# this solution also strips off the problem characters without outputting what they were
print("Warning!")
print("%s: %s" %(type(ude), ude))
print("Where possible, characters are replaced with their closest ascii equivelence.")
# earlier use of .encode() fixed one issue and bypassed the UnicodeEncodeError handling
# it then triggered this error for one of the other cases, so now we trying other solutions:
strng_u = unicode(strng, "utf-8")
rtnVal = unicodedata.normalize('NFKD', strng_u).encode('ascii', 'ignore')
# this threw an error that 2nd argument must be unicode, not string
# added string_u line as a fix for that
rtnVal = str_Intl(rtnVal)
except Exception as ee:
# when calling this code in a loop, you lose one value and get this error message output instead
# but the loop can continue over the rest of your data
rtnVal = "String data coult not be processed. Error: %s : %s" %(type(ee), ee)
return rtnVal
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'
time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)
events = [event(title[i], time[i], location[i]) for i in range(len(title))]
for i in events:
print (30*'-')
print(i) # bug fix: i is in events, so this calls __str__ in the object
print (' Time : %s' %i.time)
try:
print (' Location: %s' %i.location)
except Exception as ee:
print (str_Intl(' Location: %s' %i.location)) # bug fix: error thrown here too
# str_Intl() will parse out type of error in its try block
In [7]:
# for completeness .. the code is also re-tested as a script under Python 2.7
!python script/event.py
Note: most errors are avoided rather than handled, but the one that still gets through deliberately throws a warning to alert us to it. It would be easy to edit the code to hide this warning if it were undesirable. With more time, the exact path to this error could probably be identified and avoided as well. Here we test on Python 3.6 to show this code works just as well there as the original (which was designed for Python 3.6).
In [6]:
# Test script version using Python 3.6
!C:/ProgramFilesCoders/Anaconda2/envs/PY36/python script/event.py
This Section contains the research and experiments that led up to the solution. It presents the information like a story: background, the problem, and finally iterations of the code leading up to the final solution. To tell the story in this way, some of the content from earlier sections is repeated.
These links on Stack Overflow were particularly helpful in devising the solution. It is interesting to note that inconsistent behavior with respect to when encoding errors are thrown (under Python 2.x) and problems with solutions appearing to work in one context and failing others were reported by others in the community.
Stack Overflow posts on this topic:
In [8]:
f= open('data/python-event.html')
event = f.read()
In [9]:
import re
Note how in the cells that follow, letters from foreign character sets are presented without error ... Stranger still, note the line with "dos Campos, Brazil" in it. This works here, but throws an error in later code in this notebook for a special character in the full city name.
In [10]:
locationPattern = '<span class="event-location">(.*)</span>'
location = re.findall(locationPattern, event)
for i in location:
print (i)
In [11]:
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'
title = re.findall(titlePattern, event)
for i in title:
print (i)
In [12]:
# Here is the first version of the event() object
# it gets used in the code that follows (where the problem occurs)
import re
import datetime
class event(object):
def __init__(self, title, time, location):
self.title = title
self.time = time
self.location = location
def day(self):
try:
day = re.findall('\w+', self.time)[:3]
day = ' '.join(day)
try:
return datetime.datetime.strptime(day, "%d %b %Y")
except ValueError:
return datetime.datetime.strptime(day, "%d %B %Y")
except ValueError:
return self.time
def status(self):
if isinstance(self.day(), datetime.datetime):
now = datetime.datetime.now()
if now < self.day():
return 'Upcoming'
elif now - self.day() < datetime.timedelta(days=1):
return 'Today'
else:
return 'Missed'
else:
return 'Unknown'
def __str__(self):
return self.status() + ' Event: %s' %self.title
In [13]:
# in these test cells foreign characters continue to output without error even when called into str()
# these cells were in the original content and also illustrate differences in using print() versus str()
# on the same line of content. However, it is generally thought that str() is called by print() to ensure
# that content fed into it is a string before outputting it, so over-riding __str__ as is done in a later
# version of the event object has an impact on print() as well
event1 = event('Python Meeting Düsseldorf', '20 Jan. 2015 5pm UTC – 7pm UTC', \
'Bürgerhaus im Stadtteilzentrum Bilk, Raum 1, 2. OG, Bachstr. 145, 40217 Düsseldorf, Germany')
print (event1.day())
print (event1)
In [14]:
str(event1)
Out[14]:
In [15]:
# the full source can be viewed using this code:
# It is commented out here:
'''
import requests
text = requests.get('https://www.python.org/events/python-user-group/').text
text
'''
Out[15]:
In [16]:
# another copy of the event object (original unaltered code) so you don't have to scroll up to view it
import re
import datetime
class event(object):
def __init__(self, title, time, location):
self.title = title
self.time = time
self.location = location
def day(self):
try:
day = re.findall('\w+', self.time)[:3]
day = ' '.join(day)
try:
return datetime.datetime.strptime(day, "%d %b %Y")
except ValueError:
return datetime.datetime.strptime(day, "%d %B %Y")
except ValueError:
return self.time
def status(self):
if isinstance(self.day(), datetime.datetime):
now = datetime.datetime.now()
if now < self.day():
return 'Upcoming'
elif now - self.day() < datetime.timedelta(days=1):
return 'Today'
else:
return 'Missed'
else:
return 'Unknown'
def __str__(self):
return self.status() + ' Event: %s' %self.title
In [17]:
# this version is unchanged from lecture content. It throws an error as shown below in the output
import requests
import datetime
import re
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'
time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)
events = [event(title[i], time[i], location[i]) for i in range(len(title))]
for i in events:
print (30*'-')
print(i)
print (' Time: %s' %i.time)
print (' Location: %s' %i.location)
In [18]:
!python script/event_original.py
The right approach to this problem is to look up the error and see if there is a fix (which is explored later in this notebook).
The theory behind the code cells that immediately follow, however, is the answer to a simple question: what if we encounter characters that our current installation can't handle? What should the code do?
In this case, the desirable output is to print warnings about what went wrong so a better solution can be explored, but do something about these mis-behaving characters so the rest of the code can continue to run around it without halting on the error. Additionally, it is desirable to output as much of the original content around the error as possible.
In [19]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# quick and dirty solution that assumes only one error per line of content processed by the loop
import requests
import datetime
import re
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'
time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)
events = [event(title[i], time[i], location[i]) for i in range(len(title))]
for i in events:
print (30*'-')
try:
print(i)
except UnicodeEncodeError as uee:
print(type(uee))
print(uee)
startIndx = str(uee).find("in position ")+len("in position ")
endIndx = str(uee).find(": ordinal")
oIndx = int(str(uee)[startIndx:endIndx])
print("Character %d of the Event Title Needs to be removed before print() or str() can process it." %(oIndx))
print (' Time : %s' %i.time)
print (' Location: %s' %i.location)
The above solution appears to work ... but what if there is more than one error triggered in a line of the content? It should be noted too that this solution failed to work when run from a command line script with, what at first glance, seemed to be the exact same code in it. Later in the testing process, it was realized that the problem might be the header and this was added to the top of the script:
In [20]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
In [22]:
!python script/event_v1.py
It turns out that if we add the header lines into the Jupyter notebook cell, then the two copies of the code fail in more similar ways. Though the header appears to be for scripts, it influences coding in Jupyter cells as well. The strange thing here though, is that adding a line for UTF-8 results in more errors in the script version than leaving it out, even though the research that follows shows that handling of UTF-8 is at the core of how to fix this problem.
In [23]:
# This version of the code assumes we want to to see warnings and error content along with as much of the output
# as can be processed around the characters causing the problem
''' Here we see code designed to find the mis-behaving characters, output as much as we know about them
(text originally captured in the error messages when the code halted), output as much as possible of
the non-misbehaving content, and continue to run.
The output when the error is encountered is ugly, but this is deliberate. It shows all errors triggers and
highlights the recursive nature of this solution for future study.
'''
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import datetime
import re
class event(object):
def __init__(self, title, time, location):
self.title = title
self.time = time
self.location = location
def day(self):
try:
day = re.findall('\w+', self.time)[:3]
day = ' '.join(day)
try:
return datetime.datetime.strptime(day, "%d %b %Y")
except ValueError:
return datetime.datetime.strptime(day, "%d %B %Y")
except ValueError:
return self.time
def status(self):
if isinstance(self.day(), datetime.datetime):
now = datetime.datetime.now()
if now < self.day():
return 'Upcoming'
elif now - self.day() < datetime.timedelta(days=1):
return 'Today'
else:
return 'Missed'
else:
return 'Unknown'
def __str__(self):
return str_Intl(self.status() + ' Event: %s' %self.title) # call to str_Intl is part of bug fix
# this function created instead of modifying __str__ because in testing, this error cropped up
# both in the use of a print() satement all by itself, and in an event.__str__ call
def str_Intl(strng):
try:
rtnVal = str(strng)
except UnicodeEncodeError as uee:
print("Warning!")
print("%s: %s" %(type(uee), uee))
chrStartIndx = len("'ascii' codec can't encode character ")
chrEndIndx = str(uee).find(" in position ")
replStr = str(uee)[chrStartIndx:chrEndIndx]
startIndx = (chrEndIndx+1) + len("in position ")
endIndx = str(uee).find(": ordinal")
oIndx = int(str(uee)[startIndx:endIndx])
print("Character %d cannot be processed by print() or str() and will be replaced." %(oIndx))
print("---------------------")
rtnVal = (strng[0:oIndx] + ("\"%s\"" %replStr) + strng[(oIndx+1):])
rtnVal = str_Intl(rtnVal) # recursive fuction call
except Exception as ee:
rtnVal = "String data coult not be processed. Error: %s : %s" %(type(ee), ee)
return rtnVal
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'
time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)
events = [event(title[i], time[i], location[i]) for i in range(len(title))]
for i in events:
print (30*'-')
print(i) # bug fix: i is in events, so this calls __str__ in the object
print (' Time : %s' %i.time)
print (str_Intl(' Location: %s' %i.location)) # when bug happened here, had to add str_Intl as bug fix
In [24]:
!python script/event_v2.py
Note that in the above code, a number of runs of the code did not throw the error on the Location line for "dos Campos, Brazil" so when this error occurred for the first time when running the code from a script, it came as a surprise. Later runs of the code (before realizing the header implications) were inconsistent sometimes throwing the error and sometimes not. This part of the testing experience does not appear to be reproduceable. The error occurs consistently now. But Stack Overflow posts on this topic indicate others have had the same experience when dealing with international character sets. From this point forward, Jupyter cells and script versions run appear to run identically to each other.
The following web topics were part of the research into a more comprehensive solution to the problem. While the code in the solutions below (even the final one) may seem overly complicated, every time it looked like all instances of the encoding error were handled, another one would mysteriously creep up in testing. By the end of this section, code handles the error in such a way that the code to strip out characters it can't handle never fires. And we get it down to only one instance of a warning about ascii replacement rather than loss of characters (which might make some content harder to read). The exception case to strip out characters it can't handle at all and warn us is retained just in case some characer we did not test or needs this code in the future. Should that ever happen, the code can the be re-visited.
Stack Overflow posts on this topic:
In [25]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Draft Version ... presented to illustrate how the problem shifts and how later solutions address this
# * In this version, several solutions from Stack Overflow are added into the code.
# * the result is that one encode error is solved but another one shifts to a decode error
# * one suggested solution for decode is used and though it worked for others, it fails here
import requests
import datetime
import re
class event(object):
def __init__(self, title, time, location):
self.title = title
self.time = time
self.location = location
def day(self):
try:
day = re.findall('\w+', self.time)[:3]
day = ' '.join(day)
try:
return datetime.datetime.strptime(day, "%d %b %Y")
except ValueError:
return datetime.datetime.strptime(day, "%d %B %Y")
except ValueError:
return self.time
def status(self):
if isinstance(self.day(), datetime.datetime):
now = datetime.datetime.now()
if now < self.day():
return 'Upcoming'
elif now - self.day() < datetime.timedelta(days=1):
return 'Today'
else:
return 'Missed'
else:
return 'Unknown'
def __str__(self):
return str_Intl(self.status() + ' Event: %s' %self.title.encode('utf-8')) # call to str_Intl is part of bug fix
# a.encode('utf-8')
# this function created instead of modifying __str__ because in testing, this error cropped up
# both in the use of a print() satement all by itself, and in an event.__str__ call
def str_Intl(strng):
try:
strng2 = strng.encode('utf-8')
rtnVal = str(strng2) # if str() throws an error, we have a problem (which is why it is here)
# note that in our code though, we really already know that inputs are string
# so if it weren't for the try test on str() using it might be redundant
except UnicodeEncodeError as uee:
print("Warning!")
print("%s: %s" %(type(uee), uee))
chrStartIndx = len("'ascii' codec can't encode character ")
chrEndIndx = str(uee).find(" in position ")
replStr = str(uee)[chrStartIndx:chrEndIndx]
startIndx = (chrEndIndx+1) + len("in position ")
endIndx = str(uee).find(": ordinal")
oIndx = int(str(uee)[startIndx:endIndx])
print("Character %d cannot be processed by print() or str() and will be replaced." %(oIndx))
print("---------------------")
rtnVal = (strng[0:oIndx] + ("\"%s\"" %replStr) + strng[(oIndx+1):])
rtnVal = str_Intl(rtnVal) # recursive fuction call
except UnicodeDecodeError as ude:
# early testing with this line from stack overflow did not work for us:
# strng.encode('utf-8').strip()
# this solution also strips off the problem characters without outputting what they were
print("Warning!")
print("%s: %s" %(type(ude), ude))
# earlier use of .encode() fixed one issue and bypassed the UnicodeEncodeError handling
# it then triggered this error for one of the other cases, so now we trying other solutions:
rtnVal = strng.encode('utf-8').strip()
rtnVal = str_Intl(rtnVal)
except Exception as ee:
# when calling this code in a loop, you lose one string and get this error message output instead
# but the loop can continue over the rest of the data
rtnVal = "String data coult not be processed. Error: %s : %s" %(type(ee), ee)
return rtnVal
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'
time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)
events = [event(title[i], time[i], location[i]) for i in range(len(title))]
for i in events:
print (30*'-')
print(i) # bug fix: i is in events, so this calls __str__ in the object
print (' Time : %s' %i.time)
print (str_Intl(' Location: %s' %i.location)) # when bug happened here, had to add str_Intl as bug fix
In [26]:
!python script/event_v3.py
No work is ever complete, just abandoned. This works and covers different scenarios that might be encountered in future content. Until such time as it fails to meet expectations, it is good enough for now. Or at least it was, until cross-testing on Python 3.6 found a problem.
In [27]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# This code produces a working solution. There may be more efficient ways to do this, but this works.
import requests
import datetime
import re
import unicodedata # for solution that did not work and is commeted out
class event(object):
def __init__(self, title, time, location):
self.title = title
self.time = time
self.location = location
def day(self):
try:
day = re.findall('\w+', self.time)[:3]
day = ' '.join(day)
try:
return datetime.datetime.strptime(day, "%d %b %Y")
except ValueError:
return datetime.datetime.strptime(day, "%d %B %Y")
except ValueError:
return self.time
def status(self):
if isinstance(self.day(), datetime.datetime):
now = datetime.datetime.now()
if now < self.day():
return 'Upcoming'
elif now - self.day() < datetime.timedelta(days=1):
return 'Today'
else:
return 'Missed'
else:
return 'Unknown'
def __str__(self):
return str_Intl(self.status() + ' Event: %s' %self.title.encode('utf-8')) # call to str_Intl is part of bug fix
# this function created instead of modifying __str__ because in testing, this error cropped up
# both in the use of a print() satement all by itself, and in an event.__str__ call
def str_Intl(strng):
try:
strng2 = strng.encode('utf-8')
rtnVal = str(strng2)
except UnicodeEncodeError as uee:
print("Warning!")
print("%s: %s" %(type(uee), uee))
chrStartIndx = len("'ascii' codec can't encode character ")
chrEndIndx = str(uee).find(" in position ")
replStr = str(uee)[chrStartIndx:chrEndIndx]
startIndx = (chrEndIndx+1) + len("in position ")
endIndx = str(uee).find(": ordinal")
oIndx = int(str(uee)[startIndx:endIndx])
print("Character %d cannot be processed by print() or str() and will be replaced." %(oIndx))
print("---------------------")
rtnVal = (strng[0:oIndx] + ("\"%s\"" %replStr) + strng[(oIndx+1):])
rtnVal = str_Intl(rtnVal) # recursive fuction call
except UnicodeDecodeError as ude:
# early testing with this line from stack overflow did not work for us:
# strng.encode('utf-8').strip()
# this solution also strips off the problem characters without outputting what they were
print("Warning!")
print("%s: %s" %(type(ude), ude))
print("Where possible, characters are replaced with their closest ascii equivelence.")
# earlier use of .encode() fixed one issue and bypassed the UnicodeEncodeError handling
# it then triggered this error for one of the other cases, so now we trying other solutions:
strng_u = unicode(strng, "utf-8")
rtnVal = unicodedata.normalize('NFKD', strng_u).encode('ascii', 'ignore')
# this threw an error that 2nd argument must be unicode, not string
# added string_u line as a fix for that
rtnVal = str_Intl(rtnVal)
except Exception as ee:
# when calling this code in a loop, you lose one value and get this error message output instead
# but the loop can continue over the rest of your data
rtnVal = "String data coult not be processed. Error: %s : %s" %(type(ee), ee)
return rtnVal
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'
time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)
events = [event(title[i], time[i], location[i]) for i in range(len(title))]
for i in events:
print (30*'-')
print(i) # bug fix: i is in events, so this calls __str__ in the object
print (' Time : %s' %i.time)
print (str_Intl(' Location: %s' %i.location)) # when bug happened here, had to add str_Intl as bug fix
In [28]:
# testing above as a script (still in Python 2.7)
!python script/event_v4.py
In [29]:
# testing above as a script in Python 3.6
!C:/ProgramFilesCoders/Anaconda2/envs/PY36/python script/event_v4.py
In [ ]:
# sample output:
''' ------------------------------
b"Unknown Event: b'PyDelhiConf 2017'"
Time : 18 March – 20 March 2017
b' Location: NCR, Noida, India'
'''
print("-------------------------------")
To migrate the code to Python 3.6, two minor tweaks could be made to disable the Python 2.7 error handling and work-arounds. These changes are easy enough to do, but just for the experience, a solution is explored here that will run on both without requiring any coding changes.
In [30]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# This code produces a working solution. There may be more efficient ways to do this, but this works.
# Created for Python 2.7, then modified for cross-compatibility with Python 3.6
import requests
import datetime
import re
import unicodedata # for solution that did not work and is commeted out
class event(object):
def __init__(self, title, time, location):
self.title = title
self.time = time
self.location = location
def day(self):
try:
day = re.findall('\w+', self.time)[:3]
day = ' '.join(day)
try:
return datetime.datetime.strptime(day, "%d %b %Y")
except ValueError:
return datetime.datetime.strptime(day, "%d %B %Y")
except ValueError:
return self.time
def status(self):
if isinstance(self.day(), datetime.datetime):
now = datetime.datetime.now()
if now < self.day():
return 'Upcoming'
elif now - self.day() < datetime.timedelta(days=1):
return 'Today'
else:
return 'Missed'
else:
return 'Unknown'
def __str__(self):
try:
rtnVal = str(self.status() + ' Event: %s' %self.title)
except Exception as ee:
rtnVal = str_Intl(self.status() + ' Event: %s' %self.title.encode('utf-8'))
return rtnVal
# this function created instead of modifying __str__ because in testing, this error cropped up
# both in the use of a print() satement all by itself, and in an event.__str__ call
def str_Intl(strng):
try:
strng2 = strng.encode('utf-8')
rtnVal = str(strng2)
except UnicodeEncodeError as uee:
print("Warning!")
print("%s: %s" %(type(uee), uee))
chrStartIndx = len("'ascii' codec can't encode character ")
chrEndIndx = str(uee).find(" in position ")
replStr = str(uee)[chrStartIndx:chrEndIndx]
startIndx = (chrEndIndx+1) + len("in position ")
endIndx = str(uee).find(": ordinal")
oIndx = int(str(uee)[startIndx:endIndx])
print("Character %d cannot be processed by print() or str() and will be replaced." %(oIndx))
print("---------------------")
rtnVal = (strng[0:oIndx] + ("\"%s\"" %replStr) + strng[(oIndx+1):])
rtnVal = str_Intl(rtnVal) # recursive fuction call
except UnicodeDecodeError as ude:
# early testing with this line from stack overflow did not work for us:
# strng.encode('utf-8').strip()
# this solution also strips off the problem characters without outputting what they were
print("Warning!")
print("%s: %s" %(type(ude), ude))
print("Where possible, characters are replaced with their closest ascii equivelence.")
# earlier use of .encode() fixed one issue and bypassed the UnicodeEncodeError handling
# it then triggered this error for one of the other cases, so now we trying other solutions:
strng_u = unicode(strng, "utf-8")
rtnVal = unicodedata.normalize('NFKD', strng_u).encode('ascii', 'ignore')
# this threw an error that 2nd argument must be unicode, not string
# added string_u line as a fix for that
rtnVal = str_Intl(rtnVal)
except Exception as ee:
# when calling this code in a loop, you lose one value and get this error message output instead
# but the loop can continue over the rest of your data
rtnVal = "String data coult not be processed. Error: %s : %s" %(type(ee), ee)
return rtnVal
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'
time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)
events = [event(title[i], time[i], location[i]) for i in range(len(title))]
for i in events:
print (30*'-')
print(i) # bug fix: i is in events, so this calls __str__ in the object
print (' Time : %s' %i.time)
try:
print (' Location: %s' %i.location)
except Exception as ee:
print (str_Intl(' Location: %s' %i.location)) # bug fix: error thrown here too
# str_Intl() will parse out type of error in its try block
In [31]:
# Test in Python 3.6
!C:/ProgramFilesCoders/Anaconda2/envs/PY36/python script/event.py
In [32]:
# for completeness .. the code is also re-tested as a script under Python 2.7
!python script/event.py
In [33]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# code reverted to earliest version to test this simple approach.
# if it worked, a lot of processing could be avoided, but as shown in the output, this failed too
# above solution seems to work the best for now
import requests
import datetime
import re
import sys
# proposed bug-fix from stack overflow
reload(sys) # note: reload threw an error that "reload() was not defined on Python 3.6
# if this solution ever proves useful this may need to be investigated
# the command worked in Python 2.7
sys.setdefaultencoding('utf8')
class event(object):
def __init__(self, title, time, location):
self.title = title
self.time = time
self.location = location
def day(self):
try:
day = re.findall('\w+', self.time)[:3]
day = ' '.join(day)
try:
return datetime.datetime.strptime(day, "%d %b %Y")
except ValueError:
return datetime.datetime.strptime(day, "%d %B %Y")
except ValueError:
return self.time
def status(self):
if isinstance(self.day(), datetime.datetime):
now = datetime.datetime.now()
if now < self.day():
return 'Upcoming'
elif now - self.day() < datetime.timedelta(days=1):
return 'Today'
else:
return 'Missed'
else:
return 'Unknown'
def __str__(self):
return self.status() + ' Event: %s' %self.title
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'
time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)
events = [event(title[i], time[i], location[i]) for i in range(len(title))]
for i in events:
print (30*'-')
print(i)
print (' Time: %s' %i.time)
print (' Location: %s' %i.location)
In [ ]: