Python [conda default] v2.7

The Curious Case of Inconsistent Python UnicodeEncodeError Errors

This content was originally created as part of a Python Lecture series whose authors created it, and demoed it in Python 3.6. It worked cleanly for Python 3.5 and higher, but when I attempted to run it under Python 2.7, it exhibited strange behavior. Unicode encoding and decoding errors halted the code, but not consistently. Even adding the line "from \_\_future\_\_ import unicode_literals" to the top, the problem persisted.

This problem presented an interesting opportunity to test out some Python concepts including recursion. Then an actual solution (for Python 2.7) was devised. Part of the challenge undertaken: to alter the code to work in Python 2.7, and yet have it still run in Python 3.6 without having to "change it back".

In this Notebook:

* Note: Originally, different behaviors were experienced at the command line then what was encountered in Jypyter. Though the cause was eventually identified, just to be safe, all testing shows both the Jupyter code cells, and the Python script test.

Original Code And The Problem in Python 2.7

The code that triggered this problem error investigation and the problem output are provided in this section.


In [1]:
# another copy of the event object (original unaltered code)

import re
import datetime
class event(object):
    def __init__(self, title, time, location):
        self.title = title
        self.time = time
        self.location = location
    
    def day(self):
        try:
            day = re.findall('\w+', self.time)[:3]
            day = ' '.join(day)
            try: 
                return datetime.datetime.strptime(day, "%d %b %Y")
            except ValueError:
                return datetime.datetime.strptime(day, "%d %B %Y")
        except ValueError:
            return self.time
    
    def status(self):
        if isinstance(self.day(), datetime.datetime):
            now = datetime.datetime.now()
            if now < self.day():
                return 'Upcoming'
            elif now - self.day() < datetime.timedelta(days=1):
                return 'Today'
            else:
                return 'Missed'
        else:
            return 'Unknown'

    def __str__(self):
        return self.status() + ' Event: %s' %self.title

In [4]:
# this version is unchanged from lecture content.  It throws an error as shown below in the output

#!/usr/bin/env python
# -*- coding: utf-8 -*-  

import requests
import datetime
import re

text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'

time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)

events = [event(title[i], time[i], location[i]) for i in range(len(title))]

for i in events:
    print (30*'-')
    print(i)
    print ('    Time:  %s' %i.time)
    print ('    Location: %s' %i.location)


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time:  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time:  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-4-4f0883880c1a> in <module>()
     22 for i in events:
     23     print (30*'-')
---> 24     print(i)
     25     print ('    Time:  %s' %i.time)
     26     print ('    Location: %s' %i.location)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 29: ordinal not in range(128)

In [3]:
!python script/event_original.py


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time:  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time:  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Traceback (most recent call last):
  File "script/event_original.py", line 59, in <module>
    print(i)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 29: ordinal not in range(128)

Final Solution

This solution may not be the most elegant, but it solves the problem. This code is shown here tested under Python 2.7 in Jupyter, running as a Python 2.7 script, and as the same script running under Python 3.6. The research section illustrates the quirks and gotcha's along the way to finding this answer.


In [5]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-  

# This code produces a working solution.  There may be more efficient ways to do this, but this works.
# Created for Python 2.7, then modified for cross-compatibility with Python 3.6

import requests
import datetime
import re
import unicodedata  # for solution that did not work and is commeted out

class event(object):
    def __init__(self, title, time, location):
        self.title = title
        self.time = time
        self.location = location
    
    def day(self):
        try:
            day = re.findall('\w+', self.time)[:3]
            day = ' '.join(day)
            try: 
                return datetime.datetime.strptime(day, "%d %b %Y")
            except ValueError:
                return datetime.datetime.strptime(day, "%d %B %Y")
        except ValueError:
            return self.time
    
    def status(self):
        if isinstance(self.day(), datetime.datetime):
            now = datetime.datetime.now()
            if now < self.day():
                return 'Upcoming'
            elif now - self.day() < datetime.timedelta(days=1):
                return 'Today'
            else:
                return 'Missed'
        else:
            return 'Unknown'
        
    def __str__(self): 
        try:
            rtnVal = str(self.status()  + ' Event: %s' %self.title)
        except Exception as ee:
            rtnVal = str_Intl(self.status() + ' Event: %s' %self.title.encode('utf-8'))
        return rtnVal
    
# this function created instead of modifying __str__ because in testing, this error cropped up
# both in the use of a print() satement all by itself, and in an event.__str__ call

def str_Intl(strng):   
    try:
        strng2 = strng.encode('utf-8')
        rtnVal = str(strng2)
        
    except UnicodeEncodeError as uee:
        print("Warning!")
        print("%s: %s" %(type(uee), uee))
        chrStartIndx = len("'ascii' codec can't encode character ")
        chrEndIndx = str(uee).find(" in position ")
        replStr = str(uee)[chrStartIndx:chrEndIndx] 
        startIndx = (chrEndIndx+1) + len("in position ")
        endIndx = str(uee).find(": ordinal")
        oIndx = int(str(uee)[startIndx:endIndx])
        print("Character %d cannot be processed by print() or str() and will be replaced." %(oIndx))
        print("---------------------")
        rtnVal = (strng[0:oIndx] + ("\"%s\"" %replStr) + strng[(oIndx+1):])
        rtnVal = str_Intl(rtnVal)    # recursive fuction call
        
    except UnicodeDecodeError as ude:
        # early testing with this line from stack overflow did not work for us:
        # strng.encode('utf-8').strip()
        # this solution also strips off the problem characters without outputting what they were
        
        print("Warning!")
        print("%s: %s" %(type(ude), ude))
        print("Where possible, characters are replaced with their closest ascii equivelence.")
        # earlier use of .encode() fixed one issue and bypassed the UnicodeEncodeError handling
        # it then triggered this error for one of the other cases, so now we trying other solutions:
        
        strng_u = unicode(strng, "utf-8")
        rtnVal = unicodedata.normalize('NFKD', strng_u).encode('ascii', 'ignore')
                 # this threw an error that 2nd argument must be unicode, not string
                 # added string_u line as a fix for that
                
        rtnVal = str_Intl(rtnVal)
        
    except Exception as ee:
        # when calling this code in a loop, you lose one value and get this error message output instead
        # but the loop can continue over the rest of your data
        rtnVal = "String data coult not be processed. Error: %s : %s" %(type(ee), ee)
    return rtnVal    
    
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'

time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)

events = [event(title[i], time[i], location[i]) for i in range(len(title))]

for i in events:
    print (30*'-')
    print(i)                                          # bug fix: i is in events, so this calls __str__ in the object
    print ('    Time    :  %s' %i.time)
    try:
        print ('    Location: %s' %i.location)
    except Exception as ee:
        print (str_Intl('    Location: %s' %i.location))  # bug fix:  error thrown here too
                                                          # str_Intl() will parse out type of error in its try block


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Warning!
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xc3 in position 29: ordinal not in range(128)
Where possible, characters are replaced with their closest ascii equivelence.
Unknown Event: Django Girls Sao Jose dos Campos
    Time    :  20 May &ndash; 21 May  2017
    Location: São José dos Campos, Brazil
------------------------------
Unknown Event: Django Girls Accra
    Time    :  16 June &ndash; 18 June  2017
    Location: Accra, Ghana
------------------------------
Missed Event: Python Porto Meetup
    Time    :  24 March 2017
    Location: Porto, Portugal
------------------------------
Unknown Event: PyDelhiConf 2017
    Time    :  18 March &ndash; 20 March  2017
    Location: NCR, Noida, India

In [7]:
# for completeness .. the code is also re-tested as a script under Python 2.7
!python script/event.py


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Warning!
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xc3 in position 29: ordinal not in range(128)
Where possible, characters are replaced with their closest ascii equivelence.
Unknown Event: Django Girls Sao Jose dos Campos
    Time    :  20 May &ndash; 21 May  2017
    Location: São José dos Campos, Brazil
------------------------------
Unknown Event: Django Girls Accra
    Time    :  16 June &ndash; 18 June  2017
    Location: Accra, Ghana
------------------------------
Missed Event: Python Porto Meetup
    Time    :  24 March 2017
    Location: Porto, Portugal
------------------------------
Unknown Event: PyDelhiConf 2017
    Time    :  18 March &ndash; 20 March  2017
    Location: NCR, Noida, India

Note: most errors are avoided rather than handled, but the one that still gets through deliberately throws a warning to alert us to it. It would be easy to edit the code to hide this warning if it were undesirable. With more time, the exact path to this error could probably be identified and avoided as well. Here we test on Python 3.6 to show this code works just as well there as the original (which was designed for Python 3.6).


In [6]:
# Test script version using Python 3.6
!C:/ProgramFilesCoders/Anaconda2/envs/PY36/python script/event.py


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Unknown Event: Django Girls São José dos Campos
    Time    :  20 May &ndash; 21 May  2017
    Location: São José dos Campos, Brazil
------------------------------
Unknown Event: Django Girls Accra
    Time    :  16 June &ndash; 18 June  2017
    Location: Accra, Ghana
------------------------------
Missed Event: Python Porto Meetup
    Time    :  24 March 2017
    Location: Porto, Portugal
------------------------------
Unknown Event: PyDelhiConf 2017
    Time    :  18 March &ndash; 20 March  2017
    Location: NCR, Noida, India

Research and Experiments

This Section contains the research and experiments that led up to the solution. It presents the information like a story: background, the problem, and finally iterations of the code leading up to the final solution. To tell the story in this way, some of the content from earlier sections is repeated.

References

These links on Stack Overflow were particularly helpful in devising the solution. It is interesting to note that inconsistent behavior with respect to when encoding errors are thrown (under Python 2.x) and problems with solutions appearing to work in one context and failing others were reported by others in the community.

Stack Overflow posts on this topic:


In [8]:
f= open('data/python-event.html')
event = f.read()

In [9]:
import re

Note how in the cells that follow, letters from foreign character sets are presented without error ... Stranger still, note the line with "dos Campos, Brazil" in it. This works here, but throws an error in later code in this notebook for a special character in the full city name.


In [10]:
locationPattern = '<span class="event-location">(.*)</span>'
location = re.findall(locationPattern, event)
for i in location:
    print (i)


Bucaramanga, Colombia
 Innsbruck, Austria, Europe
São José dos Campos, Brazil
Accra, Ghana
Porto, Portugal
NCR, Noida, India

In [11]:
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'
title = re.findall(titlePattern, event)
for i in title:
    print (i)


Django Girls Bucaramanga, Colombia
Python Meetup Innsbruck: imp.reload(innsbruck)
Django Girls São José dos Campos
Django Girls Accra
Python Porto Meetup
PyDelhiConf 2017

In [12]:
# Here is the first version of the event() object
# it gets used in the code that follows (where the problem occurs)

import re
import datetime
class event(object):
    def __init__(self, title, time, location):
        self.title = title
        self.time = time
        self.location = location
    
    def day(self):
        try:
            day = re.findall('\w+', self.time)[:3]
            day = ' '.join(day)
            try: 
                return datetime.datetime.strptime(day, "%d %b %Y")
            except ValueError:
                return datetime.datetime.strptime(day, "%d %B %Y")
        except ValueError:
            return self.time
    
    def status(self):
        if isinstance(self.day(), datetime.datetime):
            now = datetime.datetime.now()
            if now < self.day():
                return 'Upcoming'
            elif now - self.day() < datetime.timedelta(days=1):
                return 'Today'
            else:
                return 'Missed'
        else:
            return 'Unknown'

    def __str__(self):
        return self.status() + ' Event: %s' %self.title

In [13]:
# in these test cells foreign characters continue to output without error even when called into str()
# these cells were in the original content and also illustrate differences in using print() versus str()
# on the same line of content.  However, it is generally thought that str() is called by print() to ensure
# that content fed into it is a string before outputting it, so over-riding __str__ as is done in a later
# version of the event object has an impact on print() as well

event1 = event('Python Meeting Düsseldorf', '20 Jan. 2015 5pm UTC – 7pm UTC', \
          'Bürgerhaus im Stadtteilzentrum Bilk, Raum 1, 2. OG, Bachstr. 145, 40217 Düsseldorf, Germany')
print (event1.day())
print (event1)


2015-01-20 00:00:00
Missed Event: Python Meeting Düsseldorf

In [14]:
str(event1)


Out[14]:
'Missed Event: Python Meeting D\xc3\xbcsseldorf'

In [15]:
# the full source can be viewed using this code:
# It is commented out here:
'''
import requests
text = requests.get('https://www.python.org/events/python-user-group/').text
text
'''


Out[15]:
"\nimport requests\ntext = requests.get('https://www.python.org/events/python-user-group/').text\ntext\n"

Problem Code

The code that triggered this problem error investigation and the problem output are provided in this section.


In [16]:
# another copy of the event object (original unaltered code) so you don't have to scroll up to view it

import re
import datetime
class event(object):
    def __init__(self, title, time, location):
        self.title = title
        self.time = time
        self.location = location
    
    def day(self):
        try:
            day = re.findall('\w+', self.time)[:3]
            day = ' '.join(day)
            try: 
                return datetime.datetime.strptime(day, "%d %b %Y")
            except ValueError:
                return datetime.datetime.strptime(day, "%d %B %Y")
        except ValueError:
            return self.time
    
    def status(self):
        if isinstance(self.day(), datetime.datetime):
            now = datetime.datetime.now()
            if now < self.day():
                return 'Upcoming'
            elif now - self.day() < datetime.timedelta(days=1):
                return 'Today'
            else:
                return 'Missed'
        else:
            return 'Unknown'

    def __str__(self):
        return self.status() + ' Event: %s' %self.title

In [17]:
# this version is unchanged from lecture content.  It throws an error as shown below in the output

import requests
import datetime
import re

text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'

time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)

events = [event(title[i], time[i], location[i]) for i in range(len(title))]

for i in events:
    print (30*'-')
    print(i)
    print ('    Time:  %s' %i.time)
    print ('    Location: %s' %i.location)


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time:  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time:  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-17-bb60359e4434> in <module>()
     19 for i in events:
     20     print (30*'-')
---> 21     print(i)
     22     print ('    Time:  %s' %i.time)
     23     print ('    Location: %s' %i.location)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 29: ordinal not in range(128)

In [18]:
!python script/event_original.py


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time:  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time:  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Traceback (most recent call last):
  File "script/event_original.py", line 59, in <module>
    print(i)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 29: ordinal not in range(128)

Preventing Code Failure

The right approach to this problem is to look up the error and see if there is a fix (which is explored later in this notebook).

The theory behind the code cells that immediately follow, however, is the answer to a simple question: what if we encounter characters that our current installation can't handle? What should the code do?

In this case, the desirable output is to print warnings about what went wrong so a better solution can be explored, but do something about these mis-behaving characters so the rest of the code can continue to run around it without halting on the error. Additionally, it is desirable to output as much of the original content around the error as possible.

Experiment One


In [19]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-  

# quick and dirty solution that assumes only one error per line of content processed by the loop

import requests
import datetime
import re

text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'

time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)

events = [event(title[i], time[i], location[i]) for i in range(len(title))]

for i in events:
    print (30*'-')
    try:
        print(i)
    except UnicodeEncodeError as uee:
        print(type(uee))
        print(uee)
        startIndx = str(uee).find("in position ")+len("in position ")
        endIndx = str(uee).find(": ordinal")
        oIndx = int(str(uee)[startIndx:endIndx])
        print("Character %d of the Event Title Needs to be removed before print() or str() can process it." %(oIndx))
    print ('    Time    :  %s' %i.time)
    print ('    Location: %s' %i.location)


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
<type 'exceptions.UnicodeEncodeError'>
'ascii' codec can't encode character u'\xe3' in position 29: ordinal not in range(128)
Character 29 of the Event Title Needs to be removed before print() or str() can process it.
    Time    :  20 May &ndash; 21 May  2017
    Location: São José dos Campos, Brazil
------------------------------
Unknown Event: Django Girls Accra
    Time    :  16 June &ndash; 18 June  2017
    Location: Accra, Ghana
------------------------------
Missed Event: Python Porto Meetup
    Time    :  24 March 2017
    Location: Porto, Portugal
------------------------------
Unknown Event: PyDelhiConf 2017
    Time    :  18 March &ndash; 20 March  2017
    Location: NCR, Noida, India

The above solution appears to work ... but what if there is more than one error triggered in a line of the content? It should be noted too that this solution failed to work when run from a command line script with, what at first glance, seemed to be the exact same code in it. Later in the testing process, it was realized that the problem might be the header and this was added to the top of the script:


In [20]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

In [22]:
!python script/event_v1.py


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
<type 'exceptions.UnicodeEncodeError'>
'ascii' codec can't encode character u'\xe3' in position 29: ordinal not in range(128)
Character 29 of the Event Title Needs to be removed before print() or str() can process it.
    Time    :  20 May &ndash; 21 May  2017
Traceback (most recent call last):
  File "script/event_v1.py", line 71, in <module>
    print ('    Location: %s' %i.location)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 15: ordinal not in range(128)

It turns out that if we add the header lines into the Jupyter notebook cell, then the two copies of the code fail in more similar ways. Though the header appears to be for scripts, it influences coding in Jupyter cells as well. The strange thing here though, is that adding a line for UTF-8 results in more errors in the script version than leaving it out, even though the research that follows shows that handling of UTF-8 is at the core of how to fix this problem.

Experiment 2


In [23]:
# This version of the code assumes we want to to see warnings and error content along with as much of the output
# as can be processed around the characters causing the problem

''' Here we see code designed to find the mis-behaving characters, output as much as we know about them
    (text originally captured in the error messages when the code halted), output as much as possible of
    the non-misbehaving content, and continue to run.  
    
    The output when the error is encountered is ugly, but this is deliberate.  It shows all errors triggers and
    highlights the recursive nature of this solution for future study.
'''

#!/usr/bin/env python
# -*- coding: utf-8 -*-  

import requests
import datetime
import re

class event(object):
    def __init__(self, title, time, location):
        self.title = title
        self.time = time
        self.location = location
    
    def day(self):
        try:
            day = re.findall('\w+', self.time)[:3]
            day = ' '.join(day)
            try: 
                return datetime.datetime.strptime(day, "%d %b %Y")
            except ValueError:
                return datetime.datetime.strptime(day, "%d %B %Y")
        except ValueError:
            return self.time
    
    def status(self):
        if isinstance(self.day(), datetime.datetime):
            now = datetime.datetime.now()
            if now < self.day():
                return 'Upcoming'
            elif now - self.day() < datetime.timedelta(days=1):
                return 'Today'
            else:
                return 'Missed'
        else:
            return 'Unknown'
        
    def __str__(self):     
        return str_Intl(self.status() + ' Event: %s' %self.title)    # call to str_Intl is part of bug fix

# this function created instead of modifying __str__ because in testing, this error cropped up
# both in the use of a print() satement all by itself, and in an event.__str__ call

def str_Intl(strng):   
    try:
        rtnVal = str(strng)
    except UnicodeEncodeError as uee:
        print("Warning!")
        print("%s: %s" %(type(uee), uee))
        chrStartIndx = len("'ascii' codec can't encode character ")
        chrEndIndx = str(uee).find(" in position ")
        replStr = str(uee)[chrStartIndx:chrEndIndx] 
        startIndx = (chrEndIndx+1) + len("in position ")
        endIndx = str(uee).find(": ordinal")
        oIndx = int(str(uee)[startIndx:endIndx])
        print("Character %d cannot be processed by print() or str() and will be replaced." %(oIndx))
        print("---------------------")
        rtnVal = (strng[0:oIndx] + ("\"%s\"" %replStr) + strng[(oIndx+1):])
        rtnVal = str_Intl(rtnVal)    # recursive fuction call
    except Exception as ee:
        rtnVal = "String data coult not be processed. Error: %s : %s" %(type(ee), ee)
    return rtnVal    
    
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'

time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)

events = [event(title[i], time[i], location[i]) for i in range(len(title))]

for i in events:
    print (30*'-')
    print(i)                                         # bug fix: i is in events, so this calls __str__ in the object
    print ('    Time    :  %s' %i.time)
    print (str_Intl('    Location: %s' %i.location)) # when bug happened here, had to add str_Intl as bug fix


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Warning!
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe3' in position 29: ordinal not in range(128)
Character 29 cannot be processed by print() or str() and will be replaced.
---------------------
Warning!
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 43: ordinal not in range(128)
Character 43 cannot be processed by print() or str() and will be replaced.
---------------------
Unknown Event: Django Girls S"u'\xe3'"o Jos"u'\xe9'" dos Campos
    Time    :  20 May &ndash; 21 May  2017
Warning!
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe3' in position 15: ordinal not in range(128)
Character 15 cannot be processed by print() or str() and will be replaced.
---------------------
Warning!
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 29: ordinal not in range(128)
Character 29 cannot be processed by print() or str() and will be replaced.
---------------------
    Location: S"u'\xe3'"o Jos"u'\xe9'" dos Campos, Brazil
------------------------------
Unknown Event: Django Girls Accra
    Time    :  16 June &ndash; 18 June  2017
    Location: Accra, Ghana
------------------------------
Missed Event: Python Porto Meetup
    Time    :  24 March 2017
    Location: Porto, Portugal
------------------------------
Unknown Event: PyDelhiConf 2017
    Time    :  18 March &ndash; 20 March  2017
    Location: NCR, Noida, India

In [24]:
!python script/event_v2.py


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Warning!
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe3' in position 29: ordinal not in range(128)
Character 29 cannot be processed by print() or str() and will be replaced.
---------------------
Warning!
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 43: ordinal not in range(128)
Character 43 cannot be processed by print() or str() and will be replaced.
---------------------
Unknown Event: Django Girls S"u'\xe3'"o Jos"u'\xe9'" dos Campos
    Time    :  20 May &ndash; 21 May  2017
Warning!
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe3' in position 15: ordinal not in range(128)
Character 15 cannot be processed by print() or str() and will be replaced.
---------------------
Warning!
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 29: ordinal not in range(128)
Character 29 cannot be processed by print() or str() and will be replaced.
---------------------
    Location: S"u'\xe3'"o Jos"u'\xe9'" dos Campos, Brazil
------------------------------
Unknown Event: Django Girls Accra
    Time    :  16 June &ndash; 18 June  2017
    Location: Accra, Ghana
------------------------------
Missed Event: Python Porto Meetup
    Time    :  24 March 2017
    Location: Porto, Portugal
------------------------------
Unknown Event: PyDelhiConf 2017
    Time    :  18 March &ndash; 20 March  2017
    Location: NCR, Noida, India

Note that in the above code, a number of runs of the code did not throw the error on the Location line for "dos Campos, Brazil" so when this error occurred for the first time when running the code from a script, it came as a surprise. Later runs of the code (before realizing the header implications) were inconsistent sometimes throwing the error and sometimes not. This part of the testing experience does not appear to be reproduceable. The error occurs consistently now. But Stack Overflow posts on this topic indicate others have had the same experience when dealing with international character sets. From this point forward, Jupyter cells and script versions run appear to run identically to each other.

Real Solution to the Problem

The following web topics were part of the research into a more comprehensive solution to the problem. While the code in the solutions below (even the final one) may seem overly complicated, every time it looked like all instances of the encoding error were handled, another one would mysteriously creep up in testing. By the end of this section, code handles the error in such a way that the code to strip out characters it can't handle never fires. And we get it down to only one instance of a warning about ascii replacement rather than loss of characters (which might make some content harder to read). The exception case to strip out characters it can't handle at all and warn us is retained just in case some characer we did not test or needs this code in the future. Should that ever happen, the code can the be re-visited.

Stack Overflow posts on this topic:

Experiment 3


In [25]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-  

# Draft Version ... presented to illustrate how the problem shifts and how later solutions address this
#   * In this version, several solutions from Stack Overflow are added into the code.
#   * the result is that one encode error is solved but another one shifts to a decode error
#   * one suggested solution for decode is used and though it worked for others, it fails here

import requests
import datetime
import re

class event(object):
    def __init__(self, title, time, location):
        self.title = title
        self.time = time
        self.location = location
    
    def day(self):
        try:
            day = re.findall('\w+', self.time)[:3]
            day = ' '.join(day)
            try: 
                return datetime.datetime.strptime(day, "%d %b %Y")
            except ValueError:
                return datetime.datetime.strptime(day, "%d %B %Y")
        except ValueError:
            return self.time
    
    def status(self):
        if isinstance(self.day(), datetime.datetime):
            now = datetime.datetime.now()
            if now < self.day():
                return 'Upcoming'
            elif now - self.day() < datetime.timedelta(days=1):
                return 'Today'
            else:
                return 'Missed'
        else:
            return 'Unknown'
        
    def __str__(self):     
        return str_Intl(self.status() + ' Event: %s' %self.title.encode('utf-8'))    # call to str_Intl is part of bug fix
        # a.encode('utf-8')

# this function created instead of modifying __str__ because in testing, this error cropped up
# both in the use of a print() satement all by itself, and in an event.__str__ call

def str_Intl(strng):   
    try:
        strng2 = strng.encode('utf-8')
        rtnVal = str(strng2)           # if str() throws an error, we have a problem (which is why it is here)
                                       # note that in our code though, we really already know that inputs are string
                                       # so if it weren't for the try test on str() using it might be redundant
    except UnicodeEncodeError as uee:
        print("Warning!")
        print("%s: %s" %(type(uee), uee))
        chrStartIndx = len("'ascii' codec can't encode character ")
        chrEndIndx = str(uee).find(" in position ")
        replStr = str(uee)[chrStartIndx:chrEndIndx] 
        startIndx = (chrEndIndx+1) + len("in position ")
        endIndx = str(uee).find(": ordinal")
        oIndx = int(str(uee)[startIndx:endIndx])
        print("Character %d cannot be processed by print() or str() and will be replaced." %(oIndx))
        print("---------------------")
        rtnVal = (strng[0:oIndx] + ("\"%s\"" %replStr) + strng[(oIndx+1):])
        rtnVal = str_Intl(rtnVal)    # recursive fuction call
        
    except UnicodeDecodeError as ude:
        # early testing with this line from stack overflow did not work for us:
        # strng.encode('utf-8').strip()
        # this solution also strips off the problem characters without outputting what they were
        
        print("Warning!")
        print("%s: %s" %(type(ude), ude))
        # earlier use of .encode() fixed one issue and bypassed the UnicodeEncodeError handling
        # it then triggered this error for one of the other cases, so now we trying other solutions:
        
        rtnVal = strng.encode('utf-8').strip()
        rtnVal = str_Intl(rtnVal)
        
    except Exception as ee:
        # when calling this code in a loop, you lose one string and get this error message output instead
        # but the loop can continue over the rest of the data
        rtnVal = "String data coult not be processed. Error: %s : %s" %(type(ee), ee)
    return rtnVal    
    
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'

time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)

events = [event(title[i], time[i], location[i]) for i in range(len(title))]

for i in events:
    print (30*'-')
    print(i)                                          # bug fix: i is in events, so this calls __str__ in the object
    print ('    Time    :  %s' %i.time)
    print (str_Intl('    Location: %s' %i.location))  # when bug happened here, had to add str_Intl as bug fix


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Warning!
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xc3 in position 29: ordinal not in range(128)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-25-2927ac3e526d> in <module>()
    100 for i in events:
    101     print (30*'-')
--> 102     print(i)                                          # bug fix: i is in events, so this calls __str__ in the object
    103     print ('    Time    :  %s' %i.time)
    104     print (str_Intl('    Location: %s' %i.location))  # when bug happened here, had to add str_Intl as bug fix

<ipython-input-25-2927ac3e526d> in __str__(self)
     41 
     42     def __str__(self):
---> 43         return str_Intl(self.status() + ' Event: %s' %self.title.encode('utf-8'))    # call to str_Intl is part of bug fix
     44         # a.encode('utf-8')
     45 

<ipython-input-25-2927ac3e526d> in str_Intl(strng)
     77         # it then triggered this error for one of the other cases, so now we trying other solutions:
     78 
---> 79         rtnVal = strng.encode('utf-8').strip()
     80         rtnVal = str_Intl(rtnVal)
     81 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 29: ordinal not in range(128)

In [26]:
!python script/event_v3.py


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Warning!
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xc3 in position 29: ordinal not in range(128)
Traceback (most recent call last):
  File "script/event_v3.py", line 104, in <module>
    print(i)                                          # bug fix: i is in events, so this calls __str__ in the object
  File "script/event_v3.py", line 45, in __str__
    return str_Intl(self.status() + ' Event: %s' %self.title.encode('utf-8'))    # call to str_Intl is part of bug fix
  File "script/event_v3.py", line 81, in str_Intl
    rtnVal = strng.encode('utf-8').strip()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 29: ordinal not in range(128)

Final Solution for Python 2.7 (But Not The Final Solution)

No work is ever complete, just abandoned. This works and covers different scenarios that might be encountered in future content. Until such time as it fails to meet expectations, it is good enough for now. Or at least it was, until cross-testing on Python 3.6 found a problem.


In [27]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-  

# This code produces a working solution.  There may be more efficient ways to do this, but this works.

import requests
import datetime
import re
import unicodedata  # for solution that did not work and is commeted out

class event(object):
    def __init__(self, title, time, location):
        self.title = title
        self.time = time
        self.location = location
    
    def day(self):
        try:
            day = re.findall('\w+', self.time)[:3]
            day = ' '.join(day)
            try: 
                return datetime.datetime.strptime(day, "%d %b %Y")
            except ValueError:
                return datetime.datetime.strptime(day, "%d %B %Y")
        except ValueError:
            return self.time
    
    def status(self):
        if isinstance(self.day(), datetime.datetime):
            now = datetime.datetime.now()
            if now < self.day():
                return 'Upcoming'
            elif now - self.day() < datetime.timedelta(days=1):
                return 'Today'
            else:
                return 'Missed'
        else:
            return 'Unknown'
        
    def __str__(self):     
        return str_Intl(self.status() + ' Event: %s' %self.title.encode('utf-8'))    # call to str_Intl is part of bug fix

# this function created instead of modifying __str__ because in testing, this error cropped up
# both in the use of a print() satement all by itself, and in an event.__str__ call

def str_Intl(strng):   
    try:
        strng2 = strng.encode('utf-8')
        rtnVal = str(strng2)
        
    except UnicodeEncodeError as uee:
        print("Warning!")
        print("%s: %s" %(type(uee), uee))
        chrStartIndx = len("'ascii' codec can't encode character ")
        chrEndIndx = str(uee).find(" in position ")
        replStr = str(uee)[chrStartIndx:chrEndIndx] 
        startIndx = (chrEndIndx+1) + len("in position ")
        endIndx = str(uee).find(": ordinal")
        oIndx = int(str(uee)[startIndx:endIndx])
        print("Character %d cannot be processed by print() or str() and will be replaced." %(oIndx))
        print("---------------------")
        rtnVal = (strng[0:oIndx] + ("\"%s\"" %replStr) + strng[(oIndx+1):])
        rtnVal = str_Intl(rtnVal)    # recursive fuction call
        
    except UnicodeDecodeError as ude:
        # early testing with this line from stack overflow did not work for us:
        # strng.encode('utf-8').strip()
        # this solution also strips off the problem characters without outputting what they were
        
        print("Warning!")
        print("%s: %s" %(type(ude), ude))
        print("Where possible, characters are replaced with their closest ascii equivelence.")
        # earlier use of .encode() fixed one issue and bypassed the UnicodeEncodeError handling
        # it then triggered this error for one of the other cases, so now we trying other solutions:
        
        strng_u = unicode(strng, "utf-8")
        rtnVal = unicodedata.normalize('NFKD', strng_u).encode('ascii', 'ignore')
                 # this threw an error that 2nd argument must be unicode, not string
                 # added string_u line as a fix for that
                
        rtnVal = str_Intl(rtnVal)
        
    except Exception as ee:
        # when calling this code in a loop, you lose one value and get this error message output instead
        # but the loop can continue over the rest of your data
        rtnVal = "String data coult not be processed. Error: %s : %s" %(type(ee), ee)
    return rtnVal    
    
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'

time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)

events = [event(title[i], time[i], location[i]) for i in range(len(title))]

for i in events:
    print (30*'-')
    print(i)                                          # bug fix: i is in events, so this calls __str__ in the object
    print ('    Time    :  %s' %i.time)
    print (str_Intl('    Location: %s' %i.location))  # when bug happened here, had to add str_Intl as bug fix


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Warning!
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xc3 in position 29: ordinal not in range(128)
Where possible, characters are replaced with their closest ascii equivelence.
Unknown Event: Django Girls Sao Jose dos Campos
    Time    :  20 May &ndash; 21 May  2017
    Location: São José dos Campos, Brazil
------------------------------
Unknown Event: Django Girls Accra
    Time    :  16 June &ndash; 18 June  2017
    Location: Accra, Ghana
------------------------------
Missed Event: Python Porto Meetup
    Time    :  24 March 2017
    Location: Porto, Portugal
------------------------------
Unknown Event: PyDelhiConf 2017
    Time    :  18 March &ndash; 20 March  2017
    Location: NCR, Noida, India

In [28]:
# testing above as a script (still in Python 2.7)
!python script/event_v4.py


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Warning!
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xc3 in position 29: ordinal not in range(128)
Where possible, characters are replaced with their closest ascii equivelence.
Unknown Event: Django Girls Sao Jose dos Campos
    Time    :  20 May &ndash; 21 May  2017
    Location: São José dos Campos, Brazil
------------------------------
Unknown Event: Django Girls Accra
    Time    :  16 June &ndash; 18 June  2017
    Location: Accra, Ghana
------------------------------
Missed Event: Python Porto Meetup
    Time    :  24 March 2017
    Location: Porto, Portugal
------------------------------
Unknown Event: PyDelhiConf 2017
    Time    :  18 March &ndash; 20 March  2017
    Location: NCR, Noida, India

In [29]:
# testing above as a script in Python 3.6
!C:/ProgramFilesCoders/Anaconda2/envs/PY36/python script/event_v4.py


------------------------------
b"Unknown Event: b'Django Girls Bucaramanga, Colombia'"
    Time    :  08 April &ndash; 09 April  2017
b'    Location: Bucaramanga, Colombia'
------------------------------
b"Upcoming Event: b'Python Meetup Innsbruck: imp.reload(innsbruck)'"
    Time    :  25 April 2017
b'    Location:  Innsbruck, Austria, Europe'
------------------------------
b"Unknown Event: b'Django Girls S\\xc3\\xa3o Jos\\xc3\\xa9 dos Campos'"
    Time    :  20 May &ndash; 21 May  2017
b'    Location: S\xc3\xa3o Jos\xc3\xa9 dos Campos, Brazil'
------------------------------
b"Unknown Event: b'Django Girls Accra'"
    Time    :  16 June &ndash; 18 June  2017
b'    Location: Accra, Ghana'
------------------------------
b"Missed Event: b'Python Porto Meetup'"
    Time    :  24 March 2017
b'    Location: Porto, Portugal'
------------------------------
b"Unknown Event: b'PyDelhiConf 2017'"
    Time    :  18 March &ndash; 20 March  2017
b'    Location: NCR, Noida, India'

Final Solution for Python 2.7 And Python 3.6

The above code, when run under Python 3.6 produces output that looks like this:


In [ ]:
# sample output:
''' ------------------------------
b"Unknown Event: b'PyDelhiConf 2017'"
    Time    :  18 March &ndash; 20 March  2017
b'    Location: NCR, Noida, India'
'''
print("-------------------------------")

To migrate the code to Python 3.6, two minor tweaks could be made to disable the Python 2.7 error handling and work-arounds. These changes are easy enough to do, but just for the experience, a solution is explored here that will run on both without requiring any coding changes.


In [30]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-  

# This code produces a working solution.  There may be more efficient ways to do this, but this works.
# Created for Python 2.7, then modified for cross-compatibility with Python 3.6

import requests
import datetime
import re
import unicodedata  # for solution that did not work and is commeted out

class event(object):
    def __init__(self, title, time, location):
        self.title = title
        self.time = time
        self.location = location
    
    def day(self):
        try:
            day = re.findall('\w+', self.time)[:3]
            day = ' '.join(day)
            try: 
                return datetime.datetime.strptime(day, "%d %b %Y")
            except ValueError:
                return datetime.datetime.strptime(day, "%d %B %Y")
        except ValueError:
            return self.time
    
    def status(self):
        if isinstance(self.day(), datetime.datetime):
            now = datetime.datetime.now()
            if now < self.day():
                return 'Upcoming'
            elif now - self.day() < datetime.timedelta(days=1):
                return 'Today'
            else:
                return 'Missed'
        else:
            return 'Unknown'
        
    def __str__(self): 
        try:
            rtnVal = str(self.status()  + ' Event: %s' %self.title)
        except Exception as ee:
            rtnVal = str_Intl(self.status() + ' Event: %s' %self.title.encode('utf-8'))
        return rtnVal
    
# this function created instead of modifying __str__ because in testing, this error cropped up
# both in the use of a print() satement all by itself, and in an event.__str__ call

def str_Intl(strng):   
    try:
        strng2 = strng.encode('utf-8')
        rtnVal = str(strng2)
        
    except UnicodeEncodeError as uee:
        print("Warning!")
        print("%s: %s" %(type(uee), uee))
        chrStartIndx = len("'ascii' codec can't encode character ")
        chrEndIndx = str(uee).find(" in position ")
        replStr = str(uee)[chrStartIndx:chrEndIndx] 
        startIndx = (chrEndIndx+1) + len("in position ")
        endIndx = str(uee).find(": ordinal")
        oIndx = int(str(uee)[startIndx:endIndx])
        print("Character %d cannot be processed by print() or str() and will be replaced." %(oIndx))
        print("---------------------")
        rtnVal = (strng[0:oIndx] + ("\"%s\"" %replStr) + strng[(oIndx+1):])
        rtnVal = str_Intl(rtnVal)    # recursive fuction call
        
    except UnicodeDecodeError as ude:
        # early testing with this line from stack overflow did not work for us:
        # strng.encode('utf-8').strip()
        # this solution also strips off the problem characters without outputting what they were
        
        print("Warning!")
        print("%s: %s" %(type(ude), ude))
        print("Where possible, characters are replaced with their closest ascii equivelence.")
        # earlier use of .encode() fixed one issue and bypassed the UnicodeEncodeError handling
        # it then triggered this error for one of the other cases, so now we trying other solutions:
        
        strng_u = unicode(strng, "utf-8")
        rtnVal = unicodedata.normalize('NFKD', strng_u).encode('ascii', 'ignore')
                 # this threw an error that 2nd argument must be unicode, not string
                 # added string_u line as a fix for that
                
        rtnVal = str_Intl(rtnVal)
        
    except Exception as ee:
        # when calling this code in a loop, you lose one value and get this error message output instead
        # but the loop can continue over the rest of your data
        rtnVal = "String data coult not be processed. Error: %s : %s" %(type(ee), ee)
    return rtnVal    
    
text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'

time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)

events = [event(title[i], time[i], location[i]) for i in range(len(title))]

for i in events:
    print (30*'-')
    print(i)                                          # bug fix: i is in events, so this calls __str__ in the object
    print ('    Time    :  %s' %i.time)
    try:
        print ('    Location: %s' %i.location)
    except Exception as ee:
        print (str_Intl('    Location: %s' %i.location))  # bug fix:  error thrown here too
                                                          # str_Intl() will parse out type of error in its try block


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Warning!
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xc3 in position 29: ordinal not in range(128)
Where possible, characters are replaced with their closest ascii equivelence.
Unknown Event: Django Girls Sao Jose dos Campos
    Time    :  20 May &ndash; 21 May  2017
    Location: São José dos Campos, Brazil
------------------------------
Unknown Event: Django Girls Accra
    Time    :  16 June &ndash; 18 June  2017
    Location: Accra, Ghana
------------------------------
Missed Event: Python Porto Meetup
    Time    :  24 March 2017
    Location: Porto, Portugal
------------------------------
Unknown Event: PyDelhiConf 2017
    Time    :  18 March &ndash; 20 March  2017
    Location: NCR, Noida, India

In [31]:
# Test in Python 3.6
!C:/ProgramFilesCoders/Anaconda2/envs/PY36/python script/event.py


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Unknown Event: Django Girls São José dos Campos
    Time    :  20 May &ndash; 21 May  2017
    Location: São José dos Campos, Brazil
------------------------------
Unknown Event: Django Girls Accra
    Time    :  16 June &ndash; 18 June  2017
    Location: Accra, Ghana
------------------------------
Missed Event: Python Porto Meetup
    Time    :  24 March 2017
    Location: Porto, Portugal
------------------------------
Unknown Event: PyDelhiConf 2017
    Time    :  18 March &ndash; 20 March  2017
    Location: NCR, Noida, India

In [32]:
# for completeness .. the code is also re-tested as a script under Python 2.7
!python script/event.py


------------------------------
Unknown Event: Django Girls Bucaramanga, Colombia
    Time    :  08 April &ndash; 09 April  2017
    Location: Bucaramanga, Colombia
------------------------------
Upcoming Event: Python Meetup Innsbruck: imp.reload(innsbruck)
    Time    :  25 April 2017
    Location:  Innsbruck, Austria, Europe
------------------------------
Warning!
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xc3 in position 29: ordinal not in range(128)
Where possible, characters are replaced with their closest ascii equivelence.
Unknown Event: Django Girls Sao Jose dos Campos
    Time    :  20 May &ndash; 21 May  2017
    Location: São José dos Campos, Brazil
------------------------------
Unknown Event: Django Girls Accra
    Time    :  16 June &ndash; 18 June  2017
    Location: Accra, Ghana
------------------------------
Missed Event: Python Porto Meetup
    Time    :  24 March 2017
    Location: Porto, Portugal
------------------------------
Unknown Event: PyDelhiConf 2017
    Time    :  18 March &ndash; 20 March  2017
    Location: NCR, Noida, India

Another Experiment

After devising most of the above code, I stumbled upon a post that a few lines of code might solve the problem so I had to try it. As it turns out, this solution did not work for this code and content. The experiment is preserved here for research and learning purposes.


In [33]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-  

# code reverted to earliest version to test this simple approach.  
# if it worked, a lot of processing could be avoided, but as shown in the output, this failed too
# above solution seems to work the best for now

import requests
import datetime
import re
import sys

# proposed bug-fix from stack overflow
reload(sys)                          # note:  reload threw an error that "reload() was not defined on Python 3.6
                                     #        if this solution ever proves useful this may need to be investigated
                                     #        the command worked in Python 2.7
sys.setdefaultencoding('utf8')

class event(object):
    def __init__(self, title, time, location):
        self.title = title
        self.time = time
        self.location = location
    
    def day(self):
        try:
            day = re.findall('\w+', self.time)[:3]
            day = ' '.join(day)
            try: 
                return datetime.datetime.strptime(day, "%d %b %Y")
            except ValueError:
                return datetime.datetime.strptime(day, "%d %B %Y")
        except ValueError:
            return self.time
    
    def status(self):
        if isinstance(self.day(), datetime.datetime):
            now = datetime.datetime.now()
            if now < self.day():
                return 'Upcoming'
            elif now - self.day() < datetime.timedelta(days=1):
                return 'Today'
            else:
                return 'Missed'
        else:
            return 'Unknown'

    def __str__(self):
        return self.status() + ' Event: %s' %self.title

text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'

time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)

events = [event(title[i], time[i], location[i]) for i in range(len(title))]

for i in events:
    print (30*'-')
    print(i)
    print ('    Time:  %s' %i.time)
    print ('    Location: %s' %i.location)


---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-33-b18cccb3bf1a> in <module>()
     65     print(i)
     66     print ('    Time:  %s' %i.time)
---> 67     print ('    Location: %s' %i.location)

C:\ProgramFilesCoders\Anaconda2\lib\encodings\cp437.pyc in encode(self, input, errors)
     10 
     11     def encode(self,input,errors='strict'):
---> 12         return codecs.charmap_encode(input,errors,encoding_map)
     13 
     14     def decode(self,input,errors='strict'):

UnicodeEncodeError: 'charmap' codec can't encode character u'\xe3' in position 15: character maps to <undefined>

In [ ]: