We need to sort data that is partially numeric and partially alphabetic, in this case the line numbers 1, 4008, 4008a, 4009, and 9. We can’t sort them numerically because the 'a' isn’t numeric. And we can’t sort them alphabetically because the numbers that begin with '4' (4008, 4008a, 4009) would all sort before '9'. We resolve the problem by writing a custom sort function that separates the values into leading numeric and optional trailing alphabetic parts. We then sort numerically by the numeric part, and break ties by subsorting alphabetically on the alphabetic part.
We’ll use a regular expression to parse our line number into two parts, so we load the regex library:
In [30]:
import re
We initialize a lines list of strings and demonstrate how the default alphabetic sort gives the wrong results:
In [31]:
lines = ['4008','4008a','4009','1','9']
sorted(lines)
Out[31]:
In Python 3, the key parameter specifies a function that should be applied to the list items before sorting them. If we use int to convert each of the string values to an integer so that we can perform a numerical sort, we raise an error because the 'a' can’t be converted to an integer:
In [34]:
sorted(lines,key=int) # this raises an error
We create our own sort function, for which we define linenoRegex, which includes two capture groups, both of which are strings by default. The first captures all digits from the beginning of the line number value. The second captures anything after the numbers. The regex splits the input into a tuple that contains the two values as strings, and we convert the first value to an integer before we return it. For example, the input value '4008a' will return (4008,'a'), where the '4008' is an integer and the 'a' is a string.
In [35]:
linenoRegex = re.compile('(\d+)(.*)')
def splitId(id):
"""Splits @id value like 4008a into parts, for sorting"""
results = linenoRegex.match(id).groups()
return (int(results[0]),results[1])
If we now specify our splitId function as the value of the key parameter in the sorted() function, the values will be split into two parts before sorting. Tuples are sorted part by part from start to finish, so we don’t have to tell the function explicitly how to sort once we’ve defined the two parts of our tuple:
In [33]:
sorted(lines,key=splitId)
Out[33]: