T.N.Olsthoorn, Feb2017
Sorting (and) filtering are important operations when working with data and databases. This tutorial demonstrates some of the concepts.
It's necessary to discriminate in-place sorting from sorting whereby a new python object is gnerated. Inplace sorting happens when using an instance's own .sort() method if it exists. Not all python sequences and arrays have a .sort() method, only those that are muttable and for which ordering makes sence. This comes down to only lists and arrays; strings and tuples have no .sort() method because they are immutable and neither have dicts and sets, because for them sorting in place makes no sence, because the concept of ordering meakes no sense for dicts and sets.
For all sequences one may use numpy's sort function, np.sort(). It takes lists, tuples and arrays and always generates a new numpy array.
Then there is the sorted() builtin function. Sorted() always generates a list, no matter what input it is given. It takes, lists, tuples, strings and arrays and always geneates a list in which the items are odered. The sorted function also takes dicts and sets and will generate a sorted list of their keys or elements in case of a set.
Further down we'll see how the key keyword is used to sort sequences that have elements that cannot be ordered without telling how.
We'll show how argsort() works, which generates the indices that will sort a sequence.
Finally we'll experiment with hierarchical sorting also using sort.
In [1]:
import numpy as np
from numpy import random
from pprint import pprint
Let's first explore how one sees whether a printed sequence is a tuple, a list or an array.
tuples are enclosed in ( ), lists in [ ] and arays also in [ ], however, the elements in tuuples and arrays are separted by commas (showing that they are distice), while the elements in an array have no separating commpas: they form one block, the array.
In [2]:
myArray = np.random.randint(10, size=15)
myTuple = tuple(myArray)
myList = list(myArray)
mySet = set(myArray)
myDict = {c:n for c, n in zip('aebghcdf', range(10))}
print(type(myTuple))
print(myTuple, " <----- tuple, parentheses, with commas\n")
print(type(myList))
print(myList, " <----- list, brackets, with commas\n")
print(type(myArray))
print(myArray, " <----- array, brackets, without commas!\n")
print(type(mySet))
print(mySet, " <----- set, curly braces, with commas\n")
print(myDict, " <----- dict, curly braces, with commas and colons")
print("\nNotice that the order in which the dict itmes are printed")
print("in different from that used to specify them ")
print("Dicts and sets have no order.")
In [3]:
myArr = np.random.randint(7, size=15)
myList = list(myArr)
print(myList)
print(myArr)
a = myList.sort() # in-place sorting of a list, no output
b = myArr.sort() # same for array
print("a={}, b={}". format(a, b))
print(myList)
print(myArr)
It's clear the .sort() method produces not output (both a and b obtained None). So assignment makes not sense for in-place sorting.
Of course, the .sort() method does not change the type of the sequence that is sorted.
The other sequences have not .sort method:
In [4]:
sequences = {'myArr': myArr, 'myList': myList, 'myTuple': myTuple, 'myDict': myDict,
'mySet': mySet, 'myString': 'myString'}
for k in sequences:
if 'sort' in dir(sequences[k]):
print("sequence {:10} has method .sort()".format(k))
else:
print("equences {:10} does not have method .sort()".format(k))
In [159]:
myArray = np.random.randint(10, size=15)
myTuple = tuple(myArray)
myList = list(myArray)
mySet = set(myArray)
myDict = {c:n for c, n in zip('aebghcdf', range(10))}
In [169]:
print(myArray, " <--- a numpy array was given")
print(np.sort(myArray), " <--- a sorted numpy array was returned")
print(myTuple, " <--- a tuple was given")
print(np.sort(myTuple), " <--- a sorted numpy array was returned")
print(myList, " <--- an list was given")
print(np.sort(myList), " <--- a sorted numpy array was returned")
#print(myDict)
#print(np.sort(myDict)) # can't be sorted with np.sort()
#print(mySet)
#print(np.sort(mySet)) # can't be sorted with np.sort()
When multiple arrays are passed in, they are sorted along their last axis unless the keyword axis
is used.
In [171]:
myArr= np.random.randint(13, size=35).reshape((7, 5))
print(myArr)
In [173]:
np.sort(myArr) # each row is sorted individually along the last axis (axix=1)
Out[173]:
In [174]:
np.sort(myArr, axis=0) # each column is sorted invididually along the firs axis (axis=0)
Out[174]:
In [178]:
myList = [list(L) for L in myList]
print("a list of lists:")
pprint(myList)
Sorting with np.sort a list of lists generates an array in which each row is sorted (because default sorting is along the last axis.
In [180]:
np.sort(myList)
Out[180]:
In [ ]:
The other way around (sorting along axis=0) also works on a list of lists
In [181]:
np.sort(myList, axis=0)
Out[181]:
Sorting tuples along the first or second axis also works. But notice that when using numpy.sort() one always obtains a numpy array as result.
In [198]:
myTuple = [tuple(T) for T in myList]
print("a tuple of tuples:")
pprint(myTuple)
print()
print(np.sort(myTuple))
print()
print(np.sort(myTuple, axis=1))
The list of strings can only be sorted along axis=0, so that the one that is alphabetically first comes floating on top.
Also here one obtains a numpy array of which the values are of the numpystr class. The letters themselves are of
In [210]:
myStr = ["the quick brown fox",
"jumps over the lazy",
"dog in the garden !"]
print()
print(myStr)
print()
strsort = np.sort(myStr)
print(strsort)
print()
print(type(strsort), " <--- type of strsort itself")
print()
print(strsort.dtype, " <--- type of the items in the array strsort, unicode characters")
print("In fact, a little endian (<) bystring of 19 unicode characters (U19).")
print()
print(type(strsort[1]), " <--- type of second line of strsort")
print()
print(type(strsort[1][3]), " <--- type of a character in a string")
In [ ]:
Python's built-in function sorted is a geneal workhorst, which takes any sequence and even dicts and sets for sorting. It always generates a list, whatever its input is, and never sorts in place. It is more general than the other two, np.sort() and the .sort() method of lists and arrays.
In [ ]:
myArray = np.random.randint(10, size=15)
myTuple = tuple(myArray)
myList = list(myArray)
mySet = set(myArray)
myDict = {c:n for c, n in zip('aebghcdf', range(10))}
In [213]:
print(sorted(myArray), " <--- a List, not an array.")
print(sorted(myTuple))
print(sorted(myList))
print(sorted(mySet), " <--- a list of the sorted set elements")
print(sorted(myDict), " <--- a list of the sorted keys of the dict")
Let's see how sorted works on a list of lists. First generate one:
In [226]:
myArr = np.random.randint(3, size=60).reshape((20, 3))
myList = list([tuple(L) for L in myArr])
print("a list of lists:")
pprint(myList)
Then sort:
In [224]:
sorted(myList)
Out[224]:
We got the list of lists with the lists sorted, first column first second column second and so on. This is a hierarchial sort.
You can't always sort items in a list. For instance, by what criterion would you sort animals? Their, age, their size, their weight, the affection, their price ?? You name it.
sorted(..) has the keyword key that you can use to make them compariable. The keyword accepts a function that computes a value for the item (age, weight, size, whatever), based on which they are mutually comparable.
To show this, create a dict with arbitrary keys of different types, e.g. consisting of a mix of tuples, strings and numbers). Such keys are ok for a dict because they are all immutable.
In [235]:
myList = [(3, 2, 2, 'beast'), (0, 'a', 'abcd'), ('dog', (4, 'abc')), ['dog', 5], 'bike']
print()
print(myList)
Next try to sort this dict, like with did before with the previous dict that had simple characters as keys:
In [234]:
sorted(myList) # doesn't work, because, python doen't know how to sort such diverse keys
This these keys, sorting doesn't work because python know of no rule to sort them.
So let's inspect the signature of the sorted(..) function.
Type this in a new cell and press
sorted?
This is what you get in a separate new window at the bottom of your screen:
Signature:
sorted(iterable, key=None, reverse=False)
Docstring: Return a new list containing all items from the iterable in ascending order.A custom key function can be supplied to customise the sort order, and the reverse flag can be set to request the result in descending order. Type: builtin_function_or_method
Obviously, revere=True
sorts in descending manner
But key
is special and our next subject.
We could still sort myList above, for instance according to the number of items that each item consists of. The function for key would then simple return that number for each item in the list, so that sorted can use that number to compare them.
In [236]:
def longest(k):
return len(k)
sorted(myList, key=longest)
Out[236]:
This works. And this key method is a general way to compare items in a list.
In practice, having to define a separate function each time one needs it for a key is considered overkill. Instead lambda functions, also called anonymous functions are most often used. They are very practical.
A lambda function consists of the word lambd followed by input parameters a colon and a single expression that yields a single answer.
lambda x, y, z, .... : expression(x, y, z, ...)
For instance to get the length of an item, the equivalent of the function longest()
above would be
lambda x: len(x)
In pratice the sort then becomes:
In [238]:
sorted(myList, key=lambda x: len(x))
Out[238]:
Another example. Here's a list of tuples. They can be sorted, but without the use of a key, sorting would be according to firs item, then second item etc. This is generally not what you want.
In [240]:
myList = [(4, 3, 5), (6, -1, 7), (4, 1, 2), (-2, 3, -4), (3, 5, 0)]
sorted(myList)
Out[240]:
As is seen, these tuples are sorted based on their first value [-2, 3, 4, 6] and on their second value for itmes for which the first are the same and so on.
But what if we want to sort them base on their third value instead?
The function that would allow us to, should just pick out the third item of every tuple in the list to sort upon. This looks like so:
In [243]:
sorted(myList, key=lambda x: x[2])
Out[243]:
The tuples are now sorted according to their third element.
Now let's take a dict of animals each with a dict of their properties
In [82]:
myCircus = {'duke': {'what': 'horse', 'color': 'brown', 'age':7, 'weight': 250},
'spooky': {'what': 'cat', 'color': 'orange', 'age':9, 'weight': 2.3},
'mooh': {'what': 'cow', 'color': 'black-white}', 'age':4, 'weight': 290},
'john': {'what': 'elephant', 'color': 'grey', 'age': 22, 'weight': 1840}}
In [244]:
myCircus['john']
Out[244]:
In [245]:
[myCircus['john']['what'], myCircus['john']['age']]
Out[245]:
Lets sort these animals, but how? We can sort them based on their age or on their weight. Let's do one after the other
In [255]:
def compAge(animal):
return animal[1]['age']
print("\nUnsorted circus, because dict has no order:\n")
pprint([(k, myCircus[k]) for k in myCircus])
print("\nSorted lexicographically according to animal's name:\n")
pprint(sorted([(k, myCircus[k]) for k in myCircus]))
print("\nSorted accoring to animal's age:\n")
pprint(sorted([(k, myCircus[k]) for k in myCircus], key=compAge))
print("\nSorted accoring to animal's weight:\n")
pprint(sorted([(k, myCircus[k]) for k in myCircus], key=lambda animal: animal[1]['weight']))
This is somewhat complex. The comprehension produces a list with one item per key k. Each item of the list is itself a list of two items, namely the key and the dict [k, myCircus[k]] with the properties of the animal under key k. The compAge function takes each item, i.e. the list [k, myCircus[k]] (called animal inside the function), it returns animal[1]]['age']. Note that animal[1] is the second item of the incoming list, which is the dict. Therefore, animal[1]['age'] returns the value under key 'age' of the dict belonging to that animal. This value, the age of the animal, is then used to compare the different animals in the list for sorting. That it works is seen on the hand of the increasing ages of the animals in this list.
It's left up to you to sort the animals based on their weight. And yes, sort them in reversed (decreasing order).
argsort()
is a function in numpy
and a method of numpy arrays
. It does not exist as a Python builtin function like sorted() and it does not exist as a method of list
like list.sort()
. It creates a numpy array with the indices that would sort the original numpy array. It's also called an indirect sort.
In [264]:
I = np.argsort(myArray)
print("myArray : ", myArray, " <-- a numpy array")
print("indices : ",I)
print("myArray[I] : ", myArray[I])
print("sorted(myArray) : ", sorted(myArray), " <-- a List")
In [268]:
myArray = np.random.randint(13, size=40).reshape((5, 8))
print(myArray)
np.argsort() or as instance method myList.argsort() yields the indices along the last axis (axis=1), that would sort each row of the array.
In [276]:
Iarray = myArray.argsort()
print(Iarray)
We may now sort the array according to any of its rows
In [286]:
myArray[:,Iarray[2]] # sorting according to row 2
Out[286]:
In [287]:
np.take(myArray, indices=Iarray[2], axis=1) # same thing using np.take
Out[287]:
Sorting all rows simultaneously:
Use a comprehension to generate a list of arrays that are each sorted using the indices Iarray generated by myArray.argsort() above. Then apply np.array(...)
or np.asarray(...)
over it, to convert into a numpy array
In [277]:
np.asarray([L[I] for L, I in zip(myArray, Iarray)])
Out[277]:
The other way around, applying argsort along the columns of myArray (setting axis=0)
In [283]:
Jarray = myArray.argsort(axis=0)
print(Jarray)
Sorting myArray verticall according to column 3:
In [288]:
myArray[Jarray[:,2], :]
Out[288]:
In [290]:
np.take(myArray, indices=Jarray[:, 2], axis=0) # same thing
Out[290]:
Using a comprehension to sort all columns using the index array and converting the total list to a numpy array using np.array(...)
or np.asarray(...)
Notice the tranposes to make the selection easy and the transpose to have the final array orientation the same as the original:
In [291]:
np.asarray([k[J] for k, J in zip(myArray.T, Jarray.T)]).T
Out[291]:
Sorting according to a set of given fields, i.e.hierarchical sorting:
Let's say we wish to sort myArray first for column 5, then 3 then 7 and then 1.
How to do that?
if we have the indices like generated above usign argsort, we can make tuples from them in the desired order and use those as sortkeys. This works because tuples are immutable and tuples with integers are comparable.
The easiest way is to use the key keyword like it was done before.
In [299]:
myArray = np.random.randint(3, size=60).reshape((20, 3))
print(myArray)
In [300]:
fields = [2, 0, 1]
np.array(sorted(myArray, key=lambda x: tuple(x[fields])))
Out[300]:
In [302]:
myList = list(myArray)
pprint(myList) # a list of arrays
In [303]:
sorted(myArray, key=lambda x: tuple(x[fields])) # also hierchically sorted as before
Out[303]: