I was graphing out letter frequency in some large academic documents. As part of this process, is was sorting letters from large clippings of those documents into alphabetical order. I was using Python's
built in sorted function, and I started to wonder if I could make it faster. I then wrote the following function:
def count_sort(l):items = {'a':0,'b':0,'c':0,'d':0,'e':0,'f':0,'g':0,'h':0,'i':0,'j':0,'k':0,'l':0,'m':0,'n':0,'o':0,'p':0,'q':0,'r':0,'s':0,'t':0,'u':0,'v':0,'w':0,'x':0,'y':0,'z':0}for item in l:items[item] += 1sort_l = []for key in items:sort_l += key*items[key]return sort_l
When testing this code versus sorted
on a 10000
letter long string of text, it was almost 20X
faster.
With such a performance boost, why isn't this sorting algorithm in the standard libs
?
Best Answer
You have rediscovered the counting sort algorithm.
To quote Wikipedia:
For problem instances in which the maximum key value is significantlysmaller than the number of items, counting sort can be highlyspace-efficient, as the only storage it uses other than its input andoutput arrays is the Count array which uses space O(k).
The counting sort algorithm becomes more and more efficient(relatively) the greater the difference is between the total number of items that being sorted, and the number of unique items being sorted.
You can see why this must be looking at your own code, or the Wikipedia example code:
# calculate the histogram of key frequencies:for x in input:count[key(x)] += 1# calculate the starting index for each key:total = 0for i in range(k): # i = 0, 1, ... k-1oldCount = count[i]count[i] = totaltotal += oldCount# copy to output array, preserving order of inputs with equal keys:for x in input:output[count[key(x)]] = xcount[key(x)] += 1return output
You have 2 for loops in your function: the first to iterate over the letters you are sorting, and the second to iterate over the items dictionary. As I posited previously, this makes sense of the items dictionary is considerably smaller than the list you are sorting, but it quickly becomes very inefficient if the number of unique elements increases relative to the number of items being sorted.
Like @BrenBarn answered, this only when you know exactly what characters to expect and you're willing to ignore any other characters. While it seems like counting sort is highly efficient in the example you gave, the problem of sorting letters is hardly the most common sorting problem.
Below I have fixed your function to print the letters by iterating through a list rather than iterating through the keys in a dictionary(as Python's dictionaries are not ordered)
def count_sort(l):letters = [chr(i) for i in range(97, 122)]items = dict()for letter in letters:items[letter] = 0for item in l:items[item] += 1sort_l = list()for letter in letters:sort_l.extend(letter*items[letter])return sort_l
As mentioned in the comments and answers above you can have rediscovered the counting sort but you haven't discovered the python collections library:
from collections import Counterdef count_sort(l):items = Counter()for item in l:items[item] += 1sort_l = []for key in items.keys().sorted():sort_l += key*items[key]return sort_l
The major difference is that you will not get any entries for missing entries, you may also wish to change:
sort_l += key*items[key]
to:
sort_l.append((key, items[key]))
to return a sorted list of keys and counts. Another good trick would be to return a collections.OrderedDict
object.
Two reasons.
One, your algorithm relies on knowing ahead of time what elements will exist in the list to be sorted. If the list contained an uppercase character or a digit or anything else, your code would fail with an error.
The other reason is that your code relies on the order of the dictionary remaining the same. Dictionaries are not ordered, so there is no guarantee that for key in items
will iterate over the keys in alphabetical order. Your "sorted" list might not wind up sorted at all.
As stated in other answer, it is known as Counting Sort...
Other than the reasons / limitations they have mentioned,
Counting sort somehow use the discrete items as array index and count them
So if the item's domain is something like double / floating point, thenthe implementation may be hard, at least maybe you have to use Map() to map the number into valid array index, and Map() itself is O(lg N), turns out the complexity is still O(N lg N)...
If you're interested in the counting sort and how it works with other sorting algorithms, you should check this analysis of the Counting sort and Radix sort and the instances in which each of them are useful:
Many times counting sort can be useful as a sub-routine of a larger sorting (most notably Radix sort)algorithm in instances where it would be impractical by itself.