Hash Sort

     The hash sort is a linear time complexity sorting algorithm that uses a hash to sort integer data in linear time.  This web page summarizes the algorithm and provides access to reference material and performance data. 

Concept

     The concept of hashing is to map data to locations in an array. The hash does not preserve ordering among the data elements, it simply associates a key with a value. 

However, by the nature of an integer, it is possible to construct a hash function for a specific range of data,  and map the data into a square matrix with the minimum at position (0,0) and the maximum at some upper limit (q,q) where the matrix is q * q.

Essentially the hash maps a subset of the range of all integers into a matrix, using the form of an integer.

I originally thought of the hash sort as an undergrad (circa 1995) and finally presented the monograph at a conference as a grad student.

Theory

     The hash sort works by taking a range of integral data from a minimum value to a maximum.

1. Normalize the data range from min...max to 0...max-min.

2. Compute a constant q for a hash function which maps an integral value i to a pair ((i div q), (i mod q)) which is the position in the square matrix size q. The constant q is the square-root of max-min rounded up to the next integer.

For example min = 5, max = 37. Normalize the range which is from 0...32. Compute the constant q for the hash function:

ROUND(SQRT(32)) = ROUND(5.67) = 6

or

SQRT(32) < SQRT(36) = 6.

3. Process the data by mapping each value into the matrix, counting repeated integers.

The data is sorted, starting from (0,0)...(q-1, q-1) in the matrix.

For the given data set { 5, ... 37 }

5 maps to (0,0)

6 maps to (0,1)

...

36 maps to (6,0)

37 maps to (6,1)

The matrix M[6,6] only has 32-elements for a space of 36-elements.

Practice

     There are three possible cases for the hash sort, involving the data range m and the number of data elements n. The three cases are:

  • m < n - more data elements in the range than the size of the range; duplicate integer values are counted.
  • m = n - each data element hashes to a position within the data range.
  • m > n - the data range exceeds the number of data elements (as demonstrated in the previous example). This is the problem of sparsity, the range far exceeds the data. The worst case is n = 2, and m is a large range.

Monograph

     The original monograph contains an error in an important criterion to consider when selecting the hash sort. The criterion is that for efficient use of the hash sort the data range and the number of data elements should at least be equal for a one-to-one mapping with no residual space in the matrix.

It seems that the third case of the data range exceeding the size of the data set is the one critics focus on (such as only 2-integer elements). My point is that hash sort is not ideal for all data sets. 

If a software professional uses the hash sort blindly or unknowingly without knowing the particulars of the data, application, or library then that professional is very intellectually lazy and sloven in software development.

Source Code 

     The original algorithm was written in Pascal (and compiled using several Pascal compilers -- Sun Pascal, Turbo Pascal, GNU Pascal, and TMT Pascal.

Later I rewrote and ported the code to C++ to try performance on different platforms with different C++ compilers. The source code is available for download for whatever reason under the GNU General Public License.

I have successfully compiled the hash sort C++ source code using Visual C++, Borland C++, g++, Watcom C++, aCC (the HP C++ compiler).