Hash Tables – using hash command and available implementations

By | 16/07/2013

Introduction to Hash Tables

Hash table is a data structure to store key value pairs. As in, a table where each entry has a key and a corresponding value. In other words, it is like an array with indices having the flexibility to be of any type i.e. integer, float, char, string, etc. Henceforth, these indices are termed as keys. So, it is a way of laying down the structure such that, a key maps to the data and that is how we retrieve and access all data efficiently. The efficiency comes from the fact, the key is directly used as an index to search/access any data without caring about traversing the entire data-structure.

A generic hash table can be imagined as,

[key i] ----> Data i
[key ii] ----> Data ii
[key iii] ----> Data iii
[key iv] ----> Data iv

The most beautiful thing about hash tables is that searching can be done in constant time. We shall understand it more once we learn how hash tables are created and how do we map.

The next question comes, how is this mapping determined? What determines, what key to be mapped to what data? The paramount computing factor behind the key-data mapping is the hash function.

Hash Function

A hashing function is a way to compute values corresponding to each key in a hash table data-structure. This hash function can implement any logic depending upon the requirement, type of data to be stored, and many other such scenario specific parameters. Basically, it leverages in accessing and searching data from the hash table. It could be any algorithm customised and optimised as per a particular use case. However, for all cases, it takes our key as the input, and the output is the value corresponding to it in the hash table. However, with the same input key to the logic, output should be same every time.

However, hash tables are not always direct mapping, which means more than one key may map to the same data. In such hash tables, data values are stored in buckets corresponding to each key. Typically, whenever two or more keys corresponds to the same data bucket, it is called as a collision. A good hashing function always minimizes collisions, however, since it cannot be avoided, there should be means to handle collisions.

Collisions

A hash table would never guarantee that for a particular key, there would be just one data value corresponding to it. The reason being, a hash table consists of a fixed number of keys. Moreover, if the hash table always consists of the same number of entries as the actual data, then hash table won’t be as efficient as it ought to be. On the other side, any new entry after the size of hash table has to be accommodated without altering the size of the hash table. Hence, of course this would cause collisions. Therefore,an efficient hash table would always try to minimize the collisions and prepared with effectual ways to handle collisions, and at the same time optimize the size of the table. This introduces the trade off.

This surface outs another prominent attribute of a hash tables, that is, uniform distribution. For any hashing, if we spread out the elements in a random and uniform way, it will lead to minimum collisions and hence faster search and retrieval.

Hashes in Linux

Hashes being a very useful data-structure, how can linux miss using it. In fact, Linux offers us a hash command, along with a hash library(mhash), which has implemented certain standard hash algorithms for us. Along with, Linux has also come up with a hash table management functions to aid the developers.

The hash command

The bash shell maintains a hash table for each command which has been run. The reason, why it does so is, making the commands run faster. Whenever, a user runs a command on the shell, it first has to search the command executable as to where is it located. The shell searches for the command binary in the directories listed by the environment variable ‘PATH’.

As an example, when we run command ‘ls’, we just type ‘ls’ without mentioning the path where the ‘ls’ executable is located.

$ ls

Now, before shell even triggers this command, it has to search for the ‘ls’ executable and know its location. Conventionally, the directories listed in the ‘PATH’ variable are looked into.

Here is what ‘PATH’ variable stores in my case,

$ echo $PATH
/home/rupali/.rvm/gems/ruby-1.9.2-p320/bin:/home/rupali/.rvm/gems/ruby-1.9.2-p320@global/bin:/home/rupali/.rvm/rubies/ruby-1.9.2-p320/bin:/home/rupali/.rvm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/home/rupali/.rvm/bin

The shell checks each and every directory in the above list until it finds the ‘ls’ executable.
In my system, the ‘ls’ executable is found at,

/bin/

This looking up has definitely added certain delays, and more the paths, the ‘PATH’ contains, more the delays it may add. To avoid this, whenever the first time bash shell, finds the location of a command executable, it adds it to a hash table. The next time, same command is run, the path is taken from the hash table rather than searched again making the commands run faster.

The hash command displays this hash table along with the number of times, each command has been run.

$ hash
hits	command
   7	/bin/grep
   1	/usr/bin/which
   1	/usr/bin/touch
   4	/bin/ls
   1	/usr/bin/find

However, it may cause trouble if we change the location of one of the command executable. With the hash entry still in the bash hash table, it will try to run the command from the same location and fail every time. Let us do a small experimental exercise to reproduce the above scenario.
First of all, we shall create our own command executable. On the simpler side, let us just create our ‘print’ command using following code,

#include <stdio.h>

int main()
{
    printf("Hello Linux User\n");
    return 0;
}

Please note, no logic involved as the purpose of creating this command is just to emulate a use case.

Now, compile the code and add the path to the built binary to ‘PATH’ variable and verify running the command.

$ gcc -Wall myprint.c -o myprint
$ export PATH=$PATH:/home/rupali/programs
$ myprint
Hello Linux User

Vital observation is that the executable ‘myprint’ is run as a command (without ./). Hence, it was searched in the list of directories consisting the environment variable ‘PATH’. One can verify running the same command from any other random location.

Further, since we have run our ‘myprint’ command once, it should be added to the bash hash table. Run the ‘hash’ command to see the same,

$ hash
hits	command
   1	/home/rupali/programs/myprint

Ouch! before we move ahead, did you notice? The entire previous list of commands in the hash table has been cleared. That implies, everytime we modify the ‘PATH’ variable, the bash hash table resets. An important point to note.

Coming back, we see now the bash hash table has the ‘myprint’ command location stored, and every time we run this command, it will take the location from this hash table. Now let us move the ‘myprint’ binary to some other path which is also in the list of directories in the ‘PATH’ variable. This new location should be from the list in ‘PATH’ variable because we want to alter the location, as well as make it available and searchable through the ‘PATH’ variable.

$ sudo mv myprint /bin/

Now run the ‘myprint’ command.

$ myprint
bash: /home/rupali/programs/myprint: No such file or directory

It was not able to find the ‘myprint’ command binary, even though it was in one of the directories listed in ‘PATH’ variable. We already know the reason. Isn’t it? Since, an entry for ‘myprint’ command was added to the bash hash table, it never made an effort to search for its location, but just picked the location directly from the hash table, which is no more a valid location for the binary and hence the error.

This might be a very common scenario and to overcome this, the linux users have two options –

1. Reset the hash table – In case, the bash hash table consists of very few entries, one can reset the table as it won’t matter to clear off a few entries which can be re-added conveniently whenever we run those commands again.
The way to explicitly reset the hash table is, using ‘-r’ option

$ hash -r

Now, when we run the hash command, we get

$ hash
hash: hash table empty

2. Delete the corresponding entry – In circumstances, we have have built a huge hash table over a course of time, clearing all the entries might not be a wise thing to do. In such cases, simply delete the specific entry using ‘-d’ option. This is how we do it,

$ hash -d myprint

Now, when we check the hash table, it is intact after deleting just one entry of ‘myprint’

$ hash
hits	command
  12	/bin/grep
   1	/usr/bin/tail
   1	/usr/bin/file
   1	/usr/bin/head
   1	/bin/netstat
   1	/bin/cat
   2	/bin/mkdir
   1	/usr/bin/make
   3	/bin/ls
   1	/usr/bin/find

More information about the linux hash command can be found through command,

$ help -m hash
NAME
    hash - Remember or display program locations.

SYNOPSIS
    hash [-lr] [-p pathname] [-dt] [name ...]

DESCRIPTION
    Remember or display program locations.

    Determine and remember the full pathname of each command NAME.  If
    no arguments are given, information about remembered commands is displayed.

    Options:
      -d		forget the remembered location of each NAME
      -l		display in a format that may be reused as input
      -p pathname	use PATHNAME is the full pathname of NAME
      -r		forget all remembered locations
      -t		print the remembered location of each NAME, preceding
    		each location with the corresponding NAME if multiple
    		NAMEs are given
    Arguments:
      NAME		Each NAME is searched for in $PATH and added to the list
    		of remembered commands.

    Exit Status:
    Returns success unless NAME is not found or an invalid option is given.

SEE ALSO
    bash(1)

IMPLEMENTATION
    GNU bash, version 4.2.8(1)-release (i686-pc-linux-gnu)
    Copyright (C) 2011 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

The mhash library

The mhash library under GNU license, is an open source library which provides C interfaces to several hashing algorithms. These interfaces can be brought to various use like computing checksum, signatures, etc.
Here is a part of the huge list of supported hash algorithms:

- SHA1
– SHA160
– MD5
– GOST
– HAVAL
– CRC32
and many more.

mhash library support HMAC generation which is a technique for message authentication in cryptography. More details regarding the capability and set of interfaces can be determined from its man page.

The source code is available from here

Programming with hash tables

In the present fast developing world, developing from scratch may not be a smart thing to do, when there are already amazing pieces available to build from. Similar is the case with hash tables. Although standard C does not provide us the implementation of hash tables, but many of the great minds have done the work for us. It may not be wise to re-develop and put the effort and time in something which is easily available in a good and stable form, unless there is a peculiar specific need.

Hence, smart developers use the already available and put more effort into developing things on top of it which is even more worthy. Here are a few of hash implementation libraries available to us,

glib hash tables
SGLIB
GTK+
Judy
Uthash
ghthash

Well, there may be other great hash table implementation libraries available. Please share if you know them. This article in no way recommends the above mentioned libraries over any other. The purpose is to make the readers aware what all is available.

However, for basic general programming, POSIX standard C does provide certain methods for creating and searching the hash tables. However, using these methods, only one hash table can be used at a time.
The header file to be included is

#include <search.h>

There are three methods:

Function hcreate() 

The hcreate() creates a new hash table. The prototype looks like

int hcreate(size_t nel);

The argument it takes is the maximum number of elements hash table can contain. Once the hash table is created, it cannot be modified, this number cannot be modified. Hence one should be pretty cautious in choosing the max number while creating the hash table.
The return value is non-zero if hash table creation is a success, else it returns a zero.

Function hsearch()

This function searches the hash table with a given key and retrieves the element corresponding to it. The prototype is

ENTRY *hsearch(ENTRY item, ACTION action);

Here, the first argument ‘item’ is same as the ‘key’ for which we are searching the element. The ‘item’ is of data structure ‘ENTRY’ which actually constitutes of a key string and a generic data.

The second argument ‘action’ implies what to do in case, the search was unsuccessful. There are two possibilities, either to add the new item to the hash table or just return NULL. These two possibilities are specified by enumeration values ‘ENTER’ and ‘FIND’ respectively.

In case, the search was successful, it returns the pointer to the needed entry. Otherwise, it returns a NULL or pointer to the new entry depending upon the value of ‘Action’ parameter.

Function hdestroy()

This function destroys the hash table, and frees the memory allocated for the created hash table. Once this is done successfully, a new hash table can be created.
The prototype looks like,

void hdestroy(void);

It is time to dive into the actual programming pit and start/use our own hash table. Here is a problem case
Given a text, the task is to count the number of times each letter is repeated in a string. This can well be done using hash tables, where each character of the text is the key, and the count is stored as its data.

Therefore, our program will first create the hash table and make it store keys i.e. each character, and the correct data i.e. count of each character.
As an output, it displays the complete hash table created.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <search.h>
#define MAX 26
#define LEN 50

char chkCh[2] = "I";

int main()
{
    ENTRY entry;
    ENTRY *es = NULL;

    char text[LEN] = "LINUX IS GREAT OPERATING SYSTEM" ;
    int i = 0;
    int retVal = 0;

    retVal = hcreate(MAX);

    entry.key = (char*) malloc (2 * sizeof (char));
    for (; (text[i]); i++)
    {
        entry.key[0] = text[i];
        entry.key[1] = 0;

        /*Search the character in the hash table*/
        if ((es = hsearch(entry, FIND)))
        {
             /*The entry already exist */
             es->data += 1;
        }
        else
        {
            /*Add the new entry*/
            entry.data = (void*)1;
            hsearch(entry, ENTER);
        }
    }

    /*Display hash table*/
    printf("Hash Table :\n");
    for (i = 65; i < 65 + 26; i++)     {         entry.key[0] = i;         entry.key[1] = 0;         /*Check for character*/             if ((es = hsearch(entry, FIND)))         {             printf(" key %s ===> %d\n", es->key, (int)es->data);
        }

    }

    hdestroy();
    free(entry.key);
    entry.key = NULL;

    return 0;
}

Building and running the executable,

$ gcc htable.c -Wall -o htable
$ ./htable

The output comes out to be,

Hash Table :
 key A ===> 2
 key E ===> 3
 key G ===> 2
 key I ===> 3
 key L ===> 1
 key M ===> 1
 key N ===> 2
 key O ===> 1
 key P ===> 1
 key R ===> 2
 key S ===> 3
 key T ===> 3
 key U ===> 1
 key X ===> 1
 key Y ===> 1

Pretty much as expected. Right !

However, all these functions use global memory space to store the hash table, hence, making it thread-UNsafe.

Conclusion

Hash tables are very important and immensely useful data structures, as it makes very complicated algorithms simpler to implement. Therefore, one will find plethora of implementations out there, which can be used for development. Still, the existing libraries may have their own individual drawbacks and bugs. One should study and analyse those, before choosing onto using any of the implementation library. However, there is always an option to implement hash tables ourselves as per our specific and optimized way. Getting in more deeper, there also a concept of distributed hash tables in networking. True indeed, more one gets to know, more one feels, there is still more out there!

References

http://www.linuxjournal.com/article/6797
http://crashingdaily.wordpress.com/2008/04/21/hashing-the-executables-a-look-at-hash-and-type/
http://attractivechaos.wordpress.com/2008/08/28/comparison-of-hash-table-libraries/

3 thoughts on “Hash Tables – using hash command and available implementations

  1. Shawn H Corey

    Although calculation of the hash key is done in a predictable amount of time, accessing the hash isn’t. That’s because of collisions. They can slow the access time. Also, if the hash algorithm does not randomly salt is hash function, a DoS attack can be made by sending data in a manner to maximize collisions.

    Reply
    1. Rupali Post author

      Thanks Shawn for sharing this valuable piece of information.

      Reply
  2. Jothikrishnan

    The hash tables are very important in data structure ,
    this article shows clearly the hash table algorithms how to processed in programs. . . .

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *