Leptonica  1.77.0
Image processing and image analysis suite
dnahash.c File Reference
#include "allheaders.h"

Go to the source code of this file.

Functions

L_DNAHASHl_dnaHashCreate (l_int32 nbuckets, l_int32 initsize)
 
void l_dnaHashDestroy (L_DNAHASH **pdahash)
 
l_int32 l_dnaHashGetCount (L_DNAHASH *dahash)
 
l_int32 l_dnaHashGetTotalCount (L_DNAHASH *dahash)
 
L_DNAl_dnaHashGetDna (L_DNAHASH *dahash, l_uint64 key, l_int32 copyflag)
 
l_ok l_dnaHashAdd (L_DNAHASH *dahash, l_uint64 key, l_float64 value)
 
L_DNAHASHl_dnaHashCreateFromDna (L_DNA *da)
 
l_ok l_dnaRemoveDupsByHash (L_DNA *das, L_DNA **pdad, L_DNAHASH **pdahash)
 
l_ok l_dnaMakeHistoByHash (L_DNA *das, L_DNAHASH **pdahash, L_DNA **pdav, L_DNA **pdac)
 
L_DNAl_dnaIntersectionByHash (L_DNA *da1, L_DNA *da2)
 
l_ok l_dnaFindValByHash (L_DNA *da, L_DNAHASH *dahash, l_float64 val, l_int32 *pindex)
 

Detailed Description

 DnaHash creation, destruction
     L_DNAHASH   *l_dnaHashCreate()
     void         l_dnaHashDestroy()

 DnaHash: Accessors and modifiers                      *
     l_int32      l_dnaHashGetCount()
     l_int32      l_dnaHashGetTotalCount()
     L_DNA       *l_dnaHashGetDna()
     void         l_dnaHashAdd()

 DnaHash: Operations on Dna
     L_DNAHASH   *l_dnaHashCreateFromDna()
     l_int32      l_dnaRemoveDupsByHash()
     l_int32      l_dnaMakeHistoByHash()
     L_DNA       *l_dnaIntersectionByHash()
     l_int32      l_dnaFindValByHash()
   (1) The DnaHash is an array of Dna.  It is useful for fast
       storage and lookup for sets and maps.  If the set or map
       is on a Dna itself, the hash is a simple function that
       maps a double to a l_uint64; otherwise the function will
       map a string or a (x,y) point to a l_uint64.  The result of
       the map is the "key", which is then used with the mod
       function to select which Dna array is to be used.  The
       number of arrays in a DnaHash should be a prime number.
       If there are N items, we set up the DnaHash array to have
       approximately N/20 Dna, so the average size of these arrays
       will be about 20 when fully populated.  The number 20 was
       found empirically to be in a broad maximum of efficiency.
   (2) Note that the word "hash" is overloaded.  There are actually
       two hashing steps: the first hashes the object to a l_uint64,
       called the "key", and the second uses the mod function to
       "hash" the "key" to the index of a particular Dna in the
       DnaHash array.
   (3) Insertion and lookup time for DnaHash is O(1).  Hash collisions
       are easily handled (we expect an average of 20 for each key),
       so we can use simple (fast) hash functions: we deal with
       collisions by storing an array for each hash key.
       This can be contrasted with using rbtree for sets and
       maps, where insertion and lookup are O(logN) and hash functions
       are slower because they must be good enough (i.e, random
       enough with arbitrary input) to avoid collisions.
   (4) Hash functions that map points, strings and floats to l_uint64
       are given in utils.c.
   (5) The use of the DnaHash (and RBTree) with strings and
       (x,y) points can be found in string2.c and ptafunc2.c, rsp.
       This file has similar hash set functions, using DnaHash on
       two input Dna, for removing duplicates and finding the
       intersection.  It also uses DnaHash as a hash map to find
       a histogram of counts from an input Dna.
   (6) Comparisons in running time, between DnaHash and RBTree, for
       large sets of strings and points, are given in prog/hashtest.c.
   (7) This is a very simple implementation, that expects that you
       know approximately (i.e., within a factor of 2 or 3) how many
       items are to be stored when you initialize the DnaHash.
       (It would be nice to modify the l_dnaHashAdd() function
       to increase the number of bins when the average occupation
       exceeds 40 or so.)
   (8) Useful rule of thumb for hashing collisions:
       For a random hashing function (say, from strings to l_uint64),
       the probability of a collision increases as N^2 for N much
       less than 2^32.  The quadratic behavior switches over to
       approaching 1.0 around 2^32, which is the square root of 2^64.
       So, for example, if you have 10^7 strings, the probability
       of a single collision using an l_uint64 key is on the order of
           (10^7/10^9)^2 ~ 10^-4.
       For a million strings you don't need to worry about collisons
       (~10-6 probability), and for most applications can use the
       RBTree (sorting) implementation with confidence.

Definition in file dnahash.c.

Function Documentation

◆ l_dnaFindValByHash()

l_ok l_dnaFindValByHash ( L_DNA da,
L_DNAHASH dahash,
l_float64  val,
l_int32 *  pindex 
)

l_dnaFindValByHash()

Parameters
[in]da
[in]dahashcontaining indices into da
[in]valsearching for this number in da
[out]pindexindex into da if found; -1 otherwise
Returns
0 if OK; 1 on error
Notes:
     (1) Algo: hash val into a key; hash the key to get the dna
               in dahash (that holds indices into da); traverse
               the dna of indices looking for val in da.

Definition at line 553 of file dnahash.c.

References l_dnaGetCount(), l_dnaGetDValue(), l_dnaGetIValue(), l_dnaHashGetCount(), l_dnaHashGetDna(), l_hashFloat64ToUint64(), and L_NOCOPY.

Referenced by l_dnaIntersectionByHash(), l_dnaMakeHistoByHash(), and l_dnaRemoveDupsByHash().

◆ l_dnaHashAdd()

l_ok l_dnaHashAdd ( L_DNAHASH dahash,
l_uint64  key,
l_float64  value 
)

l_dnaHashAdd()

Parameters
[in]dahash
[in]keykey to be hashed into a bucket number
[in]valuefloat value to be appended to the specific dna
Returns
0 if OK; 1 on error

Definition at line 267 of file dnahash.c.

References L_DnaHash::dna, L_DnaHash::initsize, l_dnaAddNumber(), and l_dnaCreate().

Referenced by l_dnaHashCreateFromDna(), l_dnaHashCreateFromPta(), l_dnaHashCreateFromSarray(), l_dnaIntersectionByHash(), l_dnaMakeHistoByHash(), l_dnaRemoveDupsByHash(), ptaIntersectionByHash(), ptaRemoveDupsByHash(), sarrayIntersectionByHash(), and sarrayRemoveDupsByHash().

◆ l_dnaHashCreate()

L_DNAHASH* l_dnaHashCreate ( l_int32  nbuckets,
l_int32  initsize 
)

l_dnaHashCreate()

Parameters
[in]nbucketsthe number of buckets in the hash table, which should be prime.
[in]initsizeinitial size of each allocated dna; 0 for default
Returns
ptr to new dnahash, or NULL on error
Notes:
     (1) Actual dna are created only as required by l_dnaHashAdd()

Definition at line 122 of file dnahash.c.

References L_DnaHash::dna, and L_DnaHash::initsize.

Referenced by l_dnaHashCreateFromDna(), l_dnaHashCreateFromPta(), l_dnaHashCreateFromSarray(), l_dnaIntersectionByHash(), l_dnaMakeHistoByHash(), l_dnaRemoveDupsByHash(), ptaIntersectionByHash(), ptaRemoveDupsByHash(), sarrayIntersectionByHash(), and sarrayRemoveDupsByHash().

◆ l_dnaHashCreateFromDna()

L_DNAHASH* l_dnaHashCreateFromDna ( L_DNA da)

l_dnaHashCreateFromDna()

Parameters
[in]da
Returns
dahash if OK; 1 on error
Notes:
     (1) The values stored in the dahash are indices into da;
         dahash has no use without da.

Definition at line 306 of file dnahash.c.

References findNextLargerPrime(), l_dnaGetCount(), l_dnaGetDValue(), l_dnaHashAdd(), l_dnaHashCreate(), and l_hashFloat64ToUint64().

Referenced by l_dnaIntersectionByHash().

◆ l_dnaHashDestroy()

void l_dnaHashDestroy ( L_DNAHASH **  pdahash)

l_dnaHashDestroy()

Parameters
[in,out]pdahashto be nulled, if it exists
Returns
void

Definition at line 152 of file dnahash.c.

References L_DnaHash::dna, and l_dnaDestroy().

Referenced by l_dnaIntersectionByHash(), l_dnaRemoveDupsByHash(), ptaIntersectionByHash(), ptaRemoveDupsByHash(), sarrayIntersectionByHash(), and sarrayRemoveDupsByHash().

◆ l_dnaHashGetCount()

l_int32 l_dnaHashGetCount ( L_DNAHASH dahash)

l_dnaHashGetCount()

Parameters
[in]dahash
Returns
nbuckets allocated, or 0 on error

Definition at line 185 of file dnahash.c.

Referenced by l_dnaFindValByHash(), and l_dnaIntersectionByHash().

◆ l_dnaHashGetDna()

L_DNA* l_dnaHashGetDna ( L_DNAHASH dahash,
l_uint64  key,
l_int32  copyflag 
)

l_dnaHashGetDna()

Parameters
[in]dahash
[in]keykey to be hashed into a bucket number
[in]copyflagL_NOCOPY, L_COPY, L_CLONE
Returns
ptr to dna

Definition at line 232 of file dnahash.c.

References L_DnaHash::dna, L_COPY, l_dnaClone(), l_dnaCopy(), and L_NOCOPY.

Referenced by l_dnaFindValByHash(), l_dnaHashGetTotalCount(), ptaFindPtByHash(), and sarrayFindStringByHash().

◆ l_dnaHashGetTotalCount()

l_int32 l_dnaHashGetTotalCount ( L_DNAHASH dahash)

l_dnaHashGetTotalCount()

Parameters
[in]dahash
Returns
n number of numbers in all dna, or 0 on error

Definition at line 203 of file dnahash.c.

References l_dnaGetCount(), l_dnaHashGetDna(), and L_NOCOPY.

◆ l_dnaIntersectionByHash()

L_DNA* l_dnaIntersectionByHash ( L_DNA da1,
L_DNA da2 
)

l_dnaIntersectionByHash()

Parameters
[in]da1,da2
Returns
dad intersection of the number arrays, or NULL on error
Notes:
     (1) This uses the same method for building the intersection set
         as ptaIntersectionByHash() and sarrayIntersectionByHash().

Definition at line 485 of file dnahash.c.

References findNextLargerPrime(), l_dnaAddNumber(), l_dnaCreate(), l_dnaFindValByHash(), l_dnaGetCount(), l_dnaGetDValue(), l_dnaHashAdd(), l_dnaHashCreate(), l_dnaHashCreateFromDna(), l_dnaHashDestroy(), l_dnaHashGetCount(), and l_hashFloat64ToUint64().

◆ l_dnaMakeHistoByHash()

l_ok l_dnaMakeHistoByHash ( L_DNA das,
L_DNAHASH **  pdahash,
L_DNA **  pdav,
L_DNA **  pdac 
)

l_dnaMakeHistoByHash()

Parameters
[in]das
[out]pdahashhash map: val –> index
[out]pdavarray of values: index –> val
[out]pdachisto array of counts: index –> count
Returns
0 if OK; 1 on error
Notes:
     (1) Generates and returns a dna of occurrences (histogram),
         an aligned dna of values, and an associated hashmap.
         The hashmap takes dav and a value, and points into the
         histogram in dac.
     (2) The dna of values, dav, is aligned with the histogram dac,
         and is needed for fast lookup.  It is a hash set, because
         the values are unique.
     (3) Lookup is simple:
             l_dnaFindValByHash(dav, dahash, val, &index);
             if (index >= 0)
                 l_dnaGetIValue(dac, index, &icount);
             else
                 icount = 0;

Definition at line 421 of file dnahash.c.

References findNextLargerPrime(), l_dnaAddNumber(), l_dnaCreate(), l_dnaFindValByHash(), l_dnaGetCount(), l_dnaGetDValue(), l_dnaGetIValue(), l_dnaHashAdd(), l_dnaHashCreate(), l_dnaSetValue(), and l_hashFloat64ToUint64().

◆ l_dnaRemoveDupsByHash()

l_ok l_dnaRemoveDupsByHash ( L_DNA das,
L_DNA **  pdad,
L_DNAHASH **  pdahash 
)

l_dnaRemoveDupsByHash()

Parameters
[in]das
[out]pdadhash set
[out]pdahash[optional] dnahash used for lookup
Returns
0 if OK; 1 on error
Notes:
     (1) Generates a dna with unique values.
     (2) The dnahash is built up with dad to assure uniqueness.
         It can be used to find if an element is in the set:
             l_dnaFindValByHash(dad, dahash, val, &index)

Definition at line 350 of file dnahash.c.

References findNextLargerPrime(), l_dnaAddNumber(), l_dnaCreate(), l_dnaFindValByHash(), l_dnaGetCount(), l_dnaGetDValue(), l_dnaHashAdd(), l_dnaHashCreate(), l_dnaHashDestroy(), and l_hashFloat64ToUint64().