![]() |
Leptonica
1.77.0
Image processing and image analysis suite
|
#include "allheaders.h"Go to the source code of this file.
Functions | |
| L_DNAHASH * | l_dnaHashCreate (l_int32 nbuckets, l_int32 initsize) |
| void | l_dnaHashDestroy (L_DNAHASH **pdahash) |
| l_int32 | l_dnaHashGetCount (L_DNAHASH *dahash) |
| l_int32 | l_dnaHashGetTotalCount (L_DNAHASH *dahash) |
| L_DNA * | l_dnaHashGetDna (L_DNAHASH *dahash, l_uint64 key, l_int32 copyflag) |
| l_ok | l_dnaHashAdd (L_DNAHASH *dahash, l_uint64 key, l_float64 value) |
| L_DNAHASH * | l_dnaHashCreateFromDna (L_DNA *da) |
| l_ok | l_dnaRemoveDupsByHash (L_DNA *das, L_DNA **pdad, L_DNAHASH **pdahash) |
| l_ok | l_dnaMakeHistoByHash (L_DNA *das, L_DNAHASH **pdahash, L_DNA **pdav, L_DNA **pdac) |
| L_DNA * | l_dnaIntersectionByHash (L_DNA *da1, L_DNA *da2) |
| l_ok | l_dnaFindValByHash (L_DNA *da, L_DNAHASH *dahash, l_float64 val, l_int32 *pindex) |
DnaHash creation, destruction
L_DNAHASH *l_dnaHashCreate()
void l_dnaHashDestroy()
DnaHash: Accessors and modifiers *
l_int32 l_dnaHashGetCount()
l_int32 l_dnaHashGetTotalCount()
L_DNA *l_dnaHashGetDna()
void l_dnaHashAdd()
DnaHash: Operations on Dna
L_DNAHASH *l_dnaHashCreateFromDna()
l_int32 l_dnaRemoveDupsByHash()
l_int32 l_dnaMakeHistoByHash()
L_DNA *l_dnaIntersectionByHash()
l_int32 l_dnaFindValByHash()
(1) The DnaHash is an array of Dna. It is useful for fast
storage and lookup for sets and maps. If the set or map
is on a Dna itself, the hash is a simple function that
maps a double to a l_uint64; otherwise the function will
map a string or a (x,y) point to a l_uint64. The result of
the map is the "key", which is then used with the mod
function to select which Dna array is to be used. The
number of arrays in a DnaHash should be a prime number.
If there are N items, we set up the DnaHash array to have
approximately N/20 Dna, so the average size of these arrays
will be about 20 when fully populated. The number 20 was
found empirically to be in a broad maximum of efficiency.
(2) Note that the word "hash" is overloaded. There are actually
two hashing steps: the first hashes the object to a l_uint64,
called the "key", and the second uses the mod function to
"hash" the "key" to the index of a particular Dna in the
DnaHash array.
(3) Insertion and lookup time for DnaHash is O(1). Hash collisions
are easily handled (we expect an average of 20 for each key),
so we can use simple (fast) hash functions: we deal with
collisions by storing an array for each hash key.
This can be contrasted with using rbtree for sets and
maps, where insertion and lookup are O(logN) and hash functions
are slower because they must be good enough (i.e, random
enough with arbitrary input) to avoid collisions.
(4) Hash functions that map points, strings and floats to l_uint64
are given in utils.c.
(5) The use of the DnaHash (and RBTree) with strings and
(x,y) points can be found in string2.c and ptafunc2.c, rsp.
This file has similar hash set functions, using DnaHash on
two input Dna, for removing duplicates and finding the
intersection. It also uses DnaHash as a hash map to find
a histogram of counts from an input Dna.
(6) Comparisons in running time, between DnaHash and RBTree, for
large sets of strings and points, are given in prog/hashtest.c.
(7) This is a very simple implementation, that expects that you
know approximately (i.e., within a factor of 2 or 3) how many
items are to be stored when you initialize the DnaHash.
(It would be nice to modify the l_dnaHashAdd() function
to increase the number of bins when the average occupation
exceeds 40 or so.)
(8) Useful rule of thumb for hashing collisions:
For a random hashing function (say, from strings to l_uint64),
the probability of a collision increases as N^2 for N much
less than 2^32. The quadratic behavior switches over to
approaching 1.0 around 2^32, which is the square root of 2^64.
So, for example, if you have 10^7 strings, the probability
of a single collision using an l_uint64 key is on the order of
(10^7/10^9)^2 ~ 10^-4.
For a million strings you don't need to worry about collisons
(~10-6 probability), and for most applications can use the
RBTree (sorting) implementation with confidence.
Definition in file dnahash.c.
| [in] | da | |
| [in] | dahash | containing indices into da |
| [in] | val | searching for this number in da |
| [out] | pindex | index into da if found; -1 otherwise |
Notes:
(1) Algo: hash val into a key; hash the key to get the dna
in dahash (that holds indices into da); traverse
the dna of indices looking for val in da.
Definition at line 553 of file dnahash.c.
References l_dnaGetCount(), l_dnaGetDValue(), l_dnaGetIValue(), l_dnaHashGetCount(), l_dnaHashGetDna(), l_hashFloat64ToUint64(), and L_NOCOPY.
Referenced by l_dnaIntersectionByHash(), l_dnaMakeHistoByHash(), and l_dnaRemoveDupsByHash().
| l_ok l_dnaHashAdd | ( | L_DNAHASH * | dahash, |
| l_uint64 | key, | ||
| l_float64 | value | ||
| ) |
| [in] | dahash | |
| [in] | key | key to be hashed into a bucket number |
| [in] | value | float value to be appended to the specific dna |
Definition at line 267 of file dnahash.c.
References L_DnaHash::dna, L_DnaHash::initsize, l_dnaAddNumber(), and l_dnaCreate().
Referenced by l_dnaHashCreateFromDna(), l_dnaHashCreateFromPta(), l_dnaHashCreateFromSarray(), l_dnaIntersectionByHash(), l_dnaMakeHistoByHash(), l_dnaRemoveDupsByHash(), ptaIntersectionByHash(), ptaRemoveDupsByHash(), sarrayIntersectionByHash(), and sarrayRemoveDupsByHash().
| L_DNAHASH* l_dnaHashCreate | ( | l_int32 | nbuckets, |
| l_int32 | initsize | ||
| ) |
| [in] | nbuckets | the number of buckets in the hash table, which should be prime. |
| [in] | initsize | initial size of each allocated dna; 0 for default |
Notes:
(1) Actual dna are created only as required by l_dnaHashAdd()
Definition at line 122 of file dnahash.c.
References L_DnaHash::dna, and L_DnaHash::initsize.
Referenced by l_dnaHashCreateFromDna(), l_dnaHashCreateFromPta(), l_dnaHashCreateFromSarray(), l_dnaIntersectionByHash(), l_dnaMakeHistoByHash(), l_dnaRemoveDupsByHash(), ptaIntersectionByHash(), ptaRemoveDupsByHash(), sarrayIntersectionByHash(), and sarrayRemoveDupsByHash().
| [in] | da |
Notes:
(1) The values stored in the dahash are indices into da;
dahash has no use without da.
Definition at line 306 of file dnahash.c.
References findNextLargerPrime(), l_dnaGetCount(), l_dnaGetDValue(), l_dnaHashAdd(), l_dnaHashCreate(), and l_hashFloat64ToUint64().
Referenced by l_dnaIntersectionByHash().
| void l_dnaHashDestroy | ( | L_DNAHASH ** | pdahash | ) |
| [in,out] | pdahash | to be nulled, if it exists |
Definition at line 152 of file dnahash.c.
References L_DnaHash::dna, and l_dnaDestroy().
Referenced by l_dnaIntersectionByHash(), l_dnaRemoveDupsByHash(), ptaIntersectionByHash(), ptaRemoveDupsByHash(), sarrayIntersectionByHash(), and sarrayRemoveDupsByHash().
| l_int32 l_dnaHashGetCount | ( | L_DNAHASH * | dahash | ) |
| [in] | dahash |
Definition at line 185 of file dnahash.c.
Referenced by l_dnaFindValByHash(), and l_dnaIntersectionByHash().
| [in] | dahash | |
| [in] | key | key to be hashed into a bucket number |
| [in] | copyflag | L_NOCOPY, L_COPY, L_CLONE |
Definition at line 232 of file dnahash.c.
References L_DnaHash::dna, L_COPY, l_dnaClone(), l_dnaCopy(), and L_NOCOPY.
Referenced by l_dnaFindValByHash(), l_dnaHashGetTotalCount(), ptaFindPtByHash(), and sarrayFindStringByHash().
| l_int32 l_dnaHashGetTotalCount | ( | L_DNAHASH * | dahash | ) |
| [in] | dahash |
Definition at line 203 of file dnahash.c.
References l_dnaGetCount(), l_dnaHashGetDna(), and L_NOCOPY.
| [in] | da1,da2 |
Notes:
(1) This uses the same method for building the intersection set
as ptaIntersectionByHash() and sarrayIntersectionByHash().
Definition at line 485 of file dnahash.c.
References findNextLargerPrime(), l_dnaAddNumber(), l_dnaCreate(), l_dnaFindValByHash(), l_dnaGetCount(), l_dnaGetDValue(), l_dnaHashAdd(), l_dnaHashCreate(), l_dnaHashCreateFromDna(), l_dnaHashDestroy(), l_dnaHashGetCount(), and l_hashFloat64ToUint64().
| [in] | das | |
| [out] | pdahash | hash map: val –> index |
| [out] | pdav | array of values: index –> val |
| [out] | pdac | histo array of counts: index –> count |
Notes:
(1) Generates and returns a dna of occurrences (histogram),
an aligned dna of values, and an associated hashmap.
The hashmap takes dav and a value, and points into the
histogram in dac.
(2) The dna of values, dav, is aligned with the histogram dac,
and is needed for fast lookup. It is a hash set, because
the values are unique.
(3) Lookup is simple:
l_dnaFindValByHash(dav, dahash, val, &index);
if (index >= 0)
l_dnaGetIValue(dac, index, &icount);
else
icount = 0;
Definition at line 421 of file dnahash.c.
References findNextLargerPrime(), l_dnaAddNumber(), l_dnaCreate(), l_dnaFindValByHash(), l_dnaGetCount(), l_dnaGetDValue(), l_dnaGetIValue(), l_dnaHashAdd(), l_dnaHashCreate(), l_dnaSetValue(), and l_hashFloat64ToUint64().
| [in] | das | |
| [out] | pdad | hash set |
| [out] | pdahash | [optional] dnahash used for lookup |
Notes:
(1) Generates a dna with unique values.
(2) The dnahash is built up with dad to assure uniqueness.
It can be used to find if an element is in the set:
l_dnaFindValByHash(dad, dahash, val, &index)
Definition at line 350 of file dnahash.c.
References findNextLargerPrime(), l_dnaAddNumber(), l_dnaCreate(), l_dnaFindValByHash(), l_dnaGetCount(), l_dnaGetDValue(), l_dnaHashAdd(), l_dnaHashCreate(), l_dnaHashDestroy(), and l_hashFloat64ToUint64().