![]() |
Leptonica
1.77.0
Image processing and image analysis suite
|
#include <string.h>#include "allheaders.h"Go to the source code of this file.
Functions | |
| SARRAY * | sarraySort (SARRAY *saout, SARRAY *sain, l_int32 sortorder) |
| SARRAY * | sarraySortByIndex (SARRAY *sain, NUMA *naindex) |
| l_int32 | stringCompareLexical (const char *str1, const char *str2) |
| SARRAY * | sarrayUnionByAset (SARRAY *sa1, SARRAY *sa2) |
| SARRAY * | sarrayRemoveDupsByAset (SARRAY *sas) |
| SARRAY * | sarrayIntersectionByAset (SARRAY *sa1, SARRAY *sa2) |
| L_ASET * | l_asetCreateFromSarray (SARRAY *sa) |
| l_ok | sarrayRemoveDupsByHash (SARRAY *sas, SARRAY **psad, L_DNAHASH **pdahash) |
| SARRAY * | sarrayIntersectionByHash (SARRAY *sa1, SARRAY *sa2) |
| l_ok | sarrayFindStringByHash (SARRAY *sa, L_DNAHASH *dahash, const char *str, l_int32 *pindex) |
| L_DNAHASH * | l_dnaHashCreateFromSarray (SARRAY *sa) |
| SARRAY * | sarrayGenerateIntegers (l_int32 n) |
| l_ok | sarrayLookupCSKV (SARRAY *sa, const char *keystring, char **pvalstring) |
Sort
SARRAY *sarraySort()
SARRAY *sarraySortByIndex()
l_int32 stringCompareLexical()
Set operations using aset (rbtree)
SARRAY *sarrayUnionByAset()
SARRAY *sarrayRemoveDupsByAset()
SARRAY *sarrayIntersectionByAset()
L_ASET *l_asetCreateFromSarray()
Set operations using hashing (dnahash)
l_int32 sarrayRemoveDupsByHash()
SARRAY *sarrayIntersectionByHash()
l_int32 sarrayFindStringByHash()
L_DNAHASH *l_dnaHashCreateFromSarray()
Miscellaneous operations
SARRAY *sarrayGenerateIntegers()
l_int32 sarrayLookupCSKV()
We have two implementations of set operations on an array of strings:
(1) Using an underlying tree (rbtree)
This uses a good 64 bit hashing function for the key,
that is not expected to have hash collisions (and we do
not test for them). The tree is built up of the hash
values, and if the hash is found in the tree, it is
assumed that the string has already been found. (2) Using an underlying hashing of the keys (dnahash)
This uses a fast 64 bit hashing function for the key,
which is then hashed into a bucket (a dna in a dnaHash).
Because hash collisions can occur, the index into the
sarray for the string that gave rise to that key is stored,
and the dna (bucket) is traversed, using the stored indices
to determine if that string had already been seen.
Definition in file sarray2.c.
| [in] | sa |
Definition at line 379 of file sarray2.c.
Referenced by sarrayIntersectionByAset().
| [in] | sa |
Definition at line 609 of file sarray2.c.
References findNextLargerPrime(), l_dnaHashAdd(), l_dnaHashCreate(), l_hashStringToUint64(), L_NOCOPY, sarrayGetCount(), and sarrayGetString().
Referenced by sarrayIntersectionByHash().
| [in] | sa | |
| [in] | dahash | built from sa |
| [in] | str | arbitrary string |
| [out] | pindex | index into sa if str is in sa; -1 otherwise |
Notes:
(1) Fast lookup in dnaHash associated with a sarray, to see if a
random string str is already stored in the hash table.
(2) We use a strong hash function to minimize the chance that
two different strings hash to the same key value.
(3) We select the number of buckets to be about 5% of the size
of the input sarray, so that when fully populated, each
bucket (dna) will have about 20 entries, each being an index
into sa. In lookup, after hashing to the key, and then
again to the bucket, we traverse the bucket (dna), using the
index into sa to check if str has been found before.
Definition at line 563 of file sarray2.c.
References l_dnaGetCount(), l_dnaGetIValue(), l_dnaHashGetDna(), l_hashStringToUint64(), L_NOCOPY, and sarrayGetString().
Referenced by sarrayIntersectionByHash(), and sarrayRemoveDupsByHash().
| SARRAY* sarrayGenerateIntegers | ( | l_int32 | n | ) |
| [in] | n |
Definition at line 648 of file sarray2.c.
References L_COPY, sarrayAddString(), and sarrayCreate().
Referenced by pixaCompareInPdf().
| [in] | sa1,sa2 |
Notes:
(1) Algorithm: put the larger sarray into a set, using the string
hashes as the key values. Then run through the smaller sarray,
building an output sarray and a second set from the strings
in the larger array: if a string is in the first set but
not in the second, add the string to the output sarray and hash
it into the second set. The second set is required to make
sure only one instance of each string is put into the output sarray.
This is O(mlogn), {m,n} = sizes of {smaller,larger} input arrays.
Definition at line 328 of file sarray2.c.
References l_asetCreateFromSarray(), sarrayCreate(), and sarrayGetCount().
| [in] | sa1,sa2 |
Notes:
(1) This is faster than sarrayIntersectionByAset(), because the
bucket lookup is O(n).
Definition at line 488 of file sarray2.c.
References findNextLargerPrime(), L_COPY, l_dnaHashAdd(), l_dnaHashCreate(), l_dnaHashCreateFromSarray(), l_dnaHashDestroy(), l_hashStringToUint64(), L_NOCOPY, sarrayAddString(), sarrayCreate(), sarrayFindStringByHash(), sarrayGetCount(), and sarrayGetString().
| l_ok sarrayLookupCSKV | ( | SARRAY * | sa, |
| const char * | keystring, | ||
| char ** | pvalstring | ||
| ) |
| [in] | sa | (of strings, each being a comma-separated pair of strings, the first being a key and the second a value) |
| [in] | keystring | (an input string to match with each key in sa |
| [out] | pvalstring | (the returned value string corresponding to the input key string, if found; otherwise NULL) |
Notes:
(1) The input sa can have other strings that are not in
comma-separated key-value format. These will be ignored.
(2) This returns a copy of the first value string in sa whose
key string matches the input keystring.
(3) White space is not ignored; all white space before the ','
is used for the keystring in matching. This allows the
key and val strings to have white space (e.g., multiple words).
Definition at line 688 of file sarray2.c.
References L_NOCOPY, sarrayCreate(), sarrayGetCount(), and sarrayGetString().
| [in] | sas |
Notes:
(1) This is O(nlogn), considerably slower than
sarrayRemoveDupsByHash() for large string arrays.
(2) The key for each string is a 64-bit hash.
(3) Build a set, using hashed strings as keys. As the set is
built, first do a find; if not found, add the key to the
set and add the string to the output sarray.
Definition at line 277 of file sarray2.c.
Referenced by sarrayUnionByAset().
| [in] | sas | |
| [out] | psad | unique set of strings; duplicates removed |
| [out] | pdahash | [optional] dnahash used for lookup |
Notes:
(1) Generates a sarray with unique values.
(2) The dnahash is built up with sad to assure uniqueness.
It can be used to find if a string is in the set:
sarrayFindValByHash(sad, dahash, str, &index)
(3) The hash of the string location is simple and fast. It scales
up with the number of buckets to insure a fairly random
bucket selection input strings.
(4) This is faster than sarrayRemoveDupsByAset(), because the
bucket lookup is O(n), although there is a double-loop
lookup within the dna in each bucket.
Definition at line 431 of file sarray2.c.
References findNextLargerPrime(), L_COPY, l_dnaHashAdd(), l_dnaHashCreate(), l_dnaHashDestroy(), l_hashStringToUint64(), L_NOCOPY, sarrayAddString(), sarrayCreate(), sarrayFindStringByHash(), sarrayGetCount(), and sarrayGetString().
| [in] | saout | output sarray; can be NULL or equal to sain |
| [in] | sain | input sarray |
| [in] | sortorder | L_SORT_INCREASING or L_SORT_DECREASING |
Notes:
(1) Set saout = sain for in-place; otherwise, set naout = NULL.
(2) Shell sort, modified from K&R, 2nd edition, p.62.
Slow but simple O(n logn) sort.
Definition at line 95 of file sarray2.c.
References Sarray::array, L_SORT_DECREASING, L_SORT_INCREASING, sarrayCopy(), sarrayGetCount(), and stringCompareLexical().
Referenced by getSortedPathnamesInDirectory().
| [in] | sain | |
| [in] | naindex | na that maps from the new sarray to the input sarray |
Definition at line 145 of file sarray2.c.
References L_COPY, L_INSERT, numaGetIValue(), sarrayAddString(), sarrayCreate(), sarrayGetCount(), and sarrayGetString().
| [in] | sa1,sa2 |
Notes:
(1) Duplicates are removed from the concatenation of the two arrays.
(2) The key for each string is a 64-bit hash.
(2) Algorithm: Concatenate the two sarrays. Then build a set,
using hashed strings as keys. As the set is built, first do
a find; if not found, add the key to the set and add the string
to the output sarray. This is O(nlogn).
Definition at line 237 of file sarray2.c.
References sarrayCopy(), sarrayDestroy(), sarrayJoin(), and sarrayRemoveDupsByAset().
| l_int32 stringCompareLexical | ( | const char * | str1, |
| const char * | str2 | ||
| ) |
| [in] | str1 | |
| [in] | str2 |
Notes:
(1) If the lexical values are identical, return a 0, to
indicate that no swapping is required to sort the strings.
Definition at line 185 of file sarray2.c.
Referenced by sarraySort().