Leptonica  1.77.0
Image processing and image analysis suite
sarray2.c File Reference
#include <string.h>
#include "allheaders.h"

Go to the source code of this file.

Functions

SARRAYsarraySort (SARRAY *saout, SARRAY *sain, l_int32 sortorder)
 
SARRAYsarraySortByIndex (SARRAY *sain, NUMA *naindex)
 
l_int32 stringCompareLexical (const char *str1, const char *str2)
 
SARRAYsarrayUnionByAset (SARRAY *sa1, SARRAY *sa2)
 
SARRAYsarrayRemoveDupsByAset (SARRAY *sas)
 
SARRAYsarrayIntersectionByAset (SARRAY *sa1, SARRAY *sa2)
 
L_ASETl_asetCreateFromSarray (SARRAY *sa)
 
l_ok sarrayRemoveDupsByHash (SARRAY *sas, SARRAY **psad, L_DNAHASH **pdahash)
 
SARRAYsarrayIntersectionByHash (SARRAY *sa1, SARRAY *sa2)
 
l_ok sarrayFindStringByHash (SARRAY *sa, L_DNAHASH *dahash, const char *str, l_int32 *pindex)
 
L_DNAHASHl_dnaHashCreateFromSarray (SARRAY *sa)
 
SARRAYsarrayGenerateIntegers (l_int32 n)
 
l_ok sarrayLookupCSKV (SARRAY *sa, const char *keystring, char **pvalstring)
 

Detailed Description

 Sort
     SARRAY     *sarraySort()
     SARRAY     *sarraySortByIndex()
     l_int32     stringCompareLexical()

 Set operations using aset (rbtree)
     SARRAY     *sarrayUnionByAset()
     SARRAY     *sarrayRemoveDupsByAset()
     SARRAY     *sarrayIntersectionByAset()
     L_ASET     *l_asetCreateFromSarray()

 Set operations using hashing (dnahash)
     l_int32     sarrayRemoveDupsByHash()
     SARRAY     *sarrayIntersectionByHash()
     l_int32     sarrayFindStringByHash()
     L_DNAHASH  *l_dnaHashCreateFromSarray()

 Miscellaneous operations
     SARRAY     *sarrayGenerateIntegers()
     l_int32     sarrayLookupCSKV()
We have two implementations of set operations on an array of strings:
  (1) Using an underlying tree (rbtree)
      This uses a good 64 bit hashing function for the key,
      that is not expected to have hash collisions (and we do
      not test for them).  The tree is built up of the hash
      values, and if the hash is found in the tree, it is
      assumed that the string has already been found.
  (2) Using an underlying hashing of the keys (dnahash)
      This uses a fast 64 bit hashing function for the key,
      which is then hashed into a bucket (a dna in a dnaHash).
      Because hash collisions can occur, the index into the
      sarray for the string that gave rise to that key is stored,
      and the dna (bucket) is traversed, using the stored indices
      to determine if that string had already been seen.
 

Definition in file sarray2.c.

Function Documentation

◆ l_asetCreateFromSarray()

L_ASET* l_asetCreateFromSarray ( SARRAY sa)

l_asetCreateFromSarray()

Parameters
[in]sa
Returns
set using a string hash into a uint64 as the key

Definition at line 379 of file sarray2.c.

Referenced by sarrayIntersectionByAset().

◆ l_dnaHashCreateFromSarray()

L_DNAHASH* l_dnaHashCreateFromSarray ( SARRAY sa)

l_dnaHashCreateFromSarray()

Parameters
[in]sa
Returns
dahash, or NULL on error

Definition at line 609 of file sarray2.c.

References findNextLargerPrime(), l_dnaHashAdd(), l_dnaHashCreate(), l_hashStringToUint64(), L_NOCOPY, sarrayGetCount(), and sarrayGetString().

Referenced by sarrayIntersectionByHash().

◆ sarrayFindStringByHash()

l_ok sarrayFindStringByHash ( SARRAY sa,
L_DNAHASH dahash,
const char *  str,
l_int32 *  pindex 
)

sarrayFindStringByHash()

Parameters
[in]sa
[in]dahashbuilt from sa
[in]strarbitrary string
[out]pindexindex into sa if str is in sa; -1 otherwise
Returns
0 if OK, 1 on error
Notes:
     (1) Fast lookup in dnaHash associated with a sarray, to see if a
         random string str is already stored in the hash table.
     (2) We use a strong hash function to minimize the chance that
         two different strings hash to the same key value.
     (3) We select the number of buckets to be about 5% of the size
         of the input sarray, so that when fully populated, each
         bucket (dna) will have about 20 entries, each being an index
         into sa.  In lookup, after hashing to the key, and then
         again to the bucket, we traverse the bucket (dna), using the
         index into sa to check if str has been found before.

Definition at line 563 of file sarray2.c.

References l_dnaGetCount(), l_dnaGetIValue(), l_dnaHashGetDna(), l_hashStringToUint64(), L_NOCOPY, and sarrayGetString().

Referenced by sarrayIntersectionByHash(), and sarrayRemoveDupsByHash().

◆ sarrayGenerateIntegers()

SARRAY* sarrayGenerateIntegers ( l_int32  n)

sarrayGenerateIntegers()

Parameters
[in]n
Returns
sa (of printed numbers, 1 - n, or NULL on error

Definition at line 648 of file sarray2.c.

References L_COPY, sarrayAddString(), and sarrayCreate().

Referenced by pixaCompareInPdf().

◆ sarrayIntersectionByAset()

SARRAY* sarrayIntersectionByAset ( SARRAY sa1,
SARRAY sa2 
)

sarrayIntersectionByAset()

Parameters
[in]sa1,sa2
Returns
sad with the intersection of the string set, or NULL on error
Notes:
     (1) Algorithm: put the larger sarray into a set, using the string
         hashes as the key values.  Then run through the smaller sarray,
         building an output sarray and a second set from the strings
         in the larger array: if a string is in the first set but
         not in the second, add the string to the output sarray and hash
         it into the second set.  The second set is required to make
         sure only one instance of each string is put into the output sarray.
         This is O(mlogn), {m,n} = sizes of {smaller,larger} input arrays.

Definition at line 328 of file sarray2.c.

References l_asetCreateFromSarray(), sarrayCreate(), and sarrayGetCount().

◆ sarrayIntersectionByHash()

SARRAY* sarrayIntersectionByHash ( SARRAY sa1,
SARRAY sa2 
)

sarrayIntersectionByHash()

Parameters
[in]sa1,sa2
Returns
sad intersection of the strings, or NULL on error
Notes:
     (1) This is faster than sarrayIntersectionByAset(), because the
         bucket lookup is O(n).

Definition at line 488 of file sarray2.c.

References findNextLargerPrime(), L_COPY, l_dnaHashAdd(), l_dnaHashCreate(), l_dnaHashCreateFromSarray(), l_dnaHashDestroy(), l_hashStringToUint64(), L_NOCOPY, sarrayAddString(), sarrayCreate(), sarrayFindStringByHash(), sarrayGetCount(), and sarrayGetString().

◆ sarrayLookupCSKV()

l_ok sarrayLookupCSKV ( SARRAY sa,
const char *  keystring,
char **  pvalstring 
)

sarrayLookupCSKV()

Parameters
[in]sa(of strings, each being a comma-separated pair of strings, the first being a key and the second a value)
[in]keystring(an input string to match with each key in sa
[out]pvalstring(the returned value string corresponding to the input key string, if found; otherwise NULL)
Returns
0 if OK, 1 on error
Notes:
     (1) The input sa can have other strings that are not in
         comma-separated key-value format.  These will be ignored.
     (2) This returns a copy of the first value string in sa whose
         key string matches the input keystring.
     (3) White space is not ignored; all white space before the ','
         is used for the keystring in matching.  This allows the
         key and val strings to have white space (e.g., multiple words).

Definition at line 688 of file sarray2.c.

References L_NOCOPY, sarrayCreate(), sarrayGetCount(), and sarrayGetString().

◆ sarrayRemoveDupsByAset()

SARRAY* sarrayRemoveDupsByAset ( SARRAY sas)

sarrayRemoveDupsByAset()

Parameters
[in]sas
Returns
sad with duplicates removed, or NULL on error
Notes:
     (1) This is O(nlogn), considerably slower than
         sarrayRemoveDupsByHash() for large string arrays.
     (2) The key for each string is a 64-bit hash.
     (3) Build a set, using hashed strings as keys.  As the set is
         built, first do a find; if not found, add the key to the
         set and add the string to the output sarray.

Definition at line 277 of file sarray2.c.

Referenced by sarrayUnionByAset().

◆ sarrayRemoveDupsByHash()

l_ok sarrayRemoveDupsByHash ( SARRAY sas,
SARRAY **  psad,
L_DNAHASH **  pdahash 
)

sarrayRemoveDupsByHash()

Parameters
[in]sas
[out]psadunique set of strings; duplicates removed
[out]pdahash[optional] dnahash used for lookup
Returns
0 if OK, 1 on error
Notes:
     (1) Generates a sarray with unique values.
     (2) The dnahash is built up with sad to assure uniqueness.
         It can be used to find if a string is in the set:
             sarrayFindValByHash(sad, dahash, str, &index)
     (3) The hash of the string location is simple and fast.  It scales
         up with the number of buckets to insure a fairly random
         bucket selection input strings.
     (4) This is faster than sarrayRemoveDupsByAset(), because the
         bucket lookup is O(n), although there is a double-loop
         lookup within the dna in each bucket.

Definition at line 431 of file sarray2.c.

References findNextLargerPrime(), L_COPY, l_dnaHashAdd(), l_dnaHashCreate(), l_dnaHashDestroy(), l_hashStringToUint64(), L_NOCOPY, sarrayAddString(), sarrayCreate(), sarrayFindStringByHash(), sarrayGetCount(), and sarrayGetString().

◆ sarraySort()

SARRAY* sarraySort ( SARRAY saout,
SARRAY sain,
l_int32  sortorder 
)

sarraySort()

Parameters
[in]saoutoutput sarray; can be NULL or equal to sain
[in]saininput sarray
[in]sortorderL_SORT_INCREASING or L_SORT_DECREASING
Returns
saout output sarray, sorted by ascii value, or NULL on error
Notes:
     (1) Set saout = sain for in-place; otherwise, set naout = NULL.
     (2) Shell sort, modified from K&R, 2nd edition, p.62.
         Slow but simple O(n logn) sort.

Definition at line 95 of file sarray2.c.

References Sarray::array, L_SORT_DECREASING, L_SORT_INCREASING, sarrayCopy(), sarrayGetCount(), and stringCompareLexical().

Referenced by getSortedPathnamesInDirectory().

◆ sarraySortByIndex()

SARRAY* sarraySortByIndex ( SARRAY sain,
NUMA naindex 
)

sarraySortByIndex()

Parameters
[in]sain
[in]naindexna that maps from the new sarray to the input sarray
Returns
saout sorted, or NULL on error

Definition at line 145 of file sarray2.c.

References L_COPY, L_INSERT, numaGetIValue(), sarrayAddString(), sarrayCreate(), sarrayGetCount(), and sarrayGetString().

◆ sarrayUnionByAset()

SARRAY* sarrayUnionByAset ( SARRAY sa1,
SARRAY sa2 
)

sarrayUnionByAset()

Parameters
[in]sa1,sa2
Returns
sad with the union of the string set, or NULL on error
Notes:
     (1) Duplicates are removed from the concatenation of the two arrays.
     (2) The key for each string is a 64-bit hash.
     (2) Algorithm: Concatenate the two sarrays.  Then build a set,
         using hashed strings as keys.  As the set is built, first do
         a find; if not found, add the key to the set and add the string
         to the output sarray.  This is O(nlogn).

Definition at line 237 of file sarray2.c.

References sarrayCopy(), sarrayDestroy(), sarrayJoin(), and sarrayRemoveDupsByAset().

◆ stringCompareLexical()

l_int32 stringCompareLexical ( const char *  str1,
const char *  str2 
)

stringCompareLexical()

Parameters
[in]str1
[in]str2
Returns
1 if str1 > str2 lexically; 0 otherwise
Notes:
     (1) If the lexical values are identical, return a 0, to
         indicate that no swapping is required to sort the strings.

Definition at line 185 of file sarray2.c.

Referenced by sarraySort().