4.6. Dictionary
re-organization
The number of words obtained
from BAMA is too large to be inserted into Link-Parser dictionaries. It was
needed to minimize the number of used words to get better performance and
remove space redundancy of the system.
It has been found that if some
words have the same set of PoS tags for their analysis without redundancy, this
PoS tag set can be used as a signature for these set of words. As mentioned in
section 4.2, all words having the same PoS tag, are placed in a separate file.
An index for this shared PoS tag-set can be used to minimize the number of
words in the dictionary, by assigning an ID for every set of words sharing the
same set of PoS tags.
Algorithm 4.2 is for assigning
IDs for compatible words (those sharing the same PoS tag set).
Algorithm 4.2:
Input: list of files containing words
Output: list of files containing IDs and an index file containing
the files that contain each ID.
Process:
1.
Scan
all given files and generate a unique list of words Luwords
2.
Initialize
a dictionary (hash-table)<key, value> : Dic<K, V> where K is a bit-vector, and its length is the number of input files, and every bit is designated
a specific file in the given input files, and its value V is the class ID.
Initially the dictionary is empty
3.
LID
↤ 0
4.
For
every chunk of words (a million) do:
a.
Create
a vector of bit-vectors Lfiles[BVi] where its size is the
chunk size, and BVi is a bit vector for word i in the chunk and its
length = # of input files.
b.
Scan
all input files against the words in the chunk and set the Lfiles[BVi[n]]
= 1 where the file n contains the word i.
c.
For
every entry in Lfiles , C do:
i.
If
Lfiles in keys of Dic do:
1.
Assign
the words associated with C the id Dic[C].
ii.
Else
1.
Insert
into Dic the key C and its value = LID+1
2.
LID
↤ LID + 1
As the number of items being
processed is finite, all the words are checked against its PoS tags through
BAMA re-analysis. It is tested that all words processed, their PoS tag-set
didn’t change from those generated by BAMA.
By using this algorithm, the
number of words is minimized from ~23 million words to around 15000 words in
the Link Parser dictionary files.
The number of words mentioned
is spread over 3033 files. Each represents a PoS tag constructed from PoS tag
of all segments of some words in the unique word-list.
It was reported by Attia [59][57]
that the number of words of Arabic is too large to be possible to be generated
and used by a parsing system. With this compression technique, the number of
words for this highly inflectional language becomes less than of that of a
simple language like English with regards to number of words and the morphology
system.
No comments:
Post a Comment