Some Developer: The Word Based Parsing Technique (part-6)

4.6. Dictionary re-organization

The number of words obtained from BAMA is too large to be inserted into Link-Parser dictionaries. It was needed to minimize the number of used words to get better performance and remove space redundancy of the system.

It has been found that if some words have the same set of PoS tags for their analysis without redundancy, this PoS tag set can be used as a signature for these set of words. As mentioned in section 4.2, all words having the same PoS tag, are placed in a separate file. An index for this shared PoS tag-set can be used to minimize the number of words in the dictionary, by assigning an ID for every set of words sharing the same set of PoS tags.

Algorithm 4.2 is for assigning IDs for compatible words (those sharing the same PoS tag set).

Algorithm 4.2:

Input: list of files containing words

Output: list of files containing IDs and an index file containing the files that contain each ID.

Process:

1. Scan all given files and generate a unique list of words L_uwords

2. Initialize a dictionary (hash-table)<key, value> : Dic<K, V> where K is a bit-vector, and its length is the number of input files, and every bit is designated a specific file in the given input files, and its value V is the class ID. Initially the dictionary is empty

3. LID ↤ 0

4. For every chunk of words (a million) do:

a. Create a vector of bit-vectors L_files[BV_i] where its size is the chunk size, and BV_i is a bit vector for word i in the chunk and its length = # of input files.

b. Scan all input files against the words in the chunk and set the L_files[BV_i[n]] = 1 where the file n contains the word i.

c. For every entry in L_files , C do:

i. If L_files in keys of Dic do:

1. Assign the words associated with C the id Dic[C].

ii. Else

1. Insert into Dic the key C and its value = LID+1

2. LID ↤ LID + 1

As the number of items being processed is finite, all the words are checked against its PoS tags through BAMA re-analysis. It is tested that all words processed, their PoS tag-set didn’t change from those generated by BAMA.

By using this algorithm, the number of words is minimized from ~23 million words to around 15000 words in the Link Parser dictionary files.

The number of words mentioned is spread over 3033 files. Each represents a PoS tag constructed from PoS tag of all segments of some words in the unique word-list.

It was reported by Attia [59][57] that the number of words of Arabic is too large to be possible to be generated and used by a parsing system. With this compression technique, the number of words for this highly inflectional language becomes less than of that of a simple language like English with regards to number of words and the morphology system.

Pages

Monday, February 17, 2014

The Word Based Parsing Technique (part-6)

4.6. Dictionary re-organization

No comments:

Post a Comment