Monday, February 17, 2014

The Word Based Parsing Technique (part-2)

4.2. Dictionary generation

The concept of tokenization for English (the language that gained most NLP studies), is different from that of Arabic. While English tokenization handles merging set of characters separated by punctuation, the Arabic tokenization concept (that was the field of study for nearly all of the statistical attempts) separates word components from each other. As illustrated in the previous chapter, tokenization adds a new dimension of errors if used on a large scale. It cannot be claimed that step 8 in the algorithm 3.1 has wide coverage. It might need extra additions and exceptions to be large scale.

Words in Arabic are either “Harf” (rigid words), nouns, or verbs. The Arabic language morphology is template oriented with regard to verbs and nouns (with special cases that are not template based, such as the verb ليس). A word in Arabic is best analyzed based on its context. A noun, for example, can be an adverb in a sentence and a subject/object in another, such as the word مقاتلا in (استجلب الجيش مقاتلا من القرية ، ذهب الرجل إلى ميدان المعركة مقاتلا). Apart from the small number of special cases, a noun or a verb in Arabic originates from a root (3-5 letters, works as the abstract meaning of the word) that is used in extending a morphological pattern (which adds applied meaning to the root)[58].

A morphological analyzer based on the template concept is presented by Attia [59], and addressed previously by Sakhr [60]. A Different approach for modeling Arabic morphology is to use catenation based model. In catenation based model, stems, prefixes and suffixes are concatenated together through rules (or concatenation restrictions), to model the language words in compact form. Buckwalter’s implementation for that approach, added, in addition to possible PoS tag, an English glossary. An attempt to convert the catenation based morphological analyzer (BAMA) to root/pattern approach is done by Smrz [61].
An ultimate solution for Arabic parsing should incorporate morphological possibilities, syntactic rules, morpho-syntactic restrictions into a single unified disambiguation and parsing system on all possible levels (morphological, syntactical, semantic, … etc) [59].

As in the word based parsing attempt in this study, semantic information is not incorporated, and thus, there is no need for incorporating the morphological pattern or the roots. This study used Buckwalter’s Arabic morphological analyzer as its source for morphology information, as it is free and being exhaustively used in several attempts reported by literature.

In this attempt, word classification is retrieved from the morphological analyzer. The tables of the morphological analyzer are traversed to obtain the words with their PoS tags. Algorithm 4.1 traverses Buckwalter’s tables for get a list of words with their possible PoS tags.

Algorithm 4.1: Generating words and PoS tags from BAMA tables
Input: BAMA tables
Output: list of records each has two fields, the word text in Buckwalters transliteration, and the PoS tag of the word

Process:

  1. Get the unique list of all stem IDs into Lstem-ids 
  2. For every stem-id entry in Lstem-ids , Sid do:
    • Get list of all stems associated with Sid in Lstem <Itext, Ipos>
    • Get list of all compatible prefix-ids and suffix-ids in T<Ipre-id , Isuff-id>, where T<a, b> is a list of tuples containing fields a and b.
    • For-every entry in Lstem , in S do:
      • For-every entry in T<Ipre-id, Isuff-id>, T do:
        1. Get list of Prefixes with their PoS tags in Lp<Iptext, IpPos>
        2. Get list of Suffixed with their PoS tags in Ls<Istext, IsPos>
        3. For every entry in Lp<Iptext, IpPos>, P do:
          • For every entry in Ls<Istext, IsPos>, F do:
            1. Produce words record T<text, pos>(P.Iptext | S.Itext | F.Istext,  P.IpPos | S.Ipos | F.Ispos)




The output of the algorithm is a list of tuples, each containing the unvocalized word text, and the PoS tag of the word. Words having the same PoS tag are copied in a designated file for that PoS tag. By constructing an index based on word text, a word can be analyzed.

The number of resulting unique words exceeded 23 millions.

Testing has been done on the generated dictionary by analyzing every word in it using BAMA implementation, and assured that all generated words match in PoS tags with those generated from BAMA.

No comments:

Post a Comment