Monday, February 17, 2014

The Word Based Parsing Technique (part-7)

4.7. Preprocessing and Post-Processing

Proper analysis of a sentence should consider punctuation. Some punctuation affects the boundary of the sentence and some are used to give some semantic relationships between sentence components. It is not practical at this stage, to develop wide scale syntax analyzer that depends on correctly punctuated text, because the punctuation was not strictly taken care of, practical wise and educational wise in the most of the Arab world publications [70]. 

As the system presented here and most of the other systems that process Arabic sentence start from having bounded sentences, sentence boundary detection is investigated first. Our target here is to develop a system that can disambiguate the use of punctuation that affects the sentence boundary.

The most common punctuation for sentence end is the period (.). The period conflicts with numeric expressions (it is displayed in Arabic as comma with proper locale settings, but its Unicode code-point is the period) and abbreviations. Both of the conflicting tokens should be detected first before stating a sentence boundary. Other marks also exist for marking end of sentences such as question mark “?”, and say sentence start mark (colon “:”).


{0} = أيه ايه بي بى سي سى دي دى إي اى إى اي اف إف أف جي جى اتش أتش آي آى أي أى جيه  كيه ال أل إل ام أم إم إن ان أو او بيه بي كيو أر آر ار اس أس إس تي تى يو في  دبليو اكس أكس واي واى زد توداي

{1} = أ ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك
                                ل م ن ه و ي ى
abbrev_pattern = “(\b({0})(\s*\.\s*|\s+)(\b({0})(\s*\.\s*|\s+|(?!\w)))+)
| (\b({1})(\s*\.\s*|\s+)(\b({1})(\s*\.\s*|\s+|(?!\w)))+)
Listing 4.2: Regular expression for Abbreviation detection

A set of regular expressions are written to detect abbreviations, words, numeric expressions, end-of-sentence marks. Listing 4.2 is the regular expression for Abbreviation detection. The priority of detection is (abbreviation, words, numeric expressions, end-of-sentence marks). 

Numeric expressions are replaced by special word named “NUMB”, and abbreviations are replaced by special word named “ABBREV”. The DLG grammar has specific rules for these two special words. Any other punctuation left in the sentence is removed. Words are replaced by their class IDs that are obtained from the word-minimization process that is discussed in the previous section.

The text of words and the characters of the abbreviations and number expressions are restored after doing the parsing.

4.8. System Overview

Recalling the ladder of linguistic processing layers, it can be seen that each layer takes a designated input from the one that precedes it and delivers its result to the one that succeeds it. That means in the best cases that each step processes the given input and its successor layer takes all possible analysis from it and selects solutions that only satisfy the rules of the current layer. A statistical system developed based on similar architecture is presented by [57].

The interference between Arabic morphology and its syntax enforces the Arabic language analyst to deal with both Arabic morphology and syntax at the same time.

The system presented in this chapter merges the first two layers to form a morpho-syntactic layer as shown in figure 4.4.

Figure 4.4: A new applied ladder for Arabic linguistic processing

The system has a morpho-syntactic parser and PoS tagger in a single processing unit (layer). This processing unit searches for rules that satisfy the linking requirements that are assigned to PoS tagged words. With this new approach complexities of disambiguation are minimized, as the morphological analysis and syntax analysis are done in a single step. While figure 4.3 doesn’t mention the pre-processing step, it is mentioned as an important step for processing raw text by Attia [57] and is depicted in figure 4.5.

Figure 4.5: System overview

The input for the system is a stream of characters. The abbreviations, words, numeric expressions, and end of sentence marks are identified. The abbreviations are replaced by a special token “ABBREV”. Words are replaced by their class-ID’s (a single number identifies all the possible PoS tags for this word). Numberic expressions are replaced by special token “NUMB”. The stream of characters is split into sentences using the end-of-sentence marks. Any other punctuation is removed from the sentence (such as parenthesis, brackets, curly braces, … etc). This step works as tokenization and pre-processing step for the parser. After obtaining the result from the parser, the words are placed back into the stream (replaces the class-ID’s of the words). Special tokens (ABBREV and NUMB) are replaced by their corresponding characters from the original character stream. The graphs (after post-processing its text to have the inputed text back) are returned from the system for the applications or higher levels of NLP processing.

The Word Based Parsing Technique (part-6)

4.6. Dictionary re-organization

The number of words obtained from BAMA is too large to be inserted into Link-Parser dictionaries. It was needed to minimize the number of used words to get better performance and remove space redundancy of the system.

It has been found that if some words have the same set of PoS tags for their analysis without redundancy, this PoS tag set can be used as a signature for these set of words. As mentioned in section 4.2, all words having the same PoS tag, are placed in a separate file. An index for this shared PoS tag-set can be used to minimize the number of words in the dictionary, by assigning an ID for every set of words sharing the same set of PoS tags.

Algorithm 4.2 is for assigning IDs for compatible words (those sharing the same PoS tag set).

Algorithm 4.2:

Input: list of files containing words

Output: list of files containing IDs and an index file containing the files that contain each ID.

Process:
1.      Scan all given files and generate a unique list of words Luwords
2.      Initialize a dictionary (hash-table)<key, value> : Dic<K, V> where K is a bit-vector, and its length is the number of input files, and every bit is designated a specific file in the given input files, and its value V is the class ID. Initially the dictionary is empty
3.      LID ↤ 0
4.      For every chunk of words (a million) do:
a.       Create a vector of bit-vectors Lfiles[BVi] where its size is the chunk size, and BVi is a bit vector for word i in the chunk and its length = # of input files.
b.      Scan all input files against the words in the chunk and set the Lfiles[BVi[n]] = 1 where the file n contains the word i.
c.       For every entry in Lfiles , C do:
                                                              i.      If Lfiles in keys of Dic do:
1.      Assign the words associated with C the id Dic[C].
                                                            ii.      Else
1.      Insert into Dic the key C and its value = LID+1
2.      LID ↤ LID + 1
As the number of items being processed is finite, all the words are checked against its PoS tags through BAMA re-analysis. It is tested that all words processed, their PoS tag-set didn’t change from those generated by BAMA.

By using this algorithm, the number of words is minimized from ~23 million words to around 15000 words in the Link Parser dictionary files.

The number of words mentioned is spread over 3033 files. Each represents a PoS tag constructed from PoS tag of all segments of some words in the unique word-list.


It was reported by Attia [59][57] that the number of words of Arabic is too large to be possible to be generated and used by a parsing system. With this compression technique, the number of words for this highly inflectional language becomes less than of that of a simple language like English with regards to number of words and the morphology system.

The Word Based Parsing Technique (part-5)

4.5. PoS tagging

PoS tagging of The Arabic sentence and word disambiguation, was previously addressed indirectly by developing language resources for inputing to statistical disambiguators [22] [63], or directly either by rule based or statistical based system. [64] [34][65][59][57][45][66][67][54][68][69]

For statistical disambiguation, it was important to have the minimum adequate number of PoS tags that can be useful for a statistical algorithm to train on limited manually disambiguated corpora. The number of unique PoS tags extracted in the word generation phase, previously described in this chapter, is above 4700 tags. We selected only 3033 on which the developed rules are applied. The use of this number of PoS tags with pure statistical model (such as Diab’s [45]), needs huge manually annotated corpus (And it is not reported in the literature that such corpus exists, as of this writing).

Another problem exists in front of statistical disambiguation tools, is the long distance dependency. As reported by Diab [45], the window length used in PoS tagging of AMIRA I is 5 tokens, which will fail in the following sentence:

Sentence #4.1:
* طلب المحامي من المستشارين بعد الاطلاع على المستندات مد مدة الدفاع

PoS Tagging and BP chunking output is:

* [VP طلب/VBD] [NP المحامي/NN] [PP من/IN المستشارين/NNS] [PP بعد/IN الاطلاع/NN] [PP على/IN المستندات/NNS] [PP مد/IN مدة/NN الدفاع/NN]

It can be seen that the object of the sentence (مد) is marked as (IN) which denotes a preposition, while the word itself has no solution with BAMA mentioning it as preposition. The output of BAMA for this word is:

LOOK-UP WORD: مد
SOLUTION 1: (مَدَّ) [mad~-u_1] مَدّ/VERB_PERFECT+ـَ/PVSUFF_SUBJ:3MS
     (GLOSS):  + extend/stretch/spread out + he/it <verb>
 SOLUTION 2: (مَدّ) [mad~_1] مَدّ/NOUN
     (GLOSS):  + extension/lengthening/spreading +

AMIRA is trained on Penn Arabic Treebank [22], which used BAMA as its morphological analyzer. This proves that systems such as AMIRA may produce results that are not in the domain of the word analysis.

The proposed system produces correct analysis taking into consideration the long distance dependency. Figure 4.2 is one of the analyses produced with the proposed system.


Figure 4.2: Morpho-syntactic analysis and PoS tagging of Sentence #5 with DLG

It can be seen in Figure 4.2 that the correct object is linked to the verb with link “On” (nominal object). And the object itself is marked as “n” noun. It can be also seen that all of the other words are assigned valid PoS tags. The following table summarizes the resulted PoS tags for this sentence:



طلب
pv-pvss-3ms
Perfect-verb+Perfect-Verb-Suffix-3rd person-masc.-singular
المحامي
d-n
Determiner + Noun
من
prep-0
Preposition
المستشارين
d-n-ns-md-ag
Determinter+Noun+Noun-Suffix-Masc-Dual-accusative/genitive form
بعد
prep-n
Preposition (can be noun)
الاطلاع
d-n
Determinter+Noun
على
prep-0
Preoposition
المستندات
d-adj-ns-fp
Determinter+Adjective+Noun-Suffix-Feminine-Plural
مد
n
Noun
مدة
n-ns-fs
Noun+Noun-Suffix-Feminine-Singular
الدفاع
d-n
Determinter+Noun


As depicted in Figure 4.2 and Table 4.5, the developed parsing system for DLG, produces with every solution the PoS tagging for every given word. 

The Word Based Parsing Technique (part-4)

4.4. Arabic rules modeling

Most of the attempts to parse Arabic sentence used the word segmentation approach. Previous attempts (such as those mentioned in chapter 3) addressed the problem by cutting the analysis process into several analysis sub-processes with each solving the disambiguation problem on its level.

The attempt being presented now, addresses the problem by putting all needed information to disambiguate the PoS tags of the words and syntactic-relationships, in a compatible form to be used by the Link-Parser.

While the word-segmentation approach simplifies the analysis process, by having less number of information per processed unit, the word based approach requires that all the information regarding the word components and its morphosyntactic rules be provided for the parser. The role of the parser in this case serves as morpho-syntactic parser and PoS tagger.

As a starting point, all rules presented by Casbeer et al [42] are checked, and well studied. Casbeer added a quite noticeable amount of rules for morphological modeling.

As the Arabic language is highly inflectional and the Arabic writing system concatenates some of the words that work as functional units in the Arabic grammar, such as co-ordination, and prepositions, and with the new formalism introduced in the previous section, the author needed to write the rules from scratch.

In the developed rules, a considerable part of Arabic grammar is modeled (co-ordination, prepositional phrases, quantifiers, nominal sentences, verbal sentences, verbal negation, relative clauses, nominal compound clauses, adjective phrases, adverbs, adverbial clauses, tense markers, aspectual verbs, and the most important problems solved by techniques mentioned in this chapter, the multi-function words for nouns, verbs, prepositions, adverbs, … etc). The developed grammar contains 211 rules written in DLG.

The Word Based Parsing Technique (part-3)

4.3. Directed Link grammar

Link grammars are superset of projective dependency grammars. Dependency grammar takes into account the relationship direction, while link grammar does not. Graphs generated by link grammar are not guaranteed to be acyclic. It is reported that the English grammar presented with the Link grammar system has cyclic modeling for some grammar cases.

Information extraction needs base phrases boundaries and relationships modeling which are hard to be achieved without directed acyclic graphs. Removing cycles of the link grammar and converting its output to directed graphs were done previously in information extraction and mentioned as an essential need. It was also mentioned that it was not error free in the mentioned context. [62].

Casbeer’s modeling of Arabic preposition, used a link named P for modeling the prepositional phrase attachment. An attachment of prepositional phrase to a word following it, is not supported by his rule-set. A simple demonstration of the rules is shown in Table 4.1.

نشرب : ADV+ & P+
كثيرا :       ADV-
في :       P- & PO+
الصيف:   PO-
Table 4.1: Sample Arabic grammar using Link grammar formalism

It can be seen that the link  P connects the verb to the preposition. By modifying the verb to accept linking on P with a word before (P-), two prepositions might connect to each other in an incomplete sentence. But the need for the parser is to parse, in the first place, which is judging the correctness (which implies completeness in natural languages, [58]), and two prepositional phrases alone do not compose a complete Arabic sentence (except in answers to questions).

In that case there will be a need for a new link to model this type of connection. As link grammar is projective and depends on the positioning information that are implicitly modeled through the order of the presented links [39], the need for additional links will increase when facing the same problem with other constructs.

This study claims that using smaller number of links in modeling natural language syntax is easier for the grammar maintainer and minimizes the possibility of errors of modeling a grammar rule with a link in such a way that might contradict with other linking decisions in the grammar.

One way of achieving the minimum number of links and also modeling dependency direction on the produced graph is by adding a dependency direction marker in the link grammar formalism.

A common implementation for dependency modeling over typed set of connectors is to annotate the connector with either “in” or “out”. A word having an “in” connector, means that the direction of dependency is from the other word to this word (and also that the other word is depending on this word). A word having an “out” marked connector, means that this word is depending on the word that it will connect through this connector. For a link to be established between two words whose rules are defined in directed-link grammar formalism, two connectors should satisfy those restrictions by Sleator & Temperly [39], and they should have different marker (one “in” and the other “out”).

A sample grammar using the modified formalism is presented in Table 4.2.
نشرب : I|ADV+ & I|P+
كثيرا  : O|ADV-
في : O|P- & I|PO+
الصيف: O|PO-
Table 4.2: Sample Arabic grammar using directed-link grammar formalism

It can be seen in table 4.2 that the connectors now preceded by either (I|) or (O|). I or O denotes “in” and “out” respectively. And the vertical bar is a special separator that separates the direction of the connector from its linking specification.

Modifying the sample grammar defined in Table 4.2 is easy to make it support forward prepositional attachment. Table 4.3 illustrates the new modification.


نشرب: I|ADV+ & (I|P+ or I|P-)
كثيرا:  O|ADV-
في   :  (O|P- & I|PO+) or (I|PO+ & O|P+)
الصيف: O|PO-

Table 4.3: Sample Arabic grammar that support forward prepositional phrase attachment

The grammar presented in table 4.3 shows that no new links are used, and this grammar will not accept two prepositional phrases to connect to each other on P link, as P connector is only defined for the preposition to have Out direction. And as the matching rules for our directed-link grammar formalism guarantees that no two words having the same connector type (either “out” or “in”) will be connected on that link for those two connectors, it will keep the number of needed link names to minimum.

The other important point that is needed to be covered, is to prove that the new formalism does not break or contradict with link-grammar formalism and parser. Also, it will be needed to prove that it doesn’t break the context freeness of the Link-Grammar formalism.

If it can be proved that the Directed-Link Grammar is a subset of (i.e. convertible to) Link Grammar, then all of the mentioned requirements are met.

I|X+ LX+
I|X- RX-
O|X+ RX+
O|X- LX-
Table 4.4: proposed conversion from directed-link grammar to link-grammar

Table 4.4 states a proposed conversion from Directed Link Grammar connector specification to Link-Grammar connector specification. As the directed connectors are convertible to be link grammar compatible, the only concern that is left, is that the conversion preserves the needed semantics for which the new formalism applies.

The connector I|X+ connects the current word to some word that follows it. If we consider the stream of words as progressing from left to right, then this word is on the left of the link, and the compatible connector for it is O|X-, which is for a word on the right of the current word. For the link X in this case, the head word is on the left of the link. It can be seen in Table 4.4 that these two connectors have this information encoded as the prefix “L” in the corresponding Link-Grammar compatible connectors. Thus, a connector that expects the head word to be on the left side of the link (will have the prefix L), is not compatible with connectors that expects the head word to be on the right side of the link (will have the prefix R), as both will have different prefixes, regardless of the direction.

As the connector conversion does not produce erroneous connections, and it has been proved that the semantics of the conversion is compatible with the semantics of the Directed-Link Grammar formalism, it can be claimed that Directed Link Grammar is a subset of Link-Grammar.

One additional rule that makes Directed Link Grammar formalism (DLG) a dependency compatible among other dependency formalisms, is that a single word cannot have two “Out” connectors in a single disjunction in its disjunction set (for the disjunction form of its connection requirements).

A preprocessing tool has been developed to convert the DLG grammars to LG. It converts the connectors and checks the rules in order not to have more than one connection marked as “Out” in the same rule.

The Word Based Parsing Technique (part-2)

4.2. Dictionary generation

The concept of tokenization for English (the language that gained most NLP studies), is different from that of Arabic. While English tokenization handles merging set of characters separated by punctuation, the Arabic tokenization concept (that was the field of study for nearly all of the statistical attempts) separates word components from each other. As illustrated in the previous chapter, tokenization adds a new dimension of errors if used on a large scale. It cannot be claimed that step 8 in the algorithm 3.1 has wide coverage. It might need extra additions and exceptions to be large scale.

Words in Arabic are either “Harf” (rigid words), nouns, or verbs. The Arabic language morphology is template oriented with regard to verbs and nouns (with special cases that are not template based, such as the verb ليس). A word in Arabic is best analyzed based on its context. A noun, for example, can be an adverb in a sentence and a subject/object in another, such as the word مقاتلا in (استجلب الجيش مقاتلا من القرية ، ذهب الرجل إلى ميدان المعركة مقاتلا). Apart from the small number of special cases, a noun or a verb in Arabic originates from a root (3-5 letters, works as the abstract meaning of the word) that is used in extending a morphological pattern (which adds applied meaning to the root)[58].

A morphological analyzer based on the template concept is presented by Attia [59], and addressed previously by Sakhr [60]. A Different approach for modeling Arabic morphology is to use catenation based model. In catenation based model, stems, prefixes and suffixes are concatenated together through rules (or concatenation restrictions), to model the language words in compact form. Buckwalter’s implementation for that approach, added, in addition to possible PoS tag, an English glossary. An attempt to convert the catenation based morphological analyzer (BAMA) to root/pattern approach is done by Smrz [61].
An ultimate solution for Arabic parsing should incorporate morphological possibilities, syntactic rules, morpho-syntactic restrictions into a single unified disambiguation and parsing system on all possible levels (morphological, syntactical, semantic, … etc) [59].

As in the word based parsing attempt in this study, semantic information is not incorporated, and thus, there is no need for incorporating the morphological pattern or the roots. This study used Buckwalter’s Arabic morphological analyzer as its source for morphology information, as it is free and being exhaustively used in several attempts reported by literature.

In this attempt, word classification is retrieved from the morphological analyzer. The tables of the morphological analyzer are traversed to obtain the words with their PoS tags. Algorithm 4.1 traverses Buckwalter’s tables for get a list of words with their possible PoS tags.

Algorithm 4.1: Generating words and PoS tags from BAMA tables
Input: BAMA tables
Output: list of records each has two fields, the word text in Buckwalters transliteration, and the PoS tag of the word

Process:

  1. Get the unique list of all stem IDs into Lstem-ids 
  2. For every stem-id entry in Lstem-ids , Sid do:
    • Get list of all stems associated with Sid in Lstem <Itext, Ipos>
    • Get list of all compatible prefix-ids and suffix-ids in T<Ipre-id , Isuff-id>, where T<a, b> is a list of tuples containing fields a and b.
    • For-every entry in Lstem , in S do:
      • For-every entry in T<Ipre-id, Isuff-id>, T do:
        1. Get list of Prefixes with their PoS tags in Lp<Iptext, IpPos>
        2. Get list of Suffixed with their PoS tags in Ls<Istext, IsPos>
        3. For every entry in Lp<Iptext, IpPos>, P do:
          • For every entry in Ls<Istext, IsPos>, F do:
            1. Produce words record T<text, pos>(P.Iptext | S.Itext | F.Istext,  P.IpPos | S.Ipos | F.Ispos)




The output of the algorithm is a list of tuples, each containing the unvocalized word text, and the PoS tag of the word. Words having the same PoS tag are copied in a designated file for that PoS tag. By constructing an index based on word text, a word can be analyzed.

The number of resulting unique words exceeded 23 millions.

Testing has been done on the generated dictionary by analyzing every word in it using BAMA implementation, and assured that all generated words match in PoS tags with those generated from BAMA.

The Word Based Parsing Technique (part-1)

In this series the process of creating “word based parser” is presented with the developed algorithms. The next section presents an overview of the Link-Parser structure, followed by the dictionaries, “rule preparation”, “word minimization”, “unknown word” handling, and finally the modifying of the formalism to include dependency information.

4.1. Introduction

The Link-Parsing system depends on 3 components to do its function. The first is a dictionary containing all words of the language classified into PoS groups. The second is the grammar rules, and the third is an implementation for the parsing algorithm. Figure 4.1 depicts the relation between the three components.

Figure 4.1: Link Parsing overview

Every group of PoS and/or word(s) group is assigned linking requirements. A single un-diacritized word may have several PoS tags (and thus, several linking requirements). The Parser tries to find all possible contexts through which the linking requirements are met for every word, resulting in a set of linkages (sentence contexts).

The resulting linkages contain all possible permutations for possible linking requirements for every word in valid contexts. As of this writing, the algorithm do not group contexts where the difference between two contexts is just the word group (PoS tag); thus, the number of linkages may overflow 32-bit integer.
In the following sections, the dictionary creation, and rule creation processes are explained, and new formalism, after that, is presented.

Developments on Syntax Analysis with Word Segmentation (part-3)

3.4. Statistical Disambiguation with Token-based Link Grammar Parsing

An algorithm that integrates the statistical tokenization and PoS tagging with token based syntax analysis is developed and presented to improve morphological disambiguation. Figure 3.1 depicts the steps of the algorithm.

Figure 3.1: Algorithm 3.1 Process Flow

Algorithm 3.1: 

Input: An Arabic sentence in form of list of words W = {w1, w2, w3 … wy} where wi is backwaters transliteration [4] form for the word. Numeric tokens is changed to the special token NUMB. Any other punctuation is removed.

Output: List LL of Linkage each of which has {list of tokens T = {t1, t2, t3 … tn}, and set of Links L = {(tl1, tr1, l1), (tl2, tr2, l2) … (tlm, trm, lm)} where  n=∑_(i=1)^y▒〖NOTOK(w_i)〗 and Max(m) = n-1.}. NOTOK(wi) denotes the number of tokens obtained from BAMA for word wi. tlx points to the left word for link x. trx points to the right word for link x, and lx is the label of the link.

Process:

  1. PoS tag the set of words in W (tokenization and feminine ending restoration is a pre-requisite for this step). The output of this step is a list of items each of which has the token and the PoS tag. Each token is a proclitic, stem, or enclitic. 
  2. Tokenize every word wi in W using BAMA. The result of this step is a list Swi of solutions for each word wi, each of which is a list of word tokens with its morphological tags attached. 
  3. For every word in W find its corresponding tokens ATwi with its PoS tags obtained from step 1 and remove from Swi any solution that has a token that does not start at a token boundry of ATwi
  4. Map morphological tags of Swi to the 24 collapsed tags available in Arabic Treebank 
  5. Assign score = 1 for every solution in Swi, and divide this score by 2 with every mapped token (from 4) that does not match with PoS of ATwi. 
  6. Sort (desc) the solutions of Swi according to score obtained from 5 and pick the first one (SSwi.)
  7. Put tokens of SSwi (mentioned after as TSSwi) in a character stream with space delimiter and feed it to Link parser.
  8. Remove from the resulting LL any linkage L with (t_lx,t_rx,l_x)∈L for some link x with tlx and trx points to q and d such that (q ∈ 〖TSS〗_wi ∧ no-of-links(q) = 1 ∧ d ∉〖TSS〗_wi ∧ |〖TSS〗_wi |>1 ∧  (〖TSS〗_wi⊂q∶  q∉verbal negation articles (لم,لن,ما,لا) ∧ q∉{affirmative article (قد) }). Where |〖TSS〗_wi |  denotes number of tokens in 〖TSS〗_wi. 


Because of the fact that the underlying parsing system needs its input in token form, the input is statistically tokenized and PoS tagged as a preparation step for selecting the best tokens from the available segmentations for every word. The tokenization used in the underlying link grammar is based on BAMA tokenization which usually produces more tokens for a given word than those of the statistical tokenizer. A mapping is implemented to select from BAMA tokenization those that are compatible with the statistical tokenizer. The following section details the steps of the tokenization and the token selection process.

3.5. Algorithm description 

The algorithm starts by sending the list of words to AMIRA for tokenization and PoS tagging.
AMIRA takes its input in the Buckwalter transliteration. A sentence like the following is sent to it.
Sentence #3.1:
* ود الولد لو يطير

A generated output for sentence #3.1 is:
* [VP ود/VBD] [NP الولد/NN] [SBAR لو/IN] [VP يطير/VBP]
(The asterisk denotes the first word and the direction of reading)

In the output, correct PoS tags and base phrase chunking is encountered.  This sentence does not contain any word with clitics. Another sentence that presents a sample of tokenization is:

Sentence #3.2:
* البنت التي لم تقرأ الكتاب نجحت في الإمتحان

A generated output for the sentence is:
* [NP البنة/NN] [SBAR التي/WP] [PRT لم/RP] [VP تقرأ/VBP] [NP الكتاب/NN] [VP نجحت/VBD] [PP في/IN الإمتحان/NN]

We can see in the presented output that the feminine ending restoration is done in a place that does not need it with AMIRA. The PoS tagging and base phrase chunking for this sentence are correct.
The next step in the algorithm is analyzing the same input sentence using BAMA. The analysis of the two mentioned sentences are:

Sentence #3.1:
LOOK-UP WORD: ود
 SOLUTION 1: (وَدَّ) [wad~-a_1] wad~/VERB_PERFECT+a/PVSUFF_SUBJ:3MS
     (GLOSS):  + want/would like + he/it <verb>
 SOLUTION 2: (وُدّ) [wud~_1] wud~/NOUN
     (GLOSS):  + affection/friendship +
 SOLUTION 3: (وِدّ) [wud~_1] wid~/NOUN
     (GLOSS):  + affection/friendship +

LOOK-UP WORD: الولد
 SOLUTION 1: (الوَلَد) [walad_1] Al/DET+walad/NOUN
     (GLOSS): the + child/son +

LOOK-UP WORD: لو
 SOLUTION 1: (لَو) [law_1] law/CONJ
     (GLOSS):  + if +
 SOLUTION 2: (لُو) [luw_1] luw/NOUN_PROP
     (GLOSS):  + Le +

LOOK-UP WORD: يطير
 SOLUTION 1: (يَطِير) [TAr-i_1] ya/IV3MS+Tiyr/VERB_IMPERFECT
     (GLOSS): he/it + fly +
 SOLUTION 2: (يُطُّيِّر) [Tay~ar_1] yu/IV3MS+Tay~ir/VERB_IMPERFECT
     (GLOSS): he/it + make fly +
Listing 3.2: BAMA output for sentence 3.1

Sentence #3.2:
LOOK-UP WORD: البنت
 SOLUTION 1: (البِنْت) [binot_1] Al/DET+binot/NOUN
     (GLOSS): the + daughter/girl +

LOOK-UP WORD: التي
 SOLUTION 1: (الَّتِي) [Al~a*iy_1] Al~atiy/REL_PRON
     (GLOSS):  + which/who/whom [fem.sg.] +
 SOLUTION 2: (آَلَتِي) [|lap_1] |l/NOUN+atayo/NSUFF_FEM_DU_ACCGEN_POSS
     (GLOSS):  + instrument/apparatus/appliance/machine + two
 SOLUTION 3: (آَلَتَيَّ) [|lap_1] |l/NOUN+ atayo/NSUFF_FEM_DU_ACCGEN_POSS+ ya/POSS_PRON_1S
     (GLOSS):  + instrument/apparatus/appliance/machine + my two
 SOLUTION 4: (آَلَتِي) [|lap_1] |l/NOUN+ap/NSUFF_FEM_SG+iy/POSS_PRON_1S
     (GLOSS):  + instrument/apparatus/appliance/machine + my

LOOK-UP WORD: لم
 SOLUTION 1: (لَم) [lam_1] lam/NEG_PART
     (GLOSS):  + not +
 SOLUTION 2: (لِمَ) [lima_1] lima/INTERROG_PART
     (GLOSS):  + why +
 SOLUTION 3: (لَمَّ) [lam~-u_1] lam~/VERB_PERFECT+a/PVSUFF_SUBJ:3MS
     (GLOSS):  + collect/put in order + he/it <verb>
LOOK-UP WORD: تقرأ
 SOLUTION 1: (تَقْرَأ) [qara>-a_1] ta/IV3FS+qora>/VERB_IMPERFECT
     (GLOSS): it/they/she + read +
 SOLUTION 2: (تُقْرَأ) [qara>-a_1] tu/IV3FS+qora>/VERB_IMPERFECT
     (GLOSS): it/they/she + be read +
 SOLUTION 3: (تَقْرَأ) [qara>-a_1] ta/IV2MS+qora>/VERB_IMPERFECT
     (GLOSS): you [masc.sg.] + read +
 SOLUTION 4: (تُقْرَأ) [qara>-a_1] tu/IV2MS+qora>/VERB_IMPERFECT
     (GLOSS): you [masc.sg.] + be read +

LOOK-UP WORD: الكتاب
 SOLUTION 1: (الكِتَاب) [kitAb_1] Al/DET+kitAb/NOUN
     (GLOSS): the + book +
 SOLUTION 2: (الكُتَّاب) [kut~Ab_1] Al/DET+kut~Ab/NOUN
     (GLOSS): the + kuttab (village school)/Quran school +
 SOLUTION 3: (الكُتَّاب) [kAtib_1] Al/DET+kut~Ab/NOUN
     (GLOSS): the + authors/writers +

LOOK-UP WORD: نجحت
 SOLUTION 1: (نَجَحْتُ) [najaH-a_1] najaH/VERB_PERFECT+tu/PVSUFF_SUBJ:1S
     (GLOSS):  + succeed + I <verb>
 SOLUTION 2: (نَجَحْتَ) [najaH-a_1] najaH/VERB_PERFECT+ta/PVSUFF_SUBJ:2MS
     (GLOSS):  + succeed + you [masc.sg.] <verb>
 SOLUTION 3: (نَجَحْتِ) [najaH-a_1] najaH/VERB_PERFECT+ti/PVSUFF_SUBJ:2FS
     (GLOSS):  + succeed + you [fem.sg.] <verb>
 SOLUTION 4: (نَجَحَت) [najaH-a_1] najaH/VERB_PERFECT+at/PVSUFF_SUBJ:3FS
     (GLOSS):  + succeed + it/they/she <verb>
 SOLUTION 5: (نَجَّحْتُ) [naj~aH_1] naj~aH/VERB_PERFECT+tu/PVSUFF_SUBJ:1S
     (GLOSS):  + make successful + I <verb>
 SOLUTION 6: (نَجَّحْتَ) [naj~aH_1] naj~aH/VERB_PERFECT+ta/PVSUFF_SUBJ:2MS
     (GLOSS):  + make successful + you [masc.sg.] <verb>
 SOLUTION 7: (نَجَّحْتِ) [naj~aH_1] naj~aH/VERB_PERFECT+ti/PVSUFF_SUBJ:2FS
     (GLOSS):  + make successful + you [fem.sg.] <verb>
 SOLUTION 8: (نَجَّحَت) [naj~aH_1] naj~aH/VERB_PERFECT+at/PVSUFF_SUBJ:3FS
     (GLOSS):  + make successful + it/they/she <verb>

LOOK-UP WORD: في
 SOLUTION 1: (فِي) [fiy_1] fiy/PREP
     (GLOSS):  + in +
 SOLUTION 2: (فِيَّ) [fiy_1] fiy/PREP+~a/PRON_1S
     (GLOSS):  + in + me
 SOLUTION 3: (فِي) [fiy_2] Viy/ABBREV
     (GLOSS):  + V. +

LOOK-UP WORD: الإمتحان
 SOLUTION 1: (الإِمْتِحَان) [{imotiHAn_1] Al/DET+{imotiHAn/NOUN
     (GLOSS): the + test/trial/examination +
Listing 3.3: BAMA solutions for sentence 3.2

Applying step 3 of the algorithm to the result reached, will remove the solutions that conflict with those obtained from AMIRA. For the given two sentences, nothing will be removed from the solution list S.
Applying steps 4, 5 and 6 yields the following list of solutions:

LOOK-UP WORD: ود
 SOLUTION 1: (وَدَّ) [wad~-a_1] wad~/VERB_PERFECT+a/PVSUFF_SUBJ:3MS
     (GLOSS):  + want/would like + he/it <verb>

LOOK-UP WORD: الولد
 SOLUTION 1: (الوَلَد) [walad_1] Al/DET+walad/NOUN
     (GLOSS): the + child/son +

LOOK-UP WORD: لو
 SOLUTION 1: (لَو) [law_1] law/CONJ
     (GLOSS):  + if +

LOOK-UP WORD: يطير
 SOLUTION 1: (يَطِير) [TAr-i_1] ya/IV3MS+Tiyr/VERB_IMPERFECT
     (GLOSS): he/it + fly +
 SOLUTION 2: (يُطُّيِّر) [Tay~ar_1] yu/IV3MS+Tay~ir/VERB_IMPERFECT
     (GLOSS): he/it + make fly +
Listing 3.4: BAMA solutions for sentence 3.1 after applying steps 4-6 of Alg. 3.1

Sentence #2:
LOOK-UP WORD: البنت
 SOLUTION 1: (البِنْت) [binot_1] Al/DET+binot/NOUN
     (GLOSS): the + daughter/girl +
LOOK-UP WORD: التي
 SOLUTION 1: (الَّتِي) [Al~a*iy_1] Al~atiy/REL_PRON
     (GLOSS):  + which/who/whom [fem.sg.] +
LOOK-UP WORD: لم
 SOLUTION 1: (لَم) [lam_1] lam/NEG_PART
     (GLOSS):  + not +
 SOLUTION 2: (لِمَ) [lima_1] lima/INTERROG_PART
     (GLOSS):  + why +
LOOK-UP WORD: تقرأ
 SOLUTION 1: (تَقْرَأ) [qara>-a_1] ta/IV3FS+qora>/VERB_IMPERFECT
     (GLOSS): it/they/she + read +
 SOLUTION 2: (تُقْرَأ) [qara>-a_1] tu/IV3FS+qora>/VERB_IMPERFECT
     (GLOSS): it/they/she + be read +
 SOLUTION 3: (تَقْرَأ) [qara>-a_1] ta/IV2MS+qora>/VERB_IMPERFECT
     (GLOSS): you [masc.sg.] + read +
 SOLUTION 4: (تُقْرَأ) [qara>-a_1] tu/IV2MS+qora>/VERB_IMPERFECT
     (GLOSS): you [masc.sg.] + be read +
LOOK-UP WORD: الكتاب
 SOLUTION 1: (الكِتَاب) [kitAb_1] Al/DET+kitAb/NOUN
     (GLOSS): the + book +
 SOLUTION 2: (الكُتَّاب) [kut~Ab_1] Al/DET+kut~Ab/NOUN
     (GLOSS): the + kuttab (village school)/Quran school +
 SOLUTION 3: (الكُتَّاب) [kAtib_1] Al/DET+kut~Ab/NOUN
     (GLOSS): the + authors/writers +

LOOK-UP WORD: نجحت
 SOLUTION 1: (نَجَحْتُ) [najaH-a_1] najaH/VERB_PERFECT+tu/PVSUFF_SUBJ:1S
     (GLOSS):  + succeed + I <verb>
 SOLUTION 2: (نَجَحْتَ) [najaH-a_1] najaH/VERB_PERFECT+ta/PVSUFF_SUBJ:2MS
     (GLOSS):  + succeed + you [masc.sg.] <verb>
 SOLUTION 3: (نَجَحْتِ) [najaH-a_1] najaH/VERB_PERFECT+ti/PVSUFF_SUBJ:2FS
     (GLOSS):  + succeed + you [fem.sg.] <verb>
 SOLUTION 4: (نَجَحَت) [najaH-a_1] najaH/VERB_PERFECT+at/PVSUFF_SUBJ:3FS
     (GLOSS):  + succeed + it/they/she <verb>
 SOLUTION 5: (نَجَّحْتُ) [naj~aH_1] naj~aH/VERB_PERFECT+tu/PVSUFF_SUBJ:1S
     (GLOSS):  + make successful + I <verb>
 SOLUTION 6: (نَجَّحْتَ) [naj~aH_1] naj~aH/VERB_PERFECT+ta/PVSUFF_SUBJ:2MS
     (GLOSS):  + make successful + you [masc.sg.] <verb>
 SOLUTION 7: (نَجَّحْتِ) [naj~aH_1] naj~aH/VERB_PERFECT+ti/PVSUFF_SUBJ:2FS
     (GLOSS):  + make successful + you [fem.sg.] <verb>
 SOLUTION 8: (نَجَّحَت) [naj~aH_1] naj~aH/VERB_PERFECT+at/PVSUFF_SUBJ:3FS
     (GLOSS):  + make successful + it/they/she <verb>

LOOK-UP WORD: في
 SOLUTION 1: (فِي) [fiy_1] fiy/PREP
     (GLOSS):  + in +
 SOLUTION 2: (فِيَّ) [fiy_1] fiy/PREP+~a/PRON_1S
     (GLOSS):  + in + me

LOOK-UP WORD: الإمتحان
 SOLUTION 1: (الإِمْتِحَان) [{imotiHAn_1] Al/DET+{imotiHAn/NOUN
     (GLOSS): the + test/trial/examination +

Listing 3.5: BAMA solutions for sentence 3.2 after applying steps 4-6 of Alg. 3.1

It can be seen that the steps 4 and 5 removed solutions from relative pronoun that affects the possible tokenizations. It is obvious now that there is no tokenization ambiguity in the existing solutions. And this disambiguation was due to the statistical tokenizer and PoS tagger used.

Step 7 feeds the resulting token stream (based on the scoring of the tokenization) to the link parser developed by Casbeer et al [42].

Sentence #3.1:
* ود ال ولد لو ي طير

Sentence #3.2:
* ال بنت التي لم ت قرأ ال كتاب نجح ت في ال إمتحان

The correct analysis produced with the system is depicted in Figure 3.2



Figure 3.2: The selected solutions for sentences 3.1 and 3.2

Step 8 removes from the solutions list, those that have undesired linkages that should not happen, that appeared due to the tokenization. For example the word (and also can be considered a token) (قد) which can be a function word serves as imperfect verb modifier (adds the meaning of possibility, or doubts), and it can also be the imperative form of the verb (قاد) which may mean (leads or ignites). If that word appears before an imperfect verb that has 3rd person feminine prefix (ت), the word (قد) and the next letter (After tokenization) (ت) may be connected together forming the verb (قدت) meaning (I lead, in the past tense). A sample sentence for this case is presented in sentence #3.3.

Sentence #3.3:
* قد تتغير الأمور

The tokenized form is:
* قد ت تغير ال أمور

A possible but wrong analysis that this step eliminates is in figure 3.3 and the correct one is in the figure 3.4.

Figure 3.3: wrong analysis removed by step 8 for sentence #3.3


Figure 3.4: correct analysis for sentence #3.3

Casbeer et al. [42] presented a grammar for MSA on the token level. He reported that above 90% of the produced solutions, the first one resulted from BAMA, is the correct one. This trial of adding a statistical tokenizer on top of the link parser and the token-based link grammar is a spontaneous solution for increasing the accuracy of the whole parsing system.

Attia [57] claimed that building rule based systems require a lot of experience and needs a lot of work. Statisitcal approaches, on the other hand, need training materials that are annotated with experienced people. Also statistical approaches are very likely to be less coverage than rule based systems. A combination of two (statistical and rule based) was claimed to be the optimal approach for achieving the best results [57].
This attempt to add a statistical tokenizer to the rule-based parser, selected Casbeer’s [42] modeling of Arabic grammar as it is the only available one for use for Arabic. It was reported that it covers not a small part of the Arabic language rules. Several enhancements are done on Casbeer’s to increase its coverage. Recalling sentence #2 that is analyzed previously, we can see a link named Ov, which connects a verb with a preceding noun forming SV~ predicates.

Despite the fact that this approach is promising for the disambiguation problem (as it is presented that it limits the number of possible tokenizations based on the context), it is bounded with the precision level of the statistical tokenizer and PoS tagger. Below is a sample sentence that cannot be correctly tokenized with AMIRA I:

Sentence #3.4:
* ولديه لم يطعنه

It can be seen that the first word is a noun in a dual form with first person possession clitic(وَلَدَيْهِ) meaning (his two sons). In this context, any analysis that has non-noun marking of the first word is incorrect. The following result is obtained from AMIRA after feeding it with sentence #3.4:

* [CONJP و/CC] [PP لدي/IN ه/PRP] [PRT لم/RP] [VP يطعن/VBP ه/PRP]

The output obtained from AMIRA shows that, the tokenization that will be sent to the parser is incorrect, thus the output cannot be correct.
Another problem encountered with the work of Casbeer et al, is that the dictionaries that are published have obvious limited coverage. Consider the following sentence:

Sentence #3.5:
* لم يستطع عبور الجسر

Figure 3.5: the analysis for sentence #3.5

As depicted in Figure 3.4, the verb (يستطع) correctly tokenized, but marked as out-of-dictionary word (by the question mark as it does not exist in dictionaries).

A tokenization algorithm using statistical tokenizer and PoS tagging is presented as an attempt to increase the precision. Several problems presented affect the coverage of the used token-rule based grammar and the precision of the statistical tokenizer and PoS tagger. The next chapter presents the attempt of overcoming the following problems in a single unified solution:

  • Special case handling for the output of token based parsing needed to overcome the conflicts of having two different word streams with the same token stream.
  • The over-head of re-modeling the morphology rules in link-grammar.
  • The need for manually annotated (morpho-syntactically) large enough corpora.
  • Being limited by the accuracy limits of statistical disambiguation tools.