3.4. Statistical Disambiguation with Token-based Link Grammar Parsing
An algorithm that integrates the statistical tokenization and PoS tagging with token based syntax analysis is developed and presented to improve morphological disambiguation. Figure 3.1 depicts the steps of the algorithm.
 |
| Figure 3.1: Algorithm 3.1 Process Flow |
Algorithm 3.1:
Input: An Arabic sentence in form of list of words W = {w1, w2, w3 … wy} where wi is backwaters transliteration [4] form for the word. Numeric tokens is changed to the special token NUMB. Any other punctuation is removed.
Output: List LL of Linkage each of which has {list of tokens T = {t1, t2, t3 … tn}, and set of Links L = {(tl1, tr1, l1), (tl2, tr2, l2) … (tlm, trm, lm)} where n=∑_(i=1)^y▒〖NOTOK(w_i)〗 and Max(m) = n-1.}. NOTOK(wi) denotes the number of tokens obtained from BAMA for word wi. tlx points to the left word for link x. trx points to the right word for link x, and lx is the label of the link.
Process:
- PoS tag the set of words in W (tokenization and feminine ending restoration is a pre-requisite for this step). The output of this step is a list of items each of which has the token and the PoS tag. Each token is a proclitic, stem, or enclitic.
- Tokenize every word wi in W using BAMA. The result of this step is a list Swi of solutions for each word wi, each of which is a list of word tokens with its morphological tags attached.
- For every word in W find its corresponding tokens ATwi with its PoS tags obtained from step 1 and remove from Swi any solution that has a token that does not start at a token boundry of ATwi
- Map morphological tags of Swi to the 24 collapsed tags available in Arabic Treebank
- Assign score = 1 for every solution in Swi, and divide this score by 2 with every mapped token (from 4) that does not match with PoS of ATwi.
- Sort (desc) the solutions of Swi according to score obtained from 5 and pick the first one (SSwi.)
- Put tokens of SSwi (mentioned after as TSSwi) in a character stream with space delimiter and feed it to Link parser.
- Remove from the resulting LL any linkage L with (t_lx,t_rx,l_x)∈L for some link x with tlx and trx points to q and d such that (q ∈ 〖TSS〗_wi ∧ no-of-links(q) = 1 ∧ d ∉〖TSS〗_wi ∧ |〖TSS〗_wi |>1 ∧ (〖TSS〗_wi⊂q∶ q∉verbal negation articles (لم,لن,ما,لا) ∧ q∉{affirmative article (قد) }). Where |〖TSS〗_wi | denotes number of tokens in 〖TSS〗_wi.
Because of the fact that the underlying parsing system needs its input in token form, the input is statistically tokenized and PoS tagged as a preparation step for selecting the best tokens from the available segmentations for every word. The tokenization used in the underlying link grammar is based on BAMA tokenization which usually produces more tokens for a given word than those of the statistical tokenizer. A mapping is implemented to select from BAMA tokenization those that are compatible with the statistical tokenizer. The following section details the steps of the tokenization and the token selection process.
3.5. Algorithm description
The algorithm starts by sending the list of words to AMIRA for tokenization and PoS tagging.
AMIRA takes its input in the Buckwalter transliteration. A sentence like the following is sent to it.
Sentence #3.1:
* ود الولد لو يطير
A generated output for sentence #3.1 is:
* [VP ود/VBD] [NP الولد/NN] [SBAR لو/IN] [VP يطير/VBP]
(The asterisk denotes the first word and the direction of reading)
In the output, correct PoS tags and base phrase chunking is encountered. This sentence does not contain any word with clitics. Another sentence that presents a sample of tokenization is:
Sentence #3.2:
* البنت التي لم تقرأ الكتاب نجحت في الإمتحان
A generated output for the sentence is:
* [NP البنة/NN] [SBAR التي/WP] [PRT لم/RP] [VP تقرأ/VBP] [NP الكتاب/NN] [VP نجحت/VBD] [PP في/IN الإمتحان/NN]
We can see in the presented output that the feminine ending restoration is done in a place that does not need it with AMIRA. The PoS tagging and base phrase chunking for this sentence are correct.
The next step in the algorithm is analyzing the same input sentence using BAMA. The analysis of the two mentioned sentences are:
Sentence #3.1:
LOOK-UP WORD: ود
SOLUTION 1: (وَدَّ) [wad~-a_1] wad~/VERB_PERFECT+a/PVSUFF_SUBJ:3MS
(GLOSS): + want/would like + he/it <verb>
SOLUTION 2: (وُدّ) [wud~_1] wud~/NOUN
(GLOSS): + affection/friendship +
SOLUTION 3: (وِدّ) [wud~_1] wid~/NOUN
(GLOSS): + affection/friendship +
LOOK-UP WORD: الولد
SOLUTION 1: (الوَلَد) [walad_1] Al/DET+walad/NOUN
(GLOSS): the + child/son +
LOOK-UP WORD: لو
SOLUTION 1: (لَو) [law_1] law/CONJ
(GLOSS): + if +
SOLUTION 2: (لُو) [luw_1] luw/NOUN_PROP
(GLOSS): + Le +
LOOK-UP WORD: يطير
SOLUTION 1: (يَطِير) [TAr-i_1] ya/IV3MS+Tiyr/VERB_IMPERFECT
(GLOSS): he/it + fly +
SOLUTION 2: (يُطُّيِّر) [Tay~ar_1] yu/IV3MS+Tay~ir/VERB_IMPERFECT
(GLOSS): he/it + make fly +
Listing 3.2: BAMA output for sentence 3.1
Sentence #3.2:
LOOK-UP WORD: البنت
SOLUTION 1: (البِنْت) [binot_1] Al/DET+binot/NOUN
(GLOSS): the + daughter/girl +
LOOK-UP WORD: التي
SOLUTION 1: (الَّتِي) [Al~a*iy_1] Al~atiy/REL_PRON
(GLOSS): + which/who/whom [fem.sg.] +
SOLUTION 2: (آَلَتِي) [|lap_1] |l/NOUN+atayo/NSUFF_FEM_DU_ACCGEN_POSS
(GLOSS): + instrument/apparatus/appliance/machine + two
SOLUTION 3: (آَلَتَيَّ) [|lap_1] |l/NOUN+ atayo/NSUFF_FEM_DU_ACCGEN_POSS+ ya/POSS_PRON_1S
(GLOSS): + instrument/apparatus/appliance/machine + my two
SOLUTION 4: (آَلَتِي) [|lap_1] |l/NOUN+ap/NSUFF_FEM_SG+iy/POSS_PRON_1S
(GLOSS): + instrument/apparatus/appliance/machine + my
LOOK-UP WORD: لم
SOLUTION 1: (لَم) [lam_1] lam/NEG_PART
(GLOSS): + not +
SOLUTION 2: (لِمَ) [lima_1] lima/INTERROG_PART
(GLOSS): + why +
SOLUTION 3: (لَمَّ) [lam~-u_1] lam~/VERB_PERFECT+a/PVSUFF_SUBJ:3MS
(GLOSS): + collect/put in order + he/it <verb>
LOOK-UP WORD: تقرأ
SOLUTION 1: (تَقْرَأ) [qara>-a_1] ta/IV3FS+qora>/VERB_IMPERFECT
(GLOSS): it/they/she + read +
SOLUTION 2: (تُقْرَأ) [qara>-a_1] tu/IV3FS+qora>/VERB_IMPERFECT
(GLOSS): it/they/she + be read +
SOLUTION 3: (تَقْرَأ) [qara>-a_1] ta/IV2MS+qora>/VERB_IMPERFECT
(GLOSS): you [masc.sg.] + read +
SOLUTION 4: (تُقْرَأ) [qara>-a_1] tu/IV2MS+qora>/VERB_IMPERFECT
(GLOSS): you [masc.sg.] + be read +
LOOK-UP WORD: الكتاب
SOLUTION 1: (الكِتَاب) [kitAb_1] Al/DET+kitAb/NOUN
(GLOSS): the + book +
SOLUTION 2: (الكُتَّاب) [kut~Ab_1] Al/DET+kut~Ab/NOUN
(GLOSS): the + kuttab (village school)/Quran school +
SOLUTION 3: (الكُتَّاب) [kAtib_1] Al/DET+kut~Ab/NOUN
(GLOSS): the + authors/writers +
LOOK-UP WORD: نجحت
SOLUTION 1: (نَجَحْتُ) [najaH-a_1] najaH/VERB_PERFECT+tu/PVSUFF_SUBJ:1S
(GLOSS): + succeed + I <verb>
SOLUTION 2: (نَجَحْتَ) [najaH-a_1] najaH/VERB_PERFECT+ta/PVSUFF_SUBJ:2MS
(GLOSS): + succeed + you [masc.sg.] <verb>
SOLUTION 3: (نَجَحْتِ) [najaH-a_1] najaH/VERB_PERFECT+ti/PVSUFF_SUBJ:2FS
(GLOSS): + succeed + you [fem.sg.] <verb>
SOLUTION 4: (نَجَحَت) [najaH-a_1] najaH/VERB_PERFECT+at/PVSUFF_SUBJ:3FS
(GLOSS): + succeed + it/they/she <verb>
SOLUTION 5: (نَجَّحْتُ) [naj~aH_1] naj~aH/VERB_PERFECT+tu/PVSUFF_SUBJ:1S
(GLOSS): + make successful + I <verb>
SOLUTION 6: (نَجَّحْتَ) [naj~aH_1] naj~aH/VERB_PERFECT+ta/PVSUFF_SUBJ:2MS
(GLOSS): + make successful + you [masc.sg.] <verb>
SOLUTION 7: (نَجَّحْتِ) [naj~aH_1] naj~aH/VERB_PERFECT+ti/PVSUFF_SUBJ:2FS
(GLOSS): + make successful + you [fem.sg.] <verb>
SOLUTION 8: (نَجَّحَت) [naj~aH_1] naj~aH/VERB_PERFECT+at/PVSUFF_SUBJ:3FS
(GLOSS): + make successful + it/they/she <verb>
LOOK-UP WORD: في
SOLUTION 1: (فِي) [fiy_1] fiy/PREP
(GLOSS): + in +
SOLUTION 2: (فِيَّ) [fiy_1] fiy/PREP+~a/PRON_1S
(GLOSS): + in + me
SOLUTION 3: (فِي) [fiy_2] Viy/ABBREV
(GLOSS): + V. +
LOOK-UP WORD: الإمتحان
SOLUTION 1: (الإِمْتِحَان) [{imotiHAn_1] Al/DET+{imotiHAn/NOUN
(GLOSS): the + test/trial/examination +
Listing 3.3: BAMA solutions for sentence 3.2
Applying step 3 of the algorithm to the result reached, will remove the solutions that conflict with those obtained from AMIRA. For the given two sentences, nothing will be removed from the solution list S.
Applying steps 4, 5 and 6 yields the following list of solutions:
LOOK-UP WORD: ود
SOLUTION 1: (وَدَّ) [wad~-a_1] wad~/VERB_PERFECT+a/PVSUFF_SUBJ:3MS
(GLOSS): + want/would like + he/it <verb>
LOOK-UP WORD: الولد
SOLUTION 1: (الوَلَد) [walad_1] Al/DET+walad/NOUN
(GLOSS): the + child/son +
LOOK-UP WORD: لو
SOLUTION 1: (لَو) [law_1] law/CONJ
(GLOSS): + if +
LOOK-UP WORD: يطير
SOLUTION 1: (يَطِير) [TAr-i_1] ya/IV3MS+Tiyr/VERB_IMPERFECT
(GLOSS): he/it + fly +
SOLUTION 2: (يُطُّيِّر) [Tay~ar_1] yu/IV3MS+Tay~ir/VERB_IMPERFECT
(GLOSS): he/it + make fly +
Listing 3.4: BAMA solutions for sentence 3.1 after applying steps 4-6 of Alg. 3.1
Sentence #2:
LOOK-UP WORD: البنت
SOLUTION 1: (البِنْت) [binot_1] Al/DET+binot/NOUN
(GLOSS): the + daughter/girl +
LOOK-UP WORD: التي
SOLUTION 1: (الَّتِي) [Al~a*iy_1] Al~atiy/REL_PRON
(GLOSS): + which/who/whom [fem.sg.] +
LOOK-UP WORD: لم
SOLUTION 1: (لَم) [lam_1] lam/NEG_PART
(GLOSS): + not +
SOLUTION 2: (لِمَ) [lima_1] lima/INTERROG_PART
(GLOSS): + why +
LOOK-UP WORD: تقرأ
SOLUTION 1: (تَقْرَأ) [qara>-a_1] ta/IV3FS+qora>/VERB_IMPERFECT
(GLOSS): it/they/she + read +
SOLUTION 2: (تُقْرَأ) [qara>-a_1] tu/IV3FS+qora>/VERB_IMPERFECT
(GLOSS): it/they/she + be read +
SOLUTION 3: (تَقْرَأ) [qara>-a_1] ta/IV2MS+qora>/VERB_IMPERFECT
(GLOSS): you [masc.sg.] + read +
SOLUTION 4: (تُقْرَأ) [qara>-a_1] tu/IV2MS+qora>/VERB_IMPERFECT
(GLOSS): you [masc.sg.] + be read +
LOOK-UP WORD: الكتاب
SOLUTION 1: (الكِتَاب) [kitAb_1] Al/DET+kitAb/NOUN
(GLOSS): the + book +
SOLUTION 2: (الكُتَّاب) [kut~Ab_1] Al/DET+kut~Ab/NOUN
(GLOSS): the + kuttab (village school)/Quran school +
SOLUTION 3: (الكُتَّاب) [kAtib_1] Al/DET+kut~Ab/NOUN
(GLOSS): the + authors/writers +
LOOK-UP WORD: نجحت
SOLUTION 1: (نَجَحْتُ) [najaH-a_1] najaH/VERB_PERFECT+tu/PVSUFF_SUBJ:1S
(GLOSS): + succeed + I <verb>
SOLUTION 2: (نَجَحْتَ) [najaH-a_1] najaH/VERB_PERFECT+ta/PVSUFF_SUBJ:2MS
(GLOSS): + succeed + you [masc.sg.] <verb>
SOLUTION 3: (نَجَحْتِ) [najaH-a_1] najaH/VERB_PERFECT+ti/PVSUFF_SUBJ:2FS
(GLOSS): + succeed + you [fem.sg.] <verb>
SOLUTION 4: (نَجَحَت) [najaH-a_1] najaH/VERB_PERFECT+at/PVSUFF_SUBJ:3FS
(GLOSS): + succeed + it/they/she <verb>
SOLUTION 5: (نَجَّحْتُ) [naj~aH_1] naj~aH/VERB_PERFECT+tu/PVSUFF_SUBJ:1S
(GLOSS): + make successful + I <verb>
SOLUTION 6: (نَجَّحْتَ) [naj~aH_1] naj~aH/VERB_PERFECT+ta/PVSUFF_SUBJ:2MS
(GLOSS): + make successful + you [masc.sg.] <verb>
SOLUTION 7: (نَجَّحْتِ) [naj~aH_1] naj~aH/VERB_PERFECT+ti/PVSUFF_SUBJ:2FS
(GLOSS): + make successful + you [fem.sg.] <verb>
SOLUTION 8: (نَجَّحَت) [naj~aH_1] naj~aH/VERB_PERFECT+at/PVSUFF_SUBJ:3FS
(GLOSS): + make successful + it/they/she <verb>
LOOK-UP WORD: في
SOLUTION 1: (فِي) [fiy_1] fiy/PREP
(GLOSS): + in +
SOLUTION 2: (فِيَّ) [fiy_1] fiy/PREP+~a/PRON_1S
(GLOSS): + in + me
LOOK-UP WORD: الإمتحان
SOLUTION 1: (الإِمْتِحَان) [{imotiHAn_1] Al/DET+{imotiHAn/NOUN
(GLOSS): the + test/trial/examination +
Listing 3.5: BAMA solutions for sentence 3.2 after applying steps 4-6 of Alg. 3.1
It can be seen that the steps 4 and 5 removed solutions from relative pronoun that affects the possible tokenizations. It is obvious now that there is no tokenization ambiguity in the existing solutions. And this disambiguation was due to the statistical tokenizer and PoS tagger used.
Step 7 feeds the resulting token stream (based on the scoring of the tokenization) to the link parser developed by Casbeer et al [42].
Sentence #3.1:
* ود ال ولد لو ي طير
Sentence #3.2:
* ال بنت التي لم ت قرأ ال كتاب نجح ت في ال إمتحان
The correct analysis produced with the system is depicted in Figure 3.2
 |
| Figure 3.2: The selected solutions for sentences 3.1 and 3.2 |
Step 8 removes from the solutions list, those that have undesired linkages that should not happen, that appeared due to the tokenization. For example the word (and also can be considered a token) (قد) which can be a function word serves as imperfect verb modifier (adds the meaning of possibility, or doubts), and it can also be the imperative form of the verb (قاد) which may mean (leads or ignites). If that word appears before an imperfect verb that has 3rd person feminine prefix (ت), the word (قد) and the next letter (After tokenization) (ت) may be connected together forming the verb (قدت) meaning (I lead, in the past tense). A sample sentence for this case is presented in sentence #3.3.
Sentence #3.3:
* قد تتغير الأمور
The tokenized form is:
* قد ت تغير ال أمور
A possible but wrong analysis that this step eliminates is in figure 3.3 and the correct one is in the figure 3.4.
 |
| Figure 3.3: wrong analysis removed by step 8 for sentence #3.3 |
 |
| Figure 3.4: correct analysis for sentence #3.3 |
Casbeer et al. [42] presented a grammar for MSA on the token level. He reported that above 90% of the produced solutions, the first one resulted from BAMA, is the correct one. This trial of adding a statistical tokenizer on top of the link parser and the token-based link grammar is a spontaneous solution for increasing the accuracy of the whole parsing system.
Attia [57] claimed that building rule based systems require a lot of experience and needs a lot of work. Statisitcal approaches, on the other hand, need training materials that are annotated with experienced people. Also statistical approaches are very likely to be less coverage than rule based systems. A combination of two (statistical and rule based) was claimed to be the optimal approach for achieving the best results [57].
This attempt to add a statistical tokenizer to the rule-based parser, selected Casbeer’s [42] modeling of Arabic grammar as it is the only available one for use for Arabic. It was reported that it covers not a small part of the Arabic language rules. Several enhancements are done on Casbeer’s to increase its coverage. Recalling sentence #2 that is analyzed previously, we can see a link named Ov, which connects a verb with a preceding noun forming SV~ predicates.
Despite the fact that this approach is promising for the disambiguation problem (as it is presented that it limits the number of possible tokenizations based on the context), it is bounded with the precision level of the statistical tokenizer and PoS tagger. Below is a sample sentence that cannot be correctly tokenized with AMIRA I:
Sentence #3.4:
* ولديه لم يطعنه
It can be seen that the first word is a noun in a dual form with first person possession clitic(وَلَدَيْهِ) meaning (his two sons). In this context, any analysis that has non-noun marking of the first word is incorrect. The following result is obtained from AMIRA after feeding it with sentence #3.4:
* [CONJP و/CC] [PP لدي/IN ه/PRP] [PRT لم/RP] [VP يطعن/VBP ه/PRP]
The output obtained from AMIRA shows that, the tokenization that will be sent to the parser is incorrect, thus the output cannot be correct.
Another problem encountered with the work of Casbeer et al, is that the dictionaries that are published have obvious limited coverage. Consider the following sentence:
Sentence #3.5:
* لم يستطع عبور الجسر
 |
| Figure 3.5: the analysis for sentence #3.5 |
As depicted in Figure 3.4, the verb (يستطع) correctly tokenized, but marked as out-of-dictionary word (by the question mark as it does not exist in dictionaries).
A tokenization algorithm using statistical tokenizer and PoS tagging is presented as an attempt to increase the precision. Several problems presented affect the coverage of the used token-rule based grammar and the precision of the statistical tokenizer and PoS tagger. The next chapter presents the attempt of overcoming the following problems in a single unified solution:
- Special case handling for the output of token based parsing needed to overcome the conflicts of having two different word streams with the same token stream.
- The over-head of re-modeling the morphology rules in link-grammar.
- The need for manually annotated (morpho-syntactically) large enough corpora.
- Being limited by the accuracy limits of statistical disambiguation tools.