4.5. PoS
tagging
PoS tagging of The Arabic sentence
and word disambiguation, was previously addressed indirectly by developing
language resources for inputing to statistical disambiguators [22] [63], or
directly either by rule based or statistical based system. [64] [34][65][59][57][45][66][67][54][68][69]
For statistical disambiguation, it was important to have the minimum adequate number of PoS tags that can be useful for a statistical algorithm to train on limited manually disambiguated corpora. The number of unique PoS tags extracted in the word generation phase, previously described in this chapter, is above 4700 tags. We selected only 3033 on which the developed rules are applied. The use of this number of PoS tags with pure statistical model (such as Diab’s [45]), needs huge manually annotated corpus (And it is not reported in the literature that such corpus exists, as of this writing).
Another problem exists in front of statistical disambiguation tools, is the long distance dependency. As reported by Diab [45], the window length used in PoS tagging of AMIRA I is 5 tokens, which will fail in the following sentence:
Sentence #4.1:
* طلب المحامي من المستشارين بعد
الاطلاع على المستندات مد مدة الدفاع
PoS Tagging and BP chunking
output is:
* [VP طلب/VBD] [NP المحامي/NN] [PP من/IN المستشارين/NNS] [PP بعد/IN الاطلاع/NN] [PP على/IN المستندات/NNS] [PP مد/IN مدة/NN الدفاع/NN]
It can be seen that the object
of the sentence (مد) is marked as (IN) which denotes a preposition, while the word
itself has no solution with BAMA mentioning it as preposition. The output of
BAMA for this word is:
LOOK-UP WORD: مد
SOLUTION 1: (مَدَّ)
[mad~-u_1] مَدّ/VERB_PERFECT+ـَ/PVSUFF_SUBJ:3MS
(GLOSS):
+ extend/stretch/spread out + he/it <verb>
SOLUTION 2: (مَدّ) [mad~_1] مَدّ/NOUN
(GLOSS):
+ extension/lengthening/spreading +
|
AMIRA is trained on Penn Arabic
Treebank [22], which used BAMA as its morphological analyzer. This proves that
systems such as AMIRA may produce results that are not in the domain of the
word analysis.
The proposed system produces correct analysis taking into consideration the long distance dependency. Figure 4.2 is one of the analyses produced with the proposed system.
![]() |
| Figure 4.2: Morpho-syntactic analysis and PoS tagging of Sentence #5 with DLG |
It can be seen in Figure 4.2
that the correct object is linked to the verb with link “On” (nominal object).
And the object itself is marked as “n” noun. It can be also seen that all of
the other words are assigned valid PoS tags. The following table summarizes the
resulted PoS tags for this sentence:
طلب
|
pv-pvss-3ms
|
Perfect-verb+Perfect-Verb-Suffix-3rd
person-masc.-singular
|
المحامي
|
d-n
|
Determiner + Noun
|
من
|
prep-0
|
Preposition
|
المستشارين
|
d-n-ns-md-ag
|
Determinter+Noun+Noun-Suffix-Masc-Dual-accusative/genitive
form
|
بعد
|
prep-n
|
Preposition (can be noun)
|
الاطلاع
|
d-n
|
Determinter+Noun
|
على
|
prep-0
|
Preoposition
|
المستندات
|
d-adj-ns-fp
|
Determinter+Adjective+Noun-Suffix-Feminine-Plural
|
مد
|
n
|
Noun
|
مدة
|
n-ns-fs
|
Noun+Noun-Suffix-Feminine-Singular
|
الدفاع
|
d-n
|
Determinter+Noun
|
As depicted in Figure 4.2 and
Table 4.5, the developed parsing system for DLG, produces with every solution
the PoS tagging for every given word.


No comments:
Post a Comment