4.7.
Preprocessing and Post-Processing
Proper analysis of a sentence should consider
punctuation. Some punctuation affects the boundary of the sentence and some are
used to give some semantic relationships between sentence components. It is not
practical at this stage, to develop wide scale syntax analyzer that depends on
correctly punctuated text, because the punctuation was not strictly taken care
of, practical wise and educational wise in the most of the Arab world publications
[70].
As the system presented here and most of the other systems that process
Arabic sentence start from having bounded sentences, sentence boundary
detection is investigated first. Our target here is to develop a system that
can disambiguate the use of punctuation that affects the sentence boundary.
The most common punctuation for
sentence end is the period (.). The period conflicts with numeric expressions (it
is displayed in Arabic as comma with proper locale settings, but its Unicode
code-point is the period) and abbreviations. Both of the conflicting tokens
should be detected first before stating a sentence boundary. Other marks also
exist for marking end of sentences such as question mark “?”, and say sentence start
mark (colon “:”).
{0} = أيه ايه بي بى
سي سى دي دى إي
اى إى اي اف إف
أف جي جى اتش أتش
آي آى أي أى جيه كيه ال أل إل ام
أم إم إن ان أو
او بيه بي كيو أر
آر ار اس أس إس
تي تى يو في دبليو اكس
أكس واي واى زد توداي
{1} = أ ا ب ت
ث ج ح خ
د ذ ر ز
س ش ص ض
ط ظ ع غ
ف ق ك
ل
م ن ه و
ي ى
abbrev_pattern = “(\b({0})(\s*\.\s*|\s+)(\b({0})(\s*\.\s*|\s+|(?!\w)))+)
|
(\b({1})(\s*\.\s*|\s+)(\b({1})(\s*\.\s*|\s+|(?!\w)))+)”
Listing 4.2: Regular expression for Abbreviation detection
A set of regular expressions
are written to detect abbreviations, words, numeric expressions,
end-of-sentence marks. Listing 4.2 is the regular expression for Abbreviation
detection. The priority of detection is (abbreviation, words, numeric
expressions, end-of-sentence marks).
Numeric expressions are replaced by
special word named “NUMB”, and abbreviations are replaced by special word named
“ABBREV”. The DLG grammar has specific rules for these two special words. Any
other punctuation left in the sentence is removed. Words are replaced by their
class IDs that are obtained from the word-minimization process that is
discussed in the previous section.
The text of words and the
characters of the abbreviations and number expressions are restored after doing
the parsing.
4.8. System
Overview
Recalling the ladder of
linguistic processing layers, it can be seen that
each layer takes a designated input from the one that precedes it and delivers
its result to the one that succeeds it. That means in the best cases that each
step processes the given input and its successor layer takes all possible
analysis from it and selects solutions that only satisfy the rules of the
current layer. A statistical system developed based on similar architecture is
presented by [57].
The interference between Arabic morphology and its syntax enforces the
Arabic language analyst to deal with both Arabic morphology and syntax at the
same time.
The system presented in this chapter merges the first
two layers to form a morpho-syntactic layer as shown in figure 4.4.
![]() |
| Figure 4.4: A new applied ladder for Arabic linguistic processing |
The system has a
morpho-syntactic parser and PoS tagger in a single processing unit (layer).
This processing unit searches for rules that satisfy the linking requirements
that are assigned to PoS tagged words. With this new approach complexities of
disambiguation are minimized, as the morphological analysis and syntax analysis
are done in a single step. While figure 4.3 doesn’t mention the pre-processing
step, it is mentioned as an important step for processing raw text by Attia [57]
and is depicted in figure 4.5.
![]() |
| Figure 4.5: System overview |
The input for the system is a stream of
characters. The abbreviations, words, numeric expressions, and end of sentence
marks are identified. The abbreviations are replaced by a special token
“ABBREV”. Words are replaced by their class-ID’s (a single number identifies
all the possible PoS tags for this word). Numberic expressions are replaced by
special token “NUMB”. The stream of characters is split into sentences using
the end-of-sentence marks. Any other punctuation is removed from the sentence
(such as parenthesis, brackets, curly braces, … etc). This step works as
tokenization and pre-processing step for the parser. After obtaining the result
from the parser, the words are placed back into the stream (replaces the
class-ID’s of the words). Special tokens (ABBREV and NUMB) are replaced by
their corresponding characters from the original character stream. The graphs
(after post-processing its text to have the inputed text back) are returned
from the system for the applications or higher levels of NLP processing.


No comments:
Post a Comment