Monday, February 17, 2014

The Word Based Parsing Technique (part-7)

4.7. Preprocessing and Post-Processing

Proper analysis of a sentence should consider punctuation. Some punctuation affects the boundary of the sentence and some are used to give some semantic relationships between sentence components. It is not practical at this stage, to develop wide scale syntax analyzer that depends on correctly punctuated text, because the punctuation was not strictly taken care of, practical wise and educational wise in the most of the Arab world publications [70]. 

As the system presented here and most of the other systems that process Arabic sentence start from having bounded sentences, sentence boundary detection is investigated first. Our target here is to develop a system that can disambiguate the use of punctuation that affects the sentence boundary.

The most common punctuation for sentence end is the period (.). The period conflicts with numeric expressions (it is displayed in Arabic as comma with proper locale settings, but its Unicode code-point is the period) and abbreviations. Both of the conflicting tokens should be detected first before stating a sentence boundary. Other marks also exist for marking end of sentences such as question mark “?”, and say sentence start mark (colon “:”).


{0} = أيه ايه بي بى سي سى دي دى إي اى إى اي اف إف أف جي جى اتش أتش آي آى أي أى جيه  كيه ال أل إل ام أم إم إن ان أو او بيه بي كيو أر آر ار اس أس إس تي تى يو في  دبليو اكس أكس واي واى زد توداي

{1} = أ ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك
                                ل م ن ه و ي ى
abbrev_pattern = “(\b({0})(\s*\.\s*|\s+)(\b({0})(\s*\.\s*|\s+|(?!\w)))+)
| (\b({1})(\s*\.\s*|\s+)(\b({1})(\s*\.\s*|\s+|(?!\w)))+)
Listing 4.2: Regular expression for Abbreviation detection

A set of regular expressions are written to detect abbreviations, words, numeric expressions, end-of-sentence marks. Listing 4.2 is the regular expression for Abbreviation detection. The priority of detection is (abbreviation, words, numeric expressions, end-of-sentence marks). 

Numeric expressions are replaced by special word named “NUMB”, and abbreviations are replaced by special word named “ABBREV”. The DLG grammar has specific rules for these two special words. Any other punctuation left in the sentence is removed. Words are replaced by their class IDs that are obtained from the word-minimization process that is discussed in the previous section.

The text of words and the characters of the abbreviations and number expressions are restored after doing the parsing.

4.8. System Overview

Recalling the ladder of linguistic processing layers, it can be seen that each layer takes a designated input from the one that precedes it and delivers its result to the one that succeeds it. That means in the best cases that each step processes the given input and its successor layer takes all possible analysis from it and selects solutions that only satisfy the rules of the current layer. A statistical system developed based on similar architecture is presented by [57].

The interference between Arabic morphology and its syntax enforces the Arabic language analyst to deal with both Arabic morphology and syntax at the same time.

The system presented in this chapter merges the first two layers to form a morpho-syntactic layer as shown in figure 4.4.

Figure 4.4: A new applied ladder for Arabic linguistic processing

The system has a morpho-syntactic parser and PoS tagger in a single processing unit (layer). This processing unit searches for rules that satisfy the linking requirements that are assigned to PoS tagged words. With this new approach complexities of disambiguation are minimized, as the morphological analysis and syntax analysis are done in a single step. While figure 4.3 doesn’t mention the pre-processing step, it is mentioned as an important step for processing raw text by Attia [57] and is depicted in figure 4.5.

Figure 4.5: System overview

The input for the system is a stream of characters. The abbreviations, words, numeric expressions, and end of sentence marks are identified. The abbreviations are replaced by a special token “ABBREV”. Words are replaced by their class-ID’s (a single number identifies all the possible PoS tags for this word). Numberic expressions are replaced by special token “NUMB”. The stream of characters is split into sentences using the end-of-sentence marks. Any other punctuation is removed from the sentence (such as parenthesis, brackets, curly braces, … etc). This step works as tokenization and pre-processing step for the parser. After obtaining the result from the parser, the words are placed back into the stream (replaces the class-ID’s of the words). Special tokens (ABBREV and NUMB) are replaced by their corresponding characters from the original character stream. The graphs (after post-processing its text to have the inputed text back) are returned from the system for the applications or higher levels of NLP processing.

No comments:

Post a Comment