Some Developer: NLP - Arabic Parsing Overview (3/6)

2.3. Brief on well-known existing formalisms applied on Arabic

The grammar formalisms of context-free phrase structure grammars and transformational grammars form the theoretical basis of several formalisms used in modern parsing systems. The works of Bloomfield [9] and Chomsky [10] [11] were the origin of phrase structure grammars and they were designed to study the structure of phrases.
The Context Free Grammar (CFG) G = { N, T, S, P } consists of a finite set of non-terminal N, a finite set of terminal symbols T where T ∩ N = ∅, a head-start symbol S ∈ N, a finite set of production rules where P = {a ↦ b| a ∈ N, b ∈ (T ∪ N)*}. [12]
The set of production rules states how a non-terminal is going to be expanded to the items of the right-hand-side of any of the associated production rules that it exists on its left-hand-side. “Context-Free” refers to the avoidance of referring to the context in which the non-terminal is located, while the expansion (or reduction) operation occurs on the production rules [13]. A rule in phrase structure context-free grammar specifies two relations: immediate dominance between a non-terminal (left-hand-side of the rule) and its children (right hand side of it), and the linear precedence relation among children (right-hand-side) [14].

2.3.1. Definite Clause Grammar

2.3.1.1. Overview

Definite Clause grammars express the rules as clauses of first-order logic (also known as ‘definite clauses’ or ‘Horn clauses’), providing a generalization of CFG.
DCG components are Terms and Clauses. Terms are the data objects of it and a Term is a constant, a variable, or a compound term. A compound term comprises a functor (called the principle functor of the term), and a sequence of one or more terms called arguments. A functor is characterized by its name, which is an atom, and its arity (the number of arguments). A grammar defined in DCG is a sequence of clauses. A clause comprises a head and a body, and the body consists of a sequence of zero or more goals. A goal is true if it is the head of some clause instance and each of the goals in the body of that clause instance is true, where an instance of a clause is obtained by substituting, for each of zero or more of its variables, a new term for all occurrences of the variable. To execute a goal, the system searches for the first clause whose head matches with the goal. The searching finds the most general common instance of the two terms, which is unique if it exists. If a match is found, the matching clause instance is then activated by executing from left to right, each of the goals if any in its body. If at any time the system fails to find a match for a goal it backtracks, and at that state, it reconsiders the original goal which activated the rejected clause, and tries to find a subsequent clause which also matches the goal. [15]
Like Affix Grammar over Finite Lattice (AGFL, as will be mentioned later), DCG extends CFG by augmenting non-terminals with arguments and extra conditions to be included in the grammar rules; these conditions make the course of the parsing depend on auxiliary computations, up to unlimited extent. DCG allows also trees to be built in the course of the parsing, which can provide higher representation of the sentence. DCG enables context dependency on the grammar, so that the permissible forms for a phrase may depend on the context in which that phrase occurs in the string. With these features DCG extends the CFG and overcome up to some extent to its in-adequacies for natural language processing.

2.3.1.2. Application on Arabic Language

Shishiny’s [16] attempt in Prolog was the first reported trying to develop a large scale Arabic grammar. Several attempts were done for modeling some aspects of the Arabic sentence, but none of them was reported to be generic or eligible for large scale usage.
The lexical entries of the grammar are augmented with arguments needed for recognizing some structures and to enforce agreement among sentence components. Arguments cover transitivity, voice, person for verbs, number, gender, definiteness, person for nouns, and number, gender, and person for pronouns.
The grammar proposed formalization for ‘nominal sentence’, ‘verbal sentence’, and special sentences such as ‘vocative sentence’ (جملة القول). It also addressed conjunction and embedding. The embedded phrase types were not reported.
The performance of the developed system was reported to be large scale. No reports could be found regarding the speed of processing, the recall, and the precision. Nothing is also reported regarding the lexicon and its coverage. The grammar itself is not available to be checked.

2.3.2. Affix Grammar over a Finite Lattice

2.3.2.1. Overview

Affix Grammars over a Finite Lattice (AGFL) provides means of augmenting context-free grammar with feature matching rules named meta-rules. The part that is commonly modeled with affix rules is the part-of-speech matching.
One major advantage of using AGFL over CFG is the low number of needed rules to model some aspects of a language. Meta-rules (Affix definition rules) state the possible values for an Affix variable (known as Affix domains in [17]). A terminal or non-terminal in a phrase structure rule can have more than one affix variable as ordered set. A simplified depiction of Affix variable can be expressed as a bit-vector, with each of the values that exist in its domain is assigned a bit position in that bit-vector. Affix variables on the LHS of a production rule specifies the propagated variables from those mentioned in the RHS. During the parsing process, LHS affix variables are matched against each other, and if a variable has an empty set, then the parsing is back-tracked. During the parsing process some grammar symbols on the RHS or LHS of a production rule may have more than one value for the same Affix variable. Figure 2.2 depicts the possible values for a variable NUMBER that can model the person-property of PoS tag regarding who is saying the verb utterance.

Figure 2.2: The lattice of values for the affix NUMBER

The operations that can be done on this lattice-over-set of values are set-union and set-intersection, and named powerset lattice of the set valued affix. The matching is a set-intersection operation resulting in a new value (set of domain values) that is propagated to the LHS of the rule if it is mentioned for that non-terminal.

Listing 2.1 contains a small example for depicting applying AGFL on Arabic.

NUMB:: sing ; dual; plur.
GEND:: fem; masc.
Sentence :: Verb(GEND, sing), Subject(GEND) ;
Subject(GEND, NUMB), Verb(GEND, NUMB) .
Verb(masc, dual) :: ذهبا .
Verb(masc, sing) :: ذهب .
Verb(fem, sing) :: ذهبت .
Verb(fem, dual) :: ذهبتا .
Verb(fem, dual) :: ذهبتا .
Verb(masc, plural) :: ذهبوا .
Verb(fem, plural) :: ذهبن .
Subject(masc, dual) :: الرجلان .
Subject(masc, sing) :: الرجل .
Subject(masc, plural) :: الرجال .
Subject(fem, plural) :: النساء .
Subject(fem, sing) :: المرأة .
Subject(fem, dual) :: المرأتان .
Listing 2.1: Sample AGFL for Arabic

In that small example we have two affix domains, one for NUMB and the other for GEND. Subject and Verb nodes in the definition of Sentence, should match on the NUMB variable obtained from their children. If there is no intersection among their values, the algorithm backtracks, taking another branch. In the previous sample, the NUMB feature has terminal values corresponding to the feature projection of these words.
The algorithm used by the framework utilizing this formalism uses Recursive-Backup parsing algorithm [18]. The main factor of selecting this method is the simple method of implementing it with on-the-fly affix evaluation.
In AGFL, there is no notion of directionality among the production rule elements with regard to Affix evaluation.

2.3.2.2. Application on Arabic Language

AraParse project was built utilizing the AGFL framework and its formalism for modeling Modern Standard Arabic (MSA). The parsing unit of the system is composed of two units, a lexical analyzer and the parser (built with AGFL). The lexical analyzer gets a word and generates PoS tags for its components, which are fed into the parser generated with AGFL. The grammar written in AGFL is based on PoS tags not the words themselves. The front end of the parser is modified to accept list of possible features of every word in addition to the word itself. The word itself is just anchored for presentation and aids in result processing. [19]
The AraParse grammar covers wide range of usual expressions of MSA, including (negative forms, elliptical forms, several interrogative forms, some kind of coordination, complex determiners, agreement and dependencies within a sentence, complement description). [19]
AraParse utilized robustness on several levels. On the lexical level, it uses rules for detecting unknown verbs and nouns through regular expressions. To enhance the performance of the parser, some rules utilize commit marker through which the parsing will stop after detecting them. Due to the parsing algorithm which tries to match the rules in the order it is mentioned in the grammar, AraParse achieved robustness on the syntactic level, special rules are supplied for parsing unknown predicates (islands). [19]
It is mentioned that AraParse needs the aid of statistical lexical tagging to overcome the combinatorial behavior of the system. Robustness is reported as a future concern for AraParse, despite the fact that it was taken care of, in this project [19].

Pages

Monday, February 17, 2014

NLP - Arabic Parsing Overview (3/6)

No comments:

Post a Comment