Monday, February 17, 2014

NLP - Arabic Parsing Overview (2/6)

2.2. Formalization Properties

2.2.1. Syntactic Representation

Syntactic structures are depicted often as tree shaped structures. They are often referred to as dependency or phrase structure trees, depending on the kind of the representation they use. While dependency trees describe sentences with dependency between words, phrase-structure trees illustrate phrases, phrases abstraction, and relationships between them.

Figure 2.1: Sample of Arabic sentence analysis using dependency
representation and phrase structure representation

Dependency trees shows only the words with binary directed links between them composing a Directed Acyclic Graph (DAG) (some representations allows for DAG), with number of nodes often equal to the number of words. Phrase structure shows from the top to bottom the constituents analysis in the form of Non-terminals ending with words (terminals) at the leaves.

Phrase structure represents the structure of the sentence in a linear form, as a chain of words and as a whole-part construction [1]. The dependencies between constituents are not distinctly shown. It is best suited for expressing languages with very limited variation in word order [2]. While some phenomena (like coordination) is best represented with it, some other (discontinues constituents) causes problems in that representation. Figure 2.1 depicts a simple Arabic sentence represented with dependency representation and phrase structure representation.

Dependency trees mark the phrase categories only implicitly, and do not often show the word order. It is better suited to represent the structures of relatively free word-order languages. It can capture the predicate-argument structure which is often needed in NLP applications. It also offers a straightforward interface between syntactic and semantic representation [3]. 

Conversion of dependency trees with specified word order might be considered under certain conditions into equivalent phrase structure trees and vice versa [2] [4] [5].

2.2.2. Representation depth

There are two types of parsers with regard to its output representation depth, Shallow parsers and Deep parsers. Shallow parsers produce more flattened analysis that represent only parts of the structure [6] and most of the parsers in that type produce what is so called “base phrase chunking.” On the other hand, Deep parsers produce complete relationship covering the whole sentence (after checking  that the input matches the given grammar). 

The NLP application dictates the reason for which type will be used, either Shallow or Deep. Information Retrieval (IR) for instance, may not need more than a Shallow parser, while MT systems might need in-depth syntactic structure (Deep) analyzer [7]. Shallow parser claimed to be faster than Deep syntactic parsers [8], and it is also easier to modify a Shallow parser than to modify a Deep syntactic parser. Deep parser, on the other hand, provides better generalizations across semantic relations and capture paraphrasing relations between syntactic constructs. Scoring of the analysis reliability can be better obtained from Deep parsers [7].

The next section briefs the most well-known formalisms that have been successfully applied in existing parsers and mentioning their application on Arabic.

2.3. Brief on well-known existing formalisms applied on Arabic

The grammar formalisms of context-free phrase structure grammars and transformational grammars form the theoretical basis of several formalisms used in modern parsing systems. The works of Bloomfield [9] and Chomsky [10] [11] were the origin of phrase structure grammars and they were designed to study the structure of phrases. 

The Context Free Grammar (CFG) G = { N, T, S, P } consists of a finite set of non-terminal N, a finite set of terminal symbols T where T ∩ N = ∅, a head-start symbol S ∈ N, a finite set of production rules where P = {a ↦ b| a ∈ N, b ∈ (T ∪ N)*}. [12] 

The set of production rules states how a non-terminal is going to be expanded to the items of the right-hand-side of any of the associated production rules that it exists on its left-hand-side. “Context-Free” refers to the avoidance of referring to the context in which the non-terminal is located, while the expansion (or reduction) operation occurs on the production rules [13]. A rule in phrase structure context-free grammar specifies two relations: immediate dominance between a non-terminal (left-hand-side of the rule) and its children (right hand side of it), and the linear precedence relation among children (right-hand-side) [14].


No comments:

Post a Comment