Monday, February 17, 2014

NLP - Arabic parsing overview (4/6)

2.3.3. Probabilistic context-free grammars (PCFG)
2.3.3.1. Overview

One way to characterize PCFG is by assigning the probability of use for each production rule. PCFG G = { N, T, S, P}, N is a finite set of non terminals, T is a finite set of terminals where T ∩ N = ∅, S the start symbol of the grammar (S ∈ N), and P = {a ↦ b} a ∈ N, b ∈ (T ∪ N)*}, probability function p : P ↦ [0,1] such that for each a ∈ N, ∑ p(a ↦ b | a) = 1.

By extending the subcategorization of PoS tagging, one might end with every PoS is assigned to a single word, with such scheme, building a PCFG for this content, will result in lexicalized PCFG, where every production rule is associated with non terminal anchored with this single lexical unit. Despite lexicalized PCFG’s are competent in identifying functional words and its contexts, unlexicalized parsing showed wider coverage when applied in content with new words that do not exist in the training data. Successful approaches for un-lexicalized PCFG are concerned with separating content words from functional words.

Because functional words have different grammatical meanings, each is considered as a separate unit.[20]
While rule based algorithms consider parsing through recursive application of written rules, probabilistic models automatically discover the disambiguation criteria during the parsing process. Because of its nature, probabilistic algorithms can sort the possible solutions and speed-up the search process based on the calculated probability. [1]

The probabilistic parsing model is built in three steps. First, is the parameterization phase within which, the decision of selecting the method of defining analysis’s probabilities is made [21]. The second, is the training phase; the probability distributions are established based on a given previously analyzed text (some corpora). The third phase, is selecting the evaluation method. Often, the given corpus is split into 3 parts, the training part, the testing part, and the development part. The training part is used for training the model and tuning it with the testing part. At the final evaluation, the development part is used (which is not seen before in the training phase) for measuring the final quality of the model.

Each parse r in a probabilistic model is given a probability P(r|s) for sentence s, and the most probable parse (maximum P(r|s)) results from the parsing process.
P(s)=arg   max┬r⁡〖(P(r|s))/(P(s))〗=arg   max┬r⁡〖P(r|s)〗

2.3.3.2. Application on Arabic Language

The Penn Arabic Treebank (PATB) project goal was to develop 1 million word corpus annotated at word, phrase and sentence level. The project gained its raw text from Agence France Presse (AFP) newswire archive.

Based on traditional Arabic Grammar, Penn Arabic Treebank annotated the content utilizing existing theories about modern linguistics and the experience of preparing content for other languages (English, Chinese, and Korean). As it does not have distinct definition, MSA is mostly used in formal written communication, and broadcast newswire. [22]

One of the important challenges towards building an Arabic tree-bank is the grammatical inconsistency for MSA, as its internalized grammar differs drastically sometimes from one annotator to another. [22]
The first Part of PATB was used in training with Bikel’s [23] implementation of Collins [24] parser and produced recall/precision =75.4/76.0 on sentences of 40 words or less and 72.5/73.4 on all sentences. Successive improvements enabled the parser to achieve recall/precision =78.14/80.26 on sentences of 40 words or less and 73.61/75.64 on all sentences.

2.3.4. Tree Adjoining Grammar
2.3.4.1. Overview

TAG uses tree structures as its elementary object. Despite, TAG’s can describe more complex languages than CFG’s, this comes at the performance cost [25]. A tree adjoining grammar G = (N, T, I, A, S), where N is a finite set of non-terminal symbols, and N ∩ T = 0, T is a finite set of terminal symbols, I is a finite set of finite initial trees, A is a finite set of finite auxiliary trees, and S ∈ N is a special symbol called sentence symbol. The initial trees and the auxiliary trees are the elementary trees. Recursion is not allowed in initial trees. Auxiliary trees may have recursion. Syntax analysis of a sentence is derived from initial and auxiliary trees through two operations substitution and adjoining. [14]

For substitute operation the label of the foot-node to be replaced must match the label of root node of the inserted tree to replace it. Adjoining operation replaces a node in a tree with another tree having a head with the same non-terminal as the replaced node and the inserted tree should have another node with a non-terminal same as its head (with no children) which will have the children of the replaced node. [14]
The process of analyzing a sentence yields two types of trees, derived trees and derivation trees. Derived trees show the final analysis of the sentence and derivation trees show the derivation operations done to have the derived trees. Each elementary tree used in the derivation is a node in the derivation tree, and it has a labeled link to the initial tree that it is either substituted or adjoined into. The label refers to the position in the target tree at which the substitution or adjunction takes place [26]. Adjunction allows for TAG’s to describe languages that are not describable with CFG, such as copying languages like anbnecndn  . TAG’s generally have constraints on adjunction, such as limiting the number of adjunctions that can occur at a particular node.

Lexicalized TAG, a lexical variation of TAG, associates with each elementary tree a lexical head. A lexicon is a main part of those grammars within which each lexical unit can be associated with finite number of tree-structures. The combination of the lexical unit and the structure defines the domain within which constraints can be stated. LTAG can also have multi-component heads which increase its coverage to model idioms without additional complexity [27][28].

In addition to lexicalization of TAG’s, Feature-based LTAG [29] and Probalistic LTAG [30] [31] are also developed. Several literatures addressing semantics with TAG’s shows its suitability for large scale constructions such as LTAG semantics with semantic unification [32], and large scale semantic construction for Tree Adjoining Grammars [33].

2.3.4.2. Application on Arabic language

The first attempt of extracting a Tree Adjoining Grammar is accomplished by Habash & Rambow (2004) [26] from Penn Arabic Treebank. As it is reported to be a work-in-progress, the coverage of the extracted trees is not confirmed. Also the total number of trees acquired with that limited corpus was not reported. It was also reported that a single TAG was not acquired that can parse that corpus.

No comments:

Post a Comment