Some Developer: Developments on Syntax Analysis with Word Segmentation (part-2)

3.2. Overview of the results achieved for Statistical Tokenization and PoS tagging

3.2.1. AMIRA tools

Diab et al [45] presented a system that can tokenize, PoS tag, and chunk into base-phrase a raw text. It is based on SVM classification [47]. SVM is a supervised learning algorithm that detects a hyperplane that separates a set of positive examples from a set of negative examples with maximum margin [48]. This toolkit is named AMIRA. There are two versions of the system. The second one named AMIRA II was reported that its performance is superior to the first version [49]. AMIRA II was not accessible for researchers by the time of this study.

In this section, the focus will be on AMIRA I (as it is the used tool for this attempt).
The tokenization phase tags every character with a tag, taking into account a window of
-5/+5 of characters centered at the character in focus and the tags assigned to previous characters in current context. A character is assigned a one of 6 tags: B-PRE1, B-PRE2, B-WRD, I-WRD, B-SUFF, and I-SUFF. PRE-1 marks the beginning of the first proclitic, PRE-2 marks the character that is the second proclitic, B-WRD marks the first character of the body of the word, I-WRD marks successive characters of the body of the word, B-SUFF marks the first character of the enclitic (suffix), and I-SUFF marks successive characters within the enclitic. [45]

An obvious problem that has been solved in the second version of AMIRA is the possibility of having a word with 3 character proclitics which cannot be modeled in the previous tags. For example, the word (وللكتاب). The first letter of the body of the word is the fourth letter. [45]

It has been reported that the tokenization phase achieves Fß=1 = 99.12. AMIRA’s tokenization tries to restore the feminine ending letter to its final form, to have correct shaped form [45]. The word (رغبته) can be a past-verb or a noun. If it is used in a context that implies it as a verb, it should be tokenized into two tokens (رغبت) and (ه), where no feminine ending restoration is needed. If it is a noun, it should be tokenized to (رغبة) and (ه), which has the feminine ending restoration applied. The training of the system was done on Penn Arabic Treebank corpus [22].

Part-of-Speech tagging depends on the tokenization phase and feminine ending restoration. The output of this phase for every token is a tag out-of 24 collapsed tags from 135 tags of Aramorph morpho-syntactic Tags. The tags are: JJ for adjective, RB for adverb, CC for coordinating conjunction, DT for determiner/demonstrative pronoun, FW for foreign word, NN for common noun (singular), NNS for common noun (plural or dual), NNP for proper noun (singular), NNPS for proper noun (plural or dual), RP for particle, VBP for imperfect verb, VBN for passive verb, VBD for perfect verb, UH for interjection, PRP for personal pronoun, PRP$ for possessive personal pronoun, CD for cardinal number, IN for subordinating conjunction (FUNC_WORD) or preposition (PREP), WP for relative pronoun, WRB for wh-adverb. [45]
The given features to the classification module include a window of -2/+2 tokens centered at the focus token, the type for each token (alpha or numeric), the PoS tags for the previous tokens within the context window, and the character N-gram, N <= 4 that occur in the focus token. [45]

The reported performance for the PoS tagging phase (alone without the tokenization) is Fß=1 = 95.49. It was reported that 50% of the errors resulted from inconsistent annotation of nouns with adjectives from the used corpus. [45]

Base phrase chunking (or shallow parsing) is the task of creating labeled groups for the given token stream elements, with each group labeled with grammatical category, and composed of a head word with constellation of function words. [50]

The base phrase chunker of AMIRA takes as input PoS tagged stream of tokens (produced from the previous phase), and produces chunk boundaries labeled with 9 tags (ADJP, ADVP, CONJP, NP, VP, PP, PRT, SBAR, and IUCP). [45]

The reported performance for base-phrase chunking is Fß=1 = 92.08. [45]

3.2.2. Other Tools

A clitic tokenizer is developed as a component in Diacritic restoration system for Arabic words in [51]. A trigram probalistic model with weighted letter based transducers is incorporated with clitic tokenizer transducer in a sequential architecture for training different level models (word, level, and token level). It achieved 7.33% word error rate without case ending restoration (Acc, Gen, and Nom). The only reported indication about the performance of the clitic separation is the decrease in the word error level through clitic concatenation with the stems as it is improved by ~6%. It also has some success in dealing with foreign transliterated names.

An implementation for conditional random fields (CRFs) [52] for morphological disambiguation that takes full account for existing morphological dictionaries by estimating conditionally against dictionary accepted analyses of a sentence is presented by Smith et al. [53]
Habash & Rambow [44] developed a single operation for tokenization and morphological tagging (including PoS tagging) which consists of three phases. The first step is to get all possible analyses for every word in the given sentence. A classification process for ten morphological features (Part-of-Speeach, Has-Conjunction, Has-Particle, Has-pronoun, Has-Determiner, Gender, Number, Person, Voice, Aspect) is applied to the words. The third step is to choose from the morphological analysis possibilities those that match the output of the classification process.

3.3. Rule based parsing and its problems

Casbeer et.al [42], presented a first grammar for Arabic using the link grammar formalism that follows the tokenization approach. The implementation uses the Link grammar parser presented in Sleator [39]. The written grammar uses a selected set of stems from BAMA. The prefixes and suffixes were hard coded in the grammar. The features on which Casbeer selected his stems are not clear.

Starting from this paradigm and taking a step forward for developing this solution, one important point has been taken as the starting point, which is adding statistical tokenizer for the system developed by Casbeer et.al. [42]. It was reported in the same literature that the system takes the first solution provided by BAMA. One major drawback for using this tool as a wide scope syntax analyzer is that it will result in erroneous results. As BAMA sorted the solutions based on un-reported criteria, the fitness for this sorting may be harmful for other domains which may use different criteria preference.

As mentioned before in this chapter, AMIRA (version 1) achieved around 99% accuracy for tokenizing the Penn Arabic Treebank. In this study, AMIRA is used as a place holder for a domain specific tokenizer. As it is being trained on news wire, it is categorized as a state-of-the-art tool for similar corpora. The coverage of this tool over very-large-corpora is not reported.

The system developed by Casbeer et.al. [42] depends on the tokenization as produced by BAMA. Every prefix and suffix is separated from the stem. Below is the produced analysis from BAMA (with tokenization) for the word “رغبته”. In the following example the word is always analyzed by Casbeer’s as perfect verb.

LOOK-UP WORD: رغبته
SOLUTION 1:(رَغِبْتُهُ) [ragib-a_1] ragib/VERB_PERFECT+ tu/PVSUFF_SUBJ:1S +hu/PVSUFF_DO:3MS
(GLOSS): + wish/desire + I <verb> it/him
SOLUTION 2: (رَغِبْتَهُ) [ragib-a_1] ragib/VERB_PERFECT + ta/PVSUFF_SUBJ:2MS + hu/PVSUFF_DO:3MS
(GLOSS): + wish/desire + you [masc.sg.] <verb> it/him
SOLUTION 3: (رَغِبْتِهِ) [ragib-a_1] ragib/VERB_PERFECT+ ti/PVSUFF_SUBJ:2FS+ hu/PVSUFF_DO:3MS
(GLOSS): + wish/desire + you [fem.sg.] <verb> it/him
SOLUTION 4: (رَغِبَتْهُ) [ragib-a_1] ragib/VERB_PERFECT+ at/PVSUFF_SUBJ:3FS+ hu/PVSUFF_DO:3MS
(GLOSS): + wish/desire + it/they/she <verb> it/him
SOLUTION 5: (رَغْبَته) [ragobap_1] ragob/NOUN+ ap/NSUFF_FEM_SG+ hu/POSS_PRON_3MS
(GLOSS): + desire/wish + his/its
Listing 3.1: BAMA output for the word (رغبته)

The grammar presented by Casbeer et al [42] used these tokens as the basic units among which linking rules are deduced.

Tokenization as mentioned in [45] means the process of segmenting clitics. It was mentioned that clitics are those tokens (subset of prefixes and suffixes) that are playing different role in the syntactic structure of the sentence. Clitics is divided into two sets, proclitics and enclitics. Proclitics are clitics (subset of prefixes) that attach to the beginning of a stem. Not all prefixes are proclitics. Noun determiner (ال) is a prefix and not a proclitic, for example. Enclitics are clitics that attach to the end of the word, but not all suffixes are enclitics, example for enclitic exception is number suffixes (ات, ان …).

Word segmentation was a revisited step by many researchers, as it has been dealt with by several other publications. A corpus of manually segmented words that is appears to be a subset of the initial release of the ATB (110,000 words) has been used by Lee et al. [54]. They obtain a list of prefixes and suffixes from this corpus, which is apparently augmented by a manually derived list of other affixes. Unfortunately, the full segmentation criteria are not given. Then a trigram model is learned from the segmented training corpus, and this is used to choose among competing segmentations for words in running text. In addition, a huge unannotated corpus (155 million words) is used to iteratively learn additional stems. Lee et al. show that the unsupervised use of the large corpus for stem identification increases accuracy. Overall, their error rate is not higher than Habash & Rambow [44] (2.9% vs. 0.7%), presumably because they do not use a morphological analyzer. There has been a fair amount of work on entirely unsupervised segmentation. Among this literature, Rogati et al. [55] investigate unsupervised learning of stemming (a variant of tokenization in which only the stem is retained) using Arabic as the example language. Darwish [56] discusses unsupervised identification of roots. Root identification is not within the bounds of this study.

Despite Habash & Rambow [44] reported higher accuracy than that of Diab et al [45], we selected to use the latter as it was the only available tagger by the time of writing this study.

In this chapter, the "tokenization" word is used to refer to the process of separating clitics from the stem in the context where the system presented by Diab et al. [45] is used. Otherwise, it points to the tokenization as depicted in the result of BAMA.

A typical syntax analysis process of the system developed by Casbeer et al. [42], consists of three steps. The first is sending the given sentence to BAMA, then picking up the first solution then sending the tokens of the first solution of every word of the sentence to Link Grammar parser, for producing syntax analysis graphs.

Despite the Link parsing system presented by Sleator and Temperley [39] is capable of inferring the possible grammatical function for a word that is not in the dictionary, it will not be able to infer the solution if the word has different token segmentations among its BAMA analysis. Link grammar does not accept more than a single word for every word position. In addition to that, it does not also accept more than one sentence per request to produce fall-back from one sentence to another.

Adding a statistical tokenization phase before feeding the input to Link parser, will increase the accuracy of the parsing system if the statistical tokenizer is trained on data that is from the same domain from which syntax analysis is going to be applied.

Pages

Monday, February 17, 2014

Developments on Syntax Analysis with Word Segmentation (part-2)

No comments:

Post a Comment