Monday, February 17, 2014

Developments on Syntax Analysis with Word Segmentation (part-1)

Developments on Syntax Analysis with Word Segmentation

3.1. Introduction

In this chapter a brief of “token based language processing” tools for Arabic is presented, followed by an attempt to utilize these resources for parsing the Arabic sentence.

Syntax analysis of the Arabic sentence cannot be addressed without tight integration with morphological analysis of its components. A word in Arabic might have several morphemes each of which has a grammatical function that plays some role in the whole syntax structure of the sentence. This morpho-syntactic nature of the Arabic language, makes it hard to separate the morphological analysis from the syntax analysis in two separate levels. Unlike the Latin originated languages where every word often has single grammatical function, an Arabic word can have all of the needed grammatical functions for a complete sentence [43]. Several attempts have been made in segmenting the Arabic word into tokens each of which has a separate grammatical function [37] [44] [45]. The segmentation was used as a simplification step towards building higher levels of analysis tools that work on grammatical function of the segment level.

Despite the fact that it falls in the problem of having a correct segmentation [46], it is not reported that it can be used in large scale applications without training on larger corpora.

In this chapter an overview of the previous work of statistical tokenization and PoS tagging is briefed. Then, token based syntax analysis approach is revisited guided by the former approach. The presented integration between the two approaches, is introduced to overcome the problems that exist in each one alone. The attempt presented in this chapter can be summarized as having a morpho-syntactic parser for Arabic using existing and accessible tools for the researchers.

No comments:

Post a Comment