Looking into the Operational Modalities Adopted in Some of the POS Tagging Tools in Identification of Contextual Part-of-Speech of Words in Texts

Kesavan Vadakalur Elumalai; Niladri Sekhar Das; Mufleh Salem M. Alqahtani; Anas Maktabi

doi:10.7575/aiac.ijalel.v.8n.6p.92

Looking into the Operational Modalities Adopted in Some of the POS Tagging Tools in Identification of Contextual Part-of-Speech of Words in Texts

Kesavan Vadakalur Elumalai, Niladri Sekhar Das, Mufleh Salem M. Alqahtani, Anas Maktabi

Abstract

Part-of-speech (POS) tagging is an indispensable method of text processing. The main aim is to assign part-of-speech to words after considering their actual contextual syntactic-cum-semantic roles in a piece of text where they occur (Siemund & Claridge 1997). This is a useful strategy in language processing, language technology, machine learning, machine translation, and computational linguistics as it generates a kind of output that enables a system to work with natural language texts with greater accuracy and success. Part-of-speech tagging is also known as ‘grammatical annotation’ and ‘word category disambiguation’ in some area of linguistics where analysis of form and function of words are important avenues for better comprehension and application of texts. Since the primary task of POS tagging involves a process of assigning a tag to each word, manually or automatically, in a piece of natural language text, it has to pay adequate attention to the contexts where words are used. This is a tough challenge for a system as it normally fails to know how word carries specific linguistic information in a text and what kind of larger syntactic frames it requires for its operation. The present paper takes up this issue into consideration and tries to critically explore how some of the well-known POS tagging systems are capable of handling this kind of challenge and if these POS tagging systems are at all successful in assigning appropriate POS tags to words without accessing information from extratextual domains. The novelty of the paper lies in its attempt for looking into some of the POS tagging schemes proposed so far to see if the systems are actually successful in dealing with the complexities involved in tagging words in texts. It also checks if the performance of these systems is better than manual POS tagging and verifies if information and insights gathered from such enterprises are at all useful for enhancing our understanding about identity and function of words used in texts. All these are addressed in this paper with reference to some of the POS taggers available to us. Moreover, the paper tries to see how a POS tagged text is useful in various applications thereby creating a sense of awareness about multifunctionality of tagged texts among language users.

Keywords

Annotation, Tagging, Part-of-Speech, Morphology, Syntax, Semantics, Contexts

Full Text:

PDF

References

Abney, S. 1997. Part-of-speech tagging and partial parsing. In: Schreibman, S., Siemens, R.G. & Unsworth, J.M. eds. Corpus-Based Methods in Language and Speech: A Companion to Digital Humanities. London: Blackwell. Pp. 118-136.

Archer, D. & Culpeper, J. 2003. Sociopragmatic annotation: New directions and possibilities in historical corpus linguistics. In: Wilson, A., Rayson, P. & McEnery, A.M. eds. Corpus Linguistics by the Lune: A Festschrift for Geoffrey Leech. Peter Lang: Frankfurt. Pp. 37-58.

Archer, D., McEnery, T., Rayson, P., & Hardie, A. 2003. Developing an automated semantic analysis system for Early Modern English. Proceedings of the Corpus Linguistics 2003 conference. UCREL, Lancaster University. Pp. 22-31.

Archer, D., Rayson, P., Piao, S., & McEnery, T. 2004. Comparing the UCREL Semantic Annotation Scheme with Lexicographical Taxonomies. In: Williams, G. & Vessier, S. eds. Proceedings of the 11th EURALEX (European Association for Lexicography) International Congress (Euralex 2004), Lorient, France, 6-10 July 2004. Université de Bretagne Sud. Volume III. Pp. 817-827.

Biber, D., Conrad, S. & Reppen, R. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.

Biber, D., Finegan, E., & Atkinson, D. 1994. ARCHER and its challenges: compiling and exploring a representative corpus of historical English registers. In: Fries, U., Tottie, G. & Schneider, P. eds. Creating and Using English Language Corpora. Amsterdam: Rodopi. Pp. 1-14.

Brants, T. 2000. TnT- A Statistical Part-of-Speech Tagger. Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP-2000), Seattle, WA, USA. Pp. 37-42.

Brill, E. 1992. A simple rule-based part of speech tagger. Proceedings of the Workshop on Speech and Natural Language (HLT-91), Morristown, NJ, USA: Association for Computational Linguistics. Pp. 112-116.

Brill, E. 1995. Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Computational Linguistics. 21(4): 543-565.

Britto, H., Galves, C., Ribeiro, I., Augusto, M., & Scher, A. 1999. Morphological Annotation System for Automatic Tagging of Electronic Textual Corpora: from English to Romance Languages. Proceedings of the 6th International Symposium of Social Communication. Santiago, Cuba. Pp.582-589.

Charniak, E. 1997. Statistical Techniques for Natural Language Parsing. Artificial Intelligence Magazine. 18(4): 33-44.

Dandapat, S. 2009. Part-of-Speech tagging for Bengali. MS Thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India (MS).

Dash 2011).

Dash, N.S. 2011. Principles of Part-Of-Speech (POS) Tagging in Indian Language Corpora. In: Vetulani, Z. ed. Proceedings of 5th Language Technology Conference (LTC-2011): Human Language Technologies as a challenge for computer science and linguistics. Poznan, Poland, 25-27 November 2011, Pp. 101-105.

DeRose, S.J. 1988. Grammatical category disambiguation by statistical optimization. Computational Linguistics. 14(1): 31–39.

DeRose, S.J. 1990. Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages. Doctoral Dissertation, Department of Cognitive and Linguistic Sciences, Providence, RI: Brown University, USA.

Fligelstone, S., Pacey, M. & Rayson, P. 1997. How to generalise the task of annotation. In: Garside, R., Leech, G. & McEnery, A. Eds. Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman. Pp 122-136.

Fligelstone, S., Rayson, P. & Smith, N. 1996. Template analysis: bridging the gap between grammar and the lexicon. In: Thomas, J. & Short, M. eds. Using Corpora for Language Research. Harlow: Longman. Pp 181-207.

Forney, G.D. 1973 The Viterbi algorithm. Proceedings of the IEEE. 61(3): 268-278. doi:10.1109/PROC.1973.9030.

Garside, R. & Smith, N. 1997. A hybrid grammatical tagger: CLAWS4. In: Garside, R., Leech, G. & McEnery, A. eds. Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman. Pp. 102-121.

Garside, R. 1987. The CLAWS Word-tagging System. In: R. Garside, Leech, G. & Sampson, G. eds. Computational Analysis of English: A Corpus-based Approach, London: Longman. Pp. 30-41.

Garside, R. 1995. Grammatical tagging of the spoken part of the British National Corpus: a progress report. In: Leech, G., Myers, G. & Thomas, J. eds. Spoken English on Computer: Transcription, Markup, and Application. London: Longman. Pp. 161-167.

Garside, R. 1996. The robust tagging of unrestricted text: the BNC experience. In: Thomas, J. & Short, M. eds. Using corpora for language research: Studies in the Honour of Geoffrey Leech, London: Longman. Pp. 167-180.

Kupiec, J. 1992. Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language. 6(1): 3-15.

Kytö, M. & Rissanen, M. 1993. General introduction. In: Rissanen, M., Kytö, M., & Palander-Collin, M. eds. Early English in the computer age: explorations through the Helsinki Corpus. Berlin: Mouton de Gruyter. Pp. 1-17.

Kytö, M. & Voutilainen, A. 1995. Applying the Constraint Grammar Parser of English to the Helsinki Corpus. ICAME Journal 19: 23-48.

Leech, G. & Eyes, E. 1993. Syntactic annotation: linguistic aspects of grammatical tagging and skeleton parsing. In: E. Black, Garside, R. & Leech, G. eds. Statistically-driven Computer Grammars of English: the IBM/Lancaster Approach. Amsterdam: Rodopi. Pp. 36-61.

Leech, G. 1997. Introducing Corpus Annotation. In: Garside, R., Leech, G., & McEnery, A. eds. Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman. Pp. 1-18.

Leech, G., Garside, R. & Atwell, E. 1983. The automatic tagging of the LOB Corpus. International Computer Archive of Modern English News. 7(1): 110-117.

Leech, G., Garside, R. & Bryant, M. 1994. CLAWS4: The tagging of the British National Corpus. Proceedings of the 15th International Conference on Computational Linguistics (COLING 94) Kyoto, Japan. Pp. 622-628.

Leech, G., Rayson, P., & Wilson, A. 2001. Word Frequencies in Written and Spoken English: based on the British National Corpus. Longman, London.

Mueller, M. 2005. The Nameless Shakespeare. Working Papers from the First and Second Canadian Symposium on Text Analysis Research (CaSTA). Computing in the Humanities Working Papers (CHWP 34).

Nissim, M., Matheson, C. & Reid, J. 2004. Recognizing Geographical Entities in Scottish Historical Documents. Proceedings of the Workshop on Geographic Information Retrieval at SIGIR 2004.

Osselton, N.E. 1984. Informal spelling systems in Early Modern English: 1500- 1800. In: Blake, N.F. & Jones, C. eds. English Historical Linguistics: Studies in development. The Centre for English Cultural Tradition and Language, University of Sheffield, Pp. 123-137.

Pajarskaite, G. 2004. Designing HMM-based Part-of-Speech Tagging for Lithuanian Language. Informatica. 15(2): 231-242.

Piao, S.L., Rayson, P., Archer, D., & McEnery, T. 2004. Evaluating lexical resources for a semantic tagger. Proceedings of 4th International Conference on Language Resources and Evaluation (LREC 2004), 26-28 May 2004, Lisbon, Portugal, Volume II, Pp. 499-502.

Rayson, P., Archer, D., & Smith, N. 2005. VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora. Proceedings of Corpus Linguistics 2005, Birmingham University, July 14-17, 2005.

Santorini, B. 1990. Part-of-speech tagging guidelines for the Penn Treebank Project. Technical Report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania.

Scott M.T. & Harper, M.P. 1999. A second-order Hidden Markov Model for part-of-speech tagging. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. Pp: 175-182.

Siemund, R. & Claridge, C. 1997. The Lampeter Corpus of Early Modern English Tracts. ICAME Journal, 21: 61-70.

Sinclair, J. 2004. Trust the Text: Language, Corpus, and Discourse. London and New York: Routledge.

Smith, N. 1997. Improving a tagger. In: Garside, R., Leech, G. & McEnery, A. eds. Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman. Pp. 137-150.

DOI: https://doi.org/10.7575/aiac.ijalel.v.8n.6p.92

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 4.0 International License.

2012-2023 (CC-BY) Australian International Academic Centre PTY.LTD.

International Journal of Applied Linguistics and English Literature

To make sure that you can receive messages from us, please add the journal emails into your e-mail 'safe list'. If you do not receive e-mail in your 'inbox', check your 'bulk mail' or 'junk mail' folders.

Username
Password
Remember me