Article

Text Segmentation

Andrei Mikheev

in The Oxford Handbook of Computational Linguistics

Published in print January 2005 | ISBN: 9780199276349
Published online September 2012 | | DOI: http://dx.doi.org/10.1093/oxfordhb/9780199276349.013.0010

Series: Oxford Handbooks in Linguistics

Text Segmentation

More Like This

Show all results sharing these subjects:

  • Linguistics
  • Computational Linguistics

GO

Preview

This article discusses electronic text as essentially just a sequence of characters. Text needs to be segmented at least into linguistic units such as words, punctuation, numbers, alphanumerics, etc. This process is called tokenization. The article mentions that most natural language processing techniques require text to be segmented into sentences as well. It briefly reviews some evaluation metrics and standard resources commonly used for text segmentation tasks. This article presents substantial challenges for computational analysis since tokens are directly attached to each other using pictogram characters or other native writing systems and outlines various computational approaches to tackle them in different languages. It focuses on the low-level tasks such as tokenization and sentence segmentation.

Keywords: linguistic units; text; segmented; tokenization; computational analysis; evaluation metrics

Article.  7746 words. 

Subjects: Linguistics ; Computational Linguistics

Full text: subscription required

How to subscribeRecommend to my Librarian

Buy this work at Oxford University Press »