Text Segmentation

Andrei Mikheev

in The Oxford Handbook of Computational Linguistics

Published in print January 2005 | ISBN: 9780199276349
Published online September 2012 | e-ISBN: 9780191743573 | DOI:

Series: Oxford Handbooks in Linguistics

 Text Segmentation

Show Summary Details


This article discusses electronic text as essentially just a sequence of characters. Text needs to be segmented at least into linguistic units such as words, punctuation, numbers, alphanumerics, etc. This process is called tokenization. The article mentions that most natural language processing techniques require text to be segmented into sentences as well. It briefly reviews some evaluation metrics and standard resources commonly used for text segmentation tasks. This article presents substantial challenges for computational analysis since tokens are directly attached to each other using pictogram characters or other native writing systems and outlines various computational approaches to tackle them in different languages. It focuses on the low-level tasks such as tokenization and sentence segmentation.

Keywords: linguistic units; text; segmented; tokenization; computational analysis; evaluation metrics

Article.  7757 words. 

Subjects: Computational Linguistics

Full text: subscription required

How to subscribe Recommend to my Librarian

Buy this work at Oxford University Press »

Users without a subscription are not able to see the full content. Please, subscribe or login to access all content.