Finding and Using High Quality Word-Pairs
for Enhanced Text Categorization

Masters Thesis

Presented to the Computer Science Faculty
of the University of California at Santa Barbara

by Chade-Meng Tan

June 2000



Abstract

Text categorization is the task of automated assignment of natural language texts to predefined categories based on their content.  Most text categorization systems use single words (unigrams) as features.  A deceptively simple idea for enhancing text categorization is investigated here, an idea that has been previously shown not to work.  It is to identify useful word pairs (bigrams) made up of adjacent unigrams.  For example, to identify the bigram "artificial+intelligence" from the unigrams "artificial" and "intelligence".  It is intuitively obvious that  many bigrams describe concepts better than their component unigrams.

We devised and refined an elegantly simple statistical algorithm that is able to identify high-quality bigrams, with only two passes of the training data, and using only simple parameters such as word occurrence counts and information gain.  The bigrams it found, while very small in numbers, were able to substantially raise the quality of the feature set (45.8% increase in average information gain for one dataset).

The algorithm was run on 22 categories from 2 manually pre-classified datasets.  One dataset is a collection of websites from the Science hierarchy on the Yahoo! website (14,477 documents, 160,975 unique words), the other is the Reuters-21578 dataset of Reuters news articles (22,173 documents, 42,418 unique words).

All categories we studied showed increases in precision/recall break-even (BE) point, with the highest increase measured at 27.6%.  The increase in mirco-averaged BE point for the Yahoo dataset was 6.8 % (11.9% macro-averaged) and that for the Reuters dataset was 2.6% (6.3% macro-averaged).  The Yahoo dataset also showed significant increase in the F1 measure (micro-average: 6.7%, macro-average: 12.1 %).  However, the Reuters dataset showed only insignificant increases in the same measure (micro-average:  0.7%, macro-average: 3.0%).

Our study also revealed the main strength and weakness of our algorithm.  It is most successful at increasing the number of correctly classified positive documents, but its main weakness is that it causes more negative documents to be classified incorrectly.  This explains why it performed much better on the Yahoo-Science than the Reuters dataset, since the latter has a much higher recall and lower precision (ie, most of its positive documents are already correctly classified).  Finally, we suggest a possible solution to this problem: adding a second stage to our current classifier that is specially trained to identify negative documents incorrectly classified by the first stage.
 


Abstract, Contents, etc
(Postscript-gzipped)  (PDF-gzipped)

Chapters 1-3
(Postscript-gzipped)  (PDF-gzipped)

Chapter 4
(Postscript-gzipped)  (PDF-gzipped)

Chapters 5-6, References
(Postscript-gzipped)  (PDF-gzipped)