We devised and refined an elegantly simple statistical algorithm that is able to identify high-quality bigrams, with only two passes of the training data, and using only simple parameters such as word occurrence counts and information gain. The bigrams it found, while very small in numbers, were able to substantially raise the quality of the feature set (45.8% increase in average information gain for one dataset).
The algorithm was run on 22 categories from 2 manually pre-classified datasets. One dataset is a collection of websites from the Science hierarchy on the Yahoo! website (14,477 documents, 160,975 unique words), the other is the Reuters-21578 dataset of Reuters news articles (22,173 documents, 42,418 unique words).
All categories we studied showed increases in precision/recall break-even (BE) point, with the highest increase measured at 27.6%. The increase in mirco-averaged BE point for the Yahoo dataset was 6.8 % (11.9% macro-averaged) and that for the Reuters dataset was 2.6% (6.3% macro-averaged). The Yahoo dataset also showed significant increase in the F1 measure (micro-average: 6.7%, macro-average: 12.1 %). However, the Reuters dataset showed only insignificant increases in the same measure (micro-average: 0.7%, macro-average: 3.0%).
Our study also revealed the main strength and weakness of our algorithm.
It is most successful at increasing the number of correctly classified positive
documents, but its main weakness is that it causes more negative documents
to be classified incorrectly. This explains why it performed much
better on the Yahoo-Science than the Reuters dataset, since the latter
has a much higher recall and lower precision (ie, most of its positive
documents are already correctly classified). Finally, we suggest
a possible solution to this problem: adding a second stage to our current
classifier that is specially trained to identify negative documents incorrectly
classified by the first stage.
Abstract, Contents, etc
(Postscript-gzipped)
(PDF-gzipped)
Chapters 1-3
(Postscript-gzipped)
(PDF-gzipped)
Chapter 4
(Postscript-gzipped)
(PDF-gzipped)
Chapters 5-6, References
(Postscript-gzipped)
(PDF-gzipped)