Learning fast classifiers for image spam

Mark Dredze, Reuven Gevaryahu, Ari Elias-Bachrach

Research output: Contribution to conferencePaperpeer-review

Abstract

Recently, spammers have proliferated "image spam", emails which contain the text of the spam message in a human readable image instead of the message body, making detection by conventional content filters difficult. New techniques are needed to filter these messages. Our goal is to automatically classify an image directly as being spam or ham. We present features that focus on simple properties of the image, making classification as fast as possible. Our evaluation shows that they accurately classify spam images in excess of 90% and up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm that selects features for classification based on their speed as well as predictive power. This technique produces an accurate system that runs in a tiny fraction of the time. Finally, we introduce Just in Time (JIT) feature extraction, which creates features at classification time as needed by the classifier. We demonstrate JIT extraction using a JIT decision tree that further increases system speed. This paper makes image spam classification practical by providing both high accuracy features and a method to learn fast classifiers.

Original languageEnglish (US)
StatePublished - 2007
Externally publishedYes
Event4th Conference on Email and Anti-Spam, CEAS 2007 - Mountain View, CA, United States
Duration: Aug 2 2007Aug 3 2007

Conference

Conference4th Conference on Email and Anti-Spam, CEAS 2007
Country/TerritoryUnited States
CityMountain View, CA
Period8/2/078/3/07

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'Learning fast classifiers for image spam'. Together they form a unique fingerprint.

Cite this