Bayesian Logistic Regression for Text Classification and Mining

David D. Lewis Consulting

Date: Wednesday, March 22, 2006
Time: 11:00am – 12:00pm
Place: 315 Ernst Bessey Hall

Abstract: An advantage of logistic regression over other discriminative learners is its explicit probabilistic foundation. This allows incorporating task knowledge through priors on parameters and model structure. I will discuss our use in content-based text categorization and author identification of 1) priors that lead to dense vs. sparse models or positive vs. mixed-sign models, 2) priors that incorporate domain knowledge from reference books and other texts, and 3) the use of polytomous (1-of-k) dependent variables. Time permitting, I will also discuss software engineering and numerical optimization issues in our open-source Bayesian logistic regression programs, BBR and BMR, make some comparisons with other logistic regression codes, and discuss some directions for future improvements. (This is joint work with David Madigan, Alex Genkin, Aynur Dayanik, and Dmitriy Fradkin of Rutgers University and DIMACS.)

Biography: Dave Lewis is a consulting computer scientist (www.DavidDLewis.com) based in Chicago, IL . He works in the areas of information retrieval, data mining, and natural language processing. Prior to establishing a consulting practice, he held research positions at AT&T Labs, Bell Labs, and the University of Chicago. He received his Ph.D. in Computer Science from the University of Massachusetts, Amherst, and did his undergraduate work in computer science and math at MSU.