msalahi's comments

msalahi · on Sept 26, 2013

i've actually found the performance of gensim (the topic modeling python module i use here) to be pretty great. we're not at a scale where CPU performance is make or break just yet, so i haven't done any comprehensive testing of performance. but i've definitely not run into any performance issues worth complaining about. however, gensim is 100% based on lazy evaluation where it can be, so it's relatively light on the CPU. i love NLTK as well, but it did lack in the dimensionality reduction/topic modeling department which gensim did so beautifully. LDA + SVM seemed like an interesting approach to go with, and it didn't disappoint.

drakaal · on Sept 26, 2013

The issue with Genism is you have to know what you are trying to analyze before you analyze it. It doesn't do well if you use the wrong corpus or if like you mention start with a million word corpus.

If you were analyzing emails in a single organization all day you could probably sort out topics really well. Doing all of the web it breaks down because it gets less accurate the larger the variety of content.

msalahi · on Sept 27, 2013

"doing all of the web" will cause pretty much any approach to AI/machine learning/NLP to break down. i'm a big believer in it being the responsibility of the engineer employing these techniques to take stock of the problem at hand and find out what constraints you can take advantage of to achieve better performance/accuracy/prettiness of code. there's not really a silver bullet that you can just release on the internet with the task of bringing back incredibly useful information without "knowing what you're trying to analyze before you analyze it."

agibsonccc · on Sept 27, 2013

Web Developers are a neat bunch. It's amazing what kind of inference you can do with exploiting document structure to make different kind of inferences alongside more traditional approaches like word frequency analysis, LDA, or even deep learning/word distributional inference. NLP on the web especially question answering and search, can still be greatly expanded upon.

drakaal · on Sept 27, 2013

Wait an hour. We decided to push that bit of code live in Alpha. :-)

msalahi · on Sept 27, 2013

SCIENCE!

msalahi · on Sept 26, 2013

As with most successful applications of machine learning, it's about finessing your approach based on the problem at hand. In our case, we have classes divided on the level of "Medicine," "Real Estate," etc. So, we could throw away lots of words that only occurred once or twice in the massive corpus we crawled to build the language model and still have a pretty robust representation of the subject you're trying to represent.

msalahi · on Sept 26, 2013

In fact, if your training corpus is sufficiently large, you'd be shocked how many words you can eliminate right away for a term frequency of one or two. I went from millions of words in the vocabulary to something like 60k just by ignoring words that happen once or twice in the corpus. Plus, you probably won't learn much about the relationships between words if they only occur a few times in the corpus.

jlees · on Sept 27, 2013

Yeah, but consider that some rare words are much stronger indicators of topic than more common ones. Even more so if you look at n-grams. If you use something like wordnet you can get a lot of meaning out of low-frequency words and throw away the meaningless higher-frequency ones that occur in too many categories to be useful.

adpreese · on Sept 27, 2013

Sure, there's value in rare words, but I don't think anything that occurs across the corpus fewer than 3 times is going to tell you anything useful. You need a certain amount just to have it be a real signal. What was the least frequent useful word in the data set, msalahi?