Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Often the last word is a descriptive noun. Take the last word of each product name and make it a category. Then, go through this list (much smaller than 150k) and mark the ones that are and aren't actually categories. Now, change your categorization script to choose the closest word to the end that hasn't been marked as not a category. Go through your list again, this time filtering out anything you've positively marked as a category. After a few iterations with these scripts you should have decent categories.

Bonus: have your script set multiple categories if the title has multiple words you have marked positively as a category.



This is also clever. I'm a big fan of the 80/20 rule and I think this will work great as a first pass. There are definitely cases where it won't pick up anything, like this product:

OUTRAGE 5C NRG35 3S1P 11.1V 800mAH 35C NRG355C-8003

That's a LiPo battery, and should be classified under "Batteries", but nothing in the title explicitly states that fact. You can tell it's a battery because of the parametrics (11.1 volts, 35C discharge, 3S, etc) but none of that would show up as a categorical classification.

Then again, I probably want sub-categories in batteries for those parametrics, so perhaps I just need to allow categories to be nested under other categories and manually assign tiers later.

Thanks for the suggestion, I'm going to play around with this!


You're more familiar with your data, but the last word doesn't need to be a hard and fast rule. The last word there looks like a product code. However, the first word looks like a brand.

If this is more of an exception, I'm sure you can generate a pretty small list of what falls into that. If you regex for some number of volts with some mAH, it's probably a battery or a motor. That at least reduces the number of things you have to manually go through later to clean up the data.


Glad it's helpful. In cases like that one I suppose you look for clues and have another script to assign categories. For example, if most batteries have voltage in the title you can regex for \d+V\w (or whatever) and assign battery to matches. Could become complex quickly, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: