It seems predicting the score directly (regression) is almost impossible without considering the associated domain. E.g. headlines with the letters GPT in it from openai.com, get an order of magnitude more votes than similar headlines from other sites.
My best model was developed about two years ago and hasn't been updated. It uses bag-of-word features as an input into logistic regression. I tried a lot of things, like BERT+pooling, and they didn't help. A model that only considers the domain is not as good as the bag-of-words.
This kind of model reaches a plateau when it has seen about 10,000-20,000 samples so for any domain (e.g. nytimes.com, phys.org) that has more than a few thousand submissions it would make sense to train a model just for that domain.
YOShiNoN and I have also submitted so many articles in the last two years that it would be worth it for me personally to make a model based on our own submissions because ultimately I'm drawing them from a different probability distribution. (I have no idea to what extent submissions behave different depending on whether or not I submit them, I know I have both fans and haters.)
I see recommendation problems as involving both: "is the topic relevant" and "is the article good quality?" The title is good for the first but very limited for the second. The domain is probably more indicative of the second but my own concept of quality is nuanced and has a bit of "dose makes the poison" kind of thinking. For instance I think phys.org articles draw out a conclusion in a scientific paper that you might not get from a superficial read (good) but they also have obnoxious ads (bad). So I feel like I only want to post a certain fraction of those.
So far as regression goes, this is what bothers me. An article that has the potential to get 800 votes might get submitted 10 times and get
1,50,4,800,1,200,1,35,105,8
votes or something like that. The ultimate predictor would show me the probability distribution, but maybe that's asking too much, and all I can really expect is the mean which is about 120 in that case. That's not a bad estimate on some level, but if was using the L2 norm I'd get a very high loss except in that case where it was 105. The loss is going to be high no matter what prediction I make so it's not like I can make a better model can cut my loss in half, but rather I can make a better model and reduce my loss by 0.1% which doesn't seem like too great of a victory -- though on some level it is an honest account of the fact that it's a crap shoot and the real uncertainty of the problem which will never go away. On the other hand, the logistic regression model gives a probability which is a very direct expression of that uncertainty.
It's an interesting problem. If most of the votes concentrate on the first submission, I wouldn't bother including subsequent submissions in the model. However if this is not the case (as in your example), you could actually include the past voting sequence, submission times, and domain, as predictors. In your example, the 800 votes might then (ideally) correspond to a better time slot and source/domain than the first single vote.
A few years ago, on my birthday, I quickly checked the visitor stats for a little side project I had started (r-exercises.com). Instead of the usual 2 or 3 live visitors, there were hundreds. It looked odd to me—more like a glitch—so I quickly returned to the party, serving food and drinks to my guests.
Later, while cleaning up after the party, I remembered the unusual spike in visitors and decided to check again. To my surprise, there were still hundreds of live visitors. The total visitor count for the day was around 10,000. After tracking down the source, I discovered that a really kind person had shared the directory/landing page I had created just a few days earlier—right here on Hacker News. It had made it to the front page, with around 200 upvotes and 40 comments: https://news.ycombinator.com/item?id=12153811
For me, the value of hitting the HN front page was twofold. First, it felt like validation for my little side project, and it encouraged me to take it more seriously (despite having a busy daily schedule as a freelance data scientist). But perhaps more importantly, it broadened my horizons and introduced me to a whole new world of information, ideas, and discussions here on HN.
> it felt like validation for my little side project
Yep, that can be useful motivation to get a side project past "works for me" through to "works for others".
The pgautoupgrade project (https://github.com/pgautoupgrade/docker-pgautoupgrade) was one of those. It seems to be going ok too, as others have come along and picked up the majority of development (I'm ~outta time). :)
Gee, today is my birthday (36). I've never managed to get anything I've built to the HN front page. Always wondered if that means my ideas just weren’t that interesting, or if it's just the luck of the draw.
*edited my original comment without mentioning my project*
Ha, thank you for that! I actually didn't have prior experience in wearables. My background was in mobile app development. I co-founded a company during the app boom, and we built a lot of iOS and Android apps. The transition into health sensors happened pretty organically and I also felt there was nothing that let me measure my own data and make use of it, so I decided to create a wearabke + free SDK for that. There is a quite nice article of how it all started here: https://howtoware.com/aidlab (good reading for anyone wondering what the struggles are when running a wearable startup).
We (and you) can never be sure why people downvote things, and whilst I think the parent commenter was well-intentioned, I think perhaps the downvotes were due to the perception of intruding on someone else's comment/subthread with promotion of one's own project.
As for your own comments, you seem to have a campaign against HN going on. Yes the HN audience can be hard to understand sometimes, but on the whole it seems easy enough to do well here if you bring a generosity of spirit, and not so well if you bring combativeness and hostility.
The guidelines [1] make it clear what we're expecting, and the first two words of the "In Comments" section are "Be kind". If you start with that then you'll be on solid ground.
Regarding your username, we can't have usernames like this, because it effectively trolls the community every time you comment. So I've banned the account for now. You're welcome to register a new account with a normal username, or email us (hn@ycombinator.com ) asking us to change the username. We'd certainly be very happy to have you post interesting comments from your experience as a retired aerospace engineer or any other life experience you no doubt have.
They weren’t trying to get on the front page of hn age 1 so it’s not 36 years in a row. Why do you need to be unpleasant to someone you don’t even know?
That pushes the question to "why did users flag it," but it's not hard to speculate reasons why. A dedicated minority roll their eyes at discussions of AI safety. On top of that, although the article itself is sound, it's targeted at non-technical people, and the domain it's hosted on isn't well-regarded (trending banner on top: "What happens when you hold in a fart?")
At the end of the paper, they mention "three factors over which the crowdfunding founder has complete control: the goal, the number of reward options and the duration. To maximize the likelihood of success, Fig. 1 implies that all other things being equal, a founder should choose a Log(goal) which is less than 7, a number of reward options which is at least 40 and a duration of between 10 and 15 days."