More

nickhuh · on Feb 14, 2021

There's a lot of interest in various ML communities on more efficient training and inference. Both vision and NLP have had a growing focus on these problems in recent years.

I think you make a good observation that much of ML progress is driven by tinkering with existing models, though instead of describing it as more "alchemy than science" it's probably more accurate to say it's very experimental right now. Being very experimental is neither unscientific nor unusual in the development of knowledge. James Watt worked as an instrument maker (not a theoretician) when he invented the Watt steam engine in 1776 [1], and at the time the idea of heat as Phlogiston [2] was still more prevalent than anything that looks like modern thermodynamics. Theory and practice naturally take turns outpacing each other, which is part of why we need both.

I'd also caution against the belief that experimental work doesn't require "particularly demanding thought". There are many things one can tweak in current ML models (the search space is exponential) and, as you point out, the experiments are expensive. Having a solid understanding of the system, great intuition, and good heuristics is necessary to reliably make progress.

For those who are interested in the theory of deep learning, the community has recently made great strides on developing a mathematical understanding of neural networks. The research is still very cutting edge, but the following PDF helps introduce the topic [3].

[1]: https://en.wikipedia.org/wiki/James_Watt

[2]: https://en.wikipedia.org/wiki/Phlogiston_theory

[3]: https://www.cs.princeton.edu/courses/archive/fall19/cos597B/...

ghufran_syed · on Feb 14, 2021

that course from princeton looks great! This paper is a nice short read and gives some geometric insight: https://arxiv.org/abs/1805.10451

nickhuh · on Jan 30, 2021

It's a good point and the study does some investigation of the question in Section 7 [1]. They find the trend seems to generalize across multiple speaker identities. Personal experiences appear more effective than facts at fostering respect for a wide range of different speakers.

[1]: https://www.pnas.org/content/118/6/e2008389118#sec-7

nickhuh · on Nov 9, 2020

I think this example misleads one's intuition for the following reason: in the proposed scenario, you'd only see the coin come up heads 10 times in a row about 1 in 1024 times you ran the experiment. While your conclusion would likely be incorrect, you almost never run into that scenario.

For example, if you conducted a study every week for 20 years, you'd both be extremely prolific and expect to have drawn about one wrong conclusion.

The example is a case of an absurd premise (i.e., a fair coin comes up heads 10 times in a row) leading to an absurd conclusion (that the coin is biased). Of course, this is exactly the guarantee the hypothesis test provides: under robust assumptions, you'll draw the wrong conclusion only rarely.

jbay808 · on Nov 9, 2020

A 1 in 1024 chance is not an absurd premise! Something that has a 1 in 1024 chance of ocurring to a person happens to 7 million people. If you get that evidence, you need to be capable of interpreting it correctly.

The OP's example would also worked well enough with even just 5 heads; you'd already pass p < 0.05, despite the actual probability of having picked the biased coin still being minuscule.

joshuamorton · on Nov 9, 2020

You're actually being misled. Formally, you have an observed event, 10H and so you want to know the probability that your coin was fair, conditioned on having seen 10 heads. You have fair coins (F) and biased coins (B)

You have P(H) == 1/1024, P(H|B) == 1/1. And,

Then the probability that you picked a biased coin (P(B))? 1/10^1000. The probability you picked a fair coin (P(F))? (1 - 1/10^1000)

All in all, P(B|H)? Well Bayes rule tells us P(A|B) = P(B|A) * P(A) / P(B). Or in this case, P(B|H) = 1 * 1024 / 10^1000, or approximately 1 / 10 ^ 997.

> For example, if you conducted a study every week for 20 years, you'd both be extremely prolific and expect to have drawn about one wrong conclusion.

This is a prior! In the kind of experimentation I do, I run literally tens of thousands of experiments a day.

feanaro · on Nov 9, 2020

I've been playing a board game daily for the last two weeks. As part of this game, a player has to draw one out of six unique cards. In the last five games, I've repeatedly drawn the same card. This is a real example.

1 / (1/6)^5 ≃ 7776

MaximumYComb · on Nov 9, 2020

You can't rely on odds after an event has happened to determine probably. I could shuffle a deck containing 1000 unique cards and then look at their order. It's astronomically low that this order occurs but it did happen.

joshuamorton · on Nov 9, 2020

When discussing conditional probability, this is absolutely what you can do.

Let's use your example. Conditional probability is P(A|B), the probability of event A, given that event B was observed. What's the probability that I am a magician, given that I shuffled the deck and when you saw it it was still in new deck order?

Now certainly, there is an astronomically small chance that this was observed due to random chance. And yet if you observed this, I'd expect that you would, with relatively high confidence, believe that I stacked the deck.

feanaro · on Nov 9, 2020

I'm not doing anything different than what we are doing when discussing the improbability of a hypothetical 10-heads-in-a-row event.

nickhuh · on April 19, 2020

I don't have any experience programming without visual aids; and you've probably already thought of this aspect though I'll mention it just in-case: if I were facing a similar situation, I would look into long-term disability insurance and see if there's anything I can do on that front. That way, I could still pursue being productive with blindness or deafblindness, but would have less economic pressure while doing so.

Many employers offer a combination of short-term and long-term disability that will provide income (in some cases until you retire) in the event that you can no longer work.

nickhuh · on Oct 14, 2018

I think this article omits the most important distinction between Bayesian and Frequentist statistics: subjective vs. frequentist interpretations of probability. In my own opinion, neither is "true", they're both just different tools for different purposes.

Bayesian inference is great when you have to make a decision and there are many theorems that illustrate this (for example, the arguments around coherence [1] and the complete class theorems [2]). In fact, Bayesian techniques are often useful for creating estimators with great frequentist properties! However, Bayesian interpretations of probability, and thereby the meaning of Bayesian statements, are inherently tied to the beliefs of an individual. That means that Bayesian statements usually aren't "true" in the objective / non-relative sense that we often expect from science. On the other hand, frequentist statements tend to have more of an objective flavor. The trick is: all our mathematical models have short comings and ways in which they're wrong when applied to any particular situation -- so neither really has a claim to being true.

The frequentist perspective often looks at worst case risk and tends to give a more global understanding of a procedure in terms of "how does this procedure shake out in all reasonably possible scenarios?". So, frequentist methods tend to be a bit more risk-averse which is often useful but can cost you for being to pessimistic. Ultimately, the real win is to know your tools well and to pick the right one for the job.

[1] https://en.wikipedia.org/wiki/Coherence_(philosophical_gambl...

[2] https://projecteuclid.org/euclid.aoms/1177730345

nickhuh · on Nov 19, 2016

Actually, in the frequentist paradigm you could choose to run a sequential hypothesis test which will end when you've acquired sufficient data[1]. Or, if you want to get fancy you could use a multi-armed bandit approach which is probably optimal in many situations in perhaps a more robust way than many Bayesian methods[2]. Really both can work well. My advice is, use whichever you know well enough to utilize effectively!

[1]: https://en.m.wikipedia.org/wiki/Sequential_analysis

[2]: https://en.m.wikipedia.org/wiki/Multi-armed_bandit

paulddraper · on Nov 19, 2016

Right, as I said, it can be done with frequentist statistics. This is what Optimizely does (http://pages.optimizely.com/rs/optimizely/images/stats_engin...). But (1) it is not simple and (2) it is not optimal.

---

Agreed 100% about multi-armed bandit which is what I was referring to. And the canonical solutions are in fact Bayesian :) See the Google Analytics link or lookup "Thompson sampling"

From your Wikipedia link:

"Probability matching strategies are also known as Thompson sampling or Bayesian Bandits, and surprisingly easy to implement if you can sample from the posterior for the mean value of each alternative."

nickhuh · on Nov 20, 2016

Oh yeah there are a few Bayesian methods which work great, Bayes-UCB is another. Personally though I think KL-UCB or just plain old UCB would be the ones I'd choose. Like I said earlier, I think these techniques are like programming languages: choose the one you know well enough to get the job done with it.

kgwgk · on Nov 19, 2016

> Actually, in the frequentist paradigm you could choose to run a sequential hypothesis test which will end when you've acquired sufficient data

And, as he said, you have to make adjustments to account for these interim analysis.

nickhuh · on Oct 3, 2016

In that case you'd want to use "git add -p" which allows you to pick only parts of a file to stage for a commit. It can be crucial in crafting a really solid project / commit history.

For even more complex cases you could use "git add -i"; however, that command can be tricky to work with and I find it's usually not to helpful to get that far into the weeds.

erichocean · on Oct 3, 2016

Atlassian has a nice UI for doing that visually, too. Super easy.

Honestly, people for whom the command line UI is "too difficult" should just use a GUI client. That's the target audience for them.

nickhuh · on Oct 2, 2016

In my experience, making "gl commit" equivalent to "git commit -a" seems like a bad idea. When leading a team, especially of newer coders, one of the most effective ways I've found of keeping the git logs and the code base sane more generally, is to force people to review their own code at the time when they commit it. Individually adding each file with "git add -p" achieves this, and "git commit -a" squashes it.

Reading the methodology used to develop gitless yesterday was interesting, but if I recall correctly, I think it left something out. They looked at how often a software design allowed users to complete their intention, but the tool ideally not only should allow users to complete their intention, but should encourage users towards practices that increase the quality of the end product. While people may struggle with staging at first, I think it in the end encourages better software, which is my biggest concern.

nickhuh · on Oct 2, 2016

I could be wrong, but I think the article that the Torvalds quote came from was purely satirical [1]. If you look at the tags it says it was posted under satire. Got me at first too!

[1]: http://typicalprogrammer.com/linus-torvalds-goes-off-on-linu...

nickhuh · on May 2, 2016

If you're interested in clustering text documents, the canonical algorithm would be latent Dirichlet allocation, which is a topic modeling algorithm. You can find latent Dirichlet allocation in sklearn; however, you're more looking for something that returns a raw similarity score it sounds like, in which case it might be interesting to check out word2vec. Perhaps checkout this stack overflow answer: https://stackoverflow.com/questions/22129943/how-to-calculat...

popra · on May 2, 2016

That you very much, I'll look into those.