The Central Limit Theorem and Its Misuse

YeGoblynQueenne · on Dec 30, 2019

This is a nice article but if it's really meant as an explanation of CLT for people who don't know much about it, then the (pretty) graphs should have their axes labelled clearly and the symbols used in the formulae presented should be also clearly explained (e.g. the article uses E for expectation without immediately explaining it).

It's no use "explaining" what is supposed to be a novel concept by presenting it in a way that is difficult to understand if one is not already familiar with it.

hgibbs · on Dec 30, 2019

Interestingly, the O(1/sqrt(n)) error in the CLT provided by the Berry-Eseen Theorem is the best one can do for general sequences of our RVs. However, if you are happy to put certain 'bumps' in the normal distribution then you can improve the error to O(n^(-r/2)), where I think r is some integer depending on the smoothness of your random variables. These bumps are called an edegworth expansion.

xpe · on Dec 30, 2019

Dear authors: Please add timestamps showing when the article was first published and updated. This helps situate the article in a larger context.

paulddraper · on Dec 30, 2019

Agreed.

It appears to have been published yesterday-ish. [1]

[1] https://web.archive.org/web/*/https://lambdaclass.com/data_e...

bobbyd2323 · on Dec 30, 2019

There are MANY CLTs. In general there is a memory-moment trade-off in assumptions for a given CLT. For example, you can relax the iid assumption to allow dependence between observations, such as a mixing condition, with the cost being you need a little bit stronger moment condition where the moment condition is a function of the mixing condition - more dependence (up to a point) requires a stronger moment.

MichaelRazum · on Dec 30, 2019

Not 100% sure, but for me "The three coins (non) example" is false. You could easily constract a random variable wich flips the free coins at random. Then you get again an iid sequence.

I'm writing this basically because I think it is troubeling how "blogs" work. Explaining, lets say complex stuff, with simple diagrams. Especially in this case it even looks wrong and gives you the wrong intuition.

em500 · on Dec 30, 2019

You're correct, and the blogpost (and several replies) are wrong. The sum or sample average of a simple mixture of Bernoulli distributions still converges to a Normal.

There are too many errors and misconceptions in this blog and thread, confusion between mixture distributions and the sample mean of said mixture, confusion between sufficient and necessary conditions (IID is sufficient to proof the Lindeberg–Levy CLT, but it's not necessary for more general forms Lindeberg-Feller, Lyapunov).

Unfortunately I'm at work and don't have time for a comprehensive rebuttal today. But people with some numpy or R skills can simply do the simulations themselves and see that the blog is not correct.

LaserPineapple · on Dec 31, 2019

> The sum or sample average of a simple mixture of Bernoulli distributions still converges to a Normal.

The procedure given in the example is not a sum of samples from a mixture of Bernoulli distributions. It is a mixture of sums of Bernoulli distributions.

em500 · on Dec 31, 2019

The writing is not very clear, but I understand it to be a mixture of Bernoulli RVs:

> Let's consider the following scenario: say we have three coins with different biases (their probability of coming up heads): 0.4, 0.5 and 0.6. We pick one of the three coins at random, toss it 300 times and count the number of heads. What is the distribution obtained?

This is approximating the sum of 300 Bernoulli RVs with a Normal, which is perfectly valid.

> As we have seen, if we fix the coin we're tossing, the number of heads can effectively be approximated by a distribution N(300p,300p(1−p)) (where p is the coin's bias). This time, however, each time we take a sample we might be tossing any of the three different coins.

I understand the procedure here to be (1) choosing one of the 3 coins at random and tossing it, (2) repeating step (1) 300 times and summing the resulting number of heads. In this case the CLT does apply: the distribution of the sum-of-number-of-heads is approximately Normal, not the plotted tri-modal density.

LaserPineapple · on Dec 31, 2019

> We pick one of the three coins at random, toss it 300 times and count the number of heads.

This is the procedure. One of the coins is chosen, and that same coin is flipped 300 times. Conditional on which coin was chosen, the number of heads is Binomial, so the unconditional distribution is a mixture of three Binomials.

em500 · on Jan 13, 2020

But a few lines later, it states

> This time, however, each time we take a sample we might be tossing any of the three different coins.

which is a different procedure

Schiphol · on Dec 30, 2019

The RV you suggest you could construct is not how the example works. From the article:

>Let's consider the following scenario: say we have three coins with different biases (their probability of coming up heads): 0.4, 0.5 and 0.6. We pick one of the three coins at random, toss it 300 times and count the number of heads.

So, not choosing a random coin every time, but choosing it once and tossing it 300 times.

MichaelRazum · on Dec 30, 2019

Maybe that is correct. What I'm just missing is a simple "proof" or at least an idea of a proof why some assumptions are violated. In this case you could a gain argue: picking a coin random and then toss 300 times --> Multidimensional CLT. Not saying the article is wrong. I'm just saying, where should be a proof or at least a bit more evidence... Just a simple plot is not enough from my perspective

inimino · on Dec 30, 2019

The violated assumption is simply i.i.d. of the so-called "random variable", in this case the individual coin flips would need to be i.i.d. for the CLT to hold. Obviously that's not the case at all by construction of the experiment. So really, it would be extraordinary if the plot showed anything other than it does. While the explanation of why it doesn't apply was handwavy ("can't really be written as a sum in the same way as before"), it is certainly correct and the graph helps give the intuition.

Schiphol · on Dec 30, 2019

The CLT holds for independent, identically distributed RVs. In the three-coin example the RVs are independent but not identically distributed: there are three distributions, one for each coin.

bonoboTP · on Dec 30, 2019

No, it's the opposite: the 300 tosses are identically distributed, but they are not independent. They aren't independent because they all depend on a common cause: the choice of which coin to flip.

Their property is also called "exchangeable" and it implies identical marginal distributions.

Schiphol · on Dec 30, 2019

Right, thanks. The averages are independent and not identically distributed, but that's irrelevant to the theorem, right?

manthideaal · on Dec 30, 2019

The example of misuse of the CLT illustrates that a mixture of normals is not a normal distribution. For example the height of people is a mixture of two normals, one for men and the other for women, this mixture has two modes and so it is not bell shaped and is not a normal distribution.

joker3 · on Dec 30, 2019

1) Human height isn't actually bimodal even if some histograms display two modes. See https://amstat.tandfonline.com/doi/abs/10.1198/00031300265#.... for an analysis.

2) The central limit theorem applies to the distribution of the sample average. It applies whenever the samples are iid and the second moment is finite. The fact that the samples are coming from a mixture of normals doesn't change that.

paulddraper · on Dec 30, 2019

> Human height isn't actually bimodal even if some histograms display two modes.

"Bimodal" doesn't really have a precise definition or test, if you don't assume normal distrubtions.

That paper argues that only if means are separated by 2σ should the distribution be considered bimodal.

But there are many measures of bimodality. [1] [2] [3] [4] [5] [6]

---

In any case, I would be very much surprised if the population couldn't be selected enough (age, race, country, diet, family) to have human height be bimodal by any measure.

[1] https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/97WR...

[2] https://journals.sagepub.com/doi/10.4137/CIN.S2846

[3] https://link.springer.com/article/10.1007%2Fs11207-008-9170-...

[4] https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809....

[5] https://link.springer.com/article/10.3758%2FBF03205709

[6] https://esj-journals.onlinelibrary.wiley.com/doi/abs/10.1007...

elcomet · on Dec 30, 2019

CLT is about the sum. In your case, the sum of heights will still converge to the normal distribution of you take the height of n random people (man or woman)

BeetleB · on Dec 30, 2019

> In your case, the sum of heights will still converge to the normal distribution of you take the height of n random people (man or woman)

Yes and no. The sample must be independent and identically distributed. In your case the "identical" part is not correct, as men and women have different distributions (both are normal but with different mean and std). However, if both distributions are normal, then their sum is normal (even with different mean and std).

The fact that the sum is normal in this case has nothing to do with the CLT - it's just a quirk of the normal distribution that the sum is normal. Had men/women had non-normal distributions with different means/stds, then the sum would not be normal.

elcomet · on Dec 30, 2019

If you sample from the global distribution (all men and women of earth) then samples are identically distributed. It's just a new distribution that is not gaussian, but the sum of samples will converge to a gaussian.

BeetleB · on Dec 30, 2019

I concede your point.

fyp · on Dec 30, 2019

Can someone give me an example on how to properly use the central limit theorem in the real world?

For example with stuff that are commonly assumed to be bell-curved like test scores, IQ, etc. What are the iid variables being averaged? Each test question?

raegis · on Dec 30, 2019

He's the example I have my students do in class. Roll a die repeatedly and tally the results. You'll get a (roughly) uniform distribution of 1s, 2s, 3s, 4s, 5s, and 6s.

Now to illustrate the CLT, you roll a die 50 times, and average the result. AND your 300 classmates do the same. If you tally the 301 averages, the distribution of the averages will not be uniform but bell-shaped, with average (approximately) 3.5.

The CLT says (roughly) the distribution of the averages will be approximately normal, regardless of the original distribution.

ganzuul · on Dec 30, 2019

Short and concise, as things should be. Thank you. :)

laichzeit0 · on Dec 30, 2019

Here’s something that’s bothered me for a while. Why is it necessary to roll it 50 times? Why can’t I roll it just once? Does each X_i have to be a sample greater than 1 and if so is there some kind of minimum required for each random sample? Does CLT still hold if each random sample’s size is 1?

zosima · on Dec 30, 2019

As the number of samples approach infinity the mean of all the iid samples will become normally distributed.

So it will neither hold strictly for n=50 or n=1, but n=50 may be a better approximation of infinity :)

laichzeit0 · on Dec 30, 2019

As I understood it each X_i is a random sample and the n refers to X_n, that is, n random samples. But what is the size of each sample? In the parent it was 50. The dice is rolled 50 times by each person i, and the outcomes are given by the random variable X_i. Also, there are 300 people rolling a dice, so n=300, thus 300 random variables each composed of 50 dice roll outcomes. I’m curious why each X_i has to be a sample of size 50 and can we just have each person roll the dice 1 time? Maybe we need 50*300 people to roll the dice one time now? Does the CLT work when each X_i is a random sample of size 1?

smallnamespace · on Dec 30, 2019

That's a good question, and it goes back to the definition of the CLT and the distinction between a distribution vs. a sample. We get to pick how we define the terms for our problem!

So in the original example, we know that if we choose X_n to mean an individual die roll, then the random variable X is not uniformly distributed. However, the CLT tells us that the average of many rolls is indeed normally distributed.

So if you pick n=50, then the average number of pips will resemble a normal distribution. What's the point of having 300 students do it then? It's so we can gather evidence that we're correct :)

Let each individual student do a 50-roll run. We know each run is i.i.d, so let Y represent the distribution of the average number of pips, and Y_1 ... Y_300 represent 300 samples - we can graph the empirical distribution of Y and test it for normality.

However, going back to your actual question, what if we had each person roll just 1 time? Is the outcome normal?

It depends on what you mean - how do we aggregate the 300 individual dice rolls? If you take each individual dice roll and tally it up (how many 1s, how many 2s, etc.), you will find that the distribution is still uniform - so no, a 'sample size' of 1 was not sufficient.

However, what is normally distributed is the average of those 300 dice rolls. However, in this case, we only get a single 'sample' (something close to 3.5), so we lack evidence that the CLT holds (even though it does).

Same goes with your other question - what if you had 50 * 300 people roll the dice? The average number of pips over 15k die rolls is approximately normal, but you've only drawn a single 'sample' in this case.

Going a bit further, this implies that what matters is not just how many dice we throw, but how we choose to define the meaning of those throws - for 15k throws, you can choose to think of it as 15k samples if a single throw, or 15 samples of (the average of) 1000 throws, or anything in between - just pick the definition that's useful.

laichzeit0 · on Dec 30, 2019

Ok great explanation. Going to your last paragraph, what does this mean “in practice”. Say I have a dataset of 15k people throwing a dice 1 time. Can I group it into 300 random samples of 50 dice throws and now say I have 300 random samples and apply the CLT? Each random sample should be i.i.d no?

I was just curious if one can go the other way around, because usually you only have 1 sample of size 15k and not 300 samples of size 50. If you have the raw data it’s just 15k samples of size 1 or 1 sample of size 15k, depending on how you look at it.

Then I could be wrong here, but doesn’t the proof of the CTL also assume that each random sample is the same size? So you can’t have one sample of size 30 and another of size 20 and another of size 25, etc? Each of the 300 samples must be size 50?

smallnamespace · on Dec 30, 2019

Yes, each dice roll is i.i.d, so you can choose to look at it as 300 random samples of 50 throws each, or any other choice of N.

Put another way, what you have is 15k dice throws, and the fact that they were thrown by 15k, 5k, or 300 people can be ignored, if you choose to. In fact, it may be useful to 'shuffle' dice rolls into new 'samples' - that gets in to the use of resampling and bootstrap techniques.

And also yes, for CLT to apply, each sample must have the same N.

zosima · on Dec 30, 2019

Each person has to roll their die a number approaching infinity times, then the reported means will be normally distributed. Now you will almost always get something quite indistinguishable from normal with n=50 throws, but not with n=1.

If you plot the means of the 300 dicethrowers for n=1,2,3,4... dice. You will see the distribution of the means take on more and more of a bell shape. But fully normal it becomes only when n approaches infinity (and then it will be a quite spiked bell of course :))

So, to be clear: the CLT talks about behavior when n approaches infinity. The CLT can be used approximately before that, but it's a very bad approximation when n=1.

mr_toad · on Dec 30, 2019

> I’m curious why each X_i has to be a sample of size 50 and can we just have each person roll the dice 1 time?

Smaller samples have higher variance and their means probably won’t converge as fast to a normal distribution.

“Samples” of one, in isolation, have no defined variance and you can’t use them to infer anything.

scottlegrand2 · on Dec 30, 2019

By the law of large numbers, the sample mean approaches the population mean as the sample size gets larger. The law of large numbers goes hand-in-hand with the central limit theorem.

amelius · on Dec 30, 2019

Ok, but the real pdf is cut off at 1 and 6, while the normal distribution extends to +- infinity and can never model the cutoffs, no matter how large N is (though of course those tails get less significant).

throwlaplace · on Dec 30, 2019

like the other comment points out, the measured rv is shifted and scaled. re tails: there's a sqrt(n) that scales the measured distribution so that it in the limit it has tails at +/- infinity.

roenxi · on Dec 30, 2019

The CLT can be used to justify modelling an effect with a normal distribution and raises questions about model assumptions when normal distributions don't appear where they should. The Normal Distribution is special because it is the highest-entropy distribution that is completely characterised by mean and standard deviation [0].

So what the CLT is saying is that, surprisingly, sums of i.i.d. random variables lose information from the individual variables quite quickly but reliably retain a little data about mean and second moment. Initially a given X_i has all sorts of information associated with it (higher order moments, other distribution characteristics, etc) that disappears as many X_i are summed together. All that is left is information about a mean and variance.

So say I have a situation where a large number of i.i.d. variables are going to be summed together. The CLT tells me that summing N of these variables together is going to be similar to summing together an equivalent number of normally distributed variables (!!). This is because the sum variable is equal in value to n * \bar{x}, but the CLT imposes a distribution on \frac{\bar{x}}{n}.

This justifies why the normal turns up everywhere in practical measurements. A lot of measurements (say, number of people at the beach) are probably really measures of a sum of random variables (maybe there is some non-normal variables that captures the chance a given person goes to the beach). So if the number of people at the beach turns out to be normally distributed (maybe I measure it each day for a few weeks) it isn't shocking. If the total number isn't normally distributed then that implies that there is no i.i.d. variable representing probability a given individual goes to the beach (eg, high correlation between the individual variables).

I didn't quite fail any of the statistics courses I've ever done, YMMV.

[0] https://en.wikipedia.org/wiki/Normal_distribution#Maximum_en...

markkvdb · on Dec 30, 2019

Suppose you have a school with 30 groups of students. All students are randomly assigned to the groups (so independent of their skill at IQ tests). The distribution of measured IQ within classes can be any distribution. But let’s take the 30 averages of the groups. The mean of these averages of groups (asymptotically) follows the normal distribution according to the CLT.

throwlaplace · on Dec 30, 2019

not the mean of the means - just the means themselves.

The distribution of the original population of iqs in the school is what you're sampling. Each group is a sample. The distribution of the sample mean approaches a normal according to clt as the sample sizes increase.

throwlaplace · on Dec 30, 2019

there's a distinction between random variables that are assumed to actually be normal and RVs that we conceptually apply the clt to and then treat as normal.

in the case of test scores it could be either. for an individual test it's possible the distribution is roughly normal actually. but if you, for example, look at subsets of all SAT scores and take their means (the mean of the subset) then those averages will be Bell shaped because of the clt.

Just to directly answer your question: the iid random variable is the test score or the iq.

wodenokoto · on Dec 30, 2019

What does the histogram of cointosses represent?[1]

We have 300 coin throws, so basically a list of H's T's.

My go-to for a histogram would be to have the buckets represent categories. I Throw a coin 300 times and plot the number of heads on one column, and the number of tails on the other column.

Next I would maybe do a bucket-size of 10-throws and count outcomes chronologically in each bucket.

Lastly, I would consider the same, but only with comulative counts.

[1] https://lambdaclass.com/data_etudes/central_limit_theorem_mi...

gbrown · on Dec 30, 2019

It’s the estimated (via simulation) sampling distribution of the number of heads observed out of three hundred coin tosses.

pmiller2 · on Dec 30, 2019

The TL;DR here seems to be that the CLT is false when it’s hypotheses are violated. This does not seem to be news.

bsaul · on Dec 30, 2019

you’re kind of missing the point imo... the goal is to explain misconceptions and misuses of the CLT. it’s useful to people that actually don’t understand properly the hypothesis of the CLT.

eanzenberg · on Dec 30, 2019

CLT is abused among data scientists quite often, and usually due to the points in the article.

- Convergence takes infinite time when the sampling variable has infinite variance (imagine a random number generator).

- While the distribution of an aggregate statistic tends to a Gaussian, most underlying distributions are not and most stats is done using the underlying raw data.

Junior data scientists assume everything is Gaussian (sometimes attributing it to CLT) when doing linear regression when most distributions being modeled are not. Then, analysis done using variance on the coefficients is meaningless because the assumptions are incorrect.

clatan · on Dec 30, 2019

Everyone assumes things that aren't true, not just "junior" data scientists. There is not standard methodology for non gaussian non independent random variables.

kyllo · on Dec 30, 2019

Linear regression doesn't require or assume the input data values to be normally distributed, just that the residuals are normally distributed and have constant variance.

clircle · on Dec 30, 2019

Further, those assumptions are only necessary for confidence intervals and hypothesis tests.

eanzenberg · on Dec 30, 2019

Linear regression doesn't require much, honestly, just non-null numeric data. My point was more on the interpretations of the outputs.

kyllo · on Jan 2, 2020

The constant variance part is important though. If you apply linear regression to a severely heteroscedastic dataset, the inferences you obtain from the model will be incorrect.

kxxsc · on Dec 30, 2019

This comment is surprising to me. Most data scientists use results and measures of statistical significance provided by the program they're using, which accounts for the distribution used. Do you have examples where either data scientists are not presenting aggregate statistics or where someone is using the wrong kinds of p-values?

To your specific examples - the coefficient of a linear regression is distributed normally (?). Similarly, we know the expected distribution of most maximum likelihood estimators (logistic regression, etc.), and programs will give you the right p-value.

Of course, omitted variable bias is still a problem and it is possible to mis-specify your model. However, I think most data scientists are presenting aggregate statistics (means, regression coefficients) like you said, and that we have a pretty good handle on the underlying distributions.

eanzenberg · on Dec 30, 2019

For example, there were times where features were excluded because their "p-value was too large to be significant", regardless that the underlying distribution was not Gaussian. p-value from a t-test, like in lots of regression software, requires Gaussian distributions.

QuesnayJr · on Dec 30, 2019

If the data set is large, then the coefficient is approximately Gaussian, because of the CLT. This is one appropriate setting to use the CLT (unless you think you are in a setting where the CLT doesn't hold, such as infinite variance).

eanzenberg · on Jan 2, 2020

For example, lets say height is a feature in your model. No matter how big the size of data, it will never be Gaussian, it is bi-modal. So the t-test in regression won’t be valid. Most ml is done on raw data.

QuesnayJr · on Jan 2, 2020

If the t test is on a regression coefficient, then the sampling distribution is approximately Gaussian (for big enough data). It doesn't matter how many modes that the original data feature has. This is standard asymptotics in hypothesis testing.

eanzenberg · on Jan 2, 2020

No, a t-test makes no assumptions that the underlying data is Gaussian. Again, most ml is done on raw data. If the raw feature is bimodal then the raw data is bimodal.

QuesnayJr · on Jan 3, 2020

I can't figure out if you are agreeing or disagreeing with me. If you do non-penalized regression on raw data, then the t statstic will be approximately Gaussian, even if the raw data is bimodal. This follows from the CLT.

eanzenberg · on Jan 5, 2020

What is the standard deviation of a bimodal distribution with modes at 0 and inf? Is that a meaningful stat?

throwlaplace · on Dec 30, 2019

>which accounts for the distribution used

what does this mean? are you trying to say the code uses the empirical distribution as a proxy for the true distribution?

wbl · on Dec 30, 2019

Mispecification error is hard to deal with.

paulddraper · on Dec 30, 2019

> Convergence takes infinite time when the sampling variable has infinite variance

When does that really even happen in practice though?

> most distributions being modeled are not

That a rather assertive statement. I dare say most distributions are (due to CLT) Gaussian.

srean · on Jan 13, 2020

> That a rather assertive statement. I dare say most distributions are (due to CLT) Gaussian.

I have been doing stats/ML/data-science for more than a decade. The above has rarely been true ever in the datasets that I have looked at. Unless of course the raw data has been transformed in ways to make them look Gaussian. Gaussian is the exception, not the rule.

clircle · on Dec 30, 2019

Distributions are only assumed. There is no such thing as truly normal data or sampling distribution.

paulddraper · on Dec 30, 2019

Sure, though that's a bit like saying there aren't any perfect squares only things you assume are squares.

clircle · on Dec 31, 2019

There is a difference. Distributions are tested in practice, for example, by using a goodness-of-fit test. Often if the hypothesis of normality is not rejected, people think this implies normality.