Jason Phang, Sat 17 September 2016, Blog

I was playing around with the distributions of correlations at work a while back and I found the following interesting pattern. I know that the distribution of correlations for normal distributions are probably pretty well-defined, so I'm hoping someone might have some insight into this.

The full code for this can be found at on Github.

The setup for the experiment goes like this:

- Generate two arrays of length $k$ drawn independently from some distribution.
- Compute the sample correlation $\rho$ between the two arrays.
- Do this N times, and look at the distribution of correlations.

So at each step, we're calculating something like this:

$$ \rho = corr\left( \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \\ \end{bmatrix}, \begin{bmatrix} x_6 \\ x_7 \\ x_8 \\ x_9 \\ x_{10} \\ \end{bmatrix} \right) $$Here, I decided to just try uniform and normal distributions. (Also note that I'm using the term *correlation* a little loosely. *Correlation* is a statistic, which means its a property of distributions. Here, I'm really looking at the sampling distribution of correlations for $k$ observations.)

Before taking a look at the results, here's what I expected to see going in:

- For $k=2$, all correlations will be either 1 or -1. (0 is technically possible but highly unlikely)
- For high $k$, most of the probability mass will be around 0 since these are random draws, and the distribution rapidly tapers out toward 1 and -1.

The question then is what happens for small $k$. Well, it's time to take a look at the results. Here are the histogram plots for uniform and normal distributions and accompanying code:

The plots for $k=2$ and $k=5,6,7$ match our intuition nearly, especially with the probability mass increasingly concentrating around $\rho=0$. In some ways, the histogram for $k=3$ matches our intution as well: as the distribution goes from being bimodal to unimodal, the bimodal $k=3$ histogram looks reasonable.

But what's going on in $k=4$?

We end up with what looks like a completely uniform distribution over sample correlations. It's also interesting the $k=4$ is the exact sweet spot for both uniform and normal distributions. Nothing about $k=4$ or the sample correlation formula immediately jumps out at my to explain this - I'm guessing there's some kind of symmetry that evenly distributes the correlations. Hence, my first question:

**Why is the sampling distribution for correlations exactly uniform for $k=4$?**

I also tried another experiment (in truth, this was actually done before the above). Instead of drawing two samples of length $k$ from the same distribution, what if I draw one sample from one distribution, and compared the correlation of that with a completely linear series. In other words, we are calculating something like this:

$$ \rho = corr\left( \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \\ \end{bmatrix}, \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ \end{bmatrix} \right) $$Our results for the normal distribution look pretty similar to before, but what's going on for the uniform distributions? In both $k=3$ and $k=4$, we're seeing this odd bump around $\rho=0$. We don't even get a uniform distribution for $k=4$ any more.

I have even less of an idea of why this might be. My guess is that it might have to do with the uniform distribution having bounded values. Hence my second question:

** Why is there a hump around $\rho=0$ for the sampling correlation of uniform distributions against a linear sample? **