Skip to content
Advertisement

How to sample data points for two variables that has highest (close to +1) or lowest (close to zero) correlation coefficient?

Let’s assume that we have N (N=212 in this case) number of datapoints for both variables A and B. I have to sample n (n=50 in this case) number of data points for A and B such that A and B should have the highest possible positive correlation coefficient or lowest correlation coefficient (close to zero) for that sample set. Is there any easy way to do this (Please note that the sampling should be index-based i.e., if we select a ith datapoint then both A and B should be taken corresponding to that ith index)? Below is the sample dataframe (Coded in R but I am OK with any programming language):

JavaScript

Advertisement

Answer

Perhaps there is a better way, but it seems to me that this is something that could be solved with a genetic algorithm. The following approach will return the correlation value (i.e. fitness) only if n genes/variables are “turned on”; otherwise, zero is returned.

I had to initialize the population with individuals with exactly 50 genes turned on to start the evolutionary process. The result is pretty high (r = 0.97) after 1142 generations, and no improvement is made over the last 50 generations.

JavaScript

enter image description here

As per your comment on how to adjust the fitness function to target a specific correlation, see the example below. Since ga always maximizes fitness, you will need to flip the sign of the output (e.g. -sqrt((res-targ)^2) is the squared error to the target value).

JavaScript
Advertisement