Part 2: Analysis of Relationship Between Two Variables

Linear Regression

Linear correlation

Significance Tests

Multiple regression

Linear Regression

Y = a X + b

Predictor and Predictand

In meteorology, we want to use a variable x to predict another variable y. In this case, the independent variable x is called the “predictor”. The dependent variable y is called the “predictand”

Linear Regression

We have N paired data point (x_i, y_i) that we want to approximate their relationship with a linear regression:

The errors produced by this linear approximation can be estimated as:

The least square linear fit chooses coefficients a and b to produce a minimum value of the error Q.

Least Square Fit

Coefficients a and b are chosen such that the error Q is minimum:

This leads to:

Solve the above equations, we get the linear regression coefficients:

where

Major Assumptions of Linear Regression

Significance of the Regression Coefficients

The regression coefficients a and b are statistics derived from sample, not parameters from population.

The regression coefficients vary from one sample to another. We can not use normal statistical theory to predicted their variations.

It is because statistical theory derives its results by assuming the successive pair of observations (x_i, y_i) are independent. This is not true for geoscience variables.

How Good Is the Fit?

The quality of the linear regression can be analyzed using the “Analysis of Variance”.

The analysis separates the total variance of y (S_y²) into the part that can be accounted for by the linear regression (b²S_x²) and the part that can not be accounted for by the regression (S_e²):

S_y² = b²S_x²+ S_e²

Analysis of Variance

We then use F-statistics to test the ratio of the variance explained by the regression and the variance not explained by the regression:

                       F = (b²S_x²/1) / (S_e²/(N-2))

Select a X% confidence level

H₀: b = 0

            (i.e., variation in y is not explained by the linear regression but

            rather by chance or fluctuations

     H₁: b ¹ 0

Reject the null hypothesis at the a significance level if F>F_a(1, N-2)

Scattering

One way to estimate the “badness of fit” is to calculate the scatter:

scatter (S_e)^0.5

The relation between the scatter to the line of regression in the analysis of two variables is like the relation between the standard deviation to the mean in the analysis of one variable.

If lines are drawn parallel to the line of regression at distances equal to ± (S_e)^0.5 above and below the line, measured in the y direction, about 68% of the observation should fall between the two lines.

Correlation and Regression

Linear Regression: Y = a + bX

A dimensional measurement of the linear relationship between X and Y.

è How does Y change with one unit of X?

Linear Correlation

A non-dimensional measurement of the linear relationship between X and Y.

è How does Y change (in standard deviation) with one standard deviation of X?

Linear Correlation

The linear regression coefficient (b) depends on the unit of measurement.

If we want to have a non-dimensional measurement of the association between two variables, we use the linear correlation coefficient (r):

Correlation and Regression

Recall in the linear regression, we show that:

We also know:

It turns out that

An Example

Suppose that the correlation coefficient between sunspots and five-year mean global temperature is 0.5 ( r = 0.5 ).

The fraction of the variance of 5-year mean global temperature that is “explained” by sunspots is r² = 0.25.

The fraction of unexplained variance is 0.75.

Significance Test of Correlation Coefficient

When the true correlation coefficient is zero (H₀: r=0 and H₁: r¹0)

      Use Student-t to test the significance of r

                                 and n = N-2 degree of freedom

When the true correlation coefficient is not expected to be zero

      We can not use a symmetric normal distribution for the test.

      We must use Fisher’s Z transformation to convert the distribution of r

            to a normal distribution:

An Example

Suppose N = 21 and r = 0.8. Find the 95% confidence limits on r.

Answer:

Use Fisher’s Z transformation:

Find the 95% significance limits

(3) Convert Z back to r

(4) The 95% significance limits are: 0.56 < r < 0.92

Test of the Difference Between Two Non-Zero Coefficients

We first convert r to Fisher’s Z statistics:

We then assume a normal distribution for Z₁-Z₂and use the

z-statistic (not Fisher’s Z):

Multiple Regression

If we want to regress y with more than one variables (x₁, x₂, x₃,…..x_n):

After perform the least-square fit and remove means from all variables:

Solve the following matrix to obtain the regression coefficients: a₁, a₂, a₃, a₄,….., a_n:

Fourier Transform

Fourier transform is an example of multiple regression. In this case, the independent (predictor) variables are:

These independent variables are orthogonal to each other. That means:

Therefore, all the off-diagonal terms are zero in the following matrix:

We can easily get:

How Many Predictors Are Needed?

Very often, one predictor is a function of the other predictors.

It becomes an important question: How many predictors do we need in order to make a good regression (or prediction)?

Does increasing the number of the predictor improve the regression (or prediction)?

If too many predictors are used, some large coefficients may be assigned to variables that are not really highly correlated to the predictant (y). These coefficients are generated to help the regression relation to fit y.

To answer this question, we have to figure out how fast (or slow) the “fraction of explained variance” increase with additional number of predictors.

Explained Variance for Multiple Regression

As an example, we discuss the case of two predictors for the multiple regression.

We can repeat the derivation we perform for the simple linear regression to find that the fraction of variance explained by the 2-predictors regression (R) is:

here r is the correlation coefficient

We can show that if r_2y is smaller than or equal to a “minimum useful correlation” value, it is not useful to include the second predictor in the regression.

The minimum useful correlation = r_1y * r₁₂

This is the minimum correlation of x₂with y that is required to improve the R² given that x₂ is correlated with x₁.

An Example

For a 2-predictor case: r_1y = r_2y = r₁₂= 0.50

     If only include one predictor (x1) (r_2y = r₁₂=0) è R²=025

     By adding x₂ in the regression (r_2y = r₁₂=0.25) è R²=033

     In this case, the 2^nd predictor improve the regression.

For a 2-predictor case: r_1y = r₁₂= 0.50 but r_2y = 0.25

     If with only x₁ è R²=025

     Adding x₂        è R²=025 (still the same!!)

     In this case, the 2^nd predictor is not useful. It is because

     r_2y£ r_1y * r₁₂= 0.50*0.50 = 0.25

Independent Predictors

Based on the previous analysis, we wish to use predictors that are independent of each other

è r₁₂= 0

è minimum useful correlation = 0.

The worst predictors are r₁₂= 1.0

The desire for independent predictors is part of the motivation for Empirical Orthogonal Function (EOF) analysis.

EOF attempts to find a relatively small number of independent quantities which convey as much of the original information as possible without redundancy.


	Linear Regression
	Linear correlation
	Significance Tests
	Multiple regression


	In meteorology, we want to use a variable x to predict another variable y. In this case, the independent variable x is called the “predictor”. The dependent variable y is called the “predictand”


	We have N paired data point (x_i, y_i) that we want to approximate their relationship with a linear regression:

	The errors produced by this linear approximation can be estimated as:



	The least square linear fit chooses coefficients a and b to produce a minimum value of the error Q.


	Coefficients a and b are chosen such that the error Q is minimum:


	This leads to:



	Solve the above equations, we get the linear regression coefficients:
	where


	The regression coefficients a and b are statistics derived from sample, not parameters from population.
	The regression coefficients vary from one sample to another. We can not use normal statistical theory to predicted their variations.
	It is because statistical theory derives its results by assuming the successive pair of observations (x_i, y_i) are independent. This is not true for geoscience variables.


	The quality of the linear regression can be analyzed using the “Analysis of Variance”.
	The analysis separates the total variance of y (S_y²) into the part that can be accounted for by the linear regression (b²S_x²) and the part that can not be accounted for by the regression (S_e²):
	S_y² = b²S_x²+ S_e²


	We then use F-statistics to test the ratio of the variance explained by the regression and the variance not explained by the regression:
	F = (b²S_x²/1) / (S_e²/(N-2))
	Select a X% confidence level
	H₀: b = 0
	(i.e., variation in y is not explained by the linear regression but
	rather by chance or fluctuations
	H₁: b ¹ 0
	Reject the null hypothesis at the a significance level if F>F_a(1, N-2)


	One way to estimate the “badness of fit” is to calculate the scatter:

	scatter (S_e)^0.5

	The relation between the scatter to the line of regression in the analysis of two variables is like the relation between the standard deviation to the mean in the analysis of one variable.
	If lines are drawn parallel to the line of regression at distances equal to ± (S_e)^0.5 above and below the line, measured in the y direction, about 68% of the observation should fall between the two lines.


	Linear Regression: Y = a + bX
	A dimensional measurement of the linear relationship between X and Y.
	è How does Y change with one unit of X?

	Linear Correlation
	A non-dimensional measurement of the linear relationship between X and Y.
	è How does Y change (in standard deviation) with one standard deviation of X?


	The linear regression coefficient (b) depends on the unit of measurement.
	If we want to have a non-dimensional measurement of the association between two variables, we use the linear correlation coefficient (r):


	Recall in the linear regression, we show that:





	We also know:



	It turns out that


	Suppose that the correlation coefficient between sunspots and five-year mean global temperature is 0.5 ( r = 0.5 ).
	The fraction of the variance of 5-year mean global temperature that is “explained” by sunspots is r² = 0.25.
	The fraction of unexplained variance is 0.75.


	When the true correlation coefficient is zero (H₀: r=0 and H₁: r¹0)
	Use Student-t to test the significance of r

	and n = N-2 degree of freedom


	When the true correlation coefficient is not expected to be zero
	We can not use a symmetric normal distribution for the test.
	We must use Fisher’s Z transformation to convert the distribution of r
	to a normal distribution:


	Suppose N = 21 and r = 0.8. Find the 95% confidence limits on r.
	Answer:
	Use Fisher’s Z transformation:


	Find the 95% significance limits


	(3) Convert Z back to r


	(4) The 95% significance limits are: 0.56 < r < 0.92


	We first convert r to Fisher’s Z statistics:


	We then assume a normal distribution for Z₁-Z₂and use the
	z-statistic (not Fisher’s Z):


	If we want to regress y with more than one variables (x₁, x₂, x₃,…..x_n):

	After perform the least-square fit and remove means from all variables:


	Solve the following matrix to obtain the regression coefficients: a₁, a₂, a₃, a₄,….., a_n:


	Fourier transform is an example of multiple regression. In this case, the independent (predictor) variables are:


	These independent variables are orthogonal to each other. That means:


	Therefore, all the off-diagonal terms are zero in the following matrix:





	We can easily get:


	Very often, one predictor is a function of the other predictors.
	It becomes an important question: How many predictors do we need in order to make a good regression (or prediction)?
	Does increasing the number of the predictor improve the regression (or prediction)?
	If too many predictors are used, some large coefficients may be assigned to variables that are not really highly correlated to the predictant (y). These coefficients are generated to help the regression relation to fit y.
	To answer this question, we have to figure out how fast (or slow) the “fraction of explained variance” increase with additional number of predictors.


	As an example, we discuss the case of two predictors for the multiple regression.
	We can repeat the derivation we perform for the simple linear regression to find that the fraction of variance explained by the 2-predictors regression (R) is:
	here r is the correlation coefficient
	We can show that if r_2y is smaller than or equal to a “minimum useful correlation” value, it is not useful to include the second predictor in the regression.
	The minimum useful correlation = r_1y * r₁₂
	This is the minimum correlation of x₂with y that is required to improve the R² given that x₂ is correlated with x₁.


	For a 2-predictor case: r_1y = r_2y = r₁₂= 0.50
	If only include one predictor (x1) (r_2y = r₁₂=0) è R²=025
	By adding x₂ in the regression (r_2y = r₁₂=0.25) è R²=033
	In this case, the 2^nd predictor improve the regression.

	For a 2-predictor case: r_1y = r₁₂= 0.50 but r_2y = 0.25
	If with only x₁ è R²=025
	Adding x₂ è R²=025 (still the same!!)
	In this case, the 2^nd predictor is not useful. It is because
	r_2y£ r_1y * r₁₂= 0.50*0.50 = 0.25


	Based on the previous analysis, we wish to use predictors that are independent of each other
	è r₁₂= 0
	è minimum useful correlation = 0.
	The worst predictors are r₁₂= 1.0
	The desire for independent predictors is part of the motivation for Empirical Orthogonal Function (EOF) analysis.
	EOF attempts to find a relatively small number of independent quantities which convey as much of the original information as possible without redundancy.