subodh: July 2020

1. What is regression?

Regression is a measure of estimating the relationship between the mean value of one variable (output) and corresponding value of other variables. It is statistical method used in various disciplines in order to determine the strength and character of relationship between one dependent variable and a series of independent variables.

2. Uses of regression

It estimates the relationship between dependent variable and explanatory variable.
Determine the effect of each explanatory variable on dependent variable by controlling the effect of other explanatory variables.
Predict the value of dependent variable for a given value of explanatory variable.
It is also used to forecast an effect and trend of future values.

3. What is linear regression?

In linear regression, the relationship between two variables is modeled by fitting a linear equation to the observed data. One variable is called as explanatory variable and the other one is dependent variable. Linear regression is basic and commonly used type of predictive analysis. The overall idea is to examine:

how is independent variable predicting the outcome of dependent variable
which independent variable is significantly affecting the outcome of dependent variable
how does the magnitude and sign of beta estimates impact the dependent variable

Note: Dependent variables are often called as outcome variable, criterion variable, endogenous variable and regressand. Moreover, independent variables are also termed as exogenous variable, predictor variable and regressor.

The researcher should determine whether there is relationship between the variables of interest before attempting to fit the linear model to the observed data. It does not necessarily imply that one variable causes the other but there can be significant association between the variables. A scatter diagram is very useful in determining the strength of relationship between these variables. The increasing or decreasing trend is seen only if there is association between the proposed explanatory and dependent variables. The numerical measure of association between these variable is called as correlation coefficient whose value is between -1 and 1 . It shows the strength of association of the observed data.

Figure: Scatter diagram showing the type of relationship

Figure showing regression line and slope

Linear regression

In the figure above,

blue line is best fit regression line
the intercept (b0) and slope (b1) is shown in green
the error term is represented by vertical red line

As shown in the figure above, some of the values are above blue line and some are below it which indicates that not all data points are fitted on regression line. The sum of squares of residuals is called as residual sum of squares. The average variation of points around the fitted regression line is called as residual standard error. It is important metric used to evaluate the quality of fitted regression model. Small value of this standard error is desirable.

4. Types of linear regression

Simple linear regression: 1 dependent variable ( interval or ratio scale), 1 independent variable (interval or ratio or dichotomous)

The equation of linear equation is in the form of: Y=a+bX

Predicted Y’ =β0+β1X+ε …..>Simple Regression

Predicted Y’ =β0+β1X1+…+βnXn +ε …..>Multiple Regression

Where,

Y’ =Predicted value of dependent variable

β0 =Intercept

β1 =Regression coefficient (beta coefficient)

X =Independent (or explanatory) variable

ε =Error term

Multiple linear regression: 1 dependent variable ( interval or ratio scale), 2+ independent variables (interval or ratio or dichotomous)
Logistic regression: 1 dependent variable ( dichotomous), 2+ independent variables (interval or ratio or dichotomous)

The equation for logistic:

Logit(Y) = ln [Pi/(1- Pi)] = β0 + β1X1 +…+ βnXn

Logit(Y) = ln [Odds)] = β0 + β1X1 +…+ βnXn

where,

Logit(Y) =Predicted value of dependent variable

ln = Natural log

Pi =Probability of experiencing an event

(1-Pi) =Probability of not experiencing an event

Pi/(1-Pi) =Odds of experiencing an event

β0 =Intercept

β1 =Regression coefficient (Logit coefficient)

Xn =Independent (or explanatory) variable

Value of Odds Ratio Ranges from 0 to +∞

More than 1=Increased or higher odds

Less than 1=Decreased or lower odds

Equal to 1 = Same odds (no difference)

Value of Logit Coefficient ranges from -∞ to +∞

- =decreased logged odds

+ =increased logged odds

Figure showing logistic regression S curve for dichotomous dependent variable

Ordinal regression: 1 dependent variable ( ordinal), 1+ independent variable (s) (nominal or dichotomous)
Multinomial regression: 1 dependent variable ( nominal), 1+ independent variable (s) (interval or ratio or dichotomous)
Discrimiant analysis: 1 dependent variable ( nominal ), 1+ independent variable (interval or ratio)

5. Simple and multiple linear regression in R studio

A. For simple linear regression

File used: data.xlsx https://drive.google.com/file/d/1US3FfzF4LnKwf-fn-Om3pC80467N0iBu/view?usp=sharing

Note: Output are shown in blue color

Import the file in r studio
Attach the file

> attach(data)

Load required packages

> library(tidyverse)

> library(ggpubr)

> theme_set(theme_pubr())

> head(data,4)

# A tibble: 4 x 3

Nitrogen Mulching Yield

1 0 No 15.7

2 20 No 18.2

3 30 No 20.5

4 40 No 24.5

Now create a scatter plot displaying the Nitrogen dose versus yield of a crop and add a smoothed line.

> ggplot(data, aes(x = Nitrogen, y = Yield)) +

+ geom_point() +

+ stat_smooth()

There is almost linear relationship between Nitrogen and yield. One important assumption of linear regression that the relationship between dependent and independent variable is linear and additive is fulfilled.

We can also compute the correlation coefficient between the variables by using following function:

> cor(data$Yield, data$Nitrogen)

[1] 0.8748427

What we should know is that the correlation coefficient is calculated to measure the association between two variables whose value ranges from -1 to +1. -1 represents perfect negative correlation whereas +1 shows perfect positive correlation. Positive correlation is when X increases, Y also increases but in negative correlation increase on x will decrease the value in y. When the value of coefficient approaches 0, we can say that the relationship is weak. Generally coefficient between -2 and +2 is said to be weak. It suggests us that variation in dependent variable is not explained by independent variables and in such case we should look for other predictors.

In our case, there is strong relation as correlation coefficient is large enough suggesting us to build linear model "y as function of x".

Computation of regression

In simple linear regression, we find the best line to predict yield with respect to dose of nitrogen. In this case the linear regression can be written as:

Yield=c+m*Nitrogen

In R studio, the function lm() is used to determine the coefficient of linear model

> model <- lm(Yield ~ Nitrogen, data = data)

> model

Call:

lm(formula = Yield ~ Nitrogen, data = data)

Coefficients:

(Intercept) Nitrogen

17.0643 0.2316

The result above shows the beta coefficient of Nitrogen (variable).

Interpretation;

The output states that the estimated regression line equation can be written as:

Yield=17.06+0.23*Nitrogen

Where

Intercept=17.06 which states that the predicted yield with 0 dose of nitrogen is 17.06 units.

Regression beta efficient=0.23 which is also known as slope. It means that each unit increase in dose of nitrogen increases the yield by 0.23 units.

Regression line

> ggplot(data, aes(Nitrogen, Yield)) +

+ geom_point() +

+ stat_smooth(method = lm)

The above code can be used to add the regression line onto the scatter plot.

Model assesment and its summary

We built the linear model of yield as function of nitrogen dose as ,

Yield=17.06+0.23*Nitrogen

But we can use this formula only if we are sure that our model is statistically significant. In other words, there is statistically significant relation between predictor and outcome variable and the model fits well the data. So, for it, we should check our regression model. It is done by following command.

> summary(model)

Call:

lm(formula = Yield ~ Nitrogen, data = data)

Residuals:

Min 1Q Median 3Q Max

-3.5114 -2.0350 -0.0271 2.2493 3.4571

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 17.06429 1.57040 10.866 7.39e-07 ***

Nitrogen 0.23157 0.04055 5.711 0.000195 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.77 on 10 degrees of freedom

Multiple R-squared: 0.7653, Adjusted R-squared: 0.7419

F-statistic: 32.62 on 1 and 10 DF, p-value: 0.0001953

Interpretation:

1. t statistic and p values:

For any predictors, the t statistic and p values test whether there is statistically significant relation between predictor and outcome variable. The hypothesis is stated as:

a. Null hypothesis: The coefficients are equal to 0 (there is no relation between x and y)

b. Alternate hypothesis: The coefficients are not equal to 0 (there is some relationship between x and y)

For any beta coefficient (b) , the t test is computed as

t=(b-0)/SE (b) where SE (b) is the standard error of the coefficient b.

The t statistic measures the number of standard deviation that b is away from 0. Thus a large t statistic will produce a small p value. The higher the t statistic (and the lower p-value), the more significant is predictor. A statistically significant coefficient indicates that there is association between predictor and outcome.

In our example , both the p-values for intercept and predictor variable are extremely significant so the null hypothesis is rejected. It shows that there is significant association between predictor and outcome variable.

2. Standard error and confidence interval

The standard error is used to measure the variability or accuracy of beta coefficients. It can be used to compote the confidence interval of coefficient also .

For example, the 95% confidence interval for the coefficient b1 is defined as b1+/- 2*SE(b1) where,

i. The lower limit of b1=b1-2SE=0.23-2*0.040=0.15

ii. The upper limit of b1=b1+2SE=0.23+2*0.040=0.31

It shows that there is 95% chance that the interval [0.15,0.31] will contain the true value of b1.

Similarly, we can also calculate the 95% confidence interval of b0 by formula: b0+/-2SE(b0)

We can also get this information by following command:

> confint(model)

2.5 % 97.5 %

(Intercept) 13.5652100 20.5633615

Nitrogen 0.1412257 0.3219172

3. Model accuracy

We found that the Nitrogen significantly affects the yield of a crop. Now we should continue the diagnostic by checking how well the model fits the data. The process is said as the test of goodness of fit. It can be assessed by using three different quantities displayed in model summary; the residual standard error (RSE), the R squared (R2) and F-statistic. We obtained following values:

Residual standard error: 2.77

Multiple R-squared: 0.7653
Adjusted R-squared: 0.7419

F-statistic: 32.62
p-value: 0.0001953

1. Residual standard error (RSE)
It is also known as model sigma which is the residual variation and it represents the average variation of the observation around the fitted regression line. It is the standard deviation of residual errors. RSE provides an absolute measure of patterns in the data that cannot be explained by the model it self. When comparing the two models, the one with small RSE is the good indication that the model fits the best.

When we divide RSE by average value of outcome variable, it will give the prediction error rate. The error rate is to be as small as possible. In our case, RSE=2.77, which indicates that the observed yield values deviate from the true regression line by average of 2.77.

The percentage error can be calculated by:
> sigma(model)*100/mean(data$yield)
[1] 0.1117659

2. R square and adjusted R square

The value of R square ranges from 0 to 1 which represents the proportion of variation in the data which is explained by the model. The adjusted R squared adjusts from the degree of freedom. The R square measures the wellness of goodness of fit. For a simple linear regression, R square is the square of Pearson correlation coefficient.

The high value of R square is desirable as it is a good indicator. However, as the value of R square tends to increase, when more predictors are added in the model such as in multiple linear regression model, we should consider the adjusted R square, as, it is the penalized R square for a high number of predictors.

An adjusted R square close to 1 indicates that the large proportion of variation in the outcome is explained by the regression model whereas number close to 0 indicates that the model did not explained much of the variation.

3. F statistic

It gives the overall significance of the model. It is used to assess whether predictor variable has a non zero coefficient. In simple linear regression, F test is not much interesting because there is duplication of information among t test and F test. A large F statistic corresponds to the significant p value.

In gist following are desirable:

RSE close to 0
higher R square
higher F statistic

B. For Multiple linear regression

The model we want to build for estimating the yield based on nitrogen and mulching is shown below:

Yield=b0+b1*nitrogen+b2*mulching

The model coefficient for multiple linear regression can be computed as below in R studio:

Note: Remember to load package tidyverse and file is same as used for simple linear regression (data.xlsx)

> attach(data)
> library(tidyverse)

> model1 <- lm(Yield~Nitrogen+Mulching, data = data)
> summary(model1)

Call:
lm(formula = Yield ~ Nitrogen + Mulching, data = data)

Residuals:
Min 1Q Median 3Q Max
-1.1614 -1.1457 0.1071 0.9857 1.1071

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.71429 0.68571 21.458 4.88e-09 ***
Nitrogen 0.23157 0.01578 14.677 1.36e-07 ***
Mulching 4.70000 0.62228 7.553 3.49e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.078 on 9 degrees of freedom
Multiple R-squared: 0.968, Adjusted R-squared: 0.9609
F-statistic: 136.2 on 2 and 9 DF, p-value: 1.869e-07

Interpretation:

The model is significant. Please follow the

> confint(model1)
2.5 % 97.5 %
(Intercept) 13.1630910 16.2654804
Nitrogen 0.1958801 0.2672628
MulchingYes 3.2923137 6.1076863

> sigma(model1)/mean(data$Yield)
[1] 0.04348946

In multiple regression RSE is 1.078(4.3% error rate) vs linear regression with RSE 2.77 (11.17%). So, it is better than simple model. The R square in this case is much more than simple regression.

C. Binary logistic regression

File used: binary.xlsx
https://drive.google.com/file/d/1lHIzPI5fIQVjqEbp4sS37TIDm-F9UZaw/view?usp=sharing
Note: In file Organic farming 1=yes, 0=No

Import file to R studio
Attach file

1. Significance of model

> attach(binary)

> head(binary,4)

# A tibble: 4 x 2

Literacy organic_farming

1 Yes 1

2 Yes 1

3 Yes 0

4 Yes 0

> model2 <- glm( organic_farming~ Literacy, data = binary, family = binomial)

> summary(model2)

Call:

glm(formula = organic_farming ~ Literacy, family = binomial,

data = binary)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.4823 -0.7244 -0.7244 0.9005 1.7125

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.2040 0.6583 -1.829 0.0674 .

LiteracyYes 1.8971 0.8991 2.110 0.0349 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 34.296 on 24 degrees of freedom

Residual deviance: 29.322 on 23 degrees of freedom

AIC: 33.322

Number of Fisher Scoring iterations: 4

Interpretation: Literacy is the significant predictor to practice organic farming (P<0.05).

2. Odds ratio and 95% confidence interval

> exp(coefficients(model2))

(Intercept) LiteracyYes

0.300000 6.666667

> exp(confint(model2))

Waiting for profiling to be done...

2.5 % 97.5 %

(Intercept) 0.06725892 0.9806492

LiteracyYes 1.24897267 45.2701894

The outcome is binary in nature where odds ratio is obtained by exponentiation of coefficients.

Interpretation: The odds ratio of 6.67 is obtained which states that the odds of literate farmers of practicing organic farming is 6.67 times higher than illiterate farmers.

AIC(Akaike Information Criterion) is seen to be 33.32. It explains the relative quality of the model and depends on two factors; the number of predictors in the model and the likelihood that the model can reproduce the data. Lower value of AIC is desirable. Model with lower AIC value is better. It provide the means of model selection. In any statistical model, the representation of data will be never exact. Some information are lost by using the model used to represent the process. AIC estimates the relative amount of information lost by a given model; the less information a model loses, the higher quality of that model.

As per the formula, AIC=-2log(L)+2K

Where,
L is the maximum likelihood
K is the number of parameters

Moreover, adding predictors to the model will widen the 95% Confidence Interval. But if there are too many predictors, the power of model to detect significance may be reduced.

Hosmer-Lemeshow goodness of fit: It examines how well the model fits the data. It is calculated as:
> library(ResourceSelection)
> library(dplyr)
organic<- binary %>% filter(!is.na(Literacy))

Note: If there is more variable suppose Age can add & !is.na(Age) and so on
> hoslem.test(organic$organic_farming, fitted(model2))

Hosmer and Lemeshow goodness of fit (GOF) test

data: organic$organic_farming, fitted(model2)
X-squared = 2.7953e-24, df = 8, p-value = 1

Interpretation: As the P value is more than 0.05, it suggests that the model fit is good. The "organic" is created so as to drop all the observation with missing data as the test could not be performed if there is missing data.

subodh

Monday, July 27, 2020

(Simple and multiple) linear regression & binary logistic regression in r studio

1. What is regression?

2. Uses of regression

3. What is linear regression?

4. Types of linear regression

5. Simple and multiple linear regression in R studio

A. For simple linear regression

B. For Multiple linear regression

C. Binary logistic regression

About Me

Links

Previous Posts

Archives