Tuesday, June 16, 2020

T test and correlation in R studio


1. Introduction to t test
T  test is the type of inferential statistic which is used to determine the significant difference between means of two groups. It is used as hypothesis testing tool which allows a researcher for testing an assumption applicable to the population.  It looks at the t statistic, t distribution values and degree of freedom to determine the statistical significance. However, for testing means for three or more groups one must use analysis of variance.  It compares the average values of the two data sets and determine if they came from the same population. 

Calculating a t-test requires three key data values. They include the difference between the mean values from each data set (called the mean difference), the standard deviation of each group, and the number of data values of each group.

1.1 What are the assumption for t test? 
Figure: A normal curve. 
  1. Bivariate independent variable (A, B groups) 
  2. Continuous dependent variable
  3. Each observation of the dependent variable is independent of the other observations of the dependent variable (its probability distribution isn't affected by their values). Exception: For the paired t-test, we only require that the pair-differences (Ai - Bi) be independent from each other (across i). [Note: "independent" and "dependent" are used in two different senses here. Just think of a "dependent variable" as one thing, and "observations that are dependent" as another thing.]
  4. Dependent variable has a normal distribution, with the same variance, σ2, in each group (as though the distribution for group A were merely shifted over to become the distribution for group B, without changing shape). 

Note: σ, "sigma", the scale parameter of the normal distribution, also known as the population standard deviation, is easy to see on a picture of a normal curve. Located one σ to the left or right of the normal mean are the two places where the curve changes from convex to concave (the second derivative is zero).

1.2 What is the formula of t test? 
In this formula, t is the t-value, x1 and x2 are the means of the two groups being compared, s2 is the pooled standard error of the two groups, and n1 and n2 are the number of observations in each of the groups.

1.3 What are the types of t test?

There are three types of t test, one sample t test, paired sample t test and independent sample t test. 

A. One Sample t test:

It is used to test the mean of a single group against a known mean (gold standard value or widely accepted value). E.g. We want to test the land holding of respondents to the average national land holding of Nepal (0.6ha).  Here both dependent and independent variables should be in continuous level of measurement (interval or ratio).

B. Paired Sample t test: 

A paired sample t test is used to compare means from the same group at different time (suitable for longitudinal data). E.g. compare the income of farmers before and after off season vegetable farming training. In this case also, both dependent and independent variables should be in continuous level of measurement (interval or ratio). 

C. Independent sample t test: 
Independent sample t test is used to compare means for two groups. E.g. compare the income of male and female farmers. The independent variable should be categorical ( with two options/groups e.g. gender ) and dependent variable should be continuous (e.g. income). 

1.4 What is t score? 

It is the ratio between the difference between two groups and the difference within the groups. The larger the t score, the more is the difference between groups. The smaller the value, the more is the similarity between the groups. 

1.5 What is p value? 

Every t value has p value associated with it. P value is the probability  that the results from your sample data occurred by chance. P-values are from 0% to 100%. They are usually written as a decimal. For example, a p value of 5% is 0.05. Low p-values are good; They indicate your data did not occur by chance. For example, a p-value of .01 means there is only a 1% probability that the results from an experiment happened by chance. We use p value of 0.05, 0.01 and 0.001 to test our data. 

1.     What is correlation analysis? 
It  is a statistical method used to evaluate the strength of relationship between two quantitative variables. A high correlation means that two or more variables have a strong relationship with each other, while a weak correlation means that the variables are hardly related. In other words, it is the process of studying the strength of that relationship with available statistical data. 

It is the bivariate analysis (two variables are used) 
that measures the strength of association between two variables and the direction of the relationship.  In terms of the strength of relationship, the value of the correlation coefficient varies between +1 and -1.  A value of ± 1 indicates a perfect degree of association between the two variables.  As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker.  The direction of the relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a – sign indicates a negative relationship. 

2.1 What are the types of correlation analysis?

A. Pearson r correlation: 

Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables.  The following formula is used to calculate the Pearson r correlation:

Assumptions for Pearson correlation 
  1. Both variables should be normally distributed (normally distributed variables have a bell-shaped curve) and are in continuous scale of measurement. 
  2. Other assumptions include linearity and homoscedasticity.  Linearity assumes a straight line relationship between each of the two variables and homoscedasticity assumes that data is equally distributed about the regression line.
B. Kendall rank correlation: Kendall rank correlation is a non-parametric test that measures the strength of dependence between two variables.  If we consider two samples, a and b, where each sample size is n, we know that the total number of pairings with a b is n(n-1)/2.  The following formula is used to calculate the value of Kendall rank correlation:
Nc= number of concordant
Nd= Number of discordant

Key Terms
Concordant: Ordered in the same way.
Discordant: Ordered differently.

C. Spearman rank correlation: 
Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two variables.  The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal.
The following formula is used to calculate the Spearman rank correlation:

ρ= Spearman rank correlation
di= the difference between the ranks of corresponding variables
n= number of observations

The assumptions of the Spearman correlation are that data must be at least ordinal and the scores on one variable must be monotonically related to the other variable.

2.2 What is correlation coefficient? 

The main result of a correlation is called the correlation coefficient (or "r"). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related. If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an "inverse" correlation).


While correlation coefficients are normally reported as r = (a value between -1 and +1), squaring them makes then easier to understand. The square of the coefficient (or r square) is equal to the percent of the variation in one variable that is related to the variation in the other. After squaring r, ignore the decimal point. An r of .5 means 25% of the variation is related (.5 squared =.25). An r value of .7 means 49% of the variance is related (.7 squared = .49).A correlation report can also show a second result of each test - statistical significance. In this case, the significance level will tell you how likely it is that the correlations reported may be due to chance in the form of random sampling error. 

3. How to perform and interpret t test in r studio? 

A. First check the normality 

Note: Output is shown in blue color. 
1. Import the data and check normality 
> attach(t)
> head(t)
# A tibble: 6 x 3
  yield_before yield_after INM  
         <dbl>       <dbl> <chr>
1         1680        1732 Yes  
2         1150        1323 No   
3         1780        1920 Yes  
4         2100        2150 Yes  
5          390         453 No   
6         1680        1690 Yes  

> hist(yield_after)
> hist(yield_after,breaks = 12,col = "yellow")
> qqnorm(yield_after);qqline(yield_after)
#Interpretation: What do you think? Is it normal or not?   Though some values are not within the qq line, the values are within the range. 
> library(car)
Loading required package: carData
Warning messages:
1: package ‘car’ was built under R version 3.6.3 
2: package ‘carData’ was built under R version 3.6.3 
> qqPlot(t$yield_after)
[1]  5 15
Interpretation: cases 5 and 15 are deviated. 
> shapiro.test(yield_after)
  Shapiro-Wilk normality test data: yield_after W = 0.88471, p-value = 0.05581
#Interpretation: As the p value>0.05, we can say the values are normally distributed. 
> library(rstatix)
Attaching package: ‘rstatix’
The following object is masked from ‘package:stats’:
    filter
Warning message:
package ‘rstatix’ was built under R version 3.6.3 
>t%>%
  group_by(INM)%>%
  identify_outliers(yield_after)
[1] INM          yield_before yield_after  is.outlier   is.extreme  
<0 rows> (or 0-length row.names)
#Interpretation: There are no outliers and extreme outliers (as there is empty below it) 
2. Perform t test 
a. One sample t test: (suppose the average yield of Nepal for the crop is 1250 kg/ha) 
Null hypothesis : mean yield of farmers=1250kg/ha 
> t.test(yield_before,mu=1250)
 One Sample t-test
data:  yield_before
t = 0.91896, df = 14, p-value = 0.3737
alternative hypothesis: true mean is not equal to 1250
95 percent confidence interval:
 1062.361 1718.973
sample estimates:
mean of x 
 1390.667 
#Interpretation: The average yield is 1390.66 kg/ha. Though the average yield of farmer is more than national average, however, it is not significantly different (t=0.91,df=14,p>0.05). 
b. Paired sample t test: 
Hypothesis: Yield before intervention=yield after intervention
> res<-t.test(yield_before,yield_after,paired = TRUE,alternative = "two.sided")
> res
Paired t-test
data:  yield_before and yield_after
t = -4.2769, df = 14, p-value = 0.0007671
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -76.67567 -25.45767
sample estimates:
mean of the differences 
              -51.06667
#Interpretation: The yield before and after treatment are significantly different with each other. Or, yield after treatment is significantly higher than that of before treatment (t=-4.27,df=14,P<0.001). 
Note: 

If you want to test whether the average weight before treatment is less than the average weight after treatment, type this:
> res1<-t.test(yield_before,yield_after,paired = TRUE,alternative = "less") > res1
If you want to test whether the average weight before treatment is more than the average weight after treatment, type this:
> res1<-t.test(yield_before,yield_after,paired = TRUE,alternative = "greater") > res1

How to plot? 
> library(PairedData)
> pd<-paired(yield_before,yield_after)
> plot(pd, type = "profile") + theme_bw()
If you want to see p value, mean and confidence interval: 
> res1$p.value

[1] 0.9996165

> res1$estimate
mean of the differences 
              -51.06667 
> res1$conf.int
[1] -72.09691       Inf
attr(,"conf.level")

[1] 0.95

c. Independent sample t test

Note: variance should be equal and

i. Check your data
> print(t)
# A tibble: 15 x 3
   yield_before yield_after INM  
          <dbl>       <dbl> <chr>
 1         1680        1732 Yes  
 2         1150        1323 No   
 3         1780        1920 Yes  
 4         2100        2150 Yes  
 5          390         453 No   
 6         1680        1690 Yes  
 7         1050        1100 No   
 8         1660        1700 Yes  
 9         2080        2100 Yes  
10          450         500 No   
11         1560        1600 Yes  
12         1150        1175 No   
13         1600        1623 Yes  
14         2080        2100 Yes  
15          450         460 No 

> library(dplyr)
> group_by(t, INM) %>%
+   summarise(
+     count = n(),
+     mean = mean(yield_before, na.rm = TRUE),
+     sd = sd(yield_before, na.rm = TRUE)
+   )
# A tibble: 2 x 4
  INM   count  mean    sd
  <chr> <int> <dbl> <dbl>
1 No        6  773.  379.

2 Yes       9 1802.  222.

ii. Visualize your data
> library(ggpubr)
> ggboxplot(t, x = "INM", y = "yield_before", 
+           color = "INM", palette = c("#00AFBB", "#E7B800"),

+           ylab = "Yield of Okra", xlab = "Use of INM")
iii. Checking assumptions for independent  t test
* The samples are independent because the samples for yes and no are not related.
*Follow normal distribution (see Shapiro test)

>with(t, shapiro.test(yield_before[INM == "Yes"]))
 Shapiro-Wilk normality test

data:  yield_before[INM == "Yes"]

W = 0.81562, p-value = 0.03074

> with(t, shapiro.test(yield_before[INM == "No"]))
Shapiro-Wilk normality test

data:  yield_before[INM == "No"]

W = 0.76055, p-value = 0.02521

Note: As the p value is less than 0.05. The observation is not normally distributed. So, we should go for Wilcox rank test ( equivalent to independent t test for non parametric samples). I will show it in next blog. For now I am showing the process of independent sample t test( not recommended to perform). 

*
Check the homogeneity of variance 
> res.ftest <- var.test(yield_before ~ INM, data = t)

> res.ftest
 F test to compare two variances

data:  yield_before by INM
F = 2.9152, num df = 5, denom df = 8, p-value = 0.173
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
  0.6051586 19.6986154
sample estimates:
ratio of variances 

          2.915216 

#Interpretation: As p>0.05, there is no significant difference in variance. 

There are various tests for testing homogeneity of variance.

> bartlett.test(yield_before,INM)

               Bartlett test of homogeneity of variances

data:  yield_before and INM
Bartlett's K-squared = 1.6861, df = 1, p-value = 0.1941

> fligner.test(yield_before,INM)

               Fligner-Killeen test of homogeneity of variances

data:  yield_before and INM
Fligner-Killeen:med chi-squared = 0.72046, df = 1, p-value = 0.396

> leveneTest(yield_before,INM)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value  Pr(>F) 
group  1  5.0217 0.04312 *
      13                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note:
Bartlett’s test is used to test if k samples are from populations with equal variances; however, Bartlett’s test is sensitive to deviations from normality. If you’ve verified that your data is normally distributed then Bartlett’s test is a good first test but if your data is non-normal than you should always verify results from Bartlett’s test to results from Levene’s and Flinger-Killeen’s tests as there is a good chance you will get a false positive. Consequently, p-values less than 0.1, 0.05, 0.001 (depending on your desired threshold) suggest variances are significantly different and the homogeneity of variance assumption has been violated. 

The Levene’s test is slightly more robust to departures from normality than the Bartlett’s test. Levene’s performs a one-way ANOVA conducted on the deviation scores; that is, the absolute difference between each score and the mean of the group from which it came. The difference from the results from the Bartlett test above, likely because of non-normality in our data.

As previously stated, Bartlett’s test can provide false positive results when the data is non-normal. Also, when the sample size is large, small differences in group variances can produce a statistically significant Levene’s test (similar to the Shapiro-Wilk test for normality). Fligner-Killeen’s test is a non-parametric test which is robust to departures from normality and provides a good alternative or double check for the previous parametric tests.

You can transform the data and check again or go for non-parametric test.


iii. Compute independent sample t test
> res4 <- t.test(yield_before ~ INM, data = t, var.equal = TRUE)

> res4
 Two Sample t-test

data:  yield_before by INM
t = -6.6824, df = 13, p-value = 1.511e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1361.5227  -696.2551
sample estimates:
 mean in group No mean in group Yes 

         773.3333         1802.2222

Interpretation: The average yield of farmers not doing INM is 733.33 kg per ha which is significantly lower than the average yield of farmers doing INM (1802.22 kg/ha) (t=-6.68,p<0.001). 

Try these commands and see the output:
res4$p.value
res4$conf.int

res4$estimate

4. How to perform correlation? 
Use the file: correlation.sav  
a. Visualize your data

Import the data 

> attach(corelation)
> head(corelation)
# A tibble: 6 x 4
     SN home_garden number_sps food_suffienciency_months
  <dbl>   <dbl+lbl>      <dbl>                     <dbl>
1     1     1 [Yes]         45                         8
2     2     1 [Yes]         56                         8
3     3     1 [Yes]         47                         7
4     4     2 [no]          16                         4
5     5     1 [Yes]         23                         5

6     6     2 [no]          27                         5
> library("ggpubr")

> print(corelation)
# A tibble: 20 x 4
      SN home_garden number_sps food_suffienciency_months
   <dbl>   <dbl+lbl>      <dbl>                     <dbl>
 1     1     1 [Yes]         45                         8
 2     2     1 [Yes]         56                         8
 3     3     1 [Yes]         47                         7
 4     4     2 [no]          16                         4
 5     5     1 [Yes]         23                         5
 6     6     2 [no]          27                         5
 7     7     2 [no]          12                         3
 8     8     2 [no]          17                         4
 9     9     1 [Yes]         45                         7
10    10     2 [no]          17                         3
11    11     1 [Yes]         55                         8
12    12     2 [no]          16                         3
13    13     2 [no]          18                         2
14    14     2 [no]          19                         2
15    15     1 [Yes]         48                         6
16    16     1 [Yes]         56                         7
17    17     2 [no]          14                         2
18    18     2 [no]          17                         3
19    19     2 [no]          17                         3
20    20     1 [Yes]         44                         7
> library(ggpubr)
> ggscatter(corelation, x = "number_sps", y = "food_suffienciency_months", 
+           add = "reg.line", conf.int = TRUE, 
+           cor.coef = TRUE, cor.method = "pearson",
+           xlab = "Number of species", ylab = "Food sufficiency in months")
(Note: While performing command ignore > and +) 
Interpretation: See P value and strength of correlation coefficient

b. Check the assumption for correlation 
* Check the linearity of covariation: 

From  the plot above, we can say that the relationship is linear. In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.

*Check the normality: 

> shapiro.test(corelation$number_sps)
            Shapiro-Wilk normality test
data:  corelation$number_sps

W = 0.81155, p-value = 0.001291

> shapiro.test(corelation$food_suffienciency_months)
            Shapiro-Wilk normality test
data:  corelation$food_suffienciency_months

W = 0.87965, p-value = 0.01744

Interpretation: As p value is less than 0.05 we cannot assume normality. (However, I will proceed further.) You can check outliers also.

Visual inspection of the data normality using Q-Q plots (quantile-quantile plots). Q-Q plot draws the correlation between a given sample and the normal distribution.

> library(ggpubr)

> ggqqplot(corelation$number_sps, ylab = " number of species in home garden")

>ggqqplot(corelation$food_suffienciency_months, ylab = "food sufficiency in months")


Interpretation: From the normality plots, it can be concluded that both population may come from normal distribution.

OR
> library(car)
> qqPlot(corelation$number_sps)

[1]  2 16
Case 2 and 16 are a bit deviated.

> qqPlot(corelation$food_suffienciency_months)
[1] 1 2


Case 1 and 2 are abit deviated.

Note: If the data are not normally distributed , it is recommended to use non parametric correlation, including Spearman and Kendall rank based correlation tests.

C. Correlation test
> res5 <- cor.test(corelation$number_sps, corelation$food_suffienciency_months, method = "pearson")
> res5

            Pearson's product-moment correlation
data:  corelation$number_sps and corelation$food_suffienciency_months
t = 10.988, df = 18, p-value = 2.055e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8351088 0.9735156
sample estimates:
      cor 

0.9328807 

or

> cor(number_sps, food_suffienciency_months, method = c("pearson", "kendall", "spearman"))

[1] 0.9328807
> cor.test(number_sps, food_suffienciency_months, method=c("pearson", "kendall", "spearman"))
            Pearson's product-moment correlation
data:  number_sps and food_suffienciency_months
t = 10.988, df = 18, p-value = 2.055e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8351088 0.9735156
sample estimates:
      cor 

0.9328807
Interpretation: There exist strong and significant correlation between number of species and food sufficiency months (r=0.03, P<0.001).

> corr <- cor.test(x=corelation$number_sps, y=corelation$food_suffienciency_months, method = 'spearman')
>corr
            Spearman's rank correlation rho
data:  corelation$number_sps and corelation$food_suffienciency_months
S = 256.56, p-value = 1.703e-05
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho

0.8071012 

Interpretation: Do it your self. 

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home