T test and correlation in R studio
1. Introduction to t test
T test is the type of inferential statistic which is used to
determine the significant difference between means of two groups. It is used as
hypothesis testing tool which allows a researcher for testing an assumption
applicable to the population. It looks at the t statistic, t distribution
values and degree of freedom to determine the statistical significance.
However, for testing means for three or more groups one must use analysis of
variance. It compares the average values of the two data sets
and determine if they came from the same population.
Calculating a t-test requires three key data values. They include the
difference between the mean values from each data set (called the mean
difference), the standard deviation of each group, and the number of data
values of each group.
1.1 What are the assumption for t test?
Figure: A normal curve. |
- Bivariate
independent variable (A, B groups)
- Continuous
dependent variable
- Each
observation of the dependent variable is independent of
the other observations of the dependent variable (its probability
distribution isn't affected by their values). Exception: For the paired
t-test, we only require that the pair-differences (Ai - Bi)
be independent from each other (across i). [Note: "independent"
and "dependent" are used in two different senses here. Just
think of a "dependent variable" as one thing, and
"observations that are dependent" as another thing.]
- Dependent variable has a normal distribution, with the same variance, σ2, in each group (as though the distribution for group A were merely shifted over to become the distribution for group B, without changing shape).
Note: σ, "sigma", the scale parameter of the normal
distribution, also known as the population standard deviation, is easy to see
on a picture of a normal curve. Located one σ to the left or right of the
normal mean are the two places where the curve changes from convex to concave
(the second derivative is zero).
1.2 What is the formula of t test?
In this formula, t is the t-value, x1 and x2 are
the means of the two groups being compared, s2 is the pooled
standard error of the two groups, and n1 and n2 are
the number of observations in each of the groups.
1.3 What are the types of t test?
There are three types of t test, one sample t test, paired sample t test
and independent sample t test.
A. One Sample t test:
It is used to test the mean of a single group against a known mean (gold
standard value or widely accepted value). E.g. We want to test the land holding
of respondents to the average national land holding of Nepal (0.6ha).
Here both dependent and independent variables should be in continuous level of
measurement (interval or ratio).
B. Paired Sample t test:
A paired sample t test is used to compare means from the same group at
different time (suitable for longitudinal data). E.g. compare the income of
farmers before and after off season vegetable farming training. In this case
also, both dependent and independent variables should be in continuous
level of measurement (interval or ratio).
C. Independent sample t test:
Independent sample t test is used to compare means for two
groups. E.g. compare the income of male and female farmers. The independent
variable should be categorical ( with two options/groups e.g. gender ) and
dependent variable should be continuous (e.g. income).
1.4 What is t score?
It is the ratio between the difference between two groups and the difference within the groups. The larger the t score, the more is the difference between groups. The smaller the value, the more is the similarity between the groups.
1.5 What is p value?
Every t value has p value associated with it. P value is the probability that the results from your sample data occurred by chance. P-values are from 0% to 100%. They are usually written as a decimal. For example, a p value of 5% is 0.05. Low p-values are good; They indicate your data did not occur by chance. For example, a p-value of .01 means there is only a 1% probability that the results from an experiment happened by chance. We use p value of 0.05, 0.01 and 0.001 to test our data.
1.
What is correlation analysis?
It is a statistical method used to evaluate the
strength of relationship between two quantitative variables. A high correlation
means that two or more variables have a strong relationship with each other,
while a weak correlation means that the variables are hardly related. In other
words, it is the process of studying the strength of that relationship with
available statistical data.
It is the bivariate analysis (two variables are used) that measures the strength of association between two variables and the direction of the relationship. In terms of the strength of relationship, the value of the correlation coefficient varies between +1 and -1. A value of ± 1 indicates a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. The direction of the relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a – sign indicates a negative relationship.
2.1 What are the types of correlation analysis?
A. Pearson r correlation:
Pearson r correlation is the most widely
used correlation statistic to measure the degree of the relationship between
linearly related variables. The following formula is used to
calculate the Pearson r correlation:
Assumptions for Pearson correlation
- Both variables
should be normally distributed (normally distributed variables have a
bell-shaped curve) and are in continuous scale of measurement.
- Other
assumptions include linearity and homoscedasticity. Linearity
assumes a straight line relationship between each of the two variables and
homoscedasticity assumes that data is equally distributed about the
regression line.
B. Kendall rank correlation: Kendall
rank correlation is a non-parametric test that measures the strength of
dependence between two variables. If we consider two samples, a and b,
where each sample size is n, we know that the
total number of pairings with a b is n(n-1)/2. The
following formula is used to calculate the value of Kendall rank correlation:
Nc= number of concordant
Nd= Number of discordant
Key Terms
Concordant: Ordered
in the same way.
Discordant: Ordered
differently.
C. Spearman rank correlation:
Spearman rank correlation is a non-parametric
test that is used to measure the degree of association between two
variables. The Spearman rank correlation test does not carry any
assumptions about the distribution of the data and is the appropriate
correlation analysis when the variables are measured on a scale that is at
least ordinal.
The following formula is used to calculate the
Spearman rank correlation:
ρ= Spearman rank correlation
di= the difference between the ranks of corresponding
variables
n= number of observations
The assumptions of the Spearman correlation are that data
must be at least ordinal and the scores on one variable must be monotonically related
to the other variable.
2.2 What is correlation coefficient?
The
main result of a correlation is called the correlation coefficient (or
"r"). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the
more closely the two variables are related. If r is close to 0, it means
there is no relationship between the variables. If r is positive, it means that
as one variable gets larger the other gets larger. If r is negative it means
that as one gets larger, the other gets smaller (often called an "inverse"
correlation).
While
correlation coefficients are normally reported as r = (a value between -1 and
+1), squaring them makes then easier to understand. The square of the
coefficient (or r square) is equal to the percent of the variation in one
variable that is related to the variation in the other. After squaring r,
ignore the decimal point. An r of .5 means 25% of the variation is related (.5
squared =.25). An r value of .7 means 49% of the variance is related (.7
squared = .49).A correlation report can also show a second result of each test
- statistical significance. In this case, the significance level will tell you
how likely it is that the correlations reported may be due to chance in the
form of random sampling error.
3. How to perform and interpret t test in r
studio?
A.
First check the normality
#File
used : t.xls https://drive.google.com/file/d/1DEFqFbemqia3FlEPyIPxeGQSIApbs8c_/view?usp=sharing
Note:
Output is shown in blue color.
1.
Import the data and check normality
> attach(t)
>
head(t)
# A
tibble: 6 x 3
yield_before yield_after INM
<dbl> <dbl>
<chr>
1
1680 1732 Yes
2
1150 1323 No
3
1780 1920 Yes
4
2100 2150 Yes
5
390 453 No
6
1680 1690 Yes
>
hist(yield_after)
> hist(yield_after,breaks = 12,col =
"yellow")
>
qqnorm(yield_after);qqline(yield_after)
#Interpretation: What
do you think? Is it normal or not? Though some values are not within the qq line,
the values are within the range.
>
library(car)
Loading
required package: carData
Warning
messages:
1:
package ‘car’ was built under R version 3.6.3
2:
package ‘carData’ was built under R version 3.6.3
>
qqPlot(t$yield_after)
[1]
5 15
Interpretation: cases
5 and 15 are deviated.
>
shapiro.test(yield_after)
Shapiro-Wilk normality test data:
yield_after W = 0.88471, p-value = 0.05581
#Interpretation: As the
p value>0.05, we can say the values are normally distributed.
>
library(rstatix)
Attaching
package: ‘rstatix’
The
following object is masked from ‘package:stats’:
filter
Warning
message:
package
‘rstatix’ was built under R version 3.6.3
>t%>%
group_by(INM)%>%
identify_outliers(yield_after)
[1]
INM yield_before yield_after
is.outlier is.extreme
<0
rows> (or 0-length row.names)
#Interpretation: There
are no outliers and extreme outliers (as there is empty below it)
2.
Perform t test
a. One
sample t test: (suppose the average yield of Nepal for the crop is 1250
kg/ha)
Null
hypothesis : mean yield of farmers=1250kg/ha
>
t.test(yield_before,mu=1250)
One Sample t-test
data:
yield_before
t =
0.91896, df = 14, p-value = 0.3737
alternative
hypothesis: true mean is not equal to 1250
95
percent confidence interval:
1062.361
1718.973
sample
estimates:
mean of
x
1390.667
#Interpretation: The
average yield is 1390.66 kg/ha. Though the average yield of farmer is more than
national average, however, it is not significantly different
(t=0.91,df=14,p>0.05).
b.
Paired sample t test:
Hypothesis:
Yield before intervention=yield after intervention
>
res<-t.test(yield_before,yield_after,paired = TRUE,alternative =
"two.sided")
>
res
Paired
t-test
data:
yield_before and yield_after
t =
-4.2769, df = 14, p-value = 0.0007671
alternative
hypothesis: true difference in means is not equal to 0
95
percent confidence interval:
-76.67567
-25.45767
sample
estimates:
mean of
the differences
-51.06667
#Interpretation:
The yield before and after treatment are significantly different with each
other. Or, yield after treatment is significantly higher than that of before
treatment (t=-4.27,df=14,P<0.001).
Note:
If you want to test whether the average weight before treatment is less than the average weight after treatment, type this:
>
res1<-t.test(yield_before,yield_after,paired = TRUE,alternative =
"less") > res1
If you
want to test whether the average weight before treatment is more than the
average weight after treatment, type this:
>
res1<-t.test(yield_before,yield_after,paired = TRUE,alternative =
"greater") > res1
How to plot?
> library(PairedData)
> pd<-paired(yield_before,yield_after)
> plot(pd, type = "profile") +
theme_bw()
If you want to see p value, mean and confidence
interval:
> res1$p.value
[1] 0.9996165
> res1$estimate
mean of the differences
-51.06667
> res1$conf.int
[1] -72.09691 Inf
attr(,"conf.level")
[1] 0.95
c. Independent sample t test
Note: variance should be equal and
i. Check your data
> print(t)
# A tibble: 15 x 3
yield_before yield_after INM
<dbl>
<dbl> <chr>
1 1680
1732 Yes
2 1150
1323 No
3 1780
1920 Yes
4 2100
2150 Yes
5 390
453 No
6 1680
1690 Yes
7 1050
1100 No
8 1660
1700 Yes
9 2080
2100 Yes
10 450
500 No
11 1560
1600 Yes
12 1150
1175 No
13 1600
1623 Yes
14 2080
2100 Yes
15 450
460 No
> library(dplyr)
> group_by(t, INM) %>%
+ summarise(
+ count = n(),
+ mean = mean(yield_before, na.rm = TRUE),
+ sd = sd(yield_before, na.rm = TRUE)
+ )
# A tibble: 2 x 4
INM count mean sd
<chr> <int> <dbl> <dbl>
1 No 6 773. 379.
2 Yes 9 1802. 222.
ii. Visualize your data
> library(ggpubr)
> ggboxplot(t, x = "INM", y =
"yield_before",
+ color =
"INM", palette = c("#00AFBB", "#E7B800"),
+ ylab = "Yield of Okra", xlab = "Use of INM")
iii. Checking assumptions for independent t
test
* The samples are independent because the samples for yes and
no are not related.
*Follow normal distribution (see Shapiro test)
>with(t, shapiro.test(yield_before[INM == "Yes"]))
Shapiro-Wilk
normality test
data: yield_before[INM == "Yes"]
W = 0.81562, p-value = 0.03074
> with(t, shapiro.test(yield_before[INM == "No"]))
Shapiro-Wilk normality test
data: yield_before[INM == "No"]
W = 0.76055, p-value = 0.02521
Note: As the p value is less than 0.05. The observation is not normally distributed. So, we should go for Wilcox rank test ( equivalent to independent t test for non parametric samples). I will show it in next blog. For now I am showing the process of independent sample t test( not recommended to perform).
*Check the homogeneity of variance
> res.ftest <- var.test(yield_before ~ INM, data = t)
> res.ftest
F test
to compare two variances
data: yield_before by INM
F = 2.9152, num df = 5, denom df = 8, p-value = 0.173
alternative hypothesis: true ratio of variances is not equal
to 1
95 percent confidence interval:
0.6051586 19.6986154
sample estimates:
ratio of variances
2.915216
#Interpretation: As p>0.05, there is no significant difference in variance.
There are various tests for testing homogeneity of variance.
>
bartlett.test(yield_before,INM)
Bartlett test of homogeneity of variances
data: yield_before and INM
Bartlett's K-squared = 1.6861,
df = 1, p-value = 0.1941
>
fligner.test(yield_before,INM)
Fligner-Killeen test of homogeneity of variances
data: yield_before and INM
Fligner-Killeen:med chi-squared
= 0.72046, df = 1, p-value = 0.396
>
leveneTest(yield_before,INM)
Levene's Test for Homogeneity
of Variance (center = median)
Df F value Pr(>F)
group 1
5.0217 0.04312 *
13
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Note:
Bartlett’s test is
used to test if k samples are from populations with equal
variances; however, Bartlett’s test is sensitive to deviations from normality.
If you’ve verified that your data is normally distributed then Bartlett’s test
is a good first test but if your data is non-normal than you should always verify
results from Bartlett’s test to results from Levene’s and Flinger-Killeen’s tests as there is a good chance
you will get a false positive. Consequently, p-values less than
0.1, 0.05, 0.001 (depending on your desired threshold) suggest variances are
significantly different and the homogeneity of variance assumption has been
violated.
The Levene’s test is
slightly more robust to departures from normality than the Bartlett’s test.
Levene’s performs a one-way ANOVA conducted on the deviation scores; that is,
the absolute difference between each score and the mean of the group from which
it came. The difference from the results from the Bartlett test above, likely
because of non-normality in our data.
As previously stated,
Bartlett’s test can provide false positive results when the data is non-normal.
Also, when the sample size is large, small differences in group variances can
produce a statistically significant Levene’s test (similar to the Shapiro-Wilk
test for normality). Fligner-Killeen’s test is a non-parametric test which is
robust to departures from normality and provides a good alternative or double
check for the previous parametric tests.
You can transform the
data and check again or go for non-parametric test.
iii. Compute independent sample t test
> res4 <- t.test(yield_before ~ INM, data = t,
var.equal = TRUE)
> res4
Two Sample t-test
data: yield_before by INM
t = -6.6824, df = 13, p-value = 1.511e-05
alternative hypothesis: true difference in means is not equal
to 0
95 percent confidence interval:
-1361.5227 -696.2551
sample estimates:
mean in group No mean in group Yes
773.3333 1802.2222
Interpretation: The average yield of farmers not doing INM is 733.33 kg per ha which is significantly lower than the average yield of farmers doing INM (1802.22 kg/ha) (t=-6.68,p<0.001).
Try these commands and see the output:
res4$p.value
res4$conf.int
res4$estimate
4. How to perform correlation?
Use the file: correlation.sav
a. Visualize your data
Import the data
> attach(corelation)
> head(corelation)
# A tibble: 6 x 4
SN home_garden number_sps
food_suffienciency_months
<dbl> <dbl+lbl>
<dbl>
<dbl>
1 1 1 [Yes]
45
8
2 2 1 [Yes]
56
8
3 3 1 [Yes]
47
7
4 4 2 [no]
16
4
5 5 1 [Yes]
23
5
6 6 2 [no] 27 5
> library("ggpubr")
> print(corelation)
# A tibble: 20 x 4
SN home_garden number_sps
food_suffienciency_months
<dbl> <dbl+lbl>
<dbl>
<dbl>
1 1 1 [Yes]
45
8
2 2 1 [Yes]
56
8
3 3 1 [Yes]
47
7
4 4 2 [no]
16
4
5 5 1 [Yes]
23
5
6 6 2 [no]
27
5
7 7 2 [no]
12
3
8 8 2 [no]
17
4
9 9 1 [Yes]
45
7
10 10 2 [no]
17
3
11 11 1 [Yes]
55
8
12 12 2 [no]
16
3
13 13 2 [no]
18
2
14 14 2 [no]
19
2
15 15 1 [Yes]
48
6
16 16 1 [Yes]
56
7
17 17 2 [no]
14
2
18 18 2 [no]
17
3
19 19 2 [no]
17
3
20 20 1 [Yes]
44
7
> library(ggpubr)
> ggscatter(corelation, x = "number_sps", y =
"food_suffienciency_months",
+ add =
"reg.line", conf.int = TRUE,
+ cor.coef = TRUE,
cor.method = "pearson",
+ xlab = "Number
of species", ylab = "Food sufficiency in months")
(Note: While performing command ignore > and
+)
Interpretation: See P
value and strength of correlation coefficient
b. Check the assumption for correlation
* Check the linearity of
covariation:
From the plot above, we can say that the relationship
is linear. In the situation where the scatter plots show curved patterns, we
are dealing with nonlinear association between the two variables.
*Check the normality:
> shapiro.test(corelation$number_sps)
Shapiro-Wilk
normality test
data: corelation$number_sps
W = 0.81155, p-value = 0.001291
> shapiro.test(corelation$food_suffienciency_months)
Shapiro-Wilk
normality test
data: corelation$food_suffienciency_months
W = 0.87965, p-value = 0.01744
Interpretation: As p
value is less than 0.05 we cannot assume normality. (However, I will proceed
further.) You can check outliers also.
Visual inspection of
the data normality using Q-Q plots (quantile-quantile plots).
Q-Q plot draws the correlation between a given sample and the normal
distribution.
> library(ggpubr)
> ggqqplot(corelation$number_sps, ylab = " number of
species in home garden")
>ggqqplot(corelation$food_suffienciency_months, ylab =
"food sufficiency in months")
Interpretation: From
the normality plots, it can be concluded that both population may come from
normal distribution.
OR
> library(car)
> qqPlot(corelation$number_sps)
[1] 2 16
Case 2 and 16 are a bit deviated.
> qqPlot(corelation$food_suffienciency_months)
[1] 1 2
Case 1 and 2 are abit deviated.
Note: If the data are not normally distributed , it is
recommended to use non parametric correlation, including Spearman and Kendall
rank based correlation tests.
C. Correlation test
> res5 <- cor.test(corelation$number_sps,
corelation$food_suffienciency_months, method = "pearson")
> res5
Pearson's
product-moment correlation
data: corelation$number_sps and
corelation$food_suffienciency_months
t = 10.988, df = 18, p-value = 2.055e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8351088 0.9735156
sample estimates:
cor
0.9328807
or
> cor(number_sps, food_suffienciency_months, method =
c("pearson", "kendall", "spearman"))
[1] 0.9328807
> cor.test(number_sps, food_suffienciency_months,
method=c("pearson", "kendall", "spearman"))
Pearson's
product-moment correlation
t = 10.988, df = 18, p-value = 2.055e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8351088 0.9735156
sample estimates:
cor
0.9328807
Interpretation: There exist strong and significant
correlation between number of species and food sufficiency months (r=0.03,
P<0.001).
> corr <- cor.test(x=corelation$number_sps,
y=corelation$food_suffienciency_months, method = 'spearman')
>corr
Spearman's
rank correlation rho
data: corelation$number_sps and
corelation$food_suffienciency_months
S = 256.56, p-value = 1.703e-05
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.8071012
Interpretation: Do it your self.
0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
<< Home