Sunday, June 14, 2020

Chi square test in R Studio

1. What is Chi square test?

A chi-square (χ2) statistic is a test that  is used  to measure  the  expectations with actual observed data. The data used  during the process must be  random, raw, mutually exclusive, drawn from independent variables, and drawn from a large enough sample.

Chi square test is used for testing relationship between categorical variables. It is used to evaluate test of independence for bivariates using cross tabulation. The null hypothesis states that there is no any relationship between the selected categorical variables. An example of research question that could be answered using this test is given below: 

Is there a significant relationship between gender and natural resource management they are doing ?
The formula for calculating chi square is given below: 
where fo = the observed frequency (the observed counts in the cells)
and fe = the expected frequency if NO relationship existed between the variables
In contrary to Chi square test, Fischer Exact test is used in testing the significance, if the sample size is less. However, it is valid for all sample size. It is the way to test the association between the two categorical variable when you have small cell sizes (expected count less than 5). 
2. How to do these tests in R? 
#File name is Chisq_oneway_2sav
#Null hypothesis: There is no significant relation between disease occurrence and breed of cattle. 

Import and print data
import file 
attach(filename)
library(MASS)
print(Chisq_oneway_2sav)
#Output is obtained as below: 
# A tibble: 25 x 5
   sample         vaccinated disease_occurance           Breed Dressing_prcnt
    <dbl>          <dbl+lbl>         <dbl+lbl>       <dbl+lbl>          <dbl>
 1      1 1 [vaccinated]               2 [no]  1 [ghatikhwile]             60
 2      2 1 [vaccinated]               2 [no]  1 [ghatikhwile]             54
 3      3 2 [non-vaccinated]           1 [yes] 1 [ghatikhwile]             67
 4      4 2 [non-vaccinated]           1 [yes] 1 [ghatikhwile]             76
 5      5 1 [vaccinated]               2 [no]  2 [sakini]                  67
 6      6 2 [non-vaccinated]           2 [no]  1 [ghatikhwile]             87
 7      7 1 [vaccinated]               1 [yes] 1 [ghatikhwile]             87
 8      8 2 [non-vaccinated]           2 [no]  1 [ghatikhwile]             77
 9      9 2 [non-vaccinated]           2 [no]  3 [broiler]                 87
10     10 1 [vaccinated]               2 [no]  3 [broiler]                 75
# ... with 15 more rows

Perform Chi square test
chisq.test(disease_occurance,Breed,correct = FALSE)
#Output is obtained as:
Pearson's Chi-squared test
data:  disease_occurance and Breed
X-squared = 0.54113, df = 2, p-value = 0.763
#Interpretation: There is no significant association between disease occurrence and breed. 
#See the warning message. 
Warning message:
In chisq.test(disease_occurance, Breed, correct = FALSE) :
  Chi-squared approximation may be incorrect
Or 
chisq <- chisq.test(disease_occurance,Breed)
Warning message:
In chisq.test(disease_occurance, Breed) :
  Chi-squared approximation may be incorrect
chisq
#Output is seen as : 
Pearson's Chi-squared test
data:  disease_occurance and Breed
X-squared = 0.54113, df = 2, p-value = 0.763
# If you see  warning  message in red color  go for Fischer test
fisher.test(disease_occurance,Breed)
#See output
Fisher's Exact Test for Count Data
data:  disease_occurance and Breed
p-value = 0.8763
alternative hypothesis: two.sided
#Interpretation: No Significant association between the variables
table(disease_occurance,Breed)
#See output
                              


To see proportion, perform following command: 
prop.table(table(disease_occurance,Breed))
#See output
Ballon plots:  It is used to plot a graphical matrix where each cells contain a dot whose size reflects the relative magnitude of the corresponding component. 
library(ggpubr)
ggballoonplot(Chisq_oneway_2sav)
#Output is obtained as :
#Or
library(ggpubr)
library(ggplot2)
theme_set(theme_pubr())
ggballoonplot(Chisq_oneway_2sav,fill="value")+scale_fill_viridis_c(option = "C")
#Output is obtained as: 
or 
library(gplots)
dt <- as.table(as.matrix(disease_occurance))
balloonplot(t(dt), main ="disease_occurance", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)
Method to perform corrplot: It is the method to plot the graph of correlation matrix. 
library(corrplot)
chisq$observed
# To see expected count and residuals with 2 and 3 values after decimal respectively perform following command.
corrplot(chisq$residuals, is.cor = FALSE)
#Interpretation: The column side is for breed and row is for disease occurrence. The blue color shows the positive association whereas red shows the negative. The value is in the figure. The increase in the intensity of color shows the strength. 
contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3)
#Output is shown as: 
#Note: The contribution of a point to an axis depends both on the distance from the point to the origin point along the axis and on the weight of the point. The contributions of points to axes are the main aid to interpretation (see Le Roux and Rouanet, 2004 and 2010).
corrplot(contrib, is.cor = FALSE)
#Output is seen as
chisq$p.value
#Output: [1] 0.76295 (Not Significant) 
chisq$estimate
#Output: Null 
The video is shown in the video below:






1 Comments:

At June 15, 2020 at 7:45 AM , Blogger Unknown said...

great job sir , very informative

 

Post a Comment

Subscribe to Post Comments [Atom]

<< Home