CHI SQUARE
Types of Data:
There are basically two types of random
variables and they
yield two types of data: numerical and categorical. A
chi square
(
X2) statistic is
used
to investigate whether distributions of categorical variables
differ
from one another. Basically categorical variable yield
data in the
categories and numerical variables yield data in numerical
form.
Responses to such questions as "What is your major?"
or Do you
own a car?" are categorical because they yield
data such as
"biology" or "no." In contrast,
responses to such
questions as "How tall are you?" or
"What is your
G.P.A.?" are numerical. Numerical data
can be either discrete or
continuous. The table below may help
you see the differences between
these two variables.
Data Type |
Question Type |
Possible Responses |
Categorical |
What is your sex? |
male or
female |
Numerical |
Disrete- How
many cars do you own? |
two or three |
Numerical |
Continuous - How tall are you? |
72 inches |
Notice that discrete data arise fom a counting process,
while
continuous data arise from a measuring process.
The Chi
Square statistic compares the tallies or counts of
categorical responses
between two (or more) independent groups.
(note: Chi square tests can
only be used on actual numbers and
not on percentages, proportions,
means, etc.)
2 x 2 Contingency Table
There are several
types of chi square tests depending on the
way the data was collected and
the hypothesis being tested. We'll
begin with the simplest case: a 2 x 2
contingency table. If we
set the 2 x 2 table to the general notation
shown below in Table
1, using the letters a, b, c, and d to denote the
contents of
the cells, then we would have the following table:
Table 1. General notation for a 2 x 2 contingency
table.
Variable 1
Variable 2
|
Data type 1
|
Data type
2
|
Totals
|
Category 1 |
a
|
b
|
a + b
|
Category 2 |
c
|
d
|
c + d
|
Total |
a + c
|
b + d
|
a + b + c + d =
N
|
For a 2 x 2 contingency
table the Chi Square statistic
is calculated by
the formula:
Note: notice that the four components of the denominator are
the four
totals from the table columns and rows.
Suppose you conducted a drug trial on a group of animals and
you hypothesized that the animals receiving the drug would show
increased heart rates compared to those that did not receive the
drug. You conduct the study and collect the following data:
Ho: The proportion of animals whose heart rate increased is independent of drug treatment.
Ha: The proportion of animals whose heart rate increased is associated with drug treatment.
Table 2. Hypothetical drug trial results.
|
Heart Rate Increased |
No Heart Rate Increase |
Total |
Treated |
36 |
14 |
50 |
Not treated |
30 |
25 |
55 |
Total |
66 |
39 |
105 |
Applying the formula above we get:
Chi square = 105[(36)(25) -
(14)(30)]
2
/ (50)(55)(39)(66) =
3.418
Before we can proceed we eed to know how many degrees of
freedom
we have. When a comparison is made between one sample and
another,
a simple rule is that the degrees of freedom equal (number of
columns minus one) x (number of rows minus one) not counting the
totals
for rows or columns. For our data this gives (2-1) x (2-1)
= 1.
We now have our chi square statistic (x
2 = 3.418), our predetermined alpha level of
significance (0.05), and our degrees of freedom (df = 1). Entering the Chi square
distribution table with 1 degree of freedom and reading along the row we find our value of
x
2 (3.418) lies between 2.706 and 3.841. The corresponding probability is between the 0.10
and 0.05 probability levels. That means that the p-value is above 0.05 (it is actually
0.065). Since a p-value of 0.65 is greater than the conventionally accepted significance
level of 0.05 (i.e. p > 0.05) we fail to reject the null hypothesis. In other words, there
is no statistically significant difference in the proportion of animals whose heart rate
increased.
What would happen if the number of control animals whose heart rate increased dropped to 29
instead of 30 and, consequently, the number of controls whose hear rate did not increase changed
from 25 to 26? Try it. Notice that the new x
2 value is 4.125 and this value exceeds the table value
of 3.841 (at 1 degree of freedom and an alpha level of 0.05). This means that p < 0.05 (it is
now0.04) and we reject the null hypothesis in favor of the alternative hypothesis - the heart rate
of animals is different between the treatment groups. When p < 0.05 we generally refer to this as a
significant difference.
Chi-square is a statistical test commonly used to compare observed data with data we
would expect to obtain according to a specific hypothesis. For example, if, according to
Mendel's laws, you expected 10 of 20 offspring from a cross to be male and the actual
observed number was 8 males, then you might want to know about the "goodness to
fit" between the observed and expected. Were the deviations (differences between
observed and expected) the result of chance, or were they due to other factors. How much
deviation can occur before you, the investigator, must conclude that something other than
chance is at work, causing the observed to differ from the expected. The chi-square test
is always testing what scientists call the
null hypothesis, which states that there
is no significant difference between the expected and observed result.
The formula for calculating chi-square (
2)
is:
2=
(o-e)2/e
That is, chi-square is the sum of the squared difference between observed (
o)
and the expected (
e) data (or the deviation,
d), divided by the expected
data in all possible categories.
For example, suppose that a cross between two pea plants yields a population of 880
plants, 639 with green seeds and 241 with yellow seeds. You are asked to propose the
genotypes of the parents. Your
hypothesis is that the allele for green is dominant
to the allele for yellow and that the parent plants were both heterozygous for this trait.
If your hypothesis is true, then the predicted ratio of offspring from this cross would be
3:1 (based on Mendel's laws) as predicted from the results of the Punnett square (Figure
B. 1).
Figure B.1 -
Punnett Square. Predicted offspring from cross between green and yellow-seeded plants.
Green (
G) is dominant (3/4 green; 1/4 yellow).
To calculate
2 , first determine
the number
expected in each category. If the ratio is 3:1 and the total number of
observed individuals is 880, then the
expected numerical values should be 660 green
and 220 yellow.
Chi-square requires that you use numerical values, not percentages or
ratios.
Then calculate
2 using this formula,
as shown in Table B.1. Note that we get a value of 2.668 for
2. But what does this number mean? Here's how to interpret the
2 value:
1. Determine degrees of freedom (df). Degrees of freedom can be calculated as the
number of categories in the problem minus 1. In our example, there are two categories
(green and yellow); therefore, there is I degree of freedom.
2. Determine a relative standard to serve as the basis for accepting or rejecting the
hypothesis. The relative standard commonly used in biological research is
p > 0.05.
The p value is the
probability that the deviation of the observed from that
expected is due to chance alone (no other forces acting). In this case, using
p >
0.05, you would expect any deviation to be due to chance alone 5% of the time or less.
3. Refer to a chi-square distribution table (Table B.2). Using the appropriate degrees
of 'freedom, locate the value closest to your calculated chi-square in the table.
Determine the closest
p (probability) value associated with your chi-square and
degrees of freedom. In this case (
2=2.668),
the p value is about 0.10, which means that there is a 10% probability that any deviation
from expected results is due to chance only. Based on our standard p
> 0.05,
this is within the range of acceptable deviation. In terms of your hypothesis for this
example, the observed chi-squareis not significantly different from expected. The observed
numbers are consistent with those expected under Mendel's law.
Step-by-Step Procedure for Testing Your Hypothesis and Calculating Chi-Square
1. State the hypothesis being tested and the predicted results. Gather the data by
conducting the proper experiment (or, if working genetics problems, use the data provided
in the problem).
2. Determine the expected numbers for each observational class. Remember to use
numbers, not percentages.
Chi-square should not be calculated if the expected value in any
category is less than 5.
3. Calculate
2 using the formula.
Complete all calculations to three significant digits. Round off your answer to two
significant digits.
4. Use the chi-square distribution table to determine significance of the value.
- Determine degrees of freedom and locate the value in the appropriate column.
- Locate the value closest to your calculated 2
on that degrees of freedom df row.
- Move up the column to determine the p value.
5. State your conclusion in terms of your hypothesis.
- If the p value for the calculated 2
is p > 0.05, accept your hypothesis. 'The deviation is small enough that chance
alone accounts for it. A p value of 0.6, for example, means that there is a 60%
probability that any deviation from expected is due to chance only. This is within the
range of acceptable deviation.
- If the p value for the calculated 2
is p < 0.05, reject your hypothesis, and conclude that some factor other than
chance is operating for the deviation to be so great. For example, a p value of 0.01 means
that there is only a 1% chance that this deviation is due to chance alone. Therefore,
other factors must be involved.
The chi-square test will be used to test for the "goodness to fit" between
observed and expected data from several laboratory investigations in this lab manual.
Table B.1
Calculating Chi-Square
|
Green |
Yellow |
Observed (o) |
639 |
241 |
Expected (e) |
660 |
220 |
Deviation (o - e) |
-21 |
21 |
Deviation2 (d2) |
441 |
441 |
d2/e |
0.668 |
2 |
2 = d2/e = 2.668 |
. |
. |
Table B.2
Chi-Square Distribution
Degrees of
Freedom
(df)
|
Probability (p)
|
|
0.95 |
0.90 |
0.80 |
0.70 |
0.50 |
0.30 |
0.20 |
0.10 |
0.05 |
0.01 |
0.001 |
1
|
0.004 |
0.02 |
0.06 |
0.15 |
0.46 |
1.07 |
1.64 |
2.71 |
3.84 |
6.64 |
10.83 |
2
|
0.10 |
0.21 |
0.45 |
0.71 |
1.39 |
2.41 |
3.22 |
4.60 |
5.99 |
9.21 |
13.82 |
3
|
0.35 |
0.58 |
1.01 |
1.42 |
2.37 |
3.66 |
4.64 |
6.25 |
7.82 |
11.34 |
16.27 |
4
|
0.71 |
1.06 |
1.65 |
2.20 |
3.36 |
4.88 |
5.99 |
7.78 |
9.49 |
13.28 |
18.47 |
5
|
1.14 |
1.61 |
2.34 |
3.00 |
4.35 |
6.06 |
7.29 |
9.24 |
11.07 |
15.09 |
20.52 |
6
|
1.63 |
2.20 |
3.07 |
3.83 |
5.35 |
7.23 |
8.56 |
10.64 |
12.59 |
16.81 |
22.46 |
7
|
2.17 |
2.83 |
3.82 |
4.67 |
6.35 |
8.38 |
9.80 |
12.02 |
14.07 |
18.48 |
24.32 |
8
|
2.73 |
3.49 |
4.59 |
5.53 |
7.34 |
9.52 |
11.03 |
13.36 |
15.51 |
20.09 |
26.12 |
9
|
3.32 |
4.17 |
5.38 |
6.39 |
8.34 |
10.66 |
12.24 |
14.68 |
16.92 |
21.67 |
27.88 |
10
|
3.94 |
4.86 |
6.18 |
7.27 |
9.34 |
11.78 |
13.44 |
15.99 |
18.31 |
23.21 |
29.59 |
|
Nonsignificant
|
Significant
|