Reliable scientific conclusions are based on verifiable empirical evidence. But data must be transformed and interpreted before they become evidence, and statistical inference plays an important role in the process of interpretation. Biologists use statistics to organize and analyze data so that they can make inferences and use the data as evidence. Students should have opportunities to collect and analyze data in their biology classes as well. In this activity, students collect data on the surface areas of sun leaves and shade leaves, then analyze the data using the independentsamples t-test. The t-test procedure can be used in investigations where two groups are compared on one dependent variable.

Introduction

Biologists use statistics to organize and analyze data so that they can make inferences and use the data as evidence. Students should have opportunities to collect and analyze data in their biology classes as well. Data analysis is promoted in all of the recent reform documents (AAAS, 2011; National Research Council, 2012; NGSS Lead States, 2013; College Board, 2019). For example, AP Science Practice 5 states that students should be able to “perform statistical tests and mathematical calculations to analyze and interpret data” (College Board, 2019, p. 15). One of the simplest ways to engage students with this practice is to have them compare two independent groups – for example, a treatment and a control group. This calls for use of the independent-samples t-test. Cooper (2019) explains the rationale for making decisions with data using hypothesis-testing procedures, then guides readers through a student activity using the independent-samples t-test with a previously collected data set on finch beak size that is available from HHMI BioInteractive.

“The goal of this activity is to give students experience in data collection and analysis using the independent-samples t-test.”

Here, I present an activity that provides students the opportunity to collect and analyze their own data on the surface areas of sun leaves and shade leaves from American sweetgum trees (Liquidambar styraciflua), then analyze the data using the independent-samples t-test. The activity also introduces students to an elegant method for determining the surface area of the leaves, which is also useful when measuring transpiration rates.

Purpose

The goal of this activity is to give students experience in data collection and analysis using the independent-samples t-test. In order to identify possible sources of data that would provide students an opportunity to experience data collection and analysis from beginning to end, the archives of The American Biology Teacher were searched. The search turned up a paper by Zales and Colosi (1998), who had their students collect leaves from the north (shade leaves) and south (sun leaves) sides of Forsythia shrubs and measure the leaf length from the tip of the blade to the end of the petiole. Their students used the data they collected to illustrate the meaning of “not statistically significantly different.” But rather than have students measure the leaf length from the tip of the blade to the end of the petiole (as in Zales & Colosi, 1998), in the activity presented here students determine the surface areas of samples of sun leaves and shade leaves and compare their mean values.

This activity can be characterized as a structured inquiry, “where students investigate a teacher-presented question through a prescribed procedure” (Bell et al., 2005, p. 33). The question motivating the investigation and the methods of analysis are provided by the instructor, but the answer to the question is unknown to both the instructor and the student at the outset. Once completed, this activity raises additional questions that students can investigate in less structured inquiry activities. Possible extensions of this activity are discussed later in this article.

The Activity

The question students were charged with investigating was “Is there a difference in the mean surface areas of sun leaves and shade leaves?” Materials required for the activity include an extension pruner or ladder (these are optional, but helpful, since sun leaves are best collected from higher up on the tree), pencils, construction paper, scissors, a metric balance, a spreadsheet to record data and assist with analysis, and a source of leaves.

An American sweetgum tree standing in an open area on the school campus provided the source of leaves for this activity. The school is in the Northern Hemisphere at latitude 40.1736°N, so the sun is in the southern sky. The side of the tree facing south receives the most sunlight during the course of each day, especially those leaves higher up on the tree. These leaves are referred to as “sun leaves.” By contrast, leaves on the north side of the tree receive less sunlight, especially those leaves lower down on the tree. These leaves are referred to as “shade leaves.” A number of hypotheses have been proposed to explain the existence of sun leaves and shade leaves with their different morphologies (Schlichting, 1986; Nicotra et al., 2011; Poorter, 2019). The hypotheses are not mutually exclusive, and the morphology of angiosperm leaves seems to be influenced by a number of factors.

In order to get a reasonably large sample size, each of 20 students in an AP Biology class collected two sun leaves and two shade leaves from one sweetgum tree standing in an open area on the school grounds, thus providing sample sizes of n = 40 for each group of leaves. Sun leaves were collected by having one student stand on a six-foot step ladder and remove two small branches from the south side of the tree. Each student then took two sun leaves from these branches. Students collected shade leaves from the north side of the tree that were within reach while they were standing on the ground. Ideally, the sample should have included leaves from various heights and from more than one tree. The sampling method, though not ideal, was chosen for considerations of time, feasibility, and student safety. A discussion of ways to improve the sampling method should be part of a follow-up class discussion at the conclusion of the activity.

Determining the Surface Area of the Leaves

To determine the surface area of the leaves, students traced each of their four leaves on construction paper, then cut out the paper leaves and massed each one individually. The masses of the paper leaves were recorded in a class spreadsheet. If students very carefully and accurately trace the leaves, then the real leaves and their paper counterparts should have very nearly the same surface area. The masses of the paper leaves can easily be used to determine the surface area of each leaf (both real and paper) by finding a linear equation relating the mass of the construction paper to its surface area.

To generate the linear equation, five construction-paper squares of different sizes were measured and cut out by the instructor prior to the start of class. The squares were 16 cm2, 36 cm2, 64 cm2, 100 cm2, and 144 cm2 (see Figure 1). After students traced and cut out their leaves, the squares with known surface areas were measured in front of the class for all to see. One student was asked to come to the front of the classroom and use a metric balance to measure the mass of each paper square, and the results were recorded in the Excel spreadsheet shown in Figure 2. The trendline option in Excel was used to draw the best-fitting line for the masses of the five squares.

Figure 1.

Construction-paper squares used to determine the relationship between the mass and surface area of the paper.

Figure 1.

Construction-paper squares used to determine the relationship between the mass and surface area of the paper.

Figure 2.

Spreadsheet used to generate the linear equation relating the mass of the construction paper to the surface area of the paper.

Figure 2.

Spreadsheet used to generate the linear equation relating the mass of the construction paper to the surface area of the paper.

The best-fitting line shows the linear relationship between the surface area and mass of the construction paper. In the equation shown in Figure 2, y = 141.88x, y equals the surface area of the paper and x equals the mass. R2, the coefficient of determination, is 0.9999. What this indicates is that the paper squares were very carefully and accurately measured and cut out, so the equation y = 141.88x provides a very reliable estimate of the relationship between the mass and surface area of the paper.

It is a simple matter, then, for students to plug in the mass of their paper leaves for x in the equation and determine the surface area of each by multiplying the mass by 141.88. If students have traced their leaves carefully and accurately, the surface area of the real leaves will be very close to the surface area of the paper leaves, and the calculation will provide a reliable estimate of the surface area of each leaf. Sample data from one class of 20 students are shown in Appendix S1, and step-by-step instructions for determining the surface area of the leaves are provided in Appendix S2 (both are available as Supplemental Material with the online version of this article).

Data Analysis

Students began their analysis of the data by using the spreadsheet to calculate descriptive statistics. The mean, variance, and standard deviation of each sample are shown in Table 1. Students also constructed histograms for each sample like those shown in Figure 3. A histogram divides the range of possible values for the data points into equal size classes and displays the number of data points that fall into each class as a bar. The histogram shows the overall pattern, or distribution, of the data and allows students to make a rough comparison of the difference between the means of the two groups (Moore et al., 2009).

Table 1.

Descriptive statistics.

Descriptive StatisticsSun LeavesShade Leaves
Sample size (n40 40 
Mean (x¯73.71 cm2 96.76 cm2 
Variance (s2)
s2=(xix¯)2n1 
285.32 cm4 271.40 cm4 
Standard deviation (s)
sx=i=1n(xix¯)2n1 
16.89 cm2 16.47 cm2 
Descriptive StatisticsSun LeavesShade Leaves
Sample size (n40 40 
Mean (x¯73.71 cm2 96.76 cm2 
Variance (s2)
s2=(xix¯)2n1 
285.32 cm4 271.40 cm4 
Standard deviation (s)
sx=i=1n(xix¯)2n1 
16.89 cm2 16.47 cm2 
Figure 3.

Histograms showing the distributions of sun and shade leaf surface areas.

Figure 3.

Histograms showing the distributions of sun and shade leaf surface areas.

Making Inferences from Data

Scientists use statistical inference procedures to draw conclusions about populations from sample data (Moore et al., 2009). We don’t know the true mean surface area of the entire population of sweetgum leaves, or the true mean surface areas of the subpopulations of sun and shade leaves, assuming they are different. But our sun and shade leaf samples can be used to make inferences about the total population of sweetgum leaves and determine whether there are two distinct subpopulations or not. Specifically, we can use the data to answer the question we began with, “Is there a difference in the mean surface areas of sun leaves and shade leaves?”

Using the t-Test to Make Inferences from the Sample Data

To compare the means of two independent, random samples, like the sun leaves and shade leaves, the appropriate statistical test is the independent-samples t-test (Moore et al., 2009). Certain conditions should be met in order to use the t-test. The data must be continuous, like the measurements of the heights of a group of people. In other words, the measurements can be meaningfully subdivided into smaller and smaller increments. In addition, the data should be approximately normally distributed. This can be tested visually by constructing histograms as in Figure 3, or it can be tested formally using the Shapiro-Wilk test of normality. The easiest way to do this is to enter the sample data into an online Shapiro-Wilk normality test calculator (Statistics Kingdom, n.d.). Both samples collected by students for this activity are approximately normally distributed. Finally, the two samples should have approximately the same variance. An online calculator for Levene’s test (Stangroom, n.d.) provides a simple method for testing whether this condition is met. The samples collected for this activity meet this condition.

The first step in a statistical hypothesis test, like the t-test, is to state the null and alternative hypotheses. The independent variable is categorical. The leaves are either sun leaves or shade leaves. The dependent variable, the surface area of the leaves, is a continuous variable. Sun leaves are exposed to a greater intensity of sunlight than shade leaves. If there is a difference this may be the cause, but other factors should be considered as well. The goal of this activity is only to determine whether a difference exists, not to determine the cause of any difference. The null hypothesis assumes that there is no difference between the means of the two subpopulations. In other words, any small difference in mean surface area will be due to chance factors like sampling error. The alternative hypothesis assumes some sort of causal factor that contributes to a difference between the two samples. The formal statements of the null and the alternative hypotheses are as follows:

  • Null Hypothesis (H0) – There is no significant difference in mean surface area between the sun leaves and shade leaves.

  • Alternative Hypothesis (H1) – There is a significant difference in mean surface area between the sun leaves and shade leaves.

Our alternative hypothesis simply proposes that a significant difference exists. It does not suggest which group will have larger leaves and which will have smaller leaves. This is called a nondirectional hypothesis and requires that we perform a two-tailed hypothesis test.

The next step is to determine the criterion used to decide whether the difference between the two means is large enough to be statistically significant. Two numbers are required: a significance level and the degrees of freedom. It is customary to use a significance level of α = 0.05. The significance level determines where to draw lines on the t distribution to divide the area under the curve into two regions: a region where the null hypothesis is rejected (5% of the area under the curve), and an area where the null hypothesis is not rejected (95% of the area under the curve). Since a two-tailed test is required, the 5% will be divided equally between the right tail of the t distribution and the left tail of the distribution (see Figure 4).

Figure 4.

The t distribution.

Figure 4.

The t distribution.

The second number needed to set the decision criterion is the degrees of freedom – the number of values in the final calculation of a statistic that are free to vary. For the t-test, the degrees of freedom (df) is equal to the sum of the number of measurements in the two groups minus two. The students collected 40 sun leaves and 40 shade leaves, so df = 40 + 40 − 2 = 78.

The two numbers, α = 0.05 and df = 78, are used to identify a critical t value from a table of critical t values (see Table 2). Notice that 78 degrees of freedom is not on the table. When this is the case, it is wise to choose the critical t value associated with the next-lowest level on the table, in this case df = 70, with a critical t value of 1.994. The lower the degrees of freedom, the higher the critical t value; and the higher the critical value, the harder it is to achieve statistical significance. So, choosing the next lower value for df makes it less likely that the null hypothesis will be rejected when there really is no significant difference between the populations being compared in the study. Rejecting the null hypothesis when there really is no difference between the populations is called a Type I error (see Cooper, 2019).

Table 2.

Critical values from two-tailed t-test.

Degrees of Freedom (df)tcrit (α = 0.05)
12.706 
4.303 
3.182 
2.776 
2.571 
2.447 
2.365 
2.306 
2.262 
10 2.228 
· · 
· · 
· · 
50 2.009 
60 2.000 
70 1.994 
80 1.990 
100 1.984 
Infinity 1.960 
Degrees of Freedom (df)tcrit (α = 0.05)
12.706 
4.303 
3.182 
2.776 
2.571 
2.447 
2.365 
2.306 
2.262 
10 2.228 
· · 
· · 
· · 
50 2.009 
60 2.000 
70 1.994 
80 1.990 
100 1.984 
Infinity 1.960 

The locations of the critical values −1.994 and +1.994 are marked as vertical lines on the x-axis in Figure 4, dividing the distribution into two regions. There is a region where the null is rejected (the regions with values less than −1.994 and greater than +1.994 collectively make up 5% of the area under the curve), and a region where the null hypothesis is not rejected (the central region of the distribution, with values greater than −1.994 and less than +1.994, which makes up 95% of the area under the curve). If there is a small difference between the means, the calculated t value should fall in the 95% region and the null is not rejected. This would mean that any small difference between the means of the surface areas of the sun leaves and shade leaves is likely due to some random factor like sampling error. If the difference between the means is large, then the calculated t value should fall in the 5% region. The null hypothesis is rejected in this case, and the data support the conclusion that the difference may not be due to random chance (see explanation of Type I error in Cooper, 2019). In this case, the alternative hypothesis must be considered. Since no prediction was made here about what may cause the difference, further investigations would be necessary to determine the cause.

The t statistic is calculated using the formula in Figure 5 and the summary statistics (Table 1) calculated from the data collected by the students. The calculations are shown below (SQRT = square root):

|x¯1x¯2| = |73.71 cm2 − 96.76 cm2| = 23.05 cm2

s21/n1 = 285.32 cm4 / 40 = 7.13 cm4

s22/n2 = 271.40 cm4 / 40 = 6.79 cm4

s21/n1 + s22/n2 = 7.13 cm4 + 6.79 cm4 = 13.92 cm4

SQRT (s21/n1 + s22/n2) = SQRT(13.92 cm4) = 3.73 cm2

|x¯1x¯2| / SQRT (s21/n1 + s22/n2) = 23.05 cm2 / 3.73 cm2 = 6.18

Figure 5.

The t statistic.

Figure 5.

The t statistic.

The observed t value calculated from the summary statistics is 6.18. This observed t value is compared to the critical t value. If the calculated t value is greater than or equal to the critical t value, the null hypothesis is rejected. The calculated t value of 6.18 is, in fact, greater than the critical t value of 1.984. Under the assumption of the null hypothesis, a calculated t value that high represents a difference between the two means that is extremely unlikely to have occurred by chance. So, the null hypothesis that the difference is caused by chance is rejected, and it is concluded that there is a statistically significant difference in the mean surface areas of sun leaves and shade leaves that were collected. In other words, the data support the conclusions that sun leaves and shade leaves are two distinct subpopulations and that some factor, perhaps intensity of sunlight, caused a difference between them.

Can it be said that the claim that shade leaves have, on average, a larger surface area than sun leaves has been proven? No! As statisticians Jerzy Neyman and Egon Pearson wrote, “The [statistical hypothesis] tests themselves give no final verdict but as tools help the worker who is using them to form his final decision” (quoted in Denworth, 2019, p. 67). So, how does one counter a claim that the difference found in this case was just a rare chance occurrence resulting from the particular samples of sun leaves and shade leaves chosen – that the result was just a Type I error? How can the conclusion be established with greater certainty? The investigation must be repeated. If the significant difference found in this case was just a chance occurrence, it is unlikely to be repeated in subsequent trials.

Biological Significance of the Statistically Significant Result

Upon completion of this activity, students should consider the biological significance of the result. Why do sun leaves have a smaller surface area, on average, than shade leaves? This may lead to further investigations. For example, Vogel (1968, p. 1203) observed: “The sun leaves are smaller, thicker, hairier, and more deeply lobed than shade leaves; in short, the sun leaves are more like those of plants characteristic of dry habitats.” Are the differences in sun leaves an adaptation for water conservation? This is a question that students could investigate. Students could collect some additional samples of sun leaves and shade leaves and compare their rates of transpiration under similar conditions. In addition, students could determine the stomatal density on each of the two types of leaves. Oddly enough, Vogel (1968) noted that sun leaves may have up to 12 times the number of stomata per unit area, calling into question the water conservation hypothesis. However, students could conduct investigations to test the water conservation hypothesis and come to their own conclusion.

In another possible extension of this activity, students could collect leaves from different species of plants and investigate whether sun leaves and shade leaves with different surface areas are present. For example, Zales and Colosi’s (1998) students collected leaves from the north and south sides of Forsythia and found that Forsythia leaves did not display a statistically significant difference in leaf length between sun leaves and shade leaves.

Conclusion

The scientific knowledge presented in textbooks often seems to have an air of certainty about it. Schwab (1962, p. 24) characterized textbook science as a “rhetoric of conclusions,” where there is often little indication of how scientists have come to trust in the reliability of that knowledge. By contrast, when scientists present their findings to other scientists, they express them in probabilistic language (Bowen & Bartley, 2014). They do this because their claims and arguments are rooted in the data they have collected, and data, no matter how carefully collected, always carry some degree of uncertainty (Curran-Everett, 2000). Even if scientists are meticulous in their efforts to collect and record their data and they avoid introducing any accidental errors, there will still be two sources of error: errors that stem from the limitations of the instruments, and sampling error (Wild, 2006).

Reliable scientific conclusions are based on data, verifiable empirical evidence. But data must be interpreted, and statistical inference plays an important role in the process of interpretation. Scientists often disagree on how evidence should be interpreted, and the resulting debate motivates further evidence collection until there is a general consensus among the experts on how the evidence should be interpreted. As a result, the most reliable scientific conclusions are derived from multiple, independent lines of investigation that all converge on a similar conclusion. Any one investigation is just one piece in a much larger puzzle.

In order for students to understand the role that statistical analysis plays in the process of science, they must have opportunities to collect some data, transform the data using their graphing and statistical analysis skills, experience the difficulties and uncertainties associated with data analysis, and make judgments about what the data mean. They will develop a deeper understanding of the nature of scientific knowledge as a result.

Supplemental Material

The following are available with the online version of this article:

  • Appendix S1: Table comparing sun leaves and shade leaves

  • Appendix S2: Determining the surface area of leaves

References

AAAS
(
2011
).
Vision & Change: A Call to Action, Final Report
.
Washington, DC
:
American Association for the Advancement of Science
.
Bell
,
R.L.
,
Smetana
,
L.
&
Binns
,
I.
(
2005
).
Simplifying inquiry instruction: assessing the inquiry level of classroom activities
.
Science Teacher
,
72
(
7
),
30
33
.
Bowen
,
M.
&
Bartley
,
A.
(
2014
).
The Basics of Data Literacy: Helping Your Students (and You!) Make Sense of Data
.
Arlington, VA
:
NSTA Press
.
College Board
(
2019
).
AP Biology Course and Exam Description
(
revised fall 2019
).
New York, NY
:
College Board
.
Cooper
,
R.A.
(
2019
).
Making decisions with data: hypothesis testing and statistical significance
.
American Biology Teacher
,
81
,
535
542
.
Curran-Everett
,
D.
(
2000
).
The process of scientific discovery: How certain can we be?
The American Biology Teacher
,
62
,
266
275
.
Denworth
,
L.
(
2019
).
A significant problem
.
Scientific American
,
321
(
4
),
62
67
.
Moore
,
D.S.
,
McCabe
,
G.P.
&
Craig
,
B.A.
(
2009
).
Introduction to the Practice of Statistics
, 6th ed.
New York
:
W.H. Freeman
.
National Research Council
(
2012
).
A Framework for K–12 Science Education: Practices, Crosscutting Concepts, and Core Ideas
.
Washington, DC
:
National Academies Press
.
NGSS Lead States
. (
2013
).
Next Generation Science Standards: For States, by States
.
Washington, DC
:
National Academies Press
.
Nicotra
,
A.B.
,
Leigh
,
A.
,
Boyce
,
C.K.
,
Jones
,
C.S.
,
Niklas
,
K.J.
,
Royer
,
D.L.
&
Tsukaya
,
H.
(
2011
).
The evolution and functional significance of leaf shape in the angiosperms
.
Functional Plant Biology
,
38
,
535
552
.
Poorter
,
H.
,
Niinemets
,
Ü
,
Ntagkas
,
N.
,
Siebenkäs
,
A.
,
Mäenpää
,
M.
&
Matsubara
,
T.
(
2019
).
A meta-analysis of plant responses to light intensity for 70 traits ranging from molecules to whole plant performance
.
New Phytologist
,
223
,
1073
1105
.
Schlichting
,
C.D.
(
1986
).
The evolution of phenotypic plasticity in plants
.
Annual Review of Ecology and Systematics
,
17
,
667
693
.
Schwab
,
J.J.
(
1962
). The teaching of science as inquiry. In
J.J.
Schwab
&
P.F.
Brandwein
,
The Teaching of Science
(pp.
1
103
).
Cambridge, MA
:
Harvard University Press
.
Stangroom
,
J.
(
n.d.
).
Social science statistics: homogeneity of variance calculator – Levene’s test
. https://www.socscistatistics.com/tests/levene/default.aspx.
Statistics Kingdom
(
n.d.
).
Shapiro-Wilk test calculator
. http://www.statskingdom.com/320ShapiroWilk.html.
Vogel
,
S.
(
1968
).
“Sun leaves” and “shade leaves”: differences in convective heat dissipation
.
Ecology
,
49
,
1203
1204
.
Wild
,
C.
(
2006
).
The concept of distribution
.
Statistics Education Research Journal
,
5
,
10
26
.
Zales
,
C.R.
&
Colosi
,
J.C.
(
1998
).
An exercise where students demonstrate the meaning of “Not statistically significantly different.”
American Biology Teacher
,
60
,
596
600
.