The biological sciences are becoming increasingly reliant on computer science and associated technologies to quickly and efficiently analyze and interpret complex data sets. Introducing students to data analysis techniques is a critical part of their development as well-rounded, scientifically literate citizens. As part of a collaborative effort between the Biology and Computer Science departments at William & Mary, we sought to develop laboratory exercises that would introduce basic ideas of data analysis while also exposing students to Python, a commonly used computer programming language. We accomplished this by developing exercises within the interactive Jupyter Notebook platform, an open-source application that allows Python code to be written and executed as discrete blocks in real time. Students used the developed Jupyter Notebook to analyze data collected as part of a multiweek ecology field experiment aimed at determining the effect of white-tailed deer on aspects of biological diversity. These inquiry-based laboratory exercises generated scientifically relevant data and gave students a chance to experience and participate in ongoing scientific research while demonstrating the utility of computer science in the scientific process.

Introduction

Computer programming and an understanding of software applications, previously strictly the domain of computer scientists, have become critical skills in modern biology (Brewer & Smith, 2011). While training courses and inquiry-based exercises incorporating data analysis and bioinformatics are emerging at an increasing rate, educational opportunities for undergraduates that lie at the confluence of biology and computer science are still lacking (Stefan et al., 2015; Wang, 2017). At William & Mary, under the framework of a Creative Adaptation internal grant, an interdisciplinary collaboration between Biology and Computer Science faculty worked to improve congruency between these two once-disparate fields by developing course offerings that would emphasize the importance of both fields in the sciences.

One of the goals of this collaboration was to develop mechanisms to introduce biology-based data analysis using computer science to students enrolled in a large Introductory Biology lab course. This course serves 350–400 primarily freshman undergraduate students per semester and addresses cellular, molecular, and developmental biology, followed by evolutionary biology and ecology the following semester. Aspects of ecology were covered using a multiweek guided inquiry-based experiment in which students investigated the impacts of white-tailed deer grazing on biological diversity. This experiment provided the perfect backdrop in which to introduce key skills of biological data analysis while also exposing students to concepts of computer coding.

This integration was accomplished using the interactive Jupyter Notebook platform. The Jupyter Notebook is an open-source application that runs in a web browser. Jupyter Notebooks are organized into “cells” – discrete units that can contain text, images, or executable Python code, all of which are modifiable in real time. Jupyter Notebooks have been exploited for their ease of use and have been disseminated for use in biological research in diverse areas such as neuroscience (Rosenberg & Horn, 2016), microbial ecology (Howe & Chain, 2015), and radiology physics (Richardson & Amini, 2018). Using this platform, it is possible to write straightforward lines of Python code, complete with text descriptions, to perform simple data analysis functions like creating charts or performing statistical tests. This design is particularly useful when introducing students to coding because each cell can be executed discretely, allowing students to observe the outcome of each command. Additionally, this platform can be implemented with no upfront knowledge of computer coding. Students can easily make changes to the code, observe the impact on the analysis output, and experiment with commands to generate appropriate graphics or tests. We thought that this guided approach would help overcome the intimidation students often feel when first introduced to coding and would encourage students to consider courses in computer science they may not have thought about previously.

Ecology Lab Organization & Data Collection

The overall aim of this multiweek lab module was to assess the impact of white-tailed deer on biodiversity in the local forest ecosystem, a question with scientific merit. This area, like many in the state, experiences heavy grazing pressure from a large population of local white-tailed deer (Odocoileus virginianus). Once hunted to near extinction in the early 1900s, this population has now rebounded to historically high levels (VGDIF, 2015). While studies have demonstrated the impact of white-tailed deer on biodiversity, both positive and negative (Rooney et al., 2004; Wiegmann & Waller, 2006; Cook-Patton et al., 2014), there have not been any published studies detailing the impact of deer on the study area of this experiment. This multiweek lab was scaffolded to guide students through aspects of the scientific method, including developing hypotheses, collecting and analyzing data, and interpreting the results of the analysis in relation to the hypothesis. The lab included sessions providing background skills necessary for the field component. Weekly electronic lab reports were used to assess progress and provide feedback to students. The capstone of this exercise was a written lab report that took the form of a mini–scientific paper.

Prior to data collection, students (working in groups of three or four) were tasked with developing hypotheses to address several questions using knowledge gained from previous lab exercises: (1) Do deer impact the biodiversity of plants? (2) Do deer impact the biodiversity of invertebrates? (3) Choice of one of five additional questions: Does the act of setting up the control plot alter biodiversity? Is the biodiversity of trees, groundcover plants, or invertebrates related to the number of deer photographed in the unfenced site? Does the amount of sun reaching the forest floor alter plant biodiversity? Do deer affect the amount of groundcover in a site? Does the amount of groundcover affect invertebrate biodiversity? Within each question, students needed to make decisions to further refine their hypotheses, such as choosing to examine the effect of deer on trees or groundcover plants for question 1, and deciding which metric of diversity to use (species richness, Simpson index, or Shannon index).

William & Mary's campus includes 900 acres of mixed-hardwood forest, the College Woods, made accessible by a network of roads and trails. In 2014, sixteen sites were chosen throughout the College Woods as part of a departmental collaboration. Each site consisted of two marked plots: a fenced plot bounded by a metal and plastic mesh fence (7 feet high) intended to deter deer and an unfenced plot of the same size. These sites are in use by multiple professors in the Biology Department and experience foot traffic at several times during the field season. In order to control for any possible effect of human foot traffic on the parameters measured, a randomly placed transect located nearby but outside of the fenced and unfenced plots was set up. The transect plot was used only for the study described here. Additionally, one motion-activated infrared camera was installed at each site (aimed at the unfenced plot) with the goal of assessing deer activity quantitatively. The cameras were allowed to operate over a three-month period (June–August).

Students were transported to an assigned field site where they collected data including mature (>5 cm circumference) tree species abundance and circumference, groundcover plant abundance, canopy cover, and groundcover vegetative coverage, as well as noting any deer scat that might have been present. Prior lab exercises had trained students on tree and groundcover plant identification techniques. While in the field, students used color field guides for identification help as well as trained teaching assistants for identification questions. In addition, students placed pitfall traps to collect invertebrates. These traps were kept in the field for 24–48 hours before being retrieved. All data were recorded by each group on a datasheet (Figure 1) and compiled for all groups and sites.

Figure 1.

Worksheet students used to record field data. Each group was assigned a specific site and plot to survey. Data were compiled for all groups and sites.

Figure 1.

Worksheet students used to record field data. Each group was assigned a specific site and plot to survey. Data were compiled for all groups and sites.

In-lab exercises undertaken in the following lab sessions tasked students with examining the camera trap data to identify and tally animals recorded in the study site. Additionally, students used a dichotomous key to identify invertebrates collected in the pitfall traps. Students calculated the three different metrics of diversity (species richness, Simpson index, and Shannon index) for trees, groundcover plants, and invertebrates. Data from all lab sections were combined to provide complete coverage for all 16 field sites and distributed to students for data analysis (Figure 2).

Figure 2.

Screenshot of data for one set of plots. Data entries included estimates of percent canopy cover, percent vegetative groundcover, deer scat, deer photographs, and measures of diversity for trees, groundcover plants, and invertebrates.

Figure 2.

Screenshot of data for one set of plots. Data entries included estimates of percent canopy cover, percent vegetative groundcover, deer scat, deer photographs, and measures of diversity for trees, groundcover plants, and invertebrates.

Jupyter Notebook for Data Analysis

Students at William & Mary are required to have personal laptop computers and were asked to use them in this course. The Jupyter Notebooks were executed in the Anaconda application (https://www.anaconda.org), a free, open-source application that allows Python coding without using the command line interface. Instructional videos detailing Anaconda downloading and installation procedures were provided prior to the beginning of this lab. Students were introduced to the Jupyter Notebook format with an introductory exercise early in this module and so were familiar with the basic layout and functionality of the application prior to the start of this analysis. The Jupyter Notebook file was developed to step students through the analysis process and included cells containing instructions (distinguished by # symbol) as well as Python code. Both the Jupyter Notebook file and the class data (in a Microsoft Excel file) were distributed to the student groups. Student datasheets, Jupyter Notebooks, and associated spreadsheets can be accessed at https://doi.org/10.5281/zenodo.1401210.

Following the provided instructions, students launched Anaconda and opened the Juypter Notebook to begin the data analysis. The initial cells of the Jupyter Notebook did not require modification by the students. However, students were instructed to read the textual instructions (marked by # symbol) contained in each cell describing the processes each line of Python code would perform. The first cell of the Jupyter Notebook (Figure 3) contained textual instructions and Python code designed to set up the working environment by bringing in useful data analysis packages known in Python as “libraries.” These libraries add functionality to the Jupyter Notebook, allowing users to write code to perform specific operations. The Jupyter Notebook developed for this lab used pandas to allow users to import data from a Microsoft Excel spreadsheet rather than create a data frame from scratch, coded as import pandas as pd. This code allowed any future pandas operation to be called on by simply coding pd. NumPy (coded as numpy) was similarly added to include functionality for large data sets as well as additional mathematical functions. The matplotlib.pyplot library allowed for the creation of data plots, scipy.stats added statistical functions, and seaborn enabled users to generate aesthetically pleasing graphics. The Python code in this cell was then executed by clicking the “run” button (Figure 3; circled button).

Figure 3.

Screenshot of the first section of the Jupyter Notebook. Each gray box is a cell containing textual instructions (denoted with # symbol) as well as Python code. The code was executed as students clicked on the “run” button on the top banner of the window. Cells needed to be executed in order from top to bottom.

Figure 3.

Screenshot of the first section of the Jupyter Notebook. Each gray box is a cell containing textual instructions (denoted with # symbol) as well as Python code. The code was executed as students clicked on the “run” button on the top banner of the window. Cells needed to be executed in order from top to bottom.

Following the setup cell was a cell designed to bring in the data distributed to students as a Microsoft Excel file and store it as a data frame called “ecology,” coded as ecology = pd.read_excel(‘Ecology.xlsx’). The second line of code in this cell (ecology.head()) generated output (the first five rows of the data frame) that appeared when students executed this cell (Figure 4).

Figure 4.

Screenshot of the output of cell 2 from Figure 3 following execution of the code. This code brings in the data for analysis and allowed students to double-check that the import was successful by displaying the first five rows of the resulting data frame.

Figure 4.

Screenshot of the output of cell 2 from Figure 3 following execution of the code. This code brings in the data for analysis and allowed students to double-check that the import was successful by displaying the first five rows of the resulting data frame.

Data were then grouped by plot type (either fenced, unfenced, or transect), and general descriptive statistics were generated for each plot type (cells 3 and 4; Figure 3). Next, each type of plot was defined as its own data set to allow students to restrict the data as appropriate for each question (cell 5; Figure 3). The last cell in this section (cell 6; Figure 3) served as a check on this process by displaying the descriptive statistics such as count, mean, and standard deviation (etc.) for each variable in one type of plot (transects) as an example (Figure 5), coded as transect.describe().

Figure 5.

Screenshot of output of cell 6. This cell displays descriptive statistics for variables within the transects only.

Figure 5.

Screenshot of output of cell 6. This cell displays descriptive statistics for variables within the transects only.

Once students had completed executing these initial cells, they progressed to the formal data analysis sections, divided by question. These sections included more detailed instructions and incomplete lines of Python code, requiring students to complete the code to tailor their analysis to fit their selected question and hypothesis developed prior to data collection (Figure 6). The first two cells of this section contained Python code generating histograms (or distribution plots) for the variable of choice within the fenced plot type sb.distplot(fenced[‘’]) and the unfenced plot type sb.distplot(unfenced[‘’]). Executing these cells without adding a variable in between the single quotes would result in an error message. A completed example for fenced data is shown (Figure 7). The third cell (Figure 6) generated a box plot after modification of the code to include the variable of choice.

Figure 6.

Screenshot of the first portion of data analysis prior to student modification. In this section, students were required to tailor the incomplete computer code to fit their selected variables. The textual component of each cell included instructions directing students on this process.

Figure 6.

Screenshot of the first portion of data analysis prior to student modification. In this section, students were required to tailor the incomplete computer code to fit their selected variables. The textual component of each cell included instructions directing students on this process.

Figure 7.

Screenshot of the output of the first cell in this section. This example was tailored to address the question of the effect of deer on groundcover plant diversity as measured by the Simpson index. ‘Groundcover Simpson histogram’ was given as the plot title and ‘Groundcover Simpson’ was entered as the desired variable. The histogram graphic was the result of the execution of this cell.

Figure 7.

Screenshot of the output of the first cell in this section. This example was tailored to address the question of the effect of deer on groundcover plant diversity as measured by the Simpson index. ‘Groundcover Simpson histogram’ was given as the plot title and ‘Groundcover Simpson’ was entered as the desired variable. The histogram graphic was the result of the execution of this cell.

Students were then instructed to consider the distribution of their selected data (as displayed in the histograms) and decide if the data were normally distributed. These basic statistical concepts had been covered in previous exercises. Depending on the decision, students were instructed to modify and execute the Python code in the next two cells to run either a t-test (for normally distributed data) or a Mann-Whitney U-test (for non-normally distributed data) (Figure 8). The resulting P-value was used to reflect back to the biological hypothesis.

Figure 8.

Screenshot of data analysis to generate P-values. Students were instructed to complete the computer code with their variable of choice, and select the appropriate test based on the assessment of normality completed in the earlier cells.

Figure 8.

Screenshot of data analysis to generate P-values. Students were instructed to complete the computer code with their variable of choice, and select the appropriate test based on the assessment of normality completed in the earlier cells.

After completion of the analysis for question 1 (impact of deer on plant diversity), students moved on to the question 2 portion of the Jupyter Notebook (impact of deer on invertebrate diversity). The Jupyter Notebook setup for question 2 was very similar to that for question 1, in that students were required to tailor the Python code to address their hypothesis, specifically by entering the appropriate metric of diversity as the variable in the code string. They plotted histograms and determined the appropriate statistical test to perform and generated a P-value to inform their interpretation.

Students were given the choice of five questions for the third portion of the analysis. The Python code was again provided in a partially completed state. For this portion, the analysis did not include an assessment of normality. However, students were tasked with selecting the appropriate type of graph for their selected question, either a bar graph or a scatter graph (Figure 9). They also needed to select the most appropriate statistical test, either a t-test or a Spearman correlation. The Python code given required input of the correct variables, as with the previous questions, and resulted in a P-value to be interpreted in the lab report.

Figure 9.

Screenshot of data analysis cells for question 3. The first question option is shown here. Students were instructed to select the appropriate graph and statistical test and modify the code as needed to fit the selected question.

Figure 9.

Screenshot of data analysis cells for question 3. The first question option is shown here. Students were instructed to select the appropriate graph and statistical test and modify the code as needed to fit the selected question.

The capstone of these lab exercises and the entire ecology module was a written lab report in the form of a mini–scientific paper. In this report, students used the output of the Jupyter Notebook to report data and reflect back on their original hypotheses. The report included introduction, methods, results/discussion, and conclusion sections. Emphasis was placed on proper interpretation of the statistical tests and a demonstrated understanding of the overall experiment (Figure 10).

Figure 10.

Rubric used to assess the lab report generated from this data analysis. This report represented 12.5% of the final lab grade.

Figure 10.

Rubric used to assess the lab report generated from this data analysis. This report represented 12.5% of the final lab grade.

Conclusion

The main objective of this collaborative initiative between our Biology and Computer Science departments was to develop a lab exercise that would allow students to experience the scientific process in action while being exposed to the methods of data analysis using Python coding. The Jupyter Notebook provided a straightforward and structured environment where the instructor could add detail as needed to guide students of all skill levels through the exercise. Additionally, this environment encouraged students to modify and explore not just data, but also data analysis, by varying the visualization and statistical methods used. This multiweek lab allowed students to experience a real-world research problem for which the outcome was not known in advance. This lab will be repeated with only minor modifications in the next academic year to provide a multiyear data set with even more options for relevant ecological inquiries such as the impact of yearly weather patterns.

The authors thank the fall 2017 graduate student teaching assistants: Meredith Andersen, Cyril Anyetei-Anum, Morgan Claybrook, Rachel Davis, Robert Galvin, Nicole Gustafson, Carly Hawkins, Sam Mason, and Dylan Simpson. Thanks also to all the students enrolled in the fall 2017 BIOL 221 course. Special thanks to Tom Meier for physical upkeep of plots. This work was supported by a 2017 grant, “Well-aligned classes for a well-rounded education in the life sciences and computer science,” from William & Mary Creative Adaptation Fund, awarded to P.K. and M.S. through the Office of the Provost. Field site setup and maintenance was provided by a 2014 grant, “Learning research by doing research: establishing a long-term experiment on deer effects in the College Woods,” from William & Mary Morton Science Lab Fund, awarded to Harmony Dalgleish, Martha Case, Dan Cristol, Matthias Leu, and Drew LaMar. Continuing support has been provided by the Biology Department and by Dr. Martha Case (College Conservator of Botanical Collections) at William & Mary.

References

References
Brewer, C.A. & Smith, D. (
2011
).
Vision and Change in Undergraduate Biology Education: A Call to Action
.
Washington, DC
:
AAAS
.
Cook-Patton, S.C., LaForgia, M. & Parker, J.D. (
2014
).
Positive interactions between herbivores and plant diversity shape forest regeneration
.
Proceedings of the Royal Society B
,
281
,
20140261
.
Howe, A. & Chain, P.S.G. (
2015
).
Challenges and opportunities in understanding microbial communities with metagenome assembly (accompanied by IPython Notebook tutorial)
.
Frontiers in Microbiology
,
6
,
678
.
Richardson, M.L. & Amini, B. (
2018
).
Teaching radiology physics interactively with scientific notebook software
.
Academic Radiolgy
,
25
,
801
810
.
Rooney, T.P., Wiegmann, S.M., Rogers, D.A. & Waller, D.M. (
2004
).
Biotic impoverishment and homogenization in unfragmented forest understory communities
.
Conservation Biology
,
18
,
787
798
.
Rosenberg, D.M. & Horn, C.C. (
2016
).
Neurophysiological analytics for all! Free open-source software tools for documenting, analyzing, visualizing, and sharing using electronic notebooks
.
Journal of Neurophysiology
,
116
,
252
262
.
Stefan, M.I., Gutlerner, J.L., Born, R.T. & Springer, M. (
2015
).
The quantitative methods boot camp: teaching quantitative thinking and computing skills to graduate students in the life sciences
.
PLoS Computational Biology
,
11
(
4
).
VGDIF
(
2015
).
Virginia Deer Management Plan 2015–2024
. http://www.dgif.virginia.gov.
Wang, J.T.H. (
2017
).
Course-based undergraduate research experiences in molecular biosciences-patterns, trends, and faculty support
.
FEMS Microbiology Letters
,
364
(
15
).
Wiegmann, S.M. & Waller, D.M. (
2006
).
Fifty years of change in northern upland forest understories: identity and traits of “winner” and “loser” plant species
.
Biological Conservation
,
129
,
109
123
.