Since biomedical science has become increasingly data-intensive, acquisition of computational and quantitative skills by science students has become more important. For non-science students, an introduction to biomedical databases and their applications promotes the development of a scientifically literate population. Because typical college introductory biology laboratories do not include experiences of this type, we present a bioinformatics module that can easily be included in a 90-minute session of a biology course for both majors and non-majors. Students completing this computational, inquiry-based module observed the value of computer-assisted analysis. The module gave students an understanding of how to read files in a biological database (GenBank) and how to use a software tool (BLAST) to mine the database.
In response to biomedical science becoming increasingly data-intensive, Hunter College, like other undergraduate institutions (Miskowski et al., 2007), has implemented a curricular initiative designed to equip students in biology and related sciences with basic skills in computational and quantitative biology. Familiarizing students with bioinformatics – the use of computers to investigate and analyze biological data – lays the foundation for more advanced work in the biomedical sciences. According to the findings of the National Research Council’s 2003 report Biology 2010: Transforming Undergraduate Education for Future Research Biologists, the adequate training of students necessitates a multidisciplinary approach. Donald Kennedy notes that “the 2010 report recognizes – as many biologists have not – the reality that our discipline has been transformed into an interdiscipline” (Kennedy & Gentile, 2003). Rice et al. (2004) add that “in addition to being able to think about biological processes from molecular to organismal and community levels of organization, tomorrow’s biologists will also benefit from new bioinformatic ways of thinking derived from the fields of computer and information sciences.” In recognition of this reality, the curricular initiative at Hunter was undertaken by a diverse team of faculty from the departments of biology, chemistry, computer science, mathematics, and statistics. Courses selected for inclusion in this initiative range from large introductory-level classes to small upper-level electives. The introductory courses enroll a wide range of students, including traditional science majors, post-baccalaureates, and students with no professed interest in the sciences. Thus, a heterogeneous population of students is exposed to the field of bioinformatics and the principles of quantitative biology as a consequence of this project.
To introduce the curricular initiatives, Hunter’s team integrated innovative exercises across many courses that employed both active and cooperative learning. Where possible, real-world illustrations were used for student explorations because authenticity of data serves to increase the interest and enthusiasm of students (Campbell, 2003). The bioinformatics module presented here is the first example of an exercise suitable for an introductory biology course with a sizable non-science population.1 This real-world, inquiry-based exercise is designed for use in a single 90-minute class period. It is readily adoptable by educators who wish to include bioinformatics content without significantly reworking an existing syllabus. Data collected from 2 years of implementation demonstrate that students who complete this module understand how to read files in a biological database (GenBank) and how to use a research tool (BLAST) to mine the database. Further, students’ responses suggest that, like other inquiry-based assignments, this exercise provides “the self-investment and excitement that comes with discovery of new knowledge” (Brame et al., 2008).
Hunter College of the City University of New York is a 4-year college in the heart of Manhattan with an undergraduate population of ∼15,500 students. Among the graduation requirements is a two-course natural science sequence, which students often satisfy during their freshman or sophomore years. Principles of Biology I and II (BIOL 100 and 102) are routinely selected by non-science students to satisfy this requirement. In addition to non-science students, however, BIOL 100/102 are also taken by biology and other natural science majors as well as by post-baccalaureates fulfilling prerequisites for graduate and medical school. This module was developed for BIOL 102 to introduce the field of bioinformatics to this heterogeneous population, with the knowledge that this is likely to be the only exposure to bioinformatics the non-science students in the course would experience.
Profile of Students in BIOL 102 (Spring 2010)
Post-baccalaureates made up ∼25% of the 453 students in BIOL 102 in spring 2010. Undergraduate enrollment that term (∼75%) was 17% freshmen, 28% sophomores, 24% juniors, and 6% seniors. Of these undergraduates, ∼20% were intending to major in biology, 18% psychology, 10% biochemistry, 9% medical lab sciences, 8% pre-med, 2% pre-health sciences, and 2% chemistry.2
Fifty-seven percent of the students had been exposed to the field of bioinformatics in previous coursework, likely BIOL 100;3 10% had heard about it but had never performed any exercises; and 26% had never heard of bioinformatics.
The Assignment: General Description
The exercise introduces BLAST (Basic Local Alignment Search Tool; Altschul et al., 1990), one of the most widely used bioinformatics programs. Casey (2005) states that a “testament to the broad applicability of BLAST in the bioinformatics community” is the fact that the original paper (Altschul et al., 1990) “was one of the most highly cited papers published in scientific literature in the 1990s.”
The exercise, which begins on paper and progresses to the computer, was designed for a single 90-minute laboratory period. It applies basic molecular biology but assumes no previous knowledge of bioinformatics. Students do not have access to the entirety of the module at the beginning of the lab session; instead they complete the exercise in three parts, receiving successive installments from their instructors upon completion of preceding segments. Guiding the inquiry by parsing out information allows for gradual development of ideas and promotes collaborative exploration as students examine questions and formulate answers without the aid of search engines.
The paper portion of the exercise asks each student to locate a 30-character segment within a human DNA sequence (the identity of which is withheld) of As, Ts, Gs, and Cs, ∼3300 characters long. Scanning the sequence by eye affords students a small degree of insight into BLAST’s power, in that it takes many students 10–15 minutes to complete the scan, even for such a limited sequence. Two different sequences are used so that each student within a pair has an opportunity to search for his own sequence. A natural competition broke out between students within a pair to see who could successfully locate the sequence first. This immediately engaged students in the exercise.
Following completion of the paper portion, students continued to work in pairs to complete the computer portion of the exercise. With 12 computers for 24 students, the pairing was practical but not a necessity, given that the module could have been run twice during the 3-hour lab period. When pairs of students exhibited collaboration and cooperation in answering the questions on the worksheets during the first implementation of the module, the students were paired off in subsequent semesters to encourage group learning. The computer portion acquaints students with GenBank and instructs them to apply BLAST to search GenBank for a match against an unknown DNA sequence.4 Students test fragments of differing lengths and sequence complexity as a means of introduction to the limitations of BLAST and are then asked questions about the DNA match returned by the search. Finally, students widen their search to compare their sequence against additional nucleotide databases and answer questions about their findings.
Graduate students and post-baccalaureates serve as the laboratory instructors for BIOL 102. With each lab section accommodating up to 24 students, as many as 15 instructors may be on staff at any one time. With a goal of uniformity of rigor across the sections, the module developers, rather than the instructors, wrote both the postmodule and examination questions and provided comprehensive grading policies to standardize the scoring of the students’ responses on these assessment tools.
Assessment following the first implementation of the module in spring 2009 (n = 20) was encouraging;5 therefore, the scope of the assessment was expanded in 2010, to include sections taught by four different instructors (n = 78). The students fared very well on both assessment tools: an average of 92% on the questions that were administered during the lab period within minutes of completion of the module (“Post_Module”), and 88% on an examination question measuring understanding and retention 1 week later (“Week_later”), scoring equivalently on both assignments (Wilcoxon test, P = 0.34; Figure 1). Thus, students as a cohort seemed to retain the information they learned from performing the module and apply their newly acquired knowledge 1 week after the bioinformatics module was completed. Individually, however, students’ grasp of bioinformatics content weakened significantly after 1 week (paired Wilcoxon test, P = 0.0074). Although expected, the diminished retention suggests room for future enhancement and improvement in module design, implementation, or both.
Exposure to computational and quantitative biology is an important part of a comprehensive science education. This bioinformatics module, accessible to a broad range of students with diverse interests and preparation, can be incorporated into an introductory biology course of any size. Upon completion of this module, students demonstrated a facility with BLAST and professed a “modest understanding of the field of bioinformatics” when polled by an exit questionnaire. We believe that this module may familiarize students with a valuable and increasingly important area within the field of biology, preparing them for future coursework and careers.
Understand and be able to read GenBank files.
Understand and be able to use the BLAST program.
The sequence on the following page is an unknown human sequence. You are given the task of identifying the location of this sequence within the human genome. The problem is that the human genome is made up of 3 billion base pairs (bp). To check even 1000 bp by eye in search of this sequence is quite time consuming (as you will find out shortly). Imagine if you had to check a billion nucleotides for this sequence. (Could any amount of extra credit make that task worthwhile?!) One lab partner will be scanning the 3360-bp sequence in search of the location of the following short stretch of nucleotides:
The second lab partner should scan that same sequence in search of the location of the following short stretch of nucleotides:
Please note the time at the beginning of your search and answer the following questions once you have located your sequence.
What technique did you employ to find your sequence? Describe your method.
How long did it take for you to find your sequence?
ACGGCGAGCGCGGGCGGCGGCGGTGACGGAGGCGCCGCTGCCAGGG GGCGTGCGGCAGCGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGAG GCGGCGGCGGCGGCGGCGGCGGCGGCGGCTGGGCCTCGAGCGCCCG CAGCCCACCTCTCGGGGGCGGGCTCCCGGCGCTAGCAGGGCTGAAG AGAAGATGGAGGAGCTGGTGGTGGAAGTGCGGGGCTCCAATGGCGC TTTCTACAAGGCATTTGTAAAGGATGTTCATGAAGATTCAATAACA GTTGCATTTGAAAACAACTGGCAGCCTGATAGGCAGATTCCATTTC ATGATGTCAGATTCCCACCTCCTGTAGGTTATAATAAAGATATAAA TGAAAGTGATGAAGTTGAGGTGTATTCCAGAGCAAATGAAAAAGAG CCTTGCTGTTGGTGGTTAGCTAAAGTGAGGATGATAAAGGGTGAGT TTTATGTGATAGAATATGCAGCATGTGATGCAACTTACAATGAAAT TGTCACAATTGAACGTCTAAGATCTGTTAATCCCAACAAACCTGCC ACAAAAGATACTTTCCATAAGATCAAGCTGGATGTGCCAGAAGACT TACGGCAAATGTGTGCCAAAGAGGCGGCACATAAGGATTTTAAAAA GGCAGTTGGTGCCTTTTCTGTAACTTATGATCCAGAAAATTATCAG CTTGTCATTTTGTCCATCAATGAAGTCACCTCAAAGCGAGCACATA TGCTGATTGACATGCACTTTCGGAGTCTGCGCACTAAGTTGTCTCT GATAATGAGAAATGAAGAAGCTAGTAAGCAGCTGGAGAGTTCAAGG CAGCTTGCCTCGAGATTTCATGAACAGTTTATCGTAAGAGAAGATC TGATGGGTCTAGCTATTGGTACTCATGGTGCTAATATTCAGCAAGC TAGAAAAGTACCTGGGGTCACTGCTATTGATCTAGATGAAGATACC TGCACATTTCATATTTATGGAGAGGATCAGGATGCAGTGAAAAAAG CTAGAAGCTTTCTCGAATTTGCTGAAGATGTAATACAAGTTCCAAG GAACTTAGTAGGCAAAGTAATAGGAAAAAATGGAAAGCTGATTCAG GAGATTGTGGACAAGTCAGGAGTTGTGAGGGTGAGGATTGAGGCTG AAAATGAGAAAAATGTTCCACAAGAAGAGGAAATTATGCCACCAAA TTCCCTTCCTTCCAATAATTCAAGGGTTGGACCTAATGCCCCAGAA GAAAAAAAACATTTAGATATAAAGGAAAACAGCACCCATTTTTCTC AACCTAACAGTACAAAAGTCCAGAGGGTGTTAGTGGCTTCATCAGT TGTAGCAGGGGAATCCCAGAAACCTGAACTCAAGGCTTGGCAGGGT ATGGTACCATTTGTTTTTGTGGGAACAAAGGACAGCATCGCTAATG CCACTGTTCTTTTGGATTATCACCTGAACTATTTAAAGGAAGTAGA CCAGTTGCGTTTGGAGAGATTACAAATTGATGAGCAGTTGCGACAG ATTGGAGCTAGTTCTAGACCACCACCAAATCGTACAGATAAGGAAA AAAGCTATGTGACTGATGATGGTCAAGGAATGGGTCGAGGTAGTAG ACCTTACAGAAATAGGGGGCACGGCAGACGCGGTCCTGGATATACT TCAGGAACTAATTCTGAAGCATCAAATGCTTCTGAAACAGAATCTG ACCACAGAGACGAACTCAGTGATTGGTCATTAGCTCCAACAGAGGA AGAGAGGGAGAGCTTCCTGCGCAGAGGAGACGGACGGCGGCGTGGA GGGGGAGGAAGAGGACAAGGAGGAAGAGGACGTGGAGGAGGCTTCA AAGGAAACGACGATCACTCCCGAACAGATAATCGTCCACGTAATCC AAGAGAGGCTAAAGGAAGAACAACAGATGGATCCCTTCAGATCAGA GTTGACTGCAATAATGAAAGGAGTGTCCACACTAAAACATTACAGA ATACCTCCAGTGAAGGTAGTCGGCTGCGCACGGGTAAAGATCGTAA CCAGAAGAAAGAGAAGCCAGACAGCGTGGATGGTCAGCAACCACTC GTGAATGGAGTACCCTAAACTGCATAATTCTGAAGTTATATTTCCT ATACCATTTCCGTAATTCTTATTCCATATTAGAAAACTTTGTTAGG CCAAAGACAAATAGTAGGCAAGATGGCACAGGGCATGAAATGAACA CAAATTATGCTAAGAATTTTTTATTTTTTGGTATTGGCCATAAGCA ACAATTTTCAGATTTGCACAAAAAGATACCTTAAAATTTGAAACAT TGCTTTTAAAACTACTTAGCACTTCAGGGCAGATTTTAGTTTTATT TTCTAAAGTACTGAGCAGTGATATTCTTTGTTAATTTGGACCATTT TCCTGCATTGGGTGATCATTCACCAGTACATTCTCAGTTTTTCTTA ATATATAGCATTTATGGTAATCATATTAGACTTCTGTTTTCAATCT CGTATAGAAGTCTTCATGAAATGCTATGTCATTTCATGTCCTGTGT CAGTTTATGTTTTGGTCCACTTTTCCAGTATTTTAGTGGACCCTGA AATGTGTGTGATGTGACATTTGTCATTTTCATTAGCAAAAAAAGTT GTATGATCTGTGCCTTTTTTATATCTTGGCAGGTAGGAATATTATA TTTGGATGCAGAGTTCAGGGAAGATAAGTTGGAAACACTAAATGTT AAAGATGTAGCAAACCCTGTCAAACATTAGTACTTTATAGAAGAAT GCATGCTTTCCATATTTTTTTCCTTACATAAACATCAGGTTAGGCA GTATAAAGAATAGGACTTGTTTTTGTTTTTGTTTTGTTGCACTGAA GTTTGATAAATAGTGTTATTGAGAGAGATGTGTAATTTTTCTGTA TAGACAGGAGAAGAAAGAACTATCTTCATCTGAGAGAGGCTAAAA TGTTTTCAGCTAGGAACAAATCTTCCTGGTCGAAAGTTAGTAGGA TATGCCTGCTCTTTGGCCTGATGACCAATTTTAACTTAGAGCTTT TTTTTTTTAATTTTGTCTGCCCCAAGTTTTGTGAAATTTTTCATA TTTTAATTTCAAGCTTATTTTGGAGAGATAGGAAGGTCATTTCCA TGTATGCATAATAATCCTGCAAAGTACAGGTACTTTGTCTAAGAA ACATTGGAAGCAGGTTAAATGTTTTGTAAACTTTGAAATATATGG TCTAATGTTTAAGCAGAATTGGAAAAGACTAAGATCGGTTAACAA ATAACAACTTTTTTTTCTTTTTTTCTTTTGTTTTTTGAAGTGTTG GGGTTTGGTTTTGTTTTTTGAGTCTTTTTTTTTTAAGTGAAATTT ATTGAGGAAAAATA
To be given to students once sequences have been located:
We will now demonstrate the efficiency of using vast online databases and online search tools to locate and identify unknown nucleotide sequences. One such search tool is called BLAST (Basic Local Alignment Search Tool). This program compares a nucleotide (or protein) sequence of interest to online databases looking for regions of local similarity and calculates the statistical significance of matches. (One such online database is GenBank. GenBank contains the sequences of at least three full-length human genomes and is free to the public.) Finding sequences (of known function) in a database that have similarity to your sequence of interest may allow you to identify the gene family to which your sequence belongs or the functional significance of your sequence, if any. You will use a BLAST search to uncover information about an unknown sequence. The unknown sequence is saved on your computer in a Word document entitled Bioinformatics-Module-BIOL102. Locate and open the document.
Go to NCBI BLAST website at http://www.ncbi.nlm.nih.gov/BLAST/.
Click the link (on left-hand side) “nucleotide blast.”
Copy the first line of the sequence (∼70 nucleotides) in the Word document and paste it in the “Enter Query Sequence” box.
Scroll to the bottom of the page and click “BLAST” in the left-hand corner. Wait for results. Did your sequence find any matches in the human genome database? Propose a reason for this result.
Now try a longer sequence. Copy the first three lines (∼210 nucleotides) and paste it into the “Enter Query Sequence” box and click “BLAST” again. Did your sequence find any matches in the human genome database? If so, what match did it locate?
Next copy one line (∼70 nucleotides) that is roughly in the middle of the provided sequence and paste it into the “Query Sequence” box and run the BLAST search again. Did you get a result this time? Propose a reason for why this one line yielded a different result than the one line at the beginning of the sequence.
Click on the first of the matches that your search yielded. This match should be with a sequence within GenBank. What is the name of this gene? What is the chromosomal location of this gene?
To be handed out at the end of the exercise:
The expression of a eukaryotic gene is controlled by the gene’s regulatory regions. The regulatory regions include the gene’s promoter, which binds RNA polymerase once the transcription factors have bound the DNA and made that site accessible, and one or more enhancers that also bind transcription factors and contribute to the control of gene expression. Normal expression of a gene can be affected if one of its regulatory regions sustains a mutation. This mutation may be of immense significance, even if the change involves a single base substitution, since a transcription factor’s recognition of the site is sequence specific. Mutations may involve more substantial changes to the gene’s regulatory regions, such as multiple nucleotide deletions, or, as in the case of the gene under study today, multiple nucleotide additions. These nucleotide additions may alter the methylation pattern of the promoter, and an increase in methylation may result in the silencing of this gene.
The gene you searched for today codes for the fragile-X mental retardation protein (FMRP). The promoter of this gene contains a variable number of the trinucleotide repeat CGG. Individuals with no disease (normal phenotype) have promoters containing <60 CGG repeats. Individuals whose promoters contain 60–200 trinucleotide repeats are said to possess a “premutation” that renders them susceptible to movement problems (ataxia) later in life. Individuals whose promoters have >200 CGG trinucleotide repeats are afflicted with Fragile-X syndrome and display a wide range of symptoms that include mental retardation, large testes, etc. Fragile-X mental retardation protein (FMRP) is a protein involved in ferrying RNA transcripts to polyribosomes located at sites of protein synthesis. In neurons these sites include the terminals of axons. Loss of expression of FMRP has far-reaching consequences for the individual.
Postmodule questions (first assignment used for assessment):
Look at the sequence you searched using the BLAST program. Would you predict that this gene comes from a normal person, a person with a premutation, or a person afflicted with Fragile-X Syndrome? Explain your reasoning.
We used the default database when conducting our BLAST search. This database contains only human genome sequences. Imagine that the sequence you subjected to the BLAST search yielded no matches (regardless of the length of the sequence you entered into the Query box). What would you infer about that sequence?
What result would you predict if we searched that sequence against all known sequences? A database containing all known nucleotide sequences exists and is called “nucleotide collection (nr/nt).” This database can be found on the BLAST site under “Choose Search Set.” At “Database” you will see that the “Human Genome + transcript” is selected. Select “Others” instead and you will find that the “nucleotide collection (nr/nt)” database is automatically selected. Run your search against this vast database.
How do your results differ from the original search?
Describe the power of a BLAST search.
What do you see as possible limitations of a BLAST search?
BLAST is often nicknamed the Google of DNA search tools. Compare a BLAST search to a Google search and list one possible similarity and one possible difference.
Examination question (second assignment used for assessment):
You are given a sequence of DNA and told that it is human. You are asked to find out its identity and whether it has similarity to sequences in other organisms. Please describe the bioinformatics tool, the database, and the procedure you would use to find such information. Give two possible outcomes of your search.
This research received approval from the local Institutional Review Board (IRB protocol no. 10-03-132-4471-HCMain). This project was supported by a National Institutes of Health curricular improvement grant (no. 1 T36 GM07800).
The bioinformatics module developed by Bednarski et al. (2005) was integrated into the third semester of a 3-semester introductory course. The students in that course have completed Organic Chemistry I and are taking Organic Chemistry II concurrently. Hunter's BIOL 102 has no chemistry prerequisite. Other published bioinformatics modules were developed for and integrated into upper-level undergraduate or graduate-level biology courses (Feig & Jabri, 2002; Honts, 2003; Kumar, 2005).
Many students did not select any of the seven possible majors provided on the questionnaire.
Hunter College's BIOL 100 includes a bioinformatics module featuring ClustalW
This DNA sequence is the same sequence the students just searched by eye. BLAST reveals that this sequence codes for the fragile-X mental retardation protein (FMRP).
One section (n = 20) was selected for assessment in spring 2009. The students earned an average of 82.5% on the postmodule questions and 80% on an exam question 1 week later.