In the spring of 2020, remote learning was implemented in schools throughout the world due to the pandemic of SARS CoV-2, the novel coronavirus that causes the disease COVID-19. Thrust into online instruction, many science teachers scrambled during this transition, and classes were severely hampered by a lack of hands-on investigations involving critical thinking and problem-solving skills. In response to a need for online experimentation, bioinformatics lessons centered around SARS-CoV-2 were developed. This article presents a multipart bioinformatics lesson that allows students to (1) compare spike protein sequences from the database portal NCBI Virus, to investigate whether this protein would be a good target for a vaccine against COVID-19; and (2) create phylogenetic trees and demonstrate evolutionary relatedness of human coronaviruses. This lesson allows for instruction in molecular biology, virology, immunology, bioinformatics, and phylogenetics, as well as analysis of scientific data. It is appropriate for high school AP Biology and biotechnology courses and can be taught entirely online.

Introduction

The SARS-CoV-2 pandemic created a need for online high school instruction throughout the world. Major school districts, including San Diego Unified School District, Los Angeles Unified School District, and Chicago Unified School District, postponed the return to in-person classrooms and instead used online instruction in the fall of 2020 (Stieb, 2020). As long as states are encountering increasingly widespread pandemic case numbers, it would be irresponsible and dangerous to students and staff if they were to open their classrooms (LaFrance, 2020). Here, we describe a multipart bioinformatics lesson that can easily be taught online, requiring only a computer and an internet connection. The lesson is appropriate for high school AP Biology and biotechnology courses.

“Would SARS-CoV-2 spike protein be a good choice as a target for vaccine development?”

Advanced biology and biotechnology classes require that students interact with current, relevant concepts and master the tools used to answer authentic questions in molecular biology. Most secondary advanced bioscience coursework standards include hands-on laboratory work with an emphasis on inquiry-based investigations. In the classroom, students can gain science practices by utilizing tools such as pipettors, electrophoresis equipment, and thermal cyclers to better understand proteins and DNA. In addition to bench laboratory techniques, state standards require that bioinformatics be included in biotechnology curricula. Bioinformatics is a growing field and a much-utilized research tool that is applied across a broad spectrum of life sciences (Wefer, 2008; Machluf et al., 2017). In the current climate of the COVID-19 pandemic, remote classrooms can take advantage of lessons utilizing bioinformatics.

The lesson described here utilizes sequences that have been uploaded to GenBank (Mizrachi, 2002), a public database made available through the National Center for Biotechnology Information (NCBI) (Cooper et al., 2010). The resources and tools provided through GenBank and NCBI are used by most life science researchers and are essential to the understanding of current biological and health sciences. Utilizing these resources, students compare the spike protein, a viral surface protein (Figure 1), from several related strains of coronaviruses to determine whether the SARS-CoV-2 spike protein would be a target for a vaccine against COVID-19.

Figure 1.

Spike proteins (in red) on the outer surface of SARS CoV-2.

Figure 1.

Spike proteins (in red) on the outer surface of SARS CoV-2.

Prior to this lesson, students should have some knowledge of how the immune system functions in response to both infections and vaccines. In preparation for this lesson, students can explore the mechanism of different types of vaccines such as the whole pathogen vaccine (inactivated or live-attenuated), subunit vaccine, and nucleic acid vaccines (DNA or RNA). This activity focuses on the comparison of spike protein amino acid sequences using the multiple alignment and phylogenetic tools in DNA Subway (Hilgert et al., 2014). Students will retrieve protein sequences from NCBI and align sequences in DNA Subway, providing them with similarities and differences between spike proteins of SARS-CoV-2 and other coronaviruses. The lesson begins with the comparison of spike proteins from the seven coronaviruses known to infect humans. Each part of the lesson will hone the comparison to more closely related viruses and, presumably, greater sequence similarity of spike protein (Figure 2). In the end, students will have tools to assess whether spike protein is a good target for a vaccine against COVID-19.

Figure 2.

Flowchart of lesson sequence alignments. Steps 1–4 of the lesson focus on more closely related viruses, and step 5 compares sequences from coronaviruses that have been circulating in the population for many years.

Figure 2.

Flowchart of lesson sequence alignments. Steps 1–4 of the lesson focus on more closely related viruses, and step 5 compares sequences from coronaviruses that have been circulating in the population for many years.

Materials for this lesson, as well as others, can be downloaded from the University of Arizona BIOTECH Project website (http://biotech.bio5.org/activities/covid19) at no cost for educators to use and adapt.

Bioinformatics Lesson – Developing a Vaccine against COVID-19

The goal of a vaccine is to safely create an immune response in an individual so that they are protected against an infectious disease or pathogen. An effective vaccine needs to elicit an immune response, the target for which is a part of the virus that is exposed (a surface protein) and therefore visible to the immune system. The vaccine target also needs to not change (mutate) and therefore elude the immune system. For example, HIV, the virus that causes AIDS, mutates rapidly, so attempts to create a vaccine against it have been unsuccessful. Mutations have created ~200 different strains of influenza, which necessitates new vaccines every year.

Students will need a DNA Subway account and will be using the Blue Line (Determining Sequence Relationships) for this analysis (Figure 3). Students will analyze the sequences of the spike protein, which is on the outside (surface) of all coronaviruses and is therefore exposed to the immune system, thus fulfilling one of the requirements for an effective vaccine target. There are seven different coronaviruses known to infect humans:

  • SARS-CoV-2, which causes COVID-19 (Beta-CoV)

  • SARS-CoV, the original SARS from 2003 (Beta-CoV)

  • MERS (Middle East Respiratory Syndrome) outbreak in 2012 (Beta-CoV)

  • OC43, which causes common cold symptoms (Beta-CoV)

  • HKU1, which causes common cold symptoms (Beta-CoV)

  • 229E, which causes common cold symptoms (Alpha-CoV)

  • NL63, which causes common cold symptoms (Alpha-CoV)

Figure 3.

Select DNA Subway’s Blue Line for sequence analysis.

Figure 3.

Select DNA Subway’s Blue Line for sequence analysis.

To address the question of whether the spike protein is a good target for a COVID-19 vaccine, students will conduct five alignments using the spike proteins of the other coronaviruses that infect humans and ascertain the sequence similarities by comparing the amino acid sequences.

Part 1: How Similar Are the Spike Proteins between All of the Seven Coronaviruses?

Section A – Identifying and copying the spike protein translations for all seven coronaviruses that infect humans

In order to compare the spike proteins of these seven coronaviruses, students conduct an internet search for NCBI Virus, and click on the blue box Search by virus (Figure 4A). You can then input information by name or ID number, or select Up-to-date-SARS-CoV-2 (Figure 4B).

Figure 4.

(A) NCBI Virus search window: search for SARS-CoV-2 under Search by virus. (B) Select Up-to date SARS-CoV-2. Alternatively, type in taxid in search window; the taxid is 2697049.

Figure 4.

(A) NCBI Virus search window: search for SARS-CoV-2 under Search by virus. (B) Select Up-to date SARS-CoV-2. Alternatively, type in taxid in search window; the taxid is 2697049.

The taxid for all SARS-CoV-2 sequences in the NCBI database is 2697049 (Figure 5). The first result(s) shown are called RefSeq (reference sequence). Note that the first SARS-CoV-2 genome sequence was added on January 13, 2020, and has an accession number (NC_045512). Every gene sequence uploaded to NCBI is given a unique accession number that serves as a catalog number in the genomic archive GeneBank. After the RefSeq, the SARS-CoV-2 sequences are listed in chronological order, starting with the most recent deposit. The number of deposits is growing daily, with a total of 61,835 on February 5, 2021. As a comparison, that number was 17,855 on August 31, 2020, when this paper was first submitted.

Figure 5.

Selection of NC_45512 will open the complete GenBank record for this viral sequence, which contains information about publications related to the sequence, amino acid sequences based on translations of open reading frames and genes, and the nucleotide sequence itself.

Figure 5.

Selection of NC_45512 will open the complete GenBank record for this viral sequence, which contains information about publications related to the sequence, amino acid sequences based on translations of open reading frames and genes, and the nucleotide sequence itself.

When NC_045512 is selected, a Nucleotide Details window opens, with attributes about that sequence, such as its length (Figure 5). Be sure the sequence you are looking at is the entire genome, approximately 30,000 base pairs.

The newly opened GenBank record lists the accession number, NC_45512, as well as the genome size in base pairs and notes that the genome is linear single-stranded RNA. Scroll down the page to see the FEATURES section, which details the translated sequences of open reading frames (ORFlab) and genes associated with the virus; and, at the very bottom of the page, the ORIGIN section provides the nucleotide sequence, which is presented as DNA, even though this is an RNA genome. In the middle of the FEATURES section, students will find the translated amino acid sequence of the spike protein, labeled “S” for gene and “surface glycoprotein” for product (protein product) (Figure 6).

Figure 6.

Translated amino acid sequence for spike protein, from GenBank, accession no. NC_45512.

Figure 6.

Translated amino acid sequence for spike protein, from GenBank, accession no. NC_45512.

Students will copy the entire translated amino acid sequence from MFV…HYT, and paste this sequence into a Word document and save it in FASTA format, in which each unique sequence is identified with a > followed by a defining name. For example:

>refseq_SARS-CoV-2_NC_045512_2020Jan13

MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNS…HYT

Save the SARS-CoV-2 sequence in FASTA format, which renders it ready for DNA Subway analysis. Add the other six coronaviruses’ spike protein sequences in FASTA format to the Word document. To add the sequence from the original SARS, from the 2003 outbreak, search SARS in NCBI Virus and select the first result, Severe acute respiratory syndrome-related CoV, taxid:694009 (Figure 7). On February 5, 2021, there were 63,342 results, including both the original SARS-CoV and the current SARS CoV-2; almost 62,000 of these were SARS-CoV-2 sequences. The first result is the SARS-CoV-2 reference sequence and the second is the 2003 SARS reference sequence (accession NC_004718). Click the accession number, copy the spike protein translated sequence, and add the FASTA format sequence to your Word document. Be sure to create a new name using the > and an identifying name for each new virus.

Figure 7.

(A) Results for SARS-CoV search, which includes all the results for SARS CoV-2. (B) The second result is SARS-CoV from 2003, accession no. NC_004718.

Figure 7.

(A) Results for SARS-CoV search, which includes all the results for SARS CoV-2. (B) The second result is SARS-CoV from 2003, accession no. NC_004718.

The sequence for the other five viruses can be found using the same instructions. MERS taxid is 1335626, OC43 taxid is 31631, HKU1 taxid is 290028, 229E taxid is 11137, and NL63 taxid is 277944. The FASTA-formatted spike protein translations for all seven coronaviruses that infect humans can then be added to DNA Subway for alignment of the sequences.

Section B – Aligning coronavirus spike protein sequences

Open DNA Subway, Blue Line (Figure 8). Select project type protein, select enter sequences in FASTA format, and copy your seven FASTA-formatted sequences in your Word doc and paste them in the window (or upload the file). Give your project a name and continue.

Figure 8.

Use the Blue Line of DNA Subway to Determine Sequence Relationships, select protein for Project Type, and enter FASTA-formatted sequences in the window under Select Sequence Source.

Figure 8.

Use the Blue Line of DNA Subway to Determine Sequence Relationships, select protein for Project Type, and enter FASTA-formatted sequences in the window under Select Sequence Source.

View the sequences in sequence viewer. Align the sequences using MUSCLE, selecting all of the sequences to align the seven CoVs. A barcode alignment will be displayed (Figure 9) and compared to the “consensus sequence.” The consensus sequence is somewhat arbitrary, in that it represents whichever amino acid is found at that position most often out of the sequences you have selected. One additional sequence could significantly change the consensus. Students can look at the differences between the CoV spike protein sequences and determine whether there are any patterns of similarities between any of the viruses. Select SEQUENCE SIMILARITY % to see the similarity between any two sequences.

Figure 9.

Sequence alignment of the seven human coronaviruses.

Figure 9.

Sequence alignment of the seven human coronaviruses.

Section C – Building a phylogenetic tree

To determine evolutionary relationships between different groups, the analysis requires a rooting or outgroup. This is an organism that is closely related to the study group but not part of it. Students can choose a variety of viruses as the outgroup, but a virus with a spike protein would allow the analysis to be a direct comparison to the sequences in the investigation. There are many coronaviruses that do not infect humans; choosing one in another genus would provide a good outgroup for evolutionary comparison of the seven coronaviruses that infect humans. Any Gamma-CoV would be a good choice to serve as a rooting outgroup for the phylogenetic tree. Go back to NCBI Virus, Search by virus and type in Gamma-CoV, then select Gamma-CoV, taxid:694013. Copy the spike protein sequence and add the FASTA format to the DNA Subway project. Select data, including the Gamma-CoV, align with MUSCLE, and then build a tree using PHYLIP ML (Figure 10).

Figure 10.

Phylogenetic tree of the seven coronaviruses that infect humans.

Figure 10.

Phylogenetic tree of the seven coronaviruses that infect humans.

Students will be able to assess the evolutionary relationships of the viruses based on the spike protein. The length of the tree branch represents the number of amino acid differences between the sequences, which can be used as a correlation of time since the most recent common ancestor between the viruses. Students should be able to determine that the two SARS-CoV have a common ancestor in the recent past, as do the two “common cold” Beta-CoVs. Although this tree put MERS more closely related to the SARS viruses, the common ancestor between these three is almost as distant as the common ancestor for all of the beta coronaviruses. Given the limitations of these data, the question of whether the beta coronaviruses had a more recent ancestor to the alpha coronaviruses or the gamma coronavirus was left unresolved.

The next two parts of the lesson will utilize the sequences added during part 1.

Part 2: How Similar Are the Five Beta-CoVs’ Spike Proteins?

Using the sequences in DNA Subway, select all the Beta-CoVs: OC43, HKU1, MERS, SARS, and SARS-CoV-2, and align using MUSCLE (Figure 11). Students will be able to look for similarities in the alignment. They can begin to look at the regions of similarity – which, if the region is similar to the consensus sequence, is designated by the whitish bands. In the end, these regions could be important targets for vaccine development.

Figure 11.

Alignment of spike protein sequences for five beta coronaviruses that infect humans.

Figure 11.

Alignment of spike protein sequences for five beta coronaviruses that infect humans.

Students will be able to build a phylogenetic tree using the Gamma-CoV as the outgroup and compare the tree with all seven human coronaviruses to the tree with just the five beta coronaviruses. This analysis may allow them to see that addition or removal of data can change the representation of the relationships. Another exercise to demonstrate this effect is through building the tree with different outgroups.

Part 3: How Similar Are the Two SARS-CoV (SARS CoV, SARS CoV-2) Spike Proteins to Each Other?

Select sequences for SARS CoV, SARS-CoV-2 and use MUSCLE for the alignment. Students will notice an even greater amount of similarity between these two sequences, getting them even closer to finding regions of the spike protein that could be a good target for a vaccine. However, even these two sequences have significant differences.

Part 4: What Is the Similarity among Different SARS-CoV-2 Spike Proteins?

Students will now have the experience to understand the importance of similarity in alignment of sequences. Our question as to whether the spike protein could be a good target for a vaccine can be further addressed by analyzing different SARS-CoV-2 spike proteins and seeing if mutations have accumulated in these sequences. To do this, students will need to return to NCBI Virus and search for several SARS-CoV-2 sequences. Search for the virus by typing in SARS-CoV-2 (taxid is 2697049) in the NCBI Virus search window (see instructions above). Select at least 10 other SARS-CoV-2 spike protein sequences to add to the DNA Subway project to compare with your reference sequence (RefSeq), which was collected in December 2019. Make sure to search for full-length SARS-CoV-2 sequence, look at the length of the sequence (top light blue row of results), and only select sequences with close to 30,000 nucleotides, which represents the entire SARS-CoV-2 genome.

To capture the scope of mutations or differences in SARS CoV-2, select sequences based on time, particularly the date collected, and on location. Notice that after the RefSeq, the sequences are in order of most recently released. Some of those may be from a much earlier date; for example, accession no. MW570899 sequence was released on February 5, 2021, from a sample that was collected on August 19, 2020 (Figure 12). Scroll right in the results window to see the collection date.

Figure 12.

Up-to-date SARS-CoV-2 submissions to GenBank as of February 5, 2021.

Figure 12.

Up-to-date SARS-CoV-2 submissions to GenBank as of February 5, 2021.

Order your sequences based on collection date, selecting dates that span the time of the pandemic. Try also to select sequences from samples based on different geolocations (China, USA, Italy, etc.). The geolocation site and collection date are also available when you click on the accession number and open the Nucleotide Detail window.

As an alternative to copying and pasting the translated sequences of the spike protein, you can upload the sequence directly into DNA Subway using the protein ID (Figure 13A). Select Upload Data (Figure 13B) and enter the protein ID from GenBank in the Import sequence in DNA Subway (Figure 13C).

Figure 13.

(A) Unique protein ID for spike protein; each GenBank submission will have a unique protein ID for each protein. (B) Use Upload Data within the project. (C) Type protein ID in window associated with Import sequence from GenBank.

Figure 13.

(A) Unique protein ID for spike protein; each GenBank submission will have a unique protein ID for each protein. (B) Use Upload Data within the project. (C) Type protein ID in window associated with Import sequence from GenBank.

At this point, align only SARS-CoV-2 sequences to compare these samples. Students can visually see the similarities between the sequences. They can also select SEQUENCE SIMILARITY % for a numerical representation of the sequence similarity. Students should investigate mutations that have accrued in the spike protein, and determine whether any mutation(s) have become established and assess whether these mutations could affect the efficacy of a vaccine.

Part 5: What Is the Similarity among Different Spike Proteins from a Coronavirus That Has Been Around in the Human Population, Such as One That Causes the Common Cold?

Since SARS-CoV-2 is novel and a new virus in the human population, not seeing many differences in spike proteins may indicate there has not been enough time for mutations to accumulate. The common cold coronaviruses have been around for decades and may be a better tool to look at how rapidly this protein can accumulate mutations.

As in part 4, in the search window of NCBI Virus, type in either OC43 or HKU1. Find 10 sequences with divergent collection dates and align these sequences in DNA Subway. Compare the similarity of the sequences. Students can see how similar or different the spike protein of another coronavirus is and assess whether these differences are significant enough to evade the immune system or a vaccine. Students should consider whether there are large sections of similarity within the protein. These could be important targets for the immune system as part of the immune response, and therefore suitable for a vaccine developer to utilize.

Conclusion

The main object of this lesson is to introduce students to computational analysis for biological questions. In this case, the question is “Would SARS-CoV-2 spike protein be a good choice as a target for vaccine development?” Students will have the opportunity to learn about immunology and development of different kinds of vaccines, as well as whether the spike protein is a good vaccine target candidate. Students will experience navigating NCBI’s GenBank genome submissions and manipulate these sequences in DNA Subway. Students will compare diverse coronaviruses and successively make the comparisons more specific, beginning with the seven CoVs that infect humans, to the five Beta-CoVs, to the two SARS-CoVs, and, finally, compare the amino acid sequence of spike protein from several SARS CoV-2. These selected data will be imported into an alignment tool in DNA Subway and compared to each other. In addition, DNA Subway offers students the opportunity to see the evolutionary relationships of the viruses in the form of phylogenetic trees. One of the highlights of this lesson is that the alignment and tree-building exercises can yield different results depending on the data used. In the end, students will be able to see that the SARS-CoV-2 spike protein fits the criteria of a part of the virus that is exposed and a protein that does not appear to be changing rapidly, and therefore offers the majority of the protein as potential targets for vaccine development.

Addendum

This lesson was written before the B.1.1.7, B.1.351, B.1.429, B.1.427, CAL.20C, and P.1 variants. Most of the steps in this lesson focus on reference sequences, and analysis of the variants need not be included. In part 4, however, students could include the sequence of current variants as part of the exercise and analyze their alignments. Although our understanding of SARS-CoV-2 is changing every day, the concept of sequence alignments and evaluation of mutations can be used as insight into the spike protein and this virus.

References

Cooper
,
P.
,
Lipshultz
,
D.
,
Matten
,
W.
,
McGinnis
,
S.
,
Pechous
,
S.
,
Rommiti
,
M.
, et al (
2010
).
Education resources of the National Center for Biotechnology Information
.
Briefings in Bioinformatics
,
2
,
563
569
.
Hilgert
,
U.
,
McKay
,
S.
,
Ghiban
,
C.
,
Khalfan
,
M.
,
Micklos
,
D.
&
Williams
,
J.
(
2014
). DNA Subway: making genome analysis egalitarian. In
Proceedings of the XSEDE 2014 Conference: Engaging Communities
.
Association for Computing Machinery
.
LaFrance
,
A.
(
2020
).
This push to open schools is guaranteed to fail
.
Atlantic
,
August
2
.
Machluf
,
Y.
,
Galbart
,
H.
,
Ben-Dor
,
S.
&
Yarden
,
A.
(
2017
).
Making authentic science accessible – the benefits and challenges of integrating bioinformatics into a high-school science curriculum
.
Briefings in Bioinformatics
,
18
,
145
159
Mizrachi
,
I.
(
2002
) [
updated 2007
]. GenBank: the nucleotide sequence database. In
J.
McEntyre
&
J.
Ostell
(Eds.),
The NCBI Handbook
(pp.
1
17
)
Bethesda, MD
:
National Center for Biotechnology Information
Stieb
,
M.
(
2020
).
The major school districts that will remain online-only this fall
.
New York Magazine
,
August
4
.
Wefer
,
S.
(
2008
).
Bioinformatics in high school biology curricula: a study of state science standards
.
CBE–Life Sciences Education
,
7
,
155
162
.