Computer modeling and protein structure visualization tools are effective and engaging ways of presenting various molecular biology concepts to high school and college students. Here, we describe a series of activities and exercises that use online bioinformatics databases and programs to search for and obtain protein sequence and structure data and use it to build homology models of proteins. Exercises in homology modeling can serve the pedagogical purpose of introducing and illustrating the concept of homology within gene and protein families, which results in conservation of the 3D structures of proteins and allows us to predict structures when experimental data are not available.

Introduction of the concept of homology is critical to teaching evolution. On a molecular level, homology (i.e. the evolutionary relationship between two or more gene/protein sequences) is reflected by measurable similarities in their sequences and structures. Homologous proteins form families and have similar sequences. Namely, many of the amino acids in their sequences are identical or similar (in chemical character), and conserved motifs (segments of amino acids that are present in all the related proteins) can be revealed by sequence alignments. Protein structures, within families, tend to be even better conserved than the corresponding sequences. This conservation of 3D structure among homologs (proteins that share a common ancestor) constitutes the basis for homology (or comparative) modeling, one of the most commonly used and most successful methods of predicting protein structure. Homology modeling predicts a structure for a protein with known sequence but unknown structure (target) if the sequence and structure for a close homolog (template) are known. The main steps of homology modeling are finding template structure(s), alignment of target and template sequences, building the 3D structure for the target, loop and side chain modeling, optimization, and evaluation. Most of the current bioinformatic resources for homology modeling integrate and streamline these steps (Hameduh et al., 2020; Hati & Bhattacharyya, 2016). Homology modeling has helped bridge the gap between the large amount of sequence data and limited structure data currently available to us. It facilitated studies of protein dynamics or mutations (e.g., Ikemura et al., 2019). It has allowed for structure-based drug design for proteins that were/are challenging to structural biology methods such as X-ray crystallography or Nuclear Magnetic Resonance (NMR) (Leelananda & Lindert, 2016). In the classroom, homology modeling is a relatively easy, inexpensive, and engaging experience with protein modeling tools that illustrates vital concepts in molecular biology and genetics. Herein we describe a lesson plan, in which students (1) learn about a selected protein family (their cellular functions and implications to human health), (2) identify one member of this family for which structure is not available, (3) use bioinformatic databases to obtain the sequence of this protein, (4) use a homology modeling program online to predict the structure of this protein, and (5) use a visualization software to analyze the predicted model.

The first step to implement this activity in the classroom setting (or as a project) is to decide which protein or domain family to use, taking into account the protein domain’s function, size, availability of structural data, and availability of both ortholog and paralog sequences. Orthologs and paralogs are two types of homologs; the former are present in different species, while the latter arise within a single species through gene duplication (Koonin, 2005). Sequence or structure data can be obtained from public databases, such as UniProt (https://www.uniprot.org/) and Protein Data Bank (PDB; https://www.rcsb.org/). We have described using PDB in lesson plans in our previous paper (Szarecka & Dobson, 2019). Homology modeling programs are available online and free of charge as well, but many servers impose limitations on the size of the proteins to model. We would recommend choosing sequences of up to 1000 amino acids for this activity. Depending on the protein sequence selected and the computational capacity of the server, the results will be typically returned within a few days, thus the activity could be implemented as two class periods or one class period and homework.

Here we suggest using the adenosine deaminase family. Interesting connections to other information presented in class could include enzymes, metabolism, purine nucleotide cycle, expression of different paralogs in different tissues, and a variety of mutations causing a broad range of disorders. For example, mutations in ADA genes are involved in immunodeficiency disorders, while those in AMPD genes affect muscle function and neurodevelopment. More information on ADA/AMPD mutations and related disorders can be found in UniProt records P00813 and P23109 (links and screenshots are provided in the Supplemental Material provided with the online version of this article), and also in (Whitmore & Gaspar, 2016) and (Hayes et al., 2013). The UniProt records provide a list of sequence variants that will be very interesting and convenient to discuss with students.

The family members can be found in the HGNC database (Figures 1 & 2; https://www.genenames.org/).

Figure 1.

HGNC database homepage. Search can be conducted directly or more easily, using the Gene group reports tab.

Figure 1.

HGNC database homepage. Search can be conducted directly or more easily, using the Gene group reports tab.

Close modal
Figure 2.

HGNC database, searching the Gene group reports.

Figure 2.

HGNC database, searching the Gene group reports.

Close modal

Of the six family members, AMPD1, AMPD3, and ADAL are good candidates for homology modeling as the experimental structures are not available (Figure 3). Selecting an HGNC ID link for a gene provides valuable information and links to other databases, for example access to sequences, links to PDB (to find experimental structures if available), and to OMIM (to find information on genetic disorders and mutations linked to a particular gene).

Figure 3.

HGNC database. Adenosine deaminase gene family.

Figure 3.

HGNC database. Adenosine deaminase gene family.

Close modal

Teacher or students can also choose other protein families depending on their interests and connections to other parts of the class material. Good candidates may, for example, include serpins, histone deacetylases, and hydroxysteroid 17-beta dehydrogenases. For ideas and inspiration, we recommend PDB Molecule of the Month collection, browsing through the Gene group reports or through OMIM database (https://www.ncbi.nlm.nih.gov/omim) to look for proteins involved in genetic disorders. Specific protein sequences can be found through a search of the UniProt database by the gene or protein name. UniProt also provides information on whether a given sequence is represented by a structure in the PDB.

The easiest way to find and download a protein sequence is through a search of the UniProt database. In this lesson plan, students can find the sequences either by a direct search of UniProt (Figure 4) or through a link provided in HGNC (by clicking on AMPD3 link shown in Figure 3 and scrolling down for a link to UniProt). The UniProt accession number for AMPD3 is Q01432 (Figure 4, box 5). The sequence is obtained by scrolling down the UniProt entry to the protein sequence segment and downloading the sequence in the FASTA format (Figure 5).

Figure 4.

Searching UniProt Database. Results table shows the sequence accession number, e.g., Q01432 (box 5), protein name (box 6), organism and sequence length (boxes 2–3). Sequences can be selected (box 4) and aligned (box 1). An example of alignment is shown in the Supplemental Material included with the online version of this article.

Figure 4.

Searching UniProt Database. Results table shows the sequence accession number, e.g., Q01432 (box 5), protein name (box 6), organism and sequence length (boxes 2–3). Sequences can be selected (box 4) and aligned (box 1). An example of alignment is shown in the Supplemental Material included with the online version of this article.

Close modal
Figure 5.

Obtaining protein sequence from a UniProt database entry file Q01432. “Sequence & Isoforms” segment of the file allows the user to download the sequence in the FASTA format (bottom panel) that can be copied and pasted in as input into bioinformatics programs.

Figure 5.

Obtaining protein sequence from a UniProt database entry file Q01432. “Sequence & Isoforms” segment of the file allows the user to download the sequence in the FASTA format (bottom panel) that can be copied and pasted in as input into bioinformatics programs.

Close modal

An interesting addition to this part of the lesson would be for students to select the family members’ sequences (by checking the boxes on the left, by each entry, Figure 4, box 4) and perform a sequence alignment (Figure 4, box 1). Sequence alignment will reveal the similarities among the sequences within the family, conserved amino acids, and family motifs, as well as differences that occurred during evolution of various paralogs (an example of an alignment is shown in the Supplemental Material included with the online version of this article).

There are several servers that can be used to build homology models online free of charge. The Supplemental Material provided with the online version of this article includes a list of resources for homology model building and assessment. Here we suggest using Swiss-Model (Waterhouse et al., 2018) and/or I-TASSER (Yang & Zhang, 2015). Swiss-Model requires three steps: input of a protein sequence to model, search for templates, and building models (Figure 6). It is advisable that students enter their email addresses to receive notifications from the server as well as bookmark the runs if they need to step away from the project.

Figure 6.

Setting up a homology modeling run for human AMPD3. After pasting in the input sequence, select project title, enter email address, and select “Search For Templates.”

Figure 6.

Setting up a homology modeling run for human AMPD3. After pasting in the input sequence, select project title, enter email address, and select “Search For Templates.”

Close modal

This step is critical in the modeling process, and here also—pedagogically. Prediction of a protein structure through homology modeling hinges on availability of a homolog protein whose structure is available in the PDB (template). Successful (i.e., with reasonable accuracy) prediction of protein structure depends on this structural homolog to have as high sequence identity to the target as possible and as high quality structure as possible. Sequence identity is a measure of similarity between two sequences; it is defined as percent of identical amino acids when two sequences are aligned. Sequence similarity is a related term, defined as the combined percent of identical and similar (in physico-chemical properties, such as polarity or size) amino acids. The Supplemental Material provided with the online version of this article shows an example of sequence alignment with identical/similar amino acids marked.

The threshold for sequence identity between two sequences (of lengths >100 aa) to consider the two proteins to be homologs is typically 30%. The accuracy of structure modeling increases as the sequence identity goes up (Hati & Bhattacharyya, 2016). For a scientist modeling a protein structure, examining the available templates, target-template sequence alignments, and “coverage” (does the template sequence and structure cover all or most of the target?) is vital. Swiss-Model searches for templates and provides the user with a summary of available templates and their suitability for modeling (Figure 7). Students can scroll through the results and note what sequences and structures have been identified and note the optimal template(s). They can also create various models and compare their structures and quality assessments. Toward the overarching goal of this activity, we would recommend discussing the following aspects: (a) evolution of protein sequences produces multiple lineages; homology is recognizable by a sufficient sequence identity among them and the presence of conserved residues and motifs even if other segments of the sequences diverged; and (b) while homologous proteins will have the same fold type and show similar secondary and tertiary structures, the accuracy of homology modeling depends critically on the level of sequence similarity (in our case ~50%).

Figure 7.

Templates identified for AMPD3. The best template is PDB structure 2a3I with sequence identity of 50% and coverage 79%. Click “Build Models” after selecting a template.

Figure 7.

Templates identified for AMPD3. The best template is PDB structure 2a3I with sequence identity of 50% and coverage 79%. Click “Build Models” after selecting a template.

Close modal

Predicted models can be viewed and downloaded from the server as shown in Figure 8.

Figure 8.

Homology model of AMPD3. Note that multiple models are typically created and compared. Teacher can choose whether to work with a single model or extend the project. Structure of the model should be downloaded as PDB file as highlighted.

Figure 8.

Homology model of AMPD3. Note that multiple models are typically created and compared. Teacher can choose whether to work with a single model or extend the project. Structure of the model should be downloaded as PDB file as highlighted.

Close modal

Recently, great progress was made in the field of homology modeling by an AI-based method called AlphaFold (Jumper et al., 2021). Many (but not yet all) sequences in UniProt have been modeled using this algorithm, and the predicted structures are available for download from the respective sequence entry files (e.g., here Q01432). We recommend that students search for protein structures predicted by AlphaFold through UniProt (Figure 9). It could be an interesting part of the modeling exercise in the classroom to compare the models they built with those predicted by AlphaFold. Note that homology models may have errors due to many factors, but accuracy assessment is provided for every model, which is an interesting aspect to discuss with students. For example, if the model has a structural segment of low confidence, this segment would not be a reliable structure to use in protein-ligand or protein-protein docking.

Figure 9.

Homology model of AMPD3 created using AlphaFold. The structure of the model can be downloaded using the icon as highlighted.

Figure 9.

Homology model of AMPD3 created using AlphaFold. The structure of the model can be downloaded using the icon as highlighted.

Close modal

Similarity between two homologous protein structures is best assessed by structural superposition that can be calculated by free programs, such VMD (Humphrey et al., 1996) or USCF Chimera (Pettersen et al., 2004), the latter being easier to use. More information on protein visualization software can be obtained from (Barber, 2021) and from the Supplemental Material included with the online version of this article.

Students can import any pdb files (for example the AlphaFold model and the Swiss-Model structure pdb files) into Chimera using File→Open and then Tools→Structure Comparison→MatchMaker. One of the structures can be highlighted as a reference leaving all other settings at default. Students can observe that Model 1 from Swiss-Model is a dimer (complex of two proteins); one of the protein chains can be visualized by selecting and hiding the other. The superposition of the template and structure predicted by Swiss-Model is presented in Figure 10A. Figure 10B shows a superposition of the template structure and the one predicted by AlphaFold. As the students will observe, there are differences between the predicted models, and areas of low confidence should be considered carefully—for example, whether they would like to use any of these models for their drug design projects. From the evolutionary perspective, a comparison of AMPD3 and ADA1 (Figure 10C) is valuable in showing that, while the two protein structures have diverged, the central core is similar between the two. Students can observe that the beta-sheets of the two proteins superimpose very well with the Zn ion and Zn-binding residues occupying the same positions (Figure 10D), although alpha-helices are not so well conserved and there is a number of structural segments unique to AMPD. This is consistent with their shared catalytic mechanism, but distant evolutionary relationship (Maier et al., 2005). Further analysis of the superposition (Figures 10C and 10D) can be found in the Supplemental Material provided with the online version of this article.

Figure 10.

(A) Structure of AMPD3 predicted by Swiss-Model overlayed with the homologous template 2A3L (AMPD2 from A. thaliana). (B) Structure of AMPD3 predicted by AlphaFold overlayed with 2A3L. (C) Superposition of 2A3L with human ADA1 (7RTG). (D) Comparison of the core beta-sheets in 2A3L (tan) and 7TRG (dark blue) with Zn ion (shown as yellow/red balls) and Zn binding residues (shown as sticks). Images A–C were created with USCF Chimera, D - with VMD.

Figure 10.

(A) Structure of AMPD3 predicted by Swiss-Model overlayed with the homologous template 2A3L (AMPD2 from A. thaliana). (B) Structure of AMPD3 predicted by AlphaFold overlayed with 2A3L. (C) Superposition of 2A3L with human ADA1 (7RTG). (D) Comparison of the core beta-sheets in 2A3L (tan) and 7TRG (dark blue) with Zn ion (shown as yellow/red balls) and Zn binding residues (shown as sticks). Images A–C were created with USCF Chimera, D - with VMD.

Close modal

In our experience, computer modeling techniques are inexpensive and highly effective methods of engaging students and presenting a variety of topics in biology. Here, we emphasize use of bioinformatic databases, familiarizing students with biomolecular sequence and structure data, discussing homology and evolutionary relationships among proteins (for example, difference between ortho- and para-logs), and discussing gene families in the context of gene expression and disease-causing mutations, particularly using an enzyme family involved in diverse disorders from muscle fatigue to immunodeficiencies. 3D protein structure conservation among homologs is a powerful illustration of evolutionary relationships and the foundation of homology modeling as a method of structure prediction. Identical protocol can be applied to other protein families and sequences, allowing for diverse projects. The additional benefit is for students to become familiar with protein structure visualization software that can also be used for analysis of binding pockets, protein-ligand interactions, and mapping sequence alignments to structural data (examples are presented in the Supplemental Material provided with the online version of this article). This exercise can also be used to expand the lesson plan for computational drug design and protein-ligand docking that we described previously (Szarecka & Dobson, 2019). The activity proposed in our previous paper relies on availability of an experimental protein structure that serves as a target for drug design; homology modeling (as described here) can be done when such structure is not available.

Assessment of student learning can be focused on students’ ability to retrieve correct sequence data, create a high-confidence homology model, and discuss the structural features and similarities among predicted and experimental structures.

Barber
,
R. D.
(
2021
)
Software to visualize proteins and perform structural alignments
.
Current Protocols
,
1
(
11
),
e292
.
Hameduh
,
T.
,
Haddad
,
Y.
,
Adam
,
W.
, &
Heger
,
Z.
(
2020
).
Homology modeling in the time of collective and artificial intelligence
.
Computational and Structural Biotechnology Journal
,
18
,
3494
3506
.
Hati
,
S.
, &
Bhattacharyya
,
S.
(
2016
).
Incorporating modeling and simulations in undergraduate biophysical chemistry course to promote understanding of structure-dynamics-function relationships in proteins
.
Biochemistry and Molecular Biology Education: A Bimonthly Publication of the International Union of Biochemistry and Molecular Biology
,
44
(
2
),
140
159
.
Hayes
,
L. D.
,
Houston
,
F. E.
, &
Baker
,
J. S.
(
2013
).
Genetic predictors of adenosine monophosphate deaminase deficiency
.
Sports Medicine & Doping Studies
,
3
(
02
),
124
.
Humphrey
,
W.
,
Dalke
,
A.
, &
Schulten
,
K.
(
1996
).
VMD - Visual molecular dynamics
.
Journal of Molecular Graphics
,
14
(
1
),
33
38
.
Ikemura
,
S.
,
Yasuda
,
H.
,
Matsumoto
,
S.
,
Kamada
,
M.
,
Hamamoto
,
J.
,
Masuzawa
,
K.
,
Kobayashi
,
K.
,
Manabe
,
T.
,
Arai
,
D.
,
Nakachi
,
I.
,
Kawada
,
I.
,
Ishioka
,
K.
,
Nakamura
,
M.
,
Namkoong
,
H.
,
Naoki
,
K.
,
Ono
,
F.
,
Araki
,
M.
,
Kanada
,
R.
,
Ma
,
B.
,
Hayashi
,
Y.
, . . .
Soejima
,
K.
(
2019
).
Molecular dynamics simulation-guided drug sensitivity prediction for lung cancer with rare EGFR mutations
.
Proceedings of the National Academy of Sciences of the United States of America
,
116
(
20
),
10025
10030
.
Jumper
,
J.
,
Evans
,
R.
,
Pritzel
,
A.
,
Green
,
T.
,
Figurnov
,
M.
,
Ronneberger
,
O.
,
Tunyasuvunakool
,
K.
,
Bates
,
R.
,
Žídek
,
A.
,
Potapenko
,
A.
,
Bridgland
,
A.
,
Meyer
,
C.
,
Kohl
,
S. A. A.
,
Ballard
,
A. J.
,
Cowie
,
A.
,
Romera-Paredes
,
B.
,
Nikolov
,
S.
,
Jain
,
R.
,
Adler
,
J.
,
Back
,
T.
, . . .
Hassabis
,
D.
(
2021
).
Highly accurate protein structure prediction with AlphaFold
.
Nature
,
596
(
7873
),
583
589
.
Koonin
,
E. V.
(
2005
).
Orthologs, paralogs, and evolutionary genomics
.
Annual Review of Genetics
,
39
,
309
338
.
Leelananda
,
S. P.
, &
Lindert
,
S.
(
2016
).
Computational methods in drug discovery
.
Beilstein Journal of Organic Chemistry
,
12
,
2694
2718
.
Maier
,
S. A.
,
Galellis
,
J. R.
, &
McDermid
,
H. E.
(
2005
).
Phylogenetic analysis reveals a novel protein family closely related to adenosine deaminase
.
Journal of Molecular Evolution
,
61
(
6
),
776
794
.
Pettersen
,
E. F.
,
Goddard
,
T. D.
,
Huang
,
C. C.
,
Couch
,
G. S.
,
Greenblatt
,
D. M.
,
Meng
,
E. C.
, &
Ferrin
,
T. E.
(
2004
).
UCSF Chimera—a visualization system for exploratory research and analysis
.
Journal of Computational Chemistry
,
25
(
13
),
1605
1612
.
Szarecka
,
A.
, &
Dobson
,
C.
(
2019
).
Protein structure analysis: introducing students to rational drug design
.
The American Biology Teacher
,
81
(
6
),
423
429
.
Waterhouse
,
A.
,
Bertoni
,
M.
,
Bienert
,
S.
,
Studer
,
G.
,
Tauriello
,
G.
,
Gumienny
,
R.
,
Heer
,
F. T.
,
de Beer
,
T. A. P.
,
Rempfer
,
C.
,
Bordoli
,
L.
,
Lepore
,
R.
, &
Schwede
,
T.
(
2018
).
SWISS-MODEL: homology modelling of protein structures and complexes
.
Nucleic Acids Research
,
46
(
Suppl. 1
),
W296
W303
.
Whitmore
,
K. V.
, &
Gaspar
,
H. B.
(
2016
).
Adenosine deaminase deficiency – more than just immunodeficiency
.
Frontiers in Immunology
,
7
,
314
.
Yang
,
J.
,
Yan
,
R.
,
Roy
,
A.
,
Xu
,
D.
,
Poisson
,
J.
, &
Zhang
,
Y.
(
2015
).
The I-TASSER suite: Protein structure and function prediction
.
Nature Methods
,
12
(
1
),
7
8
.

Supplementary Material