Computer modeling and protein structure visualization tools are effective and engaging ways of presenting various molecular biology concepts to high school and college students. Here, we describe a series of activities and exercises that use online bioinformatics databases and programs to search for and obtain protein sequence and structure data and use it to build homology models of proteins. Exercises in homology modeling can serve the pedagogical purpose of introducing and illustrating the concept of homology within gene and protein families, which results in conservation of the 3D structures of proteins and allows us to predict structures when experimental data are not available.
Teaching Evolutionary Relationships Among Proteins
Introduction of the concept of homology is critical to teaching evolution. On a molecular level, homology (i.e. the evolutionary relationship between two or more gene/protein sequences) is reflected by measurable similarities in their sequences and structures. Homologous proteins form families and have similar sequences. Namely, many of the amino acids in their sequences are identical or similar (in chemical character), and conserved motifs (segments of amino acids that are present in all the related proteins) can be revealed by sequence alignments. Protein structures, within families, tend to be even better conserved than the corresponding sequences. This conservation of 3D structure among homologs (proteins that share a common ancestor) constitutes the basis for homology (or comparative) modeling, one of the most commonly used and most successful methods of predicting protein structure. Homology modeling predicts a structure for a protein with known sequence but unknown structure (target) if the sequence and structure for a close homolog (template) are known. The main steps of homology modeling are finding template structure(s), alignment of target and template sequences, building the 3D structure for the target, loop and side chain modeling, optimization, and evaluation. Most of the current bioinformatic resources for homology modeling integrate and streamline these steps (Hameduh et al., 2020; Hati & Bhattacharyya, 2016). Homology modeling has helped bridge the gap between the large amount of sequence data and limited structure data currently available to us. It facilitated studies of protein dynamics or mutations (e.g., Ikemura et al., 2019). It has allowed for structure-based drug design for proteins that were/are challenging to structural biology methods such as X-ray crystallography or Nuclear Magnetic Resonance (NMR) (Leelananda & Lindert, 2016). In the classroom, homology modeling is a relatively easy, inexpensive, and engaging experience with protein modeling tools that illustrates vital concepts in molecular biology and genetics. Herein we describe a lesson plan, in which students (1) learn about a selected protein family (their cellular functions and implications to human health), (2) identify one member of this family for which structure is not available, (3) use bioinformatic databases to obtain the sequence of this protein, (4) use a homology modeling program online to predict the structure of this protein, and (5) use a visualization software to analyze the predicted model.
Choosing the Protein Family
The first step to implement this activity in the classroom setting (or as a project) is to decide which protein or domain family to use, taking into account the protein domain’s function, size, availability of structural data, and availability of both ortholog and paralog sequences. Orthologs and paralogs are two types of homologs; the former are present in different species, while the latter arise within a single species through gene duplication (Koonin, 2005). Sequence or structure data can be obtained from public databases, such as UniProt (https://www.uniprot.org/) and Protein Data Bank (PDB; https://www.rcsb.org/). We have described using PDB in lesson plans in our previous paper (Szarecka & Dobson, 2019). Homology modeling programs are available online and free of charge as well, but many servers impose limitations on the size of the proteins to model. We would recommend choosing sequences of up to 1000 amino acids for this activity. Depending on the protein sequence selected and the computational capacity of the server, the results will be typically returned within a few days, thus the activity could be implemented as two class periods or one class period and homework.
Here we suggest using the adenosine deaminase family. Interesting connections to other information presented in class could include enzymes, metabolism, purine nucleotide cycle, expression of different paralogs in different tissues, and a variety of mutations causing a broad range of disorders. For example, mutations in ADA genes are involved in immunodeficiency disorders, while those in AMPD genes affect muscle function and neurodevelopment. More information on ADA/AMPD mutations and related disorders can be found in UniProt records P00813 and P23109 (links and screenshots are provided in the Supplemental Material provided with the online version of this article), and also in (Whitmore & Gaspar, 2016) and (Hayes et al., 2013). The UniProt records provide a list of sequence variants that will be very interesting and convenient to discuss with students.
The family members can be found in the HGNC database (Figures 1 & 2; https://www.genenames.org/).
Of the six family members, AMPD1, AMPD3, and ADAL are good candidates for homology modeling as the experimental structures are not available (Figure 3). Selecting an HGNC ID link for a gene provides valuable information and links to other databases, for example access to sequences, links to PDB (to find experimental structures if available), and to OMIM (to find information on genetic disorders and mutations linked to a particular gene).
Teacher or students can also choose other protein families depending on their interests and connections to other parts of the class material. Good candidates may, for example, include serpins, histone deacetylases, and hydroxysteroid 17-beta dehydrogenases. For ideas and inspiration, we recommend PDB Molecule of the Month collection, browsing through the Gene group reports or through OMIM database (https://www.ncbi.nlm.nih.gov/omim) to look for proteins involved in genetic disorders. Specific protein sequences can be found through a search of the UniProt database by the gene or protein name. UniProt also provides information on whether a given sequence is represented by a structure in the PDB.
How to Obtain Protein Sequences for Modeling
The easiest way to find and download a protein sequence is through a search of the UniProt database. In this lesson plan, students can find the sequences either by a direct search of UniProt (Figure 4) or through a link provided in HGNC (by clicking on AMPD3 link shown in Figure 3 and scrolling down for a link to UniProt). The UniProt accession number for AMPD3 is Q01432 (Figure 4, box 5). The sequence is obtained by scrolling down the UniProt entry to the protein sequence segment and downloading the sequence in the FASTA format (Figure 5).
An interesting addition to this part of the lesson would be for students to select the family members’ sequences (by checking the boxes on the left, by each entry, Figure 4, box 4) and perform a sequence alignment (Figure 4, box 1). Sequence alignment will reveal the similarities among the sequences within the family, conserved amino acids, and family motifs, as well as differences that occurred during evolution of various paralogs (an example of an alignment is shown in the Supplemental Material included with the online version of this article).
Building Homology Models
There are several servers that can be used to build homology models online free of charge. The Supplemental Material provided with the online version of this article includes a list of resources for homology model building and assessment. Here we suggest using Swiss-Model (Waterhouse et al., 2018) and/or I-TASSER (Yang & Zhang, 2015). Swiss-Model requires three steps: input of a protein sequence to model, search for templates, and building models (Figure 6). It is advisable that students enter their email addresses to receive notifications from the server as well as bookmark the runs if they need to step away from the project.
This step is critical in the modeling process, and here also—pedagogically. Prediction of a protein structure through homology modeling hinges on availability of a homolog protein whose structure is available in the PDB (template). Successful (i.e., with reasonable accuracy) prediction of protein structure depends on this structural homolog to have as high sequence identity to the target as possible and as high quality structure as possible. Sequence identity is a measure of similarity between two sequences; it is defined as percent of identical amino acids when two sequences are aligned. Sequence similarity is a related term, defined as the combined percent of identical and similar (in physico-chemical properties, such as polarity or size) amino acids. The Supplemental Material provided with the online version of this article shows an example of sequence alignment with identical/similar amino acids marked.
The threshold for sequence identity between two sequences (of lengths >100 aa) to consider the two proteins to be homologs is typically 30%. The accuracy of structure modeling increases as the sequence identity goes up (Hati & Bhattacharyya, 2016). For a scientist modeling a protein structure, examining the available templates, target-template sequence alignments, and “coverage” (does the template sequence and structure cover all or most of the target?) is vital. Swiss-Model searches for templates and provides the user with a summary of available templates and their suitability for modeling (Figure 7). Students can scroll through the results and note what sequences and structures have been identified and note the optimal template(s). They can also create various models and compare their structures and quality assessments. Toward the overarching goal of this activity, we would recommend discussing the following aspects: (a) evolution of protein sequences produces multiple lineages; homology is recognizable by a sufficient sequence identity among them and the presence of conserved residues and motifs even if other segments of the sequences diverged; and (b) while homologous proteins will have the same fold type and show similar secondary and tertiary structures, the accuracy of homology modeling depends critically on the level of sequence similarity (in our case ~50%).
Predicted models can be viewed and downloaded from the server as shown in Figure 8.
Obtaining Homology Models from the AlphaFold Database or UniProt Database
Recently, great progress was made in the field of homology modeling by an AI-based method called AlphaFold (Jumper et al., 2021). Many (but not yet all) sequences in UniProt have been modeled using this algorithm, and the predicted structures are available for download from the respective sequence entry files (e.g., here Q01432). We recommend that students search for protein structures predicted by AlphaFold through UniProt (Figure 9). It could be an interesting part of the modeling exercise in the classroom to compare the models they built with those predicted by AlphaFold. Note that homology models may have errors due to many factors, but accuracy assessment is provided for every model, which is an interesting aspect to discuss with students. For example, if the model has a structural segment of low confidence, this segment would not be a reliable structure to use in protein-ligand or protein-protein docking.
Analysis of Evolution of Protein Structures
Similarity between two homologous protein structures is best assessed by structural superposition that can be calculated by free programs, such VMD (Humphrey et al., 1996) or USCF Chimera (Pettersen et al., 2004), the latter being easier to use. More information on protein visualization software can be obtained from (Barber, 2021) and from the Supplemental Material included with the online version of this article.
Students can import any pdb files (for example the AlphaFold model and the Swiss-Model structure pdb files) into Chimera using File→Open and then Tools→Structure Comparison→MatchMaker. One of the structures can be highlighted as a reference leaving all other settings at default. Students can observe that Model 1 from Swiss-Model is a dimer (complex of two proteins); one of the protein chains can be visualized by selecting and hiding the other. The superposition of the template and structure predicted by Swiss-Model is presented in Figure 10A. Figure 10B shows a superposition of the template structure and the one predicted by AlphaFold. As the students will observe, there are differences between the predicted models, and areas of low confidence should be considered carefully—for example, whether they would like to use any of these models for their drug design projects. From the evolutionary perspective, a comparison of AMPD3 and ADA1 (Figure 10C) is valuable in showing that, while the two protein structures have diverged, the central core is similar between the two. Students can observe that the beta-sheets of the two proteins superimpose very well with the Zn ion and Zn-binding residues occupying the same positions (Figure 10D), although alpha-helices are not so well conserved and there is a number of structural segments unique to AMPD. This is consistent with their shared catalytic mechanism, but distant evolutionary relationship (Maier et al., 2005). Further analysis of the superposition (Figures 10C and 10D) can be found in the Supplemental Material provided with the online version of this article.
Conclusions
In our experience, computer modeling techniques are inexpensive and highly effective methods of engaging students and presenting a variety of topics in biology. Here, we emphasize use of bioinformatic databases, familiarizing students with biomolecular sequence and structure data, discussing homology and evolutionary relationships among proteins (for example, difference between ortho- and para-logs), and discussing gene families in the context of gene expression and disease-causing mutations, particularly using an enzyme family involved in diverse disorders from muscle fatigue to immunodeficiencies. 3D protein structure conservation among homologs is a powerful illustration of evolutionary relationships and the foundation of homology modeling as a method of structure prediction. Identical protocol can be applied to other protein families and sequences, allowing for diverse projects. The additional benefit is for students to become familiar with protein structure visualization software that can also be used for analysis of binding pockets, protein-ligand interactions, and mapping sequence alignments to structural data (examples are presented in the Supplemental Material provided with the online version of this article). This exercise can also be used to expand the lesson plan for computational drug design and protein-ligand docking that we described previously (Szarecka & Dobson, 2019). The activity proposed in our previous paper relies on availability of an experimental protein structure that serves as a target for drug design; homology modeling (as described here) can be done when such structure is not available.
Assessment of student learning can be focused on students’ ability to retrieve correct sequence data, create a high-confidence homology model, and discuss the structural features and similarities among predicted and experimental structures.