Protein Modeling:

Three Dimensional Modeling of Kallikrein-9

1. Introduction

Protein structure modeling is the process of taking a sequence of amino acids and predicting the three dimensional conformation of the protein. There are two main approaches to predicting protein structure: comparative modeling and ab initio. Comparative modeling uses known protein structures as the starting point either through homology modeling or protein threading. Homology modeling is based on the alignment of the target sequence with sequences with known structures. These known structures are then used as templates to build the target protein structure prediction. Because of the link between sequence and structure, the higher the percent identity between the target and the aligned sequences, the better the model will be. Protein threading differs from homology modling in that it seeks to align the target sequence directly with known structure, this is known as fold recognition. Homology modeling works best when the target has homologs with known structure and high percent identity. Protein threading works best when the target does not have highly similar homologs. The other main approach to protein structure prediction is ab initio modeling. This is a very computationally expensive approach that seeks to predict a three dimensional structure from a protein sequence without using a known structural template.\n

2. Protein Selection

This project required a sequence of a human enzyme of length 100-250 residues. It was required that this protein had a homolog in PDB [1] with at least 30% but no more than 60% sequence identity. This sequence was found in UniProt [2] using an advanced search with the following fields:

Organism [OS]: Homo sapiens
(AND) Function > Enzyme classification [EC]: Any
(AND) Sequence > Sequence length: From 100 to 250
(NOT) Structure > 3D structure available

The goal of this research is to examine protein structure modeling so it was important to find proteins with homologs between 30% and 60% identity. 30% identity is a rule of thumb for homologous proteins; if two proteins have >30% identity, then they are extremely likely to be homologous. The upper limit of 60% identity was somewhat aribitrary, but as percent identity increases so does structural similarity, therefore proteins that have very similar homologs with known structure are much easier to model. So, the limit of 60% identitiy was chosen to make 3D modelling more difficult and thereby increase potential differences between alternative modelling methods. To this end, several of the protein sequences retrieved from the UniProt search were aligned with PDB protein sequences using Blast [3]. The Blast search set was restricted to the PDB database because that is the repository of protein structures. The Blast results for the sequences were then examined to ensure percent identity values for the aligned proteins fell within the desired range. The protein Kallikrein-9 was selected in this manner.

3. Protein Function

Kallikrein-9 (KLK9) is a serine endopeptidase with the UniProtKB ID Q9UKQ9 and the enzyme classification 3.4.21. A serine endopeptidase, or serine protease, is a type of enzyme that breaks peptide bonds and the name indicates the presence of serine (S) in the active site. This is a distinguishing characteristic of serine proteases [4]. S acts as a nucleophile in the reaction and its function is dependent on the presence of Aspartic Acid (D) and Histidine (H). The trio is known as the charge relay system [2,4]. The canonical form from [4] is shown in Figure 1. The KLK9 sequence is shown below with the charge relay systems resiudes in red:

        10         20         30         40         50
MKLGLLCALL SLLAGHGWAD TRAIGAEECR PNSQPWQAGL FHLTRLFCGA
        60         70         80         90        100
TLISDRWLLT AAHCRKPYLW VRLGEHHLWK WEGPEQLFRV TDFFPHPGFN
       110        120        130        140        150
KDLSANDHND DIMLIRLPRQ ARLSPAVQPL NLSQTCVSPG MQCLISGWGA
       160        170        180        190        200
VSSPKALFPV TLQCANISIL ENKLCHWAYP GHISDSMLCA GLWEGGRGSC
       210        220        230        240        250
QGDSGGPLVC NGTLAGVVSG GAEPCSRPRR PAVYTSVCHY LDWIQEIMEN

Figure 1. The structure of trypsin (2PTN), a serine protease, with the active site residue S195, D102, and H57 residues labelled, as presented in [4].

4. Protein Structure Models

4.1 Model Generation

The first structure was generated by MULTICOM (multi-level combination) [5], which uses a comparative modeling approach that consists of template identification and 2D contact prediction using deep convolutional neural networks. These techniques are used to build several models which are then ranked from best to worst. The second structure was generated using the I-Tasser (Iterative Threading Assembly Refinement) approach [6,7,8], which uses the protein threading type of comparative modeling. The third structure was generated by Evolutionary Couplings [9], which uses an ab initio approach based on finding evolutionary couplings in a multiple sequence alignment built using the target sequence. Evolutionary couplings are pairs of residues that are likely to be near each other in the final structure. The idea behind this method is that evolution has allowed for sequence variation under the strict constraint of maintaining structure/function [10]. MULTICOM and I-Tasser were both highly ranked, 3rd and 5th, respectively, during the 13th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP 13) [11]. The MULTICOM, I-Tasser, and EVcouplings models can be seen in Figures 2, 3, and 4, respectively.

Figure 2. MULTICOM predicted structure of KLK9.

Figure 3. I-Tasser predicted structure of KLK9.

Figure 4. Evolutionary Couplings predicted structure of KLK9.

4.2 Model Validation

The MULTICOM, I-Tasser, and EVcouplings models were validated using PROCHECK [12] and Verify3D [13]. PROCHECK analysis the stereochemistry of a protein model and outputs informations about main-chain bond angles and lengths, side-chain parameters and bond angles, and ramachandran plots. The main Ramachandran plot, which shows the Phi and Psi angles for each residue in the protein, is particularly useful for analyzing the quality of a model. There are three regions in a Ramachandran plot that are considered favored and correspond to secondary structure classifications beta strand, alpha helix, and left-handed alpha helix. In a good quality model, it is expected for over 90% of the residues to fall in these three regions [12]. Models with lower than 90% in the most favored regions are of lesser quality. Verify3D uses a different approach that measures the quality of a protein model with its amino acid sequence using a 3D profile. Each residue in the model is characterized by three characteristics of its environment: 1)the area that is buried, 2)the fraction of the side chain covered by polar atoms, and 3)the local secondary structure. These characteristics are then compared to the statistical preferences of each of the amino acids. Verify3D gives a passing score to models with at least 80% of the residues scoring over 0.2. The results of model validation can be seen in Table 1. All three models passed the Verify3D validation but had different results during PROCHECK analysis. None of the models surpassed the "90% in most favored regions" mark as defined by PROCHECK, but the MULTICOM model was closest at 88%. In contrast, EVcouplings received the worst scores from PROCHECK with only 45% of the residues falling in the most favored regions. The results suggest the the MULTICOM model is the best, I-Tasser is the next best, and EVcouplings is the worst.

Figure 5. Superposition of the MULTICOM (gold), I-Tasser (green), and EVcouplings (purple) models. In addition to the protein backbones, the active site of each of the models (S, D, and H resiudes) are represented in wireframe form.

Table 1. Model validation scores from Verify3D and PROCHECK.
Model	Verify3D Score	PROCHECK Ramachandran Plot Residue Position
		Most Favored	Additionally Allowed	Generously Allowed	Disallowed
EVCouplings	Pass - 90%	45%	36%	11%	8%
MULTICOM	Pass - 89%	88%	10%	1%	1%
I-Tasser	Pass - 100%	71%	25%	3%	1%

4. Mutated Protein Structure Models

The serine protease charge relay system, as discussed in section 2, is comprised of three resiudes: S, D, and H. S acts as the nucleophile and D and H complete the system with their respective negative and positive charges. My hypothesis was that mutating one of the resiudes in this system would affect protein structure. To this end, I chose to mutate position 63 of KLK9 from H to D. I decided against mutating the S residue because it would likely be important for protein alignments as it is the nucleophile, i.e., I wanted the alignments to include serine proteases. Of the other two resiudes, I decided to flip the charge of one as I thought this would have the most dramatic effect. Arbitrarily, the positive H residue was selected and mutated to a negatively charged D. The MULTICOM and EVcouplings models can be seen in Figures 6 and 7, respectively. It can be seen in Figure 6, that the MULTICOM models are quite similar along the backbone, except for the protruding disorded tail, and the acitve site residue are in approximately the same location, though the mutated side chain is oriented differently. Conversely, the EVcouplings models are not in close agreement along the backbone and the active site residue are near each other but their locations and side-chain orientations are quite different. Model validation was performed with Verify3D and PROCHECK. Both models passed Verify3D and surprisingly the residues positions of both fell in similar ranges in the PROCHECK Ramachandran areas.

Figure 6. MULTICOM predicted structures of KLK9 (gold) overlayed on mutated KLK9 (red) with position 63 altered from H to D .

Figure 7. Evolutionary Couplings predicted structure of KLK9 (purple) overlayed on mutated KLK9 (red) with position 63 altered from H to D.

Table 1. Model validation scores from Verify3D and PROCHECK.
Model	Verify3D Score	PROCHECK Ramachandran Plot Residue Position
		Most Favored	Additionally Allowed	Generously Allowed	Disallowed
EVCouplings - wild type	Pass - 90%	45%	36%	11%	8%
EVCouplings - mutated	Pass - 90%	45%	34%	11%	10%
MULTICOM - wild type	Pass - 89%	88%	10%	1%	1%
MULTICOM - mutated	Pass - 92%	86%	12%	1%	1%

5. Conclusions and Future Studies

Given the tight link between structure and function, it is useful to compare the active sites of the models built in this study with the model of 2PTN. The 2PTN model in Figure 1 from [4] shows the canonical active site of a serine protease. Histindine is one of the rarest amino acids found in proteins. It is distinguished by its large, positive side chain. The role of H in the enzymatic mechanism is to act as a base while serine's side-chain oxygen attacks the carbonyl of the substrate peptide [4]. This means that the H side-chain must be near the active S in order to function properly. This can be seen in Figure 1. In the MULTICOM model, seen in Figure 2, the side-chain of H is actualy oriented away from S. In both the I-Tasser (Figure 3) and EVcouplings (Figure 4) models, H is oriented toward serine, but the I-Tasser model looks more similar to the canonical form in Figure 1. This can also been seen in Figure 5, which includes all three models and active site. The MULTICOM (gold) model has the H side-chain pointing away from S, while the other two models have the opposite.

It is also important to comment on a more general problem presented in the models. The MULTICOM model has a long disordered tail protruding out of the main structure. This is not an optimal structure as that region of random coil is unlikely to hold a stable conformation and could interfere with protein function directly in this manner. Furthermore, this region could destabilize the main structure. This same problem appears to be present in the I-Tasser model, though the random coil seems to have some helical characteristics in this manner. I think that this region of disorder stem from the alignments used in generating these comparative models; this region likely did not have homologs with known structure. The EVcouplings model appears to have avoided this problem entirely by chopping off that portion of the sequence, effectively shortening the backbone. Further research into the why EVcouplings made this combination is required.

The mutated models seen in Figures 6 and 7 show interesting results. The MULTICOM model (Figure 6) has the mutated D oriented toward the wild type D, which seems unlikely given that both have negative charges and would repel. The agreement between the wild-type and mutated models is striking on the gloabl scale with the only disagreements occuring along the disordered tail region. The D mutation seems to have been inserted into the homologous template with relatively minor changes happening on the whole, which is surprising again given the negative charge of the mutated residue and its location near another negative charge. The EVcouplings models (Figure 7) are radically different in comparison. The backbone is roughly similar, but there are mainy regions with much disagreement. The model does seem to have captured the nature of the charged residues better than the MULTICOM model. The wild type model (purple) has the H oriented toward the S, as expected, and the D oriented toward the H, which makes sense given their complementary charges. The mutated model (red), on the other hand, has the two D residues oriented away from each other, which make sense given their like charges.

In conclusion, it is clear the the different modeling methods produce similar but distinct results. MULTICOM and I-Tasser, the comparative methods, both performed better at CASP 13 [11], produced generally similar models, and performed better on model validation, passing Verify3D and getting decent scores from PROCHECK. EVcouplings, the ab initio method, produced a markedly different model than MULTICOM and I-Tasser, as seen in Figure 5 where the MULTICOM (gold) and I-Tasser (green) backbones are quite similar but the EVcouplings (purple) backbone stands apart. The EVcouplings model did seem to perform better on the mutated model compared to MULTICOM because of its treatment of the mutated charged D residue. Future work could include mutating other residues in the charge relay system or nearby to see how it affects structure and the comparison of additionaly structure prediction approachs.

References

The Protein Data Bank H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne (2000) Nucleic Acids Research, 28: 235-242. doi:10.1093/nar/28.1.235
The UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47: D506-515 (2019)
Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
Di Cera E. Serine proteases. IUBMB Life. 2009;61(5):510–515. doi:10.1002/iub.186
Hou, J., Wu, T., Cao, R., & Cheng, J. (2019). Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13. Proteins, accepted. (https://doi.org/10.1002/prot.25697)
Yang Zhang. I-TASSER: Fully automated protein structure prediction in CASP8. Proteins, 77 (Suppl 9): 100-113, 2009.
Ambrish Roy, Jianyi Yang, Yang Zhang. COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Research, 40: W471-W477, 2012.
Jianyi Yang, Yang Zhang. I-TASSER server: new development for protein structure and function predictions, Nucleic Acids Research, 43: W174-W181, 2015.
The EVcouplings Python framework for coevolutionary sequence analysis Thomas A. Hopf, Anna G. Green, Benjamin Schubert, Sophia Mersmann, Charlotta P. I. Schäerfe, John B. Ingraham, Agnes Toth-Petroczy, Kelly Brock, Adam Riesselman, Chan Kang, Christian Dallago, Chris Sander, Debora S. Marks bioRxiv 326918; doi: https://doi.org/10.1101/326918
Marks DS, Colwell LJ, Sheridan R, et al. Protein 3D structure computed from evolutionary sequence variation. PLoS One. 2011;6(12):e28766. doi:10.1371/journal.pone.0028766
Kryshtafovych, A, Schwede, T, Topf, M, Fidelis, K, Moult, J. Critical assessment of methods of protein structure prediction (CASP)—Round XIII. Proteins. 2019; 87: 1011– 1020. https://doi.org/10.1002/prot.25823
Laskowski R A, MacArthur M W, Moss D S, Thornton J M (1993). PROCHECK - a program to check the stereochemical quality of protein structures. J. App. Cryst., 26, 283-291.
Luthy, R., Bowie, J. U., & Eisenberg, D. (1992). Assessment of protein models with three-dimensional profiles. Nature, 356(6364), 83-5.