Tutorial 11 - Protein structure prediction

Recap

Make sure you can answer the following questions:

Describe the levels of protein structure.
How do we represent and store them?
Explain the meaning of the words when used for genes: analog, homolog, paralog, ortholog and xenolog.
What is a protein ligand?

PDB file format from https://lammpstube.com

Homology modeling - protein structure prediction exercise

A simple, although not always reliable, way to discover the secondary structure of a peptide sequence is to look up a protein with similar primary sequence in a database. Let us try this! The task is to obtain the secondary structure of the following peptide sequence: HYLCKYVINAIPPTLTAKIHFRPELPAERNQLIQRLA

Go to https://blast.ncbi.nlm.nih.gov/Blast.cgi and click “Protein blast”.
Enter the sequence and enter “Homo sapiens (taxid:9606)” as organism.
Click the blast button and wait. This may take up to several minutes.
Look for the best matching protein. It should be: “monoamine oxidase A”
Enter this protein name to UniProt.
Check whether the result has a secondary sequence annotation and find the position respective to the BLAST match.

Use the above-described procedure to learn most about the following peptide sequence: TEYAINKLRQLYVLRC.

A hint: the sequence is a part of a frequent protein domain.

Automated protein folding

ESMFold is a deep learning-based method developed for predicting protein tertiary structures from amino acid sequences. In the following steps, we will show how the method works. We will use one of the existing ESMFold predictions and verify it against PDB database.

Access the public ESMFold webserver here.
Use the first available example in ESMFold: plastic degradation protein PETase.
Learn and download the sequence of 300 amino acids representing the primary sequence of PETase.
Predict/retreive, visualize and download the ESMFold tertiary structure prediction (a PDB file).
Generally validate the PDB file with Prosa.
Find the PETase enzyme in PDB database, use sequential search.
Use the sequence that scores most, it should be 5XJH Crystal structure of PETase from Ideonella sakaiensis.
Compare the predicted structure with the X-ray crystallography one available in the PDB database. Employ Pairwise Structure Alignment available through the PDB site.
Use the 5XJH code for the PDB internal structure and upload the PDB file downloaded from ESMFold (set the chain to A and let residues go from 1 to 300).
See the match. The structures match well visually. RMSD gives the average distance between corresponding atoms in the two structures after they have been aligned, its value is 0.57, which is below the threshold of 1 for high quality models. Also, TM-score of 0.99 is close to its maximum value of 1. Generally, scores below 0.20 correspond to randomly chosen unrelated proteins, whereas structures with a score higher than 0.5 assume roughly the same fold.

Independent work

Predict and validate your own protein structure. The output to be reported:

a PDB protein of interest,
a predicted structure,
a validation that reports the match between the experimental and predicted structure.

Test the limits of applicability of the prediction. You can focus on large proteins (such as dynein motor protein 4AKG with multiple interacting domains), disordered proteins, proteins with novel folds, or membrane proteins. However, it becomes more and more difficult to find proteins for which prediction fails. See for example the strengths and limitations of AlphaFold2.

Resources (other from the resources available above):

ESMFold Colab for protein structure prediction,
AlphaFold2 Colab for protein structure prediction,
TM-align for protein structure alignment,
AlphaFold3 server for structure predictions containing proteins, DNA, RNA, ligands, ions.

.

References

Abramson et al.: Accurate structure prediction of biomolecular interactions with AlphaFold 3, Nature, 2024 (pdf preprint).

Poleksic: Algorithms for optimal protein structure alignment, Bioinformatics, 2009 (online).

Table of Contents