JOURNAL ARTICLE

Self-Supervised Pre-training for Protein Embeddings Using Tertiary Structures

Yuzhi GuoJiaxiang WuHehuan MaJunzhou Huang

Year: 2022 Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Vol: 36 (6)Pages: 6801-6809   Publisher: Association for the Advancement of Artificial Intelligence

Abstract

The protein tertiary structure largely determines its interaction with other molecules. Despite its importance in various structure-related tasks, fully-supervised data are often time-consuming and costly to obtain. Existing pre-training models mostly focus on amino-acid sequences or multiple sequence alignments, while the structural information is not yet exploited. In this paper, we propose a self-supervised pre-training model for learning structure embeddings from protein tertiary structures. Native protein structures are perturbed with random noise, and the pre-training model aims at estimating gradients over perturbed 3D structures. Specifically, we adopt SE(3)-invariant features as model inputs and reconstruct gradients over 3D coordinates with SE(3)-equivariance preserved. Such paradigm avoids the usage of sophisticated SE(3)-equivariant models, and dramatically improves the computational efficiency of pre-training models. We demonstrate the effectiveness of our pre-training model on two downstream tasks, protein structure quality assessment (QA) and protein-protein interaction (PPI) site prediction. Hierarchical structure embeddings are extracted to enhance corresponding prediction models. Extensive experiments indicate that such structure embeddings consistently improve the prediction accuracy for both downstream tasks.

Keywords:
Protein tertiary structure Computer science Artificial intelligence Protein structure prediction Machine learning Protein structure Training set Invariant (physics) Downstream (manufacturing) Focus (optics) Pattern recognition (psychology) Mathematics Biology Engineering

Metrics

26
Cited By
13.21
FWCI (Field Weighted Citation Impact)
46
Refs
0.99
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Machine Learning in Bioinformatics
Life Sciences →  Biochemistry, Genetics and Molecular Biology →  Molecular Biology
Protein Structure and Dynamics
Life Sciences →  Biochemistry, Genetics and Molecular Biology →  Molecular Biology
Computational Drug Discovery Methods
Physical Sciences →  Computer Science →  Computational Theory and Mathematics
© 2026 ScienceGate Book Chapters — All rights reserved.