CLIP and its descendants have become a staple of text-image models. Can we do the same thing but for text-to-protein? Yes!
➡️ Xu, Yuan et al. here ProtST, a framework for learning joint representations of textual protein descriptions (via PubMedBERT) and protein sequences (via ESM). In addition to contrastive loss, ProtST has a multimodal mask prediction objective, e.g. masking 15% of the tokens in the text and protein sequence, and predicting these jointly based on latent representations, and masking losses predictions based on sequences or language only. In addition, the authors design a novel ProtDescribe dataset with 550,000 aligned protein sequence-description pairs. ProtST excels at many protein modeling tasks in the PEER reference, including annotation and localization of protein function, but also allows retrieval of proteins without firing directly from the textual description (see an example below). Looks like ProtST has a bright future as the backbone of many generative protein models 😉
In fact, ICML offers several protein generation jobs like GENIUS by Lin and AlQuraishi And FrameDiff by Yim, Trippe, De Bortoli, Mathieu et al. – these are not yet conditioned by textual descriptions, so integrating ProtST into them looks like an obvious performance improvement 📈.
⚛️ MPNNs on molecules have a strict locality bias that inhibits modeling of long-range interactions. Kosmala et al. derive Passage of Ewald’s message and apply the idea of Summary of Ewald which breaks down the potential for interaction into short and long term terms. The short range interaction is modeled by any GNN while the long range interaction is new and is modeled with a 3D Fourier Transform and message passing on Fourier frequencies. It turns out that this long term is quite flexible and can be applied to any network modeling periodic and aperiodic systems (like crystals or molecules) like SchNet, DimeNet or GemNet. The model was evaluated on the OC20 and OE62 datasets. If you are interested in more details, check out the One-hour lecture by Arthur Kosmala to the LOG2 Reading Group!
A similar idea of using Ewald summation for 3D crystals is used in PotNet by Lin et al. where the long-range connection is modeled with incomplete Bessel functions. PotNet has been evaluated on the Materials Project dataset and JARVIS — so by reading these two papers you can get a good understanding of the benefits of Ewald summation for many crystal-related tasks 🙂
➡️ Another look at impregnation any of them GNNs with equivariance for crystals and molecules are given by Duval, Schmidt et al. In FAENet. A standard method is to integrate certain symmetries and equivariances directly into GNN architectures (such as in EGNN, GemNet and Ewald Message Passing) – this is a safe but computationally expensive method (especially when dealing with harmonics spherical and tensor products). Another option often used in vision: show many augmentations of the same input and the model should eventually learn the same invariances in the augmentations. The authors take the second route and design a rigorous way to sample invariant or equivariant augmentations of 2D/3D data (e.g., for energy or forces, respectively), all with fancy proofs ✍️. For this, the data augmentation pipeline includes projecting the 2D/3D inputs to a canonical representation (based on PCA of the distance covariance matrix) from which we can uniformly sample rotations.
The proposed FAENet is a simple model that only uses distances but shows very good performance with the stochastic frame averaging the data augmentation while being 6 to 20 times faster. Also works for crystal structures!