Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Pages

Posts

portfolio

AutoPeptideML

Computational tool for building ML models for predicting peptide bioactivity automatically (https://github.com/IBM/AutoPeptideML).

BioBrigit

Hybrid machine learning and knowledge-based approach for the prediction of metal diffusion pathways through proteins (https://github.com/insilichem/BioBrigit).

Hestia-GOOD

Open source library for evaluating machine learning models in out-of-distribution generalization (https://github.com/IBM/Hestia-GOOD).

publications

How to build machine learning models able to extrapolate from standard to modified peptides

Published in Journal of Cheminformatics, 2025

This paper explores different design choices including learning algorithm and representation technique for building machine learning models that are able to extrapolate from one data distribution (standard peptides) to another (modified peptides). This study opens the door for new drug discovery campaings by allowing scientist to leverage data that is cheaper to acquire to make predictions for more expensive compounds.

Download Paper

talks

Evaluation of partitioning algorithms for trustworthy out-of-distribution evaluation of machine learning models in biochemistry.

Published:

Machine learning models in scientific discovery are expected to make predictions in new, unseen scenarios, i.e., out-of-distribution. Machine learning model evaluation is usually performed by dividing a dataset into two mutually exclusive subsets: training and testing. Model parameters are fitted to the training subset and the evaluation is performed against the testing subset. The process of creating these subsets is called partitioning. Traditionally, the machine learning literature relies on random partitioning. The problem with this approach is that it assumes that the prediction scenario will be in-distribution as random sampling is an in-distribution sampling. Recently, we have introduced the concept of similarity partitioning as a method for correcting this assumption. Similarity partitioning algorithms ensure that the testing subset contains molecules different to those the model has been exposed during training, and thus better simulates the real-world out-of-distribution scenario. However, it is not clear what algorithms are the best suited for generating these testing subsets. Thus, we have conducted a systematic benchmark of different partitioning algorithms previously described in the literature and examined which ones can generate the most challenging test subsets. We also propose a new algorithm called CCPart. Our results show that the three best similarity partitioning algorithms are Butina, CCPart, and UMAP. Where UMAP is limited to small drug-like organic molecules and both Butina and CCPart can be applied to any other entity (biosequences, 3D structures, small molecules, etc.). Further, they also show that choice of partitioning algorithm is dataset-dependent and a prior analysis of both algorithms and similarity metrics need to be performed. These results open the way for more trustworthy evaluation of machine learning models in the biochemical domain, that better estimate their real-world performance.

Download Slides

teaching

Deep learning in biomedicine - SECUAH VII

Workshop, University of Alcala de Henares, 2022

I’ve taught a cohort of biosciences students (undergrad and graduate) about artificial intelligence, machine learning, and deep learning techniques and how to apply them to biomedical research with a guided practical example where every student was able to build their own deep convolutional neural network for diagnosing skin lessions as either bening or cancerous. Materials can be found in this Github Repository.

Demonstrator Bioinformatics UCD (MEIN30240)

Undergraduate teaching, University College Dublin, School of Medicine, 2023

I’ve been a Demonstrator in the Bioinformatics UCD Module (MEIN30240) for two years (2023 - 2024).

Deep learning in biomedicine - SECUAH IX

Workshop, University of Alcala de Henares, 2024

Workshop titled “Modelos que aprenden el lenguaje de las moléculas” - Models that learn the language of molecules. The guided practical example allowed every student to finetune MolFormer-XL to build their own small molecule toxicity predictive model. Materials can be found in this Github Repository.

Models that learn biochemistry

Workshop, University of Oviedo, 2025

Workshop titled “Modelos que aprenden bioquímica” - Models that learn biochemistry. This workshop answers the question of what artificial intelligence is and how can it be use din biochemistry. The course spans a wide range of techniques and use cases including the underlying models behind modern chatbots like ChatGPT, and how these technologies can be applied for the modelling of biosequences and small drug-like organic molecules for drug discovery. It also included AI approaches for molecular docking (mainly Diffdock) as well as Molecular Dynamics through Machine-learnt Force Fields. The workshop provided students with a complete experience including practical sessions with guided code examples where they could build their own toxicity predictors from Chemical Language Models, as well as docking antipsychotic drugs to a GPCR, and running code for the simulation of the folding of a peptide with 15 alanines as well as the corresponding analysis of the simulation. The workshop was attended both by undergraduate students from Biology and Biotechnology majors, as well as PhD students from different disciplines.