13

May 2025

MS Thesis Defense

Enhancing diverse prediction in LLM on single cell transcriptomic data

Presenter
Junfan Chen
Date
13 May, 2025
Time
05:30 PM – 07:30 PM

Abstract:
Single-cell analysis has enabled transcriptomic profiling at cellular resolution, with single-cell RNA sequencing (scRNA-seq) providing rich data for functional characterization. As these datasets grow, large language models (LLMs) have been adapted to model gene expression dynamics. Among these, Geneformer, a BERT-based model applied to ranked gene sequences, has shown strong performance on multiple downstream tasks. However, Geneformer inherits key limitations from natural language models: prediction repetition and frequency bias, which undermine biological plausibility. To address this, we propose BertCABλ,
a novel architecture incorporating a probability recalibration mechanism and similarity-based regularization to promote more diverse and biologically aligned predictions. Experimental results across several downstream tasks demonstrate that BertCABλ matches or exceeds Geneformer’s performance while requiring significantly less pretraining data, achieving comparable accuracy with just 1 million single-cell profiles compared to Geneformer’s 30 million, highlighting its efficiency and improved robustness in data-constrained settings.

Bio:
Junfan Chen earned his Bachelor's degree in Intelligent Medicine from the Nankai University, China. He joined KAUST in 2023 as a Ms/PhD student in the Bioengineering program.
His thesis, “Enhancing diverse prediction in LLM on single cell transcriptomic data,” investigates the language model in single-cell transcriptomic data, the Geneformer, and modifies the model for better performance.

Event Quick Information

Date
13 May, 2025
Time
05:30 PM - 07:30 PM
Venue
Building 4 - Level 5 - Room 5209