Research Select Project

BERT for Sequence Classification

Transformer-based language models applied to biological sequence classification, exploring gene function prediction and regulatory element annotation.

Source ↗
BERT Transformers NLP PyTorch Genomics

Leveraged transformer-based language models (BERT) to classify biological sequences, treating DNA/protein sequences as “language” for deep learning analysis.

Approach

  • Adapted BERT architecture for biological sequence tokenization
  • Fine-tuned pre-trained models on curated genomic datasets
  • Explored applications in gene function prediction and regulatory element annotation
  • Compared against traditional sequence classification methods (BLAST, HMM)

Technical Stack

  • PyTorch for model training and inference
  • Hugging Face Transformers for BERT architecture
  • Custom tokenizers for biological sequence encoding
  • Evaluated using precision, recall, and F1 metrics against gold-standard annotations