Jason Cui
← Projects
Completed2026

Self-Supervised Vision Transformers for High-Recall Malaria Detection in Blood Smear Images

This paper compares three machine learning approaches — a DINOv2 Vision Transformer, an EfficientNet CNN, and a logistic regression baseline — for classifying parasitized and uninfected blood cell images from the NIH Malaria Cell Images dataset. The key finding is that DINOv2 achieves the highest recall and maintains a sensitivity advantage at low prevalence, while also producing interpretable attention maps that highlight parasite-focused regions with minimal fine-tuning.

PythonPyTorchEfficientNet-B0DINOv2 (ViT-S/14)scikit-learnAttention MapsUMAP

Problem & Motivation

Malaria remains a massive global health burden — 282 million cases and 610,000 deaths across 80 countries in 2024, with 95% concentrated in sub-Saharan Africa. The diagnostic gold standard, microscopic examination of Giemsa-stained blood smears, is accurate but slow, expensive, and highly dependent on trained technicians. In low-resource, high-burden settings, this creates a critical mismatch between where expertise is needed and where it exists. This project asks whether machine learning can close that gap — and more specifically, which type of model is best suited for high-recall clinical screening.

Methods

All models were trained on the NIH Malaria Cell Images dataset (27,558 Giemsa-stained erythrocytes, 50/50 class split, 70/15/15 train/val/test). Logistic regression used five color features extracted per cell. EfficientNet-B0 and DINOv2 both followed a two-stage transfer learning protocol — frozen backbone first, selective unfreezing second — with the deep models evaluated at five prevalence levels (5–50%) to simulate realistic screening populations.

Results/Conclusion

All models exceeded 0.95 accuracy and 0.98 AUC at the standard threshold, with the logistic regression baseline nearly matching the deep models, which suggests the Giemsa stain's color signal is largely linearly separable. DINOv2 achieved the highest recall (0.977) while EfficientNet was more balanced (recall 0.963, specificity 0.965). The most important finding came from the prevalence sweep: as prevalence dropped to 5%, EfficientNet recall fell to 0.881 while DINOv2 held at 0.917, meaning roughly one in eight infections would be missed at the default threshold. AUC barely moved across the same range, showing it would have masked the problem entirely. The central takeaway is that threshold calibration to deployment prevalence matters as much as model choice, and that DINOv2's self-supervised pretraining produces superior recall and robustness under realistic low-prevalence conditions.

Final Report

Collaborators

Artemis Xu