Robust Testing for Deep Learning using Human Label Noise

Gordon Lim1, Stefan Larson2, Kevin Leach2

1University of Michigan, 2Vanderbilt University

Paper Code Dataset
Static Plot for Mobile Devices

Hover over a point to see details here

We trained 1500 ResNet models to compute label memorization scores for CIFAR-10N human noisy labels.

Overview

Human label noise challenges deep learning models, often degrading performance more than synthetic label noise.

We study label memorization in CIFAR-10N using held-out estimation to understand this effect. Unlike the recently adopted PMD noise model, which generates feature-dependent noise along a model’s decision boundary, our findings reveal that challenging human noisy labels can form tight subclusters in their CLIP feature space.

Leveraging these insights, we propose Cluster-Based Noise (CBN), a method to simulate human label noise. Finally, we introduce Soft Neighbor Label Sampling (SNLS), which improves performance on CBN.

CBN vs PMD

Comparison of noise functions at the same noise rate

Soft Neighbor Label Sampling (SNLS)

SNLS generates a soft label distribution by leveraging the labels of the 100 nearest neighbors in CLIP space. Under our cluster-based noise assumption, incorporating richer label information from more distant neighbors in the CLIP feature space can provide signals about the true label.

SNLS Process Illustration

We implement SNLS with LRA-Diffusion and evaluate against several other Learning with Noisy Labels (LNL) methods on CIFAR-10 and CIFAR-100 datasets with varying levels of CBN and PMD noise.

Test accuracy (%) of different methods across CIFAR-10 and CIFAR-100 datasets with varying noise levels and noise types
CIFAR-10 CIFAR-100
35% Noise 70% Noise 35% Noise 70% Noise
PMD CBN PMD CBN PMD CBN PMD CBN
Standard 84.40 ± 0.18 75.44 ± 0.13 46.59 ± 0.33 27.22 ± 0.21 63.42 ± 0.15 46.17 ± 0.08 47.13 ± 0.13 17.48 ± 0.24
Co-teaching+🔗 67.08 ± 0.20 60.98 ± 0.45 35.35 ± 0.70 18.32 ± 0.14 55.09 ± 0.15 39.08 ± 0.11 39.36 ± 0.03 12.18 ± 0.09
GCE🔗 84.70 ± 0.10 77.73 ± 0.28 39.06 ± 0.66 25.16 ± 0.45 63.08 ± 0.25 39.60 ± 0.56 43.00 ± 0.25 12.59 ± 0.41
PLC🔗 86.11 ± 0.02 80.51 ± 0.19 42.66 ± 2.08 23.06 ± 4.08 62.23 ± 0.17 42.67 ± 0.15 47.86 ± 0.24 12.69 ± 0.37
LRA-Diffusion🔗 97.12 ± 0.10 91.74 ± 0.48 47.17 ± 2.00 18.60 ± 1.29 77.86 ± 0.43 50.34 ± 0.34 57.18 ± 0.81 11.76 ± 0.24
LRA-Diffusion+SNLS 97.31 ± 0.03 92.77 ± 0.18 49.16 ± 2.01 19.05 ± 0.49 78.89 ± 0.28 58.80 ± 0.51 62.41 ± 0.51 15.13 ± 0.20

BibTeX

If you used our research in your work, please cite us:

@inproceedings{lim25snls,
    author = {Lim, Gordon and Larson, Stefan and Leach, Kevin},
    title = {Robust Testing for Deep Learning using Human Label Noise},
    year = {2025},
    series = {DeepTest '25}
}