Robust Testing for Deep Learning using Human Label Noise

Gordon Lim¹, Stefan Larson², Kevin Leach²

¹University of Michigan, ²Vanderbilt University

Hover over a point to see details here

We trained 1500 ResNet models to compute label memorization scores for CIFAR-10N human noisy labels.

Overview

Human label noise challenges deep learning models, often degrading performance more than synthetic label noise.

We study label memorization in CIFAR-10N using held-out estimation to understand this effect. Unlike the recently adopted PMD noise model, which generates feature-dependent noise along a model’s decision boundary, our findings reveal that challenging human noisy labels can form tight subclusters in their CLIP feature space.

Leveraging these insights, we propose Cluster-Based Noise (CBN), a method to simulate human label noise. Finally, we introduce Soft Neighbor Label Sampling (SNLS), which improves performance on CBN.

Comparison of noise functions at the same noise rate

Soft Neighbor Label Sampling (SNLS)

SNLS generates a soft label distribution by leveraging the labels of the 100 nearest neighbors in CLIP space. Under our cluster-based noise assumption, incorporating richer label information from more distant neighbors in the CLIP feature space can provide signals about the true label.

We implement SNLS with LRA-Diffusion and evaluate against several other Learning with Noisy Labels (LNL) methods on CIFAR-10 and CIFAR-100 datasets with varying levels of CBN and PMD noise.

Test accuracy (%) of different methods across CIFAR-10 and CIFAR-100 datasets with varying noise levels and noise types
	CIFAR-10				CIFAR-100
	35% Noise		70% Noise		35% Noise		70% Noise
	PMD	CBN	PMD	CBN	PMD	CBN	PMD	CBN
Standard	84.40 ± 0.18	75.44 ± 0.13	46.59 ± 0.33	27.22 ± 0.21	63.42 ± 0.15	46.17 ± 0.08	47.13 ± 0.13	17.48 ± 0.24
Co-teaching+🔗	67.08 ± 0.20	60.98 ± 0.45	35.35 ± 0.70	18.32 ± 0.14	55.09 ± 0.15	39.08 ± 0.11	39.36 ± 0.03	12.18 ± 0.09
GCE🔗	84.70 ± 0.10	77.73 ± 0.28	39.06 ± 0.66	25.16 ± 0.45	63.08 ± 0.25	39.60 ± 0.56	43.00 ± 0.25	12.59 ± 0.41
PLC🔗	86.11 ± 0.02	80.51 ± 0.19	42.66 ± 2.08	23.06 ± 4.08	62.23 ± 0.17	42.67 ± 0.15	47.86 ± 0.24	12.69 ± 0.37
LRA-Diffusion🔗	97.12 ± 0.10	91.74 ± 0.48	47.17 ± 2.00	18.60 ± 1.29	77.86 ± 0.43	50.34 ± 0.34	57.18 ± 0.81	11.76 ± 0.24
LRA-Diffusion+SNLS	97.31 ± 0.03	92.77 ± 0.18	49.16 ± 2.01	19.05 ± 0.49	78.89 ± 0.28	58.80 ± 0.51	62.41 ± 0.51	15.13 ± 0.20

BibTeX

If you used our research in your work, please cite us:

@inproceedings{lim25snls,
    author = {Lim, Gordon and Larson, Stefan and Leach, Kevin},
    title = {Robust Testing for Deep Learning using Human Label Noise},
    year = {2025},
    series = {DeepTest '25}
}