Saltar al contenido principal

Escribe una PREreview

An Empirical Investigation into Fine-Tuning Methodologies: A Comparative Benchmark of LoRA for Vision Transformers

Publicada
Servidor
Preprints.org
DOI
10.20944/preprints202510.2514.v1

In the field of modern computer vision, the standard practice for solving new problems is not to train a model from scratch but to adapt a massive, pre-trained model. This process, known as fine-tuning, is a cornerstone of deep learning. This paper investigates two fundamentally different philosophies for fine-tuning. The first is the traditional, widely used approach where the core of a pre-trained Convolutional Neural Network (CNN) is kept frozen, and only a small, new classification layer is trained. This method is fast and computationally cheap. The second is a more modern, parameter-efficient fine-tuning (PEFT) technique called Low-Rank Adaptation (LoRA), which allows for a more profound adaptation of a model’s internal workings without the immense cost of full retraining. To test these competing methods, we designed a benchmark using three powerful, pre-trained models. For the traditional approach, we used ResNet50 and EfficientNet-B0, two highly influential CNNs. For the modern approach, we used a Vision Transformer (ViT), an architecture that processes images using a self-attention mechanism, and adapted it with LoRA. We then evaluated these models on three datasets of increasing complexity: the simple MNIST (handwritten digits), the moderately complex Fashion-MNIST (clothing items), and the significantly more challenging CIFAR-10 (color photos of objects like cars, dogs, and ships). This ladder of complexity was designed to reveal under which conditions each fine-tuning strategy excels or fails. The outcomes were striking on the intricate CIFAR-10 dataset: the ViT-LoRA model performed exceptionally well, achieving 97.3% validation accuracy, while the ResNet50 and EfficientNet-B0 models only managed 83.4% and 80.5%, respectively. The task was not difficult enough to distinguish between the models on the much easier MNIST dataset, though, and all of them received nearly flawless scores of 99%. Critically, the superior performance of the ViT-LoRA model was achieved with incredible efficiency. The LoRA method required training only 0.58% of the total model parameters, a tiny fraction of the 8-11% of parameters that needed to be trained for the traditional CNN approach. This leads to the central conclusion of our work: LoRA is not just a more efficient method, but a more effective one. For complex, real-world tasks, the ability to adapt a model’s internal representations, as LoRA does for the ViT’s attention layers, provides a decisive performance advantage that the rigid, classifier-only fine-tuning method cannot match.

Puedes escribir una PREreview de An Empirical Investigation into Fine-Tuning Methodologies: A Comparative Benchmark of LoRA for Vision Transformers. Una PREreview es una revisión de un preprint y puede variar desde unas pocas oraciones hasta un extenso informe, similar a un informe de revisión por pares organizado por una revista.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un iD, puedes crear uno.

¿Qué es un ORCID iD?

Un ORCID iD es un identificador único que te distingue de otros/as con tu mismo nombre o uno similar.

Comenzar ahora