Should we consider LoRA an “adapter”?

Published:

Is LoRA an adapter? Historically no. But should we call it one?

Introduction

This is a post purely about naming conventions, but it might be a worthwhile resource and a useful discussion to have. Especially with the large wave of newcomers to the field of NLP after the release of ChatGPT, I think that it is worth going to extra mile to ensure we get naming conventions right.

This impetus for this post is a discussion I had with Carson Poole over on the EleutherAI Discord, and I figured that it was useful to write all this down for future reference. Many of the points here were raised in our discussion by either myself or Carson.

I’ll start by laying out why my initial position was that LoRA is not an Adapter, and then circle back to arguments for why as a field we might be okay with calling is an adapter, actually.

Background

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) method published in 2021, which involves learning a low-rank additive modification to linear layers in a Transformer.. Although it is a fairly general PEFT method, it was initially applied to BERT-type and GPT-2 models. It saw an immense resurgence in popularity after both its application in Stable Diffusion model in 2022, and a second wave of interest following the dissemination of LLaMA models and the wave of open-source activity on fine-tuning LLaMA.

(Side note: the initial version of LoRA only fine-tuned the Q and V linear mappings (with some additional experiments on all four attention linear layers). However, later work has found that LoRA truly shines when you apply it to all linear and even embedding layers in a Transformer (e.g. in QLoRA).)

The original Adapter was introduced in 2019. Initially applied to BERT, it involves learning a residual low-rank MLP to a frozen Transformer model. Since the release of subsequent Adapter variants, it has sometimes been retroactively referred to as a Houlsby Adapter, after its first author.

Why LoRA is not an Adapter.

Simply put, LoRA and Adapters are separate methods.

I believe a big reason for why LoRA have been called “adapters” informally is because of the similarity of the name: it is easy to confuse “Low Rank Adaptation” for “Low Rank Adapters”, as seen here and here. However, the original paper never once refers to LoRA as an Adapter, only as an “adaptation method”. Given that there is another method named Adapters, and the reference to LoRA as an Adapter appears to arise from a misconception, I think we should undo this mistake, put our foot down, and say that LoRA is not an Adapter.

Q: Isn’t it okay to casually refer to LoRA as an adapter? They’re kind of the same thing, right?

I disagree. They are two different methods, and we should distinguish between them. The field has enough issues with misconceptions and misunderstandings with the wave of newcomers, we do not need to confuse things further by being imprecise about naming. We already had problems before on confusion between Prompt Tuning, Prefix Tuning, and P-Tuning/P-Tuning v2.

(This counter-argument also applies for “naming conventions change, just go with the times.” If naming conventions are a democracy, then it is also valid for me to campaign against a naming convention, as in this post.)

Why LoRA might be an adapter.

The above discussion captures my position for why LoRA should not be called an Adapter. However, I think that are some smaller but valid arguments for why we might be okay with calling LoRA an adapter.

1. Conceptually, LoRA is not that different from an Adapter.

If we zoom out and squint, LoRA is actually not much different from a generalized version of a Houlsby Adapter (which I think is a secondary reason for the confusion in naming). Both methods involve a low-rank projection and a residual path. The primary differences are that LoRA does not involve a non-linearity, and where it is placed (LoRA: linear layers; Houlsby Adapters: around Attention and MLP blocks). Pfeiffer Adapters have different placements, so evidently the placement is not a disqualifying factor.

(Note: The above refers to how LoRA is trained. At inference, LoRA has the substantial benefit of being able to be merged directly into the linear weights.)

2. Who decides these names anyway?

My position is that if the LoRA paper had been titled “Low-Rank Adapters for Large Language Models”, or if the authors had called LoRA an adapter in the paper, I would readily accept that LoRA is an adapter. Since they did not (and continue not to, I continue to consider them not adapters. (Question: have any)

However, is relying on authors for naming always valid? Consider the case of LLaMA-Adapters. At its core, the method is essentially a modified version of Prefix Tuning, with a separate softmax and a much better initialization. However, the authors put “adapter” in the method name. Are LLaMA-Adapters adapters? (I still think the answer is no.) Is it in an awkward position where we say “it’s called LLaMA-Adapters but it’s not really an adapter” or “really it should not even have been ‘allowed’ to put ‘adapter’ in its name”? Houlsby Adapters and Pfeiffer Adapters were also not called as such in their original papers, yet the field has accepted these names.

3. Maybe we should just call all PEFT methods adapters.

To be honest, the PEFT name is kind of clunky. Even more so when I’m trying to refer to LoRA or Adapter parameters (PEFT parameters? Parameter-efficient parameters?). A neat alternative name I like is delta tuning, but it never really took off. In the interest of naming simplicity, maybe we could just call all PEFT methods adapters: it is shorter and immediately conveys the core idea.

LoRA is not an Adapter, but I guess it can be an adapter.

My current position (as of August 2023) is that while I think capital A “Adapters” are a proper name and thus should refer only to methods that are called Adapters in their original papers, I think I am leaning toward accepting lowercase A “adapters” as an informal substitute for PEFT. So in that sense, LoRA is an adapter, but not an Adapter. Yes, this is horribly nitpicky, and yes, this will still be confusing when spoken. Adapters (See? ambiguous! I’m referring to actual Adapters here) have themselves fallen out of popularity, with methods such as LoRA and prefix-tuning superseding them. (Okay, maybe I’m the only person who still cares about prefix-tuning, but it’s neat!) It’s probably okay to let the old name go, and let “adapter” become a genericized name. So here’s my (current) takeaway.

*LoRA is short for “Low-rank Adaptation”, and LoRA is an adapter.”