Transcript of Build an Ecosystem, not a Monolith - PKC

# Introduction I probably should call it time to start. Okay, so next time, delighted to introduce Colin Raphael, one of the people who's actually really built large language models and those are all the guts of how they work and is also set out on a very interesting and rewarding for the community research program where he's been very committed to making research into this area, a community exercise, as you will talk about more. # Speech Starts Thanks. Yeah, and I'm really happy to be here. I'm actually in the middle of a move and I was invited to speak here. I decided I should just pause everything and come because I think it's a really exciting group of people and an exciting topic. And on that note, I actually have to leave right after the sessions today. So if people want to follow up, to take delivery of my belongings. So if people want to follow up on any of this, I won't be around to chat, but I'm obviously happy to chat about any of this stuff offline. And so at any rate, what I'm going to try to do today is to convince you that the way that we are building general-purpose AI systems is misguided. And I'm going to try to propose a different way to build these systems that may also be misguided, but at least it's different. And if you think it's misguided, say so and maybe we can talk about it. But at the very least, yeah, maybe it presents some new ideas and a new way to do things. So let me get started. So four or so years ago, this was the dominant way to tackle problems in NLP and lots of other application areas of machine learning. We'd start with a pre-trained model, this little rainbow neural network in the middle that was pre-trained on some general tasks, maybe language modeling, or maybe some image classification tasks with lots of possible classes. And then if we wanted the model to perform a specific task, we would take the data set that represents that task and we would fine-tune it. We would just train it on that task to create a specialized model, a model that's specialized for the task and hopefully works pretty well. And this is an incredibly effective paradigm. Using pre-training makes the model converge more quickly to a better solution, typically with less labeled data. So this is just what everyone did. And it was, as I said, extremely effective, extremely useful, extremely valuable. And these, but these days, the way that we attack tasks is often quite different. Because we have large language models and large language models, we can often treat as general-purpose systems that are reasonably competent at performing lots and lots and lots and lots of tasks. So we train a single model that hopefully is very good at performing many, many tasks that generalizes to lots and lots and lots of tasks that hasn't been trained on. And instead of doing additional training, we just keep the single model fixed. And we ask the model to do what we want it to do, in the case of a language model, in natural language. So you say, please summarize the following article. Are these questions asking the same thing? You know, basically asking questions, making requests, and then we get the language model to do the thing that we asked. And again, this is extremely effective, right? We have these giant models. We ask them to perform any task we could possibly think of in natural language. And to a reasonable extent, they actually kind of work. And the larger we make these models, the better they work. It's extremely predictable. And so it has turned out to be the dominant paradigm, the way kind of the field is moving towards building general-purpose AI systems. But today I'm going to talk about a, whoops. No, this is strange. It's actually missing a slide. Let me try to figure out why the slide is missing. Okay, so they're supposed, oh, I should mention this is the first time I'm giving this talk. And there's a slide missing. So I'll just describe it verbally. One thing that we expect when we build these systems is that these systems get, we have a reasonably competent system. And if we want the system to become better, we throw it out and we build a new one from scratch. That's a lot bigger. And that new system is often a lot better. And this approach was exemplified by a comment made yesterday by Yintat. I'm sorry to pick on you, but it was, the comment was, well, we just need to wait for GPT-5, right? And so that is exactly what I mean, right? If we want a better model, we throw out the old model, we build a new one that's bigger, and we get these jumps in performance because we've made a jump in investment in terms of training the model. And again, this has been super effective. But let's talk about a different way that we might do this, okay? Today, we're going to talk about ecosystems of specialized models. And this paradigm doesn't really exist yet. But I think it could work. And I'm going to try to provide you with some evidence that it might work. But let me kind of describe it at a high level first. So the first thing you'll notice about this diagram that I made this cartoon is that everything looks exactly the same except the box in the middle. So we're still going to be trying to build general-purpose systems that can perform any task that we throw at it. The difference is that instead of having a single monolithic model in the middle that aims to perform everything, we're going to have an ecosystem, a collection of a ton of specialized models. And within this ecosystem, we are going to hope that by combining the models in different ways, we can perform basically any task that we throw at the ecosystem as a whole. And inside this ecosystem might be thousands of models, tens of thousands of models, hundreds of thousands of specialized models. Each model is really competent, let's say, at performing some specific particular task. So that's ingredient one. Ingredient two is that in order to treat the ecosystem as a kind of black box like we do with large language models, we need a way to automatically select which specialized models we're going to use for a particular task. So that's kind of what these little lines represent. And there was a question, so I'll stop really quick. Can you also host different models in different places, different locations? Yeah, absolutely. And actually at the very end, I'll talk a little bit about kind of like the systems aspect of how you might like run inference on a system like this. But mostly I'm not going to be talking about that. I think yeah, I think there are interesting possible considerations for like distributed inference and so on that simplify some of these things. I'll talk more generally about efficiency in a moment. But yeah, that's that's it. I think that would be entirely possible. So yeah, so the ingredients are we have we have hundreds of thousands of or thousands or something specialized models. We have automatic routing among those models. And also, I think we probably need a way to combine models to combine capabilities, because Zengi pointed out earlier today, in order to perform some particular task, we often need to combine skills. So maybe it's unlikely that we'll have one model that can, you know, write poetry in the style of Tupac about, you know, astrophysics or something like that. So maybe we need to combine our poetry writing model with our Tupac generation model and our astrophysics knowledge model or something like that. And I'll discuss that later. But the crucial thing, if we are able to build systems like this, is that because these specialized models often are a lot cheaper, and sometimes are more performant than the large, the large general-purpose monolith model, this entire system could be dramatically less computationally expensive, and dramatically more or at least, hopefully, more effective, but at the very least as effective as the monolithic models that we're using today. Do you have a question to you? Yeah, I like the ecosystem concept a lot. Just in terms of the potential implementation, are those going to be, let's just separate the models like GPT-4, FLAN, etc. Or are we going to have one model that tailors or adapt it to different domains? Where do you see the balance there? And also in terms of selection, isn't this kind of downgraded to a mixture of experts to some extent? I will touch on both of those things. Later in the talk, on the first question, I think this works either way. But some things can become a little easier if most of them use the same base model, and I'll discuss that later. And I'll also discuss connections to a mixture of experts a bit later too. I have a question. So it looks like this is some sort of biologically inspired architecture. Some of the neurons are sort of a shared between different subnetworks. Is that the kind of direction that you're going? Yeah, I recognize that I'm using biologically loaded terms like ecosystem here, but any relation to biological concepts is accidental. Otherwise, yeah. I'm not a biologist. Yeah, I will get to that too. So the missing slide looked like this, except that there were blue blobs that got bigger as they went from one system to the next. And we had big jumps and capabilities before. I've used that slide in my past talk. So if you've seen my past talks, maybe you've seen it. But for an ecosystem, this is what I think the abstract notion of capabilities, what the system can do over time might look like as the ecosystem grows, more models are introduced, recombined and added and removed and so on. And the point that I want to make here is that if we have an ecosystem that allows for collaborative development, lots of people working together to add to the ecosystem, combine things together, etc., we'll see continual improvements rather than these big steps of improvements that result from a sudden increase in investment. So now for basically all of the rest of the talk, I'm going to be sketching out how and why should we build, can we build ecosystems of specialist models instead of monolithic models. And again, this talk is mostly speculative. I'm going to try to provide evidence that this might work. A lot of this evidence comes from other people's work, but I'll talk about my lab's work a little bit too because a lot of it's related. And hopefully by the end of it, you'll be at least a tiny bit convinced that this is not a completely ridiculous idea. So, and I'll make a couple of points along the way. The first point, which I think has already been made pretty well at this workshop, is that specialist models are often cheaper and sometimes better. And we had two good examples of this yesterday in Yin-Pet's talk and in Yajin's talk, the code model trained on textbook data and the paraphrasing model. Both of those were significantly smaller models that maybe worked better than the monolithic generalist model. But let me give you a couple more examples in case you're not totally convinced. This first example I hesitated to include because it's very out of date because the monolithic model we're comparing to is GPT-3, which is horribly out of date now. But this is from a paper that we published last year at NeurIPS. And the very high-level takeaway that this plot is providing is that if you take a reasonable pre-trained model and you fine-tune it with a pretty simple and standardized recipe. So it's exactly the same recipe across every task, every dataset you apply it to. And you take the examples that are used for in-context learning with GPT-3. So maybe you're doing 32-shot in-context learning with GPT-3. And instead of doing three-shot in-context learning with GPT-3, you just use that as a training dataset to fine-tune your model. Then the resulting fine-tune model works better, so it gets higher accuracy, and is dramatically cheaper. And the reason it's dramatically cheaper is because it's a lot smaller. This base model has 11 billion parameters, GPT-3 has up to 175 billion parameters. But also, a few-shot in-context learning is ridiculously expensive because you have to process the in-context examples before you make a prediction. Whereas in this case, after you fine-tune it, you can sort of think of it like you're doing zero-shot prediction, and it doesn't make any sense to say it that way. But the point is you're just feeding in the query example and getting the prediction right away. Wouldn't you have to consider the cost of fine-tuning in this example? Yes and no. No, because once you do fine-tuning, you've amortized the cost, you never have to do it again. You have your specialized model, you only have to fine-tune once. In some sense, when you do few-shot in-context learning, you're processing the dataset every time. But even if you consider the cost of fine-tuning, fine-tuning this model is about 16 times more expensive than processing a single example. So by the time you've done 16 predictions with GPT-3, you could have fine-tuned your model. You never have to do that again, and all of a sudden, your model is 1,000 times faster. So this is an improvement of 1,000 times of efficiency. I know in NLP, we don't ever draw this x-axis anymore. We only draw this y-axis, but the x-axis is very important. What's a base model that you find tuning? Yeah, I'm not going into a ton of detail here because I'm making a high-level point, but this base model is called T0. It's kind of a multi-task variant of T5. You can just think of it as a pre-trained language model that's reasonably competent. In this case, these are a subset of tasks that were intentionally held out from T0's training mixture to cover a range of tasks. And I'm not going to remember all that maybe Sasha actually remembers the list of tasks, but... NLP. NLP. They're NLP data sets. And it is important to mention that this, again, this is kind of out of date work because this is pre-chat GPT work. It's not about open-ended dialogue. Yeah, like poetry generation, et cetera. It's like NLI, like really boring academic stuff. Okay. Okay, so one data point. Now let's try to make it a little bit more up-to-date. Instead of talking about beating GPT3, I'm just going to quote some results from the Palm II paper. So this is a paper that doesn't even want to convince you that specialized models can work well. And I'm just going to point out that they put these tables in the paper. This is kind of like their common sense reasoning benchmark-ish table. And this is their translation table. And among the baselines they consider are specialist models because they're always quoting the state of the art. A, by the way, is GPT4. Okay. So generalist model with models sometimes still do win. But for these two data sets in particular, where the gap is not insignificant, the state of the art is held by a specialist model. And these are like fine-tuned BERT models. Okay. These are very cheap, effective models. And in translation, actually, they always do beat Google Translate. But arguably, I mean, it's funny to call Google Translate a specialist model, but it can only perform translation. Okay. So it is a specialist model. And I don't actually know because we don't know the system details of Google Translate or Palm II. But I suspect that inference with Google Translate is a lot cheaper. Okay. And Palm II always beats Google Translate, which again is conceptually interesting because it's a general-purpose monolithic model. It's not a specialist model. The gap is very, very small, right? So we have a specialist model that's more efficient, that works quite well. Again, this is picking out of a paper that doesn't even want to make this point. And again, I don't want to belabor this too much because I think, hopefully at this point, you all can at least envision that specialist models can be cheaper and more effective. Just drop that thing as BERT plus a calculator, right? Sure. Yeah. And specialist models can use calculators or whatever. Yeah. Yeah. These specialist models have other disadvantages. The size is good. Yeah. Performance in these cases is good. I understand that they're brittle. If you want to be attacking them is easier. Yeah, I don't. I think that's hard to characterize. I mean, that's a good point. I think that's hard to characterize. I think all of these models are, are attackable, right? There are, especially recently, demonstrations of what you might call adversarial examples for large language models, not only to do jail breaks, but also to do things like getting the model to output a different prediction by stating assumptions in a different way and so on. I think it's, however, I agree that it's good point to, and it would be interesting to think about, like, as a whole, would an ecosystem of specialized models be more or less brittle than a generalist model? Also, some of the work that I'll describe later on combining the capabilities of models has been shown to improve robustness. I won't talk about that at all, but I'm happy to provide pointers offline. Cool. Okay. Yes, baseline specialist models, reasonable place to start. Okay. So now this is kind of today's point that this is not necessary. This next point is not necessary, but it's a nice bonus if we want to exploit it, which is that when I talk about a specialist model, I don't necessarily mean like that we have hundreds of thousands of completely different model architectures that are, that are share nothing in common. It is also possible to build specialist models by making cheaply communicable updates to a common base model. It doesn't mean we have to do it this way. This is just a nice bonus if we choose to do it this way. And so to go back to the T view paper that I was bragging about a moment ago, that those results actually come from not full model fine-tuning, but what people call perimeter efficient fine-tuning, which is rather than, and many of you might be familiar with Laura, for example, which is a perimeter efficient fine-tuning method or prompt tuning. This was another parameter efficient fine-tuning method that we came up with that we call IA3. And the idea with IA3 is that you take the intermediate activations in the transformer and you multiply them element-wise by a learned vector. And the important bit is that because you're only introducing a few of these vectors and not updating all the parameters in the model, you actually update a very small fraction of the number of parameters as full model fine-tuning. And you actually do quite well, especially in the few shots that actually outperforms updating all of the models parameter. So it's actually a better way to fine-tune the model in this case. And it outperforms other parameter efficient fine-tuning methods, which is not that important for the purpose of making this point. The main point is that the base model here T zero and storing it on discs takes depending on how you store it. Let's say 10s of gigabytes storing the IA3 vectors takes a few megabytes. It's not quite small enough to fit on a floppy disk, but it's not far off. So you could imagine if storage, I think maybe came up a moment ago. If I want to store 10,000 specialist models, if I build my specialist models by training IA3 vectors, it doesn't cost that much in terms of storage space. And in fact, as I'll discuss briefly later, it also makes things simpler in terms of distributed inference. But again, you don't have to do this, just a nice bonus if you do. And if you're not familiar, people already are building giant ecosystems, not ecosystems in the way I'm describing with automatic recombination of models and routing and so on, but at least big collections of fine tune models via parameter efficient fine tuning, what people often will call like adapters. And so if you go on the hugging face model hub and you look for models that have been uploaded that use PFT, the parameter efficient fine tuning library, you can see there are 4500 of them on adapter hub, which is another repository, there's about 500. And I didn't include this example because this is a language model workshop, but on repositories for adapters for stable diffusion, for example, there are thousands uploaded every day. And these adapters, and they're often based on Laura, which again is another parameter efficient fine tuning method. They do things like take stable diffusion and make it so that it always outputs things that look like cartoons, or it's really good at outputting backgrounds of a certain kind or something like that. So what is it doing? It's taking a general purpose model, I think model, like stable diffusion, and it's turning them into specialist models. So in some sense, if we rely on these model hubs, we already have a lot of the collection of specialist models that we might use to build an ecosystem. So that's my second point. My third point, a important ingredient that I mentioned earlier, is that if we want to treat the ecosystem as a replacement for a generalist model, then we need a way to choose which specialist models will be used to perform a given query. And this will tie into combining models to combine capabilities, but at the very least we need to do this routing automatically. I mean, you could imagine that you rely on the user to choose the appropriate model, but that might be onerous. When I use GPT-4, I don't have to scroll through a list of possible requests and choose one. I can use any request I want and it will perform it. So if we really want to replace generalist models, we need a way to choose a specialist model automatically. Okay, so let me provide you with some evidence that we might be able to do this. The first bit of evidence comes from a nice, oldish, olderish paper from six or seven years ago on Task2Vec. And what Task2Vec does is it says, I'm going to represent each data set by a vector. And that vector is going to be computed as a kind of a summary of the diagonal of the Fisher information matrix, which sounds like a mess. But really what it's doing is it's measuring what is the gradient of the model's output with respect to the parameters computed in expectation over the data set, and summarizing that quantity and using that as a representation of the task. So why might this be a useful way to represent tasks? Because if I have two data sets that change the model in similar ways that cause the parameters to be updated in similar ways when you compute the gradients, then those data sets might be similar. They're making the model do similar things. And this paper is mostly focused on image classification, but they find when they use these tasks to vectors, they get things like image classification tasks having to do with clothes get clustered in one area, the color causes them to be clustered in a different area, different species go in a different area, and this is kind of clothing tasks, and those are species tasks. The details aren't terribly important except to say that there is a way to associate with a given piece of data or chunk of data, some notion of which model is appropriate for it. And I'm happy for you all to refer to that paper from our details. What is interesting and is an extra bonus from using parameter efficient fine tuning, this nice recent paper showed that if you just use the values that get dumped out from parameter efficient fine tuning, depending on which method you use, that's actually a nice representation of your task too. So you remember I said IA3 is a few megabytes, that's not that long of a vector if you stack all the vectors up. If I compare the IA3 vector from one task to the IA3 vector for another task, their similarity actually represents reasonable notions of task similarity. So again, we kind of for free get away of associating data with a particular appropriate model. Now the last point I'll make about routing among specialized models is exactly the point that you made earlier, which is that we actually already have an example of a model architecture that does automatic routing among specialized networks. In this case we call them experts, and in this case they are actually kind of sub layers in a larger model, but a mixture of expert style model can be thought of as a model that has a router that takes as input some intermediate activations in the model and adaptively selects which sub network to use to send those activations to. So this is not an ecosystem of model so much as maybe an ecosystem of sub models like of layers, but these are quite effective and I'll talk in a few slides about a mixture of expert style architecture that exhibits the kind of behavior that I think is desirable for these ecosystems. So hopefully some evidence that specialized network expert selection might work. So now some of you might be thinking, great, maybe you can give me a big collection of specialized models, and you can give me a way to choose which specialized model I should use for a given query, but what if, again, I have a task that requires lots of different skills, right? And so I'm really glad we already had the discussion about how tasks can be considered as a composition of skills, and Sanjeev has definitively convinced all of you that we can think of things this way, and also that there are exactly a thousand skills. But I will make the same point that Sanjeev made earlier today, which is that at least conceptually we can think of a given task that we might want to perform as a composition of skills. So we might say, you know, let's say for this query, how long would it take for a penny to hit the ground from the top of the Empire State Building? That requires maybe physical reasoning, world knowledge, arithmetic, and a lot of these skills, like world knowledge, for example, if you buy that that's an actual skill, can also be encapsulated in tasks that focus on that one particular skill. So let's assume that among our many thousands of specialized models, we often have a specialized model that can perform some particular skill. Now the question becomes, do we have a way to combine capabilities of models? Because, again, Sanjeev had this nice argument earlier that one of the benefits of scale is increasing the capability of doing kind of like a compositional generalization of combining more skills. So we can go from combining three skills to four skills as we scale up and so on. That's an incredibly useful capability of these monolithic models. If we want our ecosystem to have the same capability, we need a way to combine specialist models to compositionally combine skills. What might that look like? Well, I'm going to call that this action of taking specialized models and combining them to create a composite model that can perform the individual tasks of the specialized models. I'm going to call that merging, merging models. And maybe the reason that I chose that terminology will become clear later, but it's also common terminology in the field. And the important thing about merging is that if we could do this, it would enable new paths for transferring capabilities across models. Because typically, if we want a model to have some particular capability, then we perform some training, right? In old days in transfer learning, we would perform fine-tuning on a data set that represents that particular task. Currently, we just scale up and we hope that the model learns that task in the data somewhere and then can compose it with other tasks. But if we had an operation that, again, I'll call merging, what that would allow us to do is take, for example, two specialized models and combine them to create a composite model that could perform both of the tasks of the models that are being combined. And it also allows for more unusual ways of combining models, right? If you imagine in our ecosystem, we have a model that's derived from this path of sequential gradient-based training. And then we take the original model and train it on some other data set. Then we could merge these models and get some kind of new composite model, okay? So that's what merging might enable us to do. And I'm going to use the word merging kind of as a generic term to describe a few different methods. I'll mention it in a moment. Yes, I'm Chief. So I should think of each circle as one of your adapter models? As some specialized models. Exactly. But again, and most of the methods that I'll describe assume that the model architecture among the models being merged is uniform. It's consistent. So they have the same architecture, the one-to-one correspondence between parameters. You could imagine merging methods that don't make this assumption, but all of them will. And they don't have to be parameter efficient. They could just be fine-tuned models, too. But for the purpose of kind of what I think is the most plausible path towards this, sure, we can think of them as parameter efficient methods, too. And to give you an example of one something that I think looks like merging that has been shown to work okay, that shows this ability, this compositional generalization ability, comes from this nice paper overcoming catastrophic forgetting in zero-shot cross-single generation, where they show that if you train via one particular form of parameter efficient fine-tuning called prompt tuning, if you train parameter efficient adapters learned prompts for languages and tasks, then you can compose those prompt vectors and generalize to new language task combinations, right? So I train my language model on a bunch of Vietnamese data, and I only update the parameters of this adapter, this prompt, this learned prompt, via this parameter efficient fine-tuning method. I take the same language model and I train it on a variety of tasks in the same way. And if I want my model to perform task two in Vietnamese, I just concatenate those two prompt vectors, and I get a model that performs the task in Vietnamese reasonably well, okay? So this works okay. It's an interesting paper. Another example, task vector. So this is such a fun idea because it's so simple and it works surprisingly well. If I take my pre-trained model, perform some fine-tuning, and then I take the fine-tuned parameter values and I subtract the pre-trained values. So I'm getting a vector that represents the movement in parameter space that my model underwent when I fine-tuned it. I'm going to call that my task vector. Now I can do all kinds of weird things with these task vectors. ## Task Arithmetic of Semantic Content ![](https://www.youtube.com/watch?v=AeRmk-gn8I8&t=1943) Like, for example, I can take a language model, train it on a bunch of toxic text, go from here to here, compute the task vector, and then negate it and subtract it from the pre-trained model. The language model will generate less toxic tasks. Very interesting. If I want to compose tasks, which is the thing that we really care about, I mean, this is very cool, but it's not exactly what we're talking about today. If I want the model to compose tasks, I take the task vector for two tasks and I add them together, and now I have a multitask model, and this works pretty well. ![[EditingModelsWithTaskArithmetics_Fig_1.png]] (Figure 1 of [[@ilharcoEditingModelsTask2023|Editing Models with Task Arithmetics]]) So again, another example, if I have these specialized models that correspond to this point and this point, and I want to get a multitask model that can perform the tasks that the individual models can perform on their own, I just add up their task vectors. This requires no additional training. Okay, so it could be done adaptively for any input query. Let's go a little, maybe I'll try to surprise you even a little more if you're not surprised yet. Maybe you won't be surprised at all. So they aren't necessarily small, but they, and I didn't mention this work in this talk, they turn out to be low rank or have like a low intrinsic dimension, right? And one of the weird things about neural networks and gradient-based training is that as you train neural networks for longer and longer, even as your loss plateaus to its minimum value, the parameter values often continue to grow and shrink and change. You kind of think of like the model is orbiting a region in parameter space where the loss is all low. There are also these funnys symmetries that neural networks have, like if I scale up the weight in one layer, on the weight in the other layer, the network computes the same function. So, you know, I can make them have huge values, right? The actual magnitude of the task vector can be very big. But in practice, they're like, for example, they're, they have a low intrinsic dimensionality. The arithmetic of like NLP tasks is fun. But like, now that we have language that can describe like very complex combinations and compositions, it feels like a much richer surface than it did in like the word defect times. And I'm curious, like, why not just use language? I mean, we have OMS. Yeah, so I don't actually, I don't have a slide for this that represents existing work, but one of the things that I think would be really cool to do would be to build a joint embedding space of task instructions, data sets and models, right? So let's say I have a bunch of specialized models. I want a way to go from description of the task that the model performs to a vector that is close to that model's vector, right? So, and, or I have some examples of the task. But maybe I'm asking a different question, which is like, all this stuff assumes a sort of linearity that we don't have to assume, nor are people naturally assuming when they describe tasks. Linearity in terms of like the, But you have this like metric space of tasks or something that just feels to me like unnecessary if you can just describe what you're looking for. But maybe I'm missing something. Well, I wouldn't say that the linearity of task combinations or is a requirement for any of this to work. I'm mostly saying that interestingly, that's what actually just happens when you train models. And we can exploit that to do things like combine the capability, like the skills of different individual models. Okay. Yeah. I want to follow up on the metrizability and query about that one and whether or not this generalizes something that isn't metric or was affine. And do you like have examples of where instead of like doing with a vector? Yeah. To my knowledge, there hasn't. I mean, the task vector paper is relatively recent. And I think this is an example of something that's incredibly simple and works surprisingly well, but probably isn't sufficient. And maybe that answers your question too. Right. And actually what I'll describe after I show you that people are merging models already and it works reasonably well is some of the work in my group on doing more sophisticated merging methods that aren't just adding task vectors together. Okay. So maybe that will help answer those questions. So the last thing I'll say, and again, maybe this will surprise you, maybe you won't. This recent paper showed that it was possible to take a shared transformer backbone, train a text transformer and a vision transformer, merge them, and you get a multimodal model model that has a shared representation space for text and images that can, you know, enable you to perform this standard, like image and language tasks. And it turns out that this works okay, whether you just average together the parameters of the model, whether you do task vector arithmetic, what they call modality arithmetic, or you do more sophisticated things. And this works okay. It's not perfect, but as a proof of concept, it exists. So, certainly people would think that processing images and processing text involve different skills. And here are some skills that you can combine and get a composite model. And in fact, I've mentioned kind of these community, these community collections of specialized models. ## Emerging models from composition of models There are already emerging these models to create composite models that inherent the abilities of the individual models. And here's an example of a language model on Hugging Face that is described as being composed in this way. And I don't actually understand this syntax. They don't define this syntax at all, but the syntax involves taking individual models, in this case, Laura's, so parameter efficient updates, and adding them together in different ways to, for example, make it so that the model is better at following instructions, make it so that the model, you know, doesn't forget its world knowledge, etc, etc. And again, I'm not including stable diffusion examples here, but this is even more common in the stable diffusion community. So for example, earlier I was mentioning, I might have like a Laura, a parameter efficient fine tuning update and adapter for stable diffusion that makes cartoon like images and other one that makes synth punk backgrounds. Another one that makes people with really nice beards. And I can combine those adapters and make a model a stable diffusion variant that is very good at making images of that kind, composed of those characteristics. So this is already super common so proof of concept it works. So now let me just talk a little bit about maybe addressing these questions a little bit, but like probably just saying that we're going to do this by adding things together and averaging things is probably insufficient when we want to do sophisticated things. So I'll talk about two pieces of work done by my lab and some collaborators on more sophisticated ways of merging models. This first work kind of formalizes model merging as an optimization problem. So if we formalize it in this way we say that we're going to try to find the parameter value that maximizes this objective. What this objective is is a sum of the posterior for each individual model that we're being that is being merged. And this hyper parameter lambda sub i is just a, you know, model specific hyper parameter that we can tune or you can just assume that it's one over M if we're waiting each model equivalently. But the difficult thing with formalizing it in this way is that we don't actually have access to the posterior of our neural networks when we train them with maximum likelihood we just have a point estimate. And so in the work that I'm describing we use the Laplace approximation for those who aren't familiar with essentially taking a second order Taylor expansion of the posterior around a mode around the parameter value found via maximum likelihood training. And that corresponds to assuming that the parameter values are Gaussian distributed with a mean at the parameter values found during maximum likelihood training. And and a precision matrix as is as the Hessian, but we can't really compute the Hessian for these large neural networks. So we use the Fisher information which core which is very close to the Hessian at a mode of the posterior, we can't compute the fishery there so we use the diagonal Fisher. These are sort of unimportant details the important bit is that if we make this approximation to the model's posterior distributions, we can solve this maximization problem in closed form and we get this closed form solution. To the model parameters here, what we're actually basically doing is computing a weighted average of the model's parameters, where the weighting is done by the Fisher information value so basically we're up waiting parameters that are important to a given model and doubt waiting parameters that aren't. So if we do this I'm just going to throw one example out there, we have lots of other experiments in the paper, but if we do this, we can do that kind of esoteric merging pattern I described earlier. And we can, for example, take an RTE model that underwent intermediate task training with MNLI, which is a well known performant recipe, merge it with donor tasks and get a boost in performance over the original model. So it's not just for combining capabilities, we also can can boost the performance of a model on a specific task by performing merging. Now I'll describe one other piece of work on performing merging in a little bit more sophisticated way. This one doesn't have the same kind of arguably principled underpinnings, it was just kind of found heuristically. But what we're doing here is we're saying for a given model maybe there are certain parameters that are especially important, maybe some of them aren't. If we just average together things that aren't important to one model with those that are important to another model, maybe it'll wash out the changes to one of the models. So the first step we're going to do is we're going to remove all of the, if you assume we have a task vector, we're going to set all of the values in the task vector to zero if they're small, so if the parameter hasn't changed that much. And then we're going to resolve differences in sign in terms of which way the parameter values changed by having the models essentially vote as to what the final sign should be. And then we're only going to average together parameter values whose sign agrees with this with this aggregated sign. And again, this was found heuristically, we don't have any good justification of why it works, but it does actually help. So if you focus on the application that really I'm talking about now, which is basically merging together models that are individual task models to get a multi-task model. It works significantly better than simple averaging, which is what most people use, but also better than task vector arithmetic. And one of the nice things is that this is the normalized performance with respect to the individual task models. You can see that when we merge together two or even three models, we basically don't sacrifice any performance. So we are really retaining these individual capabilities. And the last piece of work I'll describe on combining capabilities of different sub models uses more of a mixture of expert style framing. So in this case, we're actually are assuming that rather than combining specialized entire models, we're just combining specialized experts. And what we're going to do is rather than routing a discreetly among some particular expert, we're going to take the distribution over the experts and computer weighted average of those experts. We're essentially going to merge them. And if we do that and train on a multi-task mixture, in this case, the data sets from glue, you a nice kind of at least somewhat intuitive pattern emerges, which is basically that tasks that share something in common often share experts, tasks that don't get their own expert, and tasks that require lots of different capabilities often use lots of experts like MNLI, which is a kind of a broad domain data set. So hopefully this convinces you. Yes, go ahead. That's a simple point, but I missed it. Do you do the merging per prom, per example, or is it given for tasks that you want to combine, you make a model that combines those? Yeah, so I haven't really bridged the part of the talk where I was talking about choosing which models to use with this part of the talk, which corresponds to combining capabilities. However, because that work doesn't exist yet, but the hope is that by combining work that allows you to select which models to use with work. So let's say that you have an intelligent routing based on task vectors or something else, and you feed in your example and it says you should use these five models. Merging provides you with a way to immediately, cheaply combine those models and get a composite model that kind of inherits the capabilities of those individual models. No one has actually done this, but I think given some of this past work that might actually work. Have you experimentally compared your method ecosystem method and putting all data together using instructive learning? Yeah, so no partially because an ecosystem like the one I'm describing just doesn't exist. It hasn't been built yet. The closest thing that's been that I think maybe answers your question is, can I take, you know, K data sets, train K specialized models and merge them and get a model that works about as well as multi-task training on those K data sets? I think that's, is that roughly correspond to what you're asking? If you put people's language together, you know it's up a lot on the scale of those languages. Yeah, definitely. And I wonder for those tasks, whether some reach results of tasks and how it goes with those tasks. Yeah. And these need to be probably compared across different tasks. Yeah, absolutely. And to kind of finish the direction I was going, people have shown that merging provides the same kind of positive interference kind of benefits like you're describing, right? If I have two data sets or two tasks that have things in common, then merging those models can boost the performance on one of the tasks, just like, you know, doing multitask training on languages that have something in common can boost the performance in the low resource language. People have shown that, but generally it doesn't work quite as well as multitask learning at this point. Does that answer your question? So in, I don't know of any work in the language case, but certainly in the multitask case. Yeah, this is, this is a common... Experimentally to have some experimental results. Yeah. Most of the papers I described include this exact setting, which is, you know, comparing training on K data sets at once, like multitask learning, multi-lingual learning versus training on them separately and then merging them. And like I said, it doesn't, typically doesn't work quite as well, but it's pretty close. Great. Oh, and I should mention at this point, just throw out a note that, again, merging also, and I'm not discussing any of this work today, merging also has been shown as a way to improve robustness and out-of-demand generalization. Okay. So the last note, and this goes back to an earlier question, is in order to build an ecosystem like this, it's probably helpful to use, to build systems that make this kind of collaboration and make running inference and doing all of these operations easier. And we've actually built some of these systems, and I'll describe two of them now. The first one is called Git Theta, and the idea with Git Theta is that we are going to just replicate the exact workflow that people already use for version controlling source code with model checkpoints. And so what does it mean to replicate that workflow? It means that we are going to be able to perform the exact same Git commands to track changes to a model checkpoint and also merge model checkpoints. This is where the term merging comes from. And so this is a kind of a pseudo code example of this pipeline for model development. We start by telling Git Theta that we want to track the original checkpoint. We just commit the model as usual. We perform a made up fine tuning step, and then we commit the new model. Now we check out a new branch called RTE. We do some fine tuning here. We check out the main branch, do some fine tuning there, and then just run Git merge RTE to tell Git Theta that we want to merge the RTE branch with the main branch. We perform the merge. It gives us various ways to perform merging, representing some of the methods that I described earlier. And then do some other operations to the model. So Git Theta works reasonably well. It's on GitHub. And actually, as long as your Git remote supports Git LFS, which GitHub does, the Hugging Face model hub does, you can use Git Theta natively with that remote. So then how do we characterize the benefit of using Git Theta? Well, you could imagine that rather than using Git Theta, you just use Git LFS to track the checkpoint at every point in the version control pipeline. In that case, you wouldn't be able to do a merge. And you could ask, for example, how much space do we save versus just saving one, two, three, four, five, six copies of the model checkpoint. And that's what this graph shows you. It shows you that if you do parameter efficient fine tuning, which we did in this made up example, you would save a lot of disk space. And also, if you do something like we did in this operation, which just corresponded to removing some of the model's parameters, you also save a ton of space. And actually, you save a little bit of space, even if you do a dense update, if you update the model's parameters, just because of the way Git Theta compresses the checkpoint. And in this example, if you actually do this pipeline, you can see that the model's performance gets better over time on the tasks that we're discussing. So Git Theta works for a somewhat real-world example. Okay, so we can track the provenance of the model ecosystem. If someone gives the model ecosystem a new model and we improve that model in some way or merge it with another model, we can use Git Theta to track that in a principal way instead of just having a big mess of models. And now finally, we'll get to the question that I think came up earlier, which is basically, how would you do distributed inference in this case? I think, in general, it would make distributed inference simpler. We haven't built a system that says that many peers are going to be serving specialized models and we're going to be routing things appropriately. But we have built a system that makes it easy to do distributed inference of a single, large-ish model. And what this system does is it takes the layers of the model, chops them up into groups, and then peers can volunteer and say, I'm going to run inference with these layers. It then compresses the layers to make inference fast, and then it compresses the activation so that when they're sent to the next peer over the internet, because this is all done in a totally distributed way on volunteer computing over the internet, the communication costs aren't terribly high. It also does intelligent routing to decide which peers should be used to make some particular request. And it works reasonably well. It's a decent way of performing volunteer inference. And in fact, you can use pedals, which is what the system is called right now, to do inference on various models. So here's the health dashboard, which is basically showing, for a given checkpoint, do we have peers that are willing to serve blocks of the model's layer? And one crucial thing, which I don't have a nice slide or diagram for, is that pedals natively supports hot-swapping adapters, parameter-efficient fine-tuning vectors. So if I have the stable Beluga-based model and I make a specialized model and I want to run inference with that specialized model, I can just pass pedals the adapter and it will run inference using that adapter. It also does fine-tuning of adapters in the same way. So, you know, these monolith models, these generalist models work really well by being served by a company with access to an expensive supercomputer. Systems like pedals would make it possible to do inference of an ecosystem on volunteer computing, so we don't have to rely on a centralized supercomputer or company. So, that's all I'm going to say today. Whenever I give talks, I put this URL on my last slide, which is a way to give me anonymous talk feedback. We thought I talked too quickly, too slowly, covered too much, too little, whatever. You can give me feedback there. And also, like I said, I have to leave, so if you want to follow up with me after the questions are done, there's my email. I'm happy to hear from any of you. Hi, so thanks for the great talk. So, what I have a question, which is I think the word ecosystem is a bit overloaded in the sense that there are two types of ecosystem where you have like you have an ecosystem of models or you have an ecosystem of different people and the ownership is distributed. I think you have argued that, like, we can have an ecosystem of different models, which is more cognitive efficient, but we can imagine that one single company can just choose this particular architecture of having an ecosystem of model. So, in that sense, the ownership is still monolithic. Yeah. So, what do you think about the comparative advantage of the community to build this type of ecosystem advantage over a single company building this equal ecosystem of models. Yeah, absolutely. Yeah, and it's entirely true that you could imagine this ecosystem being built by a single entity. And by the way, I'm not great with branding, so if anyone has a better word than ecosystem, I'm open to it. But, but yeah, to your point, I do think that one of, I mean, I didn't mention this as a point of motivation, but I worry a lot about a future where the kind of this very powerful, very useful technology, which are these kind of like generalist AI systems are controlled by relatively small group of institutions or entities that are very resource rich and also profit motivated. I don't know how much detail I need to go into. It's a gut feeling, but that's just how I feel. And so, and I think generally when you, when you distribute power, when you distribute responsibility, things usually go better. And I do think that this, this system, or this approach provides a way to decentralize power. Thank you for the great talk. So question about the models that you envision in these ecosystems in the future. Are you sort of leaning more towards ecosystems of many sort of many, many models. Some of which might be very redundant or sort of the way to go or something where you're going for like a really small set of very diverse, almost orthogonal models. Yeah, I think that's an interesting point. And it's not something I brought up, but you absolutely could imagine kind of pruning the ecosystem, right? Like if you have a bunch of redundant models, maybe you can combine those models into a single aggregate model and make sure that it kind of retains the capabilities of the individual models and the performance and make the ecosystem smaller. I do think a lot of the stuff I described today is pretty scalable though, right? Like, again, if you use parameter efficient fine tuning, the storage requirements are not a huge issue. I think most of the methods that you could use for choosing which specialized model to use, you can imagine doing that pretty effectively just by computing dot products of, you know, model vectors. So I think it is pretty scalable, but I agree that it would totally make sense to think about how to like prune out models and so on. Yeah. So great, great talk. I'm wondering about the counterpart of BitRot in this world, right? So specifically, suppose you think that you had this ecosystem sort of set up around GPT2, right? And then a bunch of like thousands of things have emerged. People are combining them, getting lots of things to work. And then someone invests and trains GPT3. Right. Okay, now what? Yeah. Yeah. So, so I do think, again, if you think of the general case where the individual models and ecosystem don't have to have anything in common, they don't have to have a base model. You could have an ecosystem that has GPT3 based models and GPT2 based models. Most of the merging methods I described would not really work in that setting. You could imagine maybe doing an intelligent thing where you only use these merging methods on like compatible models or something like that. And I didn't talk about this as future work, but I do think it's valuable to start thinking about how to how to merge models that don't have architectures in common, for example. But I think in the, and again, this goes back to Daisy's question that she threw out right away, which is, which is basically like, does this work when the models are not the same and totally does. I just think there's there are benefits to it when it does. And it also kind of goes back to this other question of like, how do you prune models that maybe aren't working as well, or are some like a pre-to suboptimal or something like that. Yeah. Yeah, so yeah, so you mentioned that this could be kind of like a paid system and people get reimbursed or something, but it's not clear how do you attribute the benefit of the individual models. I don't know if I said that I may have, I might have, I believe you more than I believe my own memory, but but I do. I mean, actually, for just to make a specific example in pedals, we thought about a way ways to incentivize peers, for example, because these peers are doing our volunteer computing, they're basically offering up their GPUs for free. You know, the simplest thing to do is to say, you know, if we have a distributed cloud, you get priority access by volunteering your compute. You can imagine similar incentives to say, you know, those whose models are used more often maybe get prioritized inference or something like that. But I don't have any concrete examples of what might work because, you know, doesn't really exist yet. Question, I think in most of the work on merging the user needs to specify what's merged and how is that right. Is there some work on automating this as like a search process in the space of specialized models to merge. Yeah, and I, there was a conceptual leap that I didn't make explicit because again, the work doesn't exist yet. But I think that there is an interesting intersection of the kind of model routing like appropriate model selection work and the model merging work. And certainly there has been work again that has shown that if I have models trained on tasks that are similar to the tasks that I want my model to eventually perform. If I merge those models, I get a boost in performance on the target task like that. That is a thing that works. And again, I think if you have a sophisticated way of identifying which models have been trained on similar tasks, then I think you get the kind of benefit you're describing. Okay, today rather than panels and receptions, we have another talk. So I think maybe we should thank Colin and move on.