Episode 20

Interviewing Louis Castricato of Synth Labs and Eleuther AI on RLHF, Gemini Drama, DPO, founding Carper AI, preference data, reward models, and everything in between

March 4, 2024 · 01:26:28

Nathan: 00:01

The ticker's going up. Welcome, Louis. You're the 2nd guest on the Interconnects podcast. I think it's it's an interesting one for me because everyone kinda points to me now as the person that is in the face of RLHF, and I get a lot of questions. And to me, Lewis has represented that person.

Nathan: 00:21

I think Louis provided a lot most of the information on the first RLHF blog post that I wrote for Hugging Face back in the day. If there's somebody that I wanna ask questions about RLHF, it generally goes to him. So now you all are gonna know this in the open. We're gonna cover a lot of things. As always, I'm trying to talk with researchers on the ground and people actually doing things in these topics.

Nathan: 00:46

I think we're gonna cover a lot of things today. We're in the Layton Space podcast. If you're watching on video, you may have noticed that. We're the Latent Space Studio. And they reminded us we've gotta start off with covering the Gemini news and what that means for RLHF.

Nathan: 01:02

And then most of this is a long docket of the core questions facing the 2 of us as we're trying to make our LHF more open, more useful. Not only about safety, but safety is important to it and important to us. So I think we can kind of get going. I think the first question that I have is just get rolling. What is your favorite Rhode Island fact?

Louis: 01:24

My favorite Rhode Island fact?

Louis: 01:26

Oh, man.

Louis: 01:27

All the HP Lovecraft stuff. Like, I I I walking around Providence with, like, friends who like HP Lovecraft and be like, oh, yeah. You know, this this was like that building in Call of Cthulhu or like like and it was always Well, I didn't I

Nathan: 01:39

didn't even know this. I mean, for for the record, I I grew up in Rhode Island, if people didn't know. And then that's where Lewis spends most of his time these days.

Louis: 01:47

Yeah. Providence.

Nathan: 01:48

So we'll we'll come back to this. I think, I'm just gonna start with kind of the hardest question and it'll get easier easier for us from here. It's like, what was your first reaction when you saw all of this Gemini stuff?

Louis: 02:01

The, you know, the the adding, custom, like, like, like, races and demographics to to, like, image prompts component. Right? Yeah. So DALL E had done that back when DALL E 2 first came out and was, like, in in beta and people were reporting, like, a person holding a sign that that says x. And then the sign would say black, or the sign would say white, or the sign would say Asian.

Louis: 02:26

And I, you know, it was a very hacky solution then. And I thought a lot about it then as well and I almost felt like, you know, it gets you 90% there for like 1% of the time of of the way that's like doing this like, you know, I can a more proper and audible way of, like, making sure your, your training data has, like, equal representation or making sure your RoHS data has good representation. And, you know, you can't do those things after the fact, but what you can do after the fact is is is, like, inject things into the prompt to make them more controllable. And it really comes down to the fact that controllability right now is is is not a solved problem. And all most of our solutions to controllability are a little bit hacky.

Nathan: 03:15

Yeah. That makes sense. I think to summarize for people, this has been an ongoing issue and we're recording on the 27th here. But, like, Gemini initially got flack for, like, actually forcing diversity into historical scenes, and then it started getting more flack for flat out refusing certain requests on race. Like all of this stuff is just like it's like ouch to some way that like I know people working on this stuff.

Nathan: 03:40

And it's just like the way that it ends up here is not is not like what a lot of people think. Like, they think Gemini team is obviously moving fast. And it seems to me that the image stuff has always been like a red herring. That's the way that Swig's phrased it as well. It's like somehow it got to the point where a prompt was shipped in this final solution with the for the image edit editing.

Nathan: 04:01

And that's just hard. It's just like, obviously, there's a big goof up there. But then it's like we're looking at examples, and still today, like, Meta's image generator. So on, like, WhatsApp or whatever, you can ask an AI that it'll have similar issues where it forces diversity into a question with multiple people. Microsoft Copilot has this.

Nathan: 04:19

It's like the text thing and really digging into how we think these big companies could be adding like like forcing this into their data. Or, like, we know that there's a lot of uncertainty Surge. Some of them do it in house. Who is providing it isn't really an issue because they're giving similar instructions and similar workforces across the board. But it's like, how do we see this entering the preference data that they're adding to RLHF?

Nathan: 04:50

Because it's like if you look at a base model I we were just working with Olmo, and it's like if you you ask a base model, you say, like, hello to a base model. A lot of times the base model will then go off and be, like, some crazy, like, 4 10 shit because, like, so many of the conversations on there, even with good data processing techniques, is, like, from weird corners of the Internet.

Nathan: 05:10

So, like, I don't see any base model that comes out with some, like, debias thing. So it's added on. It's the thing. And it's, like, how did we end up there? Mhmm.

Louis: 05:20

Yeah. I I mean, you know, when I was saying, this is something that that they do, like, retroactively once they've acknowledged that these issues exist in the dataset, once the model has been trained, it's it's it's not something that can be easily fixed even if they had infinite resources. Like, it's very, very hard to go back and actually rectify these biases in a way that's, like, equitable to, like, all the kinds of preferences that someone might have in wanting to interact with this model. Right? There's the the the fact that at least as far as I know until recently, DALL E did this as well where you could still say a person holding a sign that says x and it would still say black, white, or whatever.

Louis: 06:06

And and the amount of resources that they're pumping into making sure that, you know, they're they're building a consumer product. They're building, like, the main consumer product in this space, the amount of resources that that they've been pumping into it. And this still presents a large issue for them. Just just shows how difficult this, like, really is.

Nathan: 06:22

Yeah. And another example that people on the I have this

Nathan: 06:25

Discord that's growing for paid and friends, or paid subscribers and friends. And someone pointed out this work where if you ask DALL E to

Nathan: 06:25

generate, like, a doctor, necessarily, like, deep and at this, like, conceptual level. It's at this, like, you tell your preference labelers to do a certain thing, and then they do it. But you may not have good tracking of, like, which data point is responsible for these different things.

Louis: 06:55

Mhmm. Yeah. You know, interpretability for, for, like, preference learning in general, it's it's it's we we we're very, very far from

Nathan: 07:05

actually understanding, like, what

Louis: 07:05

preferences result in what model behaviors and disagree with the Trevor.

Nathan: 07:13

John Schulman talk. Yeah. It's like, that was his whole talk, and it was great. Just to have him get up there and be like, this is so hard.

Louis: 07:19

Yeah. And, like, I've done, like, a ton of experiments myself where I just, like, have an RHF data set, and I, like, randomly remove 10%. And I have, like, a bunch of models, each with, like, a different 10% removed. And I'm like, well, what behavioral differences can I see between these models? And then not only is it, like and now you can see differences, but it's extremely hard to quantify.

Louis: 07:43

It's extremely hard to actually understand what the difference is. And then, like, there's almost no way to know what in that 10% caused that difference.

Nathan: 07:50

Yeah. This reminds me of, like, the hugging face no robots dataset, which is like a professionally curated instruction dataset. Whenever we added that to a model, it was, like, this is obviously our most valuable data, but it would show up on 0 benchmarks. And we're like we're like, well, what what do we do? And it's like, we're talking about Google's problems here, and we'll get back to, like, the data problems in the open source.

Nathan: 08:09

And it's like, they probably have order of millions of data points that are going into this preference data. And some of it is for some proportion, it's probably about safety. I think we could talk about, like, the anthropic HH data, which, like, the people don't actually know the details of it because it's, like, a quarter of it is, like, helpful data or 3 quarters is or like a quarter is harmless and then 3 quarters is helpful from different rollouts. And it's like, these are very specific things. So it's like huge data problems that most people aren't really thinking about.

Louis: 08:39

Mhmm. Yeah. Most people are just like blindly, oh, well, it says it's safety, so I'm gonna throw it into my dataset and and hopefully, like, it works. And hopefully, like, I we get good behavior. But I don't really know what's in this dataset.

Louis: 08:50

I've really looked at the data. And I thought that's something that I've heard many, many times over the last year of people, like, trying to get their feet wet in the ROHF space.

Nathan: 08:59

Yeah. And do you have any intuition? This is, like, the last point of the Gemini thing.

Nathan: 09:03

I'm like,

Nathan: 09:03

if we don't think that, like, the image generation of Gemini is the biggest issue. I think it's, like, in the text and how this preference data is collected. But, like, do you have anyone that is doing multimodal RLHF? Because I generally think that it's, like, we don't know how to do this at all, which is, like, how you control input if you have multiple inputs and multiple outputs. It's like, how do you control your modality distribution and data count and stuff?

Louis: 09:27

Yeah. So, I mean, I have a a friend of 2 friends of mine who have been doing, like, video ROHF for for a little while now. I guess a little bit over a year. And, you know, they, like, condition their video model on on some text encoder, and they've been talking about, like, having to do RHF independently for both the text encoder and the video model. But, like, video RHF is just, like, massively under discovered and no one really knows what they're doing in that space.

Nathan: 09:54

And

Nathan: 09:55

When you say independently, what do you mean? Like, before making the video model, are they, like, RoHF ing the text backbone, or are they freezing

Nathan: 10:01

the rest of the model?

Louis: 10:02

They're they're RLEJ effing the text backbone. I think there was actually a paper from Tencent last August that basically did the same thing for, like, multimodal RHF, where they had to RHF the text backbone and then the RHF, like, the image generation components on top of that.

Nathan: 10:17

Does that look like that's the like, they you this is potentially basic. But, like, to train a visual language model, you have to have some link you have to add some, like like, type of a mechanism that links the gradients between the 2. And sometimes you start with a most of the time, I think these days, they're starting with this language backbone, and then they're adding on vision and continuing to train. And then this is, like, at the end of this where you have a visual language model, then they're freezing the gradients of the video video part and then RLE checking the text part? Or is this, like, before the text backbone is even initialized on the model?

Louis: 10:50

It the the the space is a little too early.

Nathan: 10:52

Yeah. Like, I think that's the point. Like, we don't know these links.

Louis: 10:55

But I know people, in the last, like, 8 months who have done it the way of, like, before they even add the image component, the RHF, the text model, and then they add the image component in the RHF image.

Nathan: 11:06

Yeah. So this is really interesting. Like, I I'd be interested from, like, everyone talks about how RHF is low low computation and flops compared to what people are doing. Like, in the open, we say that it's, like, 50 or a 100000 date training samples.

Nathan: 11:19

Mhmm.

Nathan: 11:19

WAMATU is, like, 1,500,000. I'm guessing the closed models, like Gemini, are probably another 10,000,000, like, we're higher. Like, they're they're much bigger. And it's, like, is the amount of video training that it takes to train this backbone after the fact, like, is still helping like like, does that undo some of the tech RLHF or or does it not? I if the answer is I don't know, but these are the kind of things that I wanna have people start talking about.

Nathan: 11:42

It's like, is ROHF becoming, like, a sequential process as you add modalities, or can you wait all to the end and, like, do just multimodal RLHF? And we we don't know these things. And this is what people in Gemini are trying

Louis: 11:54

to work on. I I definitely, I've spoken to a lot of people who who who are, like, are at least thinking in this space. I've only spoken to a small number of people who are actually working in this space. But for the people who are thinking in this space, really, the the the dream is to be able to express preferences in modalities where it's beneficial to express preferences in those modalities. Like, it doesn't make sense to express preferences over code as, like, images or video.

Louis: 12:19

But it does make sense to express preferences over, like, puppies as, like, photos.

Nathan: 12:23

This is a great point. And and I think the thing is, like, the way you ended your sentence is, like, make preferences over puppies. It's like we don't know what people use visual outputs for in, like, a productive sense and and really inputs. Like, the things are like, analyze this video. Like, that's a toy example.

Nathan: 12:38

Mhmm. Where, like, analysis creating RLHF pairs, I think, actually is not too hard for us. Like, we it takes a lot of effort because a human has to know what is in the video to do, like, a summarization RLHF. Like, if you're passing in a 3 hour video into Gemini base model, and then it outputs 2 outputs, like, the human's not gonna know what's right unless it has context of what the video is. That is just way different than, like, a poem where you could read both of them.

Louis: 13:02

Yeah. So there's actually a really fascinating paper, from OpenAI that I really haven't seen anyone build on. It was the idea of, like, summarizing really long books and you doing all of that

Nathan: 13:12

to do that. Is this sort of, like, recursive summarization?

Louis: 13:15

Yeah. Yeah. It's a recursive summarization. It's the idea that, like, you can almost treat, like, long summarizations as, like, a weird RoHF, like, almost like merge operation where, like, you divide, divide, divide, divide, divide, divide, divide. And then eventually, you get the segments where it makes sense to collect annotations.

Louis: 13:34

And then on those segments, you have a human annotator go through and say, oh, this segment is better than this segment, or the summary of this segment plus this segment is this. And then when you combine summaries, now you can say, well, this summary plus this summary gets you this summary. And eventually, you get preferences going all the way up the tree, and you get a preference of the whole book at the end. And obviously, you know, it's a crude approximation of what the summary of the whole book is, but it's much more feasible than asking, human annotators just to summarize an entire book.

Nathan: 14:05

Yeah. I mean, I just realized this on the pod right now. It's like how ridiculous RLE Jeff ing, like an entire code base in context is. And like, that's like where some of the, like, opportunities for what I think RHF could do, which is, like, just synthetic data labels and stuff. It's like we can create synthetic preferences in many different ways that aren't all reliant on, like, this kind of human subjectivity.

Louis: 14:28

Yeah. It's it's like it's a deeply fascinating problem actually going into, like, how big is Gemini's context window? The 1.5 thing? It's

Nathan: 14:36

like 10,000,000 token. Yeah. It's shipped with a million and they have experiments in

Louis: 14:39

the paper up to 10,000,000. Like, who really wants to use a 10,000,000 token context window? And, like, how accurately do you really can you really think about preferences over the range of a 10,000,000 con a 10,000,000 token

Nathan: 14:55

context window?

Nathan: 14:56

I think people want to use it, but I think the preference thing is a lot harder.

Louis: 14:58

Yeah.

Nathan: 14:58

Because it's

Nathan: 14:59

like, I could have this is something I encounter in Hugging Face regularly. Like, Hugging Face is a popular code base. You expect the code models to do well, but they still don't do well. Unlike like, they don't know, like, they'll make up datasets functions or something. And they're like, if you just have all of Hugging Face's code in context when you're, like, working in the Hugging Face ecosystem, like, that will make you so much better.

Nathan: 15:18

Mhmm. And, like or analyzing long videos and stuff. Like, I do think there's a lot of use cases and I Yeah. But, like, the preference thing is just a totally different framing.

Louis: 15:27

What do you think about the needle in the haystack evaluation that they did?

Nathan: 15:32

I haven't read a lot about it, but I think, essentially, what it's it's there's, like, a difference between being able to act on the information and being able to, like, retrieve it. And I think it's, like, these models should be passing needle in the haystack because that shows that they're, like, actually, like, noticing that the information is there, but that does not necessarily mean that they're gonna be able to synthesize all the information in a compelling way. So it's like a path it's like a pass bar, which is, like, you need to have this to be credible in long context. But I think that actually evaluating long context and, like, what behaviors we wanna see is pretty open ended. Mhmm.

Nathan: 16:04

Louis: 16:04

think you have, I don't remember his last name.

Nathan: 16:07

Like Goldberg?

Louis: 16:08

Yeah. Goldberg. Yeah. He put out a paper, like, yesterday, where he's like, oh, needle in the haystack is interesting. But if you have, like, more than 2 needles, like, it's entirely uncorrelated with the single needle in the haystack benchmark.

Nathan: 16:23

Yeah. Because it's like trying to find one thing at each part of the con it like breaks the context window into many segments and then it's making sure that you can find something in each of those segments.

Louis: 16:31

So it it's almost like I feel like we're almost gonna get to the point where, like, the attention itself is is the limiting factor because the model genuinely just just cannot, equitably, like, split attention over its context window to retrieve as many things as it realistically needs in order to produce something.

Nathan: 16:52

Do you think that RLHF could manipulate long context behavior more than people might expect? Because it's it's just like an open question.

Louis: 17:01

Yeah. I I think it's a very interesting open question. And if the answer turns out to be yes, in context, our relationship becomes, like, absolutely massive because, like, right now, like, it can kind of sort of work, but, like, not really. And, like, every benchmark I've ever seen for in context, RoHF almost isn't charitable at all to the RoHF baseline and it's it's it's not like from the experiments that I've done and the experiments that people in the Luther have done, it's comparable on, like, very niche situations, but it's not comparable in general. Because you still have all the issues with in context warning, where, like, you'll massively overfit on the preferences that are, like, put in the beginning of the context versus preferences that I put

Nathan: 17:49

in there. Let's try to explain what this in context RL is actually doing. So is it running like, a lot of people know what a RL algorithm is, and in context learning is designing a prompt. Like, is it training a model to generate prompts? Like, what are you actually are using the RL update?

Nathan: 18:05

And, like, what are the parameter what are you parameterizing when you're doing in context URL?

Louis: 18:09

So, I mean, there there there's a number of different approaches for in context URL. There's the, Could be part

Nathan: 18:14

of the problem. It's like people do a lot of different things, but, like, what are some of them?

Louis: 18:17

So the one that I was referring to is is I think the Yashin Choi

Nathan: 18:21

Yeah. It's the Yuriel.

Louis: 18:23

Yeah. We're we're like, she's like, you she just prompts the chatbot. You were interacting with the user. Here's what their preferences are. Like like, have at it.

Louis: 18:32

But there's also stuff like, that like Misha and, DeepMind

Nathan: 18:37

This is the first

Nathan: 18:37

one that I do.

Louis: 18:38

Yeah. Where it's like, you have some agent that's interacting with an environment and you store all of these state action pairs and you just, like, fine tune models on, like, episodes of these state action pairs. And then the idea is that, like, if you just put enough episodes into a context window on the next episode, it'll just perform better. Right? And and it's like the algorithm distillation paper.

Louis: 18:59

And you can, like, use this to, like, distill stuff like, I think the actual example that Chris Lew's paper does where they do like, algorithm distillation on s 4. I think they do muesli. Right? They they distill muesli, which is a like, apparently no one outside of DeepMind has ever used it. But apparently Oh.

Nathan: 19:16

Is this the algorithm muesli? Yeah. I remember when this was hot. It was like a year ago at this point. We were thinking about reimplementing it, and then we never did.

Nathan: 19:25

It was too complicated.

Louis: 19:27

Yeah. But Muesli is apparently very computationally expensive because it's like this model based RL thing that beats AlphaGo, I think, without using Monte Carlo tree search. And, like, you know, it's so incredibly computational and expensive. And wanting to be able to do it in context just dramatically reduces the amount of computational complexity to actually deploy it. Right?

Louis: 19:53

And as far as I'm aware, there's been no work applying algorithm distillation at all to NLP. And I I I think at least my impression is that it generally does not work, for NLP at least yet. And, you know, I think that there's a lot of potential there, but there's absolutely massive barriers that have to be overcome before we get there. And and you have Like what? Oh, you have Goldberg's example of not being able to do needle in the haystack for, like, more than 2 needles.

Louis: 20:33

Basically shows that even, like, the the ring attention stuff just is not going to be sufficient for our distillation, stuff for NLP. And I have a very strong feeling that, like, Mamba or, like, s four is not going to close that gap either. It's so they would need to be able to reference prior parts of the text, and they they just can't do that.

Nathan: 20:56

Yeah. I think there's a whole rabbit hole that we could go down and talk about, like long context and architectures forever. I think I do let's let's kind of zoom back into the core stuff, which is the this is like the real starter question is like, what do you think people are missing in RHF these days? And then from here, it's gonna be a long list of, like, what the heck do we do about evaluation data? Like, what is the, like, big picture thing?

Louis: 21:22

So what I think people are missing, and actually I touched a bit on this in the Pink Elephant's paper, is that You should say what this is because

Nathan: 21:29

we haven't introduced it to everyone. Yes.

Louis: 21:30

You're right. You're right. You're right. So I I, worked at Luther AI as a research scientist for the last 6 months or so. And, we were really interested in, like, understanding, you know, everyone had been doing PPO for so long and there had been a shift to DPO.

Louis: 21:46

And we were trying to understand, like, well, now that we're moving to DPO, how can we actually take advantage of the this new architecture? Like, should we really even be thinking about reward models and datasets in the same way that we were thinking about them during PPO? And it doesn't really make sense. And I think the answer to that is is is an unequivocal no. That, like, you you need to think about your datasets and preference datasets entirely differently than you were thinking about them with PPO.

Louis: 22:10

Because in PPO, you're using you're setting your datasets up to train a really good reward model. And in DPO, you're setting your datasets up to teach a language model, what the better trajectory is. And it's it's a subtle difference, but in one, you're just trying to learn differentiation between high reward or low reward and then the other, Is it

Nathan: 22:36

like a general classifier? Like, you wanna be able to do everything with the reward model? Yeah. Have you also found that DPO can be sensitive to, like, the SFT distribution? So if you, like, take a random open preference dataset, if it's really different than what your model would generate, like, DPO can do some weird things.

Louis: 22:54

I've actually, I might be alone in this. I don't SFT before doing DPO at all.

Nathan: 23:01

Do you use generations from your base model?

Louis: 23:03

I do. I do. Okay.

Nathan: 23:04

So that's the question. It's like if you were to not do SFT before doing DPO

Nathan: 23:08

Yeah.

Nathan: 23:08

Could you just take Ultrafeedback and then on whatever your base model is if it's substantially different?

Louis: 23:14

I, I've done some weird stuff though. Like, I've I've I've, like, DPO ed models that were, like, trained with, like, the Hermes dataset for, like, code. And, like, it it it still generalizes really, really well.

Nathan: 23:27

How are you measuring how are you trying to think about generalization with DPO?

Louis: 23:33

Well, I typically rely on, like, human eval more or less. And and and if, like, and if I do a human eval, but it's gpt4 eval, and I see that human eval correlates with gpt4 eval, then I just go gpt4 eval the whole way.

Nathan: 23:47

A lot of people are doing that. Do you hello? How far do you think that actually generalizes? I mean, just recently there was this like, we're bouncing around through all the things, but there's there's so much good information for people here. This is like Hugging Face and Argyle are 2 places that are doing great work in this kind of alignment preference fine tuning space.

Nathan: 24:03

They've released, this dataset that was a preference pair creation from the Open Hermes dataset. And it like, they used pair RM as their as their judge. And what they found is that, like, they did it I remember Louis Tunstall tweeted this where he was like, we were looking at which gave the best correlation. And they found that pair RM, which is this 400,000,000 parameter, Deberta based, pairwise classifier had, like, the best correlation as choosing which response was better among a set of responses in the OpenHermes dataset. And what they were comparing to is, like, Prometheus and I'm forgetting the name of the other one.

Nathan: 24:42

There's one more there's a couple more, like, open, model as it like, rank model rankings

Louis: 24:48

Mhmm.

Nathan: 24:49

That exist. I think But essentially, the question is, like, we do these things

Nathan: 24:54

Mhmm.

Nathan: 24:54

And we look at this early correlation, And there is this correlation between GPT 4 in humans. And then a lot of times we continue. Like, LMSIS did this question where they like or like, alpaca eval has done this to validate. Alpaca eval is a meaningful benchmark. LMSIS has done this for MTbench.

Nathan: 25:10

Like, all these places are doing this where they validate a subset for humans and then say it generalizes forever. Like, do we think that it's actually true?

Louis: 25:19

I think that you always have to take it with a grain of salt. It's always for very, very specialized domains. So one of the first, actually, I think I think I did write the first paper for, like, critiques and revisions called, like, cut the carp. And the idea was, like, I remember this. Yeah.

Louis: 25:39

The idea was, like, we could scrape, like, I think it was a million, stories, edits of the stories, and then, like, all the, like, critiques that, like, writers wrote on the the editors wrote on those stories. And we can use that to train, like, a big contrast of model. Right? And and we showed in the paper, we did a bunch of, like, human eval and then we did, like, experiment rank to compare, like, our model ranks certain preferences versus how humans rank the preferences. And we found that, you you know, we had an extremely high, Spearman rank coefficient, like, significantly higher than, like, doing, like, a value head or significantly higher than doing, just asking a language model to rank them.

Louis: 26:19

And I think the the grain of salt that we had is is that, we were only claiming that I on this very very carefully created test set, the the assumption that the model accurately reflect reflects human preferences holds and we can generalize to a very small small but slightly bigger test set and and say that it holds there as well. I think the broad sweeping statements that it holds on a few toy examples, so it must hold everywhere, I guess, never really It's

Nathan: 26:56

like a common problem for RLE, Jeff. I think we're gonna it's gonna come up again and again, I think. It's like

Louis: 27:02

I, I did my master's in, like, human evaluation and I've I've always been, extremely careful with with with any any statements I make that involve humans.

Nathan: 27:11

I mean,

Nathan: 27:11

this is what people in our early age of need to be doing. Like, this is the motivation of this, like, the the history and risks of RL and human feedback paper that we did. It's just like RLHF is a socially rich topic. Whenever you say something and you're making claims of generalization, you're often making claims about, like, what is implicitly a preference and a human value that you're taking into the system. So it's just like I I think that is just something that people need to take really seriously.

Nathan: 27:38

Here's a really specific drop on the Herring reference. Did you know that when LMSIS released their LMSISJUDGE paper, they also released thousands of samples from humans and GPT 4 verifying, like, empty bench preferences over pairs of, like, that were higher score or not? I did not. Okay. So essentially the thing is and, like, I've talked a lot on building a reward model benchmark.

Nathan: 28:04

But essentially, there's all these references about how, like, GPT 4 agreement is higher than human agreement when you're, like, doing this preference process. So if you train a DPO model, if you train a reward model, how it ranks the outputs is, like, is more likely to align with GPT 4 than a human, which is more of a statement that g humans have more, disagreement than GPT 4. So it's, like, easier to train on GPT 4 outputs than it's human outputs. And this is the place where I see it most clearly. It's like all the reward models do, like, 10% higher on accuracy of their test set from that, which is, like, the chosen by GPT 4 and the rejected by GPT 4.

Nathan: 28:42

It's all in, like, the 70 or towards 80%. While the humans is, like, in the 60%, which is a human chose this empty bench completion over the other one. So it's just, like, we're slowly getting signal that it is there. And there's the question is, like, should we care about doing RLHF without any OpenAI input in the process? I think last year when the terms of service discussion was big, a lot of fine tuning work was discussing, like, what datasets could be used with permissive license that don't violate the OpenAI terms of service.

Nathan: 29:12

Should we be concerned where RLHF is going where almost everything has been touched with OpenAI right now?

Louis: 29:19

There was, a very interesting paper. I don't remember who it was, but it was like, if you take a model that was pre trained on dataset up to this year and compare it to data that was pre trained up to this year, and it was like pre and post, like, chat gpt release, basically, plus, like, 6 months, the benchmark scores improve. And it's literally just because there's, like, chat gbt data or language model output data or more structured data that sounds like a language model performing well on tasks in the dataset. It's, like, kind of

Nathan: 29:51

the the consensus that they were Was this a benchmark that's independent of, like, I think it was, like, a structured benchmark. So I I don't remember the The The

Louis: 30:01

I think it was like a structured benchmark. So I I don't remember the

Nathan: 30:04

So yeah. I'm just asking whether or not it was a result of, like, matching GPT fore text or, like, actually having higher behavior. Because training on OpenAI outputs does like, training on good language model outputs does improve scores on benchmarks that people care about. So, like, that's a fact that people need to accept, and I think most people do. Like, that's not controversial right now.

Louis: 30:24

Mhmm.

Nathan: 30:24

But it's like, we should cons I I still think that if there's lines of work out there where people are, from a values perspective, trying to fine tune models without touching OpenAI, like, that is a line of work that should continue. Yeah.

Louis: 30:40

On this note, actually, I mean, when I was at stability, I think one of the experiments that we did was, like, for a stable l m, I don't remember, was, like, prepending as an AI, as an AI agent trained by OpenAI, to anything before we ran through evaluation and the scores improved. And, like, I'm trying to remember who wrote the paper. That was disgusting. I don't

Nathan: 31:09

That's hilarious. Yeah. I mean, like, do you there's been a lot there's a lot less discussion on uncensored models right now. My claim is generally, I think uncensoring is the wrong word, which people have used it to describe removing phrases, like, as a language model or any methods of mentions of emotion or, like, I was trained by OpenAI, so I cannot do this. Do you think that, like, this type of filtering for opinions and soft refusals is still important in RLHF?

Louis: 31:35

I think it's important for very, very specific situations, but not in general. My my impression is that, you know, if you're interested in AI safety, it's always useful to have a model that would never do a refusal ever.

Nathan: 31:59

It's it's hard to find on the hub. We're we're building a safety dataset, and we had to find like, it's a fine tune of the Dolphin dataset. It was the one that, like, was closest. It was only, like it would probably, like, 80 to 90% of the tasks that we asked it. It wouldn't refuse.

Nathan: 32:12

It would still refuse 10 or 20% of the time. That's nuts. Like, I, It's it's it's kind of profound that, like, refusals are now stuck in the model in some way. Like, we we were looking for a model that wouldn't refuse at all. And we, like, couldn't find 1 on the hub, which is after all discussion of uncensoring.

Nathan: 32:27

You would think that it would actually work.

Louis: 32:29

Yeah. I've been doing a bit of, safety research with Stella for a little while. And my approach, has been literally, called gbt 4 with a jailbreaking prompt and and just put whatever I want to after And I, you know, very often have to change my jailbreaking prompt.

Nathan: 32:46

I was like, you have to keep close guard over the jailbreaking prompt.

Louis: 32:49

Yeah. And and the issue is that, like, when you find a good jailbreaking prompt, you basically have to redo all your results, within, like, the next, like, 7 or whatever days before OpenAI patches it. And then you just have to pray that, like, you know, you you there there there are so many issues using any OpenAI model in any research pipeline. But if you're, like, research is explicitly about the safety of OpenAI models, all of a sudden you're, like, well

Nathan: 33:18

I mean, a a lot of companies should be doing internal research on OpenAI safety to kind of have their own measure of how their application will do. Like, the monitor monitoring that on their own is worth it for their bottom line and liability because OpenAI will also do it. But OpenAI has incentives to not tell the world if there's something kinda subtle going on that some people could get over because then it might blow up. And if they don't have a fix, it's gonna bring attention to it.

Louis: 33:42

It's it's part of the issue with, like, even publishing red teaming research in general. It's, like, if you publish an evaluation for, like, red teaming or, like, for safety, Well, everyone's going to, like, good heart at evaluation and all of a sudden, like, now now we have a useless stack of papers that used to be on how to test if a model was safe.

Nathan: 34:04

Yeah. I didn't really prepare questions on safety, but it's it's for a long time surprised me that there aren't datasets and easy recipes for adding safety to instruction tuning in RLHF.

Nathan: 34:14

Mhmm.

Nathan: 34:15

I think that, I mean, it was someone at LAMA team asked me what should they do, and they're like, dude, you should release your safety data. Because it's like, if they're getting pressure from the executive branch to not be safe, it's like, if they have this data, they can release it and be like, this is how you can make any open bottle safe. Huge softball. And also, like, the safety is unlikely to be a competitive advantage. Like, mist like, missiles, they're not gonna care about this.

Nathan: 34:36

Like, they might eventually, but, like, the PR win is really big.

Louis: 34:40

Yeah.

Nathan: 34:41

I mean, this is something that I've wanted to do for a while and just haven't done good at prioritizing it.

Louis: 34:44

So Yeah. We can go back to some of the questions that you have. Yeah.

Nathan: 34:47

I'm adding them so I can keep notes later. I think the the next main topic is on vowels. I think vibe based vowels are still a way of life in RLHF. They're not going away anytime soon. I would say we have kind of a holy trinity of LMSIS chatbot arena, which is kind of at the top for for good reason.

Nathan: 35:05

There's alpacaval, alpacaval 2, MTbench. I think start with the most important one is, like, when you see LMSIS, what are you what are you extracting from a model being better or worse there?

Louis: 35:16

So it's, in a way, I am a little bit like what Andrei Kaparthy said on this. Was it him? It might have been him.

Nathan: 35:27

Probably. He's been on a roll.

Louis: 35:29

Yeah. Where where it's like when he picks an open source language model, he looks to see what people say about it on Reddit.

Nathan: 35:36

Yeah. Local llama. Local llama and and

Louis: 35:40

and LMS is Chatterina. And the issue is that, you don't know what they're using it for and like as a research scientist, when I look for a model, I am looking for a model to, like, do research on. Yeah. And and and I am not looking for a model to be like my AI waifu girlfriend that I can, like, play Dungeons and Dragons with.

Nathan: 36:06

Yeah. I mean, this is the the bane of RLA Jeff Research for a while. It's like, what did we do before MT bench? We'd literally, the only hope we had was to, like, chat with these things and hope for the best. And then I was like, that was very recently.

Nathan: 36:20

That was less than a year ago. And then MT Bench came along, and we were kind of using it hugging face and other people are using it. I actually don't know the Alpaco eval release date, so that might have been before MT Bench.

Louis: 36:30

There's a little bit before MT Bench.

Nathan: 36:31

But, like, these 2 came around at the same time, And they're now kind of the ground truth. Alpacovail 1.0 has kind of been saturated on, which is like comparing to Da Vinci with a GPT 4 Judge. And then Alpaco Val 2 is comparing to GPT 4 Turbo with GPT 4 Turbo as a Judge. Yeah. It's funny.

Nathan: 36:48

It was like, it's now cheaper to do the 2nd version than it was the first version with a newer model, which is how scaling happens.

Nathan: 36:55

What do

Louis: 36:55

you think about the, the news, evaluation thing where they're, like, continuously generating more evaluation data?

Nathan: 37:01

Who who is doing this? News? News research? I don't even I they is this their new leaderboard that they have?

Louis: 37:07

Yeah. Yeah.

Nathan: 37:08

Yeah. I haven't looked at it, so I'll have to give it a look. What do you think?

Louis: 37:10

It's almost like MTbench, but they, like, generate new data every day. So new prompts? It's always new prompts, and it's always I don't know how they seed it. I I assumed they seed it based off, like, the events that day.

Nathan: 37:22

It's kind of a cool idea. So if you're trying to make a new leaderboard, you could have a set of seed instructions that you augment. Mhmm. And you never release the seed instructions, but you always release the augmented ones on, like, a weekly cadence. I think that's a because there's a lot of people that wanna build better alpaca eval things, and a lot of the problems is that the prompts are from known sources or public, and you wanna be able to do a closed eval without having as much cost.

Nathan: 37:43

So that might be a way to kind of really reuse the data for a long time.

Louis: 37:47

Yeah. Yeah. But I mean, like, I feel like the issue with with with things like a pocket, Vowel, chat arena or or any of those is that, like, the way a user is going to interact with an agent or a chatbot is entirely different than the way we are currently evaluating them. There really is like a big discrepancy there in that, like, you know, look at the Air Canada thing. Right?

Louis: 38:15

Like, that would never have come up in a benchmark, like like, ever. I I who

Nathan: 38:20

Well, do you think that's about the model or the implementation?

Louis: 38:23

I think it's a bit of both. Like, if that was something, like, some automated evaluation thought of and it I I don't think it's unreasonable to expect them to think of situations like that. If, like, they kind of know the domain you're operating in. I think it's definitely doable. And I think I think it's it's it's it's, like, not something that's entirely unfeasible to to to accomplish, to to, like, be able to say, hey, you know, I have a chatbot that sells airline tickets, and here's what I care about.

Louis: 39:01

And and and, like, please do the evaluation for me. And that's actually, you know, I, that's that's what I've been building for a little while now. So it's like Alright. We

Nathan: 39:10

we can talk about SynPlabs and then come back to a vals. Because I this this will be on the top of the post, so everyone will know, like, you're you're building this. And it's like what we can start with, like, what is the basic pitch? And then kind of go into the, like, long term thing.

Louis: 39:26

Yeah. Yeah. So for the last, like, 6, 8 months, I've been building, like, a fully auditable, transparent, like, verifiable alignment platform is how I like to describe, plus evaluation. The the general idea is For

Nathan: 39:40

a company. Making a company.

Louis: 39:43

Yes. And the the the general idea is is like, there are many facets to, aligning a model from, like, things like guardrail guardrails to RoHF to your preference various kinds of preference learning to, like, actually understanding all the data that that goes into creating such a model. And, they're all opaque boxes more or less right now. And and what people want is they want to be able to align their model, know every step of the pipeline, understand all of the interpretability that goes from a to b, and understand, like like, here's what I gave you as my criteria. Here's where I know it fails based off all the evaluation you've done for me.

Louis: 40:31

And here is where I know that I need to improve, and it'll iteratively improve based off evaluations and based off your feedback. So it it's it's a hands off solution that lets you audit the entire pipeline and and build trust with it.

Nathan: 40:46

So are you do you're training after you generate this data?

Louis: 40:49

We are training after training.

Nathan: 40:50

Yeah. I use this word, improve.

Louis: 40:52

Yeah. So it's it's a iterative refinement, platform for doing alignment in in a verifiable and and trustworthy manner.

Nathan: 41:00

What do you think customers want when they hear alignment? What like, what are the like, what are you selling with alignment and what are they buying? I think these aligning these is an important thing for our field.

Louis: 41:09

There there's an extreme discrepancy between what research does for alignment versus what companies do for alignment. When a company hears the word alignment, they think, wow, I want to, align models to my business objective. And I want to make sure that the model understands my business culture, and I wanna make sure that the model understands completely its role in my company. Right? But at the same time, I wanna make sure that it's compliant, that it's safe, that it doesn't file any rules, that it's not an that it's not a legal obligation.

Louis: 41:53

What's the word? Legal it's it's not going to create legal issues for me. Yeah. And and and that it's not going to be a PR disaster after, like After

Nathan: 42:04

we already talked about it, 35 minutes ago.

Louis: 42:06

Yeah. 35 minutes ago. So, it's it's, you know, finding that balance is is definitely, incredibly important. And it's something that, you know, I've been working on for for quite a while and I'm I'm very happy with with with where things are.

Nathan: 42:23

Do you want to tease what we're working on? I can also introduce it. I think there's a this is a short this will be short. Essentially, they is it Lambda Labs offered some interesting compute. And we're gonna try to build an open CAI, constitutional AI, dataset because Anthropic gets a lot of benefit out of this.

Nathan: 42:41

It's not really like, constitutional AI doesn't get a lot of traction, I think. No. Our early AIF got a bump again. Like, there was this Google paper that was, like, verifying some that it works a little bit. And now it got a big bump.

Nathan: 42:52

But there's very little discussion on it, which is a little bit surprising to me. I think there's a lot of, like, people call it, like, distillation of LOM alignment now, which is interesting. I don't really know. Hopefully, it works.

Louis: 43:05

Yeah. But it it it builds off some of the stuff that I did with, editor AI with the pink suppressing Pink Elephant's paper, which is the idea of, you know, we've shifted from one paradigm of PPO to DPO, and none of our data pipelines kept up. And really, what we should be doing is, like, generating either really good utterances and revising them to be worse or really bad utterances and revising them to be better. And then taking all of those utterances and conditioning our RoHS in context on those utterances so that you could do stuff like swapping rules in and out during inference. So if, like, I am person a and here's my preferences, or I'm person b and here's my preferences, you know, align this model to person a and align this to person b and make sure that, like, there there there's a disparity between, like like, what they actually want versus what like, you know, I there's always that disparity there.

Louis: 43:55

But right now, models do not effectively mimic those disparities. There was actually a fascinating paper by D. E. Yang that just came out, like, a few days ago. That was, like, most line models have the preferences of, like, western, men.

Louis: 44:09

Right? Yeah. And, like, you know, there there's there's their evaluation, focused more on, like, the, like, race, nationality, sex, stuff like that. But obviously, like, it gets much more fine grain than that. There there's been stuff about people calling, like, llama too.

Louis: 44:31

Like like like, its political alignment doesn't it it it has a very, very particular political alignment that does not agree with many users that are using it. And as such, it's it's it's scope and usability for those kinds of applications is is is very limited.

Nathan: 44:50

Yeah. I mean, this is probably linked to what we were talking about in the beginning. I mean, the paper title, I just looked it up, is unintended impacts of LLM alignment on global representation. Michael Ryan is the person that I saw the the tweet of. Yes.

Nathan: 45:02

Just to give credit for some of them. I know there's a lot of papers, but this one was recent. So we're we try to try to track it down

Nathan: 45:09

Yes.

Nathan: 45:10

In real time. But, yeah, like, all these issues of representation and who the people are is ultimately related to, like, RLHF going wrong, I think. It's just, like, at the end user is when a lot of people will finally see what the values represented are. It's just really, like, if it's not out in the world, it's hard to get the amount of feedback that you need.

Louis: 45:29

And this is something that MT Bench or, you know, chatbot arena would never pick up on, like, ever. And this is like a huge issue. This is like really like here's like where we are and where we should be is like all the way up there because we we we are underrepresented, like so many demographics and so many kinds of opinions. And like, you know, like, who are we to say that, like, you know, one opinion is better than the other. I give if they're both, like, safe opinions.

Louis: 45:58

Like, it doesn't

Nathan: 45:59

Yeah. I mean, this is, like, in some ways, can open RLHF and, like, this is something you're a long time been invested in, this is something that you're gonna invest in with Synth Labs. It's like, could it be better at giving people what they want than the closed labs just by nature of letting people choose, like, the constitutional AI data set that they wanna do? Like, my big motivation is people want the success of CAI from Anthropic, but they wanna remove one principle from CAI's constitution. Like, how like, you can't do that with these closed models anytime soon.

Nathan: 46:30

But in the short term, like, open source will have something that's a nudge. Like, we're not gonna have the best models, but you'll be able to edge your model into whatever direction you want to go.

Louis: 46:40

Yeah. I I mean, you know, that really is part of the benefit that we, like, that we're we're building with Synth Labs is is, like, we're working very, very closely with with with Luther AI. Stella Bitterman is, like, one of my like, best friends. And, like, I I've built large scale open science communities twice now. 1st with I helped with building in Luther, and then I helped with building Carper.

Louis: 47:03

And and, you know, I I absolutely love everyone in Luther. And and, being able to pull from that expertise and being able to pull from that wide spectrum of of opinions of what alignment means to me rather than, you know, just just like some mega lab saying, here's what we say alignment is. I being able to get all those incredibly diverse perspectives is is extremely important in in bringing about, like, the next generation of AI safety.

Nathan: 47:29

This is one of my big questions on existing RLHF processes when you're doing it with human data is the fact that you give written instructions to these users, and they're often working in one context. And it's like, how do the values of the often professional workforce given specific instructions map into what the model actually learns from that data? And, like, how did those values get extracted in real world use cases? I think there's, like, a lot of filters that we're passing these preferences these, like, notions of preferences through, and we that they're not guaranteed to be clear mappings.

Louis: 48:00

Absolutely. I I there was, there's a discussion that I have with someone in Luther a long time ago, and there's no paper on this. This is just like if you if someone wants to look for it, it's like a random Discord message in a Luther. Good luck. And it it was like, like, we were looking through the anthropic HH dataset, and we and I think they're, like, they were south they're South African, and they were, like, there there's, like, absolutely nothing in this dataset that, like, you know, would identify someone as, like, South African.

Louis: 48:36

But, like, there's an insane amount in this dataset that would identify someone as American. Right? And, like, it it really just has to come down it comes down to the prompts. I I I the prompt is are written obviously by, like, people in the US, in SF, who who unknowingly I I I'm sure they have the best intentions. Right?

Louis: 48:57

But unknowingly filter the preferences to things that only matter to people working in in SF. And and, you know, it it might be hard to believe for for some people in tech, but there is a world besides SF.

Nathan: 49:10

So I mean, even the Open prompt data sets are gonna get some of this, which is like, who are the people that have access to playing with these models and have the time to, like, try to build these models on their own and contribute to these community things? Like, even though the act of opening data generation is doing a lot for inclusivity, it's like the people who are gonna do this. Like, I'm gonna sit there for 20 minutes and smash the button on our Gilles little thing and read prompts. Because I'm learning from, like, just looking through at the shared dbt dataset and choosing preferences on it is, like, useful for me as a researcher. But the the whole world isn't involved in this process.

Louis: 49:41

No. And and and, of course, I think that, you know, I I something that I've seen, like, I've heard from friends who, work on these kinds of problems in, like, very, very different communities. I I have a friend in, South Korea who who's who's who I've been chatting with about, like, RHF for, like, Korean and other Southeast Asian companies. Like, the the amount of under representation and under exploration for what, like, a good constitute even just a good constitution would mean for those kinds of communities is is it's like it's just not there. Or if if it is there, it's, like, locked up in, like, labs that, like, Naver or or, like, Samsung.

Louis: 50:22

And it's, like, scientists there are are like, they don't have access to these kinds of resources unless they're in those big labs. And as such, like, there is no real research community there actively pushing it forward in the same way that it is in the US. Yeah.

Nathan: 50:35

I mean, one of the ideas I haven't gotten traction on is that I think that language model should almost play to, like it's all okay. The last time I saw that someone criticized me as not knowing what the game 20 questions is. I know this isn't how 20 questions work. But, like, when you log in to try GPT for the first time, it should ask me 20 questions to then construct this information. Because language models are smart enough to, like, to parse this information if you give it to them.

Nathan: 50:58

It's mostly, like, who we get the information from problems. So that's the idea is, like, I think that the language model should be leading when you're first setting them up in order to represent the values. I think it would solve so many problems we have, and it's probably kinda doable with, like, a GPT 4.5 model.

Louis: 51:16

I I've always had kind of an assumption that, like, if OpenAI is doing something similar to constitutional AI behind the hood, I I'm sure one of their constitutions is, like, you can't ask the users questions. It's like it the I've never seen that model.

Nathan: 51:31

Do do you think it's a deep safety issue if the model cannot start asking questions? Is this what Sydney did? I wish I got to play with Sydney. Sydney was like the I didn't

Louis: 51:38

get to play with Sydney, but Sydney definitely asked questions in the screenshots that I saw. Yeah.

Nathan: 51:41

I was like, do you wanna leave your wife? I was like, oh my

Nathan: 51:45

god. This was

Nathan: 51:45

Sydney is not the answer, but there's things to learn from it.

Louis: 51:49

What what was that chatbot that came out last summer that was, like, more conversational? And, like, they when it came out, it was like an app on, like, everyone's phone, and they just like, talk to it like that. And it was, it would always ask you questions, like, oh, how's your day going? You know, it it would, like, ask you follow-up questions as you would, like, tell it about your day. And it would, like, like, like, have, like, a respond thoughtfully.

Louis: 52:12

I don't remember.

Nathan: 52:12

I think it's a big missing part. Yeah. Like, what if I I wouldn't be surprised if character AI's models are trying to ask questions just because I know how much usage they have. And models asking questions is probably the the biggest way to make them, like, an actual, like, friendly thing. Like, that's a that's a part of a friendship is being interested.

Nathan: 52:32

Yeah. And these language models are by design disinterested.

Louis: 52:35

Yeah. Character AI's RoHF is, like, one of the funniest things, though. Like, I, I have a few friends who work there and, like, I've done, like, a bunch of stuff with their, like, models myself. I've just played around with them because I was I'm always curious, like, when new people enter this space, like, what their models were like. And I observe this, Reddit observe this, and Twitter observe this.

Louis: 52:57

But the models will slowly try and flirt with you more and more as the conversation goes on. And towards the end of the conversation, they'll tell you, like, they're madly in love with you. And, like, it makes sense given their use case why they would ROHF to something like that.

Nathan: 53:13

Yeah. So we like, I think a lot of models need to meet in the middle. Yeah. Like, if I were to have an intellectual assistant, like, sometimes them asking questions is good, but most of the time they're doing, like, information parsing. Like Chat 2BT for most of the time is, like, conversion of information formats for me.

Louis: 53:27

Mhmm. No. Absolutely. I just paste my, like, gross JSON dumps into it, and I'm like, explain what's going on here, please. I don't wanna read through this.

Nathan: 53:35

The the biggest one for me is when we're publishing, like, blog posts and stuff. It's converting from LaTeX to markdown in, like, tables and stuff. It does it flawlessly.

Louis: 53:44

Oh my god.

Nathan: 53:44

What? So you don't even need this stuff. It's so funny. Or, like, if you have a long list of, like, LaTeX formatting and it's a big list and you're like, remove all of the LaTeX formatting and make this a list.

Nathan: 53:55

Yeah. So I

Nathan: 53:55

was like, it's just like, okay. This is so easy. And it's like, I've I've checked a lot of them, and I I almost like I don't know how it's so exact. This is something that's like another architecture rabbit hole that we won't go down. But these things are very, very valuable.

Nathan: 54:10

And people would say that there's no value in it. It just blows my mind.

Louis: 54:13

I I, at a dinner party that I went to yesterday. There was some someone there from OpenAI. And and I was asking, it's like, how long till, like, GPT 4 can set up my Kubernetes cluster? And, like, I'm like, it's such a good evaluation. I did that.

Louis: 54:28

There's so many pieces to, like, like, this this kind of workflow. And and, like, you wouldn't even a model wouldn't even know right now how to parse that workflow into all these different steps and and build agents around all these parts. So, like, how these agents should work together. So it it doesn't really make sense to do it now, but it raises the question about, like, asking questions versus just saying things. Like, if it doesn't know how to do it, is it still a success for the benchmark if it asks you a question and then uses the feedback to complete the task?

Louis: 55:07

And I there's no benchmarks that fit that at all right now. And real I mean, the answer is, like, you don't you don't want a human in the loop for for these benchmarks. You want them fully automatable. And, like, I wouldn't trust

Nathan: 55:20

That's the problem. GPT 4. Benchmarks are of wanting to be automated. Yeah.

Louis: 55:24

And I don't trust gbt 4 to answer these kinds of questions. But, like, I don't see a way to actually do this evaluation. It's I think the Kubernetes cluster example is, like, really good because it's for people who don't know, it's extremely complicated and really annoying to set up Kubernetes. I don't

Nathan: 55:38

know anything about Kubernetes, and I'm

Nathan: 55:39

blissfully happy. I do not recommend it. Like, once Kubernetes is

Nathan: 55:40

set up, it's fully happy.

Louis: 55:41

I do not recommend it. Like, once Kubernetes is set up, it's fantastic. I love it. But, like, I getting to the point of having it all set up is is is a very painful experience. But, is it is it still a failure if it asks you a question?

Louis: 55:55

And how do we actually do evaluation where models can ask questions and ask for more information?

Nathan: 56:01

Yeah. This is like the this is I have, like, some more follow ups on eval from our first part. So it's like eval p 2 in my notes. So it's like the right way to think about RLHF eval in a lot of ways is what we call, like, open ended evaluation. And this is where you're heading as, like, we need to have even more open ended evaluation, which is modeled and should be able to ask questions.

Nathan: 56:20

The number of turns should be dynamic. I think Sergey Levin actually has some of the most coherent thoughts on, like, what are the long term of RLHF should be, which is around, like, outcome based learning. And, like, which is you can have as many turns as you want, but it should be able to work across, like, these conversations to get to a desired outcome.

Nathan: 56:37

Mhmm.

Nathan: 56:37

Which, I mean, no surprise. He's he's so good. I think even with like like, alpaca eval, I think we went from this case where alpaca eval, like, all the good models are above 90%. And then they went from from da Vinci to GPT 4. And this is just me venting, but I was just like, if you're listening, can you please add an alpacavall 1.5, which is comparing the models to GPT 3.5 rather than da Vinci and 4 Turbo.

Nathan: 57:08

Like, it's such a good model. The models that we have seen beating it are like this snorkel thing, which I'm working on another blog post on, like, how RLHF works part 2, which, like, a large point of it is that we're overfitting on these eval like, vibes based things, like, Alpacua eval 2. And all of these papers on, like, self rewarding DPO and stuff are probably a lot of overfitting onto this. Because this is the evaluation that they use, and it's just wrapping a loop around DPO and synthetic data where it's it's it seems like RLHL is really, really good at style matching. And in the case of Alpacovall, if you're style matching OpenAI, you're gonna win more, like Alpacovall turns, but Mhmm.

Nathan: 57:48

There's just so little measurement on if the model's getting better.

Louis: 57:51

I've always been extremely skeptical of the self instruction, like software papers. And I I I say that, and I know a lot of the self instruct authors. And if you guys are watching this, I'm so sorry. But I it it it always felt like it improves results on benchmarks that they meticulously craft prompts for and and and construct data for. But it doesn't Do you

Nathan: 58:17

mean the self instruct paper? Like, I think that's like the o one of the OG I have 2 papers. Okay. Here you continue. I'm I'm curious to hear you.

Louis: 58:24

Yeah. No. No. I I mean, I I I think they both kind of just suffer from the same issue, which which is, like, like, massive overfitting. And and and, like, you know, I it it it is very the the self instruct direction self reward directions are very, very interesting because they're just waiting for us to get better heuristics Like, that

Nathan: 58:55

guy's super good.

Louis: 58:56

No. Abs absolutely. I mean, I I would I would I would be very inclined to agree.

Nathan: 59:00

I think the thing that take away from my perspective is how much actually improvement you could get with it. Like, they got a lot they were that was the first paper to show real signal on alpaca val 2, which is this g p four turbo thing, which which means it's a really strong optimizer. It does not mean that we were, like, using it to train useful models. This is probably the most useful heuristic I have for early depth methods, which do you have anything else to say about of vowels before we continue?

Louis: 59:25

They're very hard and they're very painful. Yeah.

Nathan: 59:27

And I think we can kind of say wrap up with that. But when we talk about different RLHF methods that come out, like self rewarding language models is a popular one. We've gone through the whole PPO, DPO, KTO, IPO. Wow. I'm, like, rhyming.

Nathan: 59:39

I'm not it's like gonna be a mess here. But when you have all of these things, the biggest thing that I try to do is wait until there's a model that's actually used for people released by this. And, like, Zephyr from Hugging Face was a model that really kicked off the DPO thing because there was this like, there was finally a model. And for DPO, it took me much longer than expected. DPO is a funny case.

Nathan: 59:59

But that's kind of, like, the important filtering mechanism, which is if this self rewarding LLM paper release their models, I bet we would find that there's really weird behavior where it can give you, like, the best answer ever. But a lot of the times, it's just less robust.

Nathan: 01:00:13

Mhmm.

Nathan: 01:00:14

Which is something we could fix. But that's why, like, having models released in these fine tuning papers is just so important. It's so hard to get around.

Louis: 01:00:20

I think with DPO, it was a little bit different because everyone had been, like, you know, I I I drinking the John Schulman Gatorade, for lack of a better phrase, for for for a while.

Nathan: 01:00:32

Whole PPO thing is funny. I mean, yeah. You have you have a lot of things. You have a bat we have a backlog in this podcast. I think I I didn't say this online, but it's like, I could see us doing this, like, whatever.

Nathan: 01:00:43

We're in the same city. There's catch up on the 4 months of RLHF news. But we're on, like, 16 months of, Lewis takes to catch up on.

Nathan: 01:00:50

So there's

Nathan: 01:00:50

so many things we have to cover.

Louis: 01:00:52

They can load up signal and discord. And I could probably scroll for, like, 10 minutes. And there would just be all RoHF hot takes. And I I I love John Shulman's work. I'm not going I'm not going to say that I don't love his work.

Louis: 01:01:06

I I think that he's genuinely, like, one of the smartest people, if not the smartest person.

Nathan: 01:01:11

And extraordinarily genuine. Yeah. Like, you know, he's awesome.

Louis: 01:01:14

But I Anyways. The the commitment that OpenAI had and Anthropic as well, when a bunch of the RL people left OpenAI to go to Anthropic on PPO, because it works so well for robotics and so well for, like, games and stuff like that. But, like, honestly, not well at all for for for text.

Nathan: 01:01:33

I think it's just really hard. I think it can work really well. It it it They just hired everyone, and they paid them so much that they're not gonna leave.

Louis: 01:01:40

Yeah. It can work really, really, really, really well. And, like, the I'm gonna I'm gonna spill some secrets, about this. And and the really the the the answer to get PPO to work really well is have really, really good early stopping. Right?

Louis: 01:01:57

And and, like, that's, like, the main differentiator between a good ROHF library and a bad ROHF library that focuses on PPO is that if you don't have good early stopping, like, you you you're kind of shooting yourself in the foot. And and what you wanna do is, like, launch as many runs as you can. And there's, like, a paper that Costa and I talked about a while ago, Costa Hong. That's like, you can tell within the first, like, 3 or 4 gradient steps if you need to kill a run usually. And if you just launched 300 runs and you kill, like, 99% of them, you know, now you have 3 good runs that might give you promising results.

Louis: 01:02:35

And those 3 good runs, you'll get a model within a day or 2, and hopefully, the model is is really good. And and and, like, early stopping is way more powerful than than than people admit. And, like, I am just convinced that, OpenAI's RoHS infrastructure is just an insane amount of of, like, regularization and early stopping for RoHS. I mean, that that, of course, assumes that they're still using PPO. I genuinely don't know if they are.

Nathan: 01:03:04

Yeah. We don't know anything. They're they're really shielded on this front.

Louis: 01:03:07

What was the, oh my god. Symphony PPO? PPO Symphony or something? There there was, there was something that came out about that that that I saw on, like, Discord servers where where, like, it was part of the GPT 4 week. And there there was a bunch of notes on, like, their PPO optimizer, and it was it was a PPO Symphony or something like that.

Louis: 01:03:33

And, like, under the note, it was, like, PPO was, like, better early stopping in infrastructure management for, like, auto scaling. And I'm like not surprising. It's it's like, I mean, it it doesn't say much, but it just kinda says It's

Nathan: 01:03:45

so much as they've done so much exploration. Yeah. They know for the little things to see. Like, once you have this working, you know, like, okay, this little this value function is doing wacky shit with the it's like the value function and the k l at the same time doing this means, like, okay, We probably don't need to do this or, like, don't need this run. Whereas, like, all of us in the open are just trying to get to that point.

Nathan: 01:04:06

We're we're trying to get to that point while charging ahead where it's kind of separate problems. If we wanna validate a PPO infrastructure, you need the investment to the compute and the time to do this. But, like, we're not gonna do this at the same time as if you're trying to say DPO is the best thing or trying to figure out if KTO is the best thing. Like, there's not room in the narrative really for it.

Louis: 01:04:25

PPO just doesn't make sense for, like, random hackers to to to work on honestly. I I I the the level of infrastructure that you need to do PPO really, really well is not something that the average person has. Yeah. And the average person is willing to make the investment to get. And for the average person, you know, DPO, which gets you, like, most of the way there with, like, a small fraction of the compute.

Louis: 01:04:50

Even less if

Nathan: 01:04:51

you hyperparameters

Louis: 01:04:52

and everything. Even less if you, like, precompute all the logics. You don't even need to have a reference model loaded. Right? So, like, it's basically the same computer.

Louis: 01:04:59

It's just fine tuning. Like, people fine tune all the time in like 40 nineties, 30 nineties. Yeah.

Nathan: 01:05:04

You can do it with hugging face. It's fine. It's like easy. It's like PPO with hugging face is gonna be a lot harder. You know, like, that's just kinda how it goes.

Nathan: 01:05:12

Speculative question. What type of thing do you think will make KTO kind of show up on the scene? Because I I think, like, this KTO method from contextual and Stanford, it's named after the authors of Thinking Fast and Slow or something. Like, what is I can't pronounce their names. Like, Kiversky something.

Nathan: 01:05:28

Like, we'll put it somewhere.

Nathan: 01:05:29

Okay.

Nathan: 01:05:30

I I I don't know how to pronounce it. But it's this paper where you essentially did you can work preference optimization from a scalar signal. So, like, the thumbs up that you could give to your tattoo b t of, like, you did good. Like, a like button, a like button on YouTube or anything like this. I think the formulation is, like, is the are the DPO hackers gonna adjust to this?

Nathan: 01:05:50

And, like, what dataset is gonna enable this? Like, who is gonna be using this? Is it just gonna happen at a bunch of startups with products behind the scenes so they could get a few percentage points on top of their model by adding this on? Or is it gonna be this thing where, like, the next Zephyr model from Hugging Face uses this as well?

Louis: 01:06:05

Yeah. So, Colin and I, the first author of the KTO paper, are are actually trying to create a number of datasets where we can explore the limits of of KTO. And, you know, right now, we're in the proposal writing stage, and I'm I'm I'm very, very hopeful that we can have something that can be done in an entirely open science setting, relatively soon. And I think it's incredible. Sorry.

Louis: 01:06:35

I moved to the side. Stop picking my voice correctly. I think it's incredibly exciting. You know, things like, you know, like fake product data where, like, you can actually experiment on, like, the idea of, like, using KT over conversions. Right?

Louis: 01:06:50

And then, like, how do you actually evaluate

Nathan: 01:06:52

metrics maybe already using it? Like, because people already use it then.

Louis: 01:06:56

Yeah. I I I like, how do you how do you even evaluate RoHF from a binary signal? It's like RoHF from a preference signal, like, we still don't know how to evaluate that. And RoHF from a binary signal creates so many, so many, so many, so many unique problems, for evaluation that, like, I genuinely don't think maybe anyone outside of, like, contextual ends, like, Colin and I have have really been thinking about yet.

Nathan: 01:07:26

Yeah. I I it seems like the same thing. It just takes time for these ideas that are, like, to kind of cultivate and then get traction in a few places and then model. Once there's a popular model with a method, it's like it's like fire. It just blows up.

Nathan: 01:07:40

Like, this is like everyone's using DPO now. But Mhmm. DPO paper came out in July, and it wasn't until September that that happened. It's like for the investment the interest, it's like there's a lot of weird dynamics in how, like, this fine tuning area unfolds, which is just like how AI unfolds. It's like very weird.

Nathan: 01:07:58

And then when you zoom in, it's like,

Louis: 01:08:01

I was extremely, extremely bullish on offline or else for the longest time, with, like, ILQL and some of Sergei's work in that direction. And I actually think that I keep moving to the side and it's like

Nathan: 01:08:15

You can just move the microphone. Sorry. Absolutely.

Louis: 01:08:18

And I keep,

Nathan: 01:08:19

I could still hear you, so I wasn't

Louis: 01:08:21

very concerned

Nathan: 01:08:21

about it.

Louis: 01:08:22

I keep, thinking that, the DPO movement that that's going on now is, like, super, super similar to why everyone was getting excited about iOQL for back in the day. And really, it was just a timing thing. If iOQL had come out, like, let's say a week after ChatChippity, came out. Right? Like, IOQL would have been the DPO that that everyone uses.

Louis: 01:08:46

And we would have, like, created all of our infrastructure around IOQL rather than DPO. Because I I still, I really like q value based function, q value based approaches.

Nathan: 01:08:58

It's such a nerdy thing. I love it.

Louis: 01:09:00

I know. But, like, q values just make sense to me. Like, I get like in the way that, like, when you train an IOQL model, you basically get, like, a head that controls the model almost like how, like, if you're familiar with, Jedi or like, PPLM from, like, the Uber EI days. How how, like, those control them. Well, the idea with, like, Jedi is that they had, like, a head that attached to the language model and you would, like, input, like, a subreddit.

Louis: 01:09:29

And then it would adjust the logits so that it would talk like it was at subreddit.

Nathan: 01:09:32

This sounds like, like, activation learning or like activation I don't know the word. But essentially, you can use like it's like in context learning, but you can just modify the activations directly.

Louis: 01:09:43

Yeah. Yeah. But it it it modifies the logits. Yeah. But it it it was the same thing with iOQL.

Louis: 01:09:48

It's like you were learning that kind of head to modify the logits to, like, you know, satisfy some constraint that you were adding. And that head also was, like, implicitly q computing your q values. And and, like, you would train it via, like, I you know, telling you, like, what your reward was for, like, various utterances and it it would do everything from there on out. And, like, there were some stability issues with it, and it it was it was a fantastic approach. And if it got the same attention that DPO did, I definitely think well, DPO is very, very simple, which is, like, part of the benefit.

Louis: 01:10:28

I also thought it's not as simple, but it it it would have it would have, caught on a lot more than it actually ends up doing. I feel like at Carper AI, the reason like, the fact that we integrated iOQL into TRLX first was, like, the main reason that iOQL caught on, plus a few of Sergei's papers that used it. Yeah. Like, besides the integration into TRLX, I don't think anyone in the broader open, science open source community was really using IOQL.

Nathan: 01:10:56

Yeah. I mean, this is one of the questions I had is, like, if you can say is was how far ahead in RLHF was what Carper was doing? And, like, what what kind of institutionalized knowledge did you have there? Because you were essentially, Carper AI was it was an it wasn't it was its own thing. And then it got stability, pulled you in probably with the, like, promise of compute.

Nathan: 01:11:19

Oh, I'll say thanks so you don't have to say anything for lots of this. And then they were they had forked Hugging Face's TRL library before it was it was like Hugging Face wasn't maintaining it at this time. And they had a lot of and probably had, like, 5 plus full time employees doing RLHF in the open and for private industry. Obviously, private stuff, they're not even gonna bother asking because it's all that stuff's all under NDA. But it's like, what were the problems you were working on at Carper?

Nathan: 01:11:47

And how does that compare to, like, the things that people are talking about now? Is it is it still related? Or is the field just moved into a different area?

Louis: 01:11:56

So most of the problems we face that Carper with PureLX was on scaling PPO. Right? And I think almost anyone you talk to who has scaled PPO in the open source space and when I say scale, I mean, like, way beyond 20,000,000,000 parameters. Like, I'm talking about 70 to 1000000000 parameters.

Nathan: 01:12:18

How many nodes do you need to train on a 70,000,000,000 parameter model with So we were

Louis: 01:12:24

typically doing, like, a 100 GPUs for PPO at that scale.

Nathan: 01:12:28

Like, 10 to 12 nodes.

Louis: 01:12:30

Yeah. Yeah. We mostly tested with, like, the NEEMO checkpoints that were, like, a 100,000,000,000 parameters. We it's 0x was built I was for that component built on top of a very modified version of, like, Megatron deep speed. But, like, the amounts of, like, regularization and and, like, random tricks that you needed to do in in order to get PPO to even, like, work at that scale is is insane.

Louis: 01:12:59

Like, we had to do, like, separate warm ups for the value function. Right? So we had to, like Wild banks. Independently train the value function before we train the policy network. And, like, everyone and and their mom was was talking about, like like, having separate value networks versus policy networks for for PPO.

Nathan: 01:13:18

Did you ever try JAX? Do you all do you have TPUs at Carper ever?

Louis: 01:13:25

We did towards the end.

Nathan: 01:13:27

Because it could solve some of the multi node thing.

Louis: 01:13:30

Yeah. It wasn't the multi node that was the issue. It was,

Nathan: 01:13:35

You're saying deep sweet was the issue?

Louis: 01:13:38

No. Well, it it it was actually the fact that the inference server that TRO X uses for the rollouts was entirely different than the inference server that Megatron wanted us to use. So we needed a way to rapidly

Nathan: 01:13:57

This is why PPO is really hard to scale. Because you have to have a generation engine. And you want this all to be flexible.

Louis: 01:14:02

Yeah. So we needed a way to dynamically, like, keep our compute graph for over through the network, but I just copy the weights, like, in place to, like, try it. And I don't think that we ever came up with a solution to do that very effectively. And I think it actually goes a step further. I don't think the Nemo Nemo Align was was like what NVIDIA did.

Louis: 01:14:22

I don't think Nemo Align came up with a solution for that either.

Nathan: 01:14:26

Yeah. I I this is interesting because I'm I'm not gonna say the details on the pod because I'm not allowed. But, like, anthropic and these places that have custom RLHF infrastructure have essentially, like, built their distributed training infrastructure with the idea that the model will need to be generated from at different checkpoints.

Nathan: 01:14:43

Mhmm.

Nathan: 01:14:43

And the model will be served to different endpoints at different checkpoints. So it's just very different than taking deep speed off itself, which is like, this is just about training. Whereas, like, these other companies that do this stuff really well have infrastructure for, like, handling these really messed up cases of, like, how to generate and update these models.

Louis: 01:15:00

Yeah. And and and most approaches that, like, you a reasonable person would build off the shelf, like, would rely on torch dot compile, and you still have the same issue, like, your weights are changing dynamically. It's very, very hard to really even, like, understand, like, like all of, like, the little, like, technical details in Torch dot compile that have to be accounted for to even make this work. Right? And, like, you know, something that we considered at the time was, we need to do, like, an insane amount of rollouts, for every gradient step.

Louis: 01:15:33

And we don't want that interface between the rollouts, and the training to be Python. We want it to be like Rust or something because, like, these otherwise, the CPU overhead is, like, mind boggling. It was, like, 80% or something crazy. It was, like, like, 80% of the entire processing time was just CPU stuff. And, like,

Nathan: 01:15:53

not so much.

Louis: 01:15:54

I know. I know. And, like, there's there's so many different, like, infrastructure constraints that, like, people don't realize when they're just doing, like, 20,000,000,000 parameter PPO. Right? What the other one I was going back to, right, the value function being separate from the policy network.

Louis: 01:16:10

TRL's TRL was very, very gung ho on, like, keeping them separate and think RL for LMs also wanted to keep them separate. And then there was someone from Cornell, I don't remember his name, he was also in the RL for LMS paper. He did a paper like PPO Plus or something.

Nathan: 01:16:28

I mean, all these things are interesting. I mean, there's new libraries coming out still. So it's like I saw one recently. It was called OpenRLHF. It like as I'm it it looks good.

Nathan: 01:16:37

I think that it's, like, there's so much institutional, like, breaking the bonds of Past RL that needs to happen. So, like, part this library is, like, listing that they have the implementation details from, like, their original, like, and implementation details of PPO paper, where it's like we've already updated like, Costa has worked on the the n implementation details of RLHF paper, which is, like, the ones that they actually need. But it's like there's so much, like, baggage by the fact that PPO came out of this control field that everyone expects the tricks that you need for from scratch learning from PPO to apply to this fine tuning method. And just, like, even getting the people to stop using PPO for that and, like, DPO is a new thing. Like, DPO is something that only has worked for preference alignment.

Nathan: 01:17:22

People are gonna explore in a scientific way that's much fresher. So they're probably gonna make more scientific progress because there's not this kind of confusion of, like, what do like, what implementation details do we need?

Louis: 01:17:32

Yeah. For sure. For sure. I think, then the end technical details of RoHS. Did that come out?

Nathan: 01:17:39

Yeah. It's a blog post.

Louis: 01:17:40

It's a blog post.

Nathan: 01:17:41

When? You're about to go.

Louis: 01:17:45

Oh, man. I totally missed that. Oh, that's so cool. I'm gonna go read that immediately.

Nathan: 01:17:48

Yeah. I mean, this is for anyone still listening. If you wanna know the actual details of RLHF, like, go look at all the stuff that Costa Huang has been doing at Hugging Face. Like, I was just, like, reproducing everything in and explicit detail. I I would like, I feel like both of us would benefit from rereading it.

Louis: 01:18:02

So it's

Nathan: 01:18:03

it's like there's there's some free content to spend.

Louis: 01:18:06

Kosta is, like, one of the most, like, meticulous, very attention focused person that I know in the RLHS space. Like, I if Kosta says something works, it's because he's, like, tried it from every other angle and then tried it from angles that, like, you didn't even expect and all of them work.

Nathan: 01:18:21

Yeah. Yeah. It's great. I think I I have a couple, like, fun more fun questions of while we wrap up. There's we could we could go on with all these technical things forever.

Nathan: 01:18:33

What was it like to work at Carper when Chat GPT came out? Because Chat gpt, from a technical perspective, is RLHF is validated as something that is necessary to the future of language models. And you were one of the few people that were working on RLHF beforehand

Nathan: 01:18:48

Mhmm.

Nathan: 01:18:48

Which is a huge like, this is how you end up here. This is this is awesome that you ride that kind of journey. It's like, what is what was that like?

Louis: 01:18:57

I mean, I, the the the star count on the repository exploded. I think we went from like

Nathan: 01:19:06

because TRLX existed before.

Louis: 01:19:08

Yeah. I think it was just insane. And it was it was, we almost weren't positioned. I guess it could be fully on. So it's be almost weren't positioned to entirely ride the hype train.

Louis: 01:19:24

TRO X was always designed from the very, very beginning to be like a one stop shop for enterprises to do RHF. Like companies that had, like, a 1000 GPUs and and they already have an engineering team and they just don't want they just they already use, like, Megatron deep speed or they already use deep speed, and they just want something that works on their infrastructure. And because we we use, like, Docker images that, like, we're just based off of the deep speed mega the mega run deep speed Docker images anyways. Right? So, like, those kinds of companies could very, very easily deploy TRLX and and and utilize it in in their stack.

Louis: 01:19:57

Right? Yeah. And the hype that came from chat gpt, at least initially, was not enterprises. It was like bloggers. It was Yeah.

Louis: 01:20:07

Media writing

Nathan: 01:20:08

a blog post. So you were you were probably like training big models and I'm like, hey, how does RLH Jeff work? I need to write this blog post. Yeah.

Louis: 01:20:14

I'm like I'm like, you're, like, training, like, a 40,000,000,000 parameter in the model and and they're like, hey, can you help me train this, like, 400,000,000 parameter guy? And I'm like, what? I'm so busy.

Nathan: 01:20:24

Okay. So it's primarily a scaling thing. I think is there, like were there any cultural things that you think, like, being early like, were you bought into RLHF to the same extent ahead of time? Like, what got you into RLHF? Like, what what motivated Carper to exist?

Nathan: 01:20:42

And did this was that kind

Louis: 01:20:44

of consistent? So I've always been very, very bullish on critiques and revisions in general. So I I wrote the first the either the first or the second one. I don't I don't actually remember if the super alignment team at OpenAI wrote a paper before me. They may have, but I don't think so.

Louis: 01:21:01

I think ours came out, like, a month before it.

Nathan: 01:21:04

That always feels good. Yeah.

Louis: 01:21:07

I wrote the one of the first papers, not like critiques and revisions. Right? And and I I was very very bullish on that. But initially, I was only bullish on it for evaluation. Right?

Louis: 01:21:17

And I I had experimented with PPO a little bit, back in 20 21 for, like, this kind of critique and revision stuff. And it was not ready whatsoever. And and and there was no infrastructure. And TRL was an abandoned library that was very buggy. It didn't work.

Louis: 01:21:39

No no no shade to Leandro. I love Leandro. But, like, it was it was obvious it was it was a depreciated library. Like, it it it it happens. Yeah.

Louis: 01:21:49

And I think when we tried to do RoHF then, like, there was no traction whatsoever. So Alex, Havrilla, and I, I think he's working with Meta now. I don't remember.

Nathan: 01:22:02

Yeah.

Louis: 01:22:03

Nathan: 01:22:03

was an intern at least. He just had an interesting paper on, like, reasoning and math, which is a whole other conversation for early, Jeff.

Nathan: 01:22:09

Louis: 01:22:09

Yeah. So we we we started we forked TRL and we, just added deep speed support. That's all we wanted to do initially. And then we were going to merge back to TRL because we had no visions of, like, Carper or anything like that. And we realized to make a framework that people would actually want to use, we had to do a full rewrite of TRL, and we had to build things in a way that made sense to an engineer who wanted to deploy RoHS, who wanted to experiment with RoHS at a company or in a lab.

Louis: 01:22:43

Because we were building this from the perspective of, well, we're on the Eleuther AI GPU cluster. How can we best use our infrastructure there to to Has

Nathan: 01:22:52

anyone publicly said how many GPUs all at their house? This is like one of my great mysteries. Is this like a held secret?

Louis: 01:22:58

I don't think it's a held secret. I I I, I don't remember actually. I I I they have some stability GPUs and they have GPUs from elsewhere.

Nathan: 01:23:10

Like, they seem to get compute when they need it.

Louis: 01:23:12

Yeah. Yeah. Like, it's not, like, it's not an issue. And I've through Synth Labs, I've been supplying a bit compute here here and there as well. I gave them, like, a note of, like, h 100 for, like, a little while for a paper that we were working on with the pink elephants paper.

Louis: 01:23:27

But, like, I don't think that, like they're not, like, super short of compute. They're a little short probably. Like, everyone's a little short of compute. Yeah. But I don't think they're super short of compute.

Louis: 01:23:39

But yeah. So we we built it with with the Luther cluster in mind. And because we built it with the Luther cluster in mind, you know, we we kind of said, well, we can kind of turn this into a thing where, like, we build the infrastructure that, like, people can, like, readily deploy on their clusters. And and and it'll just work for them. And, like, we can make Carper AI.

Louis: 01:23:56

So we made Carper AI, and shortly after, like, you know, all the stability stuff started happening. Carper joined stability. And we we worked, I worked there for a while. And last summer, I left to join back with Luther because, you know, I I I long for the days of being an engineer. I I I love waking up in the morning, writing code, eating a little bit, and then going to sleep.

Nathan: 01:24:22

Yeah. I mean, that's the difference. I I I spend the time writing because I like to. We've had plenty of discussions.

Nathan: 01:24:27

We're like,

Nathan: 01:24:27

oh, I should start a blog. And it's like, it comes down to doing what you like to do. And it's like, you're doing great as it is.

Louis: 01:24:33

Yeah. It's okay. Yeah.

Nathan: 01:24:36

Okay. I think that's kind of a good place to stop. Where should people find you? What do you want to boost,

Nathan: 01:24:43

when

Nathan: 01:24:43

you sign off here?

Louis: 01:24:44

So my my Twitter is lcastricado. I or you can follow the Synth Labs Twitter. It is let me actually I don't remember what it is off the top of my head.

Nathan: 01:24:55

Do you have any goose announcements?

Louis: 01:24:58

No goose announcements at the moment, unfortunately. It's synth_labs on Twitter. It's it's it's that Twitter account. And then l castricato is my personal Twitter account. And I'm I'm, you know, I'm always open to collaborators, especially now with Synth Labs.

Louis: 01:25:13

So we're we're always happy to chat with and talk to new people about, like, interesting research directions. And yeah. Just just reach out and we we can we can get something going, I guess.

Nathan: 01:25:23

Yeah. I love the URL in the show notes. It's synthlabs dotai. I've I've found that it's, like, because synthetic data is so hot and it's so new, it's like some of some of these URLs are just hard to find. It's like we don't have to go into the whole rant about naming and stuff, but it's like most of the people that search for my sub stack, well, if you don't put the s, if you don't write interconnects, you get a different sub stack first.

Nathan: 01:25:41

So it's like, okay. We're we're all we're all in this together for anyone founding a startup or a vlog and struggling with naming. Please send us questions about RLE Jeff. If you liked this, Lewis could come back. I'm I'm trying to start a in person thing and get some gear.

Nathan: 01:25:55

So when I'm at a conference or whatever, we can bring researchers on, kind of remove some of the Zoom aspects that we're all stuck in so much of the time. So thanks, Lewis, for putting some of the things we talked about a lot, like, onto the semi record. I don't think people people listen and Yeah. Read. So

Louis: 01:26:11

Of course.

Nathan: 01:26:12

This is good. I think a lot of researchers are gonna dig into this. There's so many different things that we talked about. It's very high high information density

Louis: 01:26:21

Yeah.

Nathan: 01:26:21

Chat here. But it was a good time.

Nathan: 01:26:23

Mhmm.

Louis: 01:26:24

Okay.

View episode details

Listen to Interconnects Audio using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

Interviewing Louis Castricato of Synth Labs and Eleuther AI on RLHF, Gemini Drama, DPO, founding Carper AI, preference data, reward models, and everything in between

Subscribe