Letter #228: Andrej Karpathy and Stephanie Zhan (2024)
OpenAI Founding Team, Tesla Sr. Director, & Eureka Labs Founder and Sequoia Partner | Making AI Accessible
Hi there! Welcome to A Letter a Day. If you want to know more about this newsletter, see "The Archive.” At a high level, you can expect to receive a memo/essay or speech/presentation transcript from an investor, founder, or entrepreneur (IFO) each edition. More here. If you find yourself interested in any of these IFOs and wanting to learn more, shoot me a DM or email and I’m happy to point you to more or similar resources.
If you like this piece, please consider tapping the ❤️ above or subscribing below! It helps me understand which types of letters you like best and helps me choose which ones to share in the future. Thank you!
Andrej Karpathy is the Founder of Eureka Labs, an AIxEducation company. He started his career in research, taking Geoff Hinton’s class as an undergraduate, working with Michael van de Panne during his MSc, and studying under Fei-Fei Li at Stanford to earn his PhD. At Stanford, he designed and was the primary instructor for the first deep learning class at Stanford, which became one of Stanford’s largest classes, and also completed three internships (at Google Brain, Google Research, and DeepMind). After his PhD, Andrej joined OpenAI as a founding member and research scientist. He left to join Tesla as the Sr. Director of AI, where he led the computer vision team of Tesla Autopilot. After half a decade at Tesla, he returned to OpenAI, before leaving once again to found Eureka Labs.
Stephanie Zhan is a Partner at Sequoia Capital. Prior to joining Sequoia, Stephanie was on the investment team at Andreessen Horowitz, a product manager at Nest, and an associate product marketing manager at Google.
Today’s conversation is the transcript of a talk between Andrej and Stephanie at Base Camp, Sequoia’s annual founder retreat. In this conversation, Andrej tells Stephanie about the early days of OpenAI, his view of the future of the next N years, how people should be living their lives differently if that view of the future is true, where opportunities exist for other players to build new independent companies vs where OpenAI will continue to dominate as its ambition grows, and how he sees the future of the LLM ecosystem playing out. He then gets a little more granular, starting with his thoughts scale, what are some of the ingredients that matter besides scale, and which problems are meaty enough but also solvable. Andrej then steps back and shares insight into how Elon runs his companies and what gives him the most meaning as he thinks about the next chapter of his life. He then answers questions on whether he would recommend founders follow Elon’s management methods, if there are any types of model composability that he’s really excited about, if there’s a path towards building a physicist type model that has a mental model of physics that is self consistent and can generate new ideas for how to actually do fusion or if that’s a fundamentally different vector from AI model developments, how he would align the priority of doing cost reduction and revenue generation or finding better quality models with better reasoning capabilities, if open source will continue to keep pace with closed source development as models continue to improve and scale, what would make the AI ecosystem more vibrant, whether it is sufficient to modify the transformer architecture to get the next big performance leap or if scientists need to come up with some new fundamental building block to take that next big step forward towards AGI, and a parting message of advice for people as they dedicate their lives to helping shape the future of AGI.
I hope you enjoy this conversation as much as I did!
[Transcript and any errors are mine.]
Related Resources
EdTech Founders
OpenAI
Operators
Sam Altman Compilation (1,008 pages)
Investors
Tesla
Sequoia Capital
Letters
Compilations
Michael Moritz Compilation (642 pages)
Don Valentine Compilation (347 pages)
a16z
Transcript
Host: I'm thrilled to introduce our next and final speaker Andrej Karpathy. Andrej Karpathy probably needs no introduction. Most of us have probably watched his YouTube videos at length. But he's renowned for his research in deep learning. He designed the first deep learning class at Stanford, was part of the founding team at OpenAI, led the computer vision team at Tesla, and is now a mystery man again now that he has just left OpenAI. So we're very lucky to have you here. And Andrej, you've been such a dream speaker, and so we're excited to have you and Stephanie close out the day. Thank you
Stephanie Zhan: Andrej's first reaction as we walked up here was, Oh my god, to his picture.
Andrej Karpathy: It's a very intimidating photo.
Stephanie Zhan: I don't know what year it was taken, but he's impressed. Okay, amazing. Andrej, thank you so much for joining us today, and welcome back.
Andrej Karpathy: Yeah, thank you.
Stephanie Zhan: Fun fact that most people don't actually know. How many folks here know where OpenAI's original office was? It's amazing. Nick.
Andrej Karpathy: I'm gonna guess right here.
Stephanie Zhan: Right here. Right here on the opposite side of our San Francisco office, where actually many of you guys were just in huddles. So this is fun for us because it brings us back to our roots, back when I first started Sequoia, and when Andrej first started cofounding OpenAI. Andrej, in addition to living out the Willy Wonka working atop a chocolate factory dream, what were some of your favorite moments working from here?
Andrej Karpathy: Yes, so OpenAI was right there. And this was the first office after I guess Greg's apartment, which maybe doesn't count. And so yeah, we spent maybe two years here. And The Chocolate Factory was just downstairs, so it always smelled really nice. And yeah, I guess the team was 10, 20+. And yeah, we had a few very fun episodes here. One of them was alluded to by Jensen at GTC that happened just yesterday or two days ago. So Jensen was describing how he brought the first DGX, and how he delivered it to OpenAI. So that happened right there. So that's where we all signed it. It's in the room over there.
Stephanie Zhan: So Andrej needs no introduction, but I wanted to give a little bit of backstory on some of his journey to date. As Sonia had introduced, he was trained by Geoff Hinton and then Fei Fei. His first claim to fame was his deep learning course at Stanford. He cofounded OpenAI back in 2015, and 2017, he was poached by Elon. I remember this very, very clearly. For folks who don't remember the context then, Elon had just transitioned through six different autopilot leaders, each of whom lasted six months each. And I remember when Andrej took this job, I thought, Congratulations, and good luck. Not too long after that, he went back to OpenAI and has been there for the last year. Now, unlike all the rest of us today, he is basking in the ultimate glory of freedom in all time and responsibility. And so we're really excited to see what you have to share today. A few things that I appreciate the most from Andrej are that he is an incredible, fascinating futurist thinker. He is a relentless optimist. And he's a very practical builder. And so I think he'll share some of his insights around that today. To kick things off, AGI, even seven years ago, seemed like an incredibly impossible task to achieve, even in the span of our lifetimes. Now it seems within sight. What is your view of the future over the next N years?
Andrej Karpathy: Yes, so I think you're right. I think a few years ago, I sort of felt like AGI was--it wasn't clear how was going to happen. It was very sort of academic, and you would like think about different approaches. And now I think it's very clear, and there's like a lot of space, and everyone is trying to fill it. And so there's a lot of optimization. And I think, roughly speaking, the way things are happening is, everyone is trying to build what I refer to as kind of like this LLM OS. And basically, I like to think of it as an operating system. You have to get a bunch of like, very basically peripherals that you plug into this new CPU or something like that. The peripherals are, of course, like text, images, audio, and all the modalities. And then you have a CPU, which is the LLM transformer itself. And then it's also connected to all the software 1.0 infrastructure that we've already built up for ourselves. And so I think everyone is kind of trying to build something like that, and then make it available as something that's customizable to all the different nooks and crannies of the economy. And so I think that's kind of roughly what everyone is trying to build out and what we sort of also heard about earlier today. So I think that's roughly where it's headed is we can bring up and down these relatively self-contained agents that we can give high level tasks to and specialize in very ways. So I think it's gonna be very interesting and exciting. And it's not just one agent, it's many agents, and what does that look like?
Stephanie Zhan: And if that view of the future is true, how should we all be living our lives differently?
Andrej Karpathy: I don't know. I guess we have to try to build it, influence, make sure it's good. And yeah, just try to try to make sure it turns out well.
Stephanie Zhan: So now that you're a free independent agent, I want to address the elephant in the room, which is that OpenAI is dominating the ecosystem. And most of our audience here today are founders who are trying to carve out a little niche, praying that OpenAI doesn't take them out overnight. Where do you think opportunities exist for other players to build new independent companies versus what areas do you think OpenAI will continue to dominate even as its ambition grows?
Andrej Karpathy: So my high level impression is basically OpenAI is trying to build out this LLM OS. And I think, as we heard earlier today, it's trying to develop this platform on top of which you can position different companies in different verticals. Now, I think the OS analogy is also really interesting because when you look at something like Windows or something like that, these are also operating systems, they come with a few default apps. Like a browser comes with Windows, you can use the Edge browser. And so I think in the same way, OpenAI or any of the other companies might come up with a few "default apps," but that doesn't mean that you can have different browsers that are running on it, just like you can have different chat agents sort of running on that infrastructure. And so there'll be a few default apps, but there will also be, potentially, a vibrant ecosystem of all kinds of apps that are finetuned to all the different nooks and crannies of the economy. And I really liked the analogy of like the early iPhone apps and what they looked like, and they're all kind of like jokes, and it took time for that to develop. And I think absolutely I agree that we're going through the same thing right now. People are trying to figure out, What is this thing good at? What does it not good at? How do I work it? How do I program with it? How do I debug it? How do I just actually get it to perform real tasks? And what kind of oversight--because it's quite autonomous, but not fully autonomous--so what does the oversight look like? What does the evaluation look like? So there's many things to think through. And just to understand sort of like the psychology of it. And I think that's what's going to take some time to figure out--exactly how to work with this infrastructure. So I think we will see that over the next few years.
Stephanie Zhan: So the race is on right now with LLMs. OpenAI, Anthropic, Mistral, Llama, Gemini. The whole ecosystem of open source models. Now whole long tail of small models. How do you foresee the future of the ecosystem playing out?
Andrej Karpathy: So again, I think the operating systems analogy is interesting. Because we have say, like, we have basically an oligopoly of a few proprietary systems, like say, Windows, Mac OS, etc. And then we also have Linux. And Linux has an infinity of distributions. And so I think maybe it's going to look something like that. I also think we have to be careful with the naming because a lot of the ones that you listed, like Llama and Mistral, I wouldn't actually say they're open source. And so it's kind of like tossing over a binary for an operating system. Like you can kind of work with it and it's like useful, but it's not fully useful. And there are a number of what I would say, is like, fully open source LLMs. So there's Pythia models, LLM 360, OLMo, etc. And they're fully releasing the entire infrastructure that's required to compile the operating system, to train the model from the data, to gather the data, etc. And so when you're just given the binary, it's much better, of course, because you can finetune the model, which is useful. But also, I think it's subtle, but you can't fully finetune the model, because the more you finetune the model, the more it's going to start regressing on everything else. And so what you actually really want to do--for example, if you want to add capability, and not regress the other capabilities, you may want to train on some kind of a mixture of the previous data set distribution and the new data set distribution. Because you don't want to regress the old distribution, which is going to add knowledge. And if you're just given the weights, you can't do that, actually. You need the training loop, you need the dataset, etc. So you are actually constrained in how you can work with these models, and again, like I think it is definitely helpful, but I think we need like slightly better language for it, almost. So there's open weights models, open source models, and then proprietary models, I guess. And that might be the ecosystem. And, yeah, probably it's gonna look very similar to the ones that we have today.
Stephanie Zhan: And hopefully you'll continue to help build some of that out. So I'd love to address the other elephant in the room, which is scale. Simplistically, it seems like scale is all that matters. Scale of data, scale of compute. And therefore the large research labs, large tech giants have an immense advantage today. What is your view of that, and is that all that matters? And if not, what else does?
Andrej Karpathy: So I would say scale is definitely number one. I do think there are details there to get right. And I think a lot also goes into the dataset preparation and so on, making it very good and clean, etc. That matters a lot. These are all sort of like compute efficiency gains that you can get. So there's the data, the algorithms, and then of course, the training of the model and making it really large. So I think scale will be the primary determining factor. It's like the first principal component of things for sure. But there are many of the other things that you need to get right. So it's almost like the scale set some kind of a speed limit, almost, but you do need some of the other things. But it's like, if you don't have the scale, then you fundamentally just can't train some of these massive models, if you are gonna be training models. If you're just gonna be doing finetuning and so on, then I think maybe less scale is is necessary. But we haven't really seen that just yet. To fully play out.
Stephanie Zhan: And can you share more about some of the ingredients that you think also matter, maybe lower in priority behind scale.
Andrej Karpathy: Yeah, so the first thing I think is like, you can't just train these models. If you were just given the money and the scale, it's actually still really hard to build these models. And part of that is that the infrastructure is still so new. And it's still being developed and not quite there. But training these models at scale is extremely difficult, and is a very complicated distributed optimization problem. And there's actually like--the talent for this is fairly scarce right now. And it just basically turns into this insane thing running on 10s of 1000s of GPUs, all of them are like failing at random at different points in time, and so like instrumenting that and getting it to work is actually a extremely difficult challenge. GPUs were not like intended for like 10,000 GPU workloads until very recently. And so I think a lot of the infrastructure is sort of like creaking under that pressure. And we need to like work through that. But right now, if you're just giving someone a ton of money, or a ton of scale or GPUs, it's not obvious to me that they can just produce one of these models, which is why it's not just about scale. You actually need a ton of expertise, both on the infrastructure side, the algorithms side, and then the data side, and being careful with that. So I think those are the major components.
Stephanie Zhan: The ecosystem is moving so quickly. Even some of the challenges we thought existed a year ago are being solved more and more today. Hallucinations, context windows, multimodal capabilities, inference getting better, faster, cheaper. What are the LLM research challenges today that keep you up at night? What do you think are meaty enough problems, but also solvable problems that we can continue to go after?
Andrej Karpathy: So I would say on the algorithm side, one thing I'm thinking about quite a bit is this like distinct split between diffusion models and autoregressive models. They're both ways of presenting probability distributions. And it just turns out that different modalities are apparently a good fit for one of the two. I think that there's probably some space to unify them or to like, connect them in some way. And also, get some best of both worlds or sort of figure out how we can get a hybrid architecture and so on. So it's just odd to me that we have sort of like two separate split points in the space of models, and they're both extremely good, and it just feels wrong to me that there's nothing in between. So I think we'll see that sort of carved out, and I think there are interesting problems there. And then the other thing that maybe I would point to is, there's still like a massive gap in just the energetic efficiency of running all this stuff. So my brain is 20 watts, roughly, Jensen was just talking to GTC about the massive supercomputers that they're going to be building out. The numbers are in megawatts. And so maybe you don't need all of that to run like a brain. I don't know how much you need exactly, but I think it's safe to say we're probably off by a factor of 1000 to like a million, somewhere there in terms of like the efficiency of running these these models. And I think part of it is just because the computers we've designed, of course, are just like not a good fit for this workload. And I think NVIDIA GPUs are like a good step in that direction--in terms of like the--you need extremely high parallelism--we don't actually care about sequential computation that is sort of like data dependent in some way. We just need to like blast the same algorithm across many different sort of array elements, or something. You can think about it that way. So I would say, number one is just adapting the compute architecture to the new data workflows. Number two is like pushing on a few things that we're currently seeing improvements on. So number one maybe is precision. We're seeing precision come down from what originally it was like 64-bit... we're now down to I don't know what it is, four, or five, six, or even 1.58, depending on which papers you read. And so I think precision is one big lever of getting a handle on this. And then second one, of course, is sparsity. So that's also like another big delta, I would say. Like your brain is not always fully activated. And so sparsity, I think, is another big lever. But then the last lever, I also feel like just the Von Neumann architecture of like computers and how they built where you're shuttling data in and out and doing a ton of data movement between memory and the cores that are doing all the compute, this is all broken as well, kind of, and it's not how your brain works, and that's why it's so inefficient. And so I think it should be a very exciting time in computer architecture. I'm not a computer architect, but it seems like we're off by a factor of a 1000 to a million, something like that. And there should be really exciting sort of innovations there that bring that down.
Stephanie Zhan: I think there are at least a few builders in the audience working on this problem. Okay, switching gears a little bit, you've worked alongside many of the Great's of our generation. Sam, Greg from OpenAI and the rest of the OpenAI team. Elon Musk. Who here knows the the joke about the rowing team, the American team versus the Japanese team? Okay, great, so this will be a good one. Elon shared this at our last base camp. And I think it reflects a lot of his philosophy around how he builds cultures and teams. So you have two teams. The Japanese team has four rowers and one steerer, and the American team has four steerers and one rower. And can anyone guess when the American team loses, what do they do? Shout it out. Exactly. They fire the rower. And Elon shared this example, I think, as a reflection of how he thinks about hiring the right people, building the right people, building the right teams at the right ratio. From working so closely with folks like these incredible leaders, what have you learned?
Andrej Karpathy: Yeah, so I would say definitely Elon runs his companies in extremely unique style. I don't actually think that people appreciate how unique it is. You sort of like even read about it, and so much you don't understand it, I think. It's like, even hard to describe. I don't even know where to start. But it's like a very unique, different thing. Like, I like to say that he runs the biggest startups. And I think... it's just--I don't even know, basically, like how to describe it. It almost feels like it's a longer sort of thing that I have to think through. But, well, number one is like, so he likes very small, strong, highly technical teams. So that's number one. So I would say at companies by default, they sort of like-- the teams grow and they get large. Elon was always like a force against growth. I would have to work and expend effort to hire people, I would have to like, basically plead to hire people. And then the other thing is that big companies, usually you want--it's really hard to get rid of low performers. And I think Elon is very friendly to, by default, getting getting rid of low performance. So I actually had to fight for people to keep them on the team. Because he would, by default, want to remove people. And so that's one thing. So keep a small, strong, highly technical team. No middle management that is kind of like non technical, for sure. So that's number one. Number two is kind of like the vibes of how everything runs, and how it feels when he sort of like walks into the office. He wants it to be a vibrant place. People are walking around, they're pacing around, they're working on exciting stuff, they're charting something, they're coding. He doesn't like stagnation, he doesn't like for it to look that way. He doesn't like large meetings. He always encourages people to like leave meetings if they're not being useful. So actually you do see this--it's a large meeting, and if you're not contributing, and you're not learning, just walk out. And this is like, fully encouraged. And I think this is something that you don't normally see. So I think like vibes is like a second big lever that I think he really instills culturally. Maybe part of that also is like--I think a lot of--there are companies, they like pamper employees. I think like there's much less of that. The culture of it is you're there to do your best technical work, and there's the intensity, and so on. And I think maybe the last one that is very unique and very interesting and very strange is just how connected he is to the team. So usually, a CEO of a company is like a remote person five layers up who talks to their VPs who talk to their reports and directors and eventually talk to your manager. It's not how he runs companies. Like, he will come the office, he will talk to the engineers. Many of the meetings that we had were like, okay, 50 people in the room with Elon, and he talks directly to the engineers. He doesn't want to talk just to the VPs and the directors. Normally, people would spend like 99% of the time maybe talking to the VPs, he spends maybe 50% of the time. And he just wants to talk to the engineers. So if the team is small and strong, then engineers and the code are the source of truth. And so they have the source of truth, not some manager. And he wants to talk to them to understand the actual state of things and what should be done to improve it. So I would say like the degree to which he's connected with the team and not something remote is also unique. And also just like his large hammer and his willingness to exercise it within the organization. So maybe if he talks to the engineers and they bring up what's blocking you? Okay, I just, I don't have GPUs to run my thing. And he's like, okay. And if he hears that twice, he's gonna be like, Okay, this is a problem. So like, what is our timeline? And when you don't have satisfying answers, he's like, Okay, I want to talk to the person in charge of the GPU cluster. And like someone dials the phone, and he's just like, Okay, double the cluster, right now. Like, let's have a meeting tomorrow. From now on, send me daily updates until the cluster is twice the size. And then they kind of like push back and they're like, Okay, well, we have this procurement setup. We have this timeline. Nvidia says that we don't have enough GPUs, and it will take six months or something. And then you get a raise of an eyebrow. And then he's like, Okay, I'm gonna talk to Jensen. And then he just kinda like removes bottlenecks. So I think the extent to which he's extremely involved and removes bottlenecks and applies his hammer, I think is also like, not appreciated. So I think there's like a lot of these kinds of aspects that are very unique, I would say, and very interesting. And honestly, like, going to a normal company outside of that is--you definitely miss aspects of that. And so I think--yeah, that's maybe... maybe that's a long rant. But that's just kind of like, I don't think I hit all the points, but it is a very unique thing, and it's very interesting. Yeah, I guess that's my rant.
Stephanie Zhan: Hopefully tactics that most people here can employ. Taking a step back, you've helped build some of the most generational companies, you've also been such a key enabler for many people, many of whom are in the audience today, of getting into the field of AI. Knowing you, what you care most about is democratizing access to AI. Education, tools, helping create more equality in the whole ecosystem, at large, there are many more winners. As you think about the next chapter in your life, what gives you the most meaning?
Andrej Karpathy: Oh, yeah, I think you've described it in the right way. Like, where my brain goes by default is, like, I've worked for a few companies, but I think, ultimately, I care not about any one specific company, I care a lot more about the ecosystem. I want the ecosystem to be healthy. I want it to be thriving. I want it to be like a coral reef of a lot of cool, exciting startups in all the nooks and crannies of the economy. And I want the whole thing to be like this boiling soup of cool stuff. And--,
Stephanie Zhan: Genuinely, Andrej dreams about coral reefs.
Andrej Karpathy: I want it to be like a cool place. And I think... that's why I love startups and I love companies, and I want there to be a vibrant ecosystem of them. And by default, I would say a bit more hesitant about kind of, like, five mega corps kind of like taking over. Especially with AGI being such a magnifier of power. I would be kind of--I'm kind of worried about what that could look like, and so on. So I'll have to think that through more, but yeah. I love the ecosystem, and I want it to be healthy and vibrant.
Stephanie Zhan: Amazing. We'd love to have some questions from the audience. Yes, Brian.
Audience Member: Hi, Brian Halligan. Would you recommend founders follow Elon's management methods, or is it kind of unique to him and you shouldn't try to copy him?
Andrej Karpathy: Um... yeah, I think it's a good question. I think it's up to the DNA of the founder. Like you have to have that same kind of a DNA and that's same kind of vibe. And I think when you're hiring the team, it's really important that you're making it clear upfront that this is the kind of company that you have. And when people sign up for it, they're very happy to go along with it, actually. But if you change it later, I think people are unhappy with that, and that's very messy. So as long as you do from the start and you're consistent, I think you can run a company like that. But it has its own pros and cons as well. And I think... so, up to people, but I think it's a consistent model of company building and running.
Stephanie Zhan: Yes, Alex.
Audience Member: Hi. I'm curious if there are any types of model composability that you're really excited about? Maybe other than mixture of experts. I'm not sure what you think about like merge, model merges, Franken merges, or any other things to make model development more composable?
Andrej Karpathy: Yeah, it's a good question. I see like papers in this area, but I don't know that anything has like really stuck. Maybe the composability--I don't know exactly know what you mean, but there's a ton of work on like primer efficient training and things like that. I don't know if you would put that in the category of composability in the way I understand it, but it's certainly the case that like traditional code is very composable. And I would say neural nets are a lot more fully connected and less composable by default. But they do compose and can finetune as a part of a whole. So as an example, if you're doing like a system that you want to have ChatGPT in just images or something like that, it's very common that you pre train components, and then you plug them in and finetune maybe through the whole thing, as an example. So there's composability in those aspects where we can pre-train small pieces of the cortex outside and compose them later. Through initialization and finetuning. So I think to some extent, it's... so maybe those are my scattered thoughts on it, but I don't know if I have anything very coherent otherwise.
Stephanie Zhan: Yes, Nick.
Audience Member: So we've got these next word prediction things. Do you think there's a path towards building a physicist or a Von Neumann type model that has a mental model of physics that's self consistent and can generate new ideas for how do you actually do fusion? How do you get faster than light travel, if it's even possible? Is there any path towards that, or is it like a fundamentally different vector in terms of these AI model developments?
Andrej Karpathy: I think it's fundamentally different in some--in one aspect. I guess like, what you're talking about maybe is just a capability question, because the current models are just like not good enough. And I think there are big rocks to be turned here. And I think people still haven't like really seen what's possible in this space. Like at all. And roughly speaking, I think we've done step one of AlphaGo. This is what the team--we've done imitation learning part. There's step two of AlphaGo, which is the RL. And people haven't done that yet. And I think it's going to fundamentally--like this is the part that actually made it work and made something superhuman. And so I think this is--I think there's like big rocks in capability to still be turned over here. And the details of that are kind of tricky, potentially. But I think--we just haven't done step two of AlphaGo, long story short. And we've just done imitation. And I don't think that people appreciate like, for example, number one, like how terrible the data collection is for things like ChatGPT. Like, say you have a problem, like some prompt is some kind of mathematical problem. A human comes in and gives the ideal solution to that problem. The problem is that the human psychology is different from the model psychology. What's easy or hard for the human are different to what's easy or hard for the model. And so a human kind of fills out some kind of a trace that light comes to the solution, but like some parts of that are trivial to the model and some parts of that are massively leap that the model doesn't understand. And so you're kind of just like losing it, and then everything else is polluted by that later. And so like, fundamentally, what you need is, the model needs to practice itself how to solve these problems. It needs to figure out what works for it or does not work for it. Maybe it's not very good at four digit addition so it's gonna fall back and use a calculator. But it needs to learn that for itself based on its own capability and its own knowledge. So that's number one, is like, that's totally broken, I think. It's a good initializer, though, for something agent-like. And then the other thing is like, we're doing reinforcement learning from human feedback, but that's like a super weak form of reinforcement learning. Doesn't even count as reinforcement learning, I think. Like, what is the equivalent in AlphaGo for RLHF? It's like, what is the reward model? It's a--what I call it is a vibe check. Like imagine--like, if you wanted to train, like an AlphaGo RLHF, you would be giving two people two boards and like said, which one do you prefer? And then you would take those labels and you would train the model, and then you would RL against that. Well, what are the issues with that? It's like number one, that's--it's just vibes of the board. That's what you're training against. Number two, if it's a reward model that's a neural net, then it's very easy to overfit to that reward model for the model you're optimizing over. It's going to find all these spurious ways of hacking that massive model. This a problem. So, AlphaGo gets around these problems because they have a very clear objective function. You can argue against it. So RLHF is like nowhere near, I would say, RL. This is like silly. And the other thing is imitation learning is super silly, RLHF is a nice improvement, but it's still silly. And I think people need to look for better ways of training these models so that it's in the loop with itself and it's own psychology. And I think there will probably be unlocks in that direction.
Audience Member: So it's sort of like graduate school for AI model cities to sit in a room with a book and quietly question itself for a decade.
Andrej Karpathy: Yeah. I think that would be part of it, yes. And I think like when you are learning stuff and you're going through textbooks, like there's exercises in the textbook. What are those? Those are prompts to you to exercise the material. And when you're learning material, not just like reading left to right, like, number one, you're exercising, but maybe you're taking notes, you're rephrasing, reframing, like you're doing a lot of manipulation of this knowledge in the weight of you like learning that knowledge. And we haven't seen equivalence of that at all in LLMs. So it's like, super early days, I think.
Stephanie Zhan: Yes, [Yusi].
Audience Member: Yeah, it's cool to be optimal and practical at the same time, so I want to ask, like, how would you be align the priority of like a) either doing cost reduction and revenue generation or b) like finding the better quality models with like better reasoning capabilities? How would you be aligning that?
Andrej Karpathy: So maybe I understand the question. I think what I see a lot of people do is they start out with the most capable model that--doesn't matter what the cost is, so you use GPT-4, you use super prompted, etc, or you do RAG, etc. So you're just trying to get your thing to work. So you're going after sort of accuracy first, and then you make concessions later. You check if you can fall back to 3.5 for certain types of queries, and you sort of make it cheaper later. So I would say go after performance first, and then you make it cheaper later. It's kind of like the paradigm that I've seen a few people that I've talked to about this say works for them. And maybe it's not even just a single problem product. Think about what are the ways in which you can even just make it work at all? Because if you just can make it work at all, like say you make 10 prompts, or 20 prompts, and you pick the best one and you have some debate, or I don't know what kind of a crazy flow you can come up with, like, just get your thing to work really well. Because if you have a thing that works really well, then one of the things you can do is you can distill that. So you can get a large distribution of possible problem types, you run your super expensive thing on it to get your labels, and then you get a smaller, cheaper thing that you finetune on it. And so I would say, I would always go after sort of getting it to work as well as possible no matter what first, and then make it cheaper, is the thing I would suggest.
Stephanie Zhan: Hi, Sam.
Audience Member: Hi. One question. So this past year we saw a lot of kind of impressive results from the open source ecosystem. I'm curious what your opinion is of how that will continue to keep pace or not keep pace with closed source development, as the models continue to improve and scale.
Andrej Karpathy: Yeah, I think that's a very good question. Yeah, I think that's a very good question. I don't really know. Fundamentally, these models are so capital intensive. Like one thing that is really interesting is, for example, you have Facebook and Meta and so on who can afford to train these models at scale, but then it's also not the thing that they do, and it's not involved--like, their money printer is unrelated to that. And so they have actual incentive to potentially release some of these models so that they empower the ecosystem as a whole so they can actually borrow all the best ideas. So that to me makes sense. But so far, I would say they've only just done the open weights model. And so I think they should actually go further. And that's what I would hope to see. And I think it would be better for everyone. And I think potentially maybe they're squeamish about some of the aspects of it eventually, with respect to data and so on--I don't know how to overcome that. Maybe they should like try to just find data sources that they think are very easy to use, or something like that, and try to constrain themselves to those. So I would say like, those are kind of our champions, potentially. And that's--I would like to see more transparency also coming from--and I think Meta and Facebook are doing pretty well, like they've released papers, they published a logbook, and so on. So they're doing--I think they're doing well, but they could do much better, in terms of fostering the ecosystem. And I think maybe that's coming. We'll see.
Stephanie Zhan: Peter.
Audience Member: Yeah, maybe this is like an obvious answer, given the previous question, but what do you think would make the AI ecosystem cooler and more vibrant, or what's holding it back? Is it openness, or do you think there's other stuff that is also like a big thing that you'd want to work on?
Andrej Karpathy: Yeah, I certainly think like one big aspect is just like the stuff that's available. I had a tweet recently about like, number one, build the thing, number two, build the ramp. I would say there's a lot of people building the thing. I would say there's a lot less happening of like building the ramps so that people can actually understand all this stuff. And I think we're all new to all of this, we're all trying to understand how it works. We all need to ramp up and collaborate to some extent to even figure out how to use this effectively. So I would love for people to be a lot more open with respect to what they've learned, how they trained all this, how--what works, what doesn't work for them, etc. And yes, just for us to learn a lot more from each other. That's number one. And then number two, I also think there's quite a bit of momentum in the open ecosystems as well, so I think that's already good to see. And maybe there's some opportunities for improvement I talked about already. So, yeah.
Stephanie Zhan: Last question from the audience. Michael.
Audience Member: To get to like the the next big performance leap from models, do you think that it's sufficient to modify the transformer architecture with, say, thought tokens or activation beacons? Or do we need to throw that out entirely and come up with a new fundamental building block to take us to the next big step forward, or, AGI?
Andrej Karpathy: Yeah, I think that's a good question. Well, the first thing I would say is like, Transformer is amazing. It was just like, so incredible. I don't think I would have seen that coming, for sure. Like for a while before the transformer arrived, I thought there would be insane diversification in neural networks. And that was not the case. It is the complete opposite, actually. It is complete--it's like all the same model, actually. So it's incredible to me that we have that. I don't know that it's like the final neural network--I think there will definitely be--I would say it's really hard to say that--given the history of the of the field, and I've been in it for a while, it's really hard to say that this is like the end of it. Absolutely it's not. And I think I feel very optimistic that someone will be able to find pretty big change to how we do things today. I would say on the front of the autoregressive diffusion, which is kind of like the modeling and the loss setup, I would say there's definitely some fruit there, probably. But also on the transformer. And like I mentioned, these levers of precision and sparsity, and as we drive that, and together with the co-design of the hardware and how that might evolve, and just making network architectures that are a lot more sort of well tuned to those constraints and how all that works. And to some extent, also, I would say like, transformer is kind of designed for the GPU, by the way. Like that was the big leap, I would say, in the transformer paper. And that's where they were coming from, is, we want an architecture that is fundamentally extremely parallelizable. And because the recurring neural network has sequential dependencies, terrible for GPU. Transformer basically broke that through the attention. And this was like the major sort of insight there. And it has some predecessors of insights like the neural GPU and other papers that Google are sort of thinking about this. But that is a way of targeting the algorithm to the hardware that you have available. So I would say that's kind of like in that same spirit. But long story short, like, I think it's very likely we'll see changes to it, still. But it's been proven, like, remarkably resilient. I have to say, like, it came out many years ago now, like, I don't know something like six... Yeah. So, like, the original transformer and what we're using today are like, not super different. Yeah.
Stephanie Zhan: As a parting message to all the founders and builders in the audience, what advice would you give them as they dedicate the rest of their lives to helping shape the future of AI?
Andrej Karpathy: So I don't have super--I don't usually have crazy generic advice. I think like maybe the thing that's top of my mind is, I think founders of course care a lot about like their startup. I also want, like, how do we have a vibrant ecosystem of startups? How do startups continue to win, especially with respect to like, big tech, and how do we--how does the ecosystem become healthier, and what can you do?
Stephanie Zhan: Sounds like you should become an investor. Amazing. Thank you so much for joining us, Andrej, for this, and also for the whole day today.
If you got this far and you liked this piece, please consider tapping the ❤️ above or sharing this letter! It helps me understand which types of letters you like best and helps me choose which ones to share in the future. Thank you!
Wrap-up
If you’ve got any thoughts, questions, or feedback, please drop me a line - I would love to chat! You can find me on twitter at @kevg1412 or my email at kevin@12mv2.com.
If you're a fan of business or technology in general, please check out some of my other projects!
Speedwell Research — Comprehensive research on great public companies including Constellation Software, Floor & Decor, Meta (Facebook) and interesting new frameworks like the Consumer’s Hierarchy of Preferences.
Cloud Valley — Beautifully written, in-depth biographies that explore the defining moments, investments, and life decisions of investing, business, and tech legends like Dan Loeb, Bob Iger, Steve Jurvetson, and Cyan Banister.
DJY Research — Comprehensive research on publicly-traded Asian companies like Alibaba, Tencent, Nintendo, Sea Limited (FREE SAMPLE), Coupang (FREE SAMPLE), and more.
Compilations — “An international treasure”.
Memos — A selection of some of my favorite investor memos.
Bookshelves — Collection of recommended booklists.