全部课程分类

当前位置:首页 > 全部课程 > 《让我们构建GPT从头开始,用代码实现,逐字解释
让我们构建GPT从头开始,用代码实现,逐字解释

让我们构建GPT从头开始,用代码实现,逐字解释

收藏 邀请卡
价 格
免费
打开微信扫描二维码
点击右上角进行分享

详情

目录

【课程详情】

【译文】

让我们构建 GPT:从头开始,在代码中,拼写出来。


6,193,078次观看  2023年1月18日

我们构建了一个生成预训练的 Transformer (GPT),遵循论文“Attention is All You Need”和 OpenAI 的 GPT-2 / GPT-3。我们谈论与风靡全球的 ChatGPT 的联系。我们观看 GitHub Copilot,它本身就是一个 GPT,帮助我们编写一个 GPT(元:D!我建议人们观看早期的 makemore 视频,以熟悉自回归语言建模框架以及张量和 PyTorch nn 的基础知识,我们在这个视频中认为这是理所当然的。


链接:

视频的谷歌合作:https://colab.research.google.com/dri...

视频的 GitHub 存储库:https://github.com/karpathy/ng-video-...

到目前为止,整个 Zero to Hero 系列的播放列表: • 详细介绍的神经网络介绍...... 

nanoGPT 存储库:https://github.com/karpathy/nanoGPT

我的网站:https://karpathy.ai

我的推特:/ Karpathy 

我们的 Discord 频道:/ discord 


补充链接:

注意力就是你所需要的纸张:https://arxiv.org/abs/1706.03762

OpenAI GPT-3 论文:https://arxiv.org/abs/2005.14165

OpenAI ChatGPT 博客文章:https://openai.com/blog/chatgpt/

我正在训练模型的 GPU 来自 Lambda GPU Cloud,我认为在云中启动按需 GPU 实例的最佳和最简单的方法,您可以 ssh 连接到:https://lambdalabs.com。如果你更喜欢在笔记本上工作,我认为今天最简单的途径是 Google Colab。


建议练习:

EX1:n 维张量掌握挑战:将 'Head' 和 'MultiHeadAttention' 合并到一个类中,并行处理所有头部,将头部视为另一个批次维度(答案在 nanoGPT 中)。

EX2:在您自己选择的数据集上训练 GPT!还有哪些数据可以喋喋不休?(如果你愿意,一个有趣的高级建议:训练 GPT 做两个数字的加法,i。e. a+b=c。您可能会发现以相反的顺序预测 c 的数字很有帮助,因为典型的加法算法(您希望它学习)也会从右到左进行。您可能希望修改数据加载器以简单地处理随机问题并跳过train.bin的生成,val.bin。您可能希望屏蔽 a+b 输入位置的损失,这些位置仅在目标中使用 y=-1 指定问题(参见 CrossEntropyLoss ignore_index)。你的变压器学会加法吗?一旦你有了这个,swole doge 项目:在 GPT 中构建一个计算器克隆,适用于所有 +-*/。这不是一个容易的问题。您可能需要 Chain of Thought 跟踪。

EX3:找到一个非常大的数据集,大到你看不到训练和 val 损失之间的差距。根据这些数据预训练转换器,然后使用该模型进行初始化,并在微小的莎士比亚上以更少的步骤和较低的学习率对其进行微调。你可以通过使用预训练获得更低的验证损失吗?

EX4:阅读一些 Transformer 论文并实现人们似乎使用的附加功能或更改。它能提高 GPT 的性能吗?


章:

00:00:00 介绍:ChatGPT、变形金刚、nanoGPT、莎士比亚

基线语言建模, 代码设置

00:07:52 读取和探索数据

00:09:28 代币化,训练/瓦尔拆分

00:14:27 数据加载器:批量数据块

00:22:11 最简单的基线:双元组语言模型、损失、生成

00:34:53 训练双元组模型

00:38:00 将我们的代码移植到脚本中

建立“自我关注”

00:42:13 版本 1:使用 for 循环(最弱的聚合形式)对过去的上下文进行平均

00:47:11 自注意力的诀窍:矩阵乘法作为加权聚合

00:51:54 版本 2:使用矩阵乘法

00:54:42 版本 3:添加 softmax

00:58:26 次要代码清理

01:00:18 位置编码

01:02:00 视频的关键:第 4 版:自我关注

01:11:38 注 1:注意力作为交流

01:12:46 注 2:注意力没有空间的概念,在集合上运行

01:13:40 注 3:没有跨批次维度的通信

01:14:14 注 4:编码器块与解码器块

01:15:39 注 5:注意力、自我注意力、交叉注意力

01:16:56 注 6:“缩放”自我注意力。为什么要除以 sqrt(head_size)

构建变压器

01:19:11 将单个自注意力块插入我们的网络

01:21:59 多头自注意力

01:24:25 变压器块的前馈层

01:26:48 残余连接

01:32:51 layernorm(及其与我们之前的 batchnorm 的关系)

01:37:49 放大模型!创建一些变量。添加辍学

关于变压器的注意事项

01:42:39 编码器与解码器与两者 (?变形金刚

01:46:22 nanoGPT超快演练,批量多头自注意力

01:48:53 回到 ChatGPT,GPT-3,预训练与微调,RLHF

01:54:32 结论


修正:

00:57:00 哎呀,“来自未来的代币无法交流”,而不是“过去”。不好意思!:)

01:20:05 哎呀,我应该使用head_size进行归一化,而不是 C



【原文】

Let's build GPT: from scratch, in code, spelled out.



6,193,078次观看  2023年1月18日

We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framework and basics of tensors and PyTorch nn, which we take for granted in this video.


Links:

Google colab for the video: https://colab.research.google.com/dri...

GitHub repo for the video: https://github.com/karpathy/ng-video-...

Playlist of the whole Zero to Hero series so far:    • The spelled-out intro to neural networks a...  

nanoGPT repo: https://github.com/karpathy/nanoGPT

my website: https://karpathy.ai

my twitter:   / karpathy  

our Discord channel:   / discord  


Supplementary links:

Attention is All You Need paper: https://arxiv.org/abs/1706.03762

OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165 

OpenAI ChatGPT blog post: https://openai.com/blog/chatgpt/

The GPU I'm training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com . If you prefer to work in notebooks, I think the easiest path today is Google Colab.


Suggested exercises:

EX1: The n-dimensional tensor mastery challenge: Combine the `Head` and `MultiHeadAttention` into one class that processes all the heads in parallel, treating the heads as another batch dimension (answer is in nanoGPT).

EX2: Train the GPT on your own dataset of choice! What other data could be fun to blabber on about? (A fun advanced suggestion if you like: train a GPT to do addition of two numbers, i.e. a+b=c. You may find it helpful to predict the digits of c in reverse order, as the typical addition algorithm (that you're hoping it learns) would proceed right to left too. You may want to modify the data loader to simply serve random problems and skip the generation of train.bin, val.bin. You may want to mask out the loss at the input positions of a+b that just specify the problem using y=-1 in the targets (see CrossEntropyLoss ignore_index). Does your Transformer learn to add? Once you have this, swole doge project: build a calculator clone in GPT, for all of +-*/. Not an easy problem. You may need Chain of Thought traces.)

EX3: Find a dataset that is very large, so large that you can't see a gap between train and val loss. Pretrain the transformer on this data, then initialize with that model and finetune it on tiny shakespeare with a smaller number of steps and lower learning rate. Can you obtain a lower validation loss by the use of pretraining?

EX4: Read some transformer papers and implement one additional feature or change that people seem to use. Does it improve the performance of your GPT?


Chapters:

00:00:00 intro: ChatGPT, Transformers, nanoGPT, Shakespeare

baseline language modeling, code setup

00:07:52 reading and exploring the data

00:09:28 tokenization, train/val split

00:14:27 data loader: batches of chunks of data

00:22:11 simplest baseline: bigram language model, loss, generation

00:34:53 training the bigram model

00:38:00 port our code to a script

Building the "self-attention"

00:42:13 version 1: averaging past context with for loops, the weakest form of aggregation

00:47:11 the trick in self-attention: matrix multiply as weighted aggregation

00:51:54 version 2: using matrix multiply

00:54:42 version 3: adding softmax

00:58:26 minor code cleanup

01:00:18 positional encoding

01:02:00 THE CRUX OF THE VIDEO: version 4: self-attention

01:11:38 note 1: attention as communication

01:12:46 note 2: attention has no notion of space, operates over sets

01:13:40 note 3: there is no communication across batch dimension

01:14:14 note 4: encoder blocks vs. decoder blocks

01:15:39 note 5: attention vs. self-attention vs. cross-attention

01:16:56 note 6: "scaled" self-attention. why divide by sqrt(head_size)

Building the Transformer

01:19:11 inserting a single self-attention block to our network

01:21:59 multi-headed self-attention

01:24:25 feedforward layers of transformer block

01:26:48 residual connections

01:32:51 layernorm (and its relationship to our previous batchnorm)

01:37:49 scaling up the model! creating a few variables. adding dropout

Notes on Transformer

01:42:39 encoder vs. decoder vs. both (?) Transformers

01:46:22 super quick walkthrough of nanoGPT, batched multi-headed self-attention

01:48:53 back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF

01:54:32 conclusions


Corrections: 

00:57:00 Oops "tokens from the future cannot communicate", not "past". Sorry! :)

01:20:05 Oops I should be using the head_size for the normalization, not C


让我们构建GPT从头开始,用代码实现,逐字解释[1课时]

AI一手信息 总共51个课程

主要是分享Youtube上,AI类博主的视频

未来会不会拓展衍生,看大家的需求吧

但是一些高质量的视频,我看到了,也会放上来的

为您推荐

联系方式

电话:yhbj39或yhbj2024

邮箱:abxjy@163.com

VIP特权
微信客服
微信扫一扫咨询客服