全部课程 - 一行

让我们构建GPT从头开始，用代码实现，逐字解释

免费

共1课时

【译文】让我们构建 GPT：从头开始，在代码中，拼写出来。6,193,078次观看 2023年1月18日我们构建了一个生成预训练的 Transformer （GPT），遵循论文“Attention is All You Need”和 OpenAI 的 GPT-2 / GPT-3。我们谈论与风靡全球的 ChatGPT 的联系。我们观看 GitHub Copilot，它本身就是一个 GPT，帮助我们编写一个 GPT（元:D！我建议人们观看早期的 makemore 视频，以熟悉自回归语言建模框架以及张量和 PyTorch nn 的基础知识，我们在这个视频中认为这是理所当然的。链接：视频的谷歌合作：https://colab.research.google.com/dri...视频的 GitHub 存储库：https://github.com/karpathy/ng-video-...到目前为止，整个 Zero to Hero 系列的播放列表： • 详细介绍的神经网络介绍...... nanoGPT 存储库：https://github.com/karpathy/nanoGPT我的网站：https://karpathy.ai我的推特：/ Karpathy 我们的 Discord 频道：/ discord 补充链接：注意力就是你所需要的纸张：https://arxiv.org/abs/1706.03762OpenAI GPT-3 论文：https://arxiv.org/abs/2005.14165OpenAI ChatGPT 博客文章：https://openai.com/blog/chatgpt/我正在训练模型的 GPU 来自 Lambda GPU Cloud，我认为在云中启动按需 GPU 实例的最佳和最简单的方法，您可以 ssh 连接到：https://lambdalabs.com。如果你更喜欢在笔记本上工作，我认为今天最简单的途径是 Google Colab。建议练习：EX1：n 维张量掌握挑战：将 'Head' 和 'MultiHeadAttention' 合并到一个类中，并行处理所有头部，将头部视为另一个批次维度（答案在 nanoGPT 中）。EX2：在您自己选择的数据集上训练 GPT！还有哪些数据可以喋喋不休？（如果你愿意，一个有趣的高级建议：训练 GPT 做两个数字的加法，i。e. a+b=c。您可能会发现以相反的顺序预测 c 的数字很有帮助，因为典型的加法算法（您希望它学习）也会从右到左进行。您可能希望修改数据加载器以简单地处理随机问题并跳过train.bin的生成，val.bin。您可能希望屏蔽 a+b 输入位置的损失，这些位置仅在目标中使用 y=-1 指定问题（参见 CrossEntropyLoss ignore_index）。你的变压器学会加法吗？一旦你有了这个，swole doge 项目：在 GPT 中构建一个计算器克隆，适用于所有 +-*/。这不是一个容易的问题。您可能需要 Chain of Thought 跟踪。EX3：找到一个非常大的数据集，大到你看不到训练和 val 损失之间的差距。根据这些数据预训练转换器，然后使用该模型进行初始化，并在微小的莎士比亚上以更少的步骤和较低的学习率对其进行微调。你可以通过使用预训练获得更低的验证损失吗？EX4：阅读一些 Transformer 论文并实现人们似乎使用的附加功能或更改。它能提高 GPT 的性能吗？章：00：00：00 介绍：ChatGPT、变形金刚、nanoGPT、莎士比亚基线语言建模，代码设置00：07：52 读取和探索数据00：09：28 代币化，训练/瓦尔拆分00：14：27 数据加载器：批量数据块00：22：11 最简单的基线：双元组语言模型、损失、生成00：34：53 训练双元组模型00：38：00 将我们的代码移植到脚本中建立“自我关注”00：42：13 版本 1：使用 for 循环（最弱的聚合形式）对过去的上下文进行平均00：47：11 自注意力的诀窍：矩阵乘法作为加权聚合00：51：54 版本 2：使用矩阵乘法00：54：42 版本 3：添加 softmax00：58：26 次要代码清理01：00：18 位置编码01：02：00 视频的关键：第 4 版：自我关注01：11：38 注 1：注意力作为交流01：12：46 注 2：注意力没有空间的概念，在集合上运行01：13：40 注 3：没有跨批次维度的通信01：14：14 注 4：编码器块与解码器块01：15：39 注 5：注意力、自我注意力、交叉注意力01：16：56 注 6：“缩放”自我注意力。为什么要除以 sqrt（head_size）构建变压器01：19：11 将单个自注意力块插入我们的网络01：21：59 多头自注意力01：24：25 变压器块的前馈层01：26：48 残余连接01：32：51 layernorm（及其与我们之前的 batchnorm 的关系）01：37：49 放大模型！创建一些变量。添加辍学关于变压器的注意事项01：42：39 编码器与解码器与两者（？变形金刚01：46：22 nanoGPT超快演练，批量多头自注意力01：48：53 回到 ChatGPT，GPT-3，预训练与微调，RLHF01：54：32 结论修正：00：57：00 哎呀，“来自未来的代币无法交流”，而不是“过去”。不好意思！:)01：20：05 哎呀，我应该使用head_size进行归一化，而不是 C【原文】Let's build GPT: from scratch, in code, spelled out.6,193,078次观看 2023年1月18日We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framework and basics of tensors and PyTorch nn, which we take for granted in this video.Links:Google colab for the video: https://colab.research.google.com/dri...GitHub repo for the video: https://github.com/karpathy/ng-video-...Playlist of the whole Zero to Hero series so far: • The spelled-out intro to neural networks a... nanoGPT repo: https://github.com/karpathy/nanoGPTmy website: https://karpathy.aimy twitter: / karpathy our Discord channel: / discord Supplementary links:Attention is All You Need paper: https://arxiv.org/abs/1706.03762OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165 OpenAI ChatGPT blog post: https://openai.com/blog/chatgpt/The GPU I'm training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com . If you prefer to work in notebooks, I think the easiest path today is Google Colab.Suggested exercises:EX1: The n-dimensional tensor mastery challenge: Combine the `Head` and `MultiHeadAttention` into one class that processes all the heads in parallel, treating the heads as another batch dimension (answer is in nanoGPT).EX2: Train the GPT on your own dataset of choice! What other data could be fun to blabber on about? (A fun advanced suggestion if you like: train a GPT to do addition of two numbers, i.e. a+b=c. You may find it helpful to predict the digits of c in reverse order, as the typical addition algorithm (that you're hoping it learns) would proceed right to left too. You may want to modify the data loader to simply serve random problems and skip the generation of train.bin, val.bin. You may want to mask out the loss at the input positions of a+b that just specify the problem using y=-1 in the targets (see CrossEntropyLoss ignore_index). Does your Transformer learn to add? Once you have this, swole doge project: build a calculator clone in GPT, for all of +-*/. Not an easy problem. You may need Chain of Thought traces.)EX3: Find a dataset that is very large, so large that you can't see a gap between train and val loss. Pretrain the transformer on this data, then initialize with that model and finetune it on tiny shakespeare with a smaller number of steps and lower learning rate. Can you obtain a lower validation loss by the use of pretraining?EX4: Read some transformer papers and implement one additional feature or change that people seem to use. Does it improve the performance of your GPT?Chapters:00:00:00 intro: ChatGPT, Transformers, nanoGPT, Shakespearebaseline language modeling, code setup00:07:52 reading and exploring the data00:09:28 tokenization, train/val split00:14:27 data loader: batches of chunks of data00:22:11 simplest baseline: bigram language model, loss, generation00:34:53 training the bigram model00:38:00 port our code to a scriptBuilding the "self-attention"00:42:13 version 1: averaging past context with for loops, the weakest form of aggregation00:47:11 the trick in self-attention: matrix multiply as weighted aggregation00:51:54 version 2: using matrix multiply00:54:42 version 3: adding softmax00:58:26 minor code cleanup01:00:18 positional encoding01:02:00 THE CRUX OF THE VIDEO: version 4: self-attention01:11:38 note 1: attention as communication01:12:46 note 2: attention has no notion of space, operates over sets01:13:40 note 3: there is no communication across batch dimension01:14:14 note 4: encoder blocks vs. decoder blocks01:15:39 note 5: attention vs. self-attention vs. cross-attention01:16:56 note 6: "scaled" self-attention. why divide by sqrt(head_size)Building the Transformer01:19:11 inserting a single self-attention block to our network01:21:59 multi-headed self-attention01:24:25 feedforward layers of transformer block01:26:48 residual connections01:32:51 layernorm (and its relationship to our previous batchnorm)01:37:49 scaling up the model! creating a few variables. adding dropoutNotes on Transformer01:42:39 encoder vs. decoder vs. both (?) Transformers01:46:22 super quick walkthrough of nanoGPT, batched multi-headed self-attention01:48:53 back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF01:54:32 conclusionsCorrections: 00:57:00 Oops "tokens from the future cannot communicate", not "past". Sorry! :)01:20:05 Oops I should be using the head_size for the normalization, not C

[1小时讲座]大型语言模型介绍

免费

共1课时

【译文】[1小时讲座]大型语言模型入门3,003,062次观看 2023年11月23日这是一个1小时的大型语言模型（Large Language Model）的普通读者介绍：大型语言模型是ChatGPT、Claude和Bard等系统背后的核心技术组件。它们是什么，它们将走向何方，与当今操作系统的比较和类比，以及这种新的计算范式中的一些与安全相关的挑战。截至2023年11月（该领域发展迅速！）背景：这段视频基于我最近在AI安全峰会上发表演讲时所用的幻灯片。这次谈话没有被录下来，但之后很多人来找我，告诉我他们很喜欢。鉴于我已经花了一个漫长的周末来制作幻灯片，我决定稍微调整一下，录制第二轮演讲并上传到YouTube。请原谅随机的背景，那是我在感恩节假期的酒店房间。幻灯片PDF格式:https://drive.google.com/file/d/1pxx_... (42MB)幻灯片。作为主题:https://drive.google.com/file/d/1FPUp... (140MB)有几件事我希望我能说（我会在出现时在这里补充）:这些梦和幻觉不会通过微调得到修正。微调只是将梦境“引导”为“有用的助手梦”。总是要小心 LLMs 告诉你的东西，特别是如果它们仅从记忆中告诉你一些东西。也就是说，与人类类似，如果 LLM 使用浏览或检索，并且答案进入了其上下文窗口的“工作内存”，你可以更加信任 LLM 将这些信息处理成最终的答案。但是现在，请不要相信 LLMs 说的话或做的事情。例如，在工具部分，我总是建议仔细检查LLM所做的数学/代码。LLM如何使用浏览器这样的工具？它会发出特殊的单词，例如|BROWSER|。当推断 LLM 的代码“上面”检测到这些词时，它捕获以下输出，将其发送到工具，返回结果并继续生成。LLM是如何知道发出这些特殊单词的？通过示例，对数据集进行微调，教它如何以及何时浏览。和/或工具使用说明也可以自动地放置在上下文窗口中（在“系统消息”中）。你可能还喜欢我2015年的博客文章《递归神经网络的不合理有效性》。我们今天获取基本模型的方式在高层次上几乎是相同的，除了RNN被换成了Transformer。http://karpathy.github.io/2015/05/21/...run.c文件里有什么？功能更全的1000行版本 hre:https://github.com/karpathy/llama2.c/...章节:第1部分: LLMs00:00:00 简介:大型语言模型（LLM）讲座00:00:20 LLM 推理00:04:17 LLM培训00:08:58 LLM 梦想00:11:22它们是如何工作的？00:14:14 对助手进行优化00:17:52 目前的总结00:21:05附录:比较，标签文档，RLHF，合成数据，排行榜第二部分:LLMs的未来00:25:43 LLM尺度法00:27:43 工具使用（浏览器、计算器、解释器、DALL-E）00:33:32 多模态（视觉，音频）00:35:00 思考，系统 1/200:38:02自我提升，LLM AlphaGo00:40:45 LLM 定制， GPTs 商店00:42:15 LLM OS第三部分:LLM安全性00:45:43 LLM安全入门00:46:14越狱00:51:30 提示注入00:56:23数据中毒00:58:37 LLM 安全结论结尾00:59:23 输出教育使用许可此视频可免费用于教育和内部培训目的。教育工作者、学生、学校、大学、非营利机构、企业和个人学习者可以自由使用这些内容用于课程、课程、内部培训和学习活动，前提是他们不从事商业转售、再分发、外部商业使用，或修改内容以歪曲其意图。【原文】[1hr Talk] Intro to Large Language Models3,003,062次观看 2023年11月23日This is a 1 hour general-audience introduction to Large Language Models: the core technical component behind systems like ChatGPT, Claude, and Bard. What they are, where they are headed, comparisons and analogies to present-day operating systems, and some of the security-related challenges of this new computing paradigm.As of November 2023 (this field moves fast!).Context: This video is based on the slides of a talk I gave recently at the AI Security Summit. The talk was not recorded but a lot of people came to me after and told me they liked it. Seeing as I had already put in one long weekend of work to make the slides, I decided to just tune them a bit, record this round 2 of the talk and upload it here on YouTube. Pardon the random background, that's my hotel room during the thanksgiving break.Slides as PDF: https://drive.google.com/file/d/1pxx_... (42MB)Slides. as Keynote: https://drive.google.com/file/d/1FPUp... (140MB)Few things I wish I said (I'll add items here as they come up):The dreams and hallucinations do not get fixed with finetuning. Finetuning just "directs" the dreams into "helpful assistant dreams". Always be careful with what LLMs tell you, especially if they are telling you something from memory alone. That said, similar to a human, if the LLM used browsing or retrieval and the answer made its way into the "working memory" of its context window, you can trust the LLM a bit more to process that information into the final answer. But TLDR right now, do not trust what LLMs say or do. For example, in the tools section, I'd always recommend double-checking the math/code the LLM did.How does the LLM use a tool like the browser? It emits special words, e.g. |BROWSER|. When the code "above" that is inferencing the LLM detects these words it captures the output that follows, sends it off to a tool, comes back with the result and continues the generation. How does the LLM know to emit these special words? Finetuning datasets teach it how and when to browse, by example. And/or the instructions for tool use can also be automatically placed in the context window (in the “system message”).You might also enjoy my 2015 blog post "Unreasonable Effectiveness of Recurrent Neural Networks". The way we obtain base models today is pretty much identical on a high level, except the RNN is swapped for a Transformer. http://karpathy.github.io/2015/05/21/...What is in the run.c file? A bit more full-featured 1000-line version hre: https://github.com/karpathy/llama2.c/...Chapters:Part 1: LLMs00:00:00 Intro: Large Language Model (LLM) talk00:00:20 LLM Inference00:04:17 LLM Training00:08:58 LLM dreams00:11:22 How do they work?00:14:14 Finetuning into an Assistant00:17:52 Summary so far00:21:05 Appendix: Comparisons, Labeling docs, RLHF, Synthetic data, LeaderboardPart 2: Future of LLMs00:25:43 LLM Scaling Laws00:27:43 Tool Use (Browser, Calculator, Interpreter, DALL-E)00:33:32 Multimodality (Vision, Audio)00:35:00 Thinking, System 1/200:38:02 Self-improvement, LLM AlphaGo00:40:45 LLM Customization, GPTs store00:42:15 LLM OSPart 3: LLM Security00:45:43 LLM Security Intro00:46:14 Jailbreaks00:51:30 Prompt Injection00:56:23 Data poisoning00:58:37 LLM Security conclusionsEnd00:59:23 OutroEducational Use LicensingThis video is freely available for educational and internal training purposes. Educators, students, schools, universities, nonprofit institutions, businesses, and individual learners may use this content freely for lessons, courses, internal training, and learning activities, provided they do not engage in commercial resale, redistribution, external commercial use, or modify content to misrepresent its intent.

全部课程分类

联系方式