全部课程分类

当前位置:首页 > 全部课程 > 《让我们构建GPT分词器
让我们构建GPT分词器

让我们构建GPT分词器

收藏 邀请卡
价 格
免费
打开微信扫描二维码
点击右上角进行分享

详情

目录

【课程详情】

【译文】

让我们构建GPT令牌化器


2024年2月21日

令牌器是大型语言模型(LLMs)中一个必要且普遍的组件,它在字符串和令牌(文本块)之间进行翻译。令牌化器是LLM管道中一个完全独立的阶段:它们有自己的训练集、训练算法(字节对编码),并在训练后实现两个基本功能:从字符串编码到令牌,以及从令牌到字符串的解码。在本讲座中,我们将从头开始构建OpenAI GPT系列中使用的令牌化器。在这个过程中,我们会看到,LLMs的许多奇怪行为和问题实际上可以追溯到标记化。我们将讨论其中的一些问题,讨论为什么标记化是错误的,以及为什么有人理想地找到一种方法完全删除这个阶段。


章节:

00:00:00 介绍:词法分析、GPT-2论文、与词法分析相关的问题

00:05:50在Web UI中按示例进行令牌化(tiktokenizer)

00:14:56 Python中的字符串,Unicode代码点

00:18:15 Unicode字节编码,ASCII,UTF-8,UTF-16,UTF-32

00:22:47做白日梦:删除标记化

00:23:50字节对编码(BPE)算法演练

00:27:02开始执行

00:28:35数连续对,找出最常见的对

00:30:36 合并最常见的对

00:34:58 训练标记器:添加while循环,压缩比

00:39:20 标记器/LLM图表:这是一个完全独立的阶段

00:42:47将令牌解码为字符串

00:48:21将字符串编码为令牌

00:57:36 用 regex 模式强制在类别之间进行拆分

01:21:38 tiktoken 库介绍,GPT-2/GPT-4 regex之间的区别

OpenAI演练发布的GPT-2 encoder.py

01:28:26 特殊代币,代币处理,GPT-2/GPT-4差异

01:25:28 分钟锻炼时间!编写您自己的GPT-4标记器

01:28:42句子库介绍,用于训练Llama 2词汇

01:43:27如何设置词汇集?重温 gpt.py 变压器

01:48:11 训练新代币,提示压缩示例

01:49:58带矢量量化的多模态[图像、视频、音频]令牌化

01:51:41重新审视和解释LLM标记化的怪异之处

02:10:20 最终建议

02:12:50 ??? :)


练习:

建议流程:参考此文档,并在视频中提供部分解决方案之前尝试实施这些步骤。如果你被卡住了,完整的解决方案在Minbpe代码中。https://github.com/karpathy/minbpe/bl...


链接:

谷歌浏览器的视频:https://colab.research.google.com/dri...

视频的GitHub仓库:minBPEhttps://github.com/karpathy/minbpe

到目前为止,整个《从零到英雄》系列的播放列表:· 神经网络的详细介绍...

我们的Discord频道:/discord

我的推特:/ karpathy


补充链接:

tiktokenizer https://tiktokenizer.vercel.app

openAI的tiktoken:https://github.com/openai/tiktoken

来自谷歌的句子https://github.com/google/sentencepiece


【原文】

Let's build the GPT Tokenizer


2024年2月21日

The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.


Chapters:

00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues

00:05:50 tokenization by example in a Web UI (tiktokenizer)

00:14:56 strings in Python, Unicode code points

00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32

00:22:47 daydreaming: deleting tokenization

00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough

00:27:02 starting the implementation

00:28:35 counting consecutive pairs, finding most common pair

00:30:36 merging the most common pair

00:34:58 training the tokenizer: adding the while loop, compression ratio

00:39:20 tokenizer/LLM diagram: it is a completely separate stage

00:42:47 decoding tokens to strings

00:48:21 encoding strings to tokens

00:57:36 regex patterns to force splits across categories

01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex

01:14:59 GPT-2 encoder.py released by OpenAI walkthrough

01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences

01:25:28 minbpe exercise time! write your own GPT-4 tokenizer

01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary

01:43:27 how to set vocabulary set? revisiting gpt.py transformer

01:48:11 training new tokens, example of prompt compression

01:49:58 multimodal [image, video, audio] tokenization with vector quantization

01:51:41 revisiting and explaining the quirks of LLM tokenization

02:10:20 final recommendations

02:12:50 ??? :)


Exercises:

Advised flow: reference this document and try to implement the steps before I give away the partial solutions in the video. The full solutions if you're getting stuck are in the minbpe code https://github.com/karpathy/minbpe/bl...


Links:

Google colab for the video: https://colab.research.google.com/dri...

GitHub repo for the video: minBPE https://github.com/karpathy/minbpe

Playlist of the whole Zero to Hero series so far:    • The spelled-out intro to neural networks a...  

our Discord channel:   / discord  

my Twitter:   / karpathy  


Supplementary links:

tiktokenizer https://tiktokenizer.vercel.app

tiktoken from OpenAI: https://github.com/openai/tiktoken

sentencepiece from Google https://github.com/google/sentencepiece


让我们构建GPT分词器[1课时]

AI一手信息 总共51个课程

主要是分享Youtube上,AI类博主的视频

未来会不会拓展衍生,看大家的需求吧

但是一些高质量的视频,我看到了,也会放上来的

为您推荐

联系方式

电话:yhbj39或yhbj2024

邮箱:abxjy@163.com

VIP特权
微信客服
微信扫一扫咨询客服