【译文】
让我们构建GPT令牌化器
2024年2月21日
令牌器是大型语言模型(LLMs)中一个必要且普遍的组件,它在字符串和令牌(文本块)之间进行翻译。令牌化器是LLM管道中一个完全独立的阶段:它们有自己的训练集、训练算法(字节对编码),并在训练后实现两个基本功能:从字符串编码到令牌,以及从令牌到字符串的解码。在本讲座中,我们将从头开始构建OpenAI GPT系列中使用的令牌化器。在这个过程中,我们会看到,LLMs的许多奇怪行为和问题实际上可以追溯到标记化。我们将讨论其中的一些问题,讨论为什么标记化是错误的,以及为什么有人理想地找到一种方法完全删除这个阶段。
章节:
00:00:00 介绍:词法分析、GPT-2论文、与词法分析相关的问题
00:05:50在Web UI中按示例进行令牌化(tiktokenizer)
00:14:56 Python中的字符串,Unicode代码点
00:18:15 Unicode字节编码,ASCII,UTF-8,UTF-16,UTF-32
00:22:47做白日梦:删除标记化
00:23:50字节对编码(BPE)算法演练
00:27:02开始执行
00:28:35数连续对,找出最常见的对
00:30:36 合并最常见的对
00:34:58 训练标记器:添加while循环,压缩比
00:39:20 标记器/LLM图表:这是一个完全独立的阶段
00:42:47将令牌解码为字符串
00:48:21将字符串编码为令牌
00:57:36 用 regex 模式强制在类别之间进行拆分
01:21:38 tiktoken 库介绍,GPT-2/GPT-4 regex之间的区别
OpenAI演练发布的GPT-2 encoder.py
01:28:26 特殊代币,代币处理,GPT-2/GPT-4差异
01:25:28 分钟锻炼时间!编写您自己的GPT-4标记器
01:28:42句子库介绍,用于训练Llama 2词汇
01:43:27如何设置词汇集?重温 gpt.py 变压器
01:48:11 训练新代币,提示压缩示例
01:49:58带矢量量化的多模态[图像、视频、音频]令牌化
01:51:41重新审视和解释LLM标记化的怪异之处
02:10:20 最终建议
02:12:50 ??? :)
练习:
建议流程:参考此文档,并在视频中提供部分解决方案之前尝试实施这些步骤。如果你被卡住了,完整的解决方案在Minbpe代码中。https://github.com/karpathy/minbpe/bl...
链接:
谷歌浏览器的视频:https://colab.research.google.com/dri...
视频的GitHub仓库:minBPEhttps://github.com/karpathy/minbpe
到目前为止,整个《从零到英雄》系列的播放列表:· 神经网络的详细介绍...
我们的Discord频道:/discord
我的推特:/ karpathy
补充链接:
tiktokenizer https://tiktokenizer.vercel.app
openAI的tiktoken:https://github.com/openai/tiktoken
来自谷歌的句子https://github.com/google/sentencepiece
【原文】
Let's build the GPT Tokenizer
2024年2月21日
The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.
Chapters:
00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues
00:05:50 tokenization by example in a Web UI (tiktokenizer)
00:14:56 strings in Python, Unicode code points
00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
00:22:47 daydreaming: deleting tokenization
00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough
00:27:02 starting the implementation
00:28:35 counting consecutive pairs, finding most common pair
00:30:36 merging the most common pair
00:34:58 training the tokenizer: adding the while loop, compression ratio
00:39:20 tokenizer/LLM diagram: it is a completely separate stage
00:42:47 decoding tokens to strings
00:48:21 encoding strings to tokens
00:57:36 regex patterns to force splits across categories
01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
01:14:59 GPT-2 encoder.py released by OpenAI walkthrough
01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
01:25:28 minbpe exercise time! write your own GPT-4 tokenizer
01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary
01:43:27 how to set vocabulary set? revisiting gpt.py transformer
01:48:11 training new tokens, example of prompt compression
01:49:58 multimodal [image, video, audio] tokenization with vector quantization
01:51:41 revisiting and explaining the quirks of LLM tokenization
02:10:20 final recommendations
02:12:50 ??? :)
Exercises:
Advised flow: reference this document and try to implement the steps before I give away the partial solutions in the video. The full solutions if you're getting stuck are in the minbpe code https://github.com/karpathy/minbpe/bl...
Links:
Google colab for the video: https://colab.research.google.com/dri...
GitHub repo for the video: minBPE https://github.com/karpathy/minbpe
Playlist of the whole Zero to Hero series so far: • The spelled-out intro to neural networks a...
our Discord channel: / discord
my Twitter: / karpathy
Supplementary links:
tiktokenizer https://tiktokenizer.vercel.app
tiktoken from OpenAI: https://github.com/openai/tiktoken
sentencepiece from Google https://github.com/google/sentencepiece