Table of Links
-
Methods
2.2 Indicators for detecting under-trained tokens and 2.3 Verification of candidate tokens
-
Results
B. A short primer on UTF-8 encoding
C. Outputs for API-based verification
4 Closed-source models
This section briefly explores the ability to transfer our findings to closed models. As our techniques involve using the model weights, they are not directly applicable to closed-source models. However, the experience gained in inspecting a large variety of open models has provided insight which may transfer to closed models. For these tests, we use a custom prompt designed to exactly repeat strings and see if models appear incapable of doing so (see Appendix C for details)
4.1 OpenAI GPT-3.5 and GPT-4
By using models that share a tokenizer (cf. section 3.3.7), we already have an list of potential candidates, including _ForCanBeConverted, $PostalCodesNL, useRalative, _typingsJapgolly, and others. We test some of these tokens in prompts and find that all OpenAI models fail to handle many of them correctly, resulting in hallucinations followed by an inability to tell the difference between the inputs and incorrect outputs, or degrading into repetition.[12]
4.2 Anthropic Claude 2 and 3
Although documentation on tokenization in these models is limited, the Anthropic SDK contains some tokenizer utilities for Claude 2, with remarks that they are not accurate for Claude 3[13] Using the tokenizer provided for Claude 2, we can identify some candidates for merged tokens such as CandidateFaciNum (iCandidateFaciNum), TrileptonPatTuple (TrileptonPatTupleMC), BFrontend (DVBFrontend) and others. Some of these tokens can be confirmed as problematic in Claude 2.1, although none appear effective in the Claude 3 family of models, consistent with the change in tokenizer implied by their SDK code.
4.3 Mistral Medium and Large
Although tokenizers are available for Mistral’s open models, their flagship API models do not include information about tokenizers. However, due to a confirmed leak of an early version of their ‘medium’ model as ‘miqu’, we have some knowledge of the ‘medium’ model being potentially derived from Llama2 70B. By prompting both the ‘medium’ and ‘large’ models, we confirm that the ‘medium’ model is unable to repeat strings that are typically under-trained in Llama2 models, and the ‘large’ model fails on typical tokens from the ‘small’ and ‘Mixtral’ series. In addition, in experimenting with such prompts we found that the ‘large’ model occasionally responds with apparent undocumented special tokens including [TOOL_CALLS] and [control_331], which were recently confirmed to be part of the tokenizer for the 8x22B model.
5 Discussion
Our investigation has shown a wide variety of untrained and under-trained tokens present in tokenizers, and their prevalence differs significantly by model. The presence of under-trained tokens has several negative consequences for language models, including inefficient inference and the potential to bypass guardrails. Even with our relatively strict threshold for verification, we detect the presence of such tokens across all tested models, with typically around 0.1–1% of the vocabulary consisting of severely under-trained tokens. The most important factors in a model having many under-trained tokens, aside from simply having a large vocabulary, appears to be whether the tokenizer was trained on similar data as the model. Models which re-use a large external tokenizer, and then train from scratch, are among those with the highest number of under-trained tokens.
Analyzing the tokenizer directly can detect several of these without the need for any training, including unreachable tokens which do not encode back to their representation, and unused byte fallback tokens. This can be particularly useful in quickly catching tokenizer configuration errors, which appear to be particularly common when custom vocabulary is manually added. Using the model embedding weights directly is a reliable way to detect tokens which are under-trained, although the care should be taken to take into account the model architecture. Based on our findings, we can summarize number of recommendations within the scope of current tooling:
• Ensure input data pre-processing is identical across tokenizer training data, model training data, and model inference. In particular, consider carefully how to handle carriage returns, tab characters, and special tokens present as plain text in training data and user input.
• Ensure the model training data and tokenizer are aligned, especially when training a new base model.
• For single-byte tokens, either include a single copy of all 256 bytes without allowing duplicates in the vocabulary, or exclude the 13 unused bytes 0xC0/0xC1, 0xF5-0xFF. When dynamically excluding extremely rare bytes such as 0xF1, consider including an explicit <token> as a fallback.
• After training a tokenizer, check for unreachable tokens by encoding and decoding the vocabulary to ensure manually added tokens are handled correctly.
• When publishing both ‘fast’ and ‘slow’ versions of a tokenizer on Hugging Face, ensure they give the same outputs, for example by tokenizing the tokenizer vocabulary itself with both versions.
• When training a base model, check for under-trained tokens after smaller test runs and reconsider tokenisation methods and data. Running a test on a different corpus can also reveal pre-processing bugs that cause unrepresentative inputs in the main training data.
In addition to providing a set of useful tools for improving models and tokenizers, our work suggests several directions for future research. Firstly, the results from StarCoder2 (section 3.3.8) highlight a potential weakness in BPE training in that occurrences in a single document (or even single sub-collection of documents, such as a repository) are able to define a token by themselves. Strategies for preventing this, such as limiting the count for pairs to be merged by document, should be explored to prevent this. Secondly, one common difference between tokenizers is whether or not they allow partial UTF-8 sequences in tokens other than byte fallback tokens. This trade-off is also particularly under-explored. Although allowing such tokens may lead to lower average token counts, it also leads to more untrained ‘fragments’ and tokens which are less semantically meaningful. Finally, we noticed differences between models in terms of how they apply weight decay to tokens not present in input. This choice may affect how well models remember the meaning of rare tokens and likely mitigate the severity and impact of under-trained tokens. Although this choice has been known to be important in models that predate transformers [31], we are not aware of systematic ablations in recent LLMs.
In conclusion, our findings highlight a range of tokenizer issues, and the severity of these varies across different models. By analyzing tokenizers and model embeddings, we can identify under-trained tokens and improve the efficiency and security of LLMs.
Acknowledgments
We thank Dirk Groeneveld, Luca Soldaini and Nathan Lambert of the Allen Institute for AI for helpful discussions and data on weight decay, tokens trained on, and tokenization in the OLMo models, and Stella Biderman of EleutherAI for information on weight decay and tokenization in the Pythia/GPT-NeoX models. We also thank Matthias Gallé and Phil Blunsom for valuable feedback.
References
[1] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019.
[2] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units, 2016.
[3] J. Pourmostafa Roshan Sharami, D. Shterionov, and P. Spronck. A systematic analysis of vocabulary and bpe settings for optimal fine-tuning of nmt: A case study of in-domain translation, 2023.
[4] Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022.
[5] Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis. Megabyte: Predicting million-byte sequences with multiscale transformers, 2023.
[6] Kevin Slagle. Spacebyte: Towards deleting tokenization from large language modeling, 2024.
[7] Andrej Karpathy. Let’s build the GPT Tokenizer. https://www.youtube.com/watch?v=zduSFxRajkE, 2024.
[8] Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. Coercing llms to do and reveal (almost) anything, 2024.
[9] Jessica Rumbelow and Matthew Watkins. SolidGoldMagikarp (plus, prompt generation). https://www. lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation, 2023.
[10] Matthew Watkins and Jessica Rumbelow. SolidGoldMagikarp III: Glitch token archaeology. https://www.lesswrong.com/posts/8viQEp8KBg2QSW4Yc/ solidgoldmagikarp-iii-glitch-token-archaeology, 2023.
[11] Martin Fell. A search for more ChatGPT / GPT-3.5 / GPT-4 ”unspeakable” glitch tokens. https://www.lesswrong.com/posts/kmWrwtGE9B9hpbgRT/ a-search-for-more-chatgpt-gpt-3-5-gpt-4-unspeakable-glitch, 2023.
[12] Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment, 2023.
[13] Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, and Hannaneh Hajishirzi. Olmo: Accelerating the science of language models, 2024.
[14] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-20B: An open-source autoregressive language model, 2022.
[15] Microsoft. Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/ research/blog/phi-2-the-surprising-power-of-small-language-models/, 2023.
[16] Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
[17] Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023.
[18] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.
[19] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023.
[20] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts, 2024.
[21] Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, Changbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim, Mikyoung Cha, Hwalsuk Lee, and Sunghun Kim. SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling, 2024.
[22] Rakuten Group, Aaron Levine, Connie Huang, Chenguang Wang, Eduardo Batista, Ewa Szymanska, Hongyi Ding, Hou Wei Chou, Jean-François Pessiot, Johanes Effendi, Justin Chiu, Kai Torben Ohlhus, Karan Chopra, Keiji Shinzato, Koji Murakami, Lee Xiong, Lei Chen, Maki Kubota, Maksim Tkachenko, Miroku Lee, Naoki Takahashi, Prathyusha Jwalapuram, Ryutaro Tatsushima, Saurabh Jain, Sunil Kumar Yadav, Ting Cai, Wei-Te Chen, Yandi Xia, Yuki Nakayama, and Yutaka Higashiyama. Rakutenai7b: Extending large language models for Japanese, 2024.
[23] Cohere. Cohere Command R. https://docs.cohere.com/docs/command-r.
[24] Unicode® technical standard #51: Unicode emoji. https://unicode.org/reports/tr51/, 2023.
[25] OpenAI. tiktoken: a fast BPE tokeniser for use with OpenAI’s models. https://github.com/openai/ tiktoken, Accessed April 2024.
[26] Marco Bellagente, Jonathan Tow, Dakota Mahan, Duy Phung, Maksym Zhuravinskyi, Reshinth Adithyan, James Baicoianu, Ben Brooks, Nathan Cooper, Ashish Datta, Meng Lee, Emad Mostaque, Michael Pieler, Nikhil Pinnaparju, Paulo Rocha, Harry Saini, Hannah Teufel, Niccolo Zanichelli, and Carlos Riquelme. Stable lm 2 1.6b technical report, 2024.
[27] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
[28] Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, WenDing Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder 2 and the stack v2: The next generation, 2024.
[29] 01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. Yi: Open foundation models by 01.ai, 2024.
[30] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba language model, 2024.
[31] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24th International Conference on World Wide Web, pages 111–112. ACM, 2015.
[32] The Unicode® standard. version 15.0 – core specification. https://www.unicode.org/versions/ Unicode15.0.0/ch02.pdf, 2023.
Authors:
(1) Sander Land, Cohere s(ander@cohere.com);
(2) Max Bartolo, Cohere (max@cohere.com).
This paper is
[12] The same technique also confirms that the currently undocumented ‘gpt2-chatbot’ model on the LMSys Arena uses a related tokenizer.
[13] https://github.com/anthropics/anthropic-sdk-python/blob/8e3d8a68d309424238ae54e03ee962f7147cfc60/src/anthropic/_client.py#L276