andy's blog

optimizing multilingual compression via biased tokenization

I stumbled upon this paper by Cohere Labs: One Tokenizer to Rule them All: Emergent Language Plasticity via Multilingual Tokenizers.

this is an interesting paper as tokenization is one of the biggest bottlenecks in pre-training of LMs especially in multilingual settings.

they propose a novel approach of building a universal tokenizer which performs better than specialized tokenizers for a specific language in multilingual settings.



training briefs about the LM from the paper:

for our experiments, as standard for most LLMs, we use the Transformer-based decoder-only architecture[Vaswani et al., 2017; Radford & Narasimhan, 2018]. Our architecture includes key optimizations such as Parallel Attention Blocks[Chowdhery et al., 2023], Grouped Query Attention [Ainslie et al., 2023], SwiGLU activation function[Shazeer, 2020], and Rotary Positional Embeddings[Su et al., 2024]



the authors train their tokenizers, which are central to their investigation of language plasticity, using a particularly interesting and principled approach:

  1. Byte Pair Encoding (BPE) for training all the tokenizers across the language distributions; using BPE consistently, the authors ensure that the core subword learning mechanism is uniform across all their tokenizer variants, allowing for a fair comparison.

  2. balanced language weighting with data distribution and language buckets:

wi=wid·wibjwjd·wjb

this scheme (formula above) combines the natural distribution of data (wa) for each language with "language buckets" (wh) formed by grouping languages that share linguistic family and script (Section 2.3). Within each bucket, languages are uniformly weighted.



the universal tokenizer performs exceptionally well due to its intrinsic efficiency and broad language representation established during pre-training. This design enables "emergent language plasticity" : the model's capability to quickly adapt to new languages.

the authors' findings suggest this is because the tokenizer, trained with balanced language weighting, learns highly efficient sub-word units that result in lower compression ratios across diverse languages.

crucially, the paper demonstrates this proactive pre-training approach is superior to reactive Cross-Lingual Vocabulary Adaptation (CVA)[Yamaguchi et al., 2024] methods applied after initial model training (Section 5.1).

while the paper's universal tokenizer is broadly trained, how much of its success is due to its sheer coverage versus its ability to specifically capture shared linguistic patterns?

this made me wonder: what if we could further optimize this by actively guiding the tokenizer to discover and prioritize sub-word units that serve as highly efficient, shared anchors across a very diverse language set?

if we explicitly bias a tokenizer to prioritize sub-word units common across multiple languages, can we achieve an even sharper improvement in compression, thereby validating the existence and utility of these 'meta-tokens' for multilingual LLMs?

if all of this is true: it could possibly lead to the conclusion that biasing the tokenizer without adding much noise can lead to faster learning in pre-training for target languages for language models.

this experiment directly tests if such a biased tokenizer can surpass current compression benchmarks, especially for bridging vastly different scripts.1



building the "shared-bias" tokenizer

the idea is to force the tokenizer learn common words / sub-words trained on the multilingual corpora : the simple hack i came up with was to find common n-grams (n=4 or 5) with a min_frequency criteria, and while the training step force the BPE algo to explicitly learn these merges.2

and when BPE trains on this combined (and ideally shuffled) corpus, the immense frequency of common_ngrams ensures they will be among the very first and highest-priority pairs to be merged, effectively "forcing" them into the vocabulary as single units.

the methodology:

  1. identifying "shared n-grams" : i first identified common character n-grams (e.g., of n=3 and 4) that appeared frequently across my diverse language corpus. these n-grams serve as my hypothesized "meta-tokens" for cross-lingual transfer.

  2. balanced corpus preparation : the corpus was prepared by combining 4 languages (english, hindi, spanish and japanese), each language having roughly 20k words each.3

  3. injecting bias into BPE (frequency manipulation): to ensure BPE learned these "shared n-grams" effectively, i used a frequency manipulation strategy; this involved creating a seed bias file where the identified shared n-grams were massively repeated. by combining this seed file with the balanced language corpus for training, the BPE algorithm was robustly biased towards forming merges corresponding to these desired cross-lingual units.



training details

outlining the specific parameters and configurations used for training the various tokenizers and models in this experiment

  1. language corpus: languages used for the corpus preparation were English, Spanish, Hindi and Japanese. each language having roughly 20k words each, with a file size of ~2 mb.

  2. tokenizer training details: all tokenizers(variants of Universal and shared-bias) were trained using the Byte Pair Encoding (BPE) algorithm, as implemented in the Hugging Face tokenizers library, with vocab_size of 10000. pre-tokenizers used were in-built Whitespace and ByteLevel4 across experiments. also Japanese text was pre-segmented into words/morphemes prior to training to ensure compatibility when using whitespace pretokenizer, special tokens: <PAD>, <UNK>, <BOS>, <EOS>.

  3. specific tokenizer variants used for benchmarking : universal tokenizer simulating the paper's UNIVERSAL approach and shared-bias tokenizer (my proposal).

  4. mini fine-tuning tests for plasticity to test the paper's claims of superior performance in pre-training vs post-training methods (CVA)5: the pre-trained LLM used for finetuning was EleutherAI/gpt-neo-125M from Hugging Face Transformers on an instruction following task on the dataset projectbaraat/hindi-instruct-dataset-v0.1.



                          fine-tuning run details

best run

all-runs



key findings

  1. compression ratio results: the Shared-Bias Tokenizer consistently achieves lower (better) compression ratios across the diverse language set (English, Spanish, Hindi, Japanese) compared to the baseline Universal tokenizer. quantitatively speaking, this improvement was observed to be in the range of approximately 0.02-0.05 tokens per byte. this result directly validates the hypothesis that explicitly biasing the Byte Pair Encoding (BPE) algorithm towards learning common, cross-lingual "meta-tokens" can lead to a more efficient representation of multilingual text.(see bar chart below)



                          compression rates (lower is better)

there has been past research that indicates that better(the lower the better) compression ratios in tokenizers lead to better pre-training of LMs and thus leading to better plasticity.



2. mini finetuning test results, a bottleneck in post-training adaptation: while the shared-bias tokenizer showed intrinsic efficiency, the mini fine-tuning test on a small pre-trained language model EleutherAI/gpt-neo-125M did not yield significant improvements in emergent plasticity or downstream performance. this outcome, though not a direct "win" for plasticity in this specific test, serves as a crucial learning point that validates a core argument of the "One Tokenizer To Rule Them All" paper, as the authors emphasize, "it is more effective to use a UNIVERSAL tokenizer from the start, rather than substituting it in after pretraining" (Section 5.1 of the paper)5.

the results suggest that the limited parameter count of the small base model (125M) likely acted as a significant bottleneck. such a model, already pre-trained with its own tokenizer, likely lacks the capacity to fully leverage the improved efficiency of a newly introduced tokenizer for complex adaptation tasks in a single epoch (or possibly more) of post-training fine-tuning.



a more robust test would involve pre-training a small model (like nanoGPT) from scratch with the shared bias tokenizer, as this would align with the author's emphasis on "interventions made early on." (pre-training effectiveness over post-training).



conclusion and future work : the promise of pre-training-time tokenizer design

this small-scale experiment/research, despite its limitations, supports the paper's core message: investing in thoughtful, "universal" tokenizer design during pre-training (or from the start) is not merely beneficial, but crucial for cultivating true multilingual language plasticity.

this work specifically highlights that explicitly biasing tokenizer training towards learning cross-lingual "meta-tokens" yields a tangible and measurable benefit: improved compression ratios. this intrinsic efficiency validates the hypothesis that specific shared subword units are vital for representing diverse languages more compactly.

looking ahead, the exciting potential lies in fully exploring these ideas by pre-training language models from scratch with such optimized, shared-bias tokenizers. this next step, aligning with the paper's methodology and robust scaling laws in LLM research, would rigorously test how a foundationally superior tokenizer influences the emergent language plasticity of larger models.

future work/research could further refine the data-driven discovery of "meta-tokens" and explore their impact across an even broader spectrum of linguistic tasks and low-resource settings, ultimately paving the way for more efficient, inclusive, and globally accessible LLMs.

the complete code and plots for this small-scale experiment are openly available and can be found here.






other core extension ideas:

  1. can the tokenization algorithm (BPE in the official paper) be tweaked with (sentence-piece or SCRIPT-BPE etc.)? the core question is does the choice of the algorithm interact with the "universal strategy"? or can a different algorithm yield even better plasticity or compression ratios, especially for certain language families or scripts?

  2. dynamic or adaptive language weighting : the authors use a fixed language weighting scheme based on buckets and data availability, can the weighting adjust based on the model's performance on different languages (e.g., increasing weight for languages where the model is struggling)? could it adapt based on the semantic similarity or typological features of languages rather than just pre-defined buckets?

  3. what specific subword units are learned by the UNIVERSAL tokenizer that contribute to better cross-lingual transfer? how do the embeddings of shared and unshared tokens evolve differently during training with a UNIVERSAL vs. CLUSTER tokenizer? can we identify "meta-tokens" or abstract subword units that are truly universal across diverse languages?

  4. instead of relying on pre-defined "language buckets" based solely on script and family, we could explore a data-driven approach to discover optimal language groupings for tokenizer training by analyzing language embedding spaces.one could apply clustering algorithms to identify natural language groupings based on their representational similarity.this would directly inform a more intelligent, potentially dynamic, weighting and vocabulary selection strategy for "universal tokenizers."

  5. a compelling area for further research, building on the paper's use of a universal regex for pre-tokenization, would be to explore how different pre-tokenization schemes (e.g., byte-level, whitespace, or more advanced regexes) influence these plasticity gains, particularly for non-whitespace languages like CJK.






footnotes

  1. This experiment is conducted at a small scale. While the full impact of a "Shared-Bias Tokenizer" in large-scale LLM pre-training remains to be rigorously tested, the observed trends (e.g., improved compression) strongly suggest this approach would be effective in scaled settings, aligning with the paper's findings on universal tokenizers and established scaling laws in LLMs.

  2. There are two other strategies that i thought would work but didn't get fruitful results from them: one is to post-seed n-grams, where we first train the tokenizer on the combined corpus and then add those shared n-grams to tokenizers token list, now there is still room for experimentation with this one (maybe we need to select only the top k most frequent shared n-grams, cuz if the desired vocab size is not large enough, the merges might not learn at all or even incorporate noise); the second strategy i tried was using a custom pre-tokenizer to "inject" bias by isolating shared n-grams, where i built a custom pre-tokenizer that isolates the shared n-grams

  3. The tokenizer is actually trained on an oversampled version of this corpora where the under-resourced languages (Hindi and Japanese) are oversampled and then merged to the original corpus via random shuffling.

  4. Although this experiment was done using ByteLevel pre-tokenization scheme in Huggingface tokenizers, most of the lower compression rates were achieved using Whitespace pre-tokenization scheme, plots for which have been shown in this write-up.

  5. The authors underscore a pivotal insight: the most effective strategy for cultivating language plasticity in LLMs involves implementing a universal tokenizer from the very beginning of pre-training. their extensive experiments (Section 5.1) reveal that this proactive approach significantly outperforms reactive, post-training methods like Cross-Lingual Vocabulary Adaptation (CVA)[Yamaguchi et al., 2024], where an existing tokenizer's vocabulary is modified after the model has already been trained. This finding highlights that foundational interventions in tokenizer design during the initial pre-training phase yield superior and more robust multilingual capabilities than attempts to adapt the tokenizer later.