The Byte Pair Encoding Strategy

The Byte Pair Encoding strategy is the tokenizing strategy used in most modern LLMs. Let’s see an example:

Let’s look at the second iteration:

We can iterate this process as many times as we need:

To summarize:

  1. Start with Character-Level Tokenization
  2. Count Pair Frequencies
  3. Merge the Most Frequent Pair
  4. Repeat the Process
  5. Finalize the Vocabulary
  6. Tokenization of New Text

To search tokens in the resulting dictionary, we just search the longest substring in each of the words of a new input sentence:

If N is the average number of characters in a word and M is the average number of words in an input sequence, the time complexity of the tokenization process is O(MN2).

Complete and Continue