Tokenization

Discussion about Tokenization.

Introduction

Tokenization 分詞
Tokenizer 分詞器

  • Character Tokenization
  • Word Tokenization

Character Tokenization

1. Character Tokenization

string = "How are you?"
tokenized_str = list(string)
print(tokenized_str)
  • Explanation:

    • The string "How are you?" is converted into a list of characters.
    • list(string) splits the string into individual characters.
  • Output:

    ['H', 'o', 'w', ' ', 'a', 'r', 'e', ' ', 'y', 'o', 'u', '?']
    

2. Numericalization

Step 1: Remove Duplicates and Sort Characters

unique_chars = sorted(set(tokenized_str))
  • Explanation:

    • set(tokenized_str) removes duplicate characters, keeping only unique ones.
    • sorted() sorts these unique characters in ascending order (based on ASCII values).
  • Example Output:

    [' ', '?', 'H', 'a', 'e', 'o', 'r', 'u', 'w', 'y']
    

Step 2: Assign a Unique Index to Each Character

token2idx = {}
for idx, ch in enumerate(unique_chars):
    token2idx[ch] = idx
  • Explanation:

    • enumerate(unique_chars) assigns an index to each character.
    • A dictionary token2idx is created where each character is a key, and its index is the corresponding value.
  • Example Output:

    {' ': 0, '?': 1, 'H': 2, 'a': 3, 'e': 4, 'o': 5, 'r': 6, 'u': 7, 'w': 8, 'y': 9}
    

3. Mapping Characters to Indices

input_ids = [token2idx[token] for token in tokenized_str]
print(input_ids)
  • Explanation:

    • This step maps each character in the original string to its corresponding index from the token2idx dictionary.
    • input_ids will be a list of integers representing the original string.
  • Example Output:

    [2, 5, 8, 0, 3, 6, 4, 0, 9, 5, 7, 1]
    

Summary:

  • Character Tokenization: Converts the string into a list of individual characters.
  • Numericalization:
    1. Creates a sorted list of unique characters.
    2. Assigns a unique index to each character.
  • Mapping: Converts the original string into a list of indices based on the tokenization and numericalization process.

Word Tokenization

1. Word Tokenization

# Word tokenization

string = "How are you?"
tokenized_str = string.split()
print(tokenized_str)

2. Numericalization

The same .

3. Mapping Characters to Indices

The same.

Next - Subword tokenization

References

comments