Tokenization

Discussion about Tokenization.

Tokenization 分詞
Tokenizer 分詞器

Character Tokenization

string = "How are you?"
tokenized_str = list(string)
print(tokenized_str)

Explanation:
- The string "How are you?" is converted into a list of characters.
- list(string) splits the string into individual characters.

Output:

['H', 'o', 'w', ' ', 'a', 'r', 'e', ' ', 'y', 'o', 'u', '?']

unique_chars = sorted(set(tokenized_str))

Explanation:
- set(tokenized_str) removes duplicate characters, keeping only unique ones.
- sorted() sorts these unique characters in ascending order (based on ASCII values).

Example Output:

[' ', '?', 'H', 'a', 'e', 'o', 'r', 'u', 'w', 'y']

token2idx = {}
for idx, ch in enumerate(unique_chars):
    token2idx[ch] = idx

Explanation:
- enumerate(unique_chars) assigns an index to each character.
- A dictionary token2idx is created where each character is a key, and its index is the corresponding value.

Example Output:

{' ': 0, '?': 1, 'H': 2, 'a': 3, 'e': 4, 'o': 5, 'r': 6, 'u': 7, 'w': 8, 'y': 9}

input_ids = [token2idx[token] for token in tokenized_str]
print(input_ids)

Explanation:
- This step maps each character in the original string to its corresponding index from the token2idx dictionary.
- input_ids will be a list of integers representing the original string.
Example Output:
```
[2, 5, 8, 0, 3, 6, 4, 0, 9, 5, 7, 1]
```

Character Tokenization: Converts the string into a list of individual characters.
Numericalization:
1. Creates a sorted list of unique characters.
2. Assigns a unique index to each character.
Mapping: Converts the original string into a list of indices based on the tokenization and numericalization process.

# Word tokenization

string = "How are you?"
tokenized_str = string.split()
print(tokenized_str)