In order to process text in neural network models it is first required to encode text as numbers with ids, since the tensor operations act on numbers. Finally, if the output of the network is to be words, it is required to decode the predicted tokens ids back to text.
To encode text, the first decision that has to be made is to what level of graularity are we going to consider the text? Because ultimately, from these tokens, features are going to be created about them. Many different experiments have been carried out using words, morphological units, phonemic units, characters. For example,
But how to identify these units, such as words, is largely determined by the language they come from. For example, in many European languages a space is used to separate words, while in some Asian languages there are no spaces between words. Compare English and Mandarin.
So, the ability to tokenize, i.e. split text into meaningful fundamental units is not always straight-forward.
Also, there are practical issues of how large our vocabulary of words, vocab_size
, should be, considering memory limitations vs. coverage. A compromise may be need to be made between:
In SentencePiece unicode characters are grouped together using either a unigram language model (used in this week’s assignment) or BPE, byte-pair encoding. We will discuss BPE, since BERT and many of its variants use a modified version of BPE and its pseudocode is easy to implement and understand… hopefully!
Unsurprisingly, even using unicode to initially tokenize text can be ambiguous, e.g.,
eaccent = '\u00E9'
e_accent = '\u0065\u0301'
print(f'{eaccent} = {e_accent} : {eaccent == e_accent}')
é = é : False
SentencePiece uses the Unicode standard normalization form, NFKC, so this isn’t an issue. Looking at our example from above but with normalization:
from unicodedata import normalize
norm_eaccent = normalize('NFKC', '\u00E9')
norm_e_accent = normalize('NFKC', '\u0065\u0301')
print(f'{norm_eaccent} = {norm_e_accent} : {norm_eaccent == norm_e_accent}')
é = é : True
Normalization has actually changed the unicode code point (unicode unique id) for one of these two characters.
def get_hex_encoding(s):
return ' '.join(hex(ord(c)) for c in s)
def print_string_and_encoding(s):
print(f'{s} : {get_hex_encoding(s)}')
for s in [eaccent, e_accent, norm_eaccent, norm_e_accent]:
print_string_and_encoding(s)
é : 0xe9
é : 0x65 0x301
é : 0xe9
é : 0xe9
This normalization has other side effects which may be considered useful such as converting curly quotes “ to “ their ASCII equivalent. (*Although we now lose directionality of the quote…)
SentencePiece also ensures that when you tokenize your data and detokenize your data the original position of white space is preserved. *However, tabs and newlines are converted to spaces, please try this experiment yourself later below.
To ensure this lossless tokenization, SentencePiece replaces white space with _ (U+2581). So that a simple join of the tokens by replace underscores with spaces can restore the white space, even if there are consecutive symbols. But remember first to normalize and then replace spaces with _ (U+2581). As the following example shows.
s = 'Tokenization is hard.'
s_ = s.replace(' ', '\u2581')
s_n = normalize('NFKC', 'Tokenization is hard.')
print(get_hex_encoding(s))
print(get_hex_encoding(s_))
print(get_hex_encoding(s_n))
0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x20 0x69 0x73 0x20 0x68 0x61 0x72 0x64 0x2e
0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x2581 0x69 0x73 0x2581 0x68 0x61 0x72 0x64 0x2e
0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x20 0x69 0x73 0x20 0x68 0x61 0x72 0x64 0x2e
So the special unicode underscore was replaced by the ASCII unicode. Reversing the order of the second and third operations, we that the special unicode underscore was retained.
s = 'Tokenization is hard.'
sn = normalize('NFKC', 'Tokenization is hard.')
sn_ = s.replace(' ', '\u2581')
print(get_hex_encoding(s))
print(get_hex_encoding(sn))
print(get_hex_encoding(sn_))
0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x20 0x69 0x73 0x20 0x68 0x61 0x72 0x64 0x2e
0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x20 0x69 0x73 0x20 0x68 0x61 0x72 0x64 0x2e
0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x2581 0x69 0x73 0x2581 0x68 0x61 0x72 0x64 0x2e
Now that we have discussed the preprocessing that SentencePiece performs, we will go get our data, preprocess, and apply the BPE algorithm. We will show how this reproduces the tokenization produced by training SentencePiece on our example dataset (from this week’s assignment).
First, we get our Squad data and process as above.
import ast
def convert_json_examples_to_text(filepath):
example_jsons = list(map(ast.literal_eval, open(filepath))) # Read in the json from the example file
texts = [example_json['text'].decode('utf-8') for example_json in example_jsons] # Decode the byte sequences
text = '\n\n'.join(texts) # Separate different articles by two newlines
text = normalize('NFKC', text) # Normalize the text
with open('example.txt', 'w') as fw:
fw.write(text)
return text
text = convert_json_examples_to_text('./data/data.txt')
print(text[:900])
Beginners BBQ Class Taking Place in Missoula!
Do you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.
He will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.
The cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.
Discussion in 'Mac OS X Lion (10.7)' started by axboi87, Jan 20, 2012.
I've got a 500gb internal drive and a 240gb SSD.
When trying to restore using di
In the algorithm the vocab
variable is actually a frequency dictionary of the words. Further, those words have been prepended with an underscore to indicate that they are the beginning of a word. Finally, the characters have been delimited by spaces so that the BPE algorithm can group the most common characters together in the dictionary in a greedy fashion. We will see how that is done shortly.
from collections import Counter
vocab = Counter(['\u2581' + word for word in text.split()])
vocab = {' '.join([l for l in word]): freq for word, freq in vocab.items()}
def show_vocab(vocab, end='\n', limit=20):
"""Show word frequencys in vocab up to the limit number of words"""
shown = 0
for word, freq in vocab.items():
print(f'{word}: {freq}', end=end)
shown +=1
if shown > limit:
break
show_vocab(vocab)
▁ B e g i n n e r s: 1
▁ B B Q: 3
▁ C l a s s: 2
▁ T a k i n g: 1
▁ P l a c e: 1
▁ i n: 15
▁ M i s s o u l a !: 1
▁ D o: 1
▁ y o u: 13
▁ w a n t: 1
▁ t o: 33
▁ g e t: 2
▁ b e t t e r: 2
▁ a t: 1
▁ m a k i n g: 2
▁ d e l i c i o u s: 1
▁ B B Q ?: 1
▁ Y o u: 1
▁ w i l l: 6
▁ h a v e: 4
▁ t h e: 31
We check the size of the vocabulary (frequency dictionary) because this is the one hyperparameter that BPE depends on crucially on how far it breaks up a word into SentencePieces. It turns out that for our trained model on our small dataset that 60% of 455 merges of the most frequent characters need to be done to reproduce the upperlimit of a 32K vocab_size
over the entire corpus of examples.
print(f'Total number of unique words: {len(vocab)}')
print(f'Number of merges required to reproduce SentencePiece training on the whole corpus: {int(0.60*len(vocab))}')
Total number of unique words: 455
Number of merges required to reproduce SentencePiece training on the whole corpus: 273
Directly from the BPE paper we have the following algorithm.
import re, collections
def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[symbols[i], symbols[i+1]] += freq
return pairs
def merge_vocab(pair, v_in):
v_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
for word in v_in:
w_out = p.sub(''.join(pair), word)
v_out[w_out] = v_in[word]
return v_out
def get_sentence_piece_vocab(vocab, frac_merges=0.60):
sp_vocab = vocab.copy()
num_merges = int(len(sp_vocab)*frac_merges)
for i in range(num_merges):
pairs = get_stats(sp_vocab)
best = max(pairs, key=pairs.get)
sp_vocab = merge_vocab(best, sp_vocab)
return sp_vocab
To understand what’s going on first take a look at the third function get_sentence_piece_vocab
. It takes in the current vocab
word-frequency dictionary and the fraction, frac_merges
, of the total vocab_size
to merge characters in the words of the dictionary, num_merges
times. Then for each merge operation it get_stats
on how many of each pair of character sequences there are. It gets the most frequent pair of symbols as the best
pair. Then it merges that pair of symbols (removes the space between them) in each word in the vocab
that contains this best
(= pair
). Consequently, merge_vocab
creates a new vocab
, v_out
. This process is repeated num_merges
times and the result is the set of SentencePieces (keys of the final sp_vocab
).
Please feel free to skip the below if the above description was enough.
In a little more detail then, we can see in get_stats
we initially create a list of bigram (two character sequence) frequencies from our vocabulary. Later, this may include trigrams, quadgrams, etc. Note that the key of the pairs
frequency dictionary is actually a 2-tuple, which is just shorthand notation for a pair.
In merge_vocab
we take in an individual pair
(of character sequences, note this is the most frequency best
pair) and the current vocab
as v_in
. We create a new vocab
, v_out
, from the old by joining together the characters in the pair (removing the space), if they are present in a word of the dictionary.
Warning: the expression (?<!\S)
means that either a whitespace character follows before the bigram
or there is nothing before the bigram (it is the beginning of the word), similarly for (?!\S)
for preceding whitespace or the end of the word.
sp_vocab = get_sentence_piece_vocab(vocab)
show_vocab(sp_vocab)
▁B e g in n ers: 1
▁BBQ: 3
▁Cl ass: 2
▁T ak ing: 1
▁P la ce: 1
▁in: 15
▁M is s ou la !: 1
▁D o: 1
▁you: 13
▁w an t: 1
▁to: 33
▁g et: 2
▁be t ter: 2
▁a t: 1
▁mak ing: 2
▁d e l ic i ou s: 1
▁BBQ ?: 1
▁ Y ou: 1
▁will: 6
▁have: 4
▁the: 31
First let us explore the SentencePiece model provided with this week’s assignment. Remember you can always use Python’s built in help
command to see the documentation for any object or method.
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file='./data/sentencepiece.model')
help(sp)
Help on SentencePieceProcessor in module sentencepiece object:
class SentencePieceProcessor(builtins.object)
| SentencePieceProcessor(model_file=None, model_proto=None, out_type=<class 'int'>, add_bos=False, add_eos=False, reverse=False, enable_sampling=False, nbest_size=-1, alpha=0.1)
|
| Methods defined here:
|
| Decode(self, input)
| Decode processed id or token sequences.
|
| DecodeIds = DecodeIdsWithCheck(self, ids)
|
| DecodeIdsAsSerializedProto = DecodeIdsAsSerializedProtoWithCheck(self, ids)
|
| DecodeIdsAsSerializedProtoWithCheck(self, ids)
|
| DecodeIdsWithCheck(self, ids)
|
| DecodePieces(self, pieces)
|
| DecodePiecesAsSerializedProto(self, pieces)
|
| Detokenize = Decode(self, input)
|
| Encode(self, input, out_type=None, add_bos=None, add_eos=None, reverse=None, enable_sampling=None, nbest_size=None, alpha=None)
| Encode text input to segmented ids or tokens.
|
| Args:
| input: input string. accepsts list of string.
| out_type: output type. int or str.
| add_bos: Add <s> to the result (Default = false)
| add_eos: Add </s> to the result (Default = false) <s>/</s> is added after
| reversing (if enabled).
| reverse: Reverses the tokenized sequence (Default = false)
| nbest_size: sampling parameters for unigram. Invalid for BPE-Dropout.
| nbest_size = {0,1}: No sampling is performed.
| nbest_size > 1: samples from the nbest_size results.
| nbest_size < 0: assuming that nbest_size is infinite and samples
| from the all hypothesis (lattice) using
| forward-filtering-and-backward-sampling algorithm.
| alpha: Soothing parameter for unigram sampling, and merge probability for
| BPE-dropout (probablity 'p' in BPE-dropout paper).
|
| EncodeAsIds(self, input)
|
| EncodeAsPieces(self, input)
|
| EncodeAsSerializedProto(self, input)
|
| GetEncoderVersion(self)
|
| GetPieceSize(self)
|
| GetScore = _batched_func(self, arg)
|
| IdToPiece = _batched_func(self, arg)
|
| Init(self, model_file=None, model_proto=None, out_type=<class 'int'>, add_bos=False, add_eos=False, reverse=False, enable_sampling=False, nbest_size=-1, alpha=0.1)
| Initialzie sentencepieceProcessor.
|
| Args:
| model_file: The sentencepiece model file path.
| model_proto: The sentencepiece model serialized proto.
| out_type: output type. int or str.
| add_bos: Add <s> to the result (Default = false)
| add_eos: Add </s> to the result (Default = false) <s>/</s> is added after
| reversing (if enabled).
| reverse: Reverses the tokenized sequence (Default = false)
| nbest_size: sampling parameters for unigram. Invalid for BPE-Dropout.
| nbest_size = {0,1}: No sampling is performed.
| nbest_size > 1: samples from the nbest_size results.
| nbest_size < 0: assuming that nbest_size is infinite and samples
| from the all hypothesis (lattice) using
| forward-filtering-and-backward-sampling algorithm.
| alpha: Soothing parameter for unigram sampling, and dropout probability of
| merge operations for BPE-dropout.
|
| IsByte = _batched_func(self, arg)
|
| IsControl = _batched_func(self, arg)
|
| IsUnknown = _batched_func(self, arg)
|
| IsUnused = _batched_func(self, arg)
|
| Load(self, model_file=None, model_proto=None)
| Overwride SentencePieceProcessor.Load to support both model_file and model_proto.
|
| Args:
| model_file: The sentencepiece model file path.
| model_proto: The sentencepiece model serialized proto. Either `model_file`
| or `model_proto` must be set.
|
| LoadFromFile(self, arg)
|
| LoadFromSerializedProto(self, serialized)
|
| LoadVocabulary(self, filename, threshold)
|
| NBestEncodeAsIds(self, input, nbest_size)
|
| NBestEncodeAsPieces(self, input, nbest_size)
|
| NBestEncodeAsSerializedProto(self, input, nbest_size)
|
| PieceToId = _batched_func(self, arg)
|
| ResetVocabulary(self)
|
| SampleEncodeAsIds(self, input, nbest_size, alpha)
|
| SampleEncodeAsPieces(self, input, nbest_size, alpha)
|
| SampleEncodeAsSerializedProto(self, input, nbest_size, alpha)
|
| SetDecodeExtraOptions(self, extra_option)
|
| SetEncodeExtraOptions(self, extra_option)
|
| SetEncoderVersion(self, encoder_version)
|
| SetVocabulary(self, valid_vocab)
|
| Tokenize = Encode(self, input, out_type=None, add_bos=None, add_eos=None, reverse=None, enable_sampling=None, nbest_size=None, alpha=None)
|
| __getitem__(self, piece)
|
| __getstate__(self)
|
| __init__ = Init(self, model_file=None, model_proto=None, out_type=<class 'int'>, add_bos=False, add_eos=False, reverse=False, enable_sampling=False, nbest_size=-1, alpha=0.1)
|
| __len__(self)
|
| __repr__ = _swig_repr(self)
|
| __setstate__(self, serialized_model_proto)
|
| bos_id(self)
|
| decode = Decode(self, input)
|
| decode_ids = DecodeIdsWithCheck(self, ids)
|
| decode_ids_as_serialized_proto = DecodeIdsAsSerializedProtoWithCheck(self, ids)
|
| decode_ids_as_serialized_proto_with_check = DecodeIdsAsSerializedProtoWithCheck(self, ids)
|
| decode_ids_with_check = DecodeIdsWithCheck(self, ids)
|
| decode_pieces = DecodePieces(self, pieces)
|
| decode_pieces_as_serialized_proto = DecodePiecesAsSerializedProto(self, pieces)
|
| detokenize = Decode(self, input)
|
| encode = Encode(self, input, out_type=None, add_bos=None, add_eos=None, reverse=None, enable_sampling=None, nbest_size=None, alpha=None)
|
| encode_as_ids = EncodeAsIds(self, input)
|
| encode_as_pieces = EncodeAsPieces(self, input)
|
| encode_as_serialized_proto = EncodeAsSerializedProto(self, input)
|
| eos_id(self)
|
| get_encoder_version = GetEncoderVersion(self)
|
| get_piece_size = GetPieceSize(self)
|
| get_score = _batched_func(self, arg)
|
| id_to_piece = _batched_func(self, arg)
|
| init = Init(self, model_file=None, model_proto=None, out_type=<class 'int'>, add_bos=False, add_eos=False, reverse=False, enable_sampling=False, nbest_size=-1, alpha=0.1)
|
| is_byte = _batched_func(self, arg)
|
| is_control = _batched_func(self, arg)
|
| is_unknown = _batched_func(self, arg)
|
| is_unused = _batched_func(self, arg)
|
| load = Load(self, model_file=None, model_proto=None)
|
| load_from_file = LoadFromFile(self, arg)
|
| load_from_serialized_proto = LoadFromSerializedProto(self, serialized)
|
| load_vocabulary = LoadVocabulary(self, filename, threshold)
|
| nbest_encode_as_ids = NBestEncodeAsIds(self, input, nbest_size)
|
| nbest_encode_as_pieces = NBestEncodeAsPieces(self, input, nbest_size)
|
| nbest_encode_as_serialized_proto = NBestEncodeAsSerializedProto(self, input, nbest_size)
|
| pad_id(self)
|
| piece_size(self)
|
| piece_to_id = _batched_func(self, arg)
|
| reset_vocabulary = ResetVocabulary(self)
|
| sample_encode_as_ids = SampleEncodeAsIds(self, input, nbest_size, alpha)
|
| sample_encode_as_pieces = SampleEncodeAsPieces(self, input, nbest_size, alpha)
|
| sample_encode_as_serialized_proto = SampleEncodeAsSerializedProto(self, input, nbest_size, alpha)
|
| serialized_model_proto(self)
|
| set_decode_extra_options = SetDecodeExtraOptions(self, extra_option)
|
| set_encode_extra_options = SetEncodeExtraOptions(self, extra_option)
|
| set_encoder_version = SetEncoderVersion(self, encoder_version)
|
| set_vocabulary = SetVocabulary(self, valid_vocab)
|
| tokenize = Encode(self, input, out_type=None, add_bos=None, add_eos=None, reverse=None, enable_sampling=None, nbest_size=None, alpha=None)
|
| unk_id(self)
|
| vocab_size(self)
|
| ----------------------------------------------------------------------
| Static methods defined here:
|
| __swig_destroy__ = delete_SentencePieceProcessor(...)
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| thisown
| The membership flag
Let’s work with the first sentence of our example text.
s0 = 'Beginners BBQ Class Taking Place in Missoula!'
# encode: text => id
print(sp.encode_as_pieces(s0))
print(sp.encode_as_ids(s0))
# decode: id => text
print(sp.decode_pieces(sp.encode_as_pieces(s0)))
print(sp.decode_ids([12847, 277]))
['▁Beginn', 'ers', '▁BBQ', '▁Class', '▁', 'Taking', '▁Place', '▁in', '▁Miss', 'oul', 'a', '!']
[12847, 277, 15068, 4501, 3, 12297, 3399, 16, 5964, 7115, 9, 55]
Beginners BBQ Class Taking Place in Missoula!
Beginners
Notice how SentencePiece breaks the words into seemingly odd parts, but we’ve seen something similar from our work with BPE. But how close were we to this model trained on the whole corpus of examples with a vocab_size
of 32,000 instead of 455? Here you can also test what happens to white space, like ‘\n’.
But first let us note that SentencePiece encodes the SentencePieces, the tokens, and has reserved some of the ids as can be seen in this week’s assignment.
uid = 15068
spiece = "\u2581BBQ"
unknown = "__MUST_BE_UNKNOWN__"
# id <=> piece conversion
print(f'SentencePiece for ID {uid}: {sp.id_to_piece(uid)}')
print(f'ID for Sentence Piece {spiece}: {sp.piece_to_id(spiece)}')
# returns 0 for unknown tokens (we can change the id for UNK)
print(f'ID for unknown text {unknown}: {sp.piece_to_id(unknown)}')
SentencePiece for ID 15068: ▁BBQ
ID for Sentence Piece ▁BBQ: 15068
ID for unknown text __MUST_BE_UNKNOWN__: 2
print(f'Beginning of sentence id: {sp.bos_id()}')
print(f'Pad id: {sp.pad_id()}')
print(f'End of sentence id: {sp.eos_id()}')
print(f'Unknown id: {sp.unk_id()}')
print(f'Vocab size: {sp.vocab_size()}')
Beginning of sentence id: -1
Pad id: 0
End of sentence id: 1
Unknown id: 2
Vocab size: 32000
We can also check what are the ids for the first part and last part of the vocabulary.
print('\nId\tSentP\tControl?')
print('------------------------')
# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for uid in range(10):
print(uid, sp.id_to_piece(uid), sp.is_control(uid), sep='\t')
# for uid in range(sp.vocab_size()-10,sp.vocab_size()):
# print(uid, sp.id_to_piece(uid), sp.is_control(uid), sep='\t')
Id SentP Control?
------------------------
0 <pad> True
1 </s> True
2 <unk> False
3 ▁ False
4 X False
5 . False
6 , False
7 s False
8 ▁the False
9 a False
Finally, let’s train our own BPE model directly from the SentencePiece library and compare it to the results of our implemention of the algorithm from the BPE paper itself.
spm.SentencePieceTrainer.train('--input=example.txt --model_prefix=example_bpe --vocab_size=450 --model_type=bpe')
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load('example_bpe.model')
print('*** BPE ***')
print(sp_bpe.encode_as_pieces(s0))
*** BPE ***
['▁B', 'e', 'ginn', 'ers', '▁BBQ', '▁Cl', 'ass', '▁T', 'ak', 'ing', '▁P', 'la', 'ce', '▁in', '▁M', 'is', 's', 'ou', 'la', '!']
show_vocab(sp_vocab, end = ', ')
▁B e g in n ers: 1, ▁BBQ: 3, ▁Cl ass: 2, ▁T ak ing: 1, ▁P la ce: 1, ▁in: 15, ▁M is s ou la !: 1, ▁D o: 1, ▁you: 13, ▁w an t: 1, ▁to: 33, ▁g et: 2, ▁be t ter: 2, ▁a t: 1, ▁mak ing: 2, ▁d e l ic i ou s: 1, ▁BBQ ?: 1, ▁ Y ou: 1, ▁will: 6, ▁have: 4, ▁the: 31,
Our implementation of BPE’s code from the paper matches up pretty well with the library itself! The differences are probably accounted for by the vocab_size
. There is also another technical difference in that in the SentencePiece implementation of BPE a priority queue is used to more efficiently keep track of the best pairs. Actually, there is a priority queue in the Python standard library called heapq
if you would like to give that a try below!
from heapq import heappush, heappop
def heapsort(iterable):
h = []
for value in iterable:
heappush(h, value)
return [heappop(h) for i in range(len(h))]
a = [1,4,3,1,3,2,1,4,2]
heapsort(a)
[1, 1, 1, 2, 2, 3, 3, 4, 4]
For a more extensive example consider looking at the SentencePiece repo. The last few sections of this code was repurposed from that tutorial. Thanks for your participation! Next stop BERT and T5!