TransformerÊǹȸèÔÚ2017ÄêµÄһƪÂÛÎÄ"Attention is all you need"Ìá³öµÄÒ»¸öseq2seqµÄÄ£Ðͼܹ¹,Æä´´ÔìÐÔµÄÌá³öÁË×Ô×¢ÒâÁ¦µÄ˼Ïë,¿ÉÒԺܺõıí´ïÐòÁÐÖи÷¸öµ¥´ÊÖ®¼äµÄÏ໥עÒâÁ¦¹Øϵ¡£Õâ¸öÄ£ÐÍÔÚNLPÁìÓòÈ¡µÃÁ˾޴óµÄ³É¹¦¡£´ËÍâÕâ¸öÄ£Ðͼܹ¹ÔÚ×î½ü¼¸ÄêÒ²ÔÚCVÁìÓòÈ¡µÃÁËÁîÈËÖõÄ¿µÄ½øÕ¹,ÔÚͼÏñʶ±ð,Ä¿±ê¼ì²âµÈ·½Ã涼´ïµ½»ò³¬¹ýCNNÄ£Ð͵ÄÐÔÄÜ¡£Òò´ËTransformer¿ÉÒÔ˵ÊÇÈ˹¤ÖÇÄÜÁìÓò×î½ü×îÖµµÃ¹Ø×¢ºÍѧϰµÄÒ»¸ö¼Ü¹¹¡£Ä¿Ç°ÓÐÍøÉÏÒѾÓкܶàÎÄÕÂÏêϸ½â¶ÁÁËTransformerµÄ¼Ü¹¹ºÍÆäϸ½Ú,ÕâÀïÎÒ½«²»ÔÙÖظ´Õâ·½ÃæµÄÄÚÈÝ,¶øÊǹØ×¢ÔÚʵս·½Ãæ,»ùÓÚTensorflowÀ´´î½¨Ò»¸öTransformerÄ£ÐÍ,ʵÏÖ·¨ÓïºÍÓ¢ÓïµÄ·Òë¡£
ÔÚTensorflowµÄ¹ÙÍøÉÏÓÐÒ»¸öÏêϸµÄ½Ì³Ì,½éÉÜÁËÈçºÎ´î½¨TranformerÀ´ÊµÏÖÆÏÌÑÑÀÓï·ÒëΪӢÓï¡£ÎÒÒ²ÊÇѧϰÁËÕâ¸ö½Ì³ÌÖ®ºó,½øÐÐһЩ¸ÄÔì,ÒÔʵÏÖ¶Ô·¨Óï-Ó¢ÓïµÄ·Òë¡£
Êý¾Ý¼¯µÄ×¼±¸
ÔÚÕâ¸öÍøÕ¾Tab-delimited Bilingual Sentence Pairs from the Tatoeba Project (Good for Anki and Similar Flashcard Applications)¿ÉÒÔÕÒµ½ºÜ¶à²»Í¬µÄÓïÑÔÓëÓ¢ÓïµÄ·Òë¡£ÕâÀïÎÒÃÇÏÂÔØ·¨Óï-Ó¢ÓïµÄÊý¾Ý×÷ΪѵÁ·¼¯ºÍÑéÖ¤¼¯¡£ÏÂÔØhttp://www.manythings.org/anki/fra-eng.zipÕâ¸öÎļþ²¢½âѹ֮ºó,ÎÒÃÇ¿ÉÒÔ¿´µ½ÀïÃæÿһÐжÔÓ¦Ò»¸öÓ¢Óï¾ä×ÓºÍÒ»¸ö·¨Óï¾ä×Ó,ÒÔ¼°¾ä×ӵűÏ×Õß,ÖмäÒÔTAB·Ö¸ô¡£
ÒÔÏ´úÂëÊǶÁÈ¡ÎļþµÄÊý¾Ý²¢²é¿´·¨ÓïºÍÓ¢ÓïµÄ¾ä×Ó:
fra = []
eng = []
with open('fra.txt', 'r') as f:
content = f.readlines()
for line in content:
temp = line.split(sep='\t')
eng.append(temp[0])
fra.append(temp[1])
²é¿´ÕâЩ¾ä×Ó,¿ÉÒÔ¿´µ½ÓÐЩ¾ä×Ó°üº¬ÌØÊâ×Ö·û,ÀýÈç'Cours\u202f!' ÎÒÃÇÐèÒª°ÑÕâЩÌØÊâµÄ²»¿É¼û×Ö·û(\u202f, \xa0 ...)È¥³ýµô
new_fra = []
new_eng = []
for item in fra:
new_fra.append(re.sub('\s', ' ', item).strip().lower())
for item in eng:
new_eng.append(re.sub('\s', ' ', item).strip().lower())
µ¥´Ê´¦ÀíΪtoken
ÒòΪģÐÍÖ»ÄÜ´¦ÀíÊý×Ö,ÐèÒª°ÑÕâЩ·¨ÓïºÍÓ¢ÓïµÄµ¥´ÊתΪtoken¡£ÕâÀï²ÉÓÃBERT tokenizerµÄ·½Ê½À´´¦Àí,¾ßÌå¿ÉÒԲμûtensorflowµÄ½Ì³ÌSubword tokenizers ?|? Text ?|? TensorFlow
Ê×ÏÈ´´½¨Á½¸ödataset,·Ö±ð°üº¬ÁË·¨ÓïºÍÓ¢ÓïµÄ¾ä×Ó¡£
ds_fra = tf.data.Dataset.from_tensor_slices(new_fra)
ds_eng = tf.data.Dataset.from_tensor_slices(new_eng)
µ÷ÓÃtensorflowµÄbert_vocab¿âÀ´´´½¨´Ê»ã±í,ÕâÀﶨÒåÁËһЩ±£ÁôtokenÓÃÓÚÌØÊâÄ¿µÄ,ÀýÈç[START]±êʶ¾ä×ӵĿªÊ¼,[UNK]±êʶһ¸ö²»ÔÚ´Ê»ã±í³öÏÖµÄе¥´Ê¡£
bert_tokenizer_params=dict(lower_case=True)
reserved_tokens=["[PAD]", "[UNK]", "[START]", "[END]"]
bert_vocab_args = dict(
# The target vocabulary size
vocab_size = 8000,
# Reserved tokens that must be included in the vocabulary
reserved_tokens=reserved_tokens,
# Arguments for `text.BertTokenizer`
bert_tokenizer_params=bert_tokenizer_params,
# Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
learn_params={},
)
fr_vocab = bert_vocab.bert_vocab_from_dataset(
ds_fra.batch(1000).prefetch(2),
**bert_vocab_args
)
en_vocab = bert_vocab.bert_vocab_from_dataset(
ds_eng.batch(1000).prefetch(2),
**bert_vocab_args
)
´Ê»ã±í´¦ÀíÍê³ÉÖ®ºó,ÎÒÃÇ¿ÉÒÔ¿´¿´ÀïÃæ°üº¬ÄÄЩÄÚÈÝ:
print(en_vocab[:10])
print(en_vocab[100:110])
print(en_vocab[1000:1010])
print(en_vocab[-10:])
Êä³öÈçÏÂ,¿ÉÒÔ¿´µ½´Ê»ã±í²»ÊÇÑϸñ°´ÕÕÿ¸öÓ¢Óïµ¥´ÊÀ´»®·ÖµÄ,ÀýÈç'##ers'±íʾij¸öµ¥´ÊÈç¹ûÒÔers½áβ,Ôò»á»®·Ö³öÒ»¸ö'##ers'µÄtoken
['[PAD]', '[UNK]', '[START]', '[END]', '!', '"', '$', '%', '&', "'"]
['ll', 'there', 've', 'and', 'him', 'time', 'here', 'about', 'get', 'didn']
['##ers', 'chair', 'earth', 'honest', 'succeed', '##ted', 'animals', 'bill', 'drank', 'lend']
['##?', '##j', '##q', '##z', '##¡ã', '##¨C', '##¡ª', '##¡®', '##¡¯', '##€']
°Ñ´Ê»ã±í±£´æΪÎļþ,È»ºóÎÒÃǾͿÉÒÔʵÀý»¯Á½¸ötokenizer,ÒÔʵÏÖ¶Ô·¨ÓïºÍÓ¢Óï¾ä×ÓµÄtoken»¯´¦Àí¡£
def write_vocab_file(filepath, vocab):
with open(filepath, 'w') as f:
for token in vocab:
print(token, file=f)
write_vocab_file('fr_vocab.txt', fr_vocab)
write_vocab_file('en_vocab.txt', en_vocab)
fr_tokenizer = text.BertTokenizer('fr_vocab.txt', **bert_tokenizer_params)
en_tokenizer = text.BertTokenizer('en_vocab.txt', **bert_tokenizer_params)
ÏÂÃæÎÒÃÇ¿ÉÒÔ²âÊÔһ϶ÔһЩӢÓï¾ä×Ó½øÐÐtoken´¦ÀíºóµÄ½á¹û,ÕâÀïÎÒÃÇÐèÒª¸øÿ¸ö¾ä×ӵĿªÍ·ºÍ½áβ·Ö±ð¼ÓÉÏ[START]ºÍ[END]ÕâÁ½¸öÌØÊâµÄtoken,ÕâÑù¿ÉÒÔ·½±ãÒÔºóÄ£Ð͵ÄѵÁ·¡£
START = tf.argmax(tf.constant(reserved_tokens) == "[START]")
END = tf.argmax(tf.constant(reserved_tokens) == "[END]")
def add_start_end(ragged):
count = ragged.bounding_shape()[0]
starts = tf.fill([count,1], START)
ends = tf.fill([count,1], END)
return tf.concat([starts, ragged, ends], axis=1)
sentences = ["Hello Roy!", "The sky is blue.", "Nice to meet you!"]
add_start_end(en_tokenizer.tokenize(sentences).merge_dims(1,2)).to_tensor()
Êä³ö½á¹ûÈçÏÂ:
<tf.Tensor: shape=(3, 7), dtype=int64, numpy=
array([[ 2, 1830, 45, 3450, 4, 3, 0],
[ 2, 62, 1132, 64, 996, 13, 3],
[ 2, 353, 61, 416, 60, 4, 3]])>
¹¹½¨Êý¾Ý¼¯
ÏÖÔÚÎÒÃÇ¿ÉÒÔ¹¹½¨ÑµÁ·¼¯ºÍÑéÖ¤¼¯ÁË¡£ÕâÀïÐèÒª°Ñ·¨ÓïºÍÓ¢ÓïµÄ¾ä×Ó¶¼°üÀ¨ÔÚÊý¾Ý¼¯ÖÐ,ÆäÖз¨Óï¾ä×Ó×÷ΪTransformer±àÂëÆ÷µÄÊäÈë,Ó¢Óï¾ä×Ó×÷Ϊ½âÂëÆ÷µÄÊäÈëÒÔ¼°Ä£ÐÍÊä³öµÄTarget¡£ÕâÀïÎÒÃÇÓÃPandas¹¹ÔìÒ»¸öDataframe,Ëæ»ú»®·ÖÆäÖÐ80%µÄÊý¾ÝΪѵÁ·¼¯,ÆäÓàΪÑéÖ¤¼¯¡£È»ºóת»»ÎªTensorflowµÄdataset
df = pd.DataFrame(data={'fra':new_fra, 'eng':new_eng})
# Shuffle the Dataframe
recordnum = df.count()['fra']
indexlist = list(range(recordnum-1))
random.shuffle(indexlist)
df_train = df.loc[indexlist[:int(recordnum*0.8)]]
df_val = df.loc[indexlist[int(recordnum*0.8):]]
ds_train = tf.data.Dataset.from_tensor_slices((df_train.fra.values, df_train.eng.values))
ds_val = tf.data.Dataset.from_tensor_slices((df_val.fra.values, df_val.eng.values))
²é¿´ÑµÁ·¼¯µÄ¾ä×Ó×î¶à°üº¬¶àÉÙ¸ötoken
lengths = []
for fr_examples, en_examples in ds_train.batch(1024):
fr_tokens = fr_tokenizer.tokenize(fr_examples)
lengths.append(fr_tokens.row_lengths())
en_tokens = en_tokenizer.tokenize(en_examples)
lengths.append(en_tokens.row_lengths())
print('.', end='', flush=True)
all_lengths = np.concatenate(lengths)
plt.hist(all_lengths, np.linspace(0, 100, 11))
plt.ylim(plt.ylim())
max_length = max(all_lengths)
plt.plot([max_length, max_length], plt.ylim())
plt.title(f'Max tokens per example: {max_length}');
´Ó½á¹ûÖпÉÒÔ¿´µ½ÑµÁ·¼¯µÄ¾ä×Óת»»Îªtokenºó×î¶à°üº¬67¸ötoken:
Ö®ºó¾Í¿ÉÒÔΪÊý¾Ý¼¯Éú³Ébatch,ÈçÒÔÏ´úÂë:
BUFFER_SIZE = 20000
BATCH_SIZE = 64
MAX_TOKENS = 67
def filter_max_tokens(fr, en):
num_tokens = tf.maximum(tf.shape(fr)[1],tf.shape(en)[1])
return num_tokens < MAX_TOKENS
def tokenize_pairs(fr, en):
fr = add_start_end(fr_tokenizer.tokenize(fr).merge_dims(1,2))
# Convert from ragged to dense, padding with zeros.
fr = fr.to_tensor()
en = add_start_end(en_tokenizer.tokenize(en).merge_dims(1,2))
# Convert from ragged to dense, padding with zeros.
en = en.to_tensor()
return fr, en
def make_batches(ds):
return (
ds
.cache()
.shuffle(BUFFER_SIZE)
.batch(BATCH_SIZE)
.map(tokenize_pairs, num_parallel_calls=tf.data.AUTOTUNE)
.filter(filter_max_tokens)
.prefetch(tf.data.AUTOTUNE))
train_batches = make_batches(ds_train)
val_batches = make_batches(ds_val)
¿ÉÒÔÉú³ÉÒ»¸öbatchÀ´²é¿´Ò»ÏÂ:
for a in train_batches.take(1):
print(a)
½á¹ûÈçÏÂ,¿É¼ûÿ¸öbatch°üº¬Á½¸ötensor,·Ö±ð¶ÔÓ¦·¨ÓïºÍÓ¢Óï¾ä×Óת»¯ÎªtokenÖ®ºóµÄÏòÁ¿,ÿ¸ö¾ä×ÓÒÔtoken 2¿ªÍ·,ÒÔtoken 3½áβ:
(<tf.Tensor: shape=(64, 24), dtype=int64, numpy=
array([[ 2, 39, 9, ..., 0, 0, 0],
[ 2, 62, 43, ..., 0, 0, 0],
[ 2, 147, 70, ..., 0, 0, 0],
...,
[ 2, 4310, 14, ..., 0, 0, 0],
[ 2, 39, 9, ..., 0, 0, 0],
[ 2, 68, 64, ..., 0, 0, 0]])>, <tf.Tensor: shape=(64, 20), dtype=int64, numpy=
array([[ 2, 36, 76, ..., 0, 0, 0],
[ 2, 36, 75, ..., 0, 0, 0],
[ 2, 92, 80, ..., 0, 0, 0],
...,
[ 2, 68, 60, ..., 0, 0, 0],
[ 2, 36, 75, ..., 0, 0, 0],
[ 2, 67, 9, ..., 0, 0, 0]])>)
¸øÊäÈëÊý¾ÝÌí¼ÓλÖÃÐÅÏ¢
°ÑÉÏÃæµÃµ½µÄbatchÊý¾ÝÊäÈëµ½embedding²ã,¾Í¿ÉÒÔ°Ñÿ¸ötokenת»¯ÎªÒ»¸ö¸ßλÏòÁ¿,ÀýÈçת»»ÎªÒ»¸ö128άµÄÏòÁ¿¡£Ö®ºóÎÒÃÇÐèÒª¸øÕâ¸öÏòÁ¿Ôö¼ÓÒ»¸öλÖÃÐÅÏ¢ÒÔ±íʾÕâ¸ötokenÔÚ¾ä×ÓÖеÄλÖá£ÂÛÎĸø³öÁËÒ»ÖÖ¶ÔλÖÃÐÅÏ¢½øÐбàÂëµÄ·½·¨,ÈçÒÔÏµĹ«Ê½:
¹«Ê½ÖÐpos±íʾ´ÊÓïµÄλÖÃ,ÀýÈçÒ»¸ö¾ä×ÓÓÐ50¸öµ¥´Ê,posÈ¡Öµ·¶Î§Îª0-49. d_model±íʾembeddingµÄά¶È,ÀýÈç°Ñÿ¸öµ¥´ÊÓ³ÉäΪһ¸ö128άµÄÏòÁ¿,d_model=128. i±íʾÕâ128άÀïÃæµÄά¶È,È¡Öµ·¶Î§Îª0-127 Òò´Ë¹«Ê½µÄº¬ÒåΪ,¶ÔµÚN¸öµ¥´Ê,ÔÚÆä128άµÄǶÈëÏòÁ¿ÖÐ,ÿ¸öά¶È¶¼¼ÓÉ϶ÔÓ¦µÄλÖÃÐÅÏ¢. ÒÔµÚ3¸öµ¥´ÊΪÀý,pos=2, ÔÚÆä¶ÔÓ¦µÄ128άÏòÁ¿,ÆäżÊýά(0,2,4...)ÐèÒª¼ÓÉÏsin(2/10000^(2i/128)),2iµÄ¶ÔӦȡֵÊÇ(0,2,4...). µÚ2i+1ά(1,3,5...)ÐèÒª¼ÓÉÏcos(2/10000^(2i/128)),2iµÄ¶ÔӦȡֵÊÇ(0,2,4...)
ÒÔÏ´úÂ뽫Éú³ÉλÖñàÂëÏòÁ¿,Õâ¸öÏòÁ¿¿ÉÒÔ¼ÓÈëµ½tokenµÄǶÈëÏòÁ¿ÖС£
def get_angles(pos, i, d_model):
angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
return pos * angle_rates
def positional_encoding(position, d_model):
angle_rads = get_angles(np.arange(position)[:, np.newaxis],
np.arange(d_model)[np.newaxis, :],
d_model)
# apply sin to even indices in the array; 2i
angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
# apply cos to odd indices in the array; 2i+1
angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
pos_encoding = angle_rads[np.newaxis, ...]
return tf.cast(pos_encoding, dtype=tf.float32)
´´½¨PaddingÑÚÂëºÍlook aheadÑÚÂë
MaskÓÃÓÚ±êʶÊäÈëÐòÁÐÖÐΪ0µÄλÖÃ,Èç¹ûΪ0,ÔòMaskΪ1. ÕâÑù¿ÉÒÔʹµÃpaddingµÄ×Ö·û²»»á²ÎÓ뵽ģÐ͵ÄѵÁ·ÖÐ Look ahead maskÊÇÓÃÓÚÔÚÔ¤²âÊÇÑÚ¸ÇδÀ´µÄ×Ö·û,ÀýÈç·ÒëÒ»¾ä·¨Óï,¶ÔÓ¦µÄÓ¢ÓïÊÇÄ¿±êÊý¾Ý,ÔÚѵÁ·Ê±,µ±Ô¤²âµÚÒ»¸öÓ¢Óïµ¥´Êʱ,ÐèÒª°ÑÕû¾äÓ¢ÓﶼÑÚ¸Ç,µ±Ô¤²âµÚ¶þ¸öÓ¢Óïµ¥´Êʱ,ÐèÒª°ÑÕû¾äÓ¢ÓïµÄµÚÒ»¸öµ¥´ÊÖ®ºóµÄ¶¼Ñڸǡ£Õâ¸öÄ¿µÄÊDZÜÃâÈÃÄ£ÐÍ¿´µ½Ö®ºóÒªÔ¤²âµÄµ¥´Ê,Ó°ÏìÄ£Ð͵ÄѵÁ·¡£
def create_padding_mask(seq):
seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
# add extra dimensions to add the padding
# to the attention logits.
return seq[:, tf.newaxis, tf.newaxis, :] # (batch_size, 1, 1, seq_len)
def create_look_ahead_mask(size):
mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
return mask # (seq_len, seq_len)
×Ô×¢ÒâÁ¦¼ÆËã
ÏÖÔÚÀ´µ½ÁËTransformerµÄºËÐĸÅÄîÁË,ÎÒÃÇÐèÒª°ÑÊäÈëµÄÏòÁ¿,ͨ¹ýÈý¸öÏßÐÔת»»µÄ¾ØÕó,°ÑËü±äΪQ,K,VÈý¸öÏòÁ¿¡£ ͨ¹ý¼ÆËãQºÍKµÄÏàËÆÐÔÀ´µÃµ½×¢ÒâÁ¦ÏµÊý,ÔÙºÍVÏà³Ë,µÃµ½¶ÔÓ¦µÄÊýÖµ,ÈçÒÔϵÄͼƬ:
×¢ÒâÁ¦È¨ÖصļÆË㹫ʽÈçÏÂ:
|