-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcode.py
1587 lines (1224 loc) · 67.7 KB
/
code.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# -*- coding: utf-8 -*-
"""
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/15WzEwKM9d9AJXDgFVw52RvSDhh1iMXg6
# Neural Machine Translation
In this project we are going to perform machine translation using two deep learning approaches: a Recurrent Neural Network (RNN) and Transformer.
Specifically, we are going to train sequence to sequence models for Spanish to English translation. In this assignment you only need to implement the neural network models, we implement all the data loading for you. Please **refer** to the following resources for more details:
1. https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
2. https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
3. https://arxiv.org/pdf/1409.0473.pdf
<font color='green'><b>Hint:</b> While you work, we suggest that you keep your hardware accelerator set to "CPU" (the default for Colab). However, when you have finished debugging and are ready to train your models, you should select "GPU" as your runtime type. This will speed up the training of your models. You can find this by going to <TT>Runtime > Change Runtime Type</TT> and select "GPU" from the dropdown menu.</font>
We have imported all the libraries you need to do this project. <b>You should not import any extra libraries. Furthermore, you should not write any code outside of TODO sections.</b> If you do, the autograder will fail to run your code.
"""
### DO NOT EDIT ###
import pandas as pd
import unicodedata
import re
from torch.utils.data import Dataset
import torch
import random
import os
rnn_encoder, rnn_encoder, transformer_encoder, transformer_decoder = None, None, None, None
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if __name__=='__main__':
print('Using device:', DEVICE)
"""# Provided Functions
This section contains several provided functions that are used throughout the notebook.
## Helper Functions
This cell contains helper functions for the notebook.
"""
### DO NOT EDIT ###
# Converts the unicode file to ascii
def unicode_to_ascii(s):
"""Normalizes latin chars with accent to their canonical decomposition"""
return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')
def preprocess_sentence(w):
'''
Preprocess the sentence to add the start, end tokens and make them lower-case
'''
w = unicode_to_ascii(w.lower().strip())
w = re.sub(r'([?.!,¿])', r' \1 ', w)
w = re.sub(r'[" "]+', ' ', w)
w = re.sub(r'[^a-zA-Z?.!,¿]+', ' ', w)
w = w.rstrip().strip()
w = '<start> ' + w + ' <end>'
return w
def pad_sequences(x, max_len):
padded = np.zeros((max_len), dtype=np.int64)
if len(x) > max_len:
padded[:] = x[:max_len]
else:
padded[:len(x)] = x
return padded
def preprocess_data_to_tensor(dataframe, src_vocab, trg_vocab):
# Vectorize the input and target languages
src_tensor = [[src_vocab.word2idx[s if s in src_vocab.vocab else '<unk>'] for s in es.split(' ')] for es in dataframe['es'].values.tolist()]
trg_tensor = [[trg_vocab.word2idx[s if s in trg_vocab.vocab else '<unk>'] for s in eng.split(' ')] for eng in dataframe['eng'].values.tolist()]
# Calculate the max_length of input and output tensor for padding
max_length_src, max_length_trg = max(len(t) for t in src_tensor), max(len(t) for t in trg_tensor)
print('max_length_src: {}, max_length_trg: {}'.format(max_length_src, max_length_trg))
# Pad all the sentences in the dataset with the max_length
src_tensor = [pad_sequences(x, max_length_src) for x in src_tensor]
trg_tensor = [pad_sequences(x, max_length_trg) for x in trg_tensor]
return src_tensor, trg_tensor, max_length_src, max_length_trg
def train_test_split(src_tensor, trg_tensor):
'''
Create training and test sets.
'''
total_num_examples = len(src_tensor) - int(0.2*len(src_tensor))
src_tensor_train, src_tensor_test = src_tensor[:int(0.75*total_num_examples)], src_tensor[int(0.75*total_num_examples):total_num_examples]
trg_tensor_train, trg_tensor_test = trg_tensor[:int(0.75*total_num_examples)], trg_tensor[int(0.75*total_num_examples):total_num_examples]
return src_tensor_train, src_tensor_test, trg_tensor_train, trg_tensor_test
"""## Sanity Check Function
This function will be used to perform a sanity check on your RNN and transformer models.
"""
### DO NOT EDIT ###
count_parameters = lambda model: sum(p.numel() for p in model.parameters() if p.requires_grad)
def sanityCheckModel(all_test_params, NN, expected_outputs, init_or_forward):
print('--- TEST: ' + ('Number of Model Parameters (tests __init__(...))' if init_or_forward=='init' else 'Output shape of forward(...)') + ' ---')
if init_or_forward == "forward":
# Creating random texts and lables batches
texts_batch = torch.randint(low=0, high=len(all_test_params[0]['src_vocab']), size=(10,16))
labels_batch = torch.randint(low=0, high=len(all_test_params[0]['src_vocab']), size=(10,12))
for tp_idx, (test_params, expected_output) in enumerate(zip(all_test_params, expected_outputs)):
if init_or_forward == "forward":
batch_size = test_params['batch_size']
texts = texts_batch[:batch_size]
if NN.__name__ == "RnnEncoder":
texts = texts.transpose(0,1)
# Construct the student model
tps = {k:v for k, v in test_params.items() if k != 'batch_size'}
stu_nn = NN(**tps)
input_rep = str({k:v for k,v in tps.items()})
if init_or_forward == "forward":
with torch.no_grad():
if NN.__name__ == "TransformerEncoder":
stu_out = stu_nn(texts)
else:
stu_out, _ = stu_nn(texts)
expected_output = torch.rand(expected_output).size()
ref_out_shape = expected_output
has_passed = torch.is_tensor(stu_out)
if not has_passed: msg = 'Output must be a torch.Tensor; received ' + str(type(stu_out))
else:
has_passed = stu_out.shape == ref_out_shape
msg = 'Your Output Shape: ' + str(stu_out.shape)
status = 'PASSED' if has_passed else 'FAILED'
message = '\t' + status + "\t Init Input: " + input_rep + '\tForward Input Shape: ' + str(texts.shape) + '\tExpected Output Shape: ' + str(ref_out_shape) + '\t' + msg
print(message)
else:
stu_num_params = count_parameters(stu_nn)
ref_num_params = expected_output
comparison_result = (stu_num_params == ref_num_params)
status = 'PASSED' if comparison_result else 'FAILED'
message = '\t' + status + "\tInput: " + input_rep + ('\tExpected Num. Params: ' + str(ref_num_params) + '\tYour Num. Params: '+ str(stu_num_params))
print(message)
del stu_nn
"""## Evaluation Functions
These functions will be used to evaluate the translations from your RNN and transformer models.
"""
### DO NOT EDIT ###
def get_reference_candidate(target, pred, trg_vocab):
def _to_token(sentence):
lis = []
for s in sentence[1:]:
x = trg_vocab.idx2word[s]
if x == "<end>": break
lis.append(x)
return lis
reference = _to_token(list(target.numpy()))
candidate = _to_token(list(pred.numpy()))
return reference, candidate
def compute_bleu_scores(target_tensor_val, target_output, final_output, trg_vocab):
bleu_1 = 0.0
bleu_2 = 0.0
bleu_3 = 0.0
bleu_4 = 0.0
smoother = SmoothingFunction()
save_reference = []
save_candidate = []
for i in range(len(target_tensor_val)):
reference, candidate = get_reference_candidate(target_output[i], final_output[i], trg_vocab)
bleu_1 += sentence_bleu(reference, candidate, weights=(1,), smoothing_function=smoother.method1)
bleu_2 += sentence_bleu(reference, candidate, weights=(1/2, 1/2), smoothing_function=smoother.method1)
bleu_3 += sentence_bleu(reference, candidate, weights=(1/3, 1/3, 1/3), smoothing_function=smoother.method1)
bleu_4 += sentence_bleu(reference, candidate, weights=(1/4, 1/4, 1/4, 1/4), smoothing_function=smoother.method1)
save_reference.append(reference)
save_candidate.append(candidate)
bleu_1 = bleu_1/len(target_tensor_val)
bleu_2 = bleu_2/len(target_tensor_val)
bleu_3 = bleu_3/len(target_tensor_val)
bleu_4 = bleu_4/len(target_tensor_val)
scores = {"bleu_1": bleu_1, "bleu_2": bleu_2, "bleu_3": bleu_3, "bleu_4": bleu_4}
print('BLEU 1-gram: %f' % (bleu_1))
print('BLEU 2-gram: %f' % (bleu_2))
print('BLEU 3-gram: %f' % (bleu_3))
print('BLEU 4-gram: %f' % (bleu_4))
return save_candidate, scores
"""# Step 1: Download & Prepare the Data
## Download and Visualize the Data
Here we will download the translation data. We will learn a model to translate Spanish to English.
"""
### DO NOT EDIT ###
if __name__ == '__main__':
os.system("wget http://www.manythings.org/anki/spa-eng.zip")
os.system("unzip -o spa-eng.zip")
"""Now we view the data."""
### DO NOT EDIT ###
if __name__ == '__main__':
total_num_examples = 50000
dat = pd.read_csv("spa.txt",
sep="\t",
header=None,
usecols=[0,1],
names=['eng', 'es'],
nrows=total_num_examples,
encoding="UTF-8"
).sample(frac=1).reset_index().drop(['index'], axis=1)
print(dat) # Visualize the data
"""Next we preprocess the data."""
### DO NOT EDIT ###
if __name__ == '__main__':
data = dat.copy()
data['eng'] = dat.eng.apply(lambda w: preprocess_sentence(w))
data['es'] = dat.es.apply(lambda w: preprocess_sentence(w))
print(data) # Visualizing the data
"""## Vocabulary & Dataloader Classes
First we create a class for managing our vocabulary. In this project, we have a separate class for the vocabulary as we need 2 different vocabularies $-$ one for English and one for Spanish.
Then we prepare the dataloader and make sure it returns the source sentence and target sentence.
"""
### DO NOT EDIT ###
class Vocab_Lang():
def __init__(self, vocab):
self.word2idx = {'<pad>': 0, '<unk>': 1}
self.idx2word = {0: '<pad>', 1: '<unk>'}
self.vocab = vocab
for index, word in enumerate(vocab):
self.word2idx[word] = index + 2 # +2 because of <pad> and <unk> token
self.idx2word[index + 2] = word
def __len__(self):
return len(self.word2idx)
def __repr__(self):
if len(self.vocab) <= 5:
return str(self.vocab)
else:
return f'Vocab_Lang object with {len(self.vocab)} words'
class MyData(Dataset):
def __init__(self, X, y):
self.length = torch.LongTensor([np.sum(1 - np.equal(x, 0)) for x in X])
self.data = torch.LongTensor(X)
self.target = torch.LongTensor(y)
def __getitem__(self, index):
x = self.data[index]
y = self.target[index]
return x, y
def __len__(self):
return len(self.data)
### DO NOT EDIT ###
import numpy as np
import random
from torch.utils.data import DataLoader
### DO NOT EDIT ###
if __name__ == '__main__':
# HYPERPARAMETERS (You may change these if you want, though you shouldn't need to)
BATCH_SIZE = 64
EMBEDDING_DIM = 256
"""## Build Vocabulary"""
### DO NOT EDIT ###
def build_vocabulary(pd_dataframe):
sentences = [sen.split() for sen in pd_dataframe]
vocab = {}
for sen in sentences:
for word in sen:
if word not in vocab:
vocab[word] = 1
return list(vocab.keys())
if __name__ == '__main__':
src_vocab_list = build_vocabulary(data['es'])
trg_vocab_list = build_vocabulary(data['eng'])
"""## Instantiate Datasets
We instantiate our training and validation datasets.
"""
### DO NOT EDIT ###
if __name__ == '__main__':
src_vocab = Vocab_Lang(src_vocab_list)
trg_vocab = Vocab_Lang(trg_vocab_list)
src_tensor, trg_tensor, max_length_src, max_length_trg = preprocess_data_to_tensor(data, src_vocab, trg_vocab)
src_tensor_train, src_tensor_val, trg_tensor_train, trg_tensor_val = train_test_split(src_tensor, trg_tensor)
# Create train and val datasets
train_dataset = MyData(src_tensor_train, trg_tensor_train)
train_dataset = DataLoader(train_dataset, batch_size=BATCH_SIZE, drop_last=True, shuffle=True)
test_dataset = MyData(src_tensor_val, trg_tensor_val)
test_dataset = DataLoader(test_dataset, batch_size=BATCH_SIZE, drop_last=True, shuffle=False)
### DO NOT EDIT ###
if __name__ == '__main__':
idxes = random.choices(range(len(train_dataset.dataset)), k=5)
src, trg = train_dataset.dataset[idxes]
print('Source:', src)
print('Source Dimensions: ', src.size())
print('Target:', trg)
print('Target Dimensions: ', trg.size())
"""# Step 2: Train a Recurrent Neural Network (RNN) [50 points]
In this section you will write a recurrent model for machine translation, and then train and evaluate its results.
Here are some links that you may find helpful:
1. Attention paper: https://arxiv.org/pdf/1409.0473.pdf
2. Explanation of LSTM's & GRU's: https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
3. Attention explanation: https://towardsdatascience.com/attention-in-neural-networks-e66920838742
4. Another attention explanation: https://towardsdatascience.com/attention-and-its-different-forms-7fc3674d14dc
"""
### DO NOT EDIT ###
import torch.nn as nn
import torch.nn.functional as F
import time
from tqdm.notebook import tqdm
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction, corpus_bleu
"""## <font color='red'>TODO:</font> Encoder Model [10 points]
First we build a recurrent encoder model. Instead of using a fully connected layer as the output, you should the return a sequence of outputs of your GRU as well as the final hidden state. These will be used in the decoder.
In this cell, you should implement the `__init(...)` and `forward(...)` functions, each of which is <b>5 points</b>.
"""
class RnnEncoder(nn.Module):
def __init__(self, src_vocab, embedding_dim, hidden_units):
super(RnnEncoder, self).__init__()
"""
Args:
src_vocab: Vocab_Lang, the source vocabulary
embedding_dim: the dimension of the embedding
hidden_units: The number of features in the GRU hidden state
"""
self.src_vocab = src_vocab # Do not change
vocab_size = len(src_vocab)
### TODO ###
# Initialize embedding layer
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# Initialize a single directional GRU with 1 layer and batch_first=False
self.gru = nn.GRU(embedding_dim, hidden_units, batch_first = False)
def forward(self, x):
"""
Args:
x: source texts, [max_len, batch_size]
Returns:
output: [max_len, batch_size, hidden_units]
hidden_state: [1, batch_size, hidden_units]
Pseudo-code:
- Pass x through an embedding layer and pass the results through the recurrent net
- Return output and hidden states from the recurrent net
"""
### TODO ###
# Pass x through the embedding layer
x_embedded = self.embedding(x)
# Pass the embedded input through the GRU
output, hidden_state = self.gru(x_embedded)
return output, hidden_state
"""## Sanity Check: RNN Encoder Model
The code below runs a sanity check for your `RnnEncoder` class. The tests are similar to the hidden ones in Gradescope. However, note that passing the sanity check does <b>not</b> guarantee that you will pass the autograder; it is intended to help you debug.
"""
### DO NOT EDIT ###
if __name__ == '__main__':
# Set random seed
torch.manual_seed(42)
# Create test inputs
embedding_dim = [2, 5, 8]
hidden_units = [50, 100, 200]
sanity_vocab = Vocab_Lang(vocab=["I", "am", "here"])
params = []
inputs = []
for ed in embedding_dim:
for hu in hidden_units:
inp = {}
inp['src_vocab'] = sanity_vocab
inp['embedding_dim'] = ed
inp['hidden_units'] = hu
inputs.append(inp)
# Test init
expected_outputs = [8110, 31210, 122410, 8575, 32125, 124225, 9040, 33040, 126040]
sanityCheckModel(inputs, RnnEncoder, expected_outputs, "init")
print()
# Test forward
inputs = []
batch_sizes = [1, 2]
for hu in hidden_units:
for b in batch_sizes:
inp = {}
inp['embedding_dim'] = EMBEDDING_DIM
inp['src_vocab'] = sanity_vocab
inp["batch_size"] = b
inp['hidden_units'] = hu
inputs.append(inp)
expected_outputs = [torch.Size([16, 1, 50]), torch.Size([16, 2, 50]), torch.Size([16, 1, 100]), torch.Size([16, 2, 100]), torch.Size([16, 1, 200]), torch.Size([16, 2, 200])]
sanityCheckModel(inputs, RnnEncoder, expected_outputs, "forward")
"""## <font color='red'>TODO:</font> Decoder Model [15 points]
We will implement a Decoder model that uses an attention mechanism, as provided in https://arxiv.org/pdf/1409.0473.pdf. We have broken this up into three functions that you need to implement: `__init__(self, ...)`, `compute_attention(self, dec_hs, enc_output)`, and `forward(self, x, dec_hs, enc_output)`:
* <b>`__init__(self, ...)`: [5 points]</b> Instantiate the parameters of your model, and store them in `self` variables.
* <b>`compute_attention(self, dec_hs, enc_output)` [5 points]</b>: Compute the <b>context vector</b>, which is a weighted sum of the encoder output states. Suppose the decoder hidden state at time $t$ is $\mathbf{h}_t$, and the encoder hidden state at time $s$ is $\mathbf{\bar h}_s$. The pseudocode is as follows:
1. <b>Attention scores:</b> Compute real-valued scores for the decoder hidden state $\mathbf{h}_t$ and each encoder hidden state $\mathbf{\bar h}_s$: $$\mathrm{score}(\mathbf{h}_t, \mathbf{\bar h}_s)=
\mathbf{v}_a^T \tanh(\mathbf{W}_1 \mathbf{h}_t +\mathbf{W}_2 \mathbf{\bar h}_s)
$$
Here you should implement the scoring function. A higher score indicates a stronger "affinity" between the decoder state and a specific encoder state.
<font color='green'><b>Hint:</b> the matrices $\mathbf{W}_1$, $\mathbf{W}_2$ and the vector $\mathbf{v_a}$ can all be implemented with `nn.Linear(...)` in Pytorch.</font>
Note that in theory, $\mathbf{v_a}$ could have a different dimension than $\mathbf{h}_t$ and $\mathbf{\bar h}_s$, but you should use the same hidden size for this vector.
2. <b>Attention weights:</b> Normalize the attention scores to obtain a valid probability distribution: $$\alpha_{ts} = \frac{\exp \big (\mathrm{score}(\mathbf{h}_t, \mathbf{\bar h}_s) \big)}{\sum_{s'=1}^S \exp \big (\mathrm{score}(\mathbf{h}_t, \mathbf{\bar h}_{s'}) \big)}$$ Notice that this is just the softmax function, and can be implemented with `F.softmax(...)` in Pytorch.
3. <b>Context vector:</b> Compute a context vector $\mathbf{c}_t$ that is a weighted average of the encoder hidden states, where the weights are given by the attention weights you just computed: $$\mathbf{c}_t=\sum_{s=1}^S \alpha_{ts} \mathbf{\bar h}_s$$
You should return this context vector, along with the attention weights.
* <b>`forward(self, x, dec_hs, enc_output)`: [5 points]</b> Run a <b>single</b> decoding step, resulting in a distribution over the vocabulary for the next token in the sequence. Pseudocode can be found in the docstrings below.
<font color='green'><b>Hint:</b> You should be able to implement all of this <b>without any for loops</b> using the Pytorch library. Also, remember that these operations should operate in parallel for each item in your batch.</font>
"""
class RnnDecoder(nn.Module):
def __init__(self, trg_vocab, embedding_dim, hidden_units):
super(RnnDecoder, self).__init__()
"""
Args:
trg_vocab: Vocab_Lang, the target vocabulary
embedding_dim: The dimension of the embedding
hidden_units: The number of features in the GRU hidden state
"""
self.trg_vocab = trg_vocab # Do not change
vocab_size = len(trg_vocab)
### TODO ###
# Initialize embedding layer
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# Initialize layers to compute attention score
self.W1 = nn.Linear(hidden_units, hidden_units)
self.W2 = nn.Linear(hidden_units, hidden_units)
self.va = nn.Linear(hidden_units, 1)
# Initialize a single directional GRU with 1 layer and batch_first=True
# NOTE: Input to your RNN will be the concatenation of your embedding vector and the context vector
self.gru = nn.GRU(embedding_dim + hidden_units, hidden_units, batch_first=True)
# Initialize fully connected layer
self.fc = nn.Linear(hidden_units, vocab_size)
def compute_attention(self, dec_hs, enc_output):
'''
This function computes the context vector and attention weights.
Args:
dec_hs: Decoder hidden state; [1, batch_size, hidden_units]
enc_output: Encoder outputs; [max_len_src, batch_size, hidden_units]
Returns:
context_vector: Context vector, according to formula; [batch_size, hidden_units]
attention_weights: The attention weights you have calculated; [batch_size, max_len_src, 1]
Pseudo-code:
(1) Compute the attention scores for dec_hs & enc_output
- Hint: You may need to permute the dimensions of the tensors in order to pass them through linear layers
- Output size: [batch_size, max_len_src, 1]
(2) Compute attention_weights by taking a softmax over your scores to normalize the distribution (Make sure that after softmax the normalized scores add up to 1)
- Output size: [batch_size, max_len_src, 1]
(3) Compute context_vector from attention_weights & enc_output
- Hint: You may find it helpful to use torch.sum & element-wise multiplication (* operator)
(4) Return context_vector & attention_weights
'''
context_vector, attention_weights = None, None
### TODO ###
# (1) Compute the attention scores for dec_hs & enc_output
attention_scores = self.va(torch.tanh(self.W1(dec_hs.permute(1,0,2)) + self.W2(enc_output.permute(1,0,2))))
# (2) Compute attention_weights by taking a softmax over the scores
attention_weights = F.softmax(attention_scores, dim=1)
# (3) Compute context_vector from attention_weights & enc_output
context_vector = torch.sum(attention_weights * enc_output.permute(1,0,2), dim=1) # [batch_size, hidden_units]
return context_vector, attention_weights
def forward(self, x, dec_hs, enc_output):
'''
This function runs the decoder for a **single** time step.
Args:
x: Input token; [batch_size, 1]
dec_hs: Decoder hidden state; [1, batch_size, hidden_units]
enc_output: Encoder outputs; [max_len_src, batch_size, hidden_units]
Returns:
fc_out: (Unnormalized) output distribution [batch_size, vocab_size]
dec_hs: Decoder hidden state; [1, batch_size, hidden_units]
attention_weights: The attention weights you have learned; [batch_size, max_len_src, 1]
Pseudo-code:
(1) Compute the context vector & attention weights by calling self.compute_attention(...) on the appropriate input
(2) Obtain embedding vectors for your input x
- Output size: [batch_size, 1, embedding_dim]
(3) Concatenate the context vector & the embedding vectors along the appropriate dimension
(4) Feed this result through your RNN (along with the current hidden state) to get output and new hidden state
- Output sizes: [batch_size, 1, hidden_units] & [1, batch_size, hidden_units]
(5) Feed the output of your RNN through linear layer to get (unnormalized) output distribution (don't call softmax!)
(6) Return this output, the new decoder hidden state, & the attention weights
'''
fc_out, attention_weights = None, None
### TODO ###
# (1) Compute the context vector & attention weights
context_vector, attention_weights = self.compute_attention(dec_hs, enc_output)
# (2) Obtain embedding vectors for input x
x_embedded = self.embedding(x).squeeze(1)
# (3) Concatenate the context vector & the embedding vectors
x_concat = torch.cat((x_embedded, context_vector), dim=1).unsqueeze(1)
# (4) Feed this result through the RNN (along with the current hidden state) to get output and new hidden state
rnn_output, dec_hs = self.gru(x_concat, dec_hs)
# (5) Feed the output of the RNN through linear layer to get (unnormalized) output distribution
fc_out = self.fc(rnn_output.squeeze(1))
return fc_out, dec_hs, attention_weights
"""## Sanity Check: RNN Decoder Model
The code below runs a sanity check for your `RnnDecoder` class. The tests are similar to the hidden ones in Gradescope. However, note that passing the sanity check does <b>not</b> guarantee that you will pass the autograder; it is intended to help you debug.
"""
### DO NOT EDIT ###
def sanityCheckDecoderModelForward(inputs, NN, expected_outputs):
print('--- TEST: Output shape of forward(...) ---\n')
expected_fc_outs = expected_outputs[0]
expected_dec_hs = expected_outputs[1]
expected_attention_weights = expected_outputs[2]
msg = ''
for i, inp in enumerate(inputs):
input_rep = '{'
for k,v in inp.items():
if torch.is_tensor(v):
input_rep += str(k) + ': ' + 'Tensor with shape ' + str(v.size()) + ', '
else:
input_rep += str(k) + ': ' + str(v) + ', '
input_rep += '}'
dec = RnnDecoder(trg_vocab=inp['trg_vocab'],embedding_dim=inp['embedding_dim'],hidden_units=inp['hidden_units'])
dec_hs = torch.rand(1, inp["batch_size"], inp['hidden_units'])
x = torch.randint(low=0,high=len(inp["trg_vocab"]),size=(inp["batch_size"], 1))
with torch.no_grad():
dec_out = dec(x=x, dec_hs=dec_hs,enc_output=inp['encoder_outputs'])
if not isinstance(dec_out, tuple):
msg = '\tFAILED\tYour RnnDecoder.forward() output must be a tuple; received ' + str(type(dec_out))
print(msg)
continue
elif len(dec_out)!=3:
msg = '\tFAILED\tYour RnnDecoder.forward() output must be a tuple of size 3; received tuple of size ' + str(len(dec_out))
print(msg)
continue
stu_fc_out, stu_dec_hs, stu_attention_weights = dec_out
del dec
has_passed = True
msg = ""
if not torch.is_tensor(stu_fc_out):
has_passed = False
msg += '\tFAILED\tOutput must be a torch.Tensor; received ' + str(type(stu_fc_out)) + " "
if not torch.is_tensor(stu_dec_hs):
has_passed = False
msg += '\tFAILED\tDecoder Hidden State must be a torch.Tensor; received ' + str(type(stu_dec_hs)) + " "
if not torch.is_tensor(stu_attention_weights):
has_passed = False
msg += '\tFAILED\tAttention Weights must be a torch.Tensor; received ' + str(type(stu_attention_weights)) + " "
status = 'PASSED' if has_passed else 'FAILED'
if not has_passed:
message = '\t' + status + "\t Init Input: " + input_rep + '\tForward Input Shape: ' + str(inp['encoder_outputs'].shape) + '\tExpected Output Shape: ' + str(expected_fc_outs[i]) + '\t' + msg
print(message)
continue
has_passed = stu_fc_out.size() == expected_fc_outs[i]
msg = 'Your Output Shape: ' + str(stu_fc_out.size())
status = 'PASSED' if has_passed else 'FAILED'
message = '\t' + status + "\t Init Input: " + input_rep + '\tForward Input Shape: ' + str(inp['encoder_outputs'].shape) + '\tExpected Output Shape: ' + str(expected_fc_outs[i]) + '\t' + msg
print(message)
has_passed = stu_dec_hs.size() == expected_dec_hs[i]
msg = 'Your Hidden State Shape: ' + str(stu_dec_hs.size())
status = 'PASSED' if has_passed else 'FAILED'
message = '\t' + status + "\t Init Input: " + input_rep + '\tForward Input Shape: ' + str(inp['encoder_outputs'].shape) + '\tExpected Hidden State Shape: ' + str(expected_dec_hs[i]) + '\t' + msg
print(message)
has_passed = stu_attention_weights.size() == expected_attention_weights[i]
msg = 'Your Attention Weights Shape: ' + str(stu_attention_weights.size())
status = 'PASSED' if has_passed else 'FAILED'
message = '\t' + status + "\t Init Input: " + input_rep + '\tForward Input Shape: ' + str(inp['encoder_outputs'].shape) + '\tExpected Attention Weights Shape: ' + str(expected_attention_weights[i]) + '\t' + msg
print(message)
stu_sum = stu_attention_weights.sum(dim=1).squeeze()
if torch.allclose(stu_sum, torch.ones_like(stu_sum), atol=1e-5):
print('\tPASSED\t The sum of your attention_weights along dim 1 is 1.')
else:
print('\tFAILED\t The sum of your attention_weights along dim 1 is not 1.')
print()
### DO NOT EDIT ###
if __name__ == '__main__':
# Set random seed
torch.manual_seed(42)
# Create test inputs
embedding_dim = [2, 5, 8]
hidden_units = [50, 100, 200]
sanity_vocab = Vocab_Lang(vocab=["I", "am", "here"])
params = []
inputs = []
for ed in embedding_dim:
for hu in hidden_units:
inp = {}
inp['trg_vocab'] = sanity_vocab
inp['embedding_dim'] = ed
inp['hidden_units'] = hu
inputs.append(inp)
# Test init
expected_outputs = [21016, 82016, 324016, 21481, 82931, 325831, 21946, 83846, 327646]
sanityCheckModel(inputs, RnnDecoder, expected_outputs, "init")
print()
# Test forward
inputs = []
hidden_units = [50, 100, 200]
batch_sizes = [1, 2, 4]
embedding_dims = iter([50,80,100,120,150,200,300,400,500])
encoder_outputs = iter([torch.rand([16, 1, 50]), torch.rand([16, 2, 50]), torch.rand([16, 4, 50]), torch.rand([16, 1, 100]), torch.rand([16, 2, 100]), torch.rand([16, 4, 100]), torch.rand([16, 1, 200]), torch.rand([16, 2, 200]),torch.rand([16, 4, 200])])
expected_fc_outs = [torch.Size([1, 5]),torch.Size([2, 5]),torch.Size([4, 5]),torch.Size([1, 5]),torch.Size([2, 5]),torch.Size([4, 5]),torch.Size([1, 5]),torch.Size([2, 5]),torch.Size([4, 5])]
expected_dec_hs = [torch.Size([1, 1, 50]), torch.Size([1, 2, 50]), torch.Size([1, 4, 50]), torch.Size([1, 1, 100]), torch.Size([1, 2, 100]), torch.Size([1, 4, 100]), torch.Size([1, 1, 200]), torch.Size([1, 2, 200]), torch.Size([1, 4, 200])]
expected_attention_weights = [torch.Size([1, 16, 1]), torch.Size([2, 16, 1]), torch.Size([4, 16, 1]), torch.Size([1, 16, 1]), torch.Size([2, 16, 1]), torch.Size([4, 16, 1]), torch.Size([1, 16, 1]), torch.Size([2, 16, 1]), torch.Size([4, 16, 1])]
expected_outputs = (expected_fc_outs, expected_dec_hs, expected_attention_weights)
for hu in hidden_units:
for b in batch_sizes:
inp = {}
edim = next(embedding_dims)
inp['embedding_dim'] = edim
inp['trg_vocab'] = sanity_vocab
inp["batch_size"] = b
inp['hidden_units'] = hu
inp['encoder_outputs'] = next(encoder_outputs)
inputs.append(inp)
sanityCheckDecoderModelForward(inputs, RnnDecoder, expected_outputs)
"""## Train RNN Model
We will train the encoder and decoder using cross-entropy loss.
"""
### DO NOT EDIT ###
def loss_function(real, pred):
mask = real.ge(1).float() # Only consider non-zero inputs in the loss
loss_ = F.cross_entropy(pred, real) * mask
return torch.mean(loss_)
def train_rnn_model(encoder, decoder, dataset, optimizer, trg_vocab, device, n_epochs):
batch_size = dataset.batch_size
for epoch in range(n_epochs):
start = time.time()
n_batch = 0
total_loss = 0
encoder.train()
decoder.train()
for src, trg in tqdm(dataset):
n_batch += 1
loss = 0
enc_output, enc_hidden = encoder(src.transpose(0,1).to(device))
dec_hidden = enc_hidden
# use teacher forcing - feeding the target as the next input (via dec_input)
dec_input = torch.tensor([[trg_vocab.word2idx['<start>']]] * batch_size)
# run code below for every timestep in the ys batch
for t in range(1, trg.size(1)):
predictions, dec_hidden, _ = decoder(dec_input.to(device), dec_hidden.to(device), enc_output.to(device))
assert len(predictions.shape) == 2 and predictions.shape[0] == dec_input.shape[0] and predictions.shape[1] == len(trg_vocab.word2idx), "First output of decoder must have shape [batch_size, vocab_size], you returned shape " + str(predictions.shape)
loss += loss_function(trg[:, t].to(device), predictions.to(device))
dec_input = trg[:, t].unsqueeze(1)
batch_loss = (loss / int(trg.size(1)))
total_loss += batch_loss
optimizer.zero_grad()
batch_loss.backward()
### update model parameters
optimizer.step()
### TODO: Save checkpoint for model (optional)
print('Epoch:{:2d}/{}\t Loss: {:.4f} \t({:.2f}s)'.format(epoch + 1, n_epochs, total_loss / n_batch, time.time() - start))
print('Model trained!')
### DO NOT EDIT ###
if __name__ == '__main__':
# HYPERPARAMETERS - feel free to change
LEARNING_RATE = 0.001
HIDDEN_UNITS=256
N_EPOCHS=10
rnn_encoder = RnnEncoder(src_vocab, EMBEDDING_DIM, HIDDEN_UNITS).to(DEVICE)
rnn_decoder = RnnDecoder(trg_vocab, EMBEDDING_DIM, HIDDEN_UNITS).to(DEVICE)
rnn_model_params = list(rnn_encoder.parameters()) + list(rnn_decoder.parameters())
optimizer = torch.optim.Adam(rnn_model_params, lr=LEARNING_RATE)
print('Encoder and Decoder models initialized!')
### DO NOT EDIT ###
if __name__ == '__main__':
train_rnn_model(rnn_encoder, rnn_decoder, train_dataset, optimizer, trg_vocab, DEVICE, N_EPOCHS)
"""## <font color='red'>TODO:</font> Inference (Decoding) Function [15 points]
Now that we have trained the model, we can use it on test data.
Here, you will write a function that takes your trained model and a source sentence (Spanish), and returns its translation (English sentence). Instead of using teacher forcing, the input to the decoder at time step $t_i$ will be the prediction of the decoder at time $t_{i-1}$.
"""
def decode_rnn_model(encoder, decoder, src, max_decode_len, device):
"""
Args:
encoder: Your RnnEncoder object
decoder: Your RnnDecoder object
src: [max_src_length, batch_size] the source sentences you wish to translate
max_decode_len: The maximum desired length (int) of your target translated sentences
device: the device your torch tensors are on (you may need to call x.to(device) for some of your tensors)
Returns:
curr_output: [batch_size, max_decode_len] containing your predicted translated sentences
curr_predictions: [batch_size, max_decode_len, trg_vocab_size] containing the (unnormalized) probabilities of each
token in your vocabulary at each time step
Pseudo-code:
- Obtain encoder output and hidden state by encoding src sentences
- For 1 ≤ t ≤ max_decode_len:
- Obtain your (unnormalized) prediction probabilities and hidden state by feeding dec_input (the best words
from the previous time step), previous hidden state, and encoder output to decoder
- Save your (unnormalized) prediction probabilities in curr_predictions at index t
- Obtain your new dec_input by selecting the most likely (highest probability) token
- Save dec_input in curr_output at index t
"""
# Initialize variables
trg_vocab = decoder.trg_vocab
batch_size = src.size(1)
curr_output = torch.zeros((batch_size, max_decode_len))
curr_predictions = torch.zeros((batch_size, max_decode_len, len(trg_vocab.idx2word)))
# We start the decoding with the start token for each example
dec_input = torch.tensor([[trg_vocab.word2idx['<start>']]] * batch_size).to(device)
curr_output[:, 0] = dec_input.squeeze(1)
### TODO: Implement decoding algorithm ###
# Obtain encoder output and hidden state by encoding src sentences
enc_output, hidden = encoder(src)
for t in range(1, max_decode_len):
# Obtain (unnormalized) prediction probabilities and hidden state
prediction, hidden, _ = decoder(dec_input, hidden, enc_output)
# Save (unnormalized) prediction probabilities in curr_predictions at index t
curr_predictions[:, t, :] = prediction
# Obtain new dec_input by selecting the most likely (highest probability) token
dec_input = torch.argmax(prediction, dim = 1).unsqueeze(1)
# Save dec_input in curr_output at index t
curr_output[:, t] = dec_input.squeeze(1)
return curr_output, curr_predictions
"""You can run the cell below to qualitatively compare some of the sentences your model generates with the some of the correct translations."""
### DO NOT EDIT ###
if __name__ == '__main__':
rnn_encoder.eval()
rnn_decoder.eval()
idxes = random.choices(range(len(test_dataset.dataset)), k=5)
src, trg = train_dataset.dataset[idxes]
curr_output, _ = decode_rnn_model(rnn_encoder, rnn_decoder, src.transpose(0,1).to(DEVICE), trg.size(1), DEVICE)
for i in range(len(src)):
print("Source sentence:", ' '.join([x for x in [src_vocab.idx2word[j.item()] for j in src[i]] if x != '<pad>']))
print("Target sentence:", ' '.join([x for x in [trg_vocab.idx2word[j.item()] for j in trg[i]] if x != '<pad>']))
trg_decoding = [x for x in [trg_vocab.idx2word[j.item()] for j in curr_output[i]] if x != '<pad>']
for j in range(len(trg_decoding)):
if trg_decoding[j] == '<end>':
trg_decoding = trg_decoding[:(j+1)]
break
print("Predicted sentence:", ' '.join(trg_decoding))
print("----------------")
"""## Evaluate RNN Model [10 points]
We provide you with a function to run the test set through the model and calculate BLEU scores. We expect your BLEU scores to satisfy the following conditions:
* BLEU-1 > 0.290
* BLEU-2 > 0.081
* BLEU-3 > 0.059
* BLEU-4 > 0.056
Read more about Bleu Score at :
1. https://en.wikipedia.org/wiki/BLEU
2. https://www.aclweb.org/anthology/P02-1040.pdf
"""
### DO NOT EDIT ###
def evaluate_rnn_model(encoder, decoder, test_dataset, target_tensor_val, device):
trg_vocab = decoder.trg_vocab
batch_size = test_dataset.batch_size
n_batch = 0
total_loss = 0
encoder.eval()
decoder.eval()
final_output, target_output = None, None
with torch.no_grad():
for batch, (src, trg) in enumerate(test_dataset):
n_batch += 1
loss = 0
curr_output, curr_predictions = decode_rnn_model(encoder, decoder, src.transpose(0,1).to(device), trg.size(1), device)
for t in range(1, trg.size(1)):
loss += loss_function(trg[:, t].to(device), curr_predictions[:,t,:].to(device))
if final_output is None:
final_output = torch.zeros((len(target_tensor_val), trg.size(1)))
target_output = torch.zeros((len(target_tensor_val), trg.size(1)))
final_output[batch*batch_size:(batch+1)*batch_size] = curr_output
target_output[batch*batch_size:(batch+1)*batch_size] = trg
batch_loss = (loss / int(trg.size(1)))
total_loss += batch_loss
print('Loss {:.4f}'.format(total_loss / n_batch))
# Compute BLEU scores
return compute_bleu_scores(target_tensor_val, target_output, final_output, trg_vocab)
### DO NOT EDIT ###
if __name__ == '__main__':
rnn_save_candidate, rnn_scores = evaluate_rnn_model(rnn_encoder, rnn_decoder, test_dataset, trg_tensor_val, DEVICE)
"""# Step 3: Train a Transformer [50 points]
In this section you will write a transformer model for machine translation, and then train and evaluate its results.
Here are some helpful links:
<ul>
<li> Original transformer paper: <a href='https://arxiv.org/pdf/1706.03762.pdf'>https://arxiv.org/pdf/1706.03762.pdf</a>
<li> Helpful tutorial: <a href='http://jalammar.github.io/illustrated-transformer/'>http://jalammar.github.io/illustrated-transformer/</a>
<li> Another tutorial: <a href='http://peterbloem.nl/blog/transformers'>http://peterbloem.nl/blog/transformers</a>
</ul>
"""
### DO NOT EDIT ###
import math
"""## <font color='red'>TODO:</font> Positional Embeddings [5 points]
Similar to the RNN, we start with the Encoder model. A key component of the encoder is the Positional Embedding. As we know, word embeddings encode words in such a way that words with similar meaning have similar vectors. Because there are no recurrences in a Transformer, we need a way to tell the transformer the relative position of words in a sentence: so will add a positional embedding to the word embeddings. Now, two words with a similar embedding will both be close in meaning and occur near each other in the sentence.
You will create a positional embedding matrix of size $(max\_len, embed\_dim)$ using the following formulae:
<br>
$\begin{align*} pe[pos,2i] &= \sin \Big (\frac{pos}{10000^{2i/embed\_dim}}\Big )\\pe[pos,2i+1] &= \cos \Big (\frac{pos}{10000^{2i/embed\_dim}}\Big ) \end{align*}$
<font color='green'><b>Hint:</b> You should probably take the logarithm of the denominator to avoid raising $10000$ to an exponent and then exponentiate the result before plugging it into the fraction. This will help you avoid numerical (overflow/underflow) issues.
<font color='green'><b>Hint:</b> We encourage you to try to implement this function with no for loops, which is the general practice (as it is faster). However, since we are using relatively small datasets, you are welcome to do this with for loops if you prefer.
"""
def create_positional_embedding(max_len, embed_dim):
'''
Args:
max_len: The maximum length supported for positional embeddings
embed_dim: The size of your embeddings
Returns:
pe: [max_len, 1, embed_dim] computed as in the formulae above
'''
pe = None
### TODO ###
pe = np.zeros((max_len, 1, embed_dim))
positions = np.arange(max_len)[:, np.newaxis]
div_term = np.exp(-(math.log(10000.0) * np.arange(0, embed_dim, 2) / embed_dim))
pe[:, 0, 0::2] = np.sin(positions * div_term)