I set about learning the Transformer Architecture from first principles. Before long I realized
I need to read many Arxiv research papers starting with ‘Attention is All you need’.
But there are other older foundational papers dealing with this.
The key confounding aspect for me in the field of Machine Learning is the math. One could watch several videos
like I did and take notes. There is one other way to develop intuition. If one is familiar with a framework
then a few algorithms can be coded and executed either on CPU or GPU as the case may be.
I believe both these efforts should be parallel to understand the algorithm and the math.
But before understanding the Transformer architecture we need some intuition of the architectures
that it improves like Recurrent Neural Nets. So here I start with an RNN, a simple LSTM to generate
characters. There are several steps before reaching the target Transformer Architecture.
Tensorflow Code
tf.lookup.StaticHashTable
There are simpler ways to store data in lookup datastructures but there is also a HashTable
which I use to store the individual character in the text and the associated indexes.
There is one for looking up the index of a character and one for looking up the character with
an index.
Part I
The first part can be coded using our Tensorflow knowledge as this is about data preparation. My code
is not based on any established pattern like Part 2. It may not be even idiomatic. This is a learning exercise and
the code is constructed by referring to the TensorFlow API.
Part 2
This is the part that can be copied if one understands most of the meaning of the network
architecture.
Graph of epoch loss
I expected the epoch loss to be driven down lower than what the graph shows. This can be investigated.
Self-Attention
Explanation follows. This diagram is drawn based on Sebastian Raschka’s
explanatory diagram. I use Tikz and code is here