The code is ported from PyTorch to TensorFlow 2.x using various references. (e.g) https://nn.labml.ai/transformers/rope/index.html
TensorFlow is very verbose and sometimes it poses difficulties as every line of PyTorch has to be ported.
In very few cases the code can be directly ported with minimal changes. But this situation isn’t common.
The math supporting the algorithm is only partially understood. There are several research papers to read.
The RoPE embeddings and attention need more insight and explanation.In fact each section will need multiple
diagrams and descriptions.
Batches
RMSNorm
TensorFlow Tests
Simple tests like these are supported by TensorFlow. I have learnt to use the Python console from within PyCharm
as the tests are sometimes not recognized by the IDE.
There is one other issue that I don’t understand fully.
The test failed as the values are not close enough as per the tolerance.
My questions are these.
Is the RMSNorm algorithm correct ? Should I read any material/code to improve it if it is wrong ?
Can I use different tolerance levels to pass the test ? The API _ self.assertAllClose_ takes tolerance levels as
parameters.And if I pass 50( for example ) for the upper and lower limit the test passes.
I can also ignore the failure as the values seem to be close.
Rotary Positional Embeddings( RoPE)
Code used to test RoPE embeddings
RoPE
RoPE Attention
This is one section that exemplifies the lack of diagrams and descriptions. This can be added as more intuition
is gained. But for now the code is here
This visualization of attention weights seems to be different from the PyTorch code I was using as reference.
The difference is in terms of the shape of the attention weights. But by slightly changing the indexing mechanism
we can still visualize it. I need to revise this later.
Masked RoPE Attention
At this stage one loses count of the shapes and sizes of matrices. It is hard to keep track
of the ranks and shapes as it is not possible to visually inspect them.
But based on the concept of upper and lower triangular matrices we can mask so that data from
newer time steps is not visible because the model is supposed to predict it.
I use block_size(tf.shape(x)[1] for this as the shape of the data is (batch_size, block_size, embedding_dim)
Even though the mask seems to work I couldn’t debug too deeply.
The loss after all this effort is shown here.
Stacking multiple blocks
In this final part we segragate these blocks and stack them in the mode.
There are now 4 of them stacked like this.
Hyper-parameter tuning
This is a separate task that is pending. We need GPUs and more training epochs But the Llama Transformer architecture itself works now.
The final losses are worse than before. Moreover the test generated after this minimal training is gibberish and is not even as good as
the text from a general GPT model with a similar architecture.