Rough Notes: Deep Learning From The Ground Up
September 3, 2023
Can be distilled into fundemental ops:
- load
- store
- multiply accumulate (basically very small matrix multiplication)
- elementwise
Elementwise ops were always first class ops on GPU. The entire premise of using the GPU is that the hardware is specialized for these kind of ops. Multiply accumulate can be viewed as a combination of elementwise ops + reduction.
In recent years there are now specialized ops + hardware so that multiply accumulate can be made even faster. The core premise to make very small matrix multiplications an atomic operation. These are Tensor Cores in NVIDIA parlance. Leverage these cores for matrix multiplications can greatly increase TFLOPs.
From the Flash Attention 2 post:
As an example, the A100 GPU has a max theoretical throughput of 312 TFLOPs of FP16/BF16 matmul, but only 19.5 TFLOPs of non-matmul FP32. Another way to think about this is that each non-matmul FLOP is 16x more expensive than a matmul FLOP. To maintain high throughput, we want to spend as much time on matmul FLOPs as possible.
APIs for NVIDIA and Metal.
NVIDIA:
# section of a matrix distributed across all threads in the warp (grouping of threads, the rest of the functions operate on `fragment` type) `fill_fragment` # `sync` refers to syncing across the warp lanes `load_matrix_sync` `store_matrix_sync` `mma_sync`
Metal:
`simdgroup_load` `simdgroup_multiply_accumulate` `simdgroup_store`
Example use:
kernel void float_matmad(device float *pmata, device float *pmatb, device float *pmatc, device float *pmatd) { simdgroup_float8x8 sgmata; simdgroup_float8x8 sgmatb; simdgroup_float8x8 sgmatc; simdgroup_float8x8 sgmatd; simdgroup_load(sgmata, pmata); simdgroup_load(sgmatb, pmatb); simdgroup_load(sgmatc, pmatc); simdgroup_multiply_accumulate(sgmatd, sgmata, sgmatb, sgmatc); simdgroup_store(sgmatd, pmatd); }
Likely to see much more hardware and specialized instruction sets for deep learning. Already a bunch of startups in this area but software is shit. Even the next best competitor (AMD) to NVIDIA lags far behind.
Triton is a software abstraction layer over the deep learning instruction set.
Implementation of matrix multiplication:
import torch import triton import triton.language as tl @triton.autotune( configs=[ triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8), triton.Config({'BLOCK_SIZE_M': 64, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4), triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4), triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 64, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4), triton.Config({'BLOCK_SIZE_M': 64, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4), triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 32, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4), triton.Config({'BLOCK_SIZE_M': 64, 'BLOCK_SIZE_N': 32, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2), triton.Config({'BLOCK_SIZE_M': 32, 'BLOCK_SIZE_N': 64, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2), ], key=['M', 'N', 'K'], ) @triton.jit
accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)): # Load the next block of A and B, generate a mask by checking the K dimension. # If it is out of bounds, set it to 0. a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k * BLOCK_SIZE_K, other=0.0) b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, other=0.0) # We accumulate along the K dimension. accumulator += tl.dot(a, b) # Advance the ptrs to the next K block. a_ptrs += BLOCK_SIZE_K * stride_ak b_ptrs += BLOCK_SIZE_K * stride_bk # You can fuse arbitrary activation functions here # while the accumulator is still in FP32! c = accumulator.to(tl.float16)
^ This calls the previous APIs we mentioned under the hood.
Tinygrad vs. Others (my POV)
tinygrad is designed as an instruction set for DL first and foremost. A framework is added on top of that, by which I mean layers, backprop, symbolics, etc.
but at it's core this is an instruction set.
class UnaryOps(Enum): NOOP = auto(); EXP2 = auto(); LOG2 = auto(); CAST = auto(); SIN = auto(); SQRT = auto(); RECIP = auto() # noqa: E702 class BinaryOps(Enum): ADD = auto(); SUB = auto(); MUL = auto(); DIV = auto(); CMPEQ = auto(); MAX = auto(); MOD = auto(); CMPLT = auto() # noqa: E702 class ReduceOps(Enum): SUM = auto(); MAX = auto() # noqa: E702 class FusedOps(Enum): MULACC = auto() # noqa: E702 class LoadOps(Enum): EMPTY = auto(); RAND = auto(); CONST = auto(); FROM = auto(); CONTIGUOUS = auto(); CUSTOM = auto() # noqa: E702
More on the software component ...
It does not require turing completeness. Data access patterns are known up front. You can allocate all the memory (buffers) you will need before any data is actually used.
TensorFlow was right - get a static graph and optimize the hell out of it. It just had a really cumbersome API and RNNs were very popular back then. RNNs break nice data access patterns so you want to use something more dynamic, i.e. PyTorch which also had a much better developer UX.
Auxillary thought: Transformers won because they are better for current hardware - much easier to scale up. Is there an example of an RNN at massive scale? Even the most well known RL system, AlphaGo, is a conv net.
Extra: Hardware
GPT-4 is ~1.8 trillion parameters
~1.8T * 4 (float32) = 7.2T 7.2T / 2 (float16) = 3.6T
45 H100s to run inference for GPT-4.
Quantized to 4 bits would still require 12 H100s. Assuming [30k each](https://www.hpcwire.com/2023/08/17/nvidia-h100-are-550000-gpus-enough-for-this-year/#:~:text=The%20flagship%20H100%20GPU%20(14%2C592,based%20supercomputer%20called%20Shaheen%20III.) That's 360k to run 4 bit inference of GPT-4.
Assuming 800W for an H100, which seems to be the highest estimate.
It will cost $80 per month to run a H100 continously.
I guess that's how cloud providers make money. Energy arbitrage. It costs us cents to run this but we'll rent it to you for a few dollars.
But suppose you have 24 H100s, they seem to come in 8, so 24 instead of 20. That's $1920/month to run your monster GPUs at full capacity. That's way cheaper than I thought it would be. You're paying at least 10x more to use it via cloud you can't hack to your liking. So if you can pay the up front cost it's worth it to buy the monster GPUs. Especially now since you can enough use them as collateral.