Can be distilled into fundemental ops:

- load
- store
- multiply accumulate (basically very small matrix multiplication)
- elementwise

Elementwise ops were always first class ops on GPU. The entire premise of using the GPU is that the hardware is specialized for these kind of ops. Multiply accumulate can be viewed as a combination of elementwise ops + reduction.

In recent years there are now specialized ops + hardware so that multiply accumulate can be made even faster. The core premise to make very small matrix multiplications an atomic operation. These are Tensor Cores in NVIDIA parlance. Leverage these cores for matrix multiplications can greatly increase TFLOPs.

From the Flash Attention 2 post:

As an example, the A100 GPU has a max theoretical throughput of 312 TFLOPs of FP16/BF16 matmul, but only 19.5 TFLOPs of non-matmul FP32. Another way to think about this is that each non-matmul FLOP is 16x more expensive than a matmul FLOP. To maintain high throughput, we want to spend as much time on matmul FLOPs as possible.

APIs for NVIDIA and Metal.

NVIDIA:

```
# section of a matrix distributed across all threads in the warp (grouping of threads, the rest of the functions operate on `fragment` type)
`fill_fragment`
# `sync` refers to syncing across the warp lanes
`load_matrix_sync`
`store_matrix_sync`
`mma_sync`
```

Metal:

```
`simdgroup_load`
`simdgroup_multiply_accumulate`
`simdgroup_store`
```

Example use:

```
kernel void float_matmad(device float *pmata, device float *pmatb, device float *pmatc, device float *pmatd)
{
simdgroup_float8x8 sgmata;
simdgroup_float8x8 sgmatb;
simdgroup_float8x8 sgmatc;
simdgroup_float8x8 sgmatd;
simdgroup_load(sgmata, pmata);
simdgroup_load(sgmatb, pmatb);
simdgroup_load(sgmatc, pmatc);
simdgroup_multiply_accumulate(sgmatd, sgmata, sgmatb, sgmatc);
simdgroup_store(sgmatd, pmatd);
}
```

Likely to see much more hardware and specialized instruction sets for deep learning. Already a bunch of startups in this area but software is shit. Even the next best competitor (AMD) to NVIDIA lags far behind.

Triton is a software abstraction layer over the deep learning instruction set.

Implementation of matrix multiplication:

```
import torch
import triton
import triton.language as tl
@triton.autotune(
configs=[
triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8),
triton.Config({'BLOCK_SIZE_M': 64, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 64, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
triton.Config({'BLOCK_SIZE_M': 64, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 32, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4),
triton.Config({'BLOCK_SIZE_M': 64, 'BLOCK_SIZE_N': 32, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2),
triton.Config({'BLOCK_SIZE_M': 32, 'BLOCK_SIZE_N': 64, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2),
],
key=['M', 'N', 'K'],
)
@triton.jit
def matmul_kernel(
a_ptr, b_ptr, c_ptr,
M, N, K,
stride_am, stride_ak,
stride_bk, stride_bn,
stride_cm, stride_cn,
# Meta-parameters
BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,
GROUP_SIZE_M: tl.constexpr,
):
"""Kernel for computing the matmul C = A x B.
A has shape (M, K), B has shape (K, N) and C has shape (M, N)
"""
### LOAD
# -----------------------------------------------------------
# Map program ids `pid` to the block of C it should compute.
# This is done in a grouped ordering to promote L2 data reuse.
# See above `L2 Cache Optimizations` section for details.
pid = tl.program_id(axis=0)
num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
num_pid_in_group = GROUP_SIZE_M * num_pid_n
group_id = pid // num_pid_in_group
first_pid_m = group_id * GROUP_SIZE_M
group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M)
pid_m = first_pid_m + (pid % group_size_m)
pid_n = (pid % num_pid_in_group) // group_size_m
# ----------------------------------------------------------
# Create pointers for the first blocks of A and B.
# We will advance this pointer as we move in the K direction
# and accumulate
# `a_ptrs` is a block of [BLOCK_SIZE_M, BLOCK_SIZE_K] pointers
# `b_ptrs` is a block of [BLOCK_SIZE_K, BLOCK_SIZE_N] pointers
# See above `Pointer Arithmetics` section for details
offs_am = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M
offs_bn = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N
offs_k = tl.arange(0, BLOCK_SIZE_K)
a_ptrs = a_ptr + (offs_am[:, None] * stride_am + offs_k[None, :] * stride_ak)
b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_bn[None, :] * stride_bn)
### COMPUTE (MATMUL ACC)
# -----------------------------------------------------------
# Iterate to compute a block of the C matrix.
# We accumulate into a `[BLOCK_SIZE_M, BLOCK_SIZE_N]` block
# of fp32 values for higher accuracy.
# `accumulator` will be converted back to fp16 after the loop.
accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
# Load the next block of A and B, generate a mask by checking the K dimension.
# If it is out of bounds, set it to 0.
a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k * BLOCK_SIZE_K, other=0.0)
b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, other=0.0)
# We accumulate along the K dimension.
accumulator += tl.dot(a, b)
# Advance the ptrs to the next K block.
a_ptrs += BLOCK_SIZE_K * stride_ak
b_ptrs += BLOCK_SIZE_K * stride_bk
# You can fuse arbitrary activation functions here
# while the accumulator is still in FP32!
c = accumulator.to(tl.float16)
### STORE
# -----------------------------------------------------------
# Write back the block of the output matrix C with masks.
offs_cm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
offs_cn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
c_ptrs = c_ptr + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]
c_mask = (offs_cm[:, None] < M) & (offs_cn[None, :] < N)
tl.store(c_ptrs, c, mask=c_mask)
```

```
accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
# Load the next block of A and B, generate a mask by checking the K dimension.
# If it is out of bounds, set it to 0.
a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k * BLOCK_SIZE_K, other=0.0)
b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, other=0.0)
# We accumulate along the K dimension.
accumulator += tl.dot(a, b)
# Advance the ptrs to the next K block.
a_ptrs += BLOCK_SIZE_K * stride_ak
b_ptrs += BLOCK_SIZE_K * stride_bk
# You can fuse arbitrary activation functions here
# while the accumulator is still in FP32!
c = accumulator.to(tl.float16)
```

^ This calls the previous APIs we mentioned under the hood.

### Tinygrad vs. Others (my POV)

tinygrad is designed as an instruction set for DL first and foremost. A framework is added on top of that, by which I mean layers, backprop, symbolics, etc.

but at it's core this is an instruction set.

```
class UnaryOps(Enum): NOOP = auto(); EXP2 = auto(); LOG2 = auto(); CAST = auto(); SIN = auto(); SQRT = auto(); RECIP = auto() # noqa: E702
class BinaryOps(Enum): ADD = auto(); SUB = auto(); MUL = auto(); DIV = auto(); CMPEQ = auto(); MAX = auto(); MOD = auto(); CMPLT = auto() # noqa: E702
class ReduceOps(Enum): SUM = auto(); MAX = auto() # noqa: E702
class FusedOps(Enum): MULACC = auto() # noqa: E702
class LoadOps(Enum): EMPTY = auto(); RAND = auto(); CONST = auto(); FROM = auto(); CONTIGUOUS = auto(); CUSTOM = auto() # noqa: E702
```

More on the software component ...

It does not require turing completeness. Data access patterns are known up front. You can allocate all the memory (buffers) you will need before any data is actually used.

TensorFlow was right - get a static graph and optimize the hell out of it. It just had a really cumbersome API and RNNs were very popular back then. RNNs break nice data access patterns so you want to use something more dynamic, i.e. PyTorch which also had a much better developer UX.

Auxillary thought: Transformers won because they are better for current hardware - much easier to scale up. Is there an example of an RNN at massive scale? Even the most well known RL system, AlphaGo, is a conv net.

## Extra: Hardware

GPT-4 is ~1.8 trillion parameters

~1.8T * 4 (float32) = 7.2T 7.2T / 2 (float16) = 3.6T

45 H100s to run inference for GPT-4.

Quantized to 4 bits would still require 12 H100s. Assuming [30k each](https://www.hpcwire.com/2023/08/17/nvidia-h100-are-550000-gpus-enough-for-this-year/#:~:text=The%20flagship%20H100%20GPU%20(14%2C592,based%20supercomputer%20called%20Shaheen%20III.) That's 360k to run 4 bit inference of GPT-4.

Assuming 800W for an H100, which seems to be the highest estimate.

It will cost $80 per month to run a H100 continously.

I guess that's how cloud providers make money. Energy arbitrage. *It costs us cents to run this but we'll rent it to you for a few dollars.*

But suppose you have 24 H100s, they seem to come in 8, so 24 instead of 20. That's $1920/month to run your monster GPUs at full capacity. That's way cheaper than I thought it would be. You're paying at least 10x more to use it via cloud you can't hack to your liking. So if you can pay the up front cost it's worth it to buy the monster GPUs. Especially now since you can enough use them as collateral.