Matrix Multiplication

Matrix multiplication is a binary operation that produces a matrix from two matrices.
For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product.
It has the number of rows of the first and the number of columns of the seconds matrix. The product of matrices A and B is denoted as AB
Example:
Aof size(2, 3)Bof size(3, 2)- Then A@B gives
ABof size(2, 2) @ - matmul
COMPUTATIONAL COMPLEXITY OF MATRIX MULTIPLICATION
1. GEMM – General Matrix Multiplication
GEMM (General Matrix Multiplication) using a regular iterative algorithm, as commonly found in textbooks. This implementation follows the standard triple-nested loop approach for multiplying two matrices.
![]()
Algorithm Overview
The basic iterative algorithm for matrix multiplication computes the result of:
[ C = A x B ]
Time Complexity
The matrix multiplication algorithm that results from the definition requires, in the worst case, n3 operations.
why it’s O(n3) ?
For example,Take square matrices of size N
Operations per element:
-
Multiplication: n
-
Additions: n - 1
Total operations per element: n + (n-1) = 2n+1
Total elements = n x n = n2
so, Total flops = n2 * (2n - 1) = 2n3
However, in time complexity analysis using Big O notation, constant factors like 2 are omitted, So both 2n3 and n3 are simply written as O(n3).
- Using this calculation we can get flops accurately. Ex: N = 1024 ; 2 * 1024 ^ 3 = 2147483648 ~ 2.15 GFLOPs
note: Flops - Floating point operations, Flop/s - Floating point operations per second
Input: matrices A (n × m) and B (m × p)
Output: matrix C (n × p)`
Let C be a new matrix of size n × p
For i from 1 to n:
For j from 1 to p:
sum ← 0
For k from 1 to m:
sum ← sum + A[i][k] × B[k][j]
C[i][j] ← sum
Return C
check the python implementation of matmul gemm.py
But this algorithm is very slow (it like brute force algorithm for GEMM)
Reasons for why it’s slow:
- single threaded execution
- Loading in main memory (RAM) instead of cache
- Poor memory access
- No Vectorization (AVX, SIMD) …
Numpy optimized for all theses things. so it is much faster ….
BENCHMARKING MY MAC (FLOP/s)
Benchmarking on Mac M2 using numpy
![]()
Let take square matrix of size N = 1024
-
Total number of flops = 2N^3 = 2 * 1024 ** 3 = 2147483648 flops
-
Total Memory to store (single precision) = 32 * 1024 * 1024 = 33554432 bytes = 32 MB
-
Total time it takes compute A @ B (N = 1024) = t = 0.001682666999840876 sec
-
So flop/s = flops / t = 2147483648 / 0.001682666999840876 * 1e-9 (GFLOP/s) = 1276 GFLOP/s (which is closed to therotical flop/s can get from m2 chip)
if it is double precision (float64) it is much slower ~ 1/2 to 1/4 slower than fp32
Benchmark your system using perf_gemm_np.py