Matrix Multiplication

Matrix multiplication is a binary operation that produces a matrix from two matrices.

For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product.

It has the number of rows of the first and the number of columns of the seconds matrix. The product of matrices A and B is denoted as AB

Example:

A of size(2, 3)
B of size(3, 2)
Then A@B gives AB of size(2, 2) @ - matmul

COMPUTATIONAL COMPLEXITY OF MATRIX MULTIPLICATION

1. GEMM – General Matrix Multiplication

GEMM (General Matrix Multiplication) using a regular iterative algorithm, as commonly found in textbooks. This implementation follows the standard triple-nested loop approach for multiplying two matrices.

Algorithm Overview

The basic iterative algorithm for matrix multiplication computes the result of:

[ C = A x B ]

Time Complexity

The matrix multiplication algorithm that results from the definition requires, in the worst case, n³ operations.

why it’s O(n³) ?

For example,Take square matrices of size N

Operations per element:

Multiplication: n
Additions: n - 1

Total operations per element: n + (n-1) = 2n+1

Total elements = n x n = n²

so, Total flops = n² * (2n - 1) = 2n³

However, in time complexity analysis using Big O notation, constant factors like 2 are omitted, So both 2n³ and n³ are simply written as O(n³).

Using this calculation we can get flops accurately. Ex: N = 1024 ; 2 * 1024 ^ 3 = 2147483648 ~ 2.15 GFLOPs

note: Flops - Floating point operations, Flop/s - Floating point operations per second

Input: matrices A (n × m) and B (m × p)
Output: matrix C (n × p)`

Let C be a new matrix of size n × p

For i from 1 to n:
    For j from 1 to p:
        sum ← 0
        For k from 1 to m:
            sum ← sum + A[i][k] × B[k][j]
        C[i][j] ← sum

Return C

check the python implementation of matmul gemm.py

But this algorithm is very slow (it like brute force algorithm for GEMM)

Reasons for why it’s slow:

single threaded execution
Loading in main memory (RAM) instead of cache
Poor memory access
No Vectorization (AVX, SIMD) …

Numpy optimized for all theses things. so it is much faster ….

BENCHMARKING MY MAC (FLOP/s)

Benchmarking on Mac M2 using numpy

Let take square matrix of size N = 1024

Total number of flops = 2N^3 = 2 * 1024 ** 3 = 2147483648 flops
Total Memory to store (single precision) = 32 * 1024 * 1024 = 33554432 bytes = 32 MB
Total time it takes compute A @ B (N = 1024) = t = 0.001682666999840876 sec
So flop/s = flops / t = 2147483648 / 0.001682666999840876 * 1e-9 (GFLOP/s) = 1276 GFLOP/s (which is closed to therotical flop/s can get from m2 chip)

if it is double precision (float64) it is much slower ~ 1/2 to 1/4 slower than fp32

Benchmark your system using perf_gemm_np.py