Matrix multiplication is a binary operation that produces a matrix from two matrices.

For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product.

It has the number of rows of the first and the number of columns of the seconds matrix. The product of matrices A and B is denoted as AB

Example:

  • A of size(2, 3)
  • B of size(3, 2)
  • Then A@B gives AB of size(2, 2) @ - matmul

COMPUTATIONAL COMPLEXITY OF MATRIX MULTIPLICATION

1. GEMM – General Matrix Multiplication

GEMM (General Matrix Multiplication) using a regular iterative algorithm, as commonly found in textbooks. This implementation follows the standard triple-nested loop approach for multiplying two matrices.

img

Algorithm Overview

The basic iterative algorithm for matrix multiplication computes the result of:

[ C = A x B ]

Time Complexity

The matrix multiplication algorithm that results from the definition requires, in the worst case, n3 operations.

why it’s O(n3) ?

For example,Take square matrices of size N

Operations per element:

  • Multiplication: n

  • Additions: n - 1

Total operations per element: n + (n-1) = 2n+1

Total elements = n x n = n2

so, Total flops = n2 * (2n - 1) = 2n3

However, in time complexity analysis using Big O notation, constant factors like 2 are omitted, So both 2n3 and n3 are simply written as O(n3).

  • Using this calculation we can get flops accurately. Ex: N = 1024 ; 2 * 1024 ^ 3 = 2147483648 ~ 2.15 GFLOPs

note: Flops - Floating point operations, Flop/s - Floating point operations per second

Input: matrices A (n × m) and B (m × p)
Output: matrix C (n × p)`

Let C be a new matrix of size n × p

For i from 1 to n:
    For j from 1 to p:
        sum ← 0
        For k from 1 to m:
            sumsum + A[i][k] × B[k][j]
        C[i][j] ← sum

Return C

check the python implementation of matmul gemm.py

But this algorithm is very slow (it like brute force algorithm for GEMM)

Reasons for why it’s slow:

  • single threaded execution
  • Loading in main memory (RAM) instead of cache
  • Poor memory access
  • No Vectorization (AVX, SIMD) …

Numpy optimized for all theses things. so it is much faster ….

BENCHMARKING MY MAC (FLOP/s)

Benchmarking on Mac M2 using numpy

Let take square matrix of size N = 1024

  • Total number of flops = 2N^3 = 2 * 1024 ** 3 = 2147483648 flops

  • Total Memory to store (single precision) = 32 * 1024 * 1024 = 33554432 bytes = 32 MB

  • Total time it takes compute A @ B (N = 1024) = t = 0.001682666999840876 sec

  • So flop/s = flops / t = 2147483648 / 0.001682666999840876 * 1e-9 (GFLOP/s) = 1276 GFLOP/s (which is closed to therotical flop/s can get from m2 chip)

if it is double precision (float64) it is much slower ~ 1/2 to 1/4 slower than fp32

Benchmark your system using perf_gemm_np.py