PyHPC – Self-Guided Parallel Programming Workshop with Python
July 2025
This project is a self-paced workshop designed to explore parallel and high-performance computing (HPC) concepts and tools using the Python programming language.
It's inspired by typical content from introductory graduate-level HPC courses, but adapted to be practical, flexible, and free from academic bureaucracy.
Each chapter covers a different technique to exploit parallelism at the CPU, GPU, or cluster level, with simple examples and performance comparisons.
Workshop Structure
00 - How this course was made
- Source
- Workflow
- Why
01 - Multiprocessing and Concurrent Programming Fundamentals
- HPC landscape overview: where different techniques fit
- Introduction to the
multiprocessingmodule - Using
Process,Queue,Pipe, andPool - Introduction to
concurrent.futuresfor cleaner parallel execution - Examples of CPU-bound tasks with performance profiling
- Comparison with sequential execution
- Synchronization and locking considerations
- Profiling CPU-bound vs I/O-bound tasks
- Proposed exercises:
- Parallel prime number calculation
- Signal filtering comparison (sequential vs parallel)
- Parallel prime number calculation
02 - Multithreading, Async Programming, and the GIL
- What is the GIL (Global Interpreter Lock) in CPython
- Introduction to
asynciofor I/O-bound tasks - The
threadingmodule: usingThread,Lock,RLock,Event,Condition - When
threadingis useful vs when to avoid it - Practical differences between
threading,asyncio, andmultiprocessing - Advanced synchronization patterns
- Proposed exercises:
- Multiple threads reading files
- Web scraping with asyncio
- Concurrent data acquisition simulation
- Multiple threads reading files
03 - Distributed Computing with MPI and Modern Alternatives
- Introduction to MPI and
mpi4py mpiexecand distributed execution- Process communication:
send,recv,broadcast,scatter,gather - Modern alternatives:
joblibandRayfor distributed computing - Comparison: MPI vs modern distributed frameworks
- Proposed exercises
- Distributed FFT computation
- Monte Carlo simulation across processes
- Fallback:
joblibparallel processing if MPI setup fails - Required infrastructure: real cluster or local simulation
- Distributed FFT computation
04 - GPU Programming with CUDA and ROCm
- Introduction to CUDA: concept of kernels and grids
- PyCUDA vs Numba CUDA: similarities and differences
- First kernel with
numba.cuda CuPyfor NumPy-like GPU operations- ROCm/HIP alternatives for AMD GPUs
- Memory management: host-device transfers, memory coalescing
- Benchmark: operations on CPU vs GPU
- Proposed exercises
- 2D convolution on GPU
- Matrix multiplication with memory optimization
- Signal processing kernels
- 2D convolution on GPU
05 - Parallel Libraries and Performance Optimization
- Quick overview of parallel libraries:
Numbawith@jit(parallel=True)Daskfor large structures and automatic parallelism-
JAXfor scientific computing and automatic differentiation -
PyTorchand GPU usage (torch.cuda)
- Performance profiling tools:
time,timeit,line_profiler,py-spy - Numerical precision considerations in parallel computing
- Choosing the right tool for different problem types
- Comparative benchmark: CPU vs GPU vs distributed
- Proposed exercises
- Complete signal processing pipeline using multiple approaches
- Performance comparison: multiprocessing vs Numba vs GPU
- Real-world optimization problem
- Complete signal processing pipeline using multiple approaches
License
MIT – free to use, copy, and modify.