PyHPC – Self-Guided Parallel Programming Workshop with Python

July 2025

This project is a self-paced workshop designed to explore parallel and high-performance computing (HPC) concepts and tools using the Python programming language.

It's inspired by typical content from introductory graduate-level HPC courses, but adapted to be practical, flexible, and free from academic bureaucracy.

Each chapter covers a different technique to exploit parallelism at the CPU, GPU, or cluster level, with simple examples and performance comparisons.

Workshop Structure

00 - How this course was made

Source
Workflow
Why

01 - Multiprocessing and Concurrent Programming Fundamentals

HPC landscape overview: where different techniques fit
Introduction to the multiprocessing module
Using Process, Queue, Pipe, and Pool
Introduction to concurrent.futures for cleaner parallel execution
Examples of CPU-bound tasks with performance profiling
Comparison with sequential execution
Synchronization and locking considerations
Profiling CPU-bound vs I/O-bound tasks
Proposed exercises:
- Parallel prime number calculation
- Signal filtering comparison (sequential vs parallel)

02 - Multithreading, Async Programming, and the GIL

What is the GIL (Global Interpreter Lock) in CPython
Introduction to asyncio for I/O-bound tasks
The threading module: using Thread, Lock, RLock, Event, Condition
When threading is useful vs when to avoid it
Practical differences between threading, asyncio, and multiprocessing
Advanced synchronization patterns
Proposed exercises:
- Multiple threads reading files
- Web scraping with asyncio
- Concurrent data acquisition simulation

03 - Distributed Computing with MPI and Modern Alternatives

Introduction to MPI and mpi4py
mpiexec and distributed execution
Process communication: send, recv, broadcast, scatter, gather
Modern alternatives: joblib and Ray for distributed computing
Comparison: MPI vs modern distributed frameworks
Proposed exercises
- Distributed FFT computation
- Monte Carlo simulation across processes
- Fallback: joblib parallel processing if MPI setup fails
- Required infrastructure: real cluster or local simulation

04 - GPU Programming with CUDA and ROCm

Introduction to CUDA: concept of kernels and grids
PyCUDA vs Numba CUDA: similarities and differences
First kernel with numba.cuda
CuPy for NumPy-like GPU operations
ROCm/HIP alternatives for AMD GPUs
Memory management: host-device transfers, memory coalescing
Benchmark: operations on CPU vs GPU
Proposed exercises
- 2D convolution on GPU
- Matrix multiplication with memory optimization
- Signal processing kernels

05 - Parallel Libraries and Performance Optimization

Quick overview of parallel libraries:
- Numba with @jit(parallel=True)
- Dask for large structures and automatic parallelism
- JAX for scientific computing and automatic differentiation
- PyTorch and GPU usage (torch.cuda)
Performance profiling tools: time, timeit, line_profiler, py-spy
Numerical precision considerations in parallel computing
Choosing the right tool for different problem types
Comparative benchmark: CPU vs GPU vs distributed
Proposed exercises
- Complete signal processing pipeline using multiple approaches
- Performance comparison: multiprocessing vs Numba vs GPU
- Real-world optimization problem

License

MIT – free to use, copy, and modify.