Stanford CS149, Fall 2020
PARALLEL COMPUTING
This page contains lecture slides and recommended readings for the Fall 2020 offering of CS149.
(Motivations for parallel chip decisions, challenges of parallelizing code)
Further Reading:
- The Future of Microprocessors. by K. Olukotun and L. Hammond, ACM Queue 2005
- Power: A First-Class Architectural Design Constraint. by Trevor Mudge IEEE Computer 2001
(Forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth)
Further Reading:
- CPU DB: Recording Microprocessor History. A. Danowitz, K. Kelley, J. Mao, J.P. Stevenson, M. Horowitz, ACM Queue 2005. (You can also take a peak at the CPU DB website)
- The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern throughput processor)
- Intel's Haswell CPU Microarchitecture. D. Kanter, 2013 (realworldtech.com article)
- NVIDIA GV100 (Volta) Whitepaper. NVIDIA Technical Report 2017
(Ways of thinking about parallel programs, and their corresponding hardware implementations, ISPC programming)
Further Reading:
- The story of ispc. by Matt Pharr (an amazing blog post about why a programming model imposing structure can be so important)
- ISPC Programmer's Manual
- Thread Building Blocks
- MIT's StreaMIT Project
- Data Parallel Haskell
- Brook for GPUs: Stream Computing on Graphics Hardware
(Thought process of parallelizing a program in data parallel and shared address space models)
(Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing)
Further Reading:
- CilkPlus documentation
- Scheduling Multithreaded Computations by Work Stealing. by Blumofe and Leiserson, JACM 1999
- Implementation of the Cilk 5 Multi-Threaded Language. by Frigo et al. PLDI 1998
- Intel Thread Building Blocks
(Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention)
(CUDA programming abstractions, and how they are implemented on modern GPUs)
Further Reading:
- You may enjoy the free Udacity Course: Intro to Parallel Programming Using CUDA, by Luebke and Owens
- The Thrust Library is a useful collection library for CUDA.
- Rise of the Graphics Processor. D. Blythe (Proceedings of IEEE 2008) a nice overview of GPU history.
- NVIDIA Tesla V100 Whitepaper. NVIDIA Technical Report 2017
- The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern Intel integrated GPU)
- Volta CUDA Tuning Guide. NVIDIA CUDA Documentation
(Data parallel thinking: map, reduce, scan, prefix sum, groupByKey)
(Producer-consumer locality, RDD abstraction, Spark implementation and scheduling)
(Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing)
(Consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics, implementing locks and atomic operations)
(Fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers)
(Motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM)
(Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs)
(Motivation for DSLs, case study on Halide image processing DSL)
(GraphLab, Ligra, and GraphChi, streaming graph processing, graph compression)
(Performance programming for FPGAs and CGRAs)
(Scheduling convlayers, exploiting precision and sparsity, DNN acelerators (e.g., GPU TensorCores, TPU))
(Enjoy your Winter holiday break!)