Lectures and Readings : Parallel Programming

Stanford CS149, Fall 2019

PARALLEL COMPUTING

This page contains lecture slides and recommended readings for the Fall 2019 offering of CS149. Lecture videos are available via SCPD.

(motivations for parallel chip designs, challenges of parallelizing code)

Further Reading:

The Future of Microprocessors. by K. Olukotun and L. Hammond, ACM Queue 2005
Power: A First-Class Architectural Design Constraint. by Trevor Mudge IEEE Computer 2001

(forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth)

Further Reading:

CPU DB: Recording Microprocessor History. A. Danowitz, K. Kelley, J. Mao, J.P. Stevenson, M. Horowitz, ACM Queue 2005. (You can also take a peak at the CPU DB website)
The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern throughput processor)
Intel's Haswell CPU Microarchitecture. D. Kanter, 2013 (realworldtech.com article)
NVIDIA GV100 (Volta) Whitepaper. NVIDIA Technical Report 2017

(ways of thinking about parallel programs, and their corresponding hardware implementations)

Further Reading:

(message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention)

Further Reading:

(CUDA programming abstractions, and how they are implemented on modern GPUs)

Further Reading:

You may enjoy the free Udacity Course: Intro to Parallel Programming Using CUDA, by Luebke and Owens
The Thrust Library is a useful collection library for CUDA.
Rise of the Graphics Processor. D. Blythe (Proceedings of IEEE 2008) a nice overview of GPU history.
NVIDIA GeForce GTX 1080 Whitepaper. NVIDIA Technical Report 2016
NVIDIA Tesla P100 Whitepaper. NVIDIA Technical Report 2016
NVIDIA Tesla V100 Whitepaper. NVIDIA Technical Report 2017
The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern Intel integrated GPU)
Pascal Tuning Guide. NVIDIA CUDA Documentation

(map, reduce, fold, scan, gather/scatter. Parallel implementations of scan. Data-parallel algorithm design.)