Stanford CS149, Fall 2019
PARALLEL COMPUTING
From smart phones, to multi-core CPUs and GPUs, to the world's largest supercomputers and web sites, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design.
Basic Info
Tues/Thurs 3:00-4:20pm
Gates B3
Instructors: Kayvon Fatahalian and Kunle Olukotun
See the course info page for more info on course policies and logistics.
Fall 2019 Schedule
Sep 24 |
Motivations for parallel chip designs, challenges of parallelizing code
|
Sep 26 |
Forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth
|
Oct 1 |
Ways of thinking about parallel programs, and their corresponding hardware implementations, ISPC programming
|
Oct 3 |
Thought process of parallelizing a program in data parallel and shared address space models
|
Oct 8 |
Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing
|
Oct 10 |
Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention
|
Oct 15 |
CUDA programming abstractions, and how they are implemented on modern GPUs
|
Oct 17 |
Data parallel thinking: map, reduce, scan, prefix sum, groupByKey
|
Oct 22 |
Producer-consumer locality, RDD abstraction, Spark implementation and scheduling
|
Oct 24 |
Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing
|
Oct 29 |
Directory-based coherence, machine-level atomic operations, implementing locks, implementing barriers
|
Oct 31 |
Consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics
|
Nov 5 |
Midterm Exam
|
Nov 7 |
Fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers
|
Nov 12 |
Motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM
|
Nov 14 |
Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs
|
Nov 19 |
Motivation for DSLs, case study on Halide image processing DSL
|
Nov 21 |
Performance programming for FPGAs and CGRAs using Spatial
|
Dec 3 |
GraphLab, Ligra, and GraphChi, streaming graph processing, graph compression
|
Dec 5 |
Scheduling convlayers, exploiting precision and sparsity, DNN acelerators (e.g., GPU TensorCores, TPU)
|
Programming Assignments
Written Assignments
Oct 10 | Written Assignment 1 |
Oct 24 | Written Assignment 2 |
Nov 4 | Written Assignment 3 |
Nov 14 | Written Assignment 4 |
Dec 3 | Written Assignment 5 |