Stanford CS149, Fall 2019
PARALLEL COMPUTING

From smart phones, to multi-core CPUs and GPUs, to the world's largest supercomputers and web sites, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design.

Basic Info
Tues/Thurs 3:00-4:20pm
Gates B3
Instructors: Kayvon Fatahalian and Kunle Olukotun
See the course info page for more info on course policies and logistics.
Fall 2019 Schedule
Sep 24
Motivations for parallel chip designs, challenges of parallelizing code
Sep 26
Forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth
Oct 1
Ways of thinking about parallel programs, and their corresponding hardware implementations, ISPC programming
Oct 3
Thought process of parallelizing a program in data parallel and shared address space models
Oct 8
Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing
Oct 10
Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention
Oct 15
CUDA programming abstractions, and how they are implemented on modern GPUs
Oct 17
Data-Parallel Thinking
Data parallel thinking: map, reduce, scan, prefix sum, groupByKey
Oct 22
Distributed Computing using Spark
Producer-consumer locality, RDD abstraction, Spark implementation and scheduling
Oct 24
Snooping-Based Cache Coherence
Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing
Oct 29
Memory Consistency
Consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics
Oct 31
Directory-Based Coherence + Implementing Synchronization
Directory-based coherence, machine-level atomic operations, implementing locks, implementing barriers
Nov 5
Midterm Exam
Nov 7
Fine-Grained Synchronization and Lock-Free Programming
Fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers
Nov 12
Transactional Memory
Motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM
Nov 14
Heterogeneous Parallelism and Hardware Specialization
Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs
Nov 19
Domain-Specific Programming Systems
Motivation for DSLs, case study on Halide image processing DSL
Nov 21
Parallel Graph Processing Frameworks + How DRAM Works
GraphLab, Ligra, and GraphChi, streaming graph processing, graph compression
Dec 3
Efficienly Evaluating DNNs (or alternative applications topic TBD)
Scheduling convlayers, exploiting precision and sparsity, DNN acelerators (e.g., GPU TensorCores, TPU)
Dec 5
Parallel DNN Training + Course Wrap Up
Have a great winter break!
Programming Assignments
Oct 4 Assignment 1: Analyzing Parallel Program Performance on a Quad-Core CPU
Oct 18 Assignment 2: A Runtime System for Scheduling Task Graphs
Oct 31 Assignment 3: A Simple Renderer in CUDA
Nov 19 Assignment 4: Big Graph Processing in OpenMP
Dec 3 Assignment 5: Cluster-Scale Processing in Spark
Written Assignments
Oct 10 Written Assignment 1
Oct 24 Written Assignment 2
Nov 1 Written Assignment 3
Nov 14 Written Assignment 4
Nov 21 Written Assignment 5