Lecture 1: Why Parallelism? Why Efficiency?

(Motivations for parallel chip decisions, challenges of parallelizing code)

Further Reading:

The Future of Microprocessors. by K. Olukotun and L. Hammond, ACM Queue 2005
Power: A First-Class Architectural Design Constraint. by Trevor Mudge IEEE Computer 2001

Lecture 2: A Modern Multi-Core Processor

(Forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth)

Further Reading:

CPU DB: Recording Microprocessor History. A. Danowitz, K. Kelley, J. Mao, J.P. Stevenson, M. Horowitz, ACM Queue 2005. (You can also take a peak at the CPU DB website)
The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern throughput processor)
Intel's Haswell CPU Microarchitecture. D. Kanter, 2013 (realworldtech.com article)
NVIDIA GV100 (Volta) Whitepaper. NVIDIA Technical Report 2017

Lecture 3: Parallel Programming Abstractions

(Ways of thinking about parallel programs, and their corresponding hardware implementations, ISPC programming)

Further Reading:

The story of ispc. by Matt Pharr (an amazing blog post about why a programming model imposing structure can be so important)
ISPC Programmer's Manual
Thread Building Blocks
MIT's StreaMIT Project
Data Parallel Haskell
Brook for GPUs: Stream Computing on Graphics Hardware

Lecture 4: Parallel Programming Basics

(Thought process of parallelizing a program in data parallel and shared address space models)

Lecture 5: Performance Optimization I: Work Distribution and Scheduling

(Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing)

Further Reading:

CilkPlus documentation
Scheduling Multithreaded Computations by Work Stealing. by Blumofe and Leiserson, JACM 1999
Implementation of the Cilk 5 Multi-Threaded Language. by Frigo et al. PLDI 1998
Intel Thread Building Blocks

Lecture 6: Performance Optimization II: Locality, Communication, and Contention

(Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention)

Further Reading:

Lecture 7: GPU architecture and CUDA Programming

(CUDA programming abstractions, and how they are implemented on modern GPUs)

Further Reading:

You may enjoy the free Udacity Course: Intro to Parallel Programming Using CUDA, by Luebke and Owens
The Thrust Library is a useful collection library for CUDA.
Rise of the Graphics Processor. D. Blythe (Proceedings of IEEE 2008) a nice overview of GPU history.
NVIDIA Tesla V100 Whitepaper. NVIDIA Technical Report 2017
The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern Intel integrated GPU)
Volta CUDA Tuning Guide. NVIDIA CUDA Documentation

Lecture 8: Data-Parallel Thinking

(Data parallel thinking: map, reduce, scan, prefix sum, groupByKey)

Lecture 9: Distributed Computing using Spark

(Producer-consumer locality, RDD abstraction, Spark implementation and scheduling)

Lecture 10: Cache Coherence

(Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing)

Lecture 11: Memory Consistency + Implementation Synchronization

(Consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics, implementing locks and atomic operations)

Lecture 12: Fine-Grained Synchronization and Lock-Free Programming

(Fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers)

Lecture 13: Transactional Memory

(Motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM)

Lecture 14: Heterogeneous Parallelism and Hardware Specialization

(Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs)

Lecture 15: Domain-Specific Programming Systems

(Motivation for DSLs, case study on Halide image processing DSL)

Lecture 16: Parallel Graph Processing Frameworks + How DRAM Works

(GraphLab, Ligra, and GraphChi, streaming graph processing, graph compression)

Lecture 17: Programming for Hardware Specialization

(Performance programming for FPGAs and CGRAs)

Lecture 18: Efficiently Evaluating DNNs

(Scheduling convlayers, exploiting precision and sparsity, DNN acelerators (e.g., GPU TensorCores, TPU))