What's Going On

Really appreciate Kayvon explaining this part, which help clear a lot of concerns in my mind before.


I wish I took this course a couple years ago for the advice in this last lecture! Thanks so much Kayvon!


arkhan commented on Efficiently Evaluating DNNs, Slide 3

i wonder if things could be made more efficient by turning the floating point multiplications into fixed-point integer multiplications?


haofeng commented on Efficiently Evaluating DNNs, Slide 24

Practically, convolution is converted into Toeplitz matrices using the function im2col (implementation by Caffe at https://caffe.berkeleyvision.org/tutorial/layers/im2col.html). After the convolution operation backed by matrix multiplication, the inverse function col2im is used to obtain the result in the original form.


The communication of outputs can be very costly and restricts the level of parallelism that can be achieved by forcing some sort of synchronization.


teapot commented on Efficiently Evaluating DNNs, Slide 50

There have been various frameworks for architecture search. AutoML is one of these examples which provides automatic architecture and hyperparameter tuning.


This is quite an interesting point. Large mini-batches can also sometimes provide more stable gradient updates than on smaller mini-batches, which help the model to learn faster.


Breaking up a batch of data and distributing it to other nodes or GPUs is a common approach when training DNNs. This requires that the parameters be updated in sync with each other on the main node but distributing computation can give large speedups for training. This is such a common practice that Pytorch makes it quite easy with a special distributed data parallel package to take care of the details.


My understanding of Halide is that it contains very efficient implementation for commonly used algorithms in computer graphs but also allows users to specify the scheduling. Therefore, users with deep knowledge about their machines can leverage the computing resources to the largest extent, but at the same time have efficient algorithms "for free".


Here is an example where Intel processors expose APIs for data locality in multithreading programs: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html


Can I use Spatial to compile codes into a linkable file (e.g., shared library) and use it in the performance-critical part of my programs implemented using higher-level languages (e.g., C++)?


Here is the tutorial to build data-parallel and model parallel by using PyTorch https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html


I came across some "model parallelism" concepts for neural networks, where the NNs are split and stored on different machines. It's another way to reduce the memory and computational load for each machine.


zecheng commented on Efficiently Evaluating DNNs, Slide 37

Is that the pruning can be mainly used for the network using ReLU activations due to the "dying" ReLU issue?


Like mentioned earlier, can these stale momenta be used like noise from SGD itself to provide a regularizing effect on training?


Does this strategy for building larger matrix-matrix multiplies work for matrices that are so big, there is no hope of storing even a single row of the matrix in the cache? I wonder if there are other strategies for partial evaluation of matrices (e.g. where you only need to load a constant number of elements into the cache at a time).


ishangaur commented on Efficiently Evaluating DNNs, Slide 14

It allows us to build layers of filters that can be increasingly abstract so provide more useful primitives as input to whatever we need, like a classification


Jonathan commented on Efficiently Evaluating DNNs, Slide 58

@anon33 I am not sure about the specifics of how htop interacts with interleaved multithreading but I believe the utilization distinction you mentioned is that waiting for a memory access does not count as utilizing the thread, but performing some kind of arithmetic operation does. This means that you can see quickly whether the current workload is cpu/memory/disk bound by looking at which has the highest utilization.


I'm curious, what's the process for running code on a TPU? How do you compile for it? Is this process relatively accessible or do you need to learn yet another toolchain? We learned how we can program GPUs with CUDA, but I'm curious if I can spend a couple hours and end up running something on a TPU.


@wanze One way of thinking about why an FPGAs (or other specialized hardware) is preferable to something like a CPU because in practice, most workloads on a CPU will spend an overwhelming amount of time on work that is not inherent to the problem (ie things like control flow) which can be baked into the hardware. CPUs are great because they can run a large variety of programs, but that generality comes at a cost (performance). This is very similar to the trade off between DSLs and general langauges.


donquixote commented on Efficiently Evaluating DNNs, Slide 4

@orz, "deep" simply refers to a network having many layers (anything more than a couple?). Having many layers makes the network more expressive and enables it to learn more complex representations. Almost all of the cool AI results of the past 1+ decades have been the result of deep networks.


donquixote commented on Efficiently Evaluating DNNs, Slide 4

Bigger and deeper networks*


donquixote commented on Efficiently Evaluating DNNs, Slide 4

@felixw17, the idea of an artificial neural network was proposed independently of GPUs or parallel computing, including the concept of having multiple hidden layers. However, the huge efficiency provided by GPUs and parallel systems enabled researchers to run bigger and deeper knowledge that they could only fantasize about in the past. This enabled tremendous advances in the deep learning field over the past 15-20 years. Without these efficient systems, many of the innovations in deep learning today wouldn't have been possible. For example, you could never run really deep networks with billions of parameters if you didn't have the power of GPUs / parallel clustered machines, so many ML papers and many awesome architecture designs (that ultimately led to state of the art accuracy on fundamental benchmarks like image recognition) would never have been possible! If you take an ML class you might hear Andrew Ng mentioning the rise/improvement of compute as one of the main reasons behind the success of deep learning today.


What's so special about DSPs? Why are they put in FPGAs by default? Also, a DSP can be considered an fixed-function ASIC, right? If so, have there been experiments where people embed ASICs within FPGAs? This would increase the design space of chips: now we can choose an FPGA and then choose what ASICs to put on it to best support our workloads. What if we put an AISC optimized for deep learning / matrix multiplication as a "tensor block" inside an FPGA, just like the "DSP block"?


In SQL, the programmer's responsibility is to declare the data queries desired, and sometimes to declare them in a way or order that avoids unnecessary computation (e.g. if you will filter certain data, filter it before you do other computations, not after, so that you only compute what you need). The responsibility of SQL itself is to actually implement searches on data stored in specific formats (e.g. row store, column store), implement operations like join (e.g. hash join, sort join) or sort (e.g. external merge sort). SQL also optimizes your query as much as it can, e.g. reordering operations, etc.


At first I was gonna say I'd imagine you wouldn't run into memory issues because the size of the transaction record is a constant overhead. But a constant overhead could be bad in practice, e.g. if your transaction record is sized similarly to the data itself then your memory usage is 2x your actual data. I'm not sure I get what "pointer-sized" means here. It seems there would be a way around using too much memory for the txn records, but I can't think of anything.


donquixote commented on Transactional Memory, Slide 6

The point above alerts me to the fact that transactional memory is useless /not applicable when it comes to a single-threaded application. Does this mean that hardware with built-in TM constructs incurs unnecessary overhead when running a single-threaded application? Or are hardware TM constructs capable of making themselves effectively disappear without much effect on performance when they are not needed? Is there a way for the programmer to say "I'm gonna run this on a single thread, so I don't want the hardware to do any of the bookkeeping work needed for TM"?


At first, I was thinking that a workload imbalance among threads is an example of starvation for the thread that does little work / stays idle, but that's not quite true. Starvation is when a process/thread has work to do, but it's not making progress on that work because other processes/threads are using resources disproportionately more.

As a real-world example of starvation in a computer program, maybe a certain hardware processor is physically farther from main memory than another processor, so the latter processor is able to load from memory and perform computations multiple times while the far away processor is still trying to make one load. This might be a silly/unrealistic example, but I think the concept is realistic. As for solutions, it seems one should just be aware of the hardware resources at one's disposal and ensure that the code they write (or the computer architecture they are designing, if they are building a chip) makes it less likely to encounter starvation.


Directories make for "scalable" cache coherence because only the processors for whom a certain cache line state is relevant at a given point in time need to get that info from the directory. In snooping, all processors use one interconnect through which any cache coherence message travels, even if only one or two processors actually care about that broadcast message.


I think that it is however worth to note that with cache locality, we can expect to reach orders of magnitude improvement in the best case since a cache hit is orders of magnitude faster than a DRAM access, in optimizing DRAM locality, we can only hope to gain a much smaller benefit, since the cost of precharging/activating a row is only about 2/3 of the total time.


To perform matrix-vector multiplication, data flows across diagonals of the processor elements. This is inefficient, as at best only 4 (in this example) processing elements are used at any one time.

Utilization can be maximized in the case of matrix-matrix multiplication, where each diagonal is essentially performing a different matrix-vector multiplication. This allows for 100% utilization of the processing elements.

On a different note, the systolic array is highly specialized for matrix operations, and would be not sensible to use for more general-purpose computing. This is a great example of how hardware can be specialized to certain tasks for greater efficiency, at the loss of generality.


To reiterate the comment above, the problem with synchronous training is that it's incredibly inefficient to serially update an array of weights one element at a time. A good way to remove this bottle neck is to find ways to parallelize updates while minimizing communication.


Thanks so much for being such a wonderful instructor Kavyon! It's been a great time learning more about parallel programming and architecture


Thank you for teaching this course Kayvon! It's been a great time learning more about parallel programming/architecture!


Is it possible to extend this idea out to higher-dimensional tensor multiplication, or other tensor operations that need to be performant? Or are those use-cases rare enough that we maintain this systolic array as the 'building block' and compose tensor operations from this instead?


In lecture, Kunle mentioned that these systolic arrays are, in practice, oftentimes around 256x256 in dimension.


pintos commented on Efficiently Evaluating DNNs, Slide 4

@orz I think what makes it deep is that the network sort generates it's own understanding of the problem as it refines the weights.


This is instead of slicing the network in half vertically, which would create a sequential dependency between the nodes computing different halves of the network. Pipelining would help mitigate this but still transferring an output layer that's 256x27x27 is unreasonable.


It is somewhat interesting that things like this may have similar inefficiencies to say, when we did mandelbrot with ISPC in a previous assignment. Although some elements may have converged and no longer contribute per loss, repeat work might need to continue until the error is satisfactory, and calculations kind of end up being wasted


How much performance can we really squeeze out though? Or rather is squeezing out this extra performance worthwhile?


I wonder if DIMM designers made the mistake of storing cache lines this way then the first DIMMs were made. How were they able to coordinate the chips to stride memory addresses over them?


Can we also include things like ISPC, Cilk, and OpenMP in the category of Domain-Specific Programming Systems? Or are they too general purpose?


After seeing this picture in one the first few lectures, I've really come to appreciate heterogeneity of the design to provide programmers a combination of resources to tackle many different problems.


To optimize matrix multiplication, because even accessing register files can have significant latency, processing elements can be arranged in a grid where data is passed on each iteration to two directly adjacent grid elements. Data is passed down to continue the process of dot producting a column with a row, and passed right so that this can be done for the next row. This direct transfer of data is extremely fast. Although with a vector this leaves most processing elements unused most of the time, by doing matrix multiplication one column at a time, almost all of the processing elements potential is used, although some are still wasted at the beginning and end of the multiplication.


pintos commented on Transactional Memory, Slide 60

The sort of mistake tolerance that STM provides analogous to the fault tolerance that Spark RDDs provide. The idea keeping a log of actions to either optimistically wait to do them or pessimistically wait to undo them appears to be a valuable approach for providing atomicity and isolation.


In this class we focused a lot on fitting things into memory/cache and lower the communication cost between machines. I wonder if there are any setting where using faster hard drive like flash storage is still significantly important?


icebear101 commented on Efficiently Evaluating DNNs, Slide 34

^^ Background: Originally the machine learning is applied in cloud. Devices transmit the data to cloud and computation happens in cloud, then the results are transmitted back to devices. However, the latency largely depends on the network conditions. Also, security may become an issue that users may not want their private data uploaded into cloud. Then, computation are moved to devices.


What are some common examples of naive solutions to problems that end up in starvation? In other classes, we always learn about the situations that lead to deadlock, but we hardly discuss situations that lead to starvation and how to avoid them.


Sequential consistency is what programmer's typically think of since all single-threaded code we run on Intel processors runs in a sequentially consistent manner.


pintos commented on Cache Coherence, Slide 30

Something to note: not every memory operation generates bus traffic. For if a processor has a cache line in the shared state, it can write to or read from it all it wants without informing the other processors. Moreover, processors that have a cache-line in the invalid state ignore all bus traffic related to that cache-line.


One thing I think is an advantage of Spark is productivity. Since each transformation yields another RDD, the programmer can very easily chain transformations together in a single line.


yonkus commented on Efficiently Evaluating DNNs, Slide 39

There was also some discussion about pruning out values not exactly at 0 but near it, which I think ties in with the discussion here about the precision of the network- if you remove those before normalization, they could be far more impactful than you'd originally anticipate.


pintos commented on Data-Parallel Thinking, Slide 49

That last point is an interesting one I hadn't thought of until now. We've focusing on getting as much parallelism and utilization as possible without thinking how memory performance comes into play. Our algorithms in this lecture provide the most benefit for very large datasets if the memory system accompanying them can keep up.


@yayoh is this a bit like stochastic modeling? As I understand it, stochastic modeling updates parameters after every training example, so this would be like batch training in a sense?


Just am add-up, if update on the whole dataset every time, the training loss will decrease monotonously.


I would like to greatly thank this class. This is not only about the key concepts learnt, but also about how to think in parallel and write efficient programs these days, which I believe will help a lot in future work.


Just a random question, I am always wondering what makes a network "deep". Is there any definition?


The figure just reminds me of the compare between single-thread and multi-thread before.


This slide shows shared memory and L1 cache storage to be a single unit. Does that mean that a program that uses more of the shared memory will have a smaller L1 cache?


@haiyuem I think the shared memory is carved out of L1 cache storage (http://cs149.stanford.edu/fall20/lecture/gpuarch/slide_59).


I think the idea is that, in many cases, keep a local graph (i.e. partial) in cache will be enough.


I think kernel is related the fact that GPUs are made for performant image processing, and a common image processing task is a applying a convolution matrix (called a kernel) to an image.


Do we have the "definition" of library & language, though it's hard to tell the difference?


Having improved temporal locality improves performance if the program is bandwidth bound, but it does not help performance if the program is compute bound.


One thing to note: if the cache is smaller than the data that a program is scanning through, then it would be as if there was no cache at all.


An interesting thing to note about this slide is that the abstractions presented to the programmer are different level of abstract. This allows for different amounts of software-based optimization by the programmer based on what/how resources are used. What unites these abstractions is that they all provide a way for the programmer to declare what kind of parallelism they want, and the compiler/system handles how the parallelism is implemented


@kevtan made a good point. We saw with OpenMP on assignment 4 that dynamic assignment would actually hurt performance, especially if the amount work in the loop is small.


^^ I also found this conversation about energy tradeoff in various deep learning classes interesting, and came across this paper if anyone’s interested!

https://www.sciencedirect.com/science/article/pii/S0743731518308773

It essentially overviews how energy consumption is measured/estimated in machine learning applications. I though the coolest thing is that this paper is quite accessible and readable after taking CS149!


We have timing diagrams on the right; the sequential diagram shows that we finish one iteration before we start the next. The effective latency & initiation interval is equal to the sum of the latencies of the inner controller.

If we take the same foreach controller and we pipeline it, now we overlap execution of inner controllers, so we have three iterations that are running in parallel. Effective latency remains the same but the effective initiation interval is the latency of the longest-running inner controller.


marwan commented on Efficiently Evaluating DNNs, Slide 46

By designing the algorithm differently we can get the same convolution using a different topology


The idea of momentum is to maintain an update vector that is the amount the parameters will be updated at each time step. This update vector is changed by adding does this by adding a fraction of the update vector of the past time step to the current update vector, in addition to the current gradient. This makes it possible to have the gradients be updated asynchronously, as the past gradients from a calculation that was scheduled earlier still impacts the gradients we calculate at this time step when a later scheduled subgradient is added.

Momentum was originally designed to improve the stability of SGD, making the hill-climbing down a "ravines" faster. It's very interesting to see these algorithmic techniques become relevant in the context of parallel computing.


I ended up coming back to this slide a lot with confusion during WA6 (particularly because I had to figure out what were version numbers, what were stored values, etc). But I’ll try to summarize my understanding now for those that might also be confused:

When you track read, you want to keep track of the object you read and its version number at the time—helpful for when committing and you need to validate. E.g is object foo at version time 3.

When you track write, you also want to keep track of the object you are writing to, the version at which it was at when you first wrote to it, and then you want to keep track of the lock around it—this is helpful knowing that the version number might change when you go to commit. And note, the lock is just around the object.

When you track undo, that’s the only place where you really need to keep track of the an actual value and where it belongs. In reading and writing, you don’t need to keep track of the actual values because in the eager versioning scheme you optimistically read and pessimistically write.


Specifically, we want a higher block size as if a block fits in one level one level of memory, then cache misses will occur occur for all n elements of each dimension D of an N-D hiearchy. The smaller the block size, the more this will be (e.g.) the divisions of a square add up to more when the tiles are smaller)


Takeaway: Autoschedulers—in this case, Halide—when targeting CPUs, are starting to produce code that are hitting the performance of some of the best libraries i.e. the other bars on this graph.


In lecture, Kayvon explained a bit more about the graph on the left. A student found that a hand-optimization for one of the layers was nearly 2x slower than ideal for other layers. The takeaway is that EACH layer may need to be hand-optimized (potentially using strategies we saw earlier in the lecture) depending on its input size/properties.


harrymellsop commented on Efficiently Evaluating DNNs, Slide 28

In practice, what kind of performance advantages do we get by using this direct implementation, as opposed to a highly optimised matrix multiplication? Obviously the lecture explained the linear improvements in memory footprints, but how does this translate to real-world speed?


Does hardware tend to go obsolete quickly? A common joke in the computer building community is that your top-of-the-line computer will be obsolete in five years due to advancements in CPUs, GPUs, etc. (e.g. NVIDIA recently releasing the RTX 3070 as a $500 graphics card than performs about the same as the 2080 TI, a $1k+ high-end graphics card from 2018). I'm curious if this is an issue companies face, or if supercomputers are built to last for a considerably large amount of time without going obsolete (in comparison to other systems built by rival companies).


If energy cost is proving to be an issue, it can at times be beneficial to recompute values (if possible) instead of loading them from DRAM, since DRAM access is so expensive (useful in mobile systems where power usage is often limited). If it's not possible to recompute values and the system is using a lot of power due to DRAM access, it may be worth looking into ways to compress the data being read from DRAM. On slide 40, we will see that compressing models can lead to significant decreases in size, which will also likely reduce power usage since DRAM usage will decrease.


pintos commented on Parallel Programming Basics, Slide 32

@yayoh I think the "communicate" steps imply that the processors engage in some sort of synchronous communication protocol where the processors send each other their data and acknowledge receiving data.


^^Agree with all the above. I feel equipped to tackle an entirely new set of problems thanks to CS 149!


I can see how AI is pushing the boundaries of computing, it seems like every component of the computer (memory, bandwidth, cache size, processing capability, etc) is being pushed to its limits by these large-scale machine learning algorithms.


I imagine the area of the crossbar is very large to better dissipate heat. The crossbar switch will be constantly active since all memory requests have to travel through it, so it will surely generate a lot of heat.


Having only really worked with software, I never thought about power being a limiting factor for performance. This slide reminds me of the importance of using the knowledge of multiple domains to approach a problem.


These graphs show that more modern DNNs are able to achieve both greater accuracy and vastly higher performance compared to older DNNs like VGG-16. This indicates in this instance that optimizing the algorithms within the domain was a more effective approach than general optimizations like sparsification which only hid some of the costs of using an inefficient network to begin with.


cbfariasc commented on Efficiently Evaluating DNNs, Slide 37

@wanze I think this provides a benefit still because we can omit near zero values for the rest of the entire computation, since it would clear out any other value that uses the near zero value as a weight.


endofmoore commented on Distributed Computing using Spark, Slide 49

For any data scientist and fans of R, an R package of spark is also available online: https://spark.apache.org/docs/latest/sparkr.html#overview


endofmoore commented on Distributed Computing using Spark, Slide 37

In order for the single loop fusion from the previous slide to work you need to have narrow dependencies where each partition of the parent is referenced by at most one child RDD.


Nvidia, a giant in the HPC industry, acquired Mellanox, a network chip company, for nearly $7billion. This is a clear signal of how valuable fast interconnects between machines are for distributed training (and all the other HPC and ML applications that Nvidia works on).


This slide is illustrating the idea of partitioning the network in the horizontal direction, essentially giving the upper half of all the parameters to one worker node, and the bottom half to another worker node. If we use small spatial convolutions and reduce/shrink fully-connected layers to be less connected across the two worker nodes, we can increase parallelism.


endofmoore commented on Distributed Computing using Spark, Slide 8

Just to tie up this end, yes @wanze it is.


endofmoore commented on Data-Parallel Thinking, Slide 14

Having done WA3 and PA3 I think one important application that was overlooked was the use of the prefix sum to compress information in arrays. For example, we used the prefix sum to reduce an array of elements A into an array that shows the just the repeats where A[i]==A[i+1]. And in another example we used the prefix sum to reduce an array of boolean values into an array that shows the size of segments.


ISPC's foreach explicitly declares that each loop as independent where each loops can be executed as parallel instances in groups known as gangs; Cilk has similar semantics where each call to cilk_spawn implies that the invoked function can be executed independently (asynchronously) of the main thread and any other call to cilk_spawn as concurrent instances. However when it come to implementation ISPC differs from Cilk in that ISPC is has it's gangs of program instannces mapped to SIMD instrucitons on a single core; that define a set of tasks that can be executed concurrently, while cilk implements a fork join pattern where each instances can be executed as a thread in any core


endofmoore commented on Cache Coherence, Slide 40

Just to close this off, @ishanguar yes I think the graphs assume that the size of the caches, at least for the L1 and L2 cache is within the same order of magnitude. However the key takeaway is that in the MP case there is access to shared data and the presence of a coherency protocol that results in invalidations in the caches that hence decrease hit rates.


^ *an indicator that the exchange was successful


In the x86 version of test-and-set EAX is the return value of the function hence if the lock is not acquired (i.e. dst!=EAX) we expect a return value of dst . If the value in EAX is equal to dst then the src is loaded into dst, and EAX will hold it's old value; the return value in EAX will be it's old value - an indicator that the lock has been acquired.


There was a question in class regarding what "TPU prime" in this figure represents – I went to find the original paper and this is what they had to say:

Figure 9. Relative performance/Watt (TDP) of GPU server (blue) and TPU server (red) to CPU server, and TPU server to GPU server (orange). TPU’ is an improved TPU that uses GDDR5 memory (see Section 7). The green bar shows its ratio to the CPU server, and the lavender bar shows its relation to the GPU server. Total includes host server power, but incremental doesn’t. GM and WM are the geometric and weighted means.

Thus, it seems like the TPU prime here is the same base TPU design with improved memory bandwidth from using better memory. Their conclusion from the mentioned Section 7 is that this has the biggest impact on overall performance, quoting the following line: "First, increasing memory bandwidth (memory) has the biggest impact: performance improves 3X on average when memory increases 4X." Another example of the importance of memory bandwidth!


@harrymellsop Yes, this can be practically done. I'm not aware of any intrinsics, but the size of a cacheline is usually specified in the architecture of the CPU. Typically what engineers do is look it up(online or in the chip manual) and design their programs based on that reference. So yes you can practically do this. In fact a common practice for testing the coherency of multicore systems is to implement false sharing where multiple cores, load and store to different indices of an array that fits in one cacheline.


nickbowman commented on Efficiently Evaluating DNNs, Slide 9

What these pictures represent is the output that you get when you apply the two filters from the previous slide (one for detecting horizontal gradients and one for detecting vertical gradients) to the original brick wall image that we started with couple slides back. The result is that we get primitive "edge detection" along both the horizontal and vertical axes that trace out the grout lines from the original image. This is a simple example of how convolutions filters with different weights can be used to identify primitive features in an image – stacking these convolutions together with other non-linearities is the basis of today's deep neural networks image-based tasks.


The use of a ticket lock would decrease the coherency traffic of test-and-set.


Immediately after acquiring the lock, the cacheline P1's cacheline is invalidated because other processors are fighting for the cacheline and keep on acquiring exclusively and then having theirs invalidated. Once P1 is done, it reacquires and releases the lock.