What's Going On
InnerLight commented on slide_043 of GPU Architecture and CUDA Programming ()

minor: implement -> implementation


InnerLight commented on slide_016 of Parallel Programming Abstractions ()

Small typo: peak -> peek


I missed this in the lecture. What exactly is strong atomicity?


bigtimecore commented on slide_041 of Distributed Computing using Spark ()

do we need mobileViews in memory only because it has multiple future dependencies?


@bittimecore -- P0 is not "writing" to P1 and P2. P0 wants to write to the memory address X. Like in any invalidation-based coherence protocol, P0 must be the only holder of the line before writing. Therefore, P0 must inform all processors holding a copy of this line in their cache. Once these other processors invalidate the line (no longer hold a copy), P0 can proceed with its write since it now has exclusive access to the line.


@cs149-zzY7C. Note that the coherence protocol (MSI, MESI, MESIF, MOESI, etC) is different from the implementation of that protocol (snooping vs. directories and combinations thereof).


Observe that test-and-set involves writing, hence there will be significant amount of cache-coherence driven traffic between processor 2 and 3 who are taking turns to get the line.


I really like this slide as a reminder for not prematurely optimizing, in particular because we often fall into the trap of over-optimizing sections of our application code that might not even be touched that often, or even worse, replaced by a new implementation after not too long of a time. All that time spent optimizing on that particular piece of code could have been very well spent on other things more useful.


I'm not sure if this has been asked yet, but I got curious after re-reading the slides as to whether Intel uses directory-based cache coherence. After some Googling I think I came to the conclusion that Intel uses some improved version of the MESI protocol called MESIF, but isn't documented explicitly anywhere (there's a post from 2007 about MESIF https://www.realworldtech.com/common-system-interface/5/). Directory-based cache coherence protocols seem superior to snooping-based protocols, so I was wondering why it isn't being used in recent CPUs?

EDIT: Never mind, obviously I haven't read the following slides. Surprised to see I can't google this information quite easily.


reiterating my understanding -- P0 needs to write to P1 and P2 because this piece of data is shared by those two processes at the moment. this will maintain memory coherence between the multiple processors


revisiting this slide -- what exactly is the benefit of the multiple-ring bus? to allow "private" channels for communication?

following, what is the benefit of having each L3 bank connected twice?


mlakshmi commented on slide_048 of Why Parallelism? Why Efficiency? ()

The biggest takeaway for me was how the answer to 'Why parallelism' has changed over the years; how we ILP has been tapped out and frequency scaling is limited by power.


In class, I didn't quite understand the answers to these, and I'm not positive I understand what the problem was on the prior slide (is it that 2 and 3 send lots of requests for the lock, presumably emulating busy-waiting, and also repeatedly invalidate the line owned by 1?). I felt rather lost during this example and was hoping that somebody might be able to help clarify these.


In fact, you can actually enforce that a variable should not register allocated by using the keyword volatile.


Given simple code like this, the compiler may likely optimize the code such that lock is allocated as a register. If it does so, then a change to lock by another thread won't actually be reflected in another one, since the threads are not going to memory for reading (instead, they just visit a register).

So in general, global variables shared in a multithreaded application should be declared to the compiler to be volatile so that it won't put it in a register.


Why do we have to make the assumption that *lock is NOT register allocated?


kayvonf commented on slide_081 of A Modern Multi-Core Processor ()

Actually, in these examples the OS is not involved. We are talking about hardware multi-threading. So the hardware is making a decision on which thread to run on the processor each cycle. If a thread stalls (cannot make progress in the current cycle because of a dependency on an operation that has not yet completed (such as a memory fetch), then the processor will attempt to run instructions from another thread. From the OS's perspective both threads are running on the hardware concurrently.


kayvonf commented on slide_041 of Parallel Programming Basics ()

We have a discussion of this in lecture 10.



Good observation. In Cilk, there is an implicit sync at the end of every function. So before a function returns, it sync's with all functions it has spawned.


If this were instead pipelined communication, would the difference be that the start-up latency T0 only occurs once?


why don't we need a sync call in here to join back the spawned instances?


mlakshmi commented on slide_034 of A Modern Multi-Core Processor ()

How about i % 8 == 0 as an if condition?


itsalex commented on slide_043 of Parallel Programming Basics ()

I think it is because there isn't any data race if every thread is writing the value 0.0f to diff[(index+1)%3]


itsalex commented on slide_022 of Parallel Programming Basics ()

Given the previous slide saying that mapping related or unrelated threads to the same processor may be beneficial (by maximizing locality vs. using bandwidth more efficiently), is there any "right" answer to this question? Or does it just depend on the specific case of what each thread is doing?


perpendorthogon commented on slide_043 of Parallel Programming Basics ()

Why is diff[(index+1)%3] not in the lock statement?


kayvonf commented on slide_033 of Why Parallelism? Why Efficiency? ()

Sometimes students get confused between the terms ILP and superscalar execution.

ILP (instruction level parallelism) is a property of an instruction stream. For example, in the program above, the program has ILP 3 in the first part (3 instructions are independent) and ILP 1 in the latter stages.

Superscalar execution is a processor implementation technology where the processor determines when ILP is present in an instruction stream, and exploits the independence of instructions to dispatch multiple instructions simultaneously.


kayvonf commented on slide_057 of A Modern Multi-Core Processor ()

@rrastogi. Because this switching between hardware threads occurs at the granularity of a few cycles. Note that in this lecture, multi-threading is choosing which of the available hardware resident threads should have the next instruction run by the processor core.

Involving the OS to perform a thread context switch... which involves replacing the OS thread that is assigned to a specific hardware execution context. This is an operation that involves copying the register state of the thread out to memory, and swapping the register state of the new OS thread in. This is an operation that may take thousands of cycles. The cost of the OS swapping threads is far greater than the latency of the memory stalls hardware multi-threading is designed to hide.


jblee94 commented on slide_041 of Parallel Programming Basics ()

How is the barrier generally implemented?


mlakshmi commented on slide_067 of A Modern Multi-Core Processor ()

A multi-core processor is one that has multiple cores (Fetch/Decode + ALU + Execution Context).

SIMD execution: Single Instruction Multiple Data - Same instruction sequence applied to a multiple data items, that can be operated upon in parallel.

Coherent control flow refers to a sequence of logic which is applied to all elements that are operated upon simultaneously.

Memory latency is the time it takes to fetch data from memory.

Memory bandwidth is the rate at which memory provides data to the processor.

Bandwidth bound application is one where the memory bandwidth is the limiting factor in the performance of the application (Probably, there is more to this?)

Arithmetic intensity is the ratio of arithmetic operations to memory access operations.

I am not quite sure about formally defining interleaved multi-threading or simultaneous multi-threading.


mlakshmi commented on slide_002 of A Modern Multi-Core Processor ()

Things that could prevent us from obtaining maximum speed up are: (1) Communication overhead (2) Uneven distribution of work (3) Inherent sequential nature of program, thereby preventing us from obtaining maximum gains from parallelism.


rrastogi commented on slide_057 of A Modern Multi-Core Processor ()

This slide claims that the processor, and not the operating system, decides which thread to run in a hardware-supported multi-threading scheme. Why would this be preferable? In this scheme, it would be impossible (or at least harder) to support thread priorities. In addition, thread scheduling logic seems complicated enough that it seems better for the OS to do it, as opposed to dedicating more hardware resources for the processor to.


rrastogi commented on slide_039 of A Modern Multi-Core Processor ()

Yeah, I believe the operation you two are referring to is the multiply-accumulate operation: https://en.wikipedia.org/wiki/Multiply–accumulate_operation


MansNotHot commented on slide_081 of A Modern Multi-Core Processor ()

The OS, not the compiler, handles thread scheduling. If a thread is waiting for a resource (ex. a lock) it will be put to sleep by the OS until the resource becomes available. In the case of deadlock, it will never wake.


mlakshmi commented on slide_039 of Why Parallelism? Why Efficiency? ()

I had to miss this lecture; could someone help me understand what we mean by a 'single instruction stream'?


rchalla commented on slide_081 of A Modern Multi-Core Processor ()

How does the compiler know when/how to break the thread in the case of deadlock?


rchalla commented on slide_034 of Why Parallelism? Why Efficiency? ()

I wonder what the time tradeoff is for the compiler to create this graph and execute it versus just compiling it and running it sequentially?


bigtimecore commented on slide_039 of A Modern Multi-Core Processor ()

^ Could it be that there exists a single instruction that allows for a multiply-add operation in one clock cycle?


bigtimecore commented on slide_034 of A Modern Multi-Core Processor ()

In response to the question - a set of operations on a vector of size 8 with a conditional statement if x[i-1] might be very inefficient, because each step relies on a previous step in order to execute order to execute.

Do I have the right idea here?


itsalex commented on slide_034 of Why Parallelism? Why Efficiency? ()

For a superscalar processor to "respect program order" - does it just mean that the order in which the instructions are executed is a topological sort of the dependency graph? Or is this the wrong way to think about it in a parallelized context?


ppenkov commented on slide_025 of A Modern Multi-Core Processor ()

I believe this is because, as you pointed out, there is only one fetch/decode.

SIMD is particularly effective when we are executing many independent iterations of a for loop as is the case with processing an image (each pixel is independent). Because we know the same operation will be performed on all pixels, we can amortize the fetch/decode and do it once per 8 pixels rather than once per pixel.

However, SIMD is challenging when we have conditional execution. What are the other downsides of SIMD?


unique_name commented on slide_041 of Why Parallelism? Why Efficiency? ()

I was curious so I did a quick search on what 'heterogeneous processing' means - it seems that it is like parallel processing, but the cores used are architecturally different from one another. (https://queue.acm.org/detail.cfm?id=3038873)