What's New : Parallel Programming

kayvonf commented on slide_028 of Fine-Grained Synchronization and Lock-Free Programming ()

@adbenson. Great catch! You are correct. NOTE TO ALL STUDENTS: The slide you see above has been updated to include @adbenson's fix.

At the top of insert() and delete(), the code now sets cur = prev->next AFTER taking the lock on prev.

barracuda commented on slide_042 of Distributed Computing in Spark ()

It's a big consideration. There will always be a nonzero probability of hardware failure, particularly in a distributed system of thousands of machines. Note that in this situation, fault tolerance is provided not only by Spark, but also HDFS. HDFS provides the underlying fault tolerance of the original log data by handling replication of that raw data (as alluded to in the 'Note' on the slide), while Spark provides the fault tolerance of recomputing portions of lost RDDs from a lineage.

ParallelPorcupine commented on slide_056 of Transactional Memory ()

What is an xaction?

ParallelPorcupine commented on slide_052 of Transactional Memory ()

Why is the if statement on the right !=0 instead of == 0? Does tmRd not return the value?

adbenson commented on slide_028 of Fine-Grained Synchronization and Lock-Free Programming ()

I believe I see a bug in the delete implementation.

Consider two delete calls, the first of which is trying to delete the second node, and it doesn't matter for the second.

The first grabs pointers to prev and cur, grabs the prev lock, then unlocks the list. By doing so, the second is then able to grab pointers to the same prev and cur, and waits for the prev lock. The first then deletes cur. However, the second's cur pointer is now invalid, since it was already created.

I think a fix would be for delete to create the cur pointer after acquiring the prev lock.

kongty commented on slide_049 of Transactional Memory ()

Is it scalable? In this example, we only have two threads, so when T0 commits, it only compares with T1. If there are 100 different concurrent transactions, however, when T0 commits, it needs to compare with 99 different transactions. Is it not common to have many concurrent transactions?

ParallelPorcupine commented on slide_028 of Fine-Grained Synchronization and Lock-Free Programming ()

If we use fine grain locking, why do we still need the list wide lock (list->lock)?

mihirg commented on slide_012 of Parallel Programming Basics ()

Is this a tighter version of the bound given by Amdahl's law?

vikulg commented on slide_028 of Snooping-Based Cache Coherence ()

When P1's cache controller issues a BusRd, P3's cache controller issues a BusWB, which P1's cache controller uses to update its own cache. The BusWB that P3's cache controller issues also updates main memory.

In general, I believe a BusWB from one cache controller updates the state and data for the corresponding cache line in all other caches (that contain this line). In other words, if a cache controller receives a BusWB for a cache line in state I, that cache line should move to state S.

ParallelPorcupine commented on slide_003 of A Modern Multi-Core Processor ()

Superscalar execution is executing a single instruction stream, but handling multiple instructions at once that do not depend on each other.

ParallelPorcupine commented on slide_050 of Directory-Based Coherence + Implementing Synchronization ()

In this code, is the while (*lock != 0) the first test, and the second test is in the test_and_set expression in the if statement?

ParallelPorcupine commented on slide_005 of Directory-Based Coherence + Implementing Synchronization ()

Could we get a version of this slide that's filled in? Thanks!

ParallelPorcupine commented on slide_028 of Snooping-Based Cache Coherence ()

At step 4, how can P1 read from P3's cache? Does the memory first have to be written back to main memory after step 3 and then accessed by P1?

ParallelPorcupine commented on slide_057 of GPU Architecture and CUDA Programming ()

The green squares represent special execution units dedicated for instructions such as load/store or sin/cos.

anonymoose commented on slide_042 of Distributed Computing in Spark ()

How important is hardware failure as a consideration when designing distributed systems and algorithms like this? Is hardware that prone to failure that data backups always need to exist? In particular, I'm wondering if hardware failures are a 'thing of the past' or will always affect any distributed system.

vikulg commented on slide_057 of GPU Architecture and CUDA Programming ()

What do the green squares represent in the diagram?

dkumazaw commented on slide_068 of GPU Architecture and CUDA Programming ()

This suggests that the total amount of space requested on shared memory is somehow known to the scheduler beforehand so that it can tell whether a new block will fit in each of the cores. This led me to wonder what would happen if the requested space is of variable length. One pathological example would be if we want an array whose length will be a random positive number generated inside the kernel. As it turns out, this example will not be allowed; be it at compile time or at run time, the system must be able to know the exact amount of space requested when a kernel is invoked. One way to achieve this is statically declaring a space like the example we saw earlier in this slide deck. In this case, the size is known at compile time. If we want to dynamically allocate space, it turns out we need to pass the requested size as the execution configuration parameter when calling the kernel. This is an interesting design choice that makes for more efficient task scheduling!

The following link was a useful resource to understand this behavior. https://devblogs.nvidia.com/using-shared-memory-cuda-cc/

anonymoose commented on slide_015 of Data Parallel Thinking ()

I found this video to be helpful for understanding the parallelized scan and evaluating its efficiency: https://www.youtube.com/watch?v=RdfmxfZBHpo

anonymoose commented on slide_059 of GPU Architecture and CUDA Programming ()

I'm still confused about warps. In some slides, it seems like a warp is similar to an execution context for threads, like on this slide--holds an instruction stream. On others, it seems like it's thread-specific memory. Could someone with a better understanding help reword some of this?

belce commented on slide_054 of Performance Optimization I: Work Distribution ()

I recall in class we said that since the enqueueing and dequeueing operations happen on opposite sides of the queue we don't need a lock (correct me if my memory is wrong). However, what about the edge case that multiple threads are trying to steal from the same work queue?

anonymoose commented on slide_030 of Performance Optimization II: Locality, Communication, & Contention ()

We said in class that since the gap between end of washing machine and start of dryer will continue to grow, we'll eventually have a bunch of soggy wet clothes waiting to be put in the dryer and want to stop putting things in the washing machine. But since the limiting factor here is the length of time needed to dry, why don't we just always wait 15 minutes before putting clothes in the dryer? Throughput should stay the same, and this way we don't get that huge gap between end of washing and start of drying.

anonymoose commented on slide_027 of Performance Optimization I: Work Distribution ()

In case 2 (the second example), it seems like the compiler should be able to figure out that we're still effectively in case 1, since nothing happens between the last cilk_spawn and cilk_sync. From a readability standpoint, it seems much more clear to me that all independent methods get a "cilk_spawn" before them; in other words, case 2 seems like the more readable version of case 1. So, would it be possible and worth implementing that optimization in the compiler so that case 2 actually compiles the same as case 1?

anonymoose commented on slide_028 of Parallel Programming Basics ()

I can convince myself that these are the dependencies once I see them explicitly stated on the slide, but I'm not sure how I'd figure them out from the code on the previous slides. Anyone have tips or suggestions on translating code into parallelizable pieces of work + dependency trees?

kayvonf commented on slide_048 of Performance Optimization I: Work Distribution ()

You are correct, it's now fixed! Thanks.

WLTDO commented on slide_048 of Performance Optimization I: Work Distribution ()

Shouldn't the range of the subtree in the first grey block be (N/2, 5N/8) instead of (N/2, 7N/8), since the entire block is doing the work within (N/2, 3N/4)?

belce commented on slide_021 of Parallel Programming Basics ()

In class, we discussed that there are benefits to mapping threads to the same core (cache use, locality, etc.) but it also limits the computational/execution resources each thread has access to.

joel commented on slide_021 of Parallel Programming Abstractions ()

Interesting discussion here about ISPC abstractions: http://15418.courses.cs.cmu.edu/spring2014/lecture/progmodels/slide_007

Relevant Excerpt:

The point is that there is nothing in the basic ISPC model that dictates this choice and the programmer need not think about the implementation if he/she simply wants to create a correct program. However, as I'm sure it's now clear in this class, the programmer will need to think about the underlying implementation if an efficient program is desired.

Now, I'm being a little loose with my comments here because there are in fact certain details in ISPC that do put constraints on the implementation. One example is the cross-instance functions (e.g, reduceAdd), which do place a scheduling constraint on execution. Since reduceAdd entails communication of values between program instances, all program instances must run up until this point in the program to determine what values to communicate. (Otherwise, it's impossible to perform the reduction!) Consider an ISPC function that contains a reduceAdd. The ISPC system can not run instance 0 to completion before starting instance 1 because the reduceAdd operation, which must be executed as part of instance 0, depends on values computed by instance 1.

joel commented on slide_016 of Parallel Programming Abstractions ()

Summarising: http://15418.courses.cs.cmu.edu/spring2016/lecture/progabstractions/slide_016

Static Interleaved assignment: Assignment determined at compile time and program instance i operates on array such that arrayIndex % programCount = i

bmperez's thoughts On the difference between launch and foreach:

foreach expresses SIMD parallelism, and so it will (almost always) be mapped to vectorized instructions. Launch, on the other hand, specifies multi-core parallelism, and so it will be mapped using some thread-based implementation; it could either be by launching N pthreads, or by launching 1 pthread per processor, and then performing work stealing.

At a more abstract level, foreach is very declarative; it only specifies that this piece of data is fully paralellizable, and does not tell the compiler how to partition the data. Launch on the other hand is more imperative; the function that is launched tells the compiler how to partition the data, and the compiler must follow that exactly. Thus, foreach gives the freedom to the compiler to partition the data, while the function that launch runs tells the compiler how to partition the data.

joel commented on slide_015 of Parallel Programming Abstractions ()

For anyone who is interested, I went to look up the gather instruction in the manual here's the definition:

VPGATHERDQ xmm1, vm32x

Using dword indices specified in vm32x, gather qword values from memory conditioned on mask specified by xmm2. Conditionally gathered > elements are merged into xmm1

I was unfamiliar with what dwords and qwords were but according to wiki so I looked them up. Here's the relevant section of the wiki:

Data structures containing such different sized words refer to them as WORD (16 bits/2 > bytes), DWORD (32 bits/4 bytes) and QWORD (64 bits/8 bytes) respectively.

barracuda commented on slide_014 of Parallel Programming Abstractions ()

(@WLTDO) Yes, the load at one time-step across the program instances is implemented by a vector instruction, so it would have to be completed before the next one.

barracuda commented on slide_003 of Parallel Programming Basics ()

You’re correct that this code is run by each gang member. However, from the perspective of a single gang member running this code, it’s just a sequential piece of code and each iteration occurs one after another serially. The parallelism was specified when this ISPC function was called, in which case the underlying ISPC implementation had the flexibility to parallelize the original task by initializing a gang of instances, each running this piece of code, by using vector instructions. Also, it’s worth noting there’s a subtle distinction between "concurrent" pieces of work (can be executed independently and in any order, and potentially in parallel) and "parallel" pieces of work (executed simultaneously).

danieltc commented on slide_010 of Parallel Programming Basics ()

In lecture, the justification for bounding speedup to 2 was taking the limit as p goes to infinity. However, in practice if there are only N units of work, we can't split that amongst more than N processors (giving a processor a fraction of a unit of work doesnt make sense). Why do we use the idea of infinite processors for talking about the bounds? Are we just doing something similar to the style of big O, where we only care about orders of magnitude?

danieltc commented on slide_003 of Parallel Programming Basics ()

I understand that there is no explicit parallelism in this code, and that the implementation of the code is executed via 8-wide vector instructions; however, I don't understand why we say this code is not parallelized by ISPC. What do we mean exactly by "parallelized" here? Is it not part of the abstraction of ISPC that we are guaranteed a gang of program instances will execute this code concurrently?

danieltc commented on slide_017 of Parallel Programming Abstractions ()

My understanding is that for ISPC, we can summarize the following:

Abstractions: Within the C code, is a function call sinx() that computes sine for every element in the input by spawning a gang of ISPC program instances, where each program instance runs ISPC code. Upon return, all instances have completed and the result is correct.

Within the ISPC function, we use abstractions such as programCount, programIndex, and uniform to explicitly define assignment of inputs to program instances (interleaved vs. blocked), or we can raise the level of abstraction and use foreach to allow for any assignment.

Implementation: The number of instances in a gang is equal to the SIMD width of the underlying hardware. The ISPC compiler generates SIMD vector instructions and performs masking as necessary. This is executed all on one core.

If the program does not specify assignment (e.g. if the program uses forall), then the implementation decides the assignment of inputs to program instances. By default, ISPC uses the interleaved assignment.

Is there anything I'm missing / misunderstanding?

danieltc commented on slide_078 of A Modern Multi-Core Processor ()

Is it true that our Myth machines are identical to this example, except in each core we have 2 SIMD Exec units instead of 1 scalar and 1 SIMD?

danieltc commented on slide_076 of A Modern Multi-Core Processor ()

I think the difference is as follows:

A superscalar dual core processor has two cores, where each core has 2 fetch/decode units, 2 ALUs, and 1 shared execution context. It exploits ILP by (potentially) executing multiple independent instructions at once e.g. the first two lines in the forall loop. In total, it can run 2 separate instruction streams with superscalar execution, totalling a maximum of 4 instructions per clock.

The SIMD quad-core processor has 4 cores, where each core has one fetch/decide unit, 8 ALUs, and 1 shared execution context. It exploits the fact that the same instruction is being applied to multiple data using 8-wide vector instructions e.g. calculating result[i] for i = 0, ..., 7 at the same time. In total, it can run 4 separate instruction streams, each with 8-wide instructions, totalling a maximum of 32 vector lanes being computed per clock.

anonymoose commented on slide_014 of Parallel Programming Abstractions ()

It seems like another benefit of interleaving like this is that, depending on the context of what's being computed, work may be more evenly distributed among program instances, i.e. assuming instruction i takes a similar amount of time to compute as instruction i+1 takes to compute.

WLTDO commented on slide_018 of Parallel Programming Abstractions ()

I would think it is O(programCount), since reduce_add() just adds up the partial sum in each program instance (i.e., adding up 8 floats if we have 8 program instances).

Considering that the number of program instances in a gang is relatively small, reduce_add() is just a constant time operation.

ParallelPorcupine commented on slide_058 of A Modern Multi-Core Processor ()

Yes, that clears things up. Thanks so much!

siyuliu3 commented on slide_018 of Parallel Programming Abstractions ()

What is the time complexity of reduce_add()?

WLTDO commented on slide_014 of Parallel Programming Abstractions ()

Is my understanding correct that the program could not advance to the next step (e.g., i=1) until all the program instances have finished their work at the current step (e.g., i=0)?

In addition, would all instances stall until the single memory load is completed? What about the case in which the program need to make several memory loads?

kayvonf commented on slide_049 of A Modern Multi-Core Processor ()

@truenorthwest. That is correct.

In general, let's go back to first principles. The overall goal is to minimize processor stalls, since if a processor stalls it's not doing useful work, and thus it is not running at peak efficiency.

One way to reduce stalls is to reduce the time it takes to perform long latency operations, such as a memory fetch. A cache is one technique to reduce the time to fetch a value from a memory address, and thus caches can reduce the number of cycles a processor is stalled waited on data from memory. Caches are effective when a process exhibits high temporal locality in its data accesses -- that is, it accesses the same memory address multiple times in a short period of time.

Hardware multi-threading is another technique for preventing processor stalls due to waiting on memory. However, instead of reducing the latency of a memory access, hardware multi-threading provides the opportunity for the processor to do other useful work when a high latency operation, such as a memory access, is taking place. No stall occurs, even though the latency of memory access remains high. It's common to say multi-threading is a mechanism to "hide memory latency", but what we really mean is that multi-threading is a mechanism to avoid high memory access latency result in processor stalls. In other words, the high latency of memory access is still there, but its effect on performance is "hidden".

Hardware multi-threading can reduce stalls even if a program does not have temporal locality, but using this technique comes at the cost of needing to have two threads of work available in a program (the programmer must expose additional concurrency). There are also hardware costs such as the need to maintain execution contents for two threads on chip at once.