2. Course Details
Delivery
◦ Lectures/discussions: English
◦ Assessments: English
◦ Ask questions in class if you don’t understand
◦ Email me after class if you do not want to ask in
class
◦ DO NOT LEAVE QUESTIONS TILL THE DAY BEFORE THE
EXAM!!!
Assessments (this may change)
◦ Homework (~1 per week): 10%
◦ Midterm: 20%
◦ 1 project + final exam OR 2 projects: 35%+35%
3. Course Details
Textbook
◦ Principles of Parallel Programming, Lin & Snyder
Other sources of information:
◦ COMP 322, Rice University
◦ CS 194, UC Berkeley
◦ Cilk lectures, MIT
Many sources of information on the
internet for writing parallelized code
4. Teaching Materials & Assignments
Everything is on Jusur
◦ Lectures
◦ Homeworks
Submit homework through Jusur
Homework is given out on Saturday
Homework due following Saturday
You lose 10% for each day late
5. Homework 1
First homework is available on Jusur
◦ Install Linux on your computer
It is needed for future homework
It is needed to access the supercomputers
◦ Check settings/hardware
Submit pictures of your settings
Submit description of your processor
◦ Deadline: 27/03/1431 (submit on Jusur)
6. Cheating in Homework/Projects
Cheating
◦ If you cheat, you get zero
◦ If you help others cheat, you will also get zero
◦ Copy + paste from Internet, e.g. Wikipedia, or
elsewhere, is also cheating (called plagiarism)
◦ You can read any source of information, but you
must write answers in your own words
◦ If you have problems, please ask for help.
7. Outline
Previous lecture:
◦ Why study parallel computing?
◦ Topics covered on this course
This lecture:
◦ Example problem
Next week:
◦ Parallel processor architectures
8. Example Problem
We will parallelize a simple problem
Begin to explore some of the issues
related to parallel programming, and
performance of parallel programs
9. Example Problem: Array Sum
Add all the numbers in a large array
It has 100 million elements
int size = 100000000;
int array[] = {7,3,15,10,13,18,6,4,…};
What code should we write for a
sequential program?
12. How Do We Parallelize?
Objective: Thinking about parallelism
◦ Multiple processors need something to do
A program/software has to be split into parts
Each part can be executed on a different
processor.
◦ How do we improve performance over single
processor?
If a problem takes 2 seconds on a single processor
And we break it into two (equal) parts: 1 second
for each part
And we execute the two parts separately, but in
parallel, on two processors, then we improve
performance
13. How Do We Parallelize?
Time Sequential Parallel
CPU 0 CPU 0 CPU 1
Part 0 Part 0 Part 1
1
Part 1
2
14. How Do We Start Parallelizing?
What parts can be done separately?
◦ What parts can we do on separate processors?
◦ Meaning: What parts have no data dependence
◦ Data dependence:
The execution of an instruction (or line of
code) is dependent on execution of a previous
instruction (or line of code).
◦ Data independence:
The execution of an instruction (or line of
code) is not dependent on execution of a
previous instruction (or line of code).
15. Example of Data Dependence
int x = 0;
int y = 5;
x = 3;
y = y + x; //Is this line dependent on
the previous line?
16. Data Dependence & Parallelism
In a sequential program, data
dependence does not matter: each
instruction executes in sequence.
◦ Instructions execute one by one
In a parallel program, data
independence allows parallel execution
of instructions. Data dependence
prevents parallel execution of
instructions.
◦ Reduces parallel performance
◦ Reduces number of processors that can be used
17. Why is Data Dependence Bad For
Parallel Programs?
Does not allow correct parallel execution
CPU0 CPU1
x = 3; y = y + x;
x = 3; y = 5; //(5 + 0)
18. Why is Data Dependence Bad For
Parallel Programs?
Does not allow correct parallel execution
CPU0 CPU1
x = 3; WAIT
x = 3; y = y + x;
y = 8; //(5 + 3)
19. Why is Data Dependence Bad For
Parallel Programs?
Does not allow correct parallel execution
CPU0
x = 3;
y = y + x;
x = 3; y = 8;
20. Example of Data Independence
int x = 0;
int y = 5;
x = 3;
y = y + 5; //Is this line dependent on
the previous line?
21. Why is Data Independence Useful?
Allows correct parallel execution
CPU0 CPU1
x = 3; y = y + 5;
x = 3; y = 10;
22. Back to Array Sum Example
Does the code have data dependence?
int sum = 0;
for(int i = 0; i < size; i++) {
sum += array[i]; //sum=sum+array[i];
}
23. Back to Array Sum Example
Does the code have data dependence?
int sum = 0;
for(int i = 0; i < size; i++) {
sum += array[i]; //sum=sum+array[i];
}
Not so easy to see
24. Back to Array Sum Example
Let’s unroll the loop:
int sum = 0;
sum += array[0]; //sum=sum+array[0];
sum += array[1]; //sum=sum+array[1];
sum += array[2]; //sum=sum+array[2];
sum += array[3]; //sum=sum+array[3];
…
Now we can see dependence!
26. Removing Dependencies
Sometimes this is possible.
◦ Dependencies discussed in detail later.
Tip: Can be useful to look at the
problem being solved by the
code, and not the code itself.
27. Break Sum into Pieces
P0 P1
7 3 1 0 2 9 5 8 3 6
S0 S1
SUM
28. Some Details…
A program executes inside a process
If we want to use multiple processors
◦ We need multiple processes
◦ One process for each processor (not fixed rule)
Processes are big, heavyweight
Threads are lighter than processes
◦ But same strategy
◦ One thread for each processor (not fixed rule)
We will talk about threads and
processes later, if necessary
29. What Does the Code Look Like?
int numThreads = 2; //Assume one thread per core, and 2 cores
int sum = 0;
int i = 0;
int middleSum[numThreads];
int threadSetSize = size/numThreads
//Each thread will execute this code with a different threadID
for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++)
{
middleSum[threadID] += array[i];
}
//Only thread 0 will execute this code
if (threadID==0) {
for(i = 0; i < numThreads; i++) {
sum += middleSum[i];
}
}
30. Load Balancing
Which processor is doing more work?
P0 P1
7 3 1 0 2 9 5 8 3 6
S0 S1
SUM
31. Load Balancing
Time Sequential Parallel
P0 P0 P1
Part 0 Part 0
Part 1
1.0
1.3 Part 1
2.0
32. Example Problem: Array Sum
Parallelized code is more complex
Requires us to think differently about
how to solve the problem
◦ Need to think about breaking it into parts
◦ Analyze data dependencies, remove if possible
◦ Need to load balance for better performance
33. Example Problem: Array Sum
However, the parallel code is broken
◦ Thread 0 adds all the middle sums.
◦ What if thread 0 finishes its own work, but
other threads have not?
34. Synchronization
P0 will probably finish before P1
P0 P1
7 3 1 0 2 9 5 8 3 6
S0 S1
SUM
35. How Can We Fix The Code to
GUARANTEE It Works Correctly?
int numThreads = 2; //Assume one thread per core, and 2 cores
int sum = 0;
int i = 0;
int middleSum[numThreads];
int threadSetSize = size/numThreads
//Each thread will execute this code with a different threadID
for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++)
{
middleSum[threadID] += array[i];
}
//Only thread 0 will execute this code
if (threadID==0) {
for(i = 0; i < numThreads; i++) {
sum += middleSum[i];
}
}
36. Synchronization
Sometimes we need to
coordinate/organize threads
If we don’t, the code might calculate the
wrong answer to the problem
Can happen even if load balance is perfect
Synchronization is concerned with this
coordination / organization
37. Code with Synchronization Fixed
int numThreads = 2; //Assume one thread per core, & 2 cores
int sum = 0;
int i = 0;
int middleSum[numThreads];
int threadSetSize = size/numThreads
//Each thread will execute this code with a different threadID
for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++)
{
middleSum[threadID] += array[i];
}
waitForAllThreads(); //Wait for all threads
//Only thread 0 will execute this code
if (threadID==0) {
for(i = 0; i < numThreads; i++) {
sum += middleSum[i];
}
}
38. Synchronization
The example shows a barrier
This is one type of synchronization
Barriers require all threads to reach
that point in the code, before any
thread is allowed to continue
It is like a gate. All threads come to
the gate, and then it opens.
39. Generalizing the Solution
We only looked at how to parallelize
for 2 threads
But the code is more general
◦ Can use any number of threads
◦ Important that code is written this way
◦ We will look at this in more detail later
41. Performance
Now the program is correct
Let’s look at performance
Time on 2-core Processor
1
0.8
0.6
0.4
0.2
0
1 Thread 2 Threads 4 Threads
42. Performance
Two-threads are not 2x fast. Why?
◦ The problem is called false sharing
◦ To understand this, we have to look at the
computer architecture
◦ We will study this in the next lecture
Four-threads slower than two-threads.
Why?
◦ The processor only has two cores
◦ Four threads adds scheduling overhead, wastes
time
43. Summary
Used an example to start looking at
how to parallelize code, and some of
the main issues
◦ Data dependence
◦ Load balancing
◦ Synchronization
Each will be discussed in more detail
in later lectures