SlideShare una empresa de Scribd logo
1 de 152
슈퍼컴퓨팅 교육 - UNIST

Parallel Programming
CONTENT
S
I.

Introduction to Parallel Computing

II.

Parallel Programming using OpenMP

III. Parallel Programming using MPI
I. Introduction to Parallel Computing
병렬 처리 (1/3)

병렬 처리란, 순차적으로 진행되는 계산영역을

여러 개로 나누어 각각을 여러 프로세서에서
동시에 수행 되도록 하는 것
병렬 처리 (2/3)
순차실행

병렬실행

Inputs

Outputs
병렬 처리 (3/3)
주된 목적 : 더욱 큰 문제를 더욱 빨리 처리하는 것


프로그램의 wall-clock time 감소



해결할 수 있는 문제의 크기 증가

병렬 컴퓨팅 계산 자원


여러 개의 프로세서(CPU)를 가지는 단일 컴퓨터



네트워크로 연결된 다수의 컴퓨터
왜 병렬인가?
고성능 단일 프로세서 시스템 개발의 제한


전송속도의 한계 (구리선 : 9 cm/nanosec)



소형화의 한계



경제적 제한

보다 빠른 네트워크, 분산 시스템, 다중 프로세서 시스템 아키텍처의 등장  병렬 컴퓨팅 환경

상대적으로 값싼 프로세서를 여러 개 묶어 동시에 사용함
으로써 원하는 성능이득 기대
프로그램과 프로세스
프로세스는 보조 기억 장치에 하나의 파일로서 저장되어 있던 실행 가능한 프로그램이 로딩되어 운영
체제(커널)의 실행 제어 상태에 놓인 것



프로그램 : 보조 기억 장치에 저장



프로세스 : 컴퓨터 시스템에 의하여 실행 중인 프로그램



태스크 = 프로세스
프로세스
프로그램 실행을 위한 자원 할당의 단위가 되고, 한 프로그램에서 여러 개 실행 가능
다중 프로세스를 지원하는 단일 프로세서 시스템


자원 할당의 낭비, 문맥교환으로 인한 부하 발생



문맥교환

• 어떤 순간 한 프로세서에서 실행 중인 프로세스는 항상 하나
• 현재 프로세스 상태 저장  다른 프로세스 상태 적재
분산메모리 병렬 프로그래밍 모델의 작업할당 기준
스레드
프로세스에서 실행의 개념만을 분리한 것


프로세스 = 실행단위(스레드) + 실행환경(공유자원)



하나의 프로세스에 여러 개 존재가능



같은 프로세스에 속한 다른 스레드와 실행환경을 공유

다중 스레드를 지원하는 단일 프로세서 시스템


다중 프로세스보다 효율적인 자원 할당



다중 프로세스보다 효율적인 문맥교환

공유 메모리 병렬 프로그래밍 모델의 작업할당 기준
프로세스와 스레드

하나의 스레드를 갖는 3개의 프로세스

스레드

프로세스

3개의 스레드를 갖는 하나의 프로세스
병렬성 유형
데이터 병렬성 (Data Parallelism)


도메인 분해 (Domain Decomposition)



각 태스크는 서로 다른 데이터를 가지고 동일한 일련의 계산을 수행

태스크 병렬성 (Task Parallelism)


기능적 분해 (Functional Decomposition)



각 태스크는 같거나 또는 다른 데이터를 가지고 서로 다른 계산을 수행
데이터 병렬성 (1/3)

데이터 병렬성 : 도메인 분해

Problem Data Set

Task 1

Task 2

Task 3

Task 4
데이터 병렬성 (2/3)

코드 예) : 행렬의 곱셈 (OpenMP)

Serial Code

Parallel Code
!$OMP PARALLEL DO

DO K=1,N

DO K=1,N

DO J=1,N

DO J=1,N

DO I=1,N
C(I,J) = C(I,J) +

DO I=1,N
C(I,J) = C(I,J) +

(A(I,K)*B(K,J))
END DO
END DO
END DO

A(I,K)*B(K,J)
END DO
END DO
END DO
!$OMP END PARALLEL DO
데이터 병렬성 (3/3)
데이터 분해 (프로세서 4개:K=1,20일 때)

Process

Proc0
Proc1
Proc2
Proc3

Iterations of K

K =
K =

1:5
6:10

K = 11:15
K = 16:20

Data Elements

A(I,1:5)
B(1:5,J)
A(I,6:10)
B(6:10,J)
A(I,11:15)
B(11:15,J)
A(I,16:20)
B(16:20,J)
태스크 병렬성 (1/3)
태스크 병렬성 : 기능적 분해

Problem Instruction Set

Task 1

Task 2

Task 3

Task 4
태스크 병렬성 (2/3)
코드 예) : (OpenMP)

Serial Code

Parallel Code

PROGRAM MAIN
…
CALL interpolate()
CALL compute_stats()
CALL gen_random_params()
…
END

PROGRAM MAIN
…
!$OMP PARALLEL
!$OMP SECTIONS
CALL interpolate()
!$OMP SECTION
CALL compute_stats()
!$OMP SECTION
CALL gen_random_params()
!$OMP END SECTIONS
!$OMP END PARALLEL
…
END
태스크 병렬성 (3/3)
태스크 분해 (3개의 프로세서에서 동시 수행)

Process

Code

Proc0

CALL interpolate()

Proc1

CALL compute_stats()

Proc2

CALL gen_random_params()
병렬 아키텍처 (1/2)

Processor Organizations

Single Instruction,
Single Instruction,
Single Data Stream Multiple Data Stream
(SISD)
(SIMD)

Multiple Instruction, Multiple Instruction,
Single Data Stream Multiple Data Stream
(MIMD)
(MISD)

Uniprocessor
Vector
Processor

Shared memory
Array
Processor (tightly coupled)

Distributed memory
(loosely coupled)

Clusters
Symmetric
multiprocessor
(SMP)

Non-uniform
Memory
Access
(NUMA)
병렬 아키텍처 (2/2)
최근의 고성능 시스템 : 분산-공유 메모리 지원


소프트 웨어적 DSM (Distributed Shared Memory) 구현

• 공유 메모리 시스템에서 메시지 패싱 지원
• 분산 메모리 시스템에서 변수 공유 지원


하드웨어적 DSM 구현 : 분산-공유 메모리 아키텍처

• 분산 메모리 시스템의 각 노드를 공유 메모리 시스템으로 구성
• NUMA : 사용자들에게 하나의 공유 메모리 아키텍처로 보여짐
ex) Superdome(HP), Origin 3000(SGI)
• SMP 클러스터 : SMP로 구성된 분산 시스템으로 보여짐
ex) SP(IBM), Beowulf Clusters
병렬 프로그래밍 모델
공유메모리 병렬 프로그래밍 모델




공유 메모리 아키텍처에 적합
다중 스레드 프로그램
OpenMP, Pthreads

메시지 패싱 병렬 프로그래밍 모델



분산 메모리 아키텍처에 적합
MPI, PVM

하이브리드 병렬 프로그래밍 모델



분산-공유 메모리 아키텍처
OpenMP + MPI
공유 메모리 병렬 프로그래밍 모델

Single thread
time

time

S1

Multi-thread
Thread

S1

fork

P1

P2

P1
P2
P3

P3

join

S2
S2

Shared address space

P4
Process

S2

Process

P4
메시지 패싱 병렬 프로그래밍 모델

Serial
time

time

S1

Messagepassing

S1

S1

S1

S1

P1

P1

P2

P3

P4

P2

S2
S2

S2
S2

S2
S2

S2
S2

Process 0

Process 1

Process 2

Process 3

Node 1

Node 2

Node 3

Node 4

P3
P4
S2
S2
Process

Data transmission over the interconnect
하이브리드 병렬 프로그래밍 모델

Message-passing

P1

fork

P2

time

time

S1

Thread

S1

P3

Shared
address

fork

P4
join

join

S2
S2

Thread

S2
S2

Shared
address

Process 0

Process 1

Node 1

Node 2
DSM 시스템의 메시지 패싱

time

S1

S1

S1

S1

P1

P2

P3

P4

Message-passing
S2
S2

S2
S2

S2
S2

S2
S2

Process 0

Process 1

Process 2

Process 3

Node 1

Node 2
SPMD와 MPMD (1/4)

SPMD(Single Program Multiple Data)


하나의 프로그램이 여러 프로세스에서 동시에 수행됨



어떤 순간 프로세스들은 같은 프로그램내의 명령어들을 수행하며 그 명령어들은 같을 수도
다를 수도 있음

MPMD (Multiple Program Multiple Data)


한 MPMD 응용 프로그램은 여러 개의 실행 프로그램으로 구성



응용프로그램이 병렬로 실행될 때 각 프로세스는 다른 프로세스와 같거나 다른 프로그램을

실행할 수 있음
SPMD와 MPMD (2/4)
SPMD

a.out

Node 1

Node 2

Node 3
SPMD와 MPMD (3/4)

MPMD : Master/Worker (Self-Scheduling)

a.out

Node 1

b.out

Node 2

Node 3
SPMD와 MPMD (4/4)
MPMD: Coupled Analysis

a.out

b.out

c.out

Node 1

Node 2

Node 3
•성능측정
•성능에 영향을 주는 요인들
•병렬 프로그램 작성순서
프로그램 실행시간 측정 (1/2)
time
사용방법(bash, ksh) : $time [executable]

$ time mpirun –np 4 –machinefile machines ./exmpi.x
real 0m3.59s
user 0m3.16s
sys
0m0.04s


real = wall-clock time



User = 프로그램 자신과 호출된 라이브러리 실행에 사용된 CPU 시간



Sys = 프로그램에 의해 시스템 호출에 사용된 CPU 시간



user + sys = CPU time
프로그램 실행시간 측정 (2/2)
사용방법(csh) : $time [executable]

$ time testprog
1.150u 0.020s 0:01.76 66.4% 15+3981k 24+10io 0pf+0w
①
②
③
④
⑤
⑥
⑦ ⑧
① user CPU time (1.15초)
② system CPU time (0.02초)
③ real time (0분 1.76초)
④ real time에서 CPU time이 차지하는 정도(66.4%)
⑤ 메모리 사용 : Shared (15Kbytes) + Unshared (3981Kbytes)
⑥ 입력(24 블록) + 출력(10 블록)
⑦ no page faults
⑧ no swaps
성능측정
병렬화를 통해 얻어진 성능이득의 정량적 분석
성능측정
 성능향상도
 효율
 Cost
성능향상도 (1/7)
성능향상도 (Speed-up) : S(n)

S(n) =

순차 프로그램의 실행시간
=
병렬 프로그램의 실행시간(n개 프로세서)

ts
tp



순차 프로그램에 대한 병렬 프로그램의 성능이득 정도



실행시간 = Wall-clock time



실행시간이 100초가 걸리는 순차 프로그램을 병렬화 하여 10개의 프로세서로 50초 만에 실행
되었다면,
 S(10) =

100
=
50

2
성능향상도 (2/7)
이상(Ideal) 성능향상도 : Amdahl‟s Law
 f : 코드의 순차부분 (0 ≤ f ≤ 1)
 tp = fts + (1-f)ts/n

순차부분 실행시
간

병렬부분 실행시
간
성능향상도 (3/7)

ts
(1

fts
Serial section

f )t S

Parallelizable sections

1

2

n-1

n

1
2
n processes

n-1
n

tp

(1 f )t S / n
성능향상도 (4/7)

 S(n) =

ts =
tp

ts
fts + (1-f)ts/n
1

S(n) =



최대 성능향상도 ( n  ∞ )
S(n) =



f + (1-f)/n

1
f

프로세서의 개수를 증가하면, 순차부분 크기의 역수에 수렴
성능향상도 (5/7)
f = 0.2, n = 4

Serial
Parallel
process 1

20

20

80

20

process 2
process 3

cannot be parallelized

process 4

can be parallelized

S(4) =

1
0.2 + (1-0.2)/4

= 2.5
성능향상도 (6/7)
프로세서 개수 대 성능향상도

f=0

24

Speed-up

20

16

f=0.05

12

f=0.1

8

f=0.2

4

0
0

4

8

12

16

20

number of processors, n

24
성능향상도 (7/7)
순차부분 대 성능향상도

16
14

Speed-up

12

n=256

10
8
6

n=16

4
2
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Serial fraction, f

0.8

0.9

1
효율
효율 (Efficiency) : E(n)

E(n) =



ts
=

tpⅹn

S(n)
n

프로세서 개수에 따른 병렬 프로그램의 성능효율을 나타냄

• 10개의 프로세서로 2배의 성능향상 :
– S(10) = 2



E(10) = 20 %

• 100개의 프로세서로 10배의 성능향상 :
– S(100) = 10



E(100) = 10 %
Cost
Cost
Cost = 실행시간 ⅹ 프로세서 개수



순차 프로그램 : Cost = ts



병렬 프로그램 : Cost = tp ⅹ n =

tsn

S(n)

=

ts

E(n)

예) 10개의 프로세서로 2배, 100개의 프로세서로 10배의 성능향상

ts

tp

n

S(n)

E(n)

Cost

100

50

10

2

0.2

500

100

10

100

10

0.1

1000
실질적 성능향상에 고려할 사항
실제 성능향상도 : 통신부하, 로드 밸런싱 문제
20

80

Serial
parallel

20

20

process 1

cannot be parallelized

process 2

can be parallelized

process 3

communication overhead

process 4

Load unbalance
성능증가를 위한 방안들

1.

프로그램에서 병렬화 가능한 부분(Coverage) 증가
 알고리즘 개선

2.

작업부하의 균등 분배 : 로드 밸런싱

3.

통신에 소비하는 시간(통신부하) 감소
성능에 영향을 주는 요인들

Coverage : Amdahl’s Law
로드 밸런싱
동기화
통신부하

세분성
입출력
로드 밸런싱
모든 프로세스들의 작업시간이 가능한 균등하도록 작업을 분배하여 작업대기시간을 최소화 하는 것


데이터 분배방식(Block, Cyclic, Block-Cyclic) 선택에 주의



이기종 시스템을 연결시킨 경우, 매우 중요함



동적 작업할당을 통해 얻을 수도 있음

task0

WORK

task1

WAIT

task2
task3

time
동기화
병렬 태스크의 상태나 정보 등을 동일하게 설정하기 위한 조정작업



대표적 병렬부하 : 성능에 악영향
장벽, 잠금, 세마포어(semaphore), 동기통신 연산 등 이용

병렬부하 (Parallel Overhead)



병렬 태스크의 시작, 종료, 조정으로 인한 부하

• 시작 : 태스크 식별, 프로세서 지정, 태스크 로드, 데이터 로드 등
• 종료 : 결과의 취합과 전송, 운영체제 자원의 반납 등
• 조정 : 동기화, 통신 등
통신부하 (1/4)
데이터 통신에 의해 발생하는 부하


네트워크 고유의 지연시간과 대역폭 존재

메시지 패싱에서 중요
통신부하에 영향을 주는 요인들

 동기통신? 비동기 통신?
 블록킹? 논블록킹?
 점대점 통신? 집합통신?
 데이터전송 횟수, 전송하는 데이터의 크기
통신부하 (2/4)

통신시간 = 지연시간 +


메시지 크기
대역폭

지연시간 : 메시지의 첫 비트가 전송되는데 걸리는 시간

• 송신지연 + 수신지연 + 전달지연


대역폭 : 단위시간당 통신 가능한 데이터의 양(MB/sec)

유효 대역폭 =

메시지 크기
=

통신시간

대역폭
1+지연시간ⅹ대역폭/메시지크기
통신부하 (3/4)

Communication time

Communication Time

1/slope = Bandwidth

Latency

message size
통신부하 (4/4)

Effective Bandwidth
effective bandwidth
(MB/sec)

1000

network bandwidth
100

10

1

• latency = 22 ㎲
• bandwidth = 133 MB/sec

0.1

0.01
1

10

100

1000

10000

100000 1000000

message size(bytes)
세분성 (1/2)
병렬 프로그램내의 통신시간에 대한 계산시간의 비


Fine-grained 병렬성

• 통신 또는 동기화 사이의 계산작업이 상대적으로 적음
• 로드 밸런싱에 유리


Coarse-grained 병렬성

• 통신 또는 동기화 사이의 계산작업이 상대적으로 많음
• 로드 밸런싱에 불리
일반적으로 Coarse-grained 병렬성이 성능면에서 유리


계산시간 < 통신 또는 동기화 시간



알고리즘과 하드웨어 환경에 따라 다를 수 있음
세분성 (2/2)

time

time
Communication

Communication

Computation

Computation

(a) Fine-grained

(b) Coarse-grained
입출력
일반적으로 병렬성을 방해함


쓰기 : 동일 파일공간을 이용할 경우 겹쳐 쓰기 문제



읽기 : 다중 읽기 요청을 처리하는 파일서버의 성능 문제



네트워크를 경유(NFS, non-local)하는 입출력의 병목현상

입출력을 가능하면 줄일 것


I/O 수행을 특정 순차영역으로 제한해 사용



지역적인 파일공간에서 I/O 수행

병렬 파일시스템의 개발 (GPFS, PVFS, PPFS…)
병렬 I/O 프로그래밍 인터페이스 개발 (MPI-2 : MPI I/O)
확장성 (1/2)
확장된 환경에 대한 성능이득을 누릴 수 있는 능력


하드웨어적 확장성



알고리즘적 확장성

확장성에 영향을 미치는 주요 하드웨어적 요인



CPU-메모리 버스 대역폭



네트워크 대역폭



메모리 용량



프로세서 클럭 속도
Speedu
p

확장성 (2/2)

Number of Workers
의존성과 교착
데이터 의존성 : 프로그램의 실행 순서가 실행 결과에 영향을 미치는 것

DO k = 1, 100
F(k + 2) = F(k +1) + F(k)
ENDDO
교착 : 둘 이상의 프로세스들이 서로 상대방의 이벤트 발생을 기다리는 상태

Process 1
X = 4
SOURCE = TASK2
RECEIVE (SOURCE,Y)
DEST = TASK2
SEND (DEST,X)
Z = X + Y

Process 2
Y = 8
SOURCE = TASK1
RECEIVE (SOURCE,X)
DEST = TASK1
SEND (DEST,Y)
Z = X + Y
의존성

F(1)

F(2)

F(3)

F(4)

F(5)

F(6)

F(7)

…

F(n)

1

2

3

4

5

6

7

…

n

DO k = 1, 100
F(k + 2) = F(k +1) + F(k)
ENDDO
Serial

F(1)

F(2)

F(3)

F(4)

F(5)

F(6)

F(7)

…

F(n)

1

2

3

5

8

13

21

…

…

F(1)

F(2)

F(3)

F(4)

F(5)

F(6)

F(7)

…

F(n)

1

2

3

5(4)

7

11

18

…

…

Parallel
병렬 프로그램 작성 순서
①

순차코드 작성, 분석(프로파일링), 최적화



②

hotspot, 병목지점, 데이터 의존성 등을 확인
데이터 병렬성/태스크 병렬성 ?

병렬코드 개발


MPI/OpenMP/… ?



태스크 할당과 제어, 통신, 동기화 코드 추가

③

컴파일, 실행, 디버깅

④

병렬코드 최적화


성능측정과 분석을 통한 성능개선
디버깅과 성능분석
디버깅


코드 작성시 모듈화 접근 필요



통신, 동기화, 데이터 의존성, 교착 등에 주의



디버거 : TotalView

성능측정과 분석


timer 함수 사용



프로파일러 : prof, gprof, pgprof, TAU
Coffee break
I. Introduction to Parallel Computing
OpenMP란 무엇인가?

공유메모리 환경에서
다중 스레드 병렬 프로그램 작성을 위한
응용프로그램 인터페이스(API)
OpenMP의 역사
1990년대 :
 고성능 공유 메모리 시스템의 발전
 업체 고유의 지시어 집합 사용  표준화의 필요성

1994년 ANSI X3H5  1996년 openmp.org 설립
1997년 OpenMP API 발표
Release History
 OpenMP Fortran API 버전 1.0 : 1997년 10월
 C/C++ API 버전 1.0 : 1998년 10월
 Fortran API 버전 1.1 : 1999년 11월
 Fortran API 버전 2.0 : 2000년 11월
 C/C++ API 버전 2.0 : 2002년 3월
 Combined C/C++ and Fortran API 버전 2.5 : 2005년 5월
 API 버전 3.0 : 2008년 5월
OpenMP의 목표

표준과 이식성
공유메모리 병렬 프로그래밍의 표준
대부분의 Unix와 Windows에 OpenMP 컴파일러 존재
Fortran, C/C++ 지원
OpenMP의 구성 (1/2)

Directives

Runtime
Library

Environment
Variables
OpenMP의 구성 (2/2)
컴파일러 지시어


스레드 사이의 작업분담, 통신, 동기화를 담당



좁은 의미의 OpenMP

예) C$OMP PARALLEL DO
실행시간 라이브러리


병렬 매개변수(참여 스레드의 개수, 번호 등)의 설정과 조회

예) CALL omp_set_num_threads(128)
환경변수


실행 시스템의 병렬 매개변수(스레드 개수 등)를 정의

예) export OMP_NUM_THREADS=8
OpenMP 프로그래밍 모델 (1/4)
컴파일러 지시어 기반


순차코드의 적절한 위치에 컴파일러 지시어 삽입



컴파일러가 지시어를 참고하여 다중 스레드 코드 생성



OpenMP를 지원하는 컴파일러 필요



동기화, 의존성 제거 등의 작업 필요
OpenMP 프로그래밍 모델 (2/4)
Fork-Join



병렬화가 필요한 부분에 다중 스레드 생성
병렬계산을 마치면 다시 순차적으로 실행

F

J

F

J

O

O

O

O

Master

R

I

R

I

Thread

K

N

K

N

[Parallel Region]

[Parallel Region]
OpenMP 프로그래밍 모델 (3/4)
컴파일러 지시어 삽입

Serial Code
PROGRAM exam
…
ialpha = 2
DO i = 1, 100
a(i) = a(i) + ialpha*b(i)
ENDDO
PRINT *, a
END

Parallel Code
PROGRAM exam
…
ialpha = 2
!$OMP PARALLEL DO
DO i = 1, 100
a(i) = a(i) + ialpha*b(i)
ENDDO
!$OMP END PARALLEL DO
PRINT *, a
END
OpenMP 프로그래밍 모델 (4/4)

Fork-Join
※ export OMP_NUM_THREADS = 4

ialpha = 2

(Master Thread)

(Fork)
DO i=1,25

DO i=26,50

DO i=51,75

DO i=76,100

...

...

...

...

(Join)

(Master)

PRINT *, a

(Slave)

(Master Thread)

(Slave)

(Slave)
OpenMP의 장점과 단점

장 점
 MPI보다 코딩, 디버깅이 쉬움
 데이터 분배가 수월

단 점
• 공유메모리환경의 다중 프로세서
아키텍처에서만 구현 가능

 점진적 병렬화가 가능

• OpenMP를 지원하는 컴파일러 필요

 하나의 코드를 병렬코드와 순차코

• 루프에 대한 의존도가 큼  낮은

드로 컴파일 가능
 상대적으로 코드 크기가 작음

병렬화 효율성
• 공유메모리 아키텍처의 확장성
(프로세서 수, 메모리 등) 한계
OpenMP의 전형적 사용
데이터 병렬성을 이용한 루프의 병렬화
1. 시간이 많이 걸리는 루프를 찾음 (프로파일링)
2. 의존성, 데이터 유효범위 조사
3. 지시어 삽입으로 병렬화
태스크 병렬성을 이용한 병렬화도 가능
지시어 (1/5)
OpenMP 지시어 문법

Fortran

(고정형식:f77)
지시어 시작
(감시문자)
줄 바꿈
선택적
컴파일
시작위치

Fortran

(자유형식:f90)

C

▪ !$OMP <지시어>
▪ C$OMP <지시어>

▪ !$OMP <지시어>

▪ #pragma omp

▪ !$OMP <지시어> &

▪ #pragma omp … 

▪ *$OMP <지시어>

▪ !$OMP <지시어>
!$OMP& …

…

…

▪ !$ …
▪ C$ …

▪ !$ …

▪ #ifdef _OPENMP

▪ *$ …
첫번째 열

무관

무관
지시어 (2/5)
병렬영역 지시어




PARALLEL/END PARALLEL
코드부분을 병렬영역으로 지정
지정된 영역은 여러 스레드에서 동시에 실행됨

작업분할 지시어




DO/FOR
병렬영역 내에서 사용
루프인덱스를 기준으로 각 스레드에게 루프작업 할당

결합된 병렬 작업분할 지시어



PARALLEL DO/FOR
PARALLEL + DO/FOR의 역할을 수행
지시어 (3/5)
병렬영역 지정

Fortran
!$OMP PARALLEL
DO i = 1, 10
PRINT *, „Hello World‟, i
ENDDO
!$OMP END PARALLEL

C
#pragma omp parallel
for(i=1; i<=10; i++)
printf(“Hello World %dn”,i);
지시어 (4/5)
병렬영역과 작업분할

Fortran

C

!$OMP PARALLEL

#pragma omp parallel

!$OMP DO
DO i = 1, 10
PRINT *, „Hello World‟, i
ENDDO
[!$OMP END DO]
!$OMP END PARALLEL

{
#pragma omp for
for(i=1; i<=10; i++)

printf(“Hello World %dn”,i);
}
지시어 (5/5)
병렬영역과 작업분할

Fortran
!$OMP PARALLEL
!$OMP DO
DO i = 1, n
a(i) = b(i) + c(i)
ENDDO
[!$OMP END DO]
Optional
!$OMP DO
…
[!$OMP END DO]
!$OMP END PARALLEL

C
#pragma omp parallel
{
#pragma omp for
for (i=1; i<=n; i++) {
a[i] = b[i] + c[i]
}
#pragma omp for
for(…){
…
}
}
실행시간 라이브러리와 환경변수 (1/3)
실행시간 라이브러리




omp_set_num_threads(integer) : 스레드 개수 지정
omp_get_num_threads() : 스레드 개수 리턴
omp_get_thread_num() : 스레드 ID 리턴

환경변수


OMP_NUM_THREADS : 사용 가능한 스레드 최대 개수

• export OMP_NUM_THREADS=16 (ksh)
• setenv OMP_NUM_THREADS 16 (csh)
C : #include <omp.h>
실행시간 라이브러리와 환경변수 (3/3)
omp_set_num_threads
omp_get_thread_num

INTEGER OMP_GET_THREAD_NUM

CALL OMP_SET_NUM_THREADS(4)

Fortran

!$OMP PARALLEL
PRINT*, ′Thread rank: ′, OMP_GET_THREAD_NUM()
!$OMP END PARALLEL

#include <omp.h>
omp_set_num_threads(4);

C

#pragma omp parallel
{
printf(″Thread rank:%d\n″,omp_get_thread_num());

}
주요 Clauses

private(var1, var2, …)
shared(var1, var2, …)

default(shared|private|none)
firstprivate(var1, var2, …)
lastprivate(var1, var2, …)
reduction(operator|intrinsic:var1, var2,…)
schedule(type [,chunk])
clause : reduction (1/4)
reduction(operator|intrinsic:var1, var2,…)


reduction 변수는 shared

• 배열 가능(Fortran only): deferred shape, assumed shape array 사
용 불가
• C는 scalar 변수만 가능


각 스레드에 복제돼 연산에 따라 다른 값으로 초기화되고(표 참조) 병렬 연산 수행



다중 스레드에서 병렬로 수행된 계산결과를 환산해 최종 결과를 마스터 스레드로 내 놓
음
clause : reduction (2/4)
!$OMP DO reduction(+:sum)
DO i = 1, 100
sum = sum + x(i)

ENDDO

Thread 0

Thread 1

sum0 = 0

sum1 = 0

DO i = 1, 50

DO i = 51, 100

sum0 = sum0 + x(i)
ENDDO

sum = sum0 + sum1

sum1 = sum1 + x(i)
ENDDO
clause : reduction (3/4)
Reduction Operators : Fortran

Operator

Data Types

초기값

+

integer, floating point (complex or real)

0

*

integer, floating point (complex or real)

1

-

integer, floating point (complex or real)

0

.AND.

logical

.TRUE.

.OR.

logical

.FALSE.

.EQV.

logical

.TRUE.

.NEQV.

logical

.FALSE.

MAX

integer, floating point (real only)

가능한 최소값

MIN

integer, floating point (real only)

가능한 최대값

IAND

integer

all bits on

IOR

integer

0

IEOR

integer

0
clause : reduction (4/4)
Reduction Operators : C

Operator

Data Types

초기값

+

integer, floating point

0

*

integer, floating point

1

-

integer, floating point

0

&

integer

all bits on

|

integer

0

^

integer

0

&&

integer

1

||

integer

0
Coffee break
III. Parallel Programming using MPI
Current HPC Platforms : COTS-Based Clusters

COTS = Commercial off-the-shelf

Nehalem

Access
Control

File
Server(s)

Gulftown

…

Login Node(s)

88

Compute Nodes
Memory Architectures

Shared Memory


Single address space for all processors

<NUMA>
<UMA>

Distributed Memory

89
What is MPI?
MPI = Message Passing Interface
MPI is a specification for the developers and users of message passing libraries. By itself, it
is NOT a library – but rather the specification of what such a library should be.
MPI primarily addresses the message-passing parallel programming model : data is moved
from the address space of one process to that of another process through cooperative
operations on each process.
Simply stated, the goal of the message Passing Interface is to provide a widely used standard
for writing message passing programs. The interface attempts to be :



Portable



Efficient



90

Practical

Flexible
What is MPI?
The MPI standard has gone through a number of revisions, with the most recent version
being MPI-3.
Interface specifications have been defined for C and Fortran90 language bindings :


C++ bindings from MPI-1 are removed in MPI-3



MPI-3 also provides support for Fortran 2003 and 2008 features

Actual MPI library implementations differ in which version and features of the MPI standard
they support. Developers/users will need to be aware of this.

91
Programming Model
Originally, MPI was designed for distributed memory architectures, which were becoming
increasingly popular at time (1980s – early 1990s).

As architecture trends changed, shared memory SMPs were combined over networks
creating hybrid distributed memory/shared memory systems.

92
Programming Model
MPI implementers adapted their libraries to handle both types of underlying memory
architectures seamlessly. They also adapted/developed ways of handing different
interconnects and protocols.

Today, MPI runs on virtually any hardware platform :


Distributed Memory



Shared Memory



Hybrid

The programming model clearly remains a distributed memory model however, regardless of
the underlying physical architecture of the machine.

93
Reasons for Using MPI
Standardization


MPI is the only message passing library which can be considered a standard. It is
supported on virtually all HPC platforms. Practically, it has replaced all previous
message passing libraries.

Portability


There is little or no need to modify your source code when you port your application to a
different platform that supports (and is compliant with) the MPI standard.

Performance Opportunities


Vendor implementations should be able to exploit native hardware features to optimize
performance.

Functionality


There are over 440 routines defined in MPI-3, which includes the majority of those in
MPI-2 and MPI-1.

Availability


94

A Variety of implementations are available, both vendor and public domain.
History and Evolution
MPI has resulted from the efforts of numerous individuals and groups that began in 1992.
1980s – early 1990s : Distributed memory, parallel computing develops, as do a number of
incompatible soft ware tools for writing such programs – usually with tradeoffs between
portability, performance, functionality and price. Recognition of the need for a standard arose.
Apr 1992 : Workshop on Standards for Message Passing in a Distributed Memory
Environment, Sponsored by the Center for Research on Parallel Computing, Williamsburg,
Virginia. The basic features essential to a standard message passing interface were
discussed, and a working group established to continue the standardization process.
Preliminary draft proposal developed subsequently.

95
History and Evolution
Nov 1992 : Working group meets in Minneapolis. MPI draft proposal (MPI1) from ORNL
presented. Group adopts procedures and organization to form the MPI Forum. It eventually
comprised of about 175 individuals from 40 organizations including parallel computer
vendors, software writers, academia and application scientists.
Nov 1993 : Supercomputing 93 conference – draft MPI standard presented.
May 1994 : Final version of MPI-1.0 released.
MPI-1.0 was followed by versions MPI-1.1 (Jun 1995), MPI-1.2 (Jul 1997) and MPI-1.3 (May
2008).
MPI-2 picked up where the first MPI specification left off, and addressed topics which went far
beyond the MPI-1 specification. Was finalized in 1996.
MPI-2.1 (Sep 2009), and MPI-2.2 (Sep 2009) followed.
Sep 2012 : The MPI-3.0 standard was approved.

96
History and Evolution
Documentation for all versions of the MPI standard is available at :


97

http://www.mpi-forum.org/docs/
A General Structure of the MPI Program

98
A Header File for MPI routines
Required for all programs that make MPI library calls.

C include file

Fortran include file

#include “mpi.h”

include „mpif.h‟

With MPI-3 Fortran, the USE mpi_f80 module is preferred over using the include file shown
above.

99
The Format of MPI Calls
C names are case sensitive; Fortran name are not.
Programs must not declare variables or functions with names beginning with the prefix MPI_
or PMPI_ (profiling interface).

C Binding

Format

rc = MPI_Xxxxx(parameter, …)

Example

rc = MPI_Bsend(&buf, count, type, dest, tag, comm)

Error code

Returned as “rc”, MPI_SUCCESS if successful.
Fortran Binding

Format

Example

call MPI_BSEND(buf, count, type, dest, tag, comm, ierr)

Error code

100

CALL MPI_XXXXX(parameter, …, ierr)
call mpi_xxxxx(parameter, …, ierr)
Returned as “ierr” parameter, MPI_SUCCESS if successful.
Communicators and Groups
MPI uses objects called communicators and groups to define which collection of processes
may communicate with each other.
Most MPI routines require you to specify a communicator as an argument.
Communicators and groups will be covered in more detail later. For now, simply use
MPI_COMM_WORLD whenever a communicator is required - it is the predefined
communicator that includes all of your MPI processes.

101
Rank
Within a communicator, every process has its own unique, integer identifier assigned by the
system when the process initializes. A rank is sometimes also called a “task ID”. Ranks are
contiguous and begin at zero.
Used by the programmer to specify the source and destination of messages. Often used
conditionally by the application to control program execution (if rank = 0 do this / if rank = 1
do that).

102
Error Handling
Most MPI routines include a return/error code parameter, as described in “Format of MPI
Calls” section above.
However, according to the MPI standard, the default behavior of an MPI call is to abort if there
is an error. This means you will probably not be able to capture a return/error code other than
MPI_SUCCESS (zero).
The standard does provide a means to override this default error handler. You can also
consult the error handing section of the MPI Standard located at http://www.mpiforum.org/docs/mpi-11-html/node148.html .
The types of errors displayed to the user are implementation dependent.

103
Environment Management Routines
MPI_Init


Initializes the MPI execution environment. This function must be called is every MPI
program, must be called before any other MPI functions and must be called only once in
an MPI program. For C programs, MPI_Init may be used to pass the command line
arguments to all processes, although this is not required by the standard and is
implementation dependent.

C
MPI_Init(&argc, &argv)




104

Fortran
MPI_INIT(ierr)

Input parameters
• argc : Pointer to the number of arguments
• argv : Pointer to the argument vector
ierr : the error return argument
Environment Management Routines
MPI_Comm_size


Returns the total number of MPI processes in the specified communicator, such as
MPI_COMM_WORLD. If the communicator is MPI_COMM_WORLD, then it represents the
number of MPI tasks available to your application.

C
MPI_Comm_size(comm, &size)





105

Fortran
MPI_COMM_SIZE(comm, size, ierr)

Input parameters
• comm : communicator (handle)
Output parameters
• size : number of processes in the group of comm (integer)
ierr : the error return argument
Environment Management Routines
MPI_Comm_rank


Returns the rank of the calling MPI process within the specified communicator. Initially,
each process will be assigned a unique integer rank between 0 and number of tasks -1
within the communicator MPI_COMM_WORLD. This rank is often referred to as a task ID.
If a process becomes associated with other communicators, it will have a unique rank
within each of these as well.

C

MPI_Comm_rank(comm, &rank)




106

Fortran

MPI_COMM_SIZE(comm, rank, ierr)

Input parameters
• comm : communicator (handle)
Output parameters
• rank : rank of the calling process in the group of comm (integer)
ierr : the error return argument
Environment Management Routines
MPI_Finalize


Terminates the MPI execution environment. This function should be the last MPI routine
called in every MPI program – no other MPI routines may be called after it.

C
MPI_Finalize()


107

ierr : the error return argument

Fortran
MPI_FINALIZE(ierr)
Environment Management Routines
MPI_Abort


Terminates all MPI processes associated with the communicator. In most MPI
implementations it terminates ALL processes regardless of the communicator specified.

C
MPI_Abort(comm, errorcode)




108

Fortran
MPI_ABORT(comm, errorcode, ierr)

Input parameters
• comm : communicator (handle)
• errorcode : error code to return to invoking environment
ierr : the error return argument
Environment Management Routines
MPI_Get_processor_name


Return the processor name. Also returns the length of the name. The buffer for “name”
must be at least MPI_MAX_PROCESSOR_NAME characters in size. What is returned into
“name” is implementation dependent – may not be the same as the output of the
“hostname” or “host” shell commands.

C

Fortran

MPI_Get_processor_name(&name,
&resultlength)

MPI_GET_PROCESSOR_NAME(n
ame, resultlength, ierr)





109

Output parameters
• name : A unique specifies for the actual (as opposed to virtual) node. This must be
an array of size at least MPI_MAX_PROCESOR_NAME .
• resultlen : Length (in characters) of the name.
ierr : the error return argument
Environment Management Routines
MPI_Get_version


Returns the version (either 1 or 2) and subversion of MPI.

C
MPI_Get_version(&version,
&subversion)




110

Fortran
MPI_GET_VERSION(version,
subversion, ierr)

Output parameters
• version : Major version of MPI (1 or 2)
• subversion : Miner version of MPI.
ierr : the error return argument
Environment Management Routines
MPI_Initialized


Indicates whether MPI_Init has been called – returns flag as either logical true(1) or
false(0).

C
MPI_Initialized(&flag)



111

Fortran
MPI_INITIALIZED(flag, ierr)

Output parameters
• flag : Flag is true if MPI_Init has been called and false otherwise.
ierr : the error return argument
Environment Management Routines
MPI_Wtime


Returns an elapsed wall clock time in seconds (double precision) on the calling
processor.

C
MPI_Wtime()


Fortran
MPI_WTIME()

Return value
• Time in seconds since an arbitrary time in the past.

MPI_Wtick


Returns the resolution in seconds (double precision) of MPI_Wtime.

C
MPI_Wtick()


112

Fortran
MPI_WTICK()

Return value
• Time in seconds of the resolution MPI_Wtime.
Example: Hello world
#include<stdio.h>
#include"mpi.h"
int main(int argc, char *argv[])
{
int rc;
rc = MPI_Init(&argc, &argv);
printf("Hello world.n");
rc = MPI_Finalize();
return 0;
}

113
Example: Hello world
Execute a mpi program.
$ module load [compiler] [mpi]
$ mpicc hello.c
$ mpirun –np 4 –hostfile [hostfile] ./a.out

Make out a hostfile.
ibs0001
ibs0002
ibs0003
ibs0003
…

114

slots=2
slots=2
slots=2
slots=2
Example : Environment Management Routine
#include "mpi.h”
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, len, rc;
char hostname[MPI_MAX_PROCESSOR_NAME];
rc = MPI_Init(&argc,&argv);
if (rc != MPI_SUCCESS) {
printf ("Error starting MPI program. Terminating.n");
MPI_Abort(MPI_COMM_WORLD, rc);
}

MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Get_processor_name(hostname, &len);
printf ("Number of tasks= %d My rank= %d Running on %sn", numtasks,rank,hostname);
/*******

do some work *******/

rc = MPI_Finalize();
return 0;
}

115
Types of Point-to-Point Operations
MPI point-to-point operations typically involve message passing between two, and only two,
different MPI tasks. One task is performing a send operation and the other task is performing
a matching receive operation.
There are different types of send and receive routines used for different purposes.


Synchronous send



Blocking send/blocking receive



Non-blocking send/non-blocking receive



Buffered send



Combined send/receive



“Ready” send

Any type of send routine can be paired with any type of receive routine.
MPI also provides several routines associated with send – receive operations, such as those used to wait for
a message’s arrival or prove to find out if a message has arrived.

116
Buffering
In a perfect world, every send operation would be perfectly synchronized with its matching re
ceive. This is rarely the case. Somehow or other, the MPI implementation must be able to deal
with storing data when the two tasks are out of sync.
Consider the following two cases



117

A send operation occurs 5 seconds before the receive is ready – where is the message w
hile the receive is pending?
Multiple sends arrive at the same receiving task which can only accept one send at a tim
e – what happens to the messages that are “backing up”?
Buffering
The MPI implementation (not the MPI standard) decides what happens to data in these types
of cases. Typically, a system buffer area is reserved to hold data in transit.

118
Buffering
System buffer space is :






119

Opaque to the programmer and managed entirely by the MPI library
A finite resource that can be easy to exhaust
Often mysterious and not well documented
Able to exist on the sending side, the receiving side, or both
Something that may improve program performance because it allows send – receive ope
rations to be asynchronous.
Blocking vs. Non-blocking
Most of the MPI point-to-point routines can be used in either blocking or non-blocking mode.
Blocking






A blocking send routine will only “return” after it is safe to modify the application buffer (your
send data) for reuse. Safe means that modifications will not affect the data intended for the rec
eive task. Safe dose not imply that the data was actually received – it may very well be sitting i
n a system buffer.
A blocking send can be synchronous which means there is handshaking occurring with the re
ceive task to confirm a safe send.
A blocking send can be asynchronous if a system buffer is used to hold the data for eventual d
elivery to the receive.
A blocking receive only “returns” after the data has arrived and is ready for use by the progra
m.

Non-blocking





120

Non-blocking send and receive routines behave similarly – they will return almost immediately.
They do not wait for any communication events to complete, such as message copying from u
ser memory to system buffer space or the actual arrival of message.
Non-blocking operations simply “request” the MPI library to perform the operation when it is a
ble. The user can not predict when it is able. The user can not predict when that will happen.
It is unsafe to modify the application buffer (your variable space) until you know for a fact the r
equested non-blocking operation was actually performed by the library. There are “wait” routin
es used to do this.
Non-blocking communications are primarily used to overlap computation with communication
and exploit possibale performance gains.
MPI Message Passing Routine Arguments
MPI point-to-point communication routines generally have an argument list that takes one of t
he following formats :

Blocking sends

MPI_Send(buffer, count, type, dest, tag, comm)

Non-blocking sends

MPI_Isend(buffer, count, type, dest, tag, comm, request)

Blocking receive

MPI_Recv(buffer, count, type, source, tag, comm, status)

Non-blocking receive

MPI_Irecv(buffer, count, type, source, tag, comm, request)

Buffer



Program (application) address space that references the data that is to be sent or receiv
ed. In most cases, this is simply the variable name that is be sent/received. For C progra
ms, this argument is passed by reference and usually must be prepended with an amper
sand : &var1

Data count


121

Indicates the number of data elements of a particular type to be sent.
MPI Message Passing Routine Arguments
Data type


For reasons of portability, MPI predefines its elementary data types. The table below lists
those required by the standard.

C Data Types
MPI_CHAR
MPI_SHORT

signed short int

MPI_INT

signed int

MPI_LONG

signed long int

MPI_SIGNED_CHAR

signed char

MPI_UNSIGNED_CHAR

unsigned char

MPI_UNSIGNED_SHORT

unsigned short int

MPI_UNSIGNED

unsigned int

MPI_UNSIGNED_LONG

unsigned long int

MPI_FLOAT

float

MPI_DOUBLE

double

MPI_LONG_DOUBLE

122

signed char

long double
MPI Message Passing Routine Arguments
Destination


An argument to send routines that indicates the process where a message should be del
ivered. Specified as the rank of the receiving process.

Tag


Arbitrary non-negative integer assigned by the programmer to uniquely identify a messa
ge. Send and receive operations should match message tags. For a receive operation, th
e wild card MPI_ANY_TAG can be used to receive any message regardless of its tag. The
MPI standard guarantees that integers 0 – 32767 can be used as tags, but most impleme
ntations allow a much larger range than this.

Communicator


123

Indicates the communication context, or set of processes for which the source or destin
ation fields are valid. Unless the programmer is explicitly creating new communicator, th
e predefined communicator MPI_COMM_WORLD is usually used.
MPI Message Passing Routine Arguments
Status






For a receive operation, indicates the source of the message and the tag of the message.
In C, this argument is a pointer to predefined structure MPI_Status (ex. stat.MPI_SOURC
E, stat.MPI_TAG).
In Fortran, it is an integer array of size MPI_STATUS_SIZE (ex. stat(MPI_SOURCE), stat(M
PI_TAG)).
Additionally, the actual number of bytes received are obtainable from Status via MPI_Get
_out routine.

Request






124

Used by non-blocking send and receive operations.
Since non-blocking operations may return before the requested system buffer space is o
btained, the system issues a unique “request number”.
The programmer uses this system assigned “handle” later (in a WAIT type routine) to det
ermine completion of the non-blocking operation.
In C, this argument is pointer to predefined structure MPI_Request.
In Fortran, it is an integer.
Example : Blocking Message Passing Routine (1/2)
#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, dest, source, rc, count, tag=1;
char inmsg, outmsg='x';
MPI_Status Stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
dest = 1;
source = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}
else if (rank == 1) {
dest = 0;
source = 0;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
}

125
Example : Blocking Message Passing Routine (2/2)
rc = MPI_Get_count(&Stat, MPI_CHAR, &count);
printf("Task %d: Received %d char(s) from task %d with tag %d n",
rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);
MPI_Finalize();
return 0;
}

126
Example : Dead Lock
#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, dest, source, rc, count, tag=1;
char inmsg, outmsg='x';
MPI_Status Stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
dest = 1;
source = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}
else if (rank == 1) {
dest = 0;
source = 0;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}

127
Example : Non-Blocking Message Passing Routine (1/2)
Nearest neighbor exchange in a ring topology

#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, next, prev, buf[2], tag1=1, tag2=2;
MPI_Request reqs[4];
MPI_Status stats[2];
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
prev = rank-1;
next = rank+1;
if (rank == 0) prev = numtasks - 1;
if (rank == (numtasks - 1)) next = 0;

128
Example : Non-Blocking Message Passing Routine (2/2)
MPI_Irecv(&buf[0], 1, MPI_INT, prev, tag1, MPI_COMM_WORLD, &reqs[0]);
MPI_Irecv(&buf[1], 1, MPI_INT, next, tag2, MPI_COMM_WORLD, &reqs[1]);
MPI_Isend(&rank, 1, MPI_INT, prev, tag2, MPI_COMM_WORLD, &reqs[2]);
MPI_Isend(&rank, 1, MPI_INT, next, tag1, MPI_COMM_WORLD, &reqs[3]);
{

do some work

}

MPI_Waitall(4, reqs, stats);
MPI_Finalize();
return 0;

}

129
Advanced Example : Monte-Carlo Simulation
<Problem>




Monte carlo simulation
Random number use
PI = 4 ⅹAc/As

<Requirement>



N’s processor(rank) use
P2p communication

r

130
Advanced Example : Monte-Carlo Simulation for PI
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main() {
const long num_step=100000000;
long i, cnt;
double pi, x, y, r;

printf(“-----------------------------------------------------------n”);
pi = 0.0;
cnt = 0;
r = 0.0;
for (i=0; i<num_step; i++) {
x = rand() / (RAND_MAX+1.0);
y = rand() / (RAND_MAX+1.0);
r = sqrt(x*x + y*y);
if (r<=1) cnt += 1;
}
pi = 4.0 * (double)(cnt) / (double)(num_step);
printf(“PI = %17.15lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi));
printf(“-----------------------------------------------------------n”);
return 0;
}

131
Advanced Example : Numerical integration for PI
<Problem>


Get PI using Numerical integration

1
0

f ( x1 )

f ( x2 )

4.0
dx =
2)
(1+x
f ( xn )

<Requirement>


Point to point communication
n

4

i 1

1 2
1 ((i 0.5) )
n

1
n

....
1
n

1
(2 0.5)
n
1
(1 0.5)
n
x2

x1

132

xn

(n 0.5)

1
n
Advanced Example : Numerical integration for PI
#include <stdio.h>
#include <math.h>
int main() {
const long num_step=100000000;
long i;
double sum, step, pi, x;
step = (1.0/(double)num_step);
sum=0.0;
printf(“-----------------------------------------------------------n”);
for (i=0; i<num_step; i++) {
x = ((double)i - 0.5) * step;
sum += 4.0/(1.0+x*x);
}
pi = step * sum;
printf(“PI = %5lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi));
printf(“-----------------------------------------------------------n”);
return 0;
}

133
Type of Collective Operations
Synchronization


processes wait until all members of the group have reached the synchronization point.

Data Movement


broadcast, scatter/gather, all to all.

Collective Computation (reductions)


134

one member of the group collects data from the other members and performs an operati
on (min, max, add, multiply, etc.) on that data.
Programming Considerations and Restrictions
With MPI-3, collective operations can be blocking or non-blocking. Only blocking operations
are covered in this tutorial.
Collective communication routines do not take message tag arguments.
Collective operations within subset of processes are accomplished by first partitioning the su
bsets into new groups and then attaching the new groups to new communicators.
Con only be used with MPI predefined datatypes – not with MPI Derived Data Types.
MPI-2 extended most collective operations to allow data movement between intercommunicat
ors (not covered here).

135
Collective Communication Routines
MPI_Barrier


Synchronization operation. Creates a barrier synchronization in a group. Each task,
when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same
MPI_Barrier call. Then all tasks are free to proceed.

C
MPI_Barrier(comm)

136

Fortran
MPI_BARRIER(comm, ierr)
Collective Communication Routines
MPI_Bcast


Data movement operation. Broadcasts (sends) a message from the process with rank "r
oot" to all other processes in the group.

C
MPI_Bcast(&buffer, count, datatype,
root, comm)

137

Fortran
MPI_BCAST
(buffer,count,datatype,root,comm,ier
r)
Collective Communication Routines
MPI_Scatter


Data movement operation. Distributes distinct messages from a single source task to ea
ch task in the group.

C

Fortran

MPI_Scatter
MPI_SCATTER
(&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf,
uf, recvcnt,recvtype,root,comm)
recvcnt,recvtype,root,comm,ierr)

138
Collective Communication Routines
MPI_Gather


Data movement operation. Gathers distinct messages from each task in the group to a si
ngle destination task. This routine is the reverse operation of MPI_Scatter.

C

Fortran

MPI_Gather
MPI_GATHER
(&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf,
uf, recvcount,recvtype,root,comm)
recvcount,recvtype,root,comm,ierr)

139
Collective Communication Routines
MPI_Allgather


Data movement operation. Concatenation of data to all tasks in a group. Each task in the
group, in effect, performs a one-to-all broadcasting operation within the group.

C

Fortran

MPI_Allgather
MPI_ALLGATHER
(&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb
vbuf, recvcount,recvtype,comm)
uf, recvcount,recvtype,comm,info)

140
Collective Communication Routines
MPI_Reduce


Collective computation operation. Applies a reduction operation on all tasks in the group
and places the result in one task.

C
MPI_Reduce
(&sendbuf,&recvbuf,count,datatype,
op,root,comm)

141

Fortran
MPI_REDUCE
(sendbuf,recvbuf,count,datatype,op,
root,comm,ierr)
Collective Communication Routines
The predefined MPI reduction operations appear below. Users can also define their own
reduction functions by using the MPI_Op_create routine.

MPI Reduction Operation

C Data Types

MPI_MAX

maximum

integer, float

MPI_MIN

minimum

integer, float

MPI_SUM

sum

integer, float

MPI_PROD

product

integer, float

MPI_LAND

logical AND

integer

MPI_BAND

bit-wise AND

integer, MPI_BYTE

MPI_LOR

logical OR

integer

MPI_BOR

bit-wise OR

integer, MPI_BYTE

MPI_LXOR

logical XOR

integer

MPI_BXOR

bit-wise XOR

integer, MPI_BYTE

MPI_MAXLOC

max value and location

float, double and long double

MPI_MINLOC

min value and location

float, double and long double

142
Collective Communication Routines
MPI_Allreduce


Collective computation operation + data movement. Applies a reduction operation and pl
aces the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by
an MPI_Bcast.

C

MPI_Allreduce
(&sendbuf,&recvbuf,count,datatype,
op,comm)

143

Fortran

MPI_ALLREDUCE
(sendbuf,recvbuf,count,datatype,op,
comm,ierr)
Collective Communication Routines
MPI_Reduce_scatter


Collective computation operation + data movement. First does an element-wise reductio
n on a vector across all tasks in the group. Next, the result vector is split into disjoint se
gments and distributed across the tasks. This is equivalent to an MPI_Reduce followed b
y an MPI_Scatter operation.

C

MPI_Reduce_scatter
(&sendbuf,&recvbuf,recvcount,datat
ype, op,comm)

144

Fortran

MPI_REDUCE_SCATTER
(sendbuf,recvbuf,recvcount,datatype,
op,comm,ierr)
Collective Communication Routines
MPI_Alltoall


Data movement operation. Each task in a group performs a scatter operation, sending a
distinct message to all the tasks in the group in order by index.

C

Fortran

MPI_Alltoall
MPI_ALLTOALL
(&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb
vbuf, recvcnt,recvtype,comm)
uf, recvcnt,recvtype,comm,ierr)

145
Collective Communication Routines
MPI_Scan


Performs a scan operation with respect to a reduction operation across a task group.

C
MPI_Scan
(&sendbuf,&recvbuf,count,datatype,
op,comm)

146

Fortran
MPI_SCAN
(sendbuf,recvbuf,count,datatype,op,
comm,ierr)
Collective Communication Routines
data
P0

A

A

P0

A

A

P1

B

P2

A

P2

C

P3

A

P3

D

broadcast

P1

A*B*C*D

reduce

*:some operator
P0

A

B

C

D

A

P0

A

P1

B

P1

B

P2

C

P2

C

A*B*C*D

D

P3

D

A*B*C*D

scatter

gather

P3

A*B*C*D

all
reduce

A*B*C*D

*:some operator
P0

A

A

B

C

D

P0

A

P1

B

A

B

C

D

P1

B

P2

C

A

B

C

D

P2

C

A*B*C

P3

D

A

B

C

D

P3

D

A*B*C*D

allgather

A

scan

A*B

*:some operator
P0

A0

A1

A2

A3

alltoall

A0

B0

C0

D0

P0

A0

A1

A2

A0*B0*C0*D0

A3

reduce
scatter

A1*B1*C1*D1

P1

B0

B1

B2

B3

A1

B1

C1

D1

P1

B0

B1

B2

B3

P2

C0

C1

C2

C3

A2

B2

C2

D2

P2

C0

C1

C2

C3

A2*B2*C2*D2

P3

D0

D1

D2

D3

A3

B3

C3

D3

P3

D0

D1

D2

D3

A3*B3*C3*D3

*:some operator

147
Example : Collective Communication (1/2)
Perform a scatter operation on the rows of an array
#include "mpi.h"
#include <stdio.h>
#define SIZE 4
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, sendcount, recvcount, source;
float sendbuf[SIZE][SIZE] = {
{1.0, 2.0, 3.0, 4.0},
{5.0, 6.0, 7.0, 8.0},
{9.0, 10.0, 11.0, 12.0},
{13.0, 14.0, 15.0, 16.0} };
float recvbuf[SIZE];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

148
Example : Collective Communication (2/2)
if (numtasks == SIZE) {
source = 1;
sendcount = SIZE;
recvcount = SIZE;
MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount,
MPI_FLOAT,source,MPI_COMM_WORLD);
printf("rank= %d Results: %f %f %f %fn",rank,recvbuf[0],
recvbuf[1],recvbuf[2],recvbuf[3]);
}
else
printf("Must specify %d processors. Terminating.n",SIZE);
MPI_Finalize();
return 0;
}

149
Advanced Example : Monte-Carlo Simulation for PI
Use the collective communication routines!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main() {
const long num_step=100000000;
long i, cnt;
double pi, x, y, r;
printf(“-----------------------------------------------------------n”);
pi = 0.0;
cnt = 0;
r = 0.0;
for (i=0; i<num_step; i++) {
x = rand() / (RAND_MAX+1.0);
y = rand() / (RAND_MAX+1.0);
r = sqrt(x*x + y*y);
if (r<=1) cnt += 1;
}
pi = 4.0 * (double)(cnt) / (double)(num_step);
printf(“PI = %17.15lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi));
printf(“-----------------------------------------------------------n”);
return 0;
}

150
Advanced Example : Numerical integration for PI
Use the collective communication routines!
#include <stdio.h>
#include <math.h>
int main() {
const long num_step=100000000;
long i;
double sum, step, pi, x;
step = (1.0/(double)num_step);
sum=0.0;
printf(“-----------------------------------------------------------n”);
for (i=0; i<num_step; i++) {
x = ((double)i - 0.5) * step;
sum += 4.0/(1.0+x*x);
}
pi = step * sum;
printf(“PI = %5lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi));
printf(“-----------------------------------------------------------n”);
return 0;
}

151
Any questions?

Más contenido relacionado

La actualidad más candente

Travaux Dirigée: Equipements d'interconnexion
Travaux Dirigée: Equipements d'interconnexionTravaux Dirigée: Equipements d'interconnexion
Travaux Dirigée: Equipements d'interconnexionInes Kechiche
 
Dc ch11 : routing in switched networks
Dc ch11 : routing in switched networksDc ch11 : routing in switched networks
Dc ch11 : routing in switched networksSyaiful Ahdan
 
Module: the modular p2 p networking stack
Module: the modular p2 p networking stack Module: the modular p2 p networking stack
Module: the modular p2 p networking stack Ioannis Psaras
 
Travaux Dirigée: Notions de bases dans les réseaux
Travaux Dirigée: Notions de bases dans les réseauxTravaux Dirigée: Notions de bases dans les réseaux
Travaux Dirigée: Notions de bases dans les réseauxInes Kechiche
 
Lte default and dedicated bearer / VoLTE
Lte default and dedicated bearer / VoLTELte default and dedicated bearer / VoLTE
Lte default and dedicated bearer / VoLTEmanish_sapra
 
Présentation etherchannel
Présentation etherchannelPrésentation etherchannel
Présentation etherchannelLechoco Kado
 
4G/5G RAN architecture: how a split can make the difference
4G/5G RAN architecture: how a split can make the difference4G/5G RAN architecture: how a split can make the difference
4G/5G RAN architecture: how a split can make the differenceEricsson
 
Introduction to tcpdump
Introduction to tcpdumpIntroduction to tcpdump
Introduction to tcpdumpLev Walkin
 
Intelligent transportation systems
Intelligent transportation systemsIntelligent transportation systems
Intelligent transportation systemsEngin Karabulut
 
VoLTE and ViLTE.pdf
VoLTE and ViLTE.pdfVoLTE and ViLTE.pdf
VoLTE and ViLTE.pdfAsitSwain5
 
Dynamic Routing IGRP
Dynamic Routing IGRPDynamic Routing IGRP
Dynamic Routing IGRPKishore Kumar
 
Ec8004 wireless networks unit 1 hiperlan 2
Ec8004 wireless networks unit 1 hiperlan 2Ec8004 wireless networks unit 1 hiperlan 2
Ec8004 wireless networks unit 1 hiperlan 2HemalathaR31
 
Basics of firewall, ebtables, arptables and iptables
Basics of firewall, ebtables, arptables and iptablesBasics of firewall, ebtables, arptables and iptables
Basics of firewall, ebtables, arptables and iptablesPrzemysław Piotrowski
 
Multicasting and multicast routing protocols
Multicasting and multicast routing protocolsMulticasting and multicast routing protocols
Multicasting and multicast routing protocolsAbhishek Kesharwani
 
Concepts et configuration de base de la commutation
Concepts et configuration de base de la commutationConcepts et configuration de base de la commutation
Concepts et configuration de base de la commutationEL AMRI El Hassan
 
Qos Quality of services
Qos   Quality of services Qos   Quality of services
Qos Quality of services HayderThary
 

La actualidad más candente (20)

Travaux Dirigée: Equipements d'interconnexion
Travaux Dirigée: Equipements d'interconnexionTravaux Dirigée: Equipements d'interconnexion
Travaux Dirigée: Equipements d'interconnexion
 
Dc ch11 : routing in switched networks
Dc ch11 : routing in switched networksDc ch11 : routing in switched networks
Dc ch11 : routing in switched networks
 
Module: the modular p2 p networking stack
Module: the modular p2 p networking stack Module: the modular p2 p networking stack
Module: the modular p2 p networking stack
 
Travaux Dirigée: Notions de bases dans les réseaux
Travaux Dirigée: Notions de bases dans les réseauxTravaux Dirigée: Notions de bases dans les réseaux
Travaux Dirigée: Notions de bases dans les réseaux
 
Lte default and dedicated bearer / VoLTE
Lte default and dedicated bearer / VoLTELte default and dedicated bearer / VoLTE
Lte default and dedicated bearer / VoLTE
 
Présentation etherchannel
Présentation etherchannelPrésentation etherchannel
Présentation etherchannel
 
4G/5G RAN architecture: how a split can make the difference
4G/5G RAN architecture: how a split can make the difference4G/5G RAN architecture: how a split can make the difference
4G/5G RAN architecture: how a split can make the difference
 
Introduction to tcpdump
Introduction to tcpdumpIntroduction to tcpdump
Introduction to tcpdump
 
Intelligent transportation systems
Intelligent transportation systemsIntelligent transportation systems
Intelligent transportation systems
 
VoLTE and ViLTE.pdf
VoLTE and ViLTE.pdfVoLTE and ViLTE.pdf
VoLTE and ViLTE.pdf
 
Dynamic Routing IGRP
Dynamic Routing IGRPDynamic Routing IGRP
Dynamic Routing IGRP
 
MQTT and CoAP
MQTT and CoAPMQTT and CoAP
MQTT and CoAP
 
Computer network
Computer networkComputer network
Computer network
 
Free FreeRTOS Course-Task Management
Free FreeRTOS Course-Task ManagementFree FreeRTOS Course-Task Management
Free FreeRTOS Course-Task Management
 
Ec8004 wireless networks unit 1 hiperlan 2
Ec8004 wireless networks unit 1 hiperlan 2Ec8004 wireless networks unit 1 hiperlan 2
Ec8004 wireless networks unit 1 hiperlan 2
 
Basics of firewall, ebtables, arptables and iptables
Basics of firewall, ebtables, arptables and iptablesBasics of firewall, ebtables, arptables and iptables
Basics of firewall, ebtables, arptables and iptables
 
Multicasting and multicast routing protocols
Multicasting and multicast routing protocolsMulticasting and multicast routing protocols
Multicasting and multicast routing protocols
 
Transportlayer tanenbaum
Transportlayer tanenbaumTransportlayer tanenbaum
Transportlayer tanenbaum
 
Concepts et configuration de base de la commutation
Concepts et configuration de base de la commutationConcepts et configuration de base de la commutation
Concepts et configuration de base de la commutation
 
Qos Quality of services
Qos   Quality of services Qos   Quality of services
Qos Quality of services
 

Destacado

Gpu Systems
Gpu SystemsGpu Systems
Gpu Systemsjpaugh
 
병렬처리와 성능향상
병렬처리와 성능향상병렬처리와 성능향상
병렬처리와 성능향상shaderx
 
2node cluster
2node cluster2node cluster
2node clustersprdd
 
Introduction to Linux #1
Introduction to Linux #1Introduction to Linux #1
Introduction to Linux #1UNIST
 
오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0sprdd
 

Destacado (8)

Gpu Systems
Gpu SystemsGpu Systems
Gpu Systems
 
병렬처리와 성능향상
병렬처리와 성능향상병렬처리와 성능향상
병렬처리와 성능향상
 
ISBI MPI Tutorial
ISBI MPI TutorialISBI MPI Tutorial
ISBI MPI Tutorial
 
2node cluster
2node cluster2node cluster
2node cluster
 
Introduction to Linux #1
Introduction to Linux #1Introduction to Linux #1
Introduction to Linux #1
 
Using MPI
Using MPIUsing MPI
Using MPI
 
오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0
 
Open MPI 2
Open MPI 2Open MPI 2
Open MPI 2
 

Similar a Introduction to Parallel Programming

하둡 타입과 포맷
하둡 타입과 포맷하둡 타입과 포맷
하둡 타입과 포맷진호 박
 
Implementing remote procedure calls rev2
Implementing remote procedure calls rev2Implementing remote procedure calls rev2
Implementing remote procedure calls rev2Sung-jae Park
 
어플리케이션 성능 최적화 기법
어플리케이션 성능 최적화 기법어플리케이션 성능 최적화 기법
어플리케이션 성능 최적화 기법Daniel Kim
 
Multithread pattern 소개
Multithread pattern 소개Multithread pattern 소개
Multithread pattern 소개Sunghyouk Bae
 
Ch01 네트워크와+소켓+프로그래밍+[호환+모드]
Ch01 네트워크와+소켓+프로그래밍+[호환+모드]Ch01 네트워크와+소켓+프로그래밍+[호환+모드]
Ch01 네트워크와+소켓+프로그래밍+[호환+모드]지환 김
 
Programming Cascading
Programming CascadingProgramming Cascading
Programming CascadingTaewook Eom
 
리눅스 커널 기초 태스크관리
리눅스 커널 기초 태스크관리리눅스 커널 기초 태스크관리
리눅스 커널 기초 태스크관리Seungyong Lee
 
Linux programming study
Linux programming studyLinux programming study
Linux programming studyYunseok Lee
 
Python으로 채팅 구현하기
Python으로 채팅 구현하기Python으로 채팅 구현하기
Python으로 채팅 구현하기Tae Young Lee
 
고급시스템프로그래밍
고급시스템프로그래밍고급시스템프로그래밍
고급시스템프로그래밍CHANG-HYUN LEE
 
고급시스템프로그래밍
고급시스템프로그래밍고급시스템프로그래밍
고급시스템프로그래밍CHANG-HYUN LEE
 
Going asynchronous with netty - SOSCON 2015
Going asynchronous with netty - SOSCON 2015Going asynchronous with netty - SOSCON 2015
Going asynchronous with netty - SOSCON 2015Kris Jeong
 
Presto User & Admin Guide
Presto User & Admin GuidePresto User & Admin Guide
Presto User & Admin GuideJEONGPHIL HAN
 
Optimizing merge program
Optimizing merge program Optimizing merge program
Optimizing merge program CHANG-HYUN LEE
 
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스Seunghwa Song
 
고급시스템프로그래밍
고급시스템프로그래밍고급시스템프로그래밍
고급시스템프로그래밍CHANG-HYUN LEE
 
Net debugging 3_전한별
Net debugging 3_전한별Net debugging 3_전한별
Net debugging 3_전한별Han-Byul Jeon
 

Similar a Introduction to Parallel Programming (20)

Thread programming
Thread programmingThread programming
Thread programming
 
하둡 타입과 포맷
하둡 타입과 포맷하둡 타입과 포맷
하둡 타입과 포맷
 
Implementing remote procedure calls rev2
Implementing remote procedure calls rev2Implementing remote procedure calls rev2
Implementing remote procedure calls rev2
 
어플리케이션 성능 최적화 기법
어플리케이션 성능 최적화 기법어플리케이션 성능 최적화 기법
어플리케이션 성능 최적화 기법
 
Multithread pattern 소개
Multithread pattern 소개Multithread pattern 소개
Multithread pattern 소개
 
Ch01 네트워크와+소켓+프로그래밍+[호환+모드]
Ch01 네트워크와+소켓+프로그래밍+[호환+모드]Ch01 네트워크와+소켓+프로그래밍+[호환+모드]
Ch01 네트워크와+소켓+프로그래밍+[호환+모드]
 
Programming Cascading
Programming CascadingProgramming Cascading
Programming Cascading
 
ice_grad
ice_gradice_grad
ice_grad
 
리눅스 커널 기초 태스크관리
리눅스 커널 기초 태스크관리리눅스 커널 기초 태스크관리
리눅스 커널 기초 태스크관리
 
Linux programming study
Linux programming studyLinux programming study
Linux programming study
 
Python으로 채팅 구현하기
Python으로 채팅 구현하기Python으로 채팅 구현하기
Python으로 채팅 구현하기
 
고급시스템프로그래밍
고급시스템프로그래밍고급시스템프로그래밍
고급시스템프로그래밍
 
고급시스템프로그래밍
고급시스템프로그래밍고급시스템프로그래밍
고급시스템프로그래밍
 
Going asynchronous with netty - SOSCON 2015
Going asynchronous with netty - SOSCON 2015Going asynchronous with netty - SOSCON 2015
Going asynchronous with netty - SOSCON 2015
 
Presto User & Admin Guide
Presto User & Admin GuidePresto User & Admin Guide
Presto User & Admin Guide
 
서울 R&D 캠퍼스 자연어 수업자료
서울 R&D 캠퍼스 자연어 수업자료서울 R&D 캠퍼스 자연어 수업자료
서울 R&D 캠퍼스 자연어 수업자료
 
Optimizing merge program
Optimizing merge program Optimizing merge program
Optimizing merge program
 
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
 
고급시스템프로그래밍
고급시스템프로그래밍고급시스템프로그래밍
고급시스템프로그래밍
 
Net debugging 3_전한별
Net debugging 3_전한별Net debugging 3_전한별
Net debugging 3_전한별
 

Último

[아산 유스프러너] 2023-2학기 광문고등학교 앙트십프로젝트 소개서 _ 수면문제해결 향수제작
[아산 유스프러너] 2023-2학기 광문고등학교 앙트십프로젝트 소개서 _ 수면문제해결 향수제작[아산 유스프러너] 2023-2학기 광문고등학교 앙트십프로젝트 소개서 _ 수면문제해결 향수제작
[아산 유스프러너] 2023-2학기 광문고등학교 앙트십프로젝트 소개서 _ 수면문제해결 향수제작freewill3
 
학생용_Turnitin Feedback Studio 이용 매뉴얼_20240306 (1).pdf
학생용_Turnitin Feedback Studio 이용 매뉴얼_20240306 (1).pdf학생용_Turnitin Feedback Studio 이용 매뉴얼_20240306 (1).pdf
학생용_Turnitin Feedback Studio 이용 매뉴얼_20240306 (1).pdfyonseilibrary
 
AI Detection manual for instructors at Yonsei
AI Detection manual for instructors at YonseiAI Detection manual for instructors at Yonsei
AI Detection manual for instructors at Yonseiyonseilibrary
 
[아산 유스프러너] 2023-2학기 파주고등학교 앙트십프로젝트 소개서 _ 마약문제해결
[아산 유스프러너] 2023-2학기 파주고등학교 앙트십프로젝트 소개서 _ 마약문제해결[아산 유스프러너] 2023-2학기 파주고등학교 앙트십프로젝트 소개서 _ 마약문제해결
[아산 유스프러너] 2023-2학기 파주고등학교 앙트십프로젝트 소개서 _ 마약문제해결freewill3
 
미래 학교의 학교 중심 공간, 광장형 공간의 사전 기획 사례와 사전 기획
미래 학교의 학교 중심 공간, 광장형 공간의 사전 기획 사례와 사전 기획미래 학교의 학교 중심 공간, 광장형 공간의 사전 기획 사례와 사전 기획
미래 학교의 학교 중심 공간, 광장형 공간의 사전 기획 사례와 사전 기획Seongwon Kim
 
교원용_Turnitin Feedback Studio 이용 매뉴얼_20240306 .pptx (1).pdf
교원용_Turnitin Feedback Studio 이용 매뉴얼_20240306 .pptx (1).pdf교원용_Turnitin Feedback Studio 이용 매뉴얼_20240306 .pptx (1).pdf
교원용_Turnitin Feedback Studio 이용 매뉴얼_20240306 .pptx (1).pdfyonseilibrary
 

Último (6)

[아산 유스프러너] 2023-2학기 광문고등학교 앙트십프로젝트 소개서 _ 수면문제해결 향수제작
[아산 유스프러너] 2023-2학기 광문고등학교 앙트십프로젝트 소개서 _ 수면문제해결 향수제작[아산 유스프러너] 2023-2학기 광문고등학교 앙트십프로젝트 소개서 _ 수면문제해결 향수제작
[아산 유스프러너] 2023-2학기 광문고등학교 앙트십프로젝트 소개서 _ 수면문제해결 향수제작
 
학생용_Turnitin Feedback Studio 이용 매뉴얼_20240306 (1).pdf
학생용_Turnitin Feedback Studio 이용 매뉴얼_20240306 (1).pdf학생용_Turnitin Feedback Studio 이용 매뉴얼_20240306 (1).pdf
학생용_Turnitin Feedback Studio 이용 매뉴얼_20240306 (1).pdf
 
AI Detection manual for instructors at Yonsei
AI Detection manual for instructors at YonseiAI Detection manual for instructors at Yonsei
AI Detection manual for instructors at Yonsei
 
[아산 유스프러너] 2023-2학기 파주고등학교 앙트십프로젝트 소개서 _ 마약문제해결
[아산 유스프러너] 2023-2학기 파주고등학교 앙트십프로젝트 소개서 _ 마약문제해결[아산 유스프러너] 2023-2학기 파주고등학교 앙트십프로젝트 소개서 _ 마약문제해결
[아산 유스프러너] 2023-2학기 파주고등학교 앙트십프로젝트 소개서 _ 마약문제해결
 
미래 학교의 학교 중심 공간, 광장형 공간의 사전 기획 사례와 사전 기획
미래 학교의 학교 중심 공간, 광장형 공간의 사전 기획 사례와 사전 기획미래 학교의 학교 중심 공간, 광장형 공간의 사전 기획 사례와 사전 기획
미래 학교의 학교 중심 공간, 광장형 공간의 사전 기획 사례와 사전 기획
 
교원용_Turnitin Feedback Studio 이용 매뉴얼_20240306 .pptx (1).pdf
교원용_Turnitin Feedback Studio 이용 매뉴얼_20240306 .pptx (1).pdf교원용_Turnitin Feedback Studio 이용 매뉴얼_20240306 .pptx (1).pdf
교원용_Turnitin Feedback Studio 이용 매뉴얼_20240306 .pptx (1).pdf
 

Introduction to Parallel Programming

  • 1. 슈퍼컴퓨팅 교육 - UNIST Parallel Programming
  • 2. CONTENT S I. Introduction to Parallel Computing II. Parallel Programming using OpenMP III. Parallel Programming using MPI
  • 3. I. Introduction to Parallel Computing
  • 4. 병렬 처리 (1/3) 병렬 처리란, 순차적으로 진행되는 계산영역을 여러 개로 나누어 각각을 여러 프로세서에서 동시에 수행 되도록 하는 것
  • 6. 병렬 처리 (3/3) 주된 목적 : 더욱 큰 문제를 더욱 빨리 처리하는 것  프로그램의 wall-clock time 감소  해결할 수 있는 문제의 크기 증가 병렬 컴퓨팅 계산 자원  여러 개의 프로세서(CPU)를 가지는 단일 컴퓨터  네트워크로 연결된 다수의 컴퓨터
  • 7. 왜 병렬인가? 고성능 단일 프로세서 시스템 개발의 제한  전송속도의 한계 (구리선 : 9 cm/nanosec)  소형화의 한계  경제적 제한 보다 빠른 네트워크, 분산 시스템, 다중 프로세서 시스템 아키텍처의 등장  병렬 컴퓨팅 환경 상대적으로 값싼 프로세서를 여러 개 묶어 동시에 사용함 으로써 원하는 성능이득 기대
  • 8. 프로그램과 프로세스 프로세스는 보조 기억 장치에 하나의 파일로서 저장되어 있던 실행 가능한 프로그램이 로딩되어 운영 체제(커널)의 실행 제어 상태에 놓인 것  프로그램 : 보조 기억 장치에 저장  프로세스 : 컴퓨터 시스템에 의하여 실행 중인 프로그램  태스크 = 프로세스
  • 9. 프로세스 프로그램 실행을 위한 자원 할당의 단위가 되고, 한 프로그램에서 여러 개 실행 가능 다중 프로세스를 지원하는 단일 프로세서 시스템  자원 할당의 낭비, 문맥교환으로 인한 부하 발생  문맥교환 • 어떤 순간 한 프로세서에서 실행 중인 프로세스는 항상 하나 • 현재 프로세스 상태 저장  다른 프로세스 상태 적재 분산메모리 병렬 프로그래밍 모델의 작업할당 기준
  • 10. 스레드 프로세스에서 실행의 개념만을 분리한 것  프로세스 = 실행단위(스레드) + 실행환경(공유자원)  하나의 프로세스에 여러 개 존재가능  같은 프로세스에 속한 다른 스레드와 실행환경을 공유 다중 스레드를 지원하는 단일 프로세서 시스템  다중 프로세스보다 효율적인 자원 할당  다중 프로세스보다 효율적인 문맥교환 공유 메모리 병렬 프로그래밍 모델의 작업할당 기준
  • 11. 프로세스와 스레드 하나의 스레드를 갖는 3개의 프로세스 스레드 프로세스 3개의 스레드를 갖는 하나의 프로세스
  • 12. 병렬성 유형 데이터 병렬성 (Data Parallelism)  도메인 분해 (Domain Decomposition)  각 태스크는 서로 다른 데이터를 가지고 동일한 일련의 계산을 수행 태스크 병렬성 (Task Parallelism)  기능적 분해 (Functional Decomposition)  각 태스크는 같거나 또는 다른 데이터를 가지고 서로 다른 계산을 수행
  • 13. 데이터 병렬성 (1/3) 데이터 병렬성 : 도메인 분해 Problem Data Set Task 1 Task 2 Task 3 Task 4
  • 14. 데이터 병렬성 (2/3) 코드 예) : 행렬의 곱셈 (OpenMP) Serial Code Parallel Code !$OMP PARALLEL DO DO K=1,N DO K=1,N DO J=1,N DO J=1,N DO I=1,N C(I,J) = C(I,J) + DO I=1,N C(I,J) = C(I,J) + (A(I,K)*B(K,J)) END DO END DO END DO A(I,K)*B(K,J) END DO END DO END DO !$OMP END PARALLEL DO
  • 15. 데이터 병렬성 (3/3) 데이터 분해 (프로세서 4개:K=1,20일 때) Process Proc0 Proc1 Proc2 Proc3 Iterations of K K = K = 1:5 6:10 K = 11:15 K = 16:20 Data Elements A(I,1:5) B(1:5,J) A(I,6:10) B(6:10,J) A(I,11:15) B(11:15,J) A(I,16:20) B(16:20,J)
  • 16. 태스크 병렬성 (1/3) 태스크 병렬성 : 기능적 분해 Problem Instruction Set Task 1 Task 2 Task 3 Task 4
  • 17. 태스크 병렬성 (2/3) 코드 예) : (OpenMP) Serial Code Parallel Code PROGRAM MAIN … CALL interpolate() CALL compute_stats() CALL gen_random_params() … END PROGRAM MAIN … !$OMP PARALLEL !$OMP SECTIONS CALL interpolate() !$OMP SECTION CALL compute_stats() !$OMP SECTION CALL gen_random_params() !$OMP END SECTIONS !$OMP END PARALLEL … END
  • 18. 태스크 병렬성 (3/3) 태스크 분해 (3개의 프로세서에서 동시 수행) Process Code Proc0 CALL interpolate() Proc1 CALL compute_stats() Proc2 CALL gen_random_params()
  • 19. 병렬 아키텍처 (1/2) Processor Organizations Single Instruction, Single Instruction, Single Data Stream Multiple Data Stream (SISD) (SIMD) Multiple Instruction, Multiple Instruction, Single Data Stream Multiple Data Stream (MIMD) (MISD) Uniprocessor Vector Processor Shared memory Array Processor (tightly coupled) Distributed memory (loosely coupled) Clusters Symmetric multiprocessor (SMP) Non-uniform Memory Access (NUMA)
  • 20. 병렬 아키텍처 (2/2) 최근의 고성능 시스템 : 분산-공유 메모리 지원  소프트 웨어적 DSM (Distributed Shared Memory) 구현 • 공유 메모리 시스템에서 메시지 패싱 지원 • 분산 메모리 시스템에서 변수 공유 지원  하드웨어적 DSM 구현 : 분산-공유 메모리 아키텍처 • 분산 메모리 시스템의 각 노드를 공유 메모리 시스템으로 구성 • NUMA : 사용자들에게 하나의 공유 메모리 아키텍처로 보여짐 ex) Superdome(HP), Origin 3000(SGI) • SMP 클러스터 : SMP로 구성된 분산 시스템으로 보여짐 ex) SP(IBM), Beowulf Clusters
  • 21. 병렬 프로그래밍 모델 공유메모리 병렬 프로그래밍 모델    공유 메모리 아키텍처에 적합 다중 스레드 프로그램 OpenMP, Pthreads 메시지 패싱 병렬 프로그래밍 모델   분산 메모리 아키텍처에 적합 MPI, PVM 하이브리드 병렬 프로그래밍 모델   분산-공유 메모리 아키텍처 OpenMP + MPI
  • 22. 공유 메모리 병렬 프로그래밍 모델 Single thread time time S1 Multi-thread Thread S1 fork P1 P2 P1 P2 P3 P3 join S2 S2 Shared address space P4 Process S2 Process P4
  • 23. 메시지 패싱 병렬 프로그래밍 모델 Serial time time S1 Messagepassing S1 S1 S1 S1 P1 P1 P2 P3 P4 P2 S2 S2 S2 S2 S2 S2 S2 S2 Process 0 Process 1 Process 2 Process 3 Node 1 Node 2 Node 3 Node 4 P3 P4 S2 S2 Process Data transmission over the interconnect
  • 24. 하이브리드 병렬 프로그래밍 모델 Message-passing P1 fork P2 time time S1 Thread S1 P3 Shared address fork P4 join join S2 S2 Thread S2 S2 Shared address Process 0 Process 1 Node 1 Node 2
  • 25. DSM 시스템의 메시지 패싱 time S1 S1 S1 S1 P1 P2 P3 P4 Message-passing S2 S2 S2 S2 S2 S2 S2 S2 Process 0 Process 1 Process 2 Process 3 Node 1 Node 2
  • 26. SPMD와 MPMD (1/4) SPMD(Single Program Multiple Data)  하나의 프로그램이 여러 프로세스에서 동시에 수행됨  어떤 순간 프로세스들은 같은 프로그램내의 명령어들을 수행하며 그 명령어들은 같을 수도 다를 수도 있음 MPMD (Multiple Program Multiple Data)  한 MPMD 응용 프로그램은 여러 개의 실행 프로그램으로 구성  응용프로그램이 병렬로 실행될 때 각 프로세스는 다른 프로세스와 같거나 다른 프로그램을 실행할 수 있음
  • 28. SPMD와 MPMD (3/4) MPMD : Master/Worker (Self-Scheduling) a.out Node 1 b.out Node 2 Node 3
  • 29. SPMD와 MPMD (4/4) MPMD: Coupled Analysis a.out b.out c.out Node 1 Node 2 Node 3
  • 30. •성능측정 •성능에 영향을 주는 요인들 •병렬 프로그램 작성순서
  • 31. 프로그램 실행시간 측정 (1/2) time 사용방법(bash, ksh) : $time [executable] $ time mpirun –np 4 –machinefile machines ./exmpi.x real 0m3.59s user 0m3.16s sys 0m0.04s  real = wall-clock time  User = 프로그램 자신과 호출된 라이브러리 실행에 사용된 CPU 시간  Sys = 프로그램에 의해 시스템 호출에 사용된 CPU 시간  user + sys = CPU time
  • 32. 프로그램 실행시간 측정 (2/2) 사용방법(csh) : $time [executable] $ time testprog 1.150u 0.020s 0:01.76 66.4% 15+3981k 24+10io 0pf+0w ① ② ③ ④ ⑤ ⑥ ⑦ ⑧ ① user CPU time (1.15초) ② system CPU time (0.02초) ③ real time (0분 1.76초) ④ real time에서 CPU time이 차지하는 정도(66.4%) ⑤ 메모리 사용 : Shared (15Kbytes) + Unshared (3981Kbytes) ⑥ 입력(24 블록) + 출력(10 블록) ⑦ no page faults ⑧ no swaps
  • 33. 성능측정 병렬화를 통해 얻어진 성능이득의 정량적 분석 성능측정  성능향상도  효율  Cost
  • 34. 성능향상도 (1/7) 성능향상도 (Speed-up) : S(n) S(n) = 순차 프로그램의 실행시간 = 병렬 프로그램의 실행시간(n개 프로세서) ts tp  순차 프로그램에 대한 병렬 프로그램의 성능이득 정도  실행시간 = Wall-clock time  실행시간이 100초가 걸리는 순차 프로그램을 병렬화 하여 10개의 프로세서로 50초 만에 실행 되었다면,  S(10) = 100 = 50 2
  • 35. 성능향상도 (2/7) 이상(Ideal) 성능향상도 : Amdahl‟s Law  f : 코드의 순차부분 (0 ≤ f ≤ 1)  tp = fts + (1-f)ts/n 순차부분 실행시 간 병렬부분 실행시 간
  • 36. 성능향상도 (3/7) ts (1 fts Serial section f )t S Parallelizable sections 1 2 n-1 n 1 2 n processes n-1 n tp (1 f )t S / n
  • 37. 성능향상도 (4/7)  S(n) = ts = tp ts fts + (1-f)ts/n 1 S(n) =  최대 성능향상도 ( n  ∞ ) S(n) =  f + (1-f)/n 1 f 프로세서의 개수를 증가하면, 순차부분 크기의 역수에 수렴
  • 38. 성능향상도 (5/7) f = 0.2, n = 4 Serial Parallel process 1 20 20 80 20 process 2 process 3 cannot be parallelized process 4 can be parallelized S(4) = 1 0.2 + (1-0.2)/4 = 2.5
  • 39. 성능향상도 (6/7) 프로세서 개수 대 성능향상도 f=0 24 Speed-up 20 16 f=0.05 12 f=0.1 8 f=0.2 4 0 0 4 8 12 16 20 number of processors, n 24
  • 40. 성능향상도 (7/7) 순차부분 대 성능향상도 16 14 Speed-up 12 n=256 10 8 6 n=16 4 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Serial fraction, f 0.8 0.9 1
  • 41. 효율 효율 (Efficiency) : E(n) E(n) =  ts = tpⅹn S(n) n 프로세서 개수에 따른 병렬 프로그램의 성능효율을 나타냄 • 10개의 프로세서로 2배의 성능향상 : – S(10) = 2  E(10) = 20 % • 100개의 프로세서로 10배의 성능향상 : – S(100) = 10  E(100) = 10 %
  • 42. Cost Cost Cost = 실행시간 ⅹ 프로세서 개수  순차 프로그램 : Cost = ts  병렬 프로그램 : Cost = tp ⅹ n = tsn S(n) = ts E(n) 예) 10개의 프로세서로 2배, 100개의 프로세서로 10배의 성능향상 ts tp n S(n) E(n) Cost 100 50 10 2 0.2 500 100 10 100 10 0.1 1000
  • 43. 실질적 성능향상에 고려할 사항 실제 성능향상도 : 통신부하, 로드 밸런싱 문제 20 80 Serial parallel 20 20 process 1 cannot be parallelized process 2 can be parallelized process 3 communication overhead process 4 Load unbalance
  • 44. 성능증가를 위한 방안들 1. 프로그램에서 병렬화 가능한 부분(Coverage) 증가  알고리즘 개선 2. 작업부하의 균등 분배 : 로드 밸런싱 3. 통신에 소비하는 시간(통신부하) 감소
  • 45. 성능에 영향을 주는 요인들 Coverage : Amdahl’s Law 로드 밸런싱 동기화 통신부하 세분성 입출력
  • 46. 로드 밸런싱 모든 프로세스들의 작업시간이 가능한 균등하도록 작업을 분배하여 작업대기시간을 최소화 하는 것  데이터 분배방식(Block, Cyclic, Block-Cyclic) 선택에 주의  이기종 시스템을 연결시킨 경우, 매우 중요함  동적 작업할당을 통해 얻을 수도 있음 task0 WORK task1 WAIT task2 task3 time
  • 47. 동기화 병렬 태스크의 상태나 정보 등을 동일하게 설정하기 위한 조정작업   대표적 병렬부하 : 성능에 악영향 장벽, 잠금, 세마포어(semaphore), 동기통신 연산 등 이용 병렬부하 (Parallel Overhead)  병렬 태스크의 시작, 종료, 조정으로 인한 부하 • 시작 : 태스크 식별, 프로세서 지정, 태스크 로드, 데이터 로드 등 • 종료 : 결과의 취합과 전송, 운영체제 자원의 반납 등 • 조정 : 동기화, 통신 등
  • 48. 통신부하 (1/4) 데이터 통신에 의해 발생하는 부하  네트워크 고유의 지연시간과 대역폭 존재 메시지 패싱에서 중요 통신부하에 영향을 주는 요인들  동기통신? 비동기 통신?  블록킹? 논블록킹?  점대점 통신? 집합통신?  데이터전송 횟수, 전송하는 데이터의 크기
  • 49. 통신부하 (2/4) 통신시간 = 지연시간 +  메시지 크기 대역폭 지연시간 : 메시지의 첫 비트가 전송되는데 걸리는 시간 • 송신지연 + 수신지연 + 전달지연  대역폭 : 단위시간당 통신 가능한 데이터의 양(MB/sec) 유효 대역폭 = 메시지 크기 = 통신시간 대역폭 1+지연시간ⅹ대역폭/메시지크기
  • 50. 통신부하 (3/4) Communication time Communication Time 1/slope = Bandwidth Latency message size
  • 51. 통신부하 (4/4) Effective Bandwidth effective bandwidth (MB/sec) 1000 network bandwidth 100 10 1 • latency = 22 ㎲ • bandwidth = 133 MB/sec 0.1 0.01 1 10 100 1000 10000 100000 1000000 message size(bytes)
  • 52. 세분성 (1/2) 병렬 프로그램내의 통신시간에 대한 계산시간의 비  Fine-grained 병렬성 • 통신 또는 동기화 사이의 계산작업이 상대적으로 적음 • 로드 밸런싱에 유리  Coarse-grained 병렬성 • 통신 또는 동기화 사이의 계산작업이 상대적으로 많음 • 로드 밸런싱에 불리 일반적으로 Coarse-grained 병렬성이 성능면에서 유리  계산시간 < 통신 또는 동기화 시간  알고리즘과 하드웨어 환경에 따라 다를 수 있음
  • 54. 입출력 일반적으로 병렬성을 방해함  쓰기 : 동일 파일공간을 이용할 경우 겹쳐 쓰기 문제  읽기 : 다중 읽기 요청을 처리하는 파일서버의 성능 문제  네트워크를 경유(NFS, non-local)하는 입출력의 병목현상 입출력을 가능하면 줄일 것  I/O 수행을 특정 순차영역으로 제한해 사용  지역적인 파일공간에서 I/O 수행 병렬 파일시스템의 개발 (GPFS, PVFS, PPFS…) 병렬 I/O 프로그래밍 인터페이스 개발 (MPI-2 : MPI I/O)
  • 55. 확장성 (1/2) 확장된 환경에 대한 성능이득을 누릴 수 있는 능력  하드웨어적 확장성  알고리즘적 확장성 확장성에 영향을 미치는 주요 하드웨어적 요인  CPU-메모리 버스 대역폭  네트워크 대역폭  메모리 용량  프로세서 클럭 속도
  • 57. 의존성과 교착 데이터 의존성 : 프로그램의 실행 순서가 실행 결과에 영향을 미치는 것 DO k = 1, 100 F(k + 2) = F(k +1) + F(k) ENDDO 교착 : 둘 이상의 프로세스들이 서로 상대방의 이벤트 발생을 기다리는 상태 Process 1 X = 4 SOURCE = TASK2 RECEIVE (SOURCE,Y) DEST = TASK2 SEND (DEST,X) Z = X + Y Process 2 Y = 8 SOURCE = TASK1 RECEIVE (SOURCE,X) DEST = TASK1 SEND (DEST,Y) Z = X + Y
  • 58. 의존성 F(1) F(2) F(3) F(4) F(5) F(6) F(7) … F(n) 1 2 3 4 5 6 7 … n DO k = 1, 100 F(k + 2) = F(k +1) + F(k) ENDDO Serial F(1) F(2) F(3) F(4) F(5) F(6) F(7) … F(n) 1 2 3 5 8 13 21 … … F(1) F(2) F(3) F(4) F(5) F(6) F(7) … F(n) 1 2 3 5(4) 7 11 18 … … Parallel
  • 59. 병렬 프로그램 작성 순서 ① 순차코드 작성, 분석(프로파일링), 최적화   ② hotspot, 병목지점, 데이터 의존성 등을 확인 데이터 병렬성/태스크 병렬성 ? 병렬코드 개발  MPI/OpenMP/… ?  태스크 할당과 제어, 통신, 동기화 코드 추가 ③ 컴파일, 실행, 디버깅 ④ 병렬코드 최적화  성능측정과 분석을 통한 성능개선
  • 60. 디버깅과 성능분석 디버깅  코드 작성시 모듈화 접근 필요  통신, 동기화, 데이터 의존성, 교착 등에 주의  디버거 : TotalView 성능측정과 분석  timer 함수 사용  프로파일러 : prof, gprof, pgprof, TAU
  • 62. I. Introduction to Parallel Computing
  • 63. OpenMP란 무엇인가? 공유메모리 환경에서 다중 스레드 병렬 프로그램 작성을 위한 응용프로그램 인터페이스(API)
  • 64. OpenMP의 역사 1990년대 :  고성능 공유 메모리 시스템의 발전  업체 고유의 지시어 집합 사용  표준화의 필요성 1994년 ANSI X3H5  1996년 openmp.org 설립 1997년 OpenMP API 발표 Release History  OpenMP Fortran API 버전 1.0 : 1997년 10월  C/C++ API 버전 1.0 : 1998년 10월  Fortran API 버전 1.1 : 1999년 11월  Fortran API 버전 2.0 : 2000년 11월  C/C++ API 버전 2.0 : 2002년 3월  Combined C/C++ and Fortran API 버전 2.5 : 2005년 5월  API 버전 3.0 : 2008년 5월
  • 65. OpenMP의 목표 표준과 이식성 공유메모리 병렬 프로그래밍의 표준 대부분의 Unix와 Windows에 OpenMP 컴파일러 존재 Fortran, C/C++ 지원
  • 67. OpenMP의 구성 (2/2) 컴파일러 지시어  스레드 사이의 작업분담, 통신, 동기화를 담당  좁은 의미의 OpenMP 예) C$OMP PARALLEL DO 실행시간 라이브러리  병렬 매개변수(참여 스레드의 개수, 번호 등)의 설정과 조회 예) CALL omp_set_num_threads(128) 환경변수  실행 시스템의 병렬 매개변수(스레드 개수 등)를 정의 예) export OMP_NUM_THREADS=8
  • 68. OpenMP 프로그래밍 모델 (1/4) 컴파일러 지시어 기반  순차코드의 적절한 위치에 컴파일러 지시어 삽입  컴파일러가 지시어를 참고하여 다중 스레드 코드 생성  OpenMP를 지원하는 컴파일러 필요  동기화, 의존성 제거 등의 작업 필요
  • 69. OpenMP 프로그래밍 모델 (2/4) Fork-Join   병렬화가 필요한 부분에 다중 스레드 생성 병렬계산을 마치면 다시 순차적으로 실행 F J F J O O O O Master R I R I Thread K N K N [Parallel Region] [Parallel Region]
  • 70. OpenMP 프로그래밍 모델 (3/4) 컴파일러 지시어 삽입 Serial Code PROGRAM exam … ialpha = 2 DO i = 1, 100 a(i) = a(i) + ialpha*b(i) ENDDO PRINT *, a END Parallel Code PROGRAM exam … ialpha = 2 !$OMP PARALLEL DO DO i = 1, 100 a(i) = a(i) + ialpha*b(i) ENDDO !$OMP END PARALLEL DO PRINT *, a END
  • 71. OpenMP 프로그래밍 모델 (4/4) Fork-Join ※ export OMP_NUM_THREADS = 4 ialpha = 2 (Master Thread) (Fork) DO i=1,25 DO i=26,50 DO i=51,75 DO i=76,100 ... ... ... ... (Join) (Master) PRINT *, a (Slave) (Master Thread) (Slave) (Slave)
  • 72. OpenMP의 장점과 단점 장 점  MPI보다 코딩, 디버깅이 쉬움  데이터 분배가 수월 단 점 • 공유메모리환경의 다중 프로세서 아키텍처에서만 구현 가능  점진적 병렬화가 가능 • OpenMP를 지원하는 컴파일러 필요  하나의 코드를 병렬코드와 순차코 • 루프에 대한 의존도가 큼  낮은 드로 컴파일 가능  상대적으로 코드 크기가 작음 병렬화 효율성 • 공유메모리 아키텍처의 확장성 (프로세서 수, 메모리 등) 한계
  • 73. OpenMP의 전형적 사용 데이터 병렬성을 이용한 루프의 병렬화 1. 시간이 많이 걸리는 루프를 찾음 (프로파일링) 2. 의존성, 데이터 유효범위 조사 3. 지시어 삽입으로 병렬화 태스크 병렬성을 이용한 병렬화도 가능
  • 74. 지시어 (1/5) OpenMP 지시어 문법 Fortran (고정형식:f77) 지시어 시작 (감시문자) 줄 바꿈 선택적 컴파일 시작위치 Fortran (자유형식:f90) C ▪ !$OMP <지시어> ▪ C$OMP <지시어> ▪ !$OMP <지시어> ▪ #pragma omp ▪ !$OMP <지시어> & ▪ #pragma omp … ▪ *$OMP <지시어> ▪ !$OMP <지시어> !$OMP& … … … ▪ !$ … ▪ C$ … ▪ !$ … ▪ #ifdef _OPENMP ▪ *$ … 첫번째 열 무관 무관
  • 75. 지시어 (2/5) 병렬영역 지시어    PARALLEL/END PARALLEL 코드부분을 병렬영역으로 지정 지정된 영역은 여러 스레드에서 동시에 실행됨 작업분할 지시어    DO/FOR 병렬영역 내에서 사용 루프인덱스를 기준으로 각 스레드에게 루프작업 할당 결합된 병렬 작업분할 지시어   PARALLEL DO/FOR PARALLEL + DO/FOR의 역할을 수행
  • 76. 지시어 (3/5) 병렬영역 지정 Fortran !$OMP PARALLEL DO i = 1, 10 PRINT *, „Hello World‟, i ENDDO !$OMP END PARALLEL C #pragma omp parallel for(i=1; i<=10; i++) printf(“Hello World %dn”,i);
  • 77. 지시어 (4/5) 병렬영역과 작업분할 Fortran C !$OMP PARALLEL #pragma omp parallel !$OMP DO DO i = 1, 10 PRINT *, „Hello World‟, i ENDDO [!$OMP END DO] !$OMP END PARALLEL { #pragma omp for for(i=1; i<=10; i++) printf(“Hello World %dn”,i); }
  • 78. 지시어 (5/5) 병렬영역과 작업분할 Fortran !$OMP PARALLEL !$OMP DO DO i = 1, n a(i) = b(i) + c(i) ENDDO [!$OMP END DO] Optional !$OMP DO … [!$OMP END DO] !$OMP END PARALLEL C #pragma omp parallel { #pragma omp for for (i=1; i<=n; i++) { a[i] = b[i] + c[i] } #pragma omp for for(…){ … } }
  • 79. 실행시간 라이브러리와 환경변수 (1/3) 실행시간 라이브러리    omp_set_num_threads(integer) : 스레드 개수 지정 omp_get_num_threads() : 스레드 개수 리턴 omp_get_thread_num() : 스레드 ID 리턴 환경변수  OMP_NUM_THREADS : 사용 가능한 스레드 최대 개수 • export OMP_NUM_THREADS=16 (ksh) • setenv OMP_NUM_THREADS 16 (csh) C : #include <omp.h>
  • 80. 실행시간 라이브러리와 환경변수 (3/3) omp_set_num_threads omp_get_thread_num INTEGER OMP_GET_THREAD_NUM CALL OMP_SET_NUM_THREADS(4) Fortran !$OMP PARALLEL PRINT*, ′Thread rank: ′, OMP_GET_THREAD_NUM() !$OMP END PARALLEL #include <omp.h> omp_set_num_threads(4); C #pragma omp parallel { printf(″Thread rank:%d\n″,omp_get_thread_num()); }
  • 81. 주요 Clauses private(var1, var2, …) shared(var1, var2, …) default(shared|private|none) firstprivate(var1, var2, …) lastprivate(var1, var2, …) reduction(operator|intrinsic:var1, var2,…) schedule(type [,chunk])
  • 82. clause : reduction (1/4) reduction(operator|intrinsic:var1, var2,…)  reduction 변수는 shared • 배열 가능(Fortran only): deferred shape, assumed shape array 사 용 불가 • C는 scalar 변수만 가능  각 스레드에 복제돼 연산에 따라 다른 값으로 초기화되고(표 참조) 병렬 연산 수행  다중 스레드에서 병렬로 수행된 계산결과를 환산해 최종 결과를 마스터 스레드로 내 놓 음
  • 83. clause : reduction (2/4) !$OMP DO reduction(+:sum) DO i = 1, 100 sum = sum + x(i) ENDDO Thread 0 Thread 1 sum0 = 0 sum1 = 0 DO i = 1, 50 DO i = 51, 100 sum0 = sum0 + x(i) ENDDO sum = sum0 + sum1 sum1 = sum1 + x(i) ENDDO
  • 84. clause : reduction (3/4) Reduction Operators : Fortran Operator Data Types 초기값 + integer, floating point (complex or real) 0 * integer, floating point (complex or real) 1 - integer, floating point (complex or real) 0 .AND. logical .TRUE. .OR. logical .FALSE. .EQV. logical .TRUE. .NEQV. logical .FALSE. MAX integer, floating point (real only) 가능한 최소값 MIN integer, floating point (real only) 가능한 최대값 IAND integer all bits on IOR integer 0 IEOR integer 0
  • 85. clause : reduction (4/4) Reduction Operators : C Operator Data Types 초기값 + integer, floating point 0 * integer, floating point 1 - integer, floating point 0 & integer all bits on | integer 0 ^ integer 0 && integer 1 || integer 0
  • 88. Current HPC Platforms : COTS-Based Clusters COTS = Commercial off-the-shelf Nehalem Access Control File Server(s) Gulftown … Login Node(s) 88 Compute Nodes
  • 89. Memory Architectures Shared Memory  Single address space for all processors <NUMA> <UMA> Distributed Memory 89
  • 90. What is MPI? MPI = Message Passing Interface MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a library – but rather the specification of what such a library should be. MPI primarily addresses the message-passing parallel programming model : data is moved from the address space of one process to that of another process through cooperative operations on each process. Simply stated, the goal of the message Passing Interface is to provide a widely used standard for writing message passing programs. The interface attempts to be :   Portable  Efficient  90 Practical Flexible
  • 91. What is MPI? The MPI standard has gone through a number of revisions, with the most recent version being MPI-3. Interface specifications have been defined for C and Fortran90 language bindings :  C++ bindings from MPI-1 are removed in MPI-3  MPI-3 also provides support for Fortran 2003 and 2008 features Actual MPI library implementations differ in which version and features of the MPI standard they support. Developers/users will need to be aware of this. 91
  • 92. Programming Model Originally, MPI was designed for distributed memory architectures, which were becoming increasingly popular at time (1980s – early 1990s). As architecture trends changed, shared memory SMPs were combined over networks creating hybrid distributed memory/shared memory systems. 92
  • 93. Programming Model MPI implementers adapted their libraries to handle both types of underlying memory architectures seamlessly. They also adapted/developed ways of handing different interconnects and protocols. Today, MPI runs on virtually any hardware platform :  Distributed Memory  Shared Memory  Hybrid The programming model clearly remains a distributed memory model however, regardless of the underlying physical architecture of the machine. 93
  • 94. Reasons for Using MPI Standardization  MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms. Practically, it has replaced all previous message passing libraries. Portability  There is little or no need to modify your source code when you port your application to a different platform that supports (and is compliant with) the MPI standard. Performance Opportunities  Vendor implementations should be able to exploit native hardware features to optimize performance. Functionality  There are over 440 routines defined in MPI-3, which includes the majority of those in MPI-2 and MPI-1. Availability  94 A Variety of implementations are available, both vendor and public domain.
  • 95. History and Evolution MPI has resulted from the efforts of numerous individuals and groups that began in 1992. 1980s – early 1990s : Distributed memory, parallel computing develops, as do a number of incompatible soft ware tools for writing such programs – usually with tradeoffs between portability, performance, functionality and price. Recognition of the need for a standard arose. Apr 1992 : Workshop on Standards for Message Passing in a Distributed Memory Environment, Sponsored by the Center for Research on Parallel Computing, Williamsburg, Virginia. The basic features essential to a standard message passing interface were discussed, and a working group established to continue the standardization process. Preliminary draft proposal developed subsequently. 95
  • 96. History and Evolution Nov 1992 : Working group meets in Minneapolis. MPI draft proposal (MPI1) from ORNL presented. Group adopts procedures and organization to form the MPI Forum. It eventually comprised of about 175 individuals from 40 organizations including parallel computer vendors, software writers, academia and application scientists. Nov 1993 : Supercomputing 93 conference – draft MPI standard presented. May 1994 : Final version of MPI-1.0 released. MPI-1.0 was followed by versions MPI-1.1 (Jun 1995), MPI-1.2 (Jul 1997) and MPI-1.3 (May 2008). MPI-2 picked up where the first MPI specification left off, and addressed topics which went far beyond the MPI-1 specification. Was finalized in 1996. MPI-2.1 (Sep 2009), and MPI-2.2 (Sep 2009) followed. Sep 2012 : The MPI-3.0 standard was approved. 96
  • 97. History and Evolution Documentation for all versions of the MPI standard is available at :  97 http://www.mpi-forum.org/docs/
  • 98. A General Structure of the MPI Program 98
  • 99. A Header File for MPI routines Required for all programs that make MPI library calls. C include file Fortran include file #include “mpi.h” include „mpif.h‟ With MPI-3 Fortran, the USE mpi_f80 module is preferred over using the include file shown above. 99
  • 100. The Format of MPI Calls C names are case sensitive; Fortran name are not. Programs must not declare variables or functions with names beginning with the prefix MPI_ or PMPI_ (profiling interface). C Binding Format rc = MPI_Xxxxx(parameter, …) Example rc = MPI_Bsend(&buf, count, type, dest, tag, comm) Error code Returned as “rc”, MPI_SUCCESS if successful. Fortran Binding Format Example call MPI_BSEND(buf, count, type, dest, tag, comm, ierr) Error code 100 CALL MPI_XXXXX(parameter, …, ierr) call mpi_xxxxx(parameter, …, ierr) Returned as “ierr” parameter, MPI_SUCCESS if successful.
  • 101. Communicators and Groups MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. Most MPI routines require you to specify a communicator as an argument. Communicators and groups will be covered in more detail later. For now, simply use MPI_COMM_WORLD whenever a communicator is required - it is the predefined communicator that includes all of your MPI processes. 101
  • 102. Rank Within a communicator, every process has its own unique, integer identifier assigned by the system when the process initializes. A rank is sometimes also called a “task ID”. Ranks are contiguous and begin at zero. Used by the programmer to specify the source and destination of messages. Often used conditionally by the application to control program execution (if rank = 0 do this / if rank = 1 do that). 102
  • 103. Error Handling Most MPI routines include a return/error code parameter, as described in “Format of MPI Calls” section above. However, according to the MPI standard, the default behavior of an MPI call is to abort if there is an error. This means you will probably not be able to capture a return/error code other than MPI_SUCCESS (zero). The standard does provide a means to override this default error handler. You can also consult the error handing section of the MPI Standard located at http://www.mpiforum.org/docs/mpi-11-html/node148.html . The types of errors displayed to the user are implementation dependent. 103
  • 104. Environment Management Routines MPI_Init  Initializes the MPI execution environment. This function must be called is every MPI program, must be called before any other MPI functions and must be called only once in an MPI program. For C programs, MPI_Init may be used to pass the command line arguments to all processes, although this is not required by the standard and is implementation dependent. C MPI_Init(&argc, &argv)   104 Fortran MPI_INIT(ierr) Input parameters • argc : Pointer to the number of arguments • argv : Pointer to the argument vector ierr : the error return argument
  • 105. Environment Management Routines MPI_Comm_size  Returns the total number of MPI processes in the specified communicator, such as MPI_COMM_WORLD. If the communicator is MPI_COMM_WORLD, then it represents the number of MPI tasks available to your application. C MPI_Comm_size(comm, &size)    105 Fortran MPI_COMM_SIZE(comm, size, ierr) Input parameters • comm : communicator (handle) Output parameters • size : number of processes in the group of comm (integer) ierr : the error return argument
  • 106. Environment Management Routines MPI_Comm_rank  Returns the rank of the calling MPI process within the specified communicator. Initially, each process will be assigned a unique integer rank between 0 and number of tasks -1 within the communicator MPI_COMM_WORLD. This rank is often referred to as a task ID. If a process becomes associated with other communicators, it will have a unique rank within each of these as well. C MPI_Comm_rank(comm, &rank)    106 Fortran MPI_COMM_SIZE(comm, rank, ierr) Input parameters • comm : communicator (handle) Output parameters • rank : rank of the calling process in the group of comm (integer) ierr : the error return argument
  • 107. Environment Management Routines MPI_Finalize  Terminates the MPI execution environment. This function should be the last MPI routine called in every MPI program – no other MPI routines may be called after it. C MPI_Finalize()  107 ierr : the error return argument Fortran MPI_FINALIZE(ierr)
  • 108. Environment Management Routines MPI_Abort  Terminates all MPI processes associated with the communicator. In most MPI implementations it terminates ALL processes regardless of the communicator specified. C MPI_Abort(comm, errorcode)   108 Fortran MPI_ABORT(comm, errorcode, ierr) Input parameters • comm : communicator (handle) • errorcode : error code to return to invoking environment ierr : the error return argument
  • 109. Environment Management Routines MPI_Get_processor_name  Return the processor name. Also returns the length of the name. The buffer for “name” must be at least MPI_MAX_PROCESSOR_NAME characters in size. What is returned into “name” is implementation dependent – may not be the same as the output of the “hostname” or “host” shell commands. C Fortran MPI_Get_processor_name(&name, &resultlength) MPI_GET_PROCESSOR_NAME(n ame, resultlength, ierr)   109 Output parameters • name : A unique specifies for the actual (as opposed to virtual) node. This must be an array of size at least MPI_MAX_PROCESOR_NAME . • resultlen : Length (in characters) of the name. ierr : the error return argument
  • 110. Environment Management Routines MPI_Get_version  Returns the version (either 1 or 2) and subversion of MPI. C MPI_Get_version(&version, &subversion)   110 Fortran MPI_GET_VERSION(version, subversion, ierr) Output parameters • version : Major version of MPI (1 or 2) • subversion : Miner version of MPI. ierr : the error return argument
  • 111. Environment Management Routines MPI_Initialized  Indicates whether MPI_Init has been called – returns flag as either logical true(1) or false(0). C MPI_Initialized(&flag)   111 Fortran MPI_INITIALIZED(flag, ierr) Output parameters • flag : Flag is true if MPI_Init has been called and false otherwise. ierr : the error return argument
  • 112. Environment Management Routines MPI_Wtime  Returns an elapsed wall clock time in seconds (double precision) on the calling processor. C MPI_Wtime()  Fortran MPI_WTIME() Return value • Time in seconds since an arbitrary time in the past. MPI_Wtick  Returns the resolution in seconds (double precision) of MPI_Wtime. C MPI_Wtick()  112 Fortran MPI_WTICK() Return value • Time in seconds of the resolution MPI_Wtime.
  • 113. Example: Hello world #include<stdio.h> #include"mpi.h" int main(int argc, char *argv[]) { int rc; rc = MPI_Init(&argc, &argv); printf("Hello world.n"); rc = MPI_Finalize(); return 0; } 113
  • 114. Example: Hello world Execute a mpi program. $ module load [compiler] [mpi] $ mpicc hello.c $ mpirun –np 4 –hostfile [hostfile] ./a.out Make out a hostfile. ibs0001 ibs0002 ibs0003 ibs0003 … 114 slots=2 slots=2 slots=2 slots=2
  • 115. Example : Environment Management Routine #include "mpi.h” #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, len, rc; char hostname[MPI_MAX_PROCESSOR_NAME]; rc = MPI_Init(&argc,&argv); if (rc != MPI_SUCCESS) { printf ("Error starting MPI program. Terminating.n"); MPI_Abort(MPI_COMM_WORLD, rc); } MPI_Comm_size(MPI_COMM_WORLD,&numtasks); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Get_processor_name(hostname, &len); printf ("Number of tasks= %d My rank= %d Running on %sn", numtasks,rank,hostname); /******* do some work *******/ rc = MPI_Finalize(); return 0; } 115
  • 116. Types of Point-to-Point Operations MPI point-to-point operations typically involve message passing between two, and only two, different MPI tasks. One task is performing a send operation and the other task is performing a matching receive operation. There are different types of send and receive routines used for different purposes.  Synchronous send  Blocking send/blocking receive  Non-blocking send/non-blocking receive  Buffered send  Combined send/receive  “Ready” send Any type of send routine can be paired with any type of receive routine. MPI also provides several routines associated with send – receive operations, such as those used to wait for a message’s arrival or prove to find out if a message has arrived. 116
  • 117. Buffering In a perfect world, every send operation would be perfectly synchronized with its matching re ceive. This is rarely the case. Somehow or other, the MPI implementation must be able to deal with storing data when the two tasks are out of sync. Consider the following two cases   117 A send operation occurs 5 seconds before the receive is ready – where is the message w hile the receive is pending? Multiple sends arrive at the same receiving task which can only accept one send at a tim e – what happens to the messages that are “backing up”?
  • 118. Buffering The MPI implementation (not the MPI standard) decides what happens to data in these types of cases. Typically, a system buffer area is reserved to hold data in transit. 118
  • 119. Buffering System buffer space is :      119 Opaque to the programmer and managed entirely by the MPI library A finite resource that can be easy to exhaust Often mysterious and not well documented Able to exist on the sending side, the receiving side, or both Something that may improve program performance because it allows send – receive ope rations to be asynchronous.
  • 120. Blocking vs. Non-blocking Most of the MPI point-to-point routines can be used in either blocking or non-blocking mode. Blocking     A blocking send routine will only “return” after it is safe to modify the application buffer (your send data) for reuse. Safe means that modifications will not affect the data intended for the rec eive task. Safe dose not imply that the data was actually received – it may very well be sitting i n a system buffer. A blocking send can be synchronous which means there is handshaking occurring with the re ceive task to confirm a safe send. A blocking send can be asynchronous if a system buffer is used to hold the data for eventual d elivery to the receive. A blocking receive only “returns” after the data has arrived and is ready for use by the progra m. Non-blocking     120 Non-blocking send and receive routines behave similarly – they will return almost immediately. They do not wait for any communication events to complete, such as message copying from u ser memory to system buffer space or the actual arrival of message. Non-blocking operations simply “request” the MPI library to perform the operation when it is a ble. The user can not predict when it is able. The user can not predict when that will happen. It is unsafe to modify the application buffer (your variable space) until you know for a fact the r equested non-blocking operation was actually performed by the library. There are “wait” routin es used to do this. Non-blocking communications are primarily used to overlap computation with communication and exploit possibale performance gains.
  • 121. MPI Message Passing Routine Arguments MPI point-to-point communication routines generally have an argument list that takes one of t he following formats : Blocking sends MPI_Send(buffer, count, type, dest, tag, comm) Non-blocking sends MPI_Isend(buffer, count, type, dest, tag, comm, request) Blocking receive MPI_Recv(buffer, count, type, source, tag, comm, status) Non-blocking receive MPI_Irecv(buffer, count, type, source, tag, comm, request) Buffer  Program (application) address space that references the data that is to be sent or receiv ed. In most cases, this is simply the variable name that is be sent/received. For C progra ms, this argument is passed by reference and usually must be prepended with an amper sand : &var1 Data count  121 Indicates the number of data elements of a particular type to be sent.
  • 122. MPI Message Passing Routine Arguments Data type  For reasons of portability, MPI predefines its elementary data types. The table below lists those required by the standard. C Data Types MPI_CHAR MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_SIGNED_CHAR signed char MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE 122 signed char long double
  • 123. MPI Message Passing Routine Arguments Destination  An argument to send routines that indicates the process where a message should be del ivered. Specified as the rank of the receiving process. Tag  Arbitrary non-negative integer assigned by the programmer to uniquely identify a messa ge. Send and receive operations should match message tags. For a receive operation, th e wild card MPI_ANY_TAG can be used to receive any message regardless of its tag. The MPI standard guarantees that integers 0 – 32767 can be used as tags, but most impleme ntations allow a much larger range than this. Communicator  123 Indicates the communication context, or set of processes for which the source or destin ation fields are valid. Unless the programmer is explicitly creating new communicator, th e predefined communicator MPI_COMM_WORLD is usually used.
  • 124. MPI Message Passing Routine Arguments Status     For a receive operation, indicates the source of the message and the tag of the message. In C, this argument is a pointer to predefined structure MPI_Status (ex. stat.MPI_SOURC E, stat.MPI_TAG). In Fortran, it is an integer array of size MPI_STATUS_SIZE (ex. stat(MPI_SOURCE), stat(M PI_TAG)). Additionally, the actual number of bytes received are obtainable from Status via MPI_Get _out routine. Request      124 Used by non-blocking send and receive operations. Since non-blocking operations may return before the requested system buffer space is o btained, the system issues a unique “request number”. The programmer uses this system assigned “handle” later (in a WAIT type routine) to det ermine completion of the non-blocking operation. In C, this argument is pointer to predefined structure MPI_Request. In Fortran, it is an integer.
  • 125. Example : Blocking Message Passing Routine (1/2) #include "mpi.h" #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, dest, source, rc, count, tag=1; char inmsg, outmsg='x'; MPI_Status Stat; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { dest = 1; source = 1; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); } else if (rank == 1) { dest = 0; source = 0; rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } 125
  • 126. Example : Blocking Message Passing Routine (2/2) rc = MPI_Get_count(&Stat, MPI_CHAR, &count); printf("Task %d: Received %d char(s) from task %d with tag %d n", rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG); MPI_Finalize(); return 0; } 126
  • 127. Example : Dead Lock #include "mpi.h" #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, dest, source, rc, count, tag=1; char inmsg, outmsg='x'; MPI_Status Stat; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { dest = 1; source = 1; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); } else if (rank == 1) { dest = 0; source = 0; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); } 127
  • 128. Example : Non-Blocking Message Passing Routine (1/2) Nearest neighbor exchange in a ring topology #include "mpi.h" #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, next, prev, buf[2], tag1=1, tag2=2; MPI_Request reqs[4]; MPI_Status stats[2]; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); prev = rank-1; next = rank+1; if (rank == 0) prev = numtasks - 1; if (rank == (numtasks - 1)) next = 0; 128
  • 129. Example : Non-Blocking Message Passing Routine (2/2) MPI_Irecv(&buf[0], 1, MPI_INT, prev, tag1, MPI_COMM_WORLD, &reqs[0]); MPI_Irecv(&buf[1], 1, MPI_INT, next, tag2, MPI_COMM_WORLD, &reqs[1]); MPI_Isend(&rank, 1, MPI_INT, prev, tag2, MPI_COMM_WORLD, &reqs[2]); MPI_Isend(&rank, 1, MPI_INT, next, tag1, MPI_COMM_WORLD, &reqs[3]); { do some work } MPI_Waitall(4, reqs, stats); MPI_Finalize(); return 0; } 129
  • 130. Advanced Example : Monte-Carlo Simulation <Problem>    Monte carlo simulation Random number use PI = 4 ⅹAc/As <Requirement>   N’s processor(rank) use P2p communication r 130
  • 131. Advanced Example : Monte-Carlo Simulation for PI #include <stdio.h> #include <stdlib.h> #include <math.h> int main() { const long num_step=100000000; long i, cnt; double pi, x, y, r; printf(“-----------------------------------------------------------n”); pi = 0.0; cnt = 0; r = 0.0; for (i=0; i<num_step; i++) { x = rand() / (RAND_MAX+1.0); y = rand() / (RAND_MAX+1.0); r = sqrt(x*x + y*y); if (r<=1) cnt += 1; } pi = 4.0 * (double)(cnt) / (double)(num_step); printf(“PI = %17.15lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi)); printf(“-----------------------------------------------------------n”); return 0; } 131
  • 132. Advanced Example : Numerical integration for PI <Problem>  Get PI using Numerical integration 1 0 f ( x1 ) f ( x2 ) 4.0 dx = 2) (1+x f ( xn ) <Requirement>  Point to point communication n 4 i 1 1 2 1 ((i 0.5) ) n 1 n .... 1 n 1 (2 0.5) n 1 (1 0.5) n x2 x1 132 xn (n 0.5) 1 n
  • 133. Advanced Example : Numerical integration for PI #include <stdio.h> #include <math.h> int main() { const long num_step=100000000; long i; double sum, step, pi, x; step = (1.0/(double)num_step); sum=0.0; printf(“-----------------------------------------------------------n”); for (i=0; i<num_step; i++) { x = ((double)i - 0.5) * step; sum += 4.0/(1.0+x*x); } pi = step * sum; printf(“PI = %5lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi)); printf(“-----------------------------------------------------------n”); return 0; } 133
  • 134. Type of Collective Operations Synchronization  processes wait until all members of the group have reached the synchronization point. Data Movement  broadcast, scatter/gather, all to all. Collective Computation (reductions)  134 one member of the group collects data from the other members and performs an operati on (min, max, add, multiply, etc.) on that data.
  • 135. Programming Considerations and Restrictions With MPI-3, collective operations can be blocking or non-blocking. Only blocking operations are covered in this tutorial. Collective communication routines do not take message tag arguments. Collective operations within subset of processes are accomplished by first partitioning the su bsets into new groups and then attaching the new groups to new communicators. Con only be used with MPI predefined datatypes – not with MPI Derived Data Types. MPI-2 extended most collective operations to allow data movement between intercommunicat ors (not covered here). 135
  • 136. Collective Communication Routines MPI_Barrier  Synchronization operation. Creates a barrier synchronization in a group. Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barrier call. Then all tasks are free to proceed. C MPI_Barrier(comm) 136 Fortran MPI_BARRIER(comm, ierr)
  • 137. Collective Communication Routines MPI_Bcast  Data movement operation. Broadcasts (sends) a message from the process with rank "r oot" to all other processes in the group. C MPI_Bcast(&buffer, count, datatype, root, comm) 137 Fortran MPI_BCAST (buffer,count,datatype,root,comm,ier r)
  • 138. Collective Communication Routines MPI_Scatter  Data movement operation. Distributes distinct messages from a single source task to ea ch task in the group. C Fortran MPI_Scatter MPI_SCATTER (&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf, uf, recvcnt,recvtype,root,comm) recvcnt,recvtype,root,comm,ierr) 138
  • 139. Collective Communication Routines MPI_Gather  Data movement operation. Gathers distinct messages from each task in the group to a si ngle destination task. This routine is the reverse operation of MPI_Scatter. C Fortran MPI_Gather MPI_GATHER (&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf, uf, recvcount,recvtype,root,comm) recvcount,recvtype,root,comm,ierr) 139
  • 140. Collective Communication Routines MPI_Allgather  Data movement operation. Concatenation of data to all tasks in a group. Each task in the group, in effect, performs a one-to-all broadcasting operation within the group. C Fortran MPI_Allgather MPI_ALLGATHER (&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb vbuf, recvcount,recvtype,comm) uf, recvcount,recvtype,comm,info) 140
  • 141. Collective Communication Routines MPI_Reduce  Collective computation operation. Applies a reduction operation on all tasks in the group and places the result in one task. C MPI_Reduce (&sendbuf,&recvbuf,count,datatype, op,root,comm) 141 Fortran MPI_REDUCE (sendbuf,recvbuf,count,datatype,op, root,comm,ierr)
  • 142. Collective Communication Routines The predefined MPI reduction operations appear below. Users can also define their own reduction functions by using the MPI_Op_create routine. MPI Reduction Operation C Data Types MPI_MAX maximum integer, float MPI_MIN minimum integer, float MPI_SUM sum integer, float MPI_PROD product integer, float MPI_LAND logical AND integer MPI_BAND bit-wise AND integer, MPI_BYTE MPI_LOR logical OR integer MPI_BOR bit-wise OR integer, MPI_BYTE MPI_LXOR logical XOR integer MPI_BXOR bit-wise XOR integer, MPI_BYTE MPI_MAXLOC max value and location float, double and long double MPI_MINLOC min value and location float, double and long double 142
  • 143. Collective Communication Routines MPI_Allreduce  Collective computation operation + data movement. Applies a reduction operation and pl aces the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast. C MPI_Allreduce (&sendbuf,&recvbuf,count,datatype, op,comm) 143 Fortran MPI_ALLREDUCE (sendbuf,recvbuf,count,datatype,op, comm,ierr)
  • 144. Collective Communication Routines MPI_Reduce_scatter  Collective computation operation + data movement. First does an element-wise reductio n on a vector across all tasks in the group. Next, the result vector is split into disjoint se gments and distributed across the tasks. This is equivalent to an MPI_Reduce followed b y an MPI_Scatter operation. C MPI_Reduce_scatter (&sendbuf,&recvbuf,recvcount,datat ype, op,comm) 144 Fortran MPI_REDUCE_SCATTER (sendbuf,recvbuf,recvcount,datatype, op,comm,ierr)
  • 145. Collective Communication Routines MPI_Alltoall  Data movement operation. Each task in a group performs a scatter operation, sending a distinct message to all the tasks in the group in order by index. C Fortran MPI_Alltoall MPI_ALLTOALL (&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb vbuf, recvcnt,recvtype,comm) uf, recvcnt,recvtype,comm,ierr) 145
  • 146. Collective Communication Routines MPI_Scan  Performs a scan operation with respect to a reduction operation across a task group. C MPI_Scan (&sendbuf,&recvbuf,count,datatype, op,comm) 146 Fortran MPI_SCAN (sendbuf,recvbuf,count,datatype,op, comm,ierr)
  • 147. Collective Communication Routines data P0 A A P0 A A P1 B P2 A P2 C P3 A P3 D broadcast P1 A*B*C*D reduce *:some operator P0 A B C D A P0 A P1 B P1 B P2 C P2 C A*B*C*D D P3 D A*B*C*D scatter gather P3 A*B*C*D all reduce A*B*C*D *:some operator P0 A A B C D P0 A P1 B A B C D P1 B P2 C A B C D P2 C A*B*C P3 D A B C D P3 D A*B*C*D allgather A scan A*B *:some operator P0 A0 A1 A2 A3 alltoall A0 B0 C0 D0 P0 A0 A1 A2 A0*B0*C0*D0 A3 reduce scatter A1*B1*C1*D1 P1 B0 B1 B2 B3 A1 B1 C1 D1 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 A2 B2 C2 D2 P2 C0 C1 C2 C3 A2*B2*C2*D2 P3 D0 D1 D2 D3 A3 B3 C3 D3 P3 D0 D1 D2 D3 A3*B3*C3*D3 *:some operator 147
  • 148. Example : Collective Communication (1/2) Perform a scatter operation on the rows of an array #include "mpi.h" #include <stdio.h> #define SIZE 4 int main(argc,argv) int argc; char *argv[]; { int numtasks, rank, sendcount, recvcount, source; float sendbuf[SIZE][SIZE] = { {1.0, 2.0, 3.0, 4.0}, {5.0, 6.0, 7.0, 8.0}, {9.0, 10.0, 11.0, 12.0}, {13.0, 14.0, 15.0, 16.0} }; float recvbuf[SIZE]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); 148
  • 149. Example : Collective Communication (2/2) if (numtasks == SIZE) { source = 1; sendcount = SIZE; recvcount = SIZE; MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount, MPI_FLOAT,source,MPI_COMM_WORLD); printf("rank= %d Results: %f %f %f %fn",rank,recvbuf[0], recvbuf[1],recvbuf[2],recvbuf[3]); } else printf("Must specify %d processors. Terminating.n",SIZE); MPI_Finalize(); return 0; } 149
  • 150. Advanced Example : Monte-Carlo Simulation for PI Use the collective communication routines! #include <stdio.h> #include <stdlib.h> #include <math.h> int main() { const long num_step=100000000; long i, cnt; double pi, x, y, r; printf(“-----------------------------------------------------------n”); pi = 0.0; cnt = 0; r = 0.0; for (i=0; i<num_step; i++) { x = rand() / (RAND_MAX+1.0); y = rand() / (RAND_MAX+1.0); r = sqrt(x*x + y*y); if (r<=1) cnt += 1; } pi = 4.0 * (double)(cnt) / (double)(num_step); printf(“PI = %17.15lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi)); printf(“-----------------------------------------------------------n”); return 0; } 150
  • 151. Advanced Example : Numerical integration for PI Use the collective communication routines! #include <stdio.h> #include <math.h> int main() { const long num_step=100000000; long i; double sum, step, pi, x; step = (1.0/(double)num_step); sum=0.0; printf(“-----------------------------------------------------------n”); for (i=0; i<num_step; i++) { x = ((double)i - 0.5) * step; sum += 4.0/(1.0+x*x); } pi = step * sum; printf(“PI = %5lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi)); printf(“-----------------------------------------------------------n”); return 0; } 151