layout	title
page	Stanford CME 213/ME 339 Spring 2021 homepage

Introduction to parallel computing using MPI, openMP, and CUDA

This is the website for CME 213 Introduction to parallel computing using MPI, openMP, and CUDA. This material was created by Eric Darve, with the help of course staff and students.

Syllabus

Policy for late assignments

Extensions can be requested in advance for exceptional circumstances (e.g., travel, sickness, injury, COVID-related issues) and for OAE-approved accommodations.

Submissions after the deadline and late by at most two days (+48 hours after the deadline) will be accepted with a 10% penalty. No submissions will be accepted two days after the deadline.

See Gradescope for all the current assignments and their due dates. Post on Slack if you cannot access the Gradescope class page. The 6-letter code to join the class is given on Canvas.

Datasheet on the Quadro RTX 6000

Final Project

Final project instructions and starter code:

Final Project Part 1
Final Project Part 2
Short overview of Nsight Systems
Starter code

Slides and videos explaining the final project:

Overview of the final project; [Slides](Lecture Slides/Lecture_14.pdf)
33 Final Project 1, Overview; Video
34 Final Project 2, Regularization; Video
35 Final Project 3, CUDA GEMM and MPI; Video

Class modules and learning material

Introduction to the class

CME 213 First Live Lecture; Video, [Slides](Lecture Slides/Lecture_01.pdf)

C++ tutorial

[Tutorial slides](Lecture Slides/cpp tutorial/Tutorial_01.pdf)
[Tutorial code](Lecture Slides/cpp tutorial/code.zip)

Module 1 Introduction to Parallel Computing

[Slides](Lecture Slides/Lecture_02.pdf)
01 Homework 1; Video
02 Why Parallel Computing; Video
03 Top 500; Video
04 Example of Parallel Computation; Video
05 Shared memory processor; Video
[Reading assignment 1](Reading Assignments/Introduction_Parallel_Computing)
Homework 1; starter code

Module 2 Shared Memory Parallel Programming

C++ threads; [Slides](Lecture Slides/Lecture_03.pdf); Code
Introduction to OpenMP; [Slides](Lecture Slides/Lecture_04.pdf); Code
06 C++ threads; Video
07 Promise and future; Video
08 mutex; Video
09 Introduction to OpenMP; Video
10 OpenMP Hello World; Video
11 OpenMP for loop; Video
12 OpenMP clause; Video
[Reading assignment 2](Reading Assignments/OpenMP)

Module 3 Shared Memory Parallel Programming, OpenMP, advanced OpenMP

OpenMP, for loops, advanced OpenMP; [Slides](Lecture Slides/Lecture_05.pdf); Code
OpenMP, sorting algorithms; [Slides](Lecture Slides/Lecture_06.pdf); Code
13 OpenMP tasks; Video
14 OpenMP depend; Video
15 OpenMP synchronization; Video
16 Sorting algorithms Quicksort Mergesort; Video
17 Sorting Algorithms Bitonic Sort; Video
18 Bitonic Sort Exercise; Video
[Reading assignment 3](Reading Assignments/OpenMP_advanced)
Homework 2; starter code; radix sort tutorial

Module 4 Introduction to CUDA programming

Introduction to GPU computing; [Slides](Lecture Slides/Lecture_07.pdf)
Introduction to CUDA and nvcc; [Slides](Lecture Slides/Lecture_08.pdf); Code
19 GPU computing introduction; Video
20 Graphics Processing Units; Video
21 Introduction to GPU programming; Video
22 icme-gpu; Video
23 a First CUDA program; Video
23 b First CUDA program part 2; Video
24 nvcc CUDA compiler; Video
[Reading assignment 4](Reading Assignments/CUDA_intro)
Homework 3; starter code

Module 5 Code performance on NVIDIA GPUs

GPU memory and matrix transpose; [Slides](Lecture Slides/Lecture_09.pdf); Code
CUDA occupancy, branching, homework 4; [Slides](Lecture Slides/Lecture_10.pdf)
25 GPU memory; Video
26 Matrix transpose; Video
27 Latency, concurrency, and occupancy; Video
28 CUDA branching; Video
29 Homework 4; Video
[Reading assignment 5](Reading Assignments/GPU_performance)
Homework 4; starter code

Module 6 NVIDIA guest lectures, openACC, CUDA optimization

30 NVIDIA guest lecture, openACC; Video; [Slides](Lecture Slides/CME213_2021_OpenACC.pdf)
31 NVIDIA guest lecture, CUDA optimization; Video; [Slides](Lecture Slides/CME213_2021_Optimization.pdf)
[Reading assignment 6](Reading Assignments/NVIDIA_openACC_optimization)

Module 7 NVIDIA guest lectures, CUDA profiling

32 NVIDIA guest lecture, CUDA profiling; Video; [Slides](Lecture Slides/CME213_2021_CUDA_Profiling.pdf)
[Reading assignment 7](Reading Assignments/NVIDIA_CUDA_profiling)

Module 8 Group activity and introduction to MPI

The slides and videos below are needed for the final project.

Introduction to MPI; [Slides](Lecture Slides/Lecture_16.pdf); Code
37 MPI Introduction; Video
38 MPI Hello World; Video
39 MPI Send Recv; Video
40 MPI Collective Communications; Video

Material for the May 17 group activity:

generate_sequence.cpp
36 Instructions for Monday, May 17 group activity; Video; [Slides](Lecture Slides/Lecture_15.pdf)

Module 9 Advanced MPI

MPI Advanced Send and Recv; [Slides](Lecture Slides/Lecture_17.pdf); Code
41 MPI Process Mapping; Video
42 MPI Buffering; Video
43 MPI Send Recv Deadlocks; Video
44 MPI Non-blocking; Video
45 MPI Send Modes; Video
Parallel efficiency and MPI communicators; [Slides](Lecture Slides/Lecture_18.pdf); Code
46 MPI Matrix-vector product 1D schemes; Video
47 MPI Matrix vector product 2D scheme; Video
48 Parallel Speed-up; Video
49 Isoefficiency; Video
50 MPI Communicators; Video
[Reading assignment 8](Reading Assignments/MPI)

Module 10 SLAC guest lecture, Task-based parallel programming

Parallel Programming Models by Elliott Slaughter; [Slides](Lecture Slides/CME213_2021_Legion.pdf); Video

Reading and links

Lawrence Livermore National Lab Resources

LLNL Tutorial and Training Materials
LLNL Introduction to Parallel Computing tutorial
LLNL POSIX threads programming
LLNL openMP tutorial
LLNL MPI tutorial
LLNL Advanced MPI slides

C++ threads

C++ reference
Simple examples of C++ multithreading
C++ threads
LLNL tutorial on Pthreads

OpenMP

OpenMP LLNL guide
OpenMP guide by Yliluoma
OpenMP 5.0 Reference Guide
OpenMP API Specification
Tutorials

CUDA

CUDA Programming Guides and References
CUDA C++ Programming Guide
CUDA C++ Best Practices Guide
CUDA occupancy calculator
CUDA compiler driver NVCC
OpenACC
OpenACC Programming and Best Practices Guide
OpenACC 2.7 API Reference Card
Compilers that support OpenACC
OpenACC Specification (Version 3.0)

MPI

MPI standard version 3.1
Open MPI documentation
Mapping, Ranking, and Binding

Open MPI hwloc documentation

Hwloc tutorial slides
Open-mpi hwloc documentation page
Hwloc/lstopo examples

Task-based parallel languages and APIs

Legion and Regent
StarPU
Charm++
PaRSEC
Chapel
X10
TaskTorrent and documentation

Sorting algorithms

A novel sorting algorithm for many-core architectures based on adaptive bitonic sort
Adaptive Bitonic Sorting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Syllabus

Policy for late assignments

Final Project

Class modules and learning material

Introduction to the class

C++ tutorial

Module 1 Introduction to Parallel Computing

Module 2 Shared Memory Parallel Programming

Module 3 Shared Memory Parallel Programming, OpenMP, advanced OpenMP

Module 4 Introduction to CUDA programming

Module 5 Code performance on NVIDIA GPUs

Module 6 NVIDIA guest lectures, openACC, CUDA optimization

Module 7 NVIDIA guest lectures, CUDA profiling

Module 8 Group activity and introduction to MPI

Module 9 Advanced MPI

Module 10 SLAC guest lecture, Task-based parallel programming

Reading and links

C++ threads

OpenMP

CUDA

MPI

Task-based parallel languages and APIs

Sorting algorithms

Files

index.md

Latest commit

History

index.md

File metadata and controls

Syllabus

Policy for late assignments

Final Project

Class modules and learning material

Introduction to the class

C++ tutorial

Module 1 Introduction to Parallel Computing

Module 2 Shared Memory Parallel Programming

Module 3 Shared Memory Parallel Programming, OpenMP, advanced OpenMP

Module 4 Introduction to CUDA programming

Module 5 Code performance on NVIDIA GPUs

Module 6 NVIDIA guest lectures, openACC, CUDA optimization

Module 7 NVIDIA guest lectures, CUDA profiling

Module 8 Group activity and introduction to MPI

Module 9 Advanced MPI

Module 10 SLAC guest lecture, Task-based parallel programming

Reading and links

C++ threads

OpenMP

CUDA

MPI

Task-based parallel languages and APIs

Sorting algorithms