Skip to content

Latest commit

 

History

History
226 lines (166 loc) · 16.2 KB

index.md

File metadata and controls

226 lines (166 loc) · 16.2 KB
layout title
page
Stanford CME 213/ME 339 Spring 2021 homepage

Introduction to parallel computing using MPI, openMP, and CUDA

This is the website for CME 213 Introduction to parallel computing using MPI, openMP, and CUDA. This material was created by Eric Darve, with the help of course staff and students.

Syllabus

Syllabus

Policy for late assignments

Extensions can be requested in advance for exceptional circumstances (e.g., travel, sickness, injury, COVID-related issues) and for OAE-approved accommodations.

Submissions after the deadline and late by at most two days (+48 hours after the deadline) will be accepted with a 10% penalty. No submissions will be accepted two days after the deadline.

See Gradescope for all the current assignments and their due dates. Post on Slack if you cannot access the Gradescope class page. The 6-letter code to join the class is given on Canvas.

Datasheet on the Quadro RTX 6000

Final Project

Final project instructions and starter code:

Slides and videos explaining the final project:

  • Overview of the final project; [Slides](Lecture Slides/Lecture_14.pdf)
  • 33 Final Project 1, Overview; Video
  • 34 Final Project 2, Regularization; Video
  • 35 Final Project 3, CUDA GEMM and MPI; Video

See also the Module 8 videos on MPI.

Class modules and learning material

Introduction to the class

CME 213 First Live Lecture; Video, [Slides](Lecture Slides/Lecture_01.pdf)

C++ tutorial

  • [Tutorial slides](Lecture Slides/cpp tutorial/Tutorial_01.pdf)
  • [Tutorial code](Lecture Slides/cpp tutorial/code.zip)

Module 1 Introduction to Parallel Computing

  • [Slides](Lecture Slides/Lecture_02.pdf)
  • 01 Homework 1; Video
  • 02 Why Parallel Computing; Video
  • 03 Top 500; Video
  • 04 Example of Parallel Computation; Video
  • 05 Shared memory processor; Video
  • [Reading assignment 1](Reading Assignments/Introduction_Parallel_Computing)
  • Homework 1; starter code

Module 2 Shared Memory Parallel Programming

  • C++ threads; [Slides](Lecture Slides/Lecture_03.pdf); Code
  • Introduction to OpenMP; [Slides](Lecture Slides/Lecture_04.pdf); Code
  • 06 C++ threads; Video
  • 07 Promise and future; Video
  • 08 mutex; Video
  • 09 Introduction to OpenMP; Video
  • 10 OpenMP Hello World; Video
  • 11 OpenMP for loop; Video
  • 12 OpenMP clause; Video
  • [Reading assignment 2](Reading Assignments/OpenMP)

Module 3 Shared Memory Parallel Programming, OpenMP, advanced OpenMP

  • OpenMP, for loops, advanced OpenMP; [Slides](Lecture Slides/Lecture_05.pdf); Code
  • OpenMP, sorting algorithms; [Slides](Lecture Slides/Lecture_06.pdf); Code
  • 13 OpenMP tasks; Video
  • 14 OpenMP depend; Video
  • 15 OpenMP synchronization; Video
  • 16 Sorting algorithms Quicksort Mergesort; Video
  • 17 Sorting Algorithms Bitonic Sort; Video
  • 18 Bitonic Sort Exercise; Video
  • [Reading assignment 3](Reading Assignments/OpenMP_advanced)
  • Homework 2; starter code; radix sort tutorial

Module 4 Introduction to CUDA programming

  • Introduction to GPU computing; [Slides](Lecture Slides/Lecture_07.pdf)
  • Introduction to CUDA and nvcc; [Slides](Lecture Slides/Lecture_08.pdf); Code
  • 19 GPU computing introduction; Video
  • 20 Graphics Processing Units; Video
  • 21 Introduction to GPU programming; Video
  • 22 icme-gpu; Video
  • 23 a First CUDA program; Video
  • 23 b First CUDA program part 2; Video
  • 24 nvcc CUDA compiler; Video
  • [Reading assignment 4](Reading Assignments/CUDA_intro)
  • Homework 3; starter code

Module 5 Code performance on NVIDIA GPUs

  • GPU memory and matrix transpose; [Slides](Lecture Slides/Lecture_09.pdf); Code
  • CUDA occupancy, branching, homework 4; [Slides](Lecture Slides/Lecture_10.pdf)
  • 25 GPU memory; Video
  • 26 Matrix transpose; Video
  • 27 Latency, concurrency, and occupancy; Video
  • 28 CUDA branching; Video
  • 29 Homework 4; Video
  • [Reading assignment 5](Reading Assignments/GPU_performance)
  • Homework 4; starter code

Module 6 NVIDIA guest lectures, openACC, CUDA optimization

  • 30 NVIDIA guest lecture, openACC; Video; [Slides](Lecture Slides/CME213_2021_OpenACC.pdf)
  • 31 NVIDIA guest lecture, CUDA optimization; Video; [Slides](Lecture Slides/CME213_2021_Optimization.pdf)
  • [Reading assignment 6](Reading Assignments/NVIDIA_openACC_optimization)

Module 7 NVIDIA guest lectures, CUDA profiling

  • 32 NVIDIA guest lecture, CUDA profiling; Video; [Slides](Lecture Slides/CME213_2021_CUDA_Profiling.pdf)
  • [Reading assignment 7](Reading Assignments/NVIDIA_CUDA_profiling)

Module 8 Group activity and introduction to MPI

The slides and videos below are needed for the final project.

  • Introduction to MPI; [Slides](Lecture Slides/Lecture_16.pdf); Code
  • 37 MPI Introduction; Video
  • 38 MPI Hello World; Video
  • 39 MPI Send Recv; Video
  • 40 MPI Collective Communications; Video

Material for the May 17 group activity:

Module 9 Advanced MPI

  • MPI Advanced Send and Recv; [Slides](Lecture Slides/Lecture_17.pdf); Code
  • 41 MPI Process Mapping; Video
  • 42 MPI Buffering; Video
  • 43 MPI Send Recv Deadlocks; Video
  • 44 MPI Non-blocking; Video
  • 45 MPI Send Modes; Video
  • Parallel efficiency and MPI communicators; [Slides](Lecture Slides/Lecture_18.pdf); Code
  • 46 MPI Matrix-vector product 1D schemes; Video
  • 47 MPI Matrix vector product 2D scheme; Video
  • 48 Parallel Speed-up; Video
  • 49 Isoefficiency; Video
  • 50 MPI Communicators; Video
  • [Reading assignment 8](Reading Assignments/MPI)

Module 10 SLAC guest lecture, Task-based parallel programming

  • Parallel Programming Models by Elliott Slaughter; [Slides](Lecture Slides/CME213_2021_Legion.pdf); Video

Reading and links

Lawrence Livermore National Lab Resources

C++ threads

OpenMP

CUDA

MPI

Open MPI hwloc documentation

Task-based parallel languages and APIs

Sorting algorithms