Textbook

Programming Massively Parallel Processors - 3rd Edition
Lecture 1
PPT(click here)
Lecture-1-cuda-introduction
Paper(click here)
1 An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems
3 Software and the concurrency revolution
4 Some computer organizations and their effectiveness
5 MCUDA: an Efficient Implementation of CUDA Kernels for Multi-Core CPUs
Related Materials(click here)
1 MPI – A Message Passing Interface Standard Version 2.2
2 Algorithms and theory of computation handbook
3 NVIDIA CUDA C Programming Guide
4 Introduction to computing systems: from bits and gates to C and beyond
5 First Draft of a Report on the EDVAC
Lecture 2
PPT(click here)
Lecture-2-kernel-multidimension
Lecture 3
PPT(click here)
Lecture-3-Memory and Data Locality
Lecture 4
PPT(click here)
Lecture-4-Performance considerations
Paper(click here)
1 Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
2 Program optimization space pruning for a multithreaded GPU
Related Materials
1 CUDA C Best Practices Guide v. 4.2
2 CUDA Occupancy Calculator. Web search using keywords “CUDA Occupancy Calculator”.
Lecture 5
PPT(click here)
Lecture-5-histogram
Related Materials
1 Merrill, D. (2015). Using compression to improve the performance response of parallel histogram computation, NVIDIA Research Technical Report.
Lecture 6
PPT(click here)
Lecture-6-Scan
Paper(click here)
1 A regular layout for parallel adders
2 Fast scan algorithms on graphics processors
3 A study of persistent threads style GPU programming for GPGPU Workloads
4 A parallel algorithm for the efficient solution of a general class of recurrence equations
5 Single-pass parallel prefix scan with decoupled look-back
6 StreamScan: fast scan algorithms for GPUs without global barrier synchronization
Related Materials(click here)
1 Parallel prefix sum with CUDA
Lecture 7
PPT(click here)
Lecture-7-Joint CUDA-MPI Programming
Related Materials
1 Gropp, William, Lusk, Ewing, & Skjellum, Anthony (1999a). Using MPI, 2nd edition: Portable parallel programming with the message passing interface. Cambridge, MA: MIT Press Scientific And Engineering Computation Series. ISBN 978-0-262-57132-6.
Lecture 8
PPT(click here)
Lecture-8-Sparse-matrix
Paper(click here)
1 Implementing sparse matrix–vector multiplication on throughput oriented processors
2 Methods of conjugate gradients for solving linear systems
Related Materials
1 Rice, J. R., & Boisvert, R. F. (1984). Solving Elliptic Problems Using, ELLPACK. Springer Verlag. 497 pages.
Lecture 9
PPT(click here)
Lecture-9-Parallel patterns
Paper(click here)
1 Efficient MPI implementation of a parallel, stable merge algorithm
Lecture 10
PPT(click here)
Lecture-10-Computational-Thinking
Paper
1 Rodrigues, C. I., Stone, J., Hardy, D., & Hwu, W. W. (2008). GPU acceleration of cutoff-based potential summation. In: ACM computing frontier conference 2008, Italy, May.
Related Materials(click here)
