COURSE DESCRIPTION: This course focuses on developing and optimizing applications software on massively parallel graphics processing units (GPUs). Such processing units routinely come with hundreds to thousands of cores per chip and sell for a few hundred to a few thousand dollars. The massive parallelism they offer allows applications to run 2x-150x faster than on conventional multicores. However, to reach this performance improvement, the application must fully utilize the computational, storage and communication resources provided by the device. This course discusses state-of-the-art parallel programming optimization methods to achieve this goal.

Ideally this course will bring together people with strong programming skills, with people with a strong need for solving compute-intensive problems that can benefit from programming graphics processors. The initial part of the course will discuss a popular programming interface for graphics processors, the CUDA programming tools for NVIDIA processors. The course will continue with a closer view of the internal architecture of graphics processors and how it impacts performance. Finally, implementations of applications and algorithms on graphics processors will be discussed.

The course is lab intensive, and it will utilize the machines at the Wilkinson Lab. Students taking the course for EECS-395 credit will work on a well-defined final mini project that utilizes advanced parallel programming, data layout, and algorithm decomposition concepts. Students taking the course for EECS-495 credit will work on a quarter-long open-ended final project that draws upon their own interests and line of research. Ideally, in their final project these students will form interdisciplinary teams and complete the first steps of optimizing a real-world compute-intensive problem in science or engineering (e.g., materials science, astrophysics, civil engineering, etc.).

COURSE COORDINATOR:  Prof. Nikos Hardavellas (Instructor)

PREREQUISITES AND RECOMMENDATIONS:

Students should have the following prerequisites, or obtain permission from the instructor:

  • Basic understanding of computer architecture (see prerequisites by topic, below)
  • EECS 213 or equivalent understanding of computer systems (see prerequisites by topic)
  • EECS 211 or EECS 230 or equivalent intermediate C programming experience
  • Useful but not required: EECS 311, EECS 361, EECS 358.

Prerequisites By Topic

  • Basic concepts of computer architecture: processors, ALUs, memories, caches, input-output
  • Basic concepts of computer systems: memory accessing and layout, padding, alignment
  • Intermediate concepts on programming using C: control flow, scoping, type casting, pointers
  • Simple concepts of data structures like arrays and linked lists in programs
  • Some knowledge of scientific and engineering applications

TEXTBOOK: 

Required:

  • Programming Massively Parallel Processors: A Hands-on Approach. D. Kirk and W.-M. Hwu. Morgan-Kaufman

Reference textbooks (not required):

  • CUDA by Example. J. Sanders and E. Kandrot. Pearson/Addison-Wesley
  • Patterns for Parallel Programming. T.G. Mattson, B.A. Sanders, and B.L. Massingill. Pearson/Addison-Wesley
  • Introduction to Parallel Computing. A. Grama, A. Gupta, G. Karypis, and V. Kumar. Pearson/Addison-Wesley
  • Heterogeneous Computing with OpenCL. B.R. Gaster, L. Howes, D.R. Kaeli, P. Mistry, and D. Schaa. Morgan Kaufmann

LAB ASSIGNMENTS

There will be several programming assignments. Each programming assignment will involve successively more sophisticated programming skills. Tentative labs list:

  1. Matrix multiplication. The lab’s focus is on producing correct code. This project reinforces the acquisition of basic GPU/CUDA programming skills, the software interface, and the basic architecture of the device.
  2. Vector reduction. This lab focuses on the application of efficient parallel algorithms that utilize shared memory and synchronization and minimize path divergence.
  3. Tiled matrix multiplication. This lab focuses on data layout and decomposition, and full utilization of shared memory resources and global bandwidth through bank conflict avoidance and memory coalescing.
  4. 2D convolution. While the previous labs involved detailed instructions and scaffolding code, this lab provides students only with a problem statement, along with optimization goals and hints. This lab’s focus is on independent thinking and reinforces concepts learned in the previous labs.
  5. Blocked, threaded parallel prefix sum. This is similar to lab 4, and reinforces similar concepts.

FINAL PROJECT: The project can be one of the following. The project selection requires the permission of the instructor.

  1. EECS-395 only: Histograms. This is a problem at the bleeding edge of the technology and can be selected by the student as the final mini project for EECS-395. To solve it students need to consider not only parallel optimization techniques, but also to harness the statistical properties of the input data to simplify the problem (e.g., diagonalization). Students are called to define optimization goals and strategy, implement them, and keep a research lab journal on which they report statistics and analyze every optimization they tried, even ones that did not work or degraded performance. The students will demo their solution, submit a paper that describes their work, and present their work in a special class meeting. The students will also be required to read recent research papers that outline some of the best-known ways to solve this problem.
  2. EECS-395 or EECS-495: Open-ended research. The students are required to produce work similar to project (a), but at a more mature level, while working on a real-world research problem. This work will be a quarter-long open-ended final project that draws upon the students’ own interests and line of research. Ideally, in their final project these students will form interdisciplinary teams and complete the first steps of optimizing a real-world contemporary compute-intense problem in science or engineering (e.g., materials science, astrophysics, civil engineering, etc.). The final project culminates to a demo, a paper, and a presentation. The students are expected to base their ideas on recent literature.

LATE ASSIGNMENTS POLICY:  20% penalty for the first 24 hours after the lab is due. No assignments will be accepted after that.

GRADING: Grades will be assigned according to the following distribution:

Lab assignments               60%
Final project                       30%
Class participation            10%


COURSE OBJECTIVES:  
When a student completes this course, s/he should be able to:

(a) Apply knowledge of mathematics, science, and engineering (parallel programming techniques, kernel decomposition, synchronization, memory access optimizations, data layout transformations, branching/loop optimizations, algorithm cascading, Brent’s theorem, input diagonalization, FP representations, regularization, compaction, binning, thread coarsening).

(b) Design and conduct experiments, analyze, and interpret data (design and conduct experiments on real massively parallel applications written using CUDA, utilize industrial tools to identify and overcome performance bottlenecks, measure execution time and speedup on GPU devices).

(c) Design a software system to meet desired needs within realistic constraints (design optimized massively parallel programs for GPUs by amplifying the utilization of constrained resources including PCIe bandwidth, global memory bandwidth, shared memory banks, texture and constant memory, warps, thread blocks, SMP registers, etc.).

(d) Ability to function on multidisciplinary teams (the collaborative term projects will ideally pair together computer scientists and engineers with domain experts).

(e) Ability to identify, formulate, and solve engineering problems using industry-strength CUDA tools.

(g) Ability to communicate effectively (research papers and reports, presentations, posters).

(i) Recognize the need for, and have the ability to engage in life-long learning (read recent research papers in an unfamiliar subject and assimilate knowledge without direct supervision).

(j) Gain knowledge of contemporary issues (state-of-the-art in GPU programming, energy-efficiency in computer architectures, high-performance parallel programming, heterogeneous computing).
(k) Ability to use the techniques, skills, and modern engineering tools necessary for engineering practice (GPUs, CUDA, occupancy calculator, Ocelot, OpenCL, parallel programming techniques).

TENTATIVE SCHEDULE

  • Week 1: Introduction to GPUs and their Programming Model

Grids, blocks, threads, memory model, execution model, software architecture, basic API

  • Week 2: GPU Architecture Overview and Execution API

Streaming processors, streaming multiprocessors, texture clusters, streaming arrays, block scheduling and execution, warps, scoreboarding, memory parallelism, register file, execution staging, subtyping, measuring time, compilation overview

  • Week 3: Memory Performance and Control Flow; Example application: Matrix Transpose

Coalescing, bank conflicts, DRAM memory controller organization, DRAM burst mode, tiling, padding, flow divergence

  • Week 4: Putting everything to work: Reductions

Global synchronization, kernel decomposition, memory coalescing, non-divergent branching, eliminating shared-memory bank conflicts, loop unrolling, Brent’s Theorem, algorithm cascading

  • Week 5: Work with atomics: Histograms I

Data races, atomics, input diagonalization, privatization, bank mapping, warp voting

  • Week 6: Histograms II; Sparse Arrays; Textures

Warp voting, fences, compressed sparse row representation, textures, nearest-point sampling, linear filtering, clamping, CUDA arrays

  • Week 7: PTX Assembly and Profiler; Floating Point Considerations; Introduction to OpenCL

Hardware counters, visual profiler, predicated execution, floating-point representation, flush-to-zero, denormalization, rounding, units-in-the-last-place, mixed-precision methods, OpenCL kernel structure and invocation, context, devices, command queues, memory allocation, argument construction

  • Weeks 8-9: The Fermi Architecture; Parallel Programming

Horizontal vs. vertical scaling, shared memory programming, message passing, data sharing models, algorithm structures vs. parallel programming coding styles, parallel program models vs. programming models, regularization, compaction, binning, data layout transformations, thread coarsening, scatter/gather, tessellation, parallel kernel execution, double-warp scheduling, decoupled datapaths, operand collector, unified load/store addressing, ECC, caching, limiter theory

  • Week 10: Review; Project Demos

ABET Content Category: 100% Engineering (Design component)