Seminar Topic WS18/19: Fault Tolerance for HPC
Seminar Style
The presence of each participant in all seminar presentations is
obligatory.
Successful participation consist in
- choosing, reading and understanding 1-2 papers from a list,
- presenting the papers to the other participants (slides, 30 minutes),
- and writing a summary of the papers (10-15 pages).
ECTS points: 3.0
Seminar ECTS points will be assigned where
the topic presented fits the best:
- Theory,
- Computer Engineering,
- Algorithms,
- Programming Languages,
- Software Engineering.
Key Dates
- Time: Thursdays, 13:00 - 15:00
- Place: Seminarraum 183/2
- First meeting (Vorbesprechung): October 11, 2018, 13:00 - 15:00
- Paper selection: October 18, 2018, 13:00 - 15:00
- All material will be published on TUWEL
Registration
Register on TISS until October 17, 2018! (Procedure will be explained during first meeting.)
- 184.758 Seminar in Software Engineering
- 184.754 Seminar aus Algorithmik
- 184.753 Seminar in Theoretical Computer Science
- 191.108 Seminar in Technischer Informatik
Topics/Papers
paper/topic advised by:
SH - Sascha Hunold
JLT - Jesper Larsson Träff
Topic | Advisor | Paper | ECTS | Comment |
---|---|---|---|---|
1 | SH/JLT | T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (1 [Checkpointing]) | SE, TI, AL | |
2 | SH/JLT | T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (1 [ABFT]) | SE, TI, AL | |
3 | SH/JLT | T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (2 [Silent Errors]) | SE, TI, AL | |
4 | SH/JLT | T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (2 [Failures in Large Scale Machine]) | SE, TI, AL | |
5 | SH/JLT | T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (3 [Fault Tolerant MPI - Logging]) | SE, TI, AL | |
6 | SH/JLT | T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (3 [Fault Tolerant MPI - ULFM]) | SE, TI, AL | |
7 | SH/JLT | T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (4 [Replication for Resilience]) | SE, TI, AL, TH | |
8 | SH/JLT | Z. Chen and J. J. Dongarra. “Algorithm-Based Fault Tolerance for Fail-Stop Failures”. In: IEEE Trans. Parallel Distrib. Syst. 19.12 (2008), pp. 1628–1641. doi: 10.1109/TPDS.2008.58 | SE, TI, AL | |
9 | SH/JLT | J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. B. Ferreira, and C. Engelmann. “Combining Partial Redundancy and Checkpointing for HPC”. in: Proceedings of the 32nd IEEE International Conference on Distributed Computing Systems (ICDCS). IEEE Computer Society, 2012, pp. 615–626. doi: 10.1109/ICDCS.2012.56 | SE, TI, AL | |
10 | SH/JLT | N. El-Sayed and B. Schroeder. “Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling Policies”. In: IEEE Trans. Dependable Sec. Comput. 15.2 (2018), pp. 336–350. doi: 10.1109/TDSC.2016.2548463 | SE, TI, AL | |
11 | SH/JLT | C. George and S. S. Vadhiyar. “Fault Tolerance on Large Scale Systems using Adaptive Process Replication”. In: IEEE Trans. Computers 64.8 (2015), pp. 2213–2225. doi: 10.1109/TC.2014.2360536 | SE, TI, AL | |
12 | SH/JLT | M. Gamell, K. Teranishi, J. Mayo, H. Kolla, M. A. Heroux, J. Chen, and M. Parashar. “Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales”. In: IEEE Trans. Parallel Distrib. Syst. 28.10 (2017), pp. 2881–2895. doi: 10.1109/TPDS.2017.2696538 | SE, TI, AL | |
13 | SH/JLT | X. Tang, J. Zhai, B. Yu, W. Chen, W. Zheng, and K. Li. “An Efficient In-Memory Checkpoint Method and its Practice on Fault-Tolerant HPL”. in: IEEE Trans. Parallel Distrib. Syst. 29.4 (2018), pp. 758–771. doi: 10.1109/TPDS.2017.2781257 | SE, TI, AL | |
14 | SH/JLT | J. Ansel, K. Arya, and G. Cooperman. “DMTCP: Transparent checkpointing for cluster computations and the desktop”. In: Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 2009, pp. 1–12. doi: 10.1109/IPDPS.2009.5161063 | SE, TI, AL | |
15 | SH/JLT | Z. Chen. “Algorithm-based recovery for iterative methods without checkpointing”. In: Proceedings of the 20th ACM International Symposium on High Performance Distributed Computing (HPDC). ACM, 2011, pp. 73–84. doi: 10.1145/1996130.1996142 | SE, TI, AL |
Dates
Contact
In case you have further questions about the seminar, please contact Assistant Prof. Dr. Sascha Hunold.