PASC Conference
  • RSS
PASC24 Conference: June 3 to June 5, 2024
  • Home
  • About
    • Organization
  • PASC25 News
  • PASC24 News
  • PASC23 News
  • PASC22 News
  • PASC21 News
  • Older editions
    • PASC20 News
    • PASC19 News
    • PASC18 News
    • PASC17 News
    • PASC16 News
    • PASC15 News
    • PASC14 News
  • Home
  • PASC18 Conference
  • PASC18 – Video of Christian Engelmann on Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems

PASC18 – Video of Christian Engelmann on Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems

In this video from PASC18, Christian Engelmann from Oak Ridge National Laboratory presents: Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems.

“Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. The Catalog project develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current supercomputers and extrapolates this knowledge to future-generation systems. To date, the Catalog project has analyzed billions of node hours of system logs from supercomputers at Oak Ridge National Laboratory and Argonne National Laboratory. This talk provides an overview of our findings and lessons learned.”

Thanks to Rich Brueckner from insideHPC Media Publications for recording the video.

Categories

Next conference

Next conference

Conference Co-Sponsors

Conference Co-Sponsors
© 2025 PASC Conference