
Research and Development Staff Member at Oak Ridge National Laboratory
Knoxville, Tennessee Area

Research and Development Staff Member at Oak Ridge National Laboratory
Knoxville, Tennessee Area
Dr. Christian Engelmann has 8+ years of extensive experience in software research and development for next-generation extreme-scale high-performance computing (HPC) systems with a strong research funding and publication record. In collaboration with other U.S. Department of Energy laboratories and universities world-wide, his research aims at computer science challenges in HPC system software, such as dependability, scalability, and portability.
principal investigator for federally funded software research and development; MSc thesis advisor; small team lead; project- and milestone-oriented team work; publishing research and development results
(Government Agency; Research industry)
September 2009 — Present (3 months)
2009-...: Checkpoint storage virtualization to improve efficiency by aggregating a variety of resources, such as memory and flash, and software dual-modular redundancy (DMR) to eliminate rollback/recovery in HPC
2008-...: Light-weight simulation of future HPC architectures (~10,000,000 cores) to evaluate scalability/fault tolerance of key science algorithms [U.S. DOE Institute for Advanced Architecture and Algorithms]
2008-...: HPC system software resiliency solutions, including health monitoring, reliability analysis, fault prediction, proactive fault tolerance, reactive fault tolerance enhancements, and holistic fault tolerance
(Government Agency; Research industry)
May 2004 — August 2009 (5 years 4 months)
2008-...: Light-weight simulation of future HPC architectures (~10,000,000 cores) to evaluate scalability/fault tolerance of key science algorithms [U.S. DOE Institute for Advanced Architecture and Algorithms]
2008-...: HPC system software resiliency solutions, including health monitoring, reliability analysis, fault prediction, proactive fault tolerance, reactive fault tolerance enhancements, and holistic fault tolerance
2006-09: Enhancing productivity for scientific application development, deployment and execution with the Harness Workbench Toolkit offering a common view across diverse HPC hardware and software platforms
2006-08: Virtual system environments for "plug-and-play" supercomputing through desktop-to-cluster-to-petaflop computer system-level virtualization based on recent advances in hypervisor technologies
2004-07: HPC reliability, availability and serviceability solutions, such as scalable membership management for MPI high availability and asymmetric active/standby (n+m) replication for head and service nodes
2004-06: High availability for services running on HPC head and service nodes, such as Torque and PVFS MDS, using symmetric active/active (state-machine) replication with 99.9997% service uptime
2000-05: Pluggable lightweight heterogeneous distributed virtual machine (PVM successor) with an adaptive reconfigurable runtime environment, parallel plug-in paradigms, high availability, and fault-tolerant MPI
(Government Agency; Research industry)
June 2001 — April 2004 (2 years 11 months)
2002-04: Light-weight simulation of future HPC architectures (~1,000,000 processors) to evaluate scalability/fault tolerance of a new generation of super-scalable, naturally fault-tolerant scientific algorithms [IBM CRADA]
2000-05: Pluggable lightweight heterogeneous distributed virtual machine (PVM successor) with an adaptive reconfigurable runtime environment, parallel plug-in paradigms, high availability, and fault-tolerant MPI
(Government Agency; Research industry)
August 2000 — January 2001 (6 months)
2000-05: Pluggable lightweight heterogeneous distributed virtual machine (PVM successor) with an adaptive reconfigurable runtime environment, parallel plug-in paradigms, high availability, and fault-tolerant MPI
(Public Company; HPQ; Computer Hardware industry)
October 1998 — September 1999 (1 year )
1998-99: Object-oriented graphical user interface prototype system service using the model-view-controller software architecture for an embedded mobile patient monitoring system
PhD , Computer Science , 2004 — 2008
Thesis title: "Symmetric Active/Active High Availability for High-Performance Computing System Services". Thesis research performed at Oak Ridge National Laboratory. Advisor: Prof. Vassil N. Alexandrov (University of Reading)
MSc , Computer Science , 2000 — 2001
Thesis title: "Distributed Peer-to-Peer Control for Harness". Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Engineering I, Technical College for Engineering and Economics (FHTW) Berlin, Germany. Advisors: Prof. V. N. Alexandrov (University of Reading); George A. (Al) Geist (Oak Ridge National Laboratory).
Dipl.-Ing (FH) , Computer Systems Engineering , 1996 — 2001
Thesis title: "Distributed Peer-to-Peer Control for Harness". Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Computer Science, University of Reading, UK. Advisors: Prof. U. Metzler (Technical College for Engineering and Economics (FHTW) Berlin); George A. (Al) Geist (Oak Ridge National Laboratory).
skiing, travel
ACM, ACM SIGOPS, IEEE, IEEE CS, IEEE CS TCSC/TCPP/TCDP/TCFT