Various computers are used everywhere in our daily lives, and one might consider the computing power is high enough. In reality, however, the performance of even the latest computer is far from enough in not a few fields. Nowadays, numerical simulations using large-scale high-performance computing systems, so-called supercomputers, are imperative in fields of science and technology, and hence new supercomputers are being intensively developed all over the world. To further increase the performance, future supercomputers will be much larger and much more complex than any of current supercomputers. An important technical issue is how to exploit the potential of such a future supercomputer as effectively as possible. To solve the issue, we have been extensively studying high-performance computing technologies from system design to programming and utilization technologies. Especially, we have been leading an international joint research project for reducing the programming efforts required to exploit the performance of a supercomputer, resulting in some pioneering work. Visit the following project page for more details.
Today, expert knowledge and experiences of professional programmers or so-called hackers are required for programming of supercomputers to exploit their performance. Meanwhile, various research projects have lately used machine learning for doing various intelligent tasks instead of human beings. Under this situation, we are exploring an effective way of utilizing machine learning technologies to replace expert tasks of performance-aware programming for supercomputers.
We are also exploring an effective way of using HPC technologies for machine learning. One typical example is to use supercomputers for automatically adjusting hyperparameters that have to be set in advance of machine learning.
To further improve the performance, future supercomputers will be getting larger and larger. As a result, the frequency of system failures will increase due to several factors such as more hardware components and more complicated system software. There is a report of estimating that the MTBF (mean time before failure) of a future system could be only a few minutes. For a long-running scientific program to execute to the end, we need fault tolerance that allows the program to restart from the previous state, which was saved before a failure happens. Generally, such a fault tolerance feature, so-called checkpointing, needs a non-negligible runtime overhead, because it periodically stores the state of a running program to a file at a certain interval. Therefore, we are developing a checkpointing mechanism with a low overhead.
Use of hierarchical storage systems for reducing the runtime overhead of a checkpointing mechanism.
Automatic tuning of checkpointing intervals.