LLM inference on SX-Aurora Vector Engine#

Erich Focht (NEC Deutschland GmbH)

Abstract#

This talk explores the performance and optimization techniques for running large language models on the SX-Aurora Vector Engine (VE). We discuss the VE’s capabilities and enhancements, focusing on its performance in float32 format and the utilization of the bfloat16 floating point format. The experiments involve running VE native code on the llama2.c project, which aims to provide a compact binary for large language models inference. We highlight the advantages of using bfloat16 on the first two vector engine generations, despite not natively supporting it, as it delivers an additional performance boost. The talk also covers the modifications made to the llama2.c code to support bfloat16 weights and matrix-vector multiplication. We showcase the results achieved with efficient bfloat16 code on VE3 using simple C programming with vectorized and OpenMP parallelized loops. Additionally, we discuss the VE’s relatively large HBM memory and generation speeds, making it an ideal CPU for isolated on-premise generative AI solutions.

<= Go back