Sunway Supercomputer Architecture for 125 peak petaflops

The technical details of the new worlds fastest supercomputer are available. Here is a 24 page report by Jack Dongarra

The Sunway TaihuLight System was developed by the National Research Center of Parallel Computer Engineering and Technology (NRCPC), and installed at the National Supercomputing Center in Wuxi (a joint team with the Tsinghua University, City of Wuxi, and Jiangsu province), which is in China’s Jiangsu province. The CPU vendor is the Shanghai High Performance IC Design Center. The system is in full operation with a number of applications implemented and running on the system. The Center will be a public supercomputing center that provides services for public users in China and across the world.

The complete system has a theoretical peak performance of 125.4 Pflop per second with 10,649,600 cores and 1.31 PB of primary memory. It is based on a processor, the SW26010 processor, that was designed by the Shanghai High Performance IC Design Center. The processor chip is composed of 4 core groups (CGs), see figure 1, connected via a NoC, see figure 2, each of which includes a Management Processing Element (MPE) and 64 Computing Processing Elements (CPEs) arranged in an 8 by 8 grid. Each CG has its own memory space, which is connected to the MPE and the CPE cluster through the MC. The processor connects to other outside devices through a system interface (SI).

Each CPE Cluster is composed of a Management Processing Element (MPE) which is a 64-bit RISC core which is supporting both user and system modes, a 264-bit vector instructions, 32 KB L1 instruction cache and 32 KB L1 data cache, and a 256KB L2 cache. The Computer Processing Element (CPE) is composed of an 8×8 mesh of 62-bit RISC cores, supporting only user mode, with a 264-bit vector instructions, 16 KB L1 instruction cache and 64 KB Scratch Pad Memory (SPM).

Sunway Computer Node

A computer node of this machine is based on one many-core processor chip called the SW26010 processor. Each processor is composed of 4 MPEs, 4 CPEs, (a total of 260 cores), 4 Memory Controllers (MC), and a Network on Chip (NoC) connected to the System Interface (SI). Each of the four MPE, CPE, and MC have access to 8 GB of DDR3 memory. The total system has 40,960 nodes for a total of 10,649,600 cores and 1.31 PB of memory.

The MPE’s and CPE’s are based on a RISC architecture, 64-bit, SIMD, out of order microstructure. Both the MPE and the CPE participate in the user’s application. The MPE performance management, communication, and computation while the CPEs mainly perform computations. (The MPE can also participate in the computations.)

Each core of the CPE has a single floating point pipeline that can perform 8 flops per cycle per core (64-bit floating point arithmetic) and the MPE has a dual pipeline each of which can perform 8 flops per cycle per pipeline (64-bit floating point arithmetic). The cycle time for the cores is 1.45 GHz, so a CPE core has a peak performance of 8 flops/cycle * 1.45 GHz or 11.6 Gflop per second and a core of the MPE has a peak performance of 16 flops per cycle * 1.45 GHz or 23.2 Gflop per second. There is just one thread of execution per physical core.

A node of the TaihuLight System has a peak performance of (260 cores * 8 flops/cycle * 1.45 GHz) + (4 core * 16 flops/cycle * 1.45 GHz) = 3.0624 Tflop/s per node. The complete system has 40,960 nodes or 125.4 Pflop/s for the theoretical peak performance of the system.

Each CPE has a 64 KB local (scratchpad) memory, no cache memory. The local memory is SRAM. There is a 16KB instruction cache. Each of the 4 CPE/MPE clusters has 8 GB of DDR3 memory. So a node has 32 GB of primary memory. Each processor connects to four 128-bit DDR3-2133 memory controllers, with a memory bandwidth of 136.51 GB/s. Non-volatile memory is not used in the system.

The MPE/CPE chip is connected via a network-on-chip (NoC) and the system interface (SI) is used to connect the system outside of the node. The SI is a standard PCIe interface. The bidirectional bandwidth is 16 GB/s with a latency around 1 us.

The next large acquisition of supercomputers for the US Department of Energy will not be until 2017 with production beginning in 2018. The US Department of Energy schedule is for a planned 200 Pflop/s machine called Summit at Oak Ridge National Lab by early 2018, a planned 150 Pflop/s machine called Sierra at Lawrence Livermore National Lab by mid-2018, and a planned 180 Pflop/s machine called Aurora at Argonne National Lab in late 2018.

The Sunway Interconnect

Sunway has built their own interconnect. There is a five-level integrated hierarchy, connecting the computing node, computing board, super-nodes, cabinet, to the complete system. Each card has two nodes.

Sunway Supernode and Cabinet

The complete system is composed of 40 Cabinets. Each Cabinet contains 4 Supernodes and each Supernode has 256 Nodes. Each node has a peak floating point performance of 3.06Tflop/s.

Each Supernode then is 256*3.06 Tflop/s and a Cabinet of 4 Supernodes is at 3.1359 Pflop/s.
All number are for 64-bit Floating Point Arithmetic.
1 Node = 260 cores
1 Node = 3.06 Tflop/s
1 Supernode = 256 Nodes
1 Supernode = 783.97 Tflops
1 Cabinet = 4 Supernodes
1 Cabinet = 3.1359 Pflops
1 Sunway TaihuLight System = 40 Cabinets = 160 Supernodes = 40,960 nodes = 10,649,600
cores.
1 Sunway TaihuLight System = 125.4359 Pflop/s

Assuming 15.311 MW for HPL using 40 cabinets, each cabinet is at 382.8 KW. Each cabinet has 4*256 nodes or 373.8 W/node.
The Flops/W for the theoretical peak is at 8 Gflops/W and for HPL the efficiency is 6.074 Gflops/W (93 Pflops/15.311MW).

The Sunway Software Stack is Linux based

The Sunway TaihuLight System is using Sunway Raise OS 2.0.5 based on Linux as the operating system.

The basic software stack for the many-core processor includes basic compiler components, such as C/C++, and Fortran compilers, an automatic vectorization tool, and basic math libraries. There is also the Sunway OpenACC, a customized parallel compilation tool that supports OpenACC 2.0 syntax and targets the SW26010 many-core processor.

Applications
There are currently four key application domains for the Sunway TaihuLight system:
• Advanced manufacturing: CFD, CAE applications.
• Earth system modeling and weather forecasting.
• Life science.
• Big data analytics.

There are three submissions which are finalists for the Gordon Bell Award at SC16 that are based on the new Sunway TaihuLight system. These three applications are:
(1) a fully-implicit nonhydrostatic dynamic solver for cloud-resolving atmospheric simulation;
(2) a highly effective
global surface wave numerical simulation with ultra-high resolution;
(3) large scale phase-field simulation for coarsening dynamics based on Cahn-Hilliard equation with degenerated mobility.

All these three applications have scaled to around 8 million cores (close to the full system scale). The applications that come with an explicit method (such as wave simulation and phase-field simulation) have achieved a sustained performance of 30 to 40 PFlops. In contrast, the implicit solver achieves a sustained performance of around 1.5 PFlops, with a good convergence rate for large-scale problems. These performance number may be improved before the SC16 Conference in November 2016.

The Gordon Bell Prize is awarded each year to recognize outstanding achievement in high performance computing. The purpose of the award is to track the progress over time of parallel computing, with particular emphasis on rewarding innovation in applying high-performance computing to applications in science, engineering, and large-scale data analytics. Prizes may be awarded for peak performance or special achievements in scalability and time-to-solution on important science and engineering problems

The system was funded from three sources, the central Chinese government, the province of Jiangsu, and the city of Wuxi. Each contributed approximately 600 million RMBs or a total of 1.8 billion RMBs for the system or approximately $270M USD. That is the cost of the building, hardware, R and D, and software costs. It does not cover the ongoing maintenance and running of the system and center.

The Sunway TaihuLight System is very impressive with over 10 million cores and a peak performance of 125 Pflop/s. The Sunway TaihuLight is almost three times (2.75 times) as fast and three times as efficient as the system it displaces in the number one spot. The HPL Benchmark results at 93 Pflop/s or 74% of theoretical peak performance is also impressive, with an efficiency of 6 Gflops per Watt. The HPCG performance at only 0.3% of peak performance shows the weakness of the Sunway TaihuLight architecture with slow memory and modest interconnect performance. The ratio of floating point operations per byte of data from memory on the SW26010 is 22.4 Flops(DP)/Byte transfer, which shows an imbalance or an overcapacity of floating point operations per data transfer from memory. By comparison the Intel Knights Landing processor with 7.2 Flops(DP)/Byte transfer. So for many “real” applications the performance on the TaihuLight will be no where near the peak performance rate. Also the primary memory for this system is on low side at 1.3 PB (Tianhe-2 has 1.4 PB and Titan has .71 PB).

The Sunway TaihuLight system, based on a homegrown processor, demonstrates the significant progress that China has made in the domain of designing and manufacturing large-scale computation systems.

The fact that there are sizeable applications and Gordon Bell contender applications running on the system is impressive and shows that the system is capable of running real applications and not just a “stunt machine”.

China has made a big push into high performance computing. In 2001 there were no supercomputers listed on the Top500 in China. Today China has 167 systems on the June 2016 Top500 list compared to 165 systems in the US. This is the first time the US has lost the lead.

SOURCE- Dongarra report on the Sunway supercomputer