Next: 4 The overall architecture Up: No Title Previous: 2 A sub-petaflops computer

3 The Pipeline Processor

How GRAPE-6 chip will look like? Here, we briefly discuss the difference between GRAPE-4 processor chip (the HARP chip) and GRAPE-6 chip. The changes are introduced to make full use of the advance of the VLSI technology.

The advance in technology has two outcomes. The first is the increase in the available number of transistors on a single chip. The HARP chip was fabricated using the technology, while the technology will be used for GRAPE-6. Roughly speaking, we can use 16 times more transistors. Secondly, switching delay of the transistor is improved roughly in proportional to its physical size, which we hope to give us around a factor of four increase in the clock cycle. Thus, we expect that GRAPE-6 chip will have 64 times more processing power than GRAPE-4 chip, by means of larger number of pipelines and higher clock speed.

Table 1: Comparison of pipeline chips for GRAPE-4 and GRAPE-6

The power consumption is still relatively low, because of the shrink in the physical size of the transistors and the reduction in the supply voltage.

The increase in the number of pipelines and clock period of the pipeline chip, however, forced us to reconsider the architecture of the chip. The pipeline chip of GRAPE-4 (HARP chip) implemented only the force calculation pipeline. GRAPE-4 has a separate pipeline to evaluate the predictor polynomials for the position of particles, so that it can be used with individual timestep algorithms (for details, see [MTES97]). This pipeline was implemented in another chip (PROMETHEUS chip). A PROMETHEUS chip supplied the data of particles to 48 force calculation pipeline chips. The data transfer bandwidth between PROMETHEUS and HARP was 256 MB/s. The number of HARP chips connected to a PROMETHEUS chip was chosen so that we can achieve a reasonable efficiency for individual timestep algorithm. The number of particles for which the gravitational force are calculated in parallel, , must be relatively small, since the number of particles that can be integrated in parallel is small (otherwise the individual timestep algorithm would be useless). For , must be less than 100.

If we use a similar architecture for GRAPE-6, a rather serious problem arises: we need a very high total memory bandwidth. With GRAPE-6, we can increase to around 500. Even so, the required memory bandwidth is about 40 times higher than that of GRAPE-4, since the peak speed of GRAPE-6 will be 200 times as that of GRAPE-4.

There are a number of different approaches to achieve this high memory bandwidth. We analyzed several of them and reached the conclusion that a tightly-coupled memory-processor chip pair is, at present, most cost-effective solution. If we can integrate the memory and pipelines into a single chip, it would be an even better solution. However, as of late 1997, the logic-memory integration still has too large impact on the density and performance. In a few years, advance in the process technology might make the logic-memory integration a practical option.

If we place memory chips and a pipeline chip physically close, it is not very difficult to achieve a high bandwidth. We can use high clock frequency without much problem. Of course, we have to use advanced packaging technologies such as MCM (multi-chip module), which was used in GRAPE-4. Figure 2 shows an example design. Here, a multi-chip module contains two GRAPE-6 chips, each connected to two SSRAM (Synchronous Static Random Access Memory) chips. For the connection between memory and pipeline chips, we use a high speed (125 MHz) clock, to achieve the data transfer speed of around 1.2 GB/s. Two GRAPE-6 chips share a common I/O port, through which they are connected to the communication path to the host. For the I/O port, we plan to use one port with 64-bit width and 25 MHz data rate (200MB/s). The physical wire length for the I/O port will be considerably longer than that of the port to the memory. Therefore it is important to keep both the clock frequency and number of wires (, the I/O bandwidth) as low as possible.

Figure 2: The GRAPE-6 processor module

Next: 4 The overall architecture Up: No Title Previous: 2 A sub-petaflops computer

Jun Makino
Tue Jun 23 14:17:17 JST 1998