Travel of Software Developer: Java Performance - Choosing the Right Platform and Evaluating a System

26.12.11

Java Performance - Choosing the Right Platform and Evaluating a System

The application may be running on an inappropriate CPU architecture or system.

That's interesting. Java application is supposed to be crossing platform, which means it could fit itself into the different CPU architectures or operating systems. However, according to this, we need to tune something to let the application suit itself into the new environment.

Multiple cores per CPU

Multi-core processor - Wikipedia, the free encyclopedia

A multi-core processor is a single computing component with two or more independent actual processors (called "cores"), which are the units that read and execute program instructions.[1] The instructions are ordinary cpu instructionslike add, move data, and branch, but the multiple cores can run multiple instructions at the same time, increasing overall speed for programs amenable to parallel computing. Manufacturers typically integrate the cores onto a singleintegrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package.

Multiple threads per core (CMT, chip multithreading)

Using CMT as the keyword, I found something related but not sure whether it is the same in Wikipedia.

Simultaneous multithreading - Wikipedia, the free encyclopedia

Although many people reported that Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its 14 November 2005 release) and the now defunct processor codenamed "Rock" (originally announced in 2005, but after many delays cancelled in 2009) are implementations of SPARC focused almost entirely on exploiting SMT and CMP techniques, Niagara is not actually using SMT. Sun refers to these combined approaches as "CMT", and the overall concept as "Throughput Computing". The Niagara has 8 cores, but each core has only one pipeline, so actually it uses fine-grained multithreading. Unlike SMT, where instructions from multiple threads share the issue window each cycle, the processor uses a round robin policy to issue instructions from the next active thread each cycle. This makes it more similar to a barrel processor. Sun Microsystems' Rock processor is different, it has more complex cores that have more than one pipeline.

In processor design, there are two ways to increase on-chip parallelism with less resource requirements: one is superscalar technique which tries to increase instruction level parallelism (ILP), the other ismultithreading approach exploiting thread level parallelism (TLP).

Superscalar means executing multiple instructions at the same time while chip-level multithreading (CMT) executes instructions from multiple threads within one processor chip at the same time.

Interleaved multithreading: Interleaved issue of multiple instructions from different threads, also referred to as Temporal multithreading. It can be further divided into fine-grain multithreading or coarse-grain multithreading depending on the frequency of interleaved issues. Fine-grainmultithreading—such as in a barrel processor -- issues instructions for different threads after every cycle, while coarse-grain multithreading only switches to issue instructions from another thread when the current executing thread causes some long latency events (like page fault etc.). Coarse-grain multithreading is more common for less context switch between threads. For example, Intel's Montecitoprocessor uses coarse-grain multithreading, while Sun's UltraSPARC T1 uses fine-grain multithreading. For those processors that have only one pipeline per core, interleaved multithreading is the only possible way, because it can issue at most one instruction per cycle.

CPU Cache

CPU cache - Wikipedia, the free encyclopedia

A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. As long as most memory accesses are cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory.

CPU Cache Missing

CPU cache - Wikipedia, the free encyclopedia

A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency. There are three kinds of cache misses: instruction read miss, data read miss, and data write miss.

One of the major design points behind the SPARC T-series processors is to address CPU cache misses by introducing multiple hardware threads per core.

UltraSPARC T1 has four hardware threads per core and comes in four, six, or eight cores per CPU.
8 x 4 = 32
An UltraSPACE T1 processor with eight cores looks like a 32-processor system from an operating system viewpoint.
That is, the operating system views each of the four hardware threads per core as a processor.
Of the four hardware threads per core, only one of the four threads per core executes on a given clock cycle - because there is only one pipeline per core.
However, when a long latency event occurs, such as a CPU cache miss, if there is another runnable hardware thread in the same UltraSPARC T1 core, that hardware thread executes on the next clock cycle.

Does it mean T1 won't switch the threads at all at those situation?

In contrast, other modern CPUs with a single hardware thread per core, or even hyperthreaded cores, will block on long latency events such as CPU cache misses and may waste clock cycles while waiting for a long latency event to be satisfied.

I don't understand this.
In other modern CPUs, if another runnable application thread is ready to run and no other hardward threads are available, a thread context switch must occur before another runnable application thread can execute.
So, compare to T1, there are always redundant hardware threads over there and the CPU just picks up one and continues running, right? So, redundant threads are waste by itself, right? I don't understand.
Thread context switches generally take hundreds of clock cycles to complete. Hence, on a highly threaded application with many thread ready to execute, the SPACE T-series processors have the capability to execute the application faster as a result of their capability to switch to another runnable thread within a core on the next clock cycle. The capability to have multiple hardware threads per core and switch to a different runnable hardware hread in the same core on the next clock cycle comes at the expense of a CPU with a slower clock rate. In other words, CPUs such as the SPARC T-series processor that have multiple hardware threads tend to execute at a lower clock rate than other modern CPUs that have a single hardware thread per core or do not offer the capability to switch to another runnable hardware thread on a subsequent clock cycle.

26.12.11

Java Performance - Choosing the Right Platform and Evaluating a System

No comments: