- The application may be running on an inappropriate CPU architecture or system.
That's interesting. Java application is supposed to be crossing platform, which means it could fit itself into the different CPU architectures or operating systems. However, according to this, we need to tune something to let the application suit itself into the new environment.
- Multiple cores per CPU
- Multiple threads per core (CMT, chip multithreading)
Using CMT as the keyword, I found something related but not sure whether it is the same in Wikipedia.
In processor design, there are two ways to increase on-chip parallelism with less resource requirements: one is superscalar technique which tries to increase instruction level parallelism (ILP), the other ismultithreading approach exploiting thread level parallelism (TLP).
Superscalar means executing multiple instructions at the same time while chip-level multithreading (CMT) executes instructions from multiple threads within one processor chip at the same time.
Interleaved multithreading: Interleaved issue of multiple instructions from different threads, also referred to as Temporal multithreading. It can be further divided into fine-grain multithreading or coarse-grain multithreading depending on the frequency of interleaved issues. Fine-grainmultithreading—such as in a barrel processor -- issues instructions for different threads after every cycle, while coarse-grain multithreading only switches to issue instructions from another thread when the current executing thread causes some long latency events (like page fault etc.). Coarse-grain multithreading is more common for less context switch between threads. For example, Intel's Montecitoprocessor uses coarse-grain multithreading, while Sun's UltraSPARC T1 uses fine-grain multithreading. For those processors that have only one pipeline per core, interleaved multithreading is the only possible way, because it can issue at most one instruction per cycle.
- CPU Cache Missing
- One of the major design points behind the SPARC T-series processors is to address CPU cache misses by introducing multiple hardware threads per core.
- UltraSPARC T1 has four hardware threads per core and comes in four, six, or eight cores per CPU.
- 8 x 4 = 32
- An UltraSPACE T1 processor with eight cores looks like a 32-processor system from an operating system viewpoint.
- That is, the operating system views each of the four hardware threads per core as a processor.
- Of the four hardware threads per core, only one of the four threads per core executes on a given clock cycle - because there is only one pipeline per core.
- However, when a long latency event occurs, such as a CPU cache miss, if there is another runnable hardware thread in the same UltraSPARC T1 core, that hardware thread executes on the next clock cycle.
- Does it mean T1 won't switch the threads at all at those situation?
- In contrast, other modern CPUs with a single hardware thread per core, or even hyperthreaded cores, will block on long latency events such as CPU cache misses and may waste clock cycles while waiting for a long latency event to be satisfied.
- I don't understand this.
- In other modern CPUs, if another runnable application thread is ready to run and no other hardward threads are available, a thread context switch must occur before another runnable application thread can execute.
- So, compare to T1, there are always redundant hardware threads over there and the CPU just picks up one and continues running, right? So, redundant threads are waste by itself, right? I don't understand.
- Thread context switches generally take hundreds of clock cycles to complete. Hence, on a highly threaded application with many thread ready to execute, the SPACE T-series processors have the capability to execute the application faster as a result of their capability to switch to another runnable thread within a core on the next clock cycle. The capability to have multiple hardware threads per core and switch to a different runnable hardware hread in the same core on the next clock cycle comes at the expense of a CPU with a slower clock rate. In other words, CPUs such as the SPARC T-series processor that have multiple hardware threads tend to execute at a lower clock rate than other modern CPUs that have a single hardware thread per core or do not offer the capability to switch to another runnable hardware thread on a subsequent clock cycle.
No comments:
Post a Comment