Travel of Software Developer: 12.2011

30.12.11

How to make reusable components/features/blocks or whatever?

Today out development group had a short discussion of a new feature on our system for the next release. As soon as I understood the requirement, I proposed a solution to just reuse one of the feature I implemented about about 15 months ago, which has been working ever seen and without any changes after I finished it.

To be honest, that feature was designed totally for another feature. However, when I designed the module, I just thought in a very generic way. I almost forgot the feature itself, by just working with an abstract requirement.

Developing a specific feature is much easier, because it would be straightforward. I totally agree with this. I don't try to make things very generic to support many scenarios. Instead, I just simply abstract the requirement to remove most of specific scenario out of my scope and try to support my abstract scenario. In another word, I just support one scenario, but an abstract one. In this way, I don't have problem to develop a GENERIC system by considering too many things. However, my solution does support many scenarios.

Single Responsibility Principle

Adherence to a single concern makes the class robust, and it has limited chances of modification.

29.12.11

Java(TM) java.lang.instrument

Java Application Profiling using TPTP

Java Application Profiling using TPTP
Eclipse Test & Performance Tools Platform Project

How to install git to red hat enterprise?

redhat - How to install git to red hat enterprise linux 5.3 x64? - Server Fault

28.12.11

More complicated optimizations

Execution of loops
Range check elimination
unrolling
Loop invariant code motion

Loop optimization - Wikipedia, the free encyclopedia

In compiler theory, loop optimization plays an important role in improving cache performance, making effective use of parallel processing capabilities, and reducing overheads associated with executing loops. Most execution time of a scientific program is spent on loops. Thus a lot of compiler analysis and compiler optimization techniques have been developed to make the execution of loops faster.

The Most Basic Classes of Optimizations

Simple identity transformations
Constant folding
Common subexpression elimination
Inlining of functions

Identity transform - Wikipedia, the free encyclopedia

The identity transform is a data transformation that copies the source data into the destination data without change.
The identity transformation is considered an essential process in creating a reusable transformation library. By creating a library of variations of the base identity transformation, a variety of data transformation filters can be easily maintained. These filters can be chained together in a format similar to UNIX shell pipes.

That is not the one I am looking for.

Identity function - Wikipedia, the free encyclopedia

(Redirected from Identity transformation)

In mathematics, an identity function, also called identity map or identity transformation, is a functionthat always returns the same value that was used as its argument. In terms of equations, the function is given by f(x) = x.

So, Simple Identity Transformation means the compiler picks up some obvious code/block which could be easily transform to other code but more optimistic.

Constant folding - Wikipedia, the free encyclopedia

Constant folding and constant propagation are related compiler optimizations used by many modern compilers. An advanced form of constant propagation known as sparse conditional constant propagation can more accurately propagate constants and simultaneously remove dead code.

Constant folding

Constant folding is the process of simplifying constant expressions at compile time. Terms in constant expressions are typically simple literals, such as the integer 2, but can also be variables whose values are never modified, or variables explicitly marked as constant. Consider the statement:

i = 320 * 200 * 32;

Most modern compilers would not actually generate two multiply instructions and a store for this statement. Instead, they identify constructs such as these, and substitute the computed values at compile time (in this case, 2,048,000), usually in the intermediate representation (IR) tree.

Common subexpression elimination - Wikipedia, the free encyclopedia

In computer science, common subexpression elimination (CSE) is a compiler optimization that searches for instances of identical expressions (i.e., they all evaluate to the same value), and analyses whether it is worthwhile replacing them with a single variable holding the computed value.

hs_err_pid.log and OutOfMemoryError

An OutOfMemoryError also triggers the hs_err_pid<pid>.log file to be generated

It's not true. I tried with a simple program and it just doesn't work.

  1 import java.util.*;
  2
  3 public class HS {
  4     public static void main(String [] args) {
  5         List list = new ArrayList();
  6         for(;;) {
  7             list.add(new byte[1024 * 1024]);
  8         }
  9     }
 10
 11 }

Thread Manager in Java

openjdk.java.net/groups/hotspot/docs/RuntimeOverview.html#VM%20Lifecycle|outline

Thread Management

Thread management covers all aspects of the thread lifecycle, from creation through to termination, and the coordination of threads within the VM. This involves management of threads created from Java code (whether application code or library code), native threads that attach directly to the VM, or internal VM threads created for a range of purposes. While the broader aspects of thread management are platform independent, the details necessarily vary depending on the underlying operating system.

[Notes]

Thread types:

Threads created from Java Code.
Native Threads
Internal VM Threads

Platform independent vs. platform dependent

The details necessarily vary depending on the underlying operating system.

Threading Model

The basic threading model in Hotspot is a 1:1 mapping between Java threads (an instance of java.lang.Thread) and native operating system threads. The native thread is created when the Java thread is started, and is reclaimed once it terminates. The operating system is responsible for scheduling all threads and dispatching to any available CPU.

The relationship between Java thread priorities and operating system thread priorities is a complex one that varies across systems. These details are covered later.

[Notes]

1:1 mapping!!!
The operating system is responsible for scheduling all threads and dispatching to any available CPU.

Thread Creation and Destruction

There are two basic ways for a thread to be introduced into the VM: execution of Java code that calls start() on a java.lang.Thread object; or attaching an existing native thread to the VM using JNI. Other threads created by the VM for internal purposes are discussed below.

There are a number of objects associated with a given thread in the VM (remembering that Hotspot is written in the C++ object-oriented programming language):

The java.lang.Thread instance that represents a thread in Java code
A JavaThread instance that represents the java.lang.Thread instance inside the VM. It contains additional information to track the state of the thread. A Java Thread holds a reference to its associated java.lang.Thread object (as an oop), and the java.lang.Thread object also stores a reference to its Java Thread (as a raw int). A JavaThread also holds a reference to its associated OSThread instance.
An OSThread instance represents an operating system thread, and contains additional operating-system-level information needed to track thread state. The OSThread then contains a platform specific “handle” to identify the actual thread to the operating system

When a java.lang.Thread is started the VM creates the associated JavaThread and OSThread objects, and ultimately the native thread. After preparing all of the VM state (such as thread-local storage and allocation buffers, synchronization objects and so forth) the native thread is started. The native thread completes initialization and then executes a start-up method that leads to the execution of the java.lang.Thread object's run() method, and then, upon its return, terminates the thread after dealing with any uncaught exceptions, and interacting with the VM to check if termination of this thread requires termination of the whole VM. Thread termination releases all allocated resources, removes the JavaThread from the set of known threads, invokes destructors for the OSThread and JavaThread and ultimately ceases execution when it's initial startup method completes.

A native thread attaches to the VM using the JNI call AttachCurrentThread. In response to this an associated OSThread and JavaThread instance is created and basic initialization is performed. Next a java.lang.Thread object must be created for the attached thread, which is done by reflectively invoking the Java code for the Thread class constructor, based on the arguments supplied when the thread attached. Once attached, a thread can invoke whatever Java code it needs to via the other JNI methods available. Finally when the native thread no longer wishes to be involved with the VM it can call the JNIDetachCurrentThread method to disassociate it from the VM (release resources, drop the reference to the java.lang.Thread instance, destruct the JavaThread and OSThread objects and so forth).

A special case of attaching a native thread is the initial creation of the VM via the JNI CreateJavaVM call, which can be done by a native application or by the launcher (java.c). This causes a range of initialization operations to take place and then acts effectively as if a call to AttachCurrentThread was made. The thread can then invoke Java code as needed, such as reflective invocation of themain method of an application. See the JNI section for further details.

Thread States

The VM uses a number of different internal thread states to characterize what each thread is doing. This is necessary both for coordinating the interactions of threads, and for providing useful debugging information if things go wrong. A thread's state transitions as different actions are performed, and these transition points are used to check that it is appropriate for a thread to proceed with the requested action at that point in time – see the discussion of safepoints below.

The main thread states from the VM perspective are as follows:

_thread_new: a new thread in the process of being initialized
_thread_in_Java: a thread that is executing Java code
_thread_in_vm: a thread that is executing inside the VM
_thread_blocked: the thread is blocked for some reason (acquiring a lock, waiting for a condition, sleeping, performing a blocking I/O operation and so forth)

For debugging purposes additional state information is also maintained for reporting by tools, in thread dumps, stack traces etc. This is maintained in the OSThread and some of it has fallen into dis-use, but states reported in thread dumps etc include:

MONITOR_WAIT: a thread is waiting to acquire a contended monitor lock
CONDVAR_WAIT: a thread is waiting on an internal condition variable used by the VM (not associated with any Java level object)
OBJECT_WAIT: a thread is performing an Object.wait() call

Other subsystems and libraries impose their own state information, such as the JVMTI system and the ThreadState exposed by the java.lang.Thread class itself. Such information is generally not accessible to, nor relevant to, the management of threads inside the VM.

Internal VM Threads

People are often surprised to discover that even executing a simple “Hello World” program can result in the creation of a dozen or more threads in the system. These arise from a combination of internal VM threads, and library related threads (such as reference handler and finalizer threads). The main kinds of VM threads are as follows:

VM thread: This singleton instance of VMThread is responsible for executing VM operations, which are discussed below
Periodic task thread: This singleton instance of WatcherThreadsimulates timer interrupts for executing periodic operations within the VM
GC threads: These threads, of different types, support parallel and concurrent garbage collection
Compiler threads: These threads perform runtime compilation of bytecode to native code
Signal dispatcher thread: This thread waits for process directed signals and dispatches them to a Java level signal handling method

All threads are instances of the Thread class, and all threads that execute Java code are JavaThread instances (a subclass of Thread). The VM keeps track of all threads in a linked-list known as the Threads_list, and which is protected by the Threads_lock – one of the key synchronization locks used within the VM.

Java Performance - Memory and Runtime Analysis - Tutorial

The right way to optimize

The Hotspot Virtual Machine

This article is pretty old. But still there are some points there.

The right way to optimize

Of course, the basic principles of optimizing any kind of software apply to Java. Java programmers who live by these principles have little cause to choose performance tweaks over good designs, except in rare cases. Unfortunately, not every Java programmer subscribes to these principles.

The fundamental principle of optimization is: Don't optimize until you know you have a problem. As Donald Knuth once said, "Premature optimization is the root of all evil." In general, you should forget about optimization and just create good quality designs and clear code. After you get your well-designed program working, if you then find that its performance is lacking, that is the time to optimize.

Another basic principle is: Measure the program before and after your optimization efforts. If you find that a particular effort did not make a significant improvement in the program's performance, then revert back to the original, clear code.

A third principle takes into account that most programs spend 80 to 90 percent of their time executing 10 to 20 percent of the code: You should profile the program to isolate the code that really matters to performance (that 10 to 20 percent), and just focus your optimization efforts there.

Once you isolate the time-critical areas of your program, you should first try to devise a better algorithm, use APIs in a smarter way, or use standard code optimization techniques such as strength reduction, common sub-expression elimination, code motion, and loop unrolling. Only as a last resort should you sacrifice good object-oriented, thread-safe design and maintainable code in the name of performance.

27.12.11

HotSpot Runtime / DestroyJavaVM

openjdk.java.net/groups/hotspot/docs/RuntimeOverview.html#VM%20Lifecycle|outline

DestroyJavaVM

This method can be called from the launcher to tear down the VM, it can also be called by the VM itself when a very serious error occurs.

The tear down of the VM takes the following steps:

Wait until we are the last non-daemon thread to execute, noting that the VM is still functional.
Call java.lang.Shutdown.shutdown(), which will invoke Java level shutdown hooks, run finalizers if finalization-on-exit.

Call before_exit(), prepare for VM exit run VM level shutdown hooks (they are registered through JVM_OnExit()), stop theProfiler, StatSampler, Watcher and GC threads. Post the status events to JVMTI/PI, disable JVMPI, and stop the Signal thread.

Call JavaThread::exit(), to release JNI handle blocks, remove stack guard pages, and remove this thread from Threads list. From this point on we cannot execute any more Java code.
Stop VM thread, it will bring the remaining VM to a safepoint and stop the compiler threads. At a safepoint, care should that we should not use anything that could get blocked by a Safepoint.
Disable tracing at JNI/JVM/JVMPI barriers.
Set _vm_exited flag for threads that are still running native code.
Delete this thread.
Call exit_globals(), which deletes IO and PerfMemory resources.
Return to caller.

Hotspot VM Overview / JNI_CreateJavaVM

openjdk.java.net/groups/hotspot/docs/RuntimeOverview.html#VM%20Lifecycle|outline

JNI_CreateJavaVM

The JNI invocation method performs, the following:

Ensures that no two threads call this method at the same time and that no two VM instances are created in the same process. Noting that a VM cannot be created in the same process space once a point in initialization is reached, “point of no return”. This is so because the VM creates static data structures that cannot be re-initialized, at this time.
Checks to make sure the JNI version is supported, and the ostream is initialized for gc logging. The OS modules are initialized such as the random number generator, the current pid, high-resolution time, memory page sizes, and the guard pages.
The arguments and properties passed in are parsed and stored away for later use. The standard java system properties are initialized.
The OS modules are further created and initialized, based on the parsed arguments and properties, are initialized for synchronization, stack, memory, and safepoint pages. At this time other libraries such as libzip, libhpi, libjava, libthread are loaded, signal handlers are initialized and set, and the thread library is initialized.
The output stream logger is initialized. Any agent libraries (hprof, jdi) required are initialized and started.
The thread states and the thread local storage (TLS), which holds several thread specific data required for the operation of threads, are initialized.
The global data is initialized as part of the I phase, such as event log, OS synchronization primitives, perfMemory (performance memory), chunkPool (memory allocator).
At this point, we can create Threads. The Java version of the main thread is created and attached to the current OS thread. However this thread will not be yet added to the known list of the Threads. The Java level synchronization is initialized and enabled.
The rest of the global modules are initialized such as theBootClassLoader, CodeCache, Interpreter, Compiler, JNI,SystemDictionary, and Universe. Noting that, we have reached our “point of no return”, ie. We can no longer create another VM in the same process address space.
The main thread is added to the list, by first locking the Thread_Lock. The Universe, a set of required global data structures, is sanity checked. The VMThread, which performs all the VM's critical functions, is created. At this point the appropriate JVMTI events are posted to notify the current state.
The following classes java.lang.String, java.lang.System,java.lang.Thread, java.lang.ThreadGroup,java.lang.reflect.Method, java.lang.ref.Finalizer,java.lang.Class, and the rest of the System classes, are loaded and initialized. At this point, the VM is initialized and operational, but not yet fully functional.
The Signal Handler thread is started, the compilers are initialized and the CompileBroker thread is started. The other helper threads StatSampler and WatcherThreads are started, at this time the VM is fully functional, the JNIEnv is populated and returned to the caller, and the VM is ready to service new JNI requests.

Hotspot Runtime Overview / VM Lifecycle

openjdk.java.net/groups/hotspot/docs/RuntimeOverview.html

Very good resource to learn how JVM works. Maybe even better than my book. And I am afraid both this web page and the book were from the same writer(s).

There are several HotSpot VM launchers in the Java Standard Edition, the general purpose launcher typically used is the java command on Unix and on Windows java and javaw commands.

java
javaw
javaws
appletviewer

The launcher operations pertaining to VM startup are:

Parse the command line options,

some of the command line options are consumed by the launcher itself,

for example -client or -server is used to determine and load the appropriate VM library,

others are passed to the VM using JavaVMInitArgs.

Establish the heap sizes and the compiler type (client or server) if these options are not explicitly specified on the command line.
Establishes the environment variables such as LD_LIBRARY_PATH and CLASSPATH.
If the java Main-Class is not specified on the command line it fetches the Main-Class name from the JAR's manifest.
Creates the VM using JNI_CreateJavaVM in a newly created thread (non primordial thread).

Note: creating the VM in the primordial thread greatly reduces the ability to customize the VM, for example the stack size on Windows, and many other limitations

Once the VM is created and initialized, the Main-Class is loaded, and the launcher gets the main method's attributes from the Main-Class.
The java main method is then invoked in the VM using CallStaticVoidMethod, using the marshalled arguments from the command line.
Once the java main method completes, its very important to check and clear any pending exceptions that may have occurred and also pass back the exit status, the exception is cleared by calling ExceptionOccurred, the return value of this method is 0 if successful, any other value otherwise, this value is passed back to the calling process.

Maybe I need to dig a little bit into this part, which is not very intuitive to me so far.

The main thread is detached using DetachCurrentThread, by doing so we decrement the thread count so the DestroyJavaVM can be called safely, also to ensure that the thread is not performing operations in the vm and that there are no active java frames on its stack.

The Invocation API

The Invocation API

This is some content from JDK 1.4's documentation.

The Invocation API allows software vendors to load the Java VM into an arbitrary native application. Vendors can deliver Java-enabled applications without having to link with the Java VM source code.
This chapter begins with an overview of the Invocation API. This is followed by reference pages for all Invocation API functions.
To enhance the embeddability of the Java VM, the Invocation API is extended in JDK 1.1.2 in a few minor ways.

docs.oracle.com/javase/6/docs/technotes/guides/jni/spec/invocation.html

And this is from JDK 1.6.

The Invocation API allows software vendors to load the Java VM into an arbitrary native application. Vendors can deliver Java-enabled applications without having to link with the Java VM source code.
This chapter begins with an overview of the Invocation API. This is followed by reference pages for all Invocation API functions.

Not too much changes.

A Simple Test for Distributed Lock with ActiveMQ's Exclusive Consumer Feature

import java.util.concurrent.Semaphore;
import java.util.concurrent.atomic.AtomicInteger;

import javax.jms.Connection;
import javax.jms.ConnectionFactory;
import javax.jms.Message;
import javax.jms.MessageListener;
import javax.jms.Queue;
import javax.jms.Session;

import org.apache.activemq.ActiveMQConnectionFactory;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

public class DistributedLockTest
{
    private static final String BROKER_URL = "vm://dlt?broker.persistent=false&broker.useJmx=false";
    private final Semaphore sem = new Semaphore(0);
    
    @Before
    public void setUp() throws Exception
    {
    }

    @After
    public void tearDown() throws Exception
    {
    }
    
    static interface Closable
    {
        void close() throws Exception;
    }
    
    class MyConn implements Closable
    {
        Connection conn;
        Session session;
        
        public MyConn(ConnectionFactory factory) throws Exception
        {
            conn = factory.createConnection();
            conn.start();            
        }
        
        public void start() throws Exception
        {
            session = conn.createSession(false, Session.CLIENT_ACKNOWLEDGE);
            Queue queue = session.createQueue("queue?consumer.exclusive=true");
            session.createConsumer(queue).setMessageListener(getListener(this));
        }
        
        public void close() throws Exception
        {
            session.close();
        }
    }
    
    @Test(timeout=2000)
    public void test() throws Exception
    {
        ActiveMQConnectionFactory factory = new ActiveMQConnectionFactory(BROKER_URL);
        Connection conn1 = factory.createConnection();
        conn1.start();
        
        Session producerSession = conn1.createSession(false, Session.CLIENT_ACKNOWLEDGE);
        Queue queue = producerSession.createQueue("queue?consumer.exclusive=true");
        producerSession.createProducer(queue).send(producerSession.createMessage());
        //producerSession.close();

        for(int k=0; k<100; ++k)
        {
            MyConn my = new MyConn(factory);
            my.start();
        }

        
        //TimeUnit.MINUTES.sleep(1);
        sem.acquire(100);
    }

    private AtomicInteger idCounter = new AtomicInteger(1);
    
    private MessageListener getListener(final Closable closable)
    {
        return new MessageListener()
        {
            private final int id = idCounter.getAndIncrement();
            
            @Override
            public void onMessage(Message message)
            {
                try 
                {
                    System.out.println(String.format("ID = %04d - Begin", id));                    
                    //TimeUnit.SECONDS.sleep(1);
                    closable.close();
                    sem.release();
                    System.out.println(String.format("ID = %04d - End", id));
                }
                catch(Exception ex)
                {
                    
                }
            }
        };
    }

}

ActiveMQ - Virtual Destination

Apache ActiveMQ ™ -- Virtual Destinations

JMS Queue Depth Monitoring using JMX

Apache ActiveMQ ™ -- How do I find the Size of a Queue

You can view the queue depth using the MBeans in ActiveMQ 4.x. Use any JMX management console to see the statistics. See How can I monitor ActiveMQ.

You can also browse the contents of a queue using the JMS QueueBrowser.

Or you can access statistics programmatically

26.12.11

Performance Object

Performance objects and counters: Management Services

The Windows Server 2003 family of operating systems obtains performance data from components in your computer as those components are utilized. That data is described as a performance object and is typically named for the component generating the data. For example, the Processor object is a collection of performance data about processors on your system.

Performance objects are built into the operating system, typically corresponding to the major hardware components such as memory, processors, and so on. Other programs might install their own performance objects. For example, services such as Windows Internet Name Service (WINS) or server programs such as Microsoft Exchange provide performance objects, and performance graphs and logs can monitor these objects.

Each performance object provides performance counters that represent data on specific aspects of a system or service. For example, the Pages/sec counter provided by the Memory object tracks the rate of memory paging.

CPI & IPC

Instructions per cycle - Wikipedia, the free encyclopedia

Cycles per instruction - Wikipedia, the free encyclopedia

Java Performance - Ideal CPU Utilization?

High kernel or system CPU utilization can be an indication of shared resource contention or a large number of interactions between I/O devices.
The ideal situation for maximum application performance and scalability is to have 0% kernel or system CPU utilization since CPU cycles spent executing in operating system kernel code are CPU cycles that could be utilized by application code.
Hence, one of the objectives to achieving maximum application performance and scalability is to reduce kernel or system CPU utilization as much as possible.

This is something I didn't know before.

Understanding CPU Utilization

Utilization

Utilization - Wikipedia, the free encyclopedia

Utilization is a statistical concept (Queueing Theory) as well as a primary business measure for the rental industry.

CPU load vs. CPU utilization

Load (computing) - Wikipedia, the free encyclopedia

The comparative study of different load indices carried out by Ferrari et al.[1] reported that CPU load information based upon the CPU queue length does much better in load balancing compared to CPU utilization. The reason CPU queue length did better is probably because when a host is heavily loaded, its CPU utilization is likely to be close to 100% and it is unable to reflect the exact load level of the utilization. In contrast, CPU queue lengths can directly reflect the amount of load on a CPU. As an example, two systems, one with 3 and the other with 6 processes in the queue, will probably have utilizations close to 100% although they obviously differ.

High CPU Utilization in Windows

Troubleshooting High CPU Utilization

This month, I show you how to troubleshoot situations in which your server is sluggish or unresponsive because of high CPU utilization. When a server's CPU or CPUs are working at or above 80 percent to 90 percent utilization, applications on the server can become sluggish or stop responding completely. When this situation occurs, you need to determine which process is monopolizing the CPUs.

PC Guide - CPU Utilization

CPU Utilization

CPU utilization is one of those performance factors that is both grossly underrated and overrated at the same time. :^) Most people have never even heard of it; it often seems though that a big percentage of those who do understand its role worry about it way too much. :^) Like most performance issues, sweating small differences in numbers is usually pointless; it doesn't matter much if your CPU utilization is 5% or 10%; but if it is 80% or 90% then you are going to see an impact on the usability of the system if you multitask.

Windows Server Performance Team Blog - Interpreting CPU Utilization for Performance Analysis

Interpreting CPU Utilization for Performance Analysis - Windows Server Performance Team Blog - Site Home - TechNet Blogs

CPU hardware and features are rapidly evolving, and your performance testing and analysis methodologies may need to evolve as well. If you rely on CPU utilization as a crucial performance metric, you could be making some big mistakes interpreting the data. Read this post to get the full scoop; experts can scroll down to the end of the article for a summary of the key points.

IBM Informix Dynamic Server Performance Guide - CPU Utilization

You can use the resource-utilization formula from the previous section to estimate the response time for a heavily loaded CPU. However, high utilization for the CPU does not always indicate a performance problem. The CPU performs all calculations that are needed to process transactions. The more transaction-related calculations that it performs within a given period, the higher the throughput will be for that period. As long as transaction throughput is high and seems to remain proportional to CPUutilization, a high CPU utilization indicates that the computer is being used to the fullest advantage.

How do I Find out Linux CPU Utilization

How do I Find Out Linux CPU Utilization?

You can see Linux CPU utilization under CPU stats. The task’s share of the elapsed CPU time since the last screen update, expressed as a percentage of total CPU time. In a true SMP environment (multiple CPUS), top will operate in number of CPUs. Please note that you need to type q key to exit the top command display.

Please note that you need to install special package called sysstat to take advantage of following commands. This package includes system performance tools for Linux (Red Hat Linux / RHEL includes these tools by default).

Java Performance - Definitions for Performance Engineerings

Performance monitoring

Is an act of nonintrusively collecting or observing performance data from an operating or running application.
Monitoring is usually a preventative or proactive type of action and is usually performed in a production environment, qualification environment, or development environment.
Monitoring is also usually the first step in a reactive situation where an application stakeholder has reported a performance issue but has not provided sufficient information or clues as to a potential root cause.
In this situation, performance profiling likely follows performance monitoring.

Performance profiling

In contrast to performance monitoring is an act of collecting performance data from an operating or running application that may be intrusive on application responsiveness or throughput..
Performance profiling tends to be a reactive type of activity, or an activity in response to a stakeholder reporting a performance issue, and usually has a more narrow focus than performance monitoring.
Profiling is rarely done in production environments. It is typically done in qualification, testing, or development environments and is often an act that follows a monitoring activity that indicates some kind of performance issue.

Performance tuning

In contrast to performance monitoring and performance profiling, is an act of changing tune-ables, source code, or configuration attributes(s) for the purposes of improving application responsiveness or throughput.
Performance tuning often follows performance monitoring or performance profiling activities.

Reference:

Performance engineering - Wikipedia, the free encyclopedia

Java Performance - Evaluating a System's Performance

A common approach used to qualify or evaluate the performance of a new system has been to place a portion of the expected target load on the system, or execute one or more micro-benchmarks and observe how the system performs or observe the amount of work the application does per some unit of time.
However, to evaluate the performance of a SPARC T-series processor, it must be loaded with enough concurrent application threads to keep the large number of hardware threads busy.
The workload needs to be large enough for the SPARC T-series to reap the benefit of switching to a different runnable thread on the next clock cycle when long latency event such as CPU cache misses occur.

It seems to me that the author tried to explain to us why SPARC T-series perform so bad, instead of telling us the right approach to test the system performance. He paid very little attention to other platforms or CPU architectures, but just tried to downplay the others. Maybe it is not true. But that is my feeling when I read this section.

Java Performance - Choosing the Right Platform and Evaluating a System

The application may be running on an inappropriate CPU architecture or system.

That's interesting. Java application is supposed to be crossing platform, which means it could fit itself into the different CPU architectures or operating systems. However, according to this, we need to tune something to let the application suit itself into the new environment.

Multiple cores per CPU

Multi-core processor - Wikipedia, the free encyclopedia

A multi-core processor is a single computing component with two or more independent actual processors (called "cores"), which are the units that read and execute program instructions.[1] The instructions are ordinary cpu instructionslike add, move data, and branch, but the multiple cores can run multiple instructions at the same time, increasing overall speed for programs amenable to parallel computing. Manufacturers typically integrate the cores onto a singleintegrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package.

Multiple threads per core (CMT, chip multithreading)

Using CMT as the keyword, I found something related but not sure whether it is the same in Wikipedia.

Simultaneous multithreading - Wikipedia, the free encyclopedia

Although many people reported that Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its 14 November 2005 release) and the now defunct processor codenamed "Rock" (originally announced in 2005, but after many delays cancelled in 2009) are implementations of SPARC focused almost entirely on exploiting SMT and CMP techniques, Niagara is not actually using SMT. Sun refers to these combined approaches as "CMT", and the overall concept as "Throughput Computing". The Niagara has 8 cores, but each core has only one pipeline, so actually it uses fine-grained multithreading. Unlike SMT, where instructions from multiple threads share the issue window each cycle, the processor uses a round robin policy to issue instructions from the next active thread each cycle. This makes it more similar to a barrel processor. Sun Microsystems' Rock processor is different, it has more complex cores that have more than one pipeline.

In processor design, there are two ways to increase on-chip parallelism with less resource requirements: one is superscalar technique which tries to increase instruction level parallelism (ILP), the other ismultithreading approach exploiting thread level parallelism (TLP).

Superscalar means executing multiple instructions at the same time while chip-level multithreading (CMT) executes instructions from multiple threads within one processor chip at the same time.

Interleaved multithreading: Interleaved issue of multiple instructions from different threads, also referred to as Temporal multithreading. It can be further divided into fine-grain multithreading or coarse-grain multithreading depending on the frequency of interleaved issues. Fine-grainmultithreading—such as in a barrel processor -- issues instructions for different threads after every cycle, while coarse-grain multithreading only switches to issue instructions from another thread when the current executing thread causes some long latency events (like page fault etc.). Coarse-grain multithreading is more common for less context switch between threads. For example, Intel's Montecitoprocessor uses coarse-grain multithreading, while Sun's UltraSPARC T1 uses fine-grain multithreading. For those processors that have only one pipeline per core, interleaved multithreading is the only possible way, because it can issue at most one instruction per cycle.

CPU Cache

CPU cache - Wikipedia, the free encyclopedia

A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. As long as most memory accesses are cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory.

CPU Cache Missing

CPU cache - Wikipedia, the free encyclopedia

A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency. There are three kinds of cache misses: instruction read miss, data read miss, and data write miss.

One of the major design points behind the SPARC T-series processors is to address CPU cache misses by introducing multiple hardware threads per core.

UltraSPARC T1 has four hardware threads per core and comes in four, six, or eight cores per CPU.
8 x 4 = 32
An UltraSPACE T1 processor with eight cores looks like a 32-processor system from an operating system viewpoint.
That is, the operating system views each of the four hardware threads per core as a processor.
Of the four hardware threads per core, only one of the four threads per core executes on a given clock cycle - because there is only one pipeline per core.
However, when a long latency event occurs, such as a CPU cache miss, if there is another runnable hardware thread in the same UltraSPARC T1 core, that hardware thread executes on the next clock cycle.

Does it mean T1 won't switch the threads at all at those situation?

In contrast, other modern CPUs with a single hardware thread per core, or even hyperthreaded cores, will block on long latency events such as CPU cache misses and may waste clock cycles while waiting for a long latency event to be satisfied.

I don't understand this.
In other modern CPUs, if another runnable application thread is ready to run and no other hardward threads are available, a thread context switch must occur before another runnable application thread can execute.
So, compare to T1, there are always redundant hardware threads over there and the CPU just picks up one and continues running, right? So, redundant threads are waste by itself, right? I don't understand.
Thread context switches generally take hundreds of clock cycles to complete. Hence, on a highly threaded application with many thread ready to execute, the SPACE T-series processors have the capability to execute the application faster as a result of their capability to switch to another runnable thread within a core on the next clock cycle. The capability to have multiple hardware threads per core and switch to a different runnable hardware hread in the same core on the next clock cycle comes at the expense of a CPU with a slower clock rate. In other words, CPUs such as the SPARC T-series processor that have multiple hardware threads tend to execute at a lower clock rate than other modern CPUs that have a single hardware thread per core or do not offer the capability to switch to another runnable hardware thread on a subsequent clock cycle.

Apache ActiveMQ ™ -- Advisory Message

ActiveMQ supports advisory messages which allows you to watch the system using regular JMS messages. Currently we have advisory messages that support

consumers, producers and connections starting and stopping
temporary destinations being created and destroyed
messages expiring on topics and queues
brokers sending messages to destinations with no consumers.
connections starting and stopping

25.12.11

Spring Message Listener Container

Chapter 19. JMS (Java Message Service)

19.2.4. Message Listener Containers

One of the most common uses of JMS messages in the EJB world is to drive message-driven beans (MDBs). Spring offers a solution to create message-driven POJOs (MDPs) in a way that does not tie a user to an EJB container. (See the section entitled Section 19.4.2, “Asynchronous Reception - Message-Driven POJOs” for detailed coverage of Spring's MDP support.)

A message listener container is used to receive messages from a JMS message queue and drive the MessageListener that is injected into it. The listener container is responsible for all threading of message reception and dispatches into the listener for processing. A message listener container is the intermediary between an MDP and a messaging provider, and takes care of registering to receive messages, participating in transactions, resource acquisition and release, exception conversion and suchlike. This allows you as an application developer to write the (possibly complex) business logic associated with receiving a message (and possibly responding to it), and delegates boilerplate JMS infrastructure concerns to the framework.

There are three standard JMS message listener containers packaged with Spring, each with its specialised feature set.

19.2.4.1. SimpleMessageListenerContainer

This message listener container is the simplest of the three standard flavors. It simply creates a fixed number of JMS sessions at startup and uses them throughout the lifespan of the container. This container doesn't allow for dynamic adaption to runtime demands or participate in externally managed transactions. However, it does have the fewest requirements on the JMS provider: This listener container only requires simple JMS API compliance.

19.2.4.2. DefaultMessageListenerContainer

This message listener container is the one used in most cases. In contrast to SimpleMessageListenerContainer, this container variant does allow for dynamic adaption to runtime demands and is able to participate in externally managed transactions. Each received message is registered with an XA transaction (when configured with a JtaTransactionManager); processing can take advantage of XA transation semantics. This listener container strikes a good balance between low requirements on the JMS provider and good functionality including transaction participation.

19.2.4.3. ServerSessionMessageListenerContainer

This listener container leverages the JMS ServerSessionPool SPI to allow for dynamic management of JMS sessions. The use of this variety of message listener container enables the provider to perform dynamic runtime tuning but, at the expense of requiring the JMS provider to support the ServerSessionPool SPI. If there is no need for provider-driven runtime tuning, look at the DefaultMessageListenerContainer or theSimpleMessageListenerContainer instead.

Apache Commons Pool

Pool – Overview

Pool provides an Object-pooling API, with three major aspects:

A generic object pool interface that clients and implementors can use to provide easily interchangable pooling implementations.
A toolkit for creating modular object pools.
Several general purpose pool implementations.

ActiveMQ–KahaDB Architecture

Persistent Messaging

Synchronous dispatch through a persistent broker

Synchronous Dispatch through a Persistent Broker

After receiving a message from a producer, the broker dispatches the messages to the consumers, as follows:

The broker pushes the message into the message store. Assuming that the enableJournalDiskSyncs option is true, the message store also writes the message to disk, before the broker proceeds.
The broker now sends the message to all of the interested consumers (but does not wait for consumer acknowledgments). For topics, the broker dispatches the message immediately, while for queues, the broker adds the message to a destination cursor.
The broker then sends a receipt back to the producer. The receipt can thus be sent back before the consumers have finished acknowledging messages (in the case of topic messages, consumer acknowledgments are usually not required anyway).

Concurrent store and dispatch

Concurrent store and dispatch is enabled, by default, for queue.

Concurrent Store and Dispatch

After receiving a message from a producer, the broker dispatches the messages to the consumers, as follows:

The broker pushes the message onto the message store and, concurrently, sends the message to all of the interested consumers. After sending the message to the consumers, the broker then sends a receipt back to the producer, without waiting for consumer acknowledgments or for the message store to synchronize to disk.

As soon as the broker receives acknowledgments from all the consumers, the broker removes the message from the message store. Because consumers typically acknowledge messages faster than a message store can write them to disk, this often means that write to disk is optimized away entirely. That is, the message is removed from the message store before it is ever physically written to disk.

KahaDB Optimization
- http://fusesource.com/docs/broker/5.4/tuning/PersTuning-KahaDB.html

KahaDB Architecture

The bulk of the data is stored in rolling journal files (data logs).

Where all broker events are continuously appended.

In particular, pending messages are also stored in the data logs.
BTree

In order to facilitate rapid retrieval of messages from the data log.
The complete B-tree index is stored on disk and part or all of the B-tree index is held in a cache in memory.

So, the secrete of the KahaDB is the rolling data storage and BTree indexing, which is good for fast appending and fast deleting, but not good for storing data, which is something the traditional RDBMs try to avoid.

I designed some similar storage solutions for some three system, based on different read/write requirements. I tried a flat indexing file with a BTree storage. I also tried an in-memory indexing. The similar part is: we need to separate the indexing from massive data storage.

ActiveMQ in Action–KahaDB

Since 5.3
File based
Combined with a transactional journal

Reliable message storage
Recovery

Good performance
Good Scalability

File based

No prerequisite for a third-party database.

Easy to use
Allowing ActiveMQ to be downloaded and running in literally minutes.

The structure of the KahaDB store has been streamlined especially for the requirements of a message broker.

I am interested to take a look at the design of the KahaDB structure to see how it tries to meet the requirements of a message broker. Before that, maybe I need to understand the requirements of a message broker.

The KahaDB message store uses a transactional log for its indexes and only uses one index file for all its destinations.
The configurability of the KahaDB store means that it can be tuned for most usage scenarios, from high throughput applications (for example, trading platform), to storing large amounts of messages (for example, GPS tracking).
<persistenceAdapter>

ActiveMQ in Action–How are message stored by ActiveMQ?

Messages sent to queues and topics are stored differently, because there are some storage optimizations that can be made with topics that don’t make sense with queues.
Queues

Straightforward

FIFO
Only when that message has been consumed and acknowledged can it be deleted from the broker’s message store.

Topics

Durable subscribers to a topic.

Each consumer gets a copy of the message.

Only one copy of message is stored by the broker
A durable subscriber object in the store maintains a pointer to its next stored message and dispatches a copy of its consumer.
A message can’t be deleted from the store until it’s been successfully delivered to every interested durable subscriber.

How to stop Cygwin Command Line Beeping you when Tab

vim ~/.inputrc

set bell-style none

!! gone

What about VIM?

Re: How do you shut off the beeps in vim?

Cool?

ActiveMQ–Discovery

From the book

Discovery

is a process of detecting remote broker services.

Clients usually want to discover all available brokers.
Brokers usually want to find other available brokers so that they can establish a network of brokers.

Apache ActiveMQ ™ – Discovery

This document is a little bit old, because it was using maven 1.

Discovery Agents

ActiveMQ uses an abstraction called a Discovery Agent to detect remote services such as remote brokers. We can use discovery for JMS clients to auto-detect a Message Broker to connect to, or to provide Networks of Brokers

Two kinds of discovery agent

Multicast
Zeroconfig
LDAP Discovery

Trying out discovery

Apache ActiveMQ ™ -- Discovery Transport Reference

The Discovery transport works just like the Failover transport, except that it uses a discovery agent to locate the list of uri to connect to.
The Discovery transport is also used by the Fanout transport for discovering brokers to send a fanout message to.
Note that to be able to use Discovery to find brokers, the brokers need to have the multicast discovery agent enabled on the broker.

<transportConnector … discoveryUri=”multicast://default” />

discovery:(discoveryAgentURI)?transportOptions
discovery:discoveryAgentURI
Example URI:

discovery:(multicast://default)?initialReconnnectDelay=100

Transport Options

Option Name	Default Value	Description
reconnectDelay	10	How long to wait for discovery
initialReconnectDelay	10	How long to wait before the first reconnect attempt to a discovered url
maxReconnectDelay	30000	The maximum amount of time we ever wait between reconnect attempts
useExponentialBackOff	true	Should an exponential backoff be used btween reconnect attempts
backOffMultiplier	2	The exponent used in the exponential backoff attempts
maxReconnectAttempts	0	If not 0, then this is the maximum number of reconnect attempts before an error is sent back to the client
group	default	an identifier for the group to partition multi cast traffic among collaborating peers; the group forms part of the shared identity of a discovery datagram (since 5.2)