How to trouble shoot and monitor a Java and J2EE application having performance and scalability problem. Here are the techniques used for production systems.
1. Perform a series of JDK thread dump to locate the following possible problems:
- Application bottleneck: Identify application bottlenecks by locating the most common stack trace. Optimize requests that happen most often on the stack trace.
Bad SQLs: If most threads are in the waiting state for the JDBC calls, trace down the bad SQLs to the DB.
Slow DB: If many SQLs are having problem, conduct a DB profiling to locate the DB problem.
DB or external system outages: Check if a lot of threads are in the waiting state of making external connection.
Concurrency issue: Check if many stack trace are waiting in the same code for a lock.
Infinite loop: Verify if threads remaining running over minutes at similar part of the source code.
Connectivity problem: Un-expected low idling thread count indicates the requests are not reaching the application server.
Thread count mis-configuration: Increase thread count if CPU utilization is low yet most thread are in runnable state.
2. Monitor CPU utilization
- High CPU utilization implies design or coding in-efficiency. Execute a thread dump to locate bottleneck. If no problems are found, the system may reach full capacity.
- Low CPU utilization with abnormal high response time implies many threads are blocked. Execute a thread dump to narrow down the problem.
3. Monitor process health including the Java application server
Monitor whether all web servers, application servers, middle tier systems and DB server is running. Configure the system as service so it can be automatically re-started when the process die suddenly.
4. Monitor the Java Heap Utilization
Monitor the amount of Java Heap memory that can be re-claimed after a major garbage collections. If the re-claimed amount keep dropping consistently, the application is leaking memory. Perform memory profiling in locating the memory leak. If no memory is leaking but yet major garbage collection is frequent, tune the Java heap accordingly.
5. Monitor un-usual exception in application log & application server log
Monitor and resolve any exceptions detected in the application and server log. Examine the source code to ensure all resources, in particular DB, file, socket and JMS resources, are probably closed when the application throws an exception.
6. Monitor memory & paging activities
Growing residence (native) memory implies leaking memory in the native code. The source of leaking may include the application non-java native code, C code in the JVM and third party libraries. Also monitor the paging activities closely. Frequent paging means memory mis-configuration.
7. Perform DB profiling
Monitor the following matrix closely
- Identify the top SQLs in logical reads, latency and counts - Re-write or tune poorly performed SQLs or DB programming code.
- Top DB waiting and latch events - Identify bad DB coding or bad DB instance or table configuration.
- Amount of hard parses - Identify scalability problem because of improper DB programming.
- Hit ratio for different buffers and caches - Proof of bad SQLs or improper buffer size configuration.
- File I/O statistics - Proof of bad SQLs, or disk mis-configuration or layout
- Rollback ratio - Identify improper application logic
- Sorting efficiency - Improper sorting buffer configuration
- Undo log or rollback segment performance - Identify DB tuning problem
- Amount of SQL statements and transactions per second - A sudden jump reviews any bad application coding
8. JMS Resources
Monitor the Queue length and resource utilization
- Poison messages: Check if many messages un-processed and staying in the queues for a long time.
- JMS queue deadlocks: Check if no messages can be de-queued and finished.
- JMS listener problems: Check if no messages are processed in a particular queue.
- Memory consumption: Ensure queues having a large amount of pending messages can be paged out of the physical memory.
- JMS retry: Ensure the failed messages are not re-processed immediately. Otherwise, poison messages may consumes most of the CPU.
9. Monitor file I/O performance
Trend the I/O access and wait time. Re-design or re-configure the disk layout if necessary for better I/O performance in particular for the DB server.
10. Monitor resource utilization including file descriptor
Monitor resources closely to identify any application code is depleting OS level resources.
11. Monitor HTTP access
Monitor the top IP address in accessing the system. Detect any intruder trying to steal the content and data in the web site. Use the access log to trace any non 200 HTTP response.
12. Monitor security access log
Monitor OS level security log and web server log to detect hacker intrusion. It also gives hints on how hackers are attacking the system.
13. Monitor network connectivity and TCP status
Run netstat constantly to monitor the TCP socket state.
- High amount of TCP idle wait state implies TCP mis-configuration.
- High amount of TCP in SYNC or FIN state implies possible denial of service attack (DoS).