Cassandra Performance Tuning and Monitoring

Performance Tuning for Cassandra Write Operations

Optimize Cassandra for Write Operations

  • Cassandra write path is very simple and require little tunning
  • The biggest performance gain for write is to put commit log in a separate disk drive
    • commit log uses sequential write and most hard drive will meet the throughput requirement
    • However, if SSTables share the same drive with commit log
    • I/O contention between commit log & SSTables may deteriorate commit log writes and SSTable reads
    • Unfortunately, current cloud computing service does not provide a real standalone drive

Performance Tuning for Cassandra Read Operations

Tuning Cassandra Concurrent Reads & Concurrent Writes

concurrent_reads (Defaults are 8)

A good rule of thumb is 4 concurrent_reads per processor core. May increase the value for systems with fast I/O storage

concurrent_write (Defaults are 32)

  • May not need tuning since write is usually fast
  • If needed, increase the value for system with many cores

Contention From Cassandra Compaction

Cassandra Compaction increases I/O contention on SSTable data read

  • Writing Cassandra data is usually fast and the write degradation may not be noticeable
  • Reading Cassandra data (SSTables) continues to slow down when I/O contention increases
  • Can be a major cause of read performance degradation

Monitoring Cassandra Compaction Contention

Monitoring Cassandra Compaction Statistics to Identify problem

  • Use nodetool cfstats to monitor the number of SSTables
    • If the count is continually growing, there may be I/O contention between reading the SSTable and compacting them
    • Read will slow down because data is fragmented across many SSTables and compaction is continually running trying to reduce them
  • If pending tasks and bytes total in progress are un-reasonable high, compaction may have I/O contention with the read operation (SSTables)
    bin/nodetool -h 10.208.115.203 compactionstats
    compaction type: n/a
    column family: n/a
    bytes compacted: n/a
    bytes total in progress: n/a
    pending tasks: 0

Remedy for Cassandra Compaction Contention

  • Top priority: Reduce or merge application updates/insert requests
  • Reduce the frequency of memtable flush
    • By increasing the memtable size or preventing too pre-mature flushing
      • Less frequent memtable flush results in fewer SSTables files and less compaction
      • Fewer compaction reduces SSTables I/O contention, and therefore improves read operations
      • Bigger memtables absorb more overwrites for updates to the same keys,
      • and therefore accommodating more read/write operations between each flushes
        memtable_flush_after_mins The max time to leave a dirty memtable unflushed.
        Need to be large enough to avoid flushing memtable too prematurely.
        For production, a larger value such as 1440 is recommended
        memtable_operations_in_millions The max number of columns to store in memory per ColumnFamily before flushing to disk
        memtable_throughput_in_mb The maximum amount of data to store in memory per ColumnFamily before flushing to disk
    • Defaults are:
      60 minutes, HeapSize/2**29 * 0.3 millions, and HeapSize/2**23 mb
  • If the problem is not frequent or serve, lower the compaction thread priority may lower the I/O contention
    • The lower priority may slow down the compaction write but may not help if there are plenty of idle CPU
    • Add the following to JVM_OPTS in conf/cassandra-env.sh to lower the compaction priority
      JVM_OPTS="$JVM_OPTS -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Dcassandra.compaction.priority=1"
  • May need to add nodes or increase drives I/O bandwidth when it is a real I/O capacity problems

    However, most problems may be originated from poorly implemented frequent data updates or inserts application codes

Cassandra Memory Cache Tunning

Basic principles

  • Do no increase Cassandra cache size unless there is enough physical memory. Avoid memory swapping at any cost
  • The available memory for Cassandra is less than most people think
    • Cassandra heavily depend on OS file cache to optimize data read
    • The real physical memory available for Casandra is
      physical memory size - OS required memory - target memory size to cache data file by OS
                           - other memory used by other applications

Cassandra uses 3 level of caching strategy

Cache type Memory Consumption Effectiveness If physical memory is scare
Row Cache High Lowest. Much harder to tune and may have opposite effect if the data work set changed Lowest return
Key Cache Low Most effective with its smaller memory footprint Highest return
File OS Cache High More effective than row cache in handling changes in data work set High return

In the worst case:

  • Cassandra read the index file for the row location in the data file
  • Cassandra read the data file to load the column data
  • The objective of tuning the Cassandra cache is to reduce the total number of disk reads under a limited amount of physical memory

Cassandra read path:

  1. Check if data can be found in in-memory row cache (Lookup the Column Data by the row key)
    • If found, return the row cache data to the client
    • If row columns are large, the cached data may take up a lot of memory and the row cache becomes less effective since it can host less keys
    • Use row cache for column family that has hot data spot
    • However, avoid row cache if data is frequent swapped out from the cache. That will trigger too frequent JVM GC and kill performance
    • Use nodetool to monitor row cache hit ratio
      bin/nodetool cfstats
    • Increase row cache if hit ratio increase significantly with relatively small memory footprint
      bin/nodetool setcachecapacity keyspace mycolumn_family key_cache_capacity row_cache_capacity

      OR

      create column family Standard1
          with comparator = BytesType
          and keys_cached = 10000
          and rows_cached = 1000
      • By default rows_cached is set to 0
  2. Check if key can be found in in-memory key cache (Key caches store the physical location of data for the cached key)
    • Read the data location from the in memory key cache if found, otherwise read the information from the index file
    • Save one File I/O seek if the key is found in the cache
    • Key cache stores the key and data location, and therefore have the least memory footprint
    • By default, the key cache is 200000
    • With limited memory, key caches provides the biggest performance gain comparing with other cache using the same amount of memory
  3. Read the column data: if the requested data can be found in the OS file cache, data will be returned from the cache
    • OS file cache can adjust to different access pattern better than the row caches
    • Setting Cassandra row cache and key cache too high will deplete the OS File cache, and degrade the performance
    • Cassandra performance assumes an effective OS file cache and some deployment reserve 50% of physical memory for the OS file cache
    • Linux can use up as much remaining physical memory for file cache and release part of it whenever low in memory for other applications
  4. If the requested data is not in the OS File cache, Cassandra read the data from the file

Cassandra Row Cache Tuning

The row cache holds the entire content of a row in memory. It provides data caching instead of reading data from the disk

Row Cache

  • Small row cache may result in low hit ratio
  • Large row cache may deplete the physical memory and cause adverse impact on performance
    • Use nodetool sfstats to find the average row size
  • good if column's data is small so the cache is big enough to hold most of the hotspot data
  • bad if column's data is too large so the cache is not big enough to hold most of the hotspot data
  • bad for high write/read ratios
  • By default, it is off
  • If hit ratio is below 30%, row cache should be disabled

To enable row cache

update column family my_column_family with rows_cached=5000;

Use absolute number for rows_cached instead of percentage

To monitor the key cache hit ratio

bin/nodetool -h 10.10.10.1 cfstats

Cassandra Key Cache Tuning

The key cache holds the location of data in memory for each column family:

  • Effective if there are hot data spot & cannot use row cache effectively because of the large column size
  • Reduce 1 of the 2 disk seek

To monitor the key cache hit ratio

bin/nodetool -h 10.10.10.1 cfstats
Column Family: products
...
Pending Tasks: 0
Key cache capacity: 200000
Key cache size: 3000
Key cache hit rate: NaN
  • By default, Cassandra caches 200000 keys per column family
  • Use absolute number for keys_cached instead of percentage

To change the key cache

update column family my_column_family with keys_cached=25000;
  • keys_cached defines how many key locations will be kept in memory per column family

Cassandra JVM Parameters Tuning

Java Heap Tunable Configuration

JVM Parameters & Default Values Description
-Xms${MAX_HEAP_SIZE} Min Java Heap Size: Default to half of available physical memory
-Xmx${MAX_HEAP_SIZE} Max Java Heap Size: Default to half of available physical memory
-Xmn${HEAP_NEWSIZE} Size of young generation heap (1/4 of Java Heap)
-Xss128k Max native stack size of a thread
-XX:SurvivorRatio=8 Young Heap survivor ratio
  • Tune the Heap Size and the young generation heap according to the memory needs and the data access pattern
  • Do NOT increase the size without confirming there are enough available physical memory- Always reserves memory for OS FIle cache

Java GC Tunable Configuration

JVM Parameters & Default Values Description
-XX:+UseParNewGC Use parallel garbage collection in the young generation
-XX:+UseConcMarkSweepGC Use concurrent garbage collection in the old generation
-XX:+CMSParallelRemarkEnabled Decrease remark pauses
-XX:MaxTenuringThreshold=1 How fast an object will move to the old generation
-XX:CMSInitiatingOccupancyFraction=75 Execute GC when old generation is N% full
-XX:+UseCMSInitiatingOccupancyOnly Use of the anticipated promotions to start a concurrent collection set

Java thread Tunable Configuration

JVM Parameters & Default Values Description
-XX:+UseThreadPriorities Use native thread priority
-XX:ThreadPriorityPolicy=42 A work around to change thread priority

Misc Parameters

  • Cassandra dynamic snitches (enable by default) avoid reading from slow hosts

    casssandra.yaml:

    dynamic_snitch_update_interval_in_ms The interval to calculate read latency. Defaults- 100
    dynamic_snitch_reset_interval_in_ms The interval to reset scores and allow a bad node to recover. Defaults- 60000
    dynamic_snitch_badness_threshold A performance threshold for dynamically routing requests away from a node.
    Defaults - 0.0. A value of 0.2 means Cassandra would continue the static snitch until the host was 20% worse than the fastest

Cassandra Performance Monitoring

Locate the Cassandra bottleneck by examine all the internal pools

  • Use nodetool tpstats to dump the length of the pending requests
    bin/nodetool -h 10.208.115.203 tpstats
    Pool Name                    Active   Pending      Completed
    ReadStage                         0         0              1
    RequestResponseStage              0         0              0
    MutationStage                     0         0              3
    ReadRepairStage                   0         0              0
    GossipStage                       0         0              0
    AntiEntropyStage                  0         0              0
    MigrationStage                    0         0              0
    MemtablePostFlusher               0         0              2
    StreamStage                       0         0              0
    FlushWriter                       0         0              2
    MiscStage                         0         0              0
    FlushSorter                       0         0              0
    InternalResponseStage             0         0              0
    HintedHandoff                     0         0              0
  • Trace down which pool is backlogging requests
  • Focus on the pool with high pending count to narrow down the possible area of contentions

Monitoring CPU utilization, Swap activities and Memory usage with vmstat

vmstat - Report memory usage, swap space, I/O activities, interrupts and CPU utilization

% vmstat -S M 5 200
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache     si   so    bi    bo  in   cs us sy id  wa st
 0  0      0  61116 113628 6569796    0    0     0     5   10    7  0  0 100  0  0
 0  0      0  61108 113628 6569788    0    0     0     7   44   26  0  0 100  0  0
 0  0      0  61108 113628 6569796    0    0     0    11  226  138  2  0 97   0  1
  • Use Megabyes as unit
  • Run every 5 seconds
  • For 200 times

How to read/interpret vmstat output

Column Description
r Number of processes are runnable
b Number of processes are block
swpd Virtual memory used
free Free memory
buff Memory buffer
cache Cache buffer
si Amount of swap in memory from disk (/s)
so Amount of swap out memory to disk (/s)
bi block received form a block device like disk (block/s)
bo block sent to a block device like disk (block/s)
in Interrupts per second
cs context switches per second
us User time
sy Kernek time
id Idle time
wa Time waiting for I/O
st Time stolen form other virtual machine
  • High activity in the swap columns implies
    • Possible miss-configuration of the Java Heap or
    • Miss-configuration of the Cassandra caches
    • Reduce memory usage or add more physical memory
  • High CPU utilization implies
    • Poorly designed application code that makes too many requests to the Cassandra
    • The system is running out of capacity
  • Low CPU utilization with high number of blocking process implies other contentions that may be include
    • I/O contention
    • Internal Cassandra contention

Monitoring memory utilization with free

Monitoring the memory utilization to identify memory mis-configuration

  • Make sure there is enough physical memory for the configured Java Heap and Cassandra caches
  • A healthy system should have very low swapping activities
    free
                 total       used       free     shared    buffers     cached
    Mem:        611212     508196     103016          0      90992     236640
    -/+ buffers/cache:     180564     430648
    Swap:            0          0          0

-/+ buffers/cache row shows the used and free memory respectively if all cached memory is reclaimed

-/+ buffers/cache:        885       6587
  • Cassandra highly depends on file cache. Make sure there is enough memory for buffers and cache.
  • Some applications may require 40-60% of physical memory allocated for file cache for optimal performance (But this is highly subjective to the data access pattern)

Monitoring I/O activities

Identify high utilization of the disk holding the data files (SSTables)

  • High %util on SSTable's drive indicates I/O contentions
    % iostat 5 100
    Linux  (ip-10-10-10-10)    11/06/2010      _x86_64_     (1 CPU)
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               0.02    0.00    0.01    0.23    0.03   99.71
    
    Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
    xdap1            0.06         1.63         0.88    2890754    1558192
    xda              0.67         4.89         2.76    8670738    4889288

Possible causes:

  • Too many data reads and writes from the application: Application code need refactoring
  • Frequent Compaction causing I/O contention: Improper memtable configuration
  • I/O capacity problem

How to read/interpret iostat output

Column Description
tps Number of transfers (I/O request) per second for the device. A transfer is an I/O request
Blk_read/s Block read per second (Usually 512 bytes per block)
Blk_wrtn/s Block written per second
Blk_read Total block read
Blk_wrtn Total block written
iostat -m        # Display information in Mbytes

Print more detail in Mbytes

% iostat -mx 5 100
Linux (ip-10-10-10-10)    11/06/2010      _x86_64_        (1 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.02    0.00    0.01    0.23    0.03   99.71

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
xap1            0.00     0.09    0.04    0.02     0.00     0.00    39.35     0.00   31.44   3.12   0.02
xdh              0.00     0.28    0.61    0.07     0.00     0.00    11.34     0.01    8.21   1.74   0.12
Column Description
rrqm/s The number of merged read request per second. (Logical write can sometime merged and handle together)
wrqm/s The number of merged write request per second
avgqu-sz Average queue length of the requests to the device
await Average wait time in milliseconds for I/O requests
svctm Average service time (in milliseconds) for I/O requests
%util CPU utilization during I/O request