Rhea's Blog: What's the best way to avoid garbage collector pauses with big heaps in Java (Question excerpt from Quora)

Answered by Li Pi

https://www.quora.com/Whats-the-best-way-to-avoid-garbage-collector-pauses-with-big-heaps-in-Java

If the GC (Concurrent-Mark-Sweep) (see How does garbage collection work in the JVM?) is operating as expected, then the stop the world pauses should not be significant. CMS is designed to have as few stop the world pauses as possible. The Parallel New Collector will stop the world, but as long as your young generation is reasonably sized, you should be fine.

To recap, CMS operates by first stopping the world, then initially markingthe root nodes, then it concurrently proceeds to trace through the rest of the objects. Memory is freed at the end of this process, but objects are never moved around, making fragmentation problematic.

However, there exist a few failure modes where CMS will be forced the stop the world for a significant amount of time:

1. Concurrent Mode Failure - This occurs when the tenured generation fills up before CMS has completed it's work. When this happens, the JVM will fall back to a stop the world garbage collection mode.

For efficiency's sake, the JVM attempts to start the garbage collection process as late as it can get away with. CMS tracks the growth in heapsize and attempts to time it's collection so that the collection ends right before the tenured generation fills up. Sometimes Java is wrong - such as if during the collection process, object tenuring rate increases dramatically.

You'll know if you're experiencing this failure mode if you see the words [CMS (concurrent mode failure): in your GC log.

If this failure mode is bugging you, simply set - XX:CMSInitiatingOccupancyFraction to a conservative value. This will tell the JVM to start garbage collection earlier, and thus, not run out of space so much.

2. Promotion Failure Due to Fragmentation:

This is the other biggie. If you see ParNew (promotion failed) in your GC log, you're experiencing this.

I'm gonna steal more content from Todd Lipcon and this blog post again:http://www.cloudera.com/blog/201...

This failure mode is a little bit more complicated. Recall that the CMS collector does not relocate objects, but simply tracks all of the separate areas of free space in the heap. As a thought experiment, imagine that I allocate 1 million objects, each 1KB, for a total usage of 1GB in a heap that is exactly 1GB. Then I free every odd-numbered object, so I have 500MB live. However, the free space will be solely made up of 1KB chunks. If I need to allocate a 2KB object, there is nowhere to put it, even though I ostensibly have 500MB of space free. This is termed memory fragmentation. No matter how early I ask the CMS collector to start, since it does not relocate objects, it cannot solve this problem!

When this problem occurs, the collector again falls back to the copying collector, which is able to compact all the objects and free up space.

Solutions:

Dealing with this failure mode is more difficult. As answerers have mentioned above, try to do things in a way that don't create garbage in the tenured generation - the new generation is always collected by a copying collector, thus fragmentation doesn't occur in the young generation.

The tenured generation makes the assumption that objects that are allocated together die together, but if we violate this tenet, fragmentation becomes a big problem. You can get around this by manually allocating memory in a way that objects next to eachother die at the same time, and can thus be collected together.

Todd Lipcon gives an awesome writeup of this approach with a Local Allocation Buffer in this blog post. http://www.cloudera.com/blog/201...

Another possible solution I'm working on at Cloudera is to move the most memory hungry elements of the application off-heap, either through the usage of DirectByteBuffers or via JNI. A slab allocation model similar to MemCached can be used in order to trade space efficiency for fragmentation overhead.

You may either choose to manage memory manually, and copy stuff on heap when necessary, or wrap references to the external code in phantom references, and use a reference queue to keep track of which references have been garbage collected, and then free them using some form of free(). If you just need a cache, BigMemory provides a commercial, off the shelf solution.

You're also free to rip out the allocator and cache I implemented inhttps://issues.apache.org/jira/b....

Obviously, both approaches employed by Todd and I are engineering intensive, and not simple to implement. But if GC tuning fails, and if you really want to minimize pauses with your Java/Scala/etc apps, then you might want to experiment with these approaches.

Rhea's Blog

Tuesday, August 18, 2015

What's the best way to avoid garbage collector pauses with big heaps in Java (Question excerpt from Quora)

No comments:

Post a Comment