Updating to Nexus 3.23.0 causes 100% CPU load with GC

Hello!
Yesterday, at night, we’ve updated our docker instance of Nexus to 3.23.0
In the morning when developers start to trigger they build and requests go to Nexus we have see the following:
Log message:
2020-05-07 07:25:29,543+0000 INFO [elasticsearch[97E976DC-49B0040B-E218D207-BCD3BD80-233B626D][scheduler][T#1]] *SYSTEM org.elasticsearch.monitor.jvm - [97E976DC-49B0040B-E218D207-BCD3BD80-233B626D] [gc][old][5032][29] duration [5s], collections [1]/[5.3s], total [5s]/[44s], memory [2.6gb]->[2.6gb]/[3.9gb], all_pools {[young] [1.2mb]->[23.9mb]/[1.1gb]}{[survivor] [0b]->[0b]/[119mb]}{[old] [2.6gb]->[2.6gb]/[2.6gb]}

At this moment CPU was at 100% load.


Interesting moment that nexus have a huge amount of RAM free…

After that web-interface kick off anyone with error “server disconnected”

File descriptors starts to grow:

Server doesn’t do anything. No logs, no responses, nothing. The last log string was about starting GC.

After restarting Nexus all goes fine for about 2 hours and repeats again.

Our memory settings for jvm is next:

-Xms4G 
-Xmx4G 
-XX:MaxDirectMemorySize=17530M

Followed by official documentation: System Requirements

For the experiment for now we extend the XMS and XMX to 5G.

Does some of GC settings changes in this release?
I did not find anything about that in Release Notes.
Any advise or workaround to fix that? If a minor increase of XMS and XMX don’t solve the problem we will go back to the previous version.

May be this is not a best place for such question, so I’ve created a ticket: https://issues.sonatype.org/browse/NEXUS-23826

Hello Dmitry,

I have disabled the old scheduled tasks which ware made in the old version of the nexus, and it solved the problem with CPU spikes.

Hi.
The problem not in CPU “spikes”. Spikes were caused by garbage collector, not the scheduled task.

For quite some time we have performance issues with Nexus as you can see from the first post.

We decided to conduct some experiments. DEBUG-logs on, JMX-trans and Jconsole (you can grab in with Java-SDK) connected. Let’s start some profiling, manualy trigger GC and other cool stuff. Here we go.

Our configuration for server:
12core CPU\32RAM\1,5TB of active data. CentOS 7.7 as host system, Nexus as docker container. No other services on server at all.
In our ticket Sonatype shared with us a formula to calculate Heap and DirectMemory. Unfortunately it was not saviour and we faced same problem again.

Total memory to be used = minimum of 50% of the physical RAM on the host
min/max heap = from 50% to 66% of the total memory to be used
direct memory = from 33% to 50% of the total memory to be used (what you choose just depends on how much you've allocated to the jvm heap)

As far as how much of the physical ram you allocate is difficult to gauge because of OS-specific factors. Some OSes will consume quite a bit of physical memory for disk caching, etc. The goal for figuring out the "Total memory to be used" above is primarily to try to keep JVM memory from being swapped to disk.

For now we have this JVM configuration and we good to go. GC works as we expected, no memory leaks, no infinite GC, stable CPU-load around 30% with “spikes” to 50-60% couple times a day.

-Xms13G
-Xmx13G
-XX:MaxDirectMemorySize=14G
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=50M
-Xloggc:/opt/sonatype/sonatype-work/nexus3/log/gc.%t.log
-XX:+UseG1GC
-XX:MaxGCPauseMillis=300
-XX:InitiatingHeapOccupancyPercent=70
-XX:+ParallelRefProcEnabled
-XX:+UseStringDeduplication
-XX:ConcGCThreads=5
-XX:ParallelGCThreads=8
-Dcom.sun.management.jmxremote=true
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=7688
-Dcom.sun.management.jmxremote.rmi.port=7688
-Djava.rmi.server.hostname=your_instance_name
-Dcom.sun.management.jmxremote.password.file=/nexus-data/jmx/jmxremote.password/
-Dcom.sun.management.jmxremote.access.file=/nexus-data/jmx/jmxremote.access/

We’ve write a simple script, which emulates a search-queries in web-interface with random GUID for package name. Because we have some suspicions about ElasticSearch index and we thought that it can trigger infinite garbage collector for some reason.

As you can from graphic for start we have about 60-70 RPS to service. Then we increased it to 260-270RPS. As you can see from other graphics GC works pretty well, it cleans heap when it’s fill for 70% as we set in settings. The load test was about 30 minutes, Nexus feels quite well, CPU load was about 25-30%, no memory leaks. GC working in parallel with 8 threads.

For now we will live with this settings for a while, but it seems that we have found right balance and service is stable.

Hope it helps anyone who will face same problems as we do.
Have a nice day! :slight_smile:

1 Like