Updating to Nexus 3.23.0 causes 100% CPU load with GC

ado · May 7, 2020, 8:07am

Hello!
Yesterday, at night, we’ve updated our docker instance of Nexus to 3.23.0
In the morning when developers start to trigger they build and requests go to Nexus we have see the following:
Log message:
2020-05-07 07:25:29,543+0000 INFO [elasticsearch[97E976DC-49B0040B-E218D207-BCD3BD80-233B626D][scheduler][T#1]] *SYSTEM org.elasticsearch.monitor.jvm - [97E976DC-49B0040B-E218D207-BCD3BD80-233B626D] [gc][old][5032][29] duration [5s], collections [1]/[5.3s], total [5s]/[44s], memory [2.6gb]->[2.6gb]/[3.9gb], all_pools {[young] [1.2mb]->[23.9mb]/[1.1gb]}{[survivor] [0b]->[0b]/[119mb]}{[old] [2.6gb]->[2.6gb]/[2.6gb]}

At this moment CPU was at 100% load.

Interesting moment that nexus have a huge amount of RAM free…

After that web-interface kick off anyone with error “server disconnected”

File descriptors starts to grow:

Server doesn’t do anything. No logs, no responses, nothing. The last log string was about starting GC.

After restarting Nexus all goes fine for about 2 hours and repeats again.

Our memory settings for jvm is next:

-Xms4G 
-Xmx4G 
-XX:MaxDirectMemorySize=17530M

Followed by official documentation: System Requirements

For the experiment for now we extend the XMS and XMX to 5G.

Does some of GC settings changes in this release?
I did not find anything about that in Release Notes.
Any advise or workaround to fix that? If a minor increase of XMS and XMX don’t solve the problem we will go back to the previous version.

May be this is not a best place for such question, so I’ve created a ticket: https://issues.sonatype.org/browse/NEXUS-23826

andrii.korolchuk · May 15, 2020, 8:27am

Hello Dmitry,

I have disabled the old scheduled tasks which ware made in the old version of the nexus, and it solved the problem with CPU spikes.

ado · May 15, 2020, 11:00am

Hi.
The problem not in CPU “spikes”. Spikes were caused by garbage collector, not the scheduled task.

ado · May 15, 2020, 1:46pm

For quite some time we have performance issues with Nexus as you can see from the first post.

We decided to conduct some experiments. DEBUG-logs on, JMX-trans and Jconsole (you can grab in with Java-SDK) connected. Let’s start some profiling, manualy trigger GC and other cool stuff. Here we go.

Our configuration for server:
12core CPU\32RAM\1,5TB of active data. CentOS 7.7 as host system, Nexus as docker container. No other services on server at all.
In our ticket Sonatype shared with us a formula to calculate Heap and DirectMemory. Unfortunately it was not saviour and we faced same problem again.

Total memory to be used = minimum of 50% of the physical RAM on the host
min/max heap = from 50% to 66% of the total memory to be used
direct memory = from 33% to 50% of the total memory to be used (what you choose just depends on how much you've allocated to the jvm heap)

As far as how much of the physical ram you allocate is difficult to gauge because of OS-specific factors. Some OSes will consume quite a bit of physical memory for disk caching, etc. The goal for figuring out the "Total memory to be used" above is primarily to try to keep JVM memory from being swapped to disk.

For now we have this JVM configuration and we good to go. GC works as we expected, no memory leaks, no infinite GC, stable CPU-load around 30% with “spikes” to 50-60% couple times a day.

-Xms13G
-Xmx13G
-XX:MaxDirectMemorySize=14G
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=50M
-Xloggc:/opt/sonatype/sonatype-work/nexus3/log/gc.%t.log
-XX:+UseG1GC
-XX:MaxGCPauseMillis=300
-XX:InitiatingHeapOccupancyPercent=70
-XX:+ParallelRefProcEnabled
-XX:+UseStringDeduplication
-XX:ConcGCThreads=5
-XX:ParallelGCThreads=8
-Dcom.sun.management.jmxremote=true
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=7688
-Dcom.sun.management.jmxremote.rmi.port=7688
-Djava.rmi.server.hostname=your_instance_name
-Dcom.sun.management.jmxremote.password.file=/nexus-data/jmx/jmxremote.password/
-Dcom.sun.management.jmxremote.access.file=/nexus-data/jmx/jmxremote.access/

We’ve write a simple script, which emulates a search-queries in web-interface with random GUID for package name. Because we have some suspicions about ElasticSearch index and we thought that it can trigger infinite garbage collector for some reason.

As you can from graphic for start we have about 60-70 RPS to service. Then we increased it to 260-270RPS. As you can see from other graphics GC works pretty well, it cleans heap when it’s fill for 70% as we set in settings. The load test was about 30 minutes, Nexus feels quite well, CPU load was about 25-30%, no memory leaks. GC working in parallel with 8 threads.

For now we will live with this settings for a while, but it seems that we have found right balance and service is stable.

Hope it helps anyone who will face same problems as we do.
Have a nice day!

tallandtree · October 9, 2020, 9:01am

Thanks for this.

We are also experiencing issues with OOM probably due to nuget v2 in our instance running 3.27.0. I still have the recommended max 4GB heap configured, but I see you have set it to 13GB and using G1GC. Is your instance still running stable?

Thanks, Mariska.

ado · October 9, 2020, 9:15am

It was pretty stable from 3.23 to 3.25.
For now, we are have the same problem again. I asked Sonatype engineers in issue tracker. Seems that nuget-group endpoint works wrong. The only solution recommended by Sonatype - convert all proxy repo to V3 version of nuget-Api.
Which, obviously, we can not dot, because have no direct access to the remote repository which have been proxied by Nexus.

Very sad

tallandtree · October 9, 2020, 11:45am

I’m afraid we suffer from the same issue, then. Sonatype suggested it was also possible to turn off the feature y setting: nexus.nuget.multiple.latest.fix=false
but I’m not sure what they mean with “if you don’t rely on powershell”.

Solving NuGet Performance Problems in Nexus Repo 3 – Sonatype Support.

ado · October 9, 2020, 12:00pm

Under the hood powershell modules - it’s just a nuget packages as well.

So if you have in your nuget repo mix of, let’s say C# and Powershell packages - you might have other issues.

In our organisation we have separated reposiroires for Csharpes and for powershell-guys. Will try this advice on the next week.

jstephens · October 9, 2020, 2:16pm

@tallandtree We ended up between a rock and a hard place with the Nuget V2 implementation. V2 relies on OData which can produce a wide variety of queries which makes it very difficult / impossible to know what a client might use. At the same time the current version of NXRM3 relies on OrientDB which has some performance quirks. Each time we’ve tried to improve the performance of one query we’ve made a trade off for another. The PowerShell problem is an example of that.

The PowerShell client makes use of a paging mechanism that the Nuget Client doesn’t. We found a bug that prevented the PowerShell client from being able to find packages, once we fixed that bug it then had a performance impact on the normal Nuget Client.

Microsoft are starting to deprecate some OData (v2) endpoints. We’ve worked to keep builds via NXRM working but, even with these changes, Nuget users should be thinking about moving to V3.

The good news is that Nuget v3 has a much more streamlined design which has allowed us to break this cycle where we fix one thing and break another. Our testing shows that on average builds via NXRM are ~42% faster, searching is ~25% faster and memory usage is decreased by ~48% when using v3.

We are also hitting this problem from two sides and are simultaneously working hard on a new database implementation.

Which, obviously, we can not dot, because have no direct access to the remote repository which have been proxied by Nexus.

@ado @tallandtree Does this mean you are proxying another repository manager that isn’t configured to use V3 yet, or does it mean that you don’t have access to administer the proxy repositories in NXRM?

ado · October 9, 2020, 3:50pm

That doesn’t mean anything that you suggested

There are some NuGet Gallery on the wild internet. We are proxying this gallery through Nexus.
Of course I can do anything with this repository INSIDE Nexus. Like enable\disable negative cache, change values of the time that artifacts cached and so on.

But I can’t do anything about this gallery itself. It’s pretty old version, so it don’t support api V3 version, only V2. I can’t update this remote gallery by any means.

Unfortunately it means for us regular performance issues with infinite GC. If you interesed in a little more details, our nuget-group looks like that:

Hosted Nexus Repo (which supports Api V3)
Proxy from NuGet.org (which supports Api V3)
Proxy from specific NuGet Gallery (which do not supports Api V3)
With our configuration (12 cores, 42 GB RAM on the Host(which of xms\xmx 32 GB, about 2,2 TB of active data) we found out that Nexus run on infinite Full GC with load about 30-35 RPS direceted only to nuget-group endpoint.

One of options is to limit requests to this location by ngnix, but when CI triggered and dotnet build starts to restore packages from Nexus it can generate much more that 30-35 RPS, especialy for big projects or if some builds were simultaneously triggered.

jstephens · October 9, 2020, 3:56pm

Have you tried it using:

ado · October 9, 2020, 4:01pm

Will try this on Monday, thank you

tallandtree · October 11, 2020, 8:03am

We are proxying cholatey.org as well, but this is a separate group, and not used that much as the other nuget repositories. As far as I could see, chocolatey does not yet have a v3 api.
Would it make sense to increase the heap size and tune GC like Dmitry did, or doesn’t that make any difference at all?
I will set nexus.nuget.multiple.latest.fix=false until all users have moved to v3, but I don’t know which users will suffer from the bug. I guess we will find out.

ado · October 12, 2020, 8:35am

Tried it today. We’ve created a little load test with 50-70 RPS (increasing every 5 minutes for 5 RPS).
Result was the same, after 3 minutes - infinite Full GC. Even stopping load test have no result - only restarting the container with Nexus.

In 13.25 load test was started. Full use of heap was achieved less then in a minute and Full GC was triggered.

So, for our case this solution doesn’t work at all

tallandtree · October 12, 2020, 2:17pm

I’ve also set the nexus.nuget.multiple.latest.fix=false and in our case it seems to work, but I have not done a stress test yet. But I can see that the heap does not grow beyond 86% anymore. I still have the recommended settings for the heap size (max 4GB) and also have not changed the GC settings. I will let you know how it goes after stress test.
The load on our instance is probably a lot lower than on yours and I only had 2 OOM’s in one month, the first a few days after I upgraded to 3.26 and the second last week with 3.27 (which already was running for 2 weeks). We didn’t have any major issues before 3.26 actually.

ado · October 12, 2020, 6:06pm

CMS is okay for heaps less or equal than 4 GB. But in our case, it’s not an option, because of the load and database size. For heaps, more than 4 GB G1GC is welcome, which we are using