Occasional java.lang.reflect.InvocationTargetException when using S3 blob storage

niels.heidbrink · July 4, 2023, 4:32am

When running Nexus Repository 3 (OSS 3.55.0-01), we occasionally see a java.lang.reflect.InvocationTargetException when accessing artifacts stored in S3.

This causes any applications/systems accessing the Nexus at that time to receive HTTP 500 responses.

Restarting the system service or rebooting the machine solves the issue.

The log at the time of the exception:

2023-06-07 04:49:20,872+0200 DEBUG [periodic-4-thread-1 <query>SELECT FROM quartz_trigger WHERE name = ? AND group = ?</query>] *SYSTEM com.amazonaws.http.conn.ClientConnectionManagerFactory -
java.lang.reflect.InvocationTargetException: null
at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
at com.amazonaws.http.conn.$Proxy196.requestConnection(Unknown Source)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1346)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403)
at com.amazonaws.services.s3.AmazonS3Client.getAcl(AmazonS3Client.java:4062)
at com.amazonaws.services.s3.AmazonS3Client.getBucketAcl(AmazonS3Client.java:1278)
at com.amazonaws.services.s3.AmazonS3Client.getBucketAcl(AmazonS3Client.java:1268)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExistV2(AmazonS3Client.java:1406)
at org.sonatype.nexus.blobstore.s3.internal.S3BlobStore.isStorageAvailable(S3BlobStore.java:831)
at org.sonatype.nexus.blobstore.s3.internal.S3BlobStore$$EnhancerByGuice$$951667108.GUICE$TRAMPOLINE(<generated>)
at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:74)
at com.palominolabs.metrics.guice.TimedInterceptor.invoke(TimedInterceptor.java:26)
at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:75)
at com.google.inject.internal.InterceptorStackCallback.invoke(InterceptorStackCallback.java:55)
at org.sonatype.nexus.blobstore.s3.internal.S3BlobStore$$EnhancerByGuice$$951667108.isStorageAvailable(<generated>)
at org.sonatype.nexus.repository.internal.blobstore.BlobStoreStateHealthCheck.lambda$0(BlobStoreStateHealthCheck.java:61)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Collections$2.tryAdvance(Collections.java:4719)
at java.util.Collections$2.forEachRemaining(Collections.java:4727)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at org.sonatype.nexus.repository.internal.blobstore.BlobStoreStateHealthCheck.check(BlobStoreStateHealthCheck.java:67)
at com.codahale.metrics.health.HealthCheck.execute(HealthCheck.java:374)
at com.codahale.metrics.health.HealthCheckRegistry.runHealthCheck(HealthCheckRegistry.java:160)
at org.sonatype.nexus.rapture.internal.LocalSystemCheckService$1.load(LocalSystemCheckService.java:120)
at org.sonatype.nexus.rapture.internal.LocalSystemCheckService$1.load(LocalSystemCheckService.java:1)
at com.google.common.cache.CacheLoader.reload(CacheLoader.java:101)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3532)
at com.google.common.cache.LocalCache$Segment.loadAsync(LocalCache.java:2287)
at com.google.common.cache.LocalCache$Segment.refresh(LocalCache.java:2360)
at com.google.common.cache.LocalCache.refresh(LocalCache.java:4134)
at com.google.common.cache.LocalCache$LocalLoadingCache.refresh(LocalCache.java:4965)
at org.sonatype.nexus.rapture.internal.LocalSystemCheckService.lambda$3(LocalSystemCheckService.java:97)
at java.lang.Iterable.forEach(Iterable.java:75)
at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
at org.sonatype.nexus.rapture.internal.LocalSystemCheckService.lambda$2(LocalSystemCheckService.java:95)
at org.sonatype.nexus.scheduling.internal.PeriodicJobServiceImpl.lambda$2(PeriodicJobServiceImpl.java:109)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalStateException: Connection pool shut down
at org.apache.http.util.Asserts.check(Asserts.java:34)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:269)
... 65 common frames omitted

joris.klop · July 5, 2023, 5:52pm

Hi Niels,

I would suggest contacting support on this one. However, i do have a few questions…

Do you have any other info like:

How long does it take to face the issue
Is there any relation with high load as it might be a connection limit somehow?
Do you have any insight in other triggers?
Is it a full stop scenario or only percentage of traffic impacted?

Thanks!

Regards,

Joris

P.s. Not an expert here, but it could be the connection is closed somehow and when the S3 is accessed there are no active connections to use: apache httpclient 4.x - java.lang.IllegalStateException: Connection pool shut down - Stack Overflow)

niels.heidbrink · July 7, 2023, 4:32am

Hi Joris,

thank you for your response.

That’s quite hard to measure to be honest. The issue seems to appear rather randomly/sporadically.
That’s what I thought as well. While I can’t fully rule it out yet, according to our monitoring CPU/memory usage seems reasonable even when the issue occurs.
Unfortunately not.
Full stop, all pipelines accessing the Nexus to download artifacts fail consistently while this issue occurs. After a restart (systemctl or reboot) it’s working normally again.

I did open a support ticket. Unfortunately, we’re not on a premium support plan. Therefore he referred me to this public forum.

The connection pool could be a plausible explanation. Especially because restarting fixes the issue altogether. In that case, implementing a health check and restarting the service automatically might be a viable workaround.

Thank you.

Best regards,
Niels

joris.klop · July 7, 2023, 1:23pm

Hi Niels,

Yesterday i implemented a health check to the API endpoint “/service/rest/v1/status/writable” maybe that will cause your container to automatically restart when you hit the issue?

One other thing, can you outline what database you are using and can you perhaps outline the sizing of it?
Reasoning for asking, the more a requests takes to process, the more connections you will consume. E.g. Blobs references are stored to database and cause significant load on database side when not properly cached. I can recall that a standalone setup will use eleastisearch for indexing, however cluster setup does not.
On S3, do you use “S3 standard” or some other price tier?
You run the app on ECS/kubernetes or ECS2?
What type of storage are you using for persistent storage? (mainly interested in the performance of the storage picked)

Keep me posted on this one.

Regards,

Joris

niels.heidbrink · July 10, 2023, 4:53am

Hi Joris,

yes, we’ve also found the following two endpoints which we have added to our uptime monitoring:

/service/rest/v1/status
/service/rest/v1/status/writable

Once we can confirm that these endpoints report the issue correctly (for example during the next outage), we can implement a health check to automatically restart the system service.

We’re running a plain EC2 instance. The Nexus3 binary is sourced from the bundle (https://download.sonatype.com/nexus/3/nexus-3.55.0-01-unix.tar.gz) and started using a custom systemctl service.

Blob storage is S3 (Standard Storage Class), the rest is stored on the system using a mounted EFS volume.

Thank you and best regards,
Niels

joris.klop · July 10, 2023, 9:58am

Hi Niels,

Keep me posted on the endpoints and if they report an error status on hitting your issue.

You are using the embedded (default) orient database on EFS storage? To be honest i think that solution will scale really bad as EFS is as far as i am aware not advised for storing your orient database. The default database is already a bad option anyway, however, if your license does not allow using anything better, i would advise to use EBS storage instead of EFS…

Regards,

Joris

niels.heidbrink · July 11, 2023, 2:35am

Hey Joris,

I will. Thank you for your support so far.

Regarding the database: We generally agree. While the system has been set up before my time, I believe the reasoning was that our Nexus installation isn’t a system critical component and the default database is therefore acceptable.

Though, considering how many of our CI/CD pipelines are blocked if the Nexus is unavailable, we may need to rethink that statement.

I’ll let you know if I have any news.

Best regards,
Niels

niels.heidbrink · August 8, 2023, 9:02am

Unfortunately, the endpoints don’t return a HTTP response != 200 when the issue occurs. The only endpoint indicating that the issue is happening seems to be “/service/rest/v1/status/check”. This one requires authorization though and also doesn’t respond with HTTP != 200 if there’s an error. Instead, you’d need to parse the JSON body:

{
  "Blob Stores Ready": {
    "healthy": false,
    "message": "Connection pool shut down",
    [...]
  }
}