Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Medium
Fix Version/s: New Delhi Release
Affects Version/s: New Delhi Release
Component/s: None
Labels:
None

Description:

We faced a Hazelcast OOM failure on the stability environment, after it seems that the NCMP and Hazelcast were not able to recover properly, and this caused further failures and constantly failing test runs on the given environment.

Logs:

pod which had the first OOM:

eric-oss-ncmp-6c9f45c4d5-pwjf6_fen28.txt
eric-oss-ncmp-6c9f45c4d5-pwjf6_feb29.txt
02_28_ncmp-04_pwjf6.log

2024-02-28T05:21:51.206Z@eric-oss-ncmp-04@ncmp@Stopping container due to an Error, logger: org.springframework.kafka.listener.KafkaMessageListenerContainer, thread_name: org.springframework.kafka.KafkaListenerEndpointContainer#4-0-C-1, stack_trace: java.lang.OutOfMemoryError: Java heap space@

other pod:

eric-oss-ncmp-6c9f45c4d5-gpxlx_fen28.txt
eric-oss-ncmp-6c9f45c4d5-gpxlx_feb29.txt
02_28_ncmp-05_gpxlx.log
02_29_ncmp-05_gpxlx_1.log
02_29_ncmp-05_gpxlx_2.log

2024-02-29T09:25:25.466Z@eric-oss-ncmp-05@ncmp@java.lang.OutOfMemoryError: Java heap space

Faulty version:

CPS 3.4.3

Reproduction:

Stability env:

Since the first failure NCMP related TCs are constantly failing on the stability environment.
That environment has two big ENMs (30k + 50k) already discovered, and they are running PE functional tests (discovery of a small ENM) and CH load test against it.

Manual reproduction:

EIC deployment with a big ENMs discovered (80k or 30k + 50k)
sending multiple requests for searches or id-searches endpoints with model filters

Expected Behavior:

No OOM errors, or in case it happens accidently then a proper error handling and a healthy recovery.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

02_28_eric-oss-ncmp-database-pg.log
188 kB
12/Mar/24 3:04 PM
CPS_OOM_Hazelcast.tgz
4.22 MB
06/Mar/24 10:14 AM

blocks

CPS-2122 I/O error while reading input message

Closed

CPS-2156 Postgres DB run out of shared_buffers

Closed

relates to

CPS-2150 Async task execution failed by TimeoutException

Closed

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...

(2 mentioned in)

Assignee:: Priyank Maheshwari

Reporter:: efreend

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 06/Mar/24 10:39 AM

Updated:: 12/Apr/24 1:05 PM

Resolved:: 11/Apr/24 9:47 AM

Details

Description

Description:

Logs:

Faulty version:

Reproduction:

Expected Behavior:

Attachments

Attachments

Issue Links

Activity

People

Dates