Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Assertion false failed in get_multiplier_from_str and core dumped #182

Closed
milindsmart opened this issue Jun 28, 2024 · 2 comments
Closed

Comments

@milindsmart
Copy link

Problem Description

rocm-smi crashes on running with the following output:

[milind@desktop-linux ~]$ rocm-smi
518578 detected dead process, and making mutex /rocm_smi_card2 consistent.


Exception caught: map::at
python3: /home/milind/rpmbuild/BUILD/rocm_smi_lib-rocm-6.1.2/src/rocm_smi.cc:133: uint64_t get_multiplier_from_str(char): Assertion `false' failed.
Aborted (core dumped)

Operating System

Fedora Linux 40 (Workstation Edition)

CPU

AMD Ryzen 7 5700G with Radeon Graphics

GPU

AMD Radeon VII

ROCm Version

ROCm 6.1.0

ROCm Component

rocm_smi_lib

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded

HSA System Attributes

Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents


Agent 1


Name: AMD Ryzen 7 5700G with Radeon Graphics
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 5700G with Radeon Graphics
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4673
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 15632744(0xee8968) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 15632744(0xee8968) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 15632744(0xee8968) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx90c
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 1024(0x400) KB
Chip ID: 5688(0x1638)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2000
BDFID: 3072
Internal Node ID: 1
Compute Unit: 8
SIMDs per CU: 4
Shader Engines: 1
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 472
SDMA engine uCode:: 40
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90c:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

Additional Information

I debugged the problem to

ret = GetDevValueVec(type, dv_ind, &val_vec);
where GetDevValueVec returns val_vec with the following content

(gdb) print val_vec
$2 = std::vector of length 3, capacity 4 = {"2: 400Mhz ", "3: 1067Mhz ", "   1066Mhz *"}

Each piece of text is processed by freq_string_to_int, which assumes the presence of a number: https://github.com/ROCm/rocm_smi_lib/blob/develop/src/rocm_smi.cc#L160-L170 .

@ppanchad-amd
Copy link

@milindsmart Internal ticket has been created to investigate this issue. Thanks!

@jamesxu2
Copy link
Contributor

jamesxu2 commented Aug 8, 2024

Hello @milindsmart, thanks for debugging the problem to this point. Your issue seems similar to #167.

Please try disabling your 5700G's integrated graphics in your BIOS and rerunning rocm-smi. There is an advisory on the official docs (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html) to disable integrated graphics in the BIOS before running ROCm, and this may be the cause of failure. I think this warning in the docs should be more widespread however and I'll talk to our docs team to get this fixed.

I'll close this ticket for now, but if you encounter it after disabling integrated graphics, feel free to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants