| 2 | 1/1 | 返回列表 |
| 查看: 4127 | 回復(fù): 1 | ||
sic029鐵蟲 (初入文壇)
|
[求助]
qsub提交并行siesta不成功,求助
|
|
大家好,交流下集群程序使用遇到的問題,多謝。 [node21:10714] *** An error occurred in MPI_Comm_rank [node21:10714] *** on communicator MPI_COMM_WORLD [node21:10714] *** MPI_ERR_COMM: invalid communicator [node21:10714] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- mpirun has exited due to process rank 3 with PID 10711 on node node21 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [node21:10707] 7 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal [node21:10707] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 新編譯的計算程序siesta,用qsub job提交上去很快結(jié)束提示的信息,能否幫忙診斷一下情況。在另外一個集群上編譯后直接用mpirun -np 4 siesta可以順利執(zhí)行的,不知道為何在新集群用qsub出現(xiàn)這個問題,這個新集群不讓進入到子節(jié)點,所以必須要解決這個問題才行,多謝了。 不知道是哪里的問題,之前在該環(huán)境并行編譯的lammps和vasp都使用很順利,就是siesta用qsub提交作業(yè)總是無法正常計算,但是并行編譯的siesta在另外環(huán)境下的子節(jié)點用mpirun -np 4 siesta執(zhí)行很順利,糾結(jié)了。 哦,登錄節(jié)點上mpirun我試過的,請幫忙看看,感覺被管理員設(shè)置了也無法用: mpirun -np 4 siesta libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. -------------------------------------------------------------------------- The OpenFabrics (openib) BTL failed to initialize while trying to allocate some locked memory. This typically can indicate that the memlock limits are set too low. For most HPC installations, the memlock limits should be set to "unlimited". The failure occured here: Local host: manage1 OMPI source: btl_openib_component.c:1115 Function: ompi_free_list_init_ex_new() Device: mlx4_0 Memlock limit: 32768 You may need to consult with your system administrator to get this problem fixed. This FAQ entry on the Open MPI web site may also be helpful: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: manage1 Local device: mlx4_0 -------------------------------------------------------------------------- [manage1:16214] *** An error occurred in MPI_Comm_rank [manage1:16214] *** on communicator MPI_COMM_WORLD [manage1:16214] *** MPI_ERR_COMM: invalid communicator [manage1:16214] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 16212 on node manage1 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [manage1:16211] 3 more processes have sent help message help-mpi-btl-openib.txt / init-fail-no-mem [manage1:16211] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [manage1:16211] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init [manage1:16211] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal 計算子節(jié)點無法進入,被限制死了的。 這邊用的pbs作業(yè)管理gridview,我用的提交腳本是: ==================================== #PBS -N test #PBS -l nodes=1:ppn=8 #PBS -j oe #PBS -l walltime=24:00:00 cd $PBS_O_WORKDIR NP=`cat $PBS_NODEFILE|wc -l` source /public/software/mpi/openmpi1.5.4-intel.sh mpirun -machinefile $PBS_NODEFILE -np $NP \ /home/sw/siesta/siesta-3.1/Obj/siesta < fe.fdf | tee output ===================================== 感謝蟲友幫忙,多謝。 |

銀蟲 (小有名氣)
|
我前兩天用vasp也出現(xiàn)類似問題,剛剛解決~ The OpenFabrics (openib) BTL failed to initialize while trying to allocate some locked memory. This typically can indicate that the memlock limits are set too low. For most HPC installations, the memlock limits should be set to "unlimited". The failure occured here: Local host: node21 OMPI source: btl_openib_component.c:1055 Function: ompi_free_list_init_ex_new() Device: mlx4_0 Memlock limit: 65536 You may need to consult with your system administrator to get this problem fixed. This FAQ entry on the Open MPI web site may also be helpful: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. 上面那個網(wǎng)址里15、16、17說的挺清楚的,我的情況是在每個節(jié)點ulimit -a顯示locked memory都正常,可就是出錯說內(nèi)存分配不正常,那個網(wǎng)址里說可能是登錄時沒有正常執(zhí)行系統(tǒng)所設(shè)的locked memory,或者作業(yè)調(diào)度系統(tǒng)沒有分配給應(yīng)用程序足夠大的內(nèi)存,最后重啟了一下每個節(jié)點的pbs調(diào)度系統(tǒng)的守護進程,問題解決了~ 或者你可以在mpirun前邊兒加上ulimit -l unlimited,用qsub提交下試試 希望以上信息對樓主有用~ |
| 2 | 1/1 | 返回列表 |
| 最具人氣熱帖推薦 [查看全部] | 作者 | 回/看 | 最后發(fā)表 | |
|---|---|---|---|---|
|
[考研] 306求調(diào)劑 +3 | 來好運來來來 2026-03-22 | 3/150 |
|
|---|---|---|---|---|
|
[考研] 廣西大學(xué)材料導(dǎo)師推薦 +3 | 夏夏夏小正 2026-03-17 | 5/250 |
|
|
[考研] 材料求調(diào)劑 +5 | @taotao 2026-03-21 | 5/250 |
|
|
[考研] 化學(xué)工程321分求調(diào)劑 +18 | 大米飯! 2026-03-15 | 22/1100 |
|
|
[考研] 278求調(diào)劑 +9 | 煙火先于春 2026-03-17 | 9/450 |
|
|
[考研] 317求調(diào)劑 +9 | 申子申申 2026-03-19 | 15/750 |
|
|
[考研] 材料 271求調(diào)劑 +5 | 展信悅_ 2026-03-21 | 5/250 |
|
|
[考研] 336求調(diào)劑 +5 | rmc8866 2026-03-21 | 5/250 |
|
|
[考研] 298求調(diào)劑 +4 | 上岸6666@ 2026-03-20 | 4/200 |
|
|
[考研] 機械專碩299求調(diào)劑至材料 +3 | kkcoco25 2026-03-16 | 4/200 |
|
|
[考研] 材料專碩英一數(shù)二306 +7 | z1z2z3879 2026-03-18 | 7/350 |
|
|
[考研] 中南大學(xué)化學(xué)學(xué)碩337求調(diào)劑 +3 | niko- 2026-03-19 | 6/300 |
|
|
[考研] 一志愿西南交通 專碩 材料355 本科雙非 求調(diào)劑 +5 | 西南交通專材355 2026-03-19 | 5/250 |
|
|
[考研] 一志愿吉林大學(xué)材料學(xué)碩321求調(diào)劑 +11 | Ymlll 2026-03-18 | 15/750 |
|
|
[考研] 求調(diào)劑 +3 | @taotao 2026-03-20 | 3/150 |
|
|
[考研] 298-一志愿中國農(nóng)業(yè)大學(xué)-求調(diào)劑 +9 | 手機用戶 2026-03-17 | 9/450 |
|
|
[論文投稿]
申請回稿延期一個月,編輯同意了。但系統(tǒng)上的時間沒變,給編輯又寫郵件了,沒回復(fù)
10+3
|
wangf9518 2026-03-17 | 4/200 |
|
|
[考研] 085600材料與化工調(diào)劑 324分 +10 | llllkkkhh 2026-03-18 | 12/600 |
|
|
[考研] 085600材料與化工求調(diào)劑 +6 | 緒幸與子 2026-03-17 | 6/300 |
|
|
[考研] 344求調(diào)劑 +6 | knight344 2026-03-16 | 7/350 |
|