| 2 | 1/1 | 返回列表 |
| 查看: 4128 | 回復(fù): 1 | |||
sic029鐵蟲 (初入文壇)
|
[求助]
qsub提交并行siesta不成功,求助
|
|
大家好,交流下集群程序使用遇到的問題,多謝。 [node21:10714] *** An error occurred in MPI_Comm_rank [node21:10714] *** on communicator MPI_COMM_WORLD [node21:10714] *** MPI_ERR_COMM: invalid communicator [node21:10714] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- mpirun has exited due to process rank 3 with PID 10711 on node node21 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [node21:10707] 7 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal [node21:10707] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 新編譯的計(jì)算程序siesta,用qsub job提交上去很快結(jié)束提示的信息,能否幫忙診斷一下情況。在另外一個(gè)集群上編譯后直接用mpirun -np 4 siesta可以順利執(zhí)行的,不知道為何在新集群用qsub出現(xiàn)這個(gè)問題,這個(gè)新集群不讓進(jìn)入到子節(jié)點(diǎn),所以必須要解決這個(gè)問題才行,多謝了。 不知道是哪里的問題,之前在該環(huán)境并行編譯的lammps和vasp都使用很順利,就是siesta用qsub提交作業(yè)總是無法正常計(jì)算,但是并行編譯的siesta在另外環(huán)境下的子節(jié)點(diǎn)用mpirun -np 4 siesta執(zhí)行很順利,糾結(jié)了。 哦,登錄節(jié)點(diǎn)上mpirun我試過的,請(qǐng)幫忙看看,感覺被管理員設(shè)置了也無法用: mpirun -np 4 siesta libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. -------------------------------------------------------------------------- The OpenFabrics (openib) BTL failed to initialize while trying to allocate some locked memory. This typically can indicate that the memlock limits are set too low. For most HPC installations, the memlock limits should be set to "unlimited". The failure occured here: Local host: manage1 OMPI source: btl_openib_component.c:1115 Function: ompi_free_list_init_ex_new() Device: mlx4_0 Memlock limit: 32768 You may need to consult with your system administrator to get this problem fixed. This FAQ entry on the Open MPI web site may also be helpful: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: manage1 Local device: mlx4_0 -------------------------------------------------------------------------- [manage1:16214] *** An error occurred in MPI_Comm_rank [manage1:16214] *** on communicator MPI_COMM_WORLD [manage1:16214] *** MPI_ERR_COMM: invalid communicator [manage1:16214] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 16212 on node manage1 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [manage1:16211] 3 more processes have sent help message help-mpi-btl-openib.txt / init-fail-no-mem [manage1:16211] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [manage1:16211] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init [manage1:16211] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal 計(jì)算子節(jié)點(diǎn)無法進(jìn)入,被限制死了的。 這邊用的pbs作業(yè)管理gridview,我用的提交腳本是: ==================================== #PBS -N test #PBS -l nodes=1:ppn=8 #PBS -j oe #PBS -l walltime=24:00:00 cd $PBS_O_WORKDIR NP=`cat $PBS_NODEFILE|wc -l` source /public/software/mpi/openmpi1.5.4-intel.sh mpirun -machinefile $PBS_NODEFILE -np $NP \ /home/sw/siesta/siesta-3.1/Obj/siesta < fe.fdf | tee output ===================================== 感謝蟲友幫忙,多謝。 |

銀蟲 (小有名氣)
|
我前兩天用vasp也出現(xiàn)類似問題,剛剛解決~ The OpenFabrics (openib) BTL failed to initialize while trying to allocate some locked memory. This typically can indicate that the memlock limits are set too low. For most HPC installations, the memlock limits should be set to "unlimited". The failure occured here: Local host: node21 OMPI source: btl_openib_component.c:1055 Function: ompi_free_list_init_ex_new() Device: mlx4_0 Memlock limit: 65536 You may need to consult with your system administrator to get this problem fixed. This FAQ entry on the Open MPI web site may also be helpful: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. 上面那個(gè)網(wǎng)址里15、16、17說的挺清楚的,我的情況是在每個(gè)節(jié)點(diǎn)ulimit -a顯示locked memory都正常,可就是出錯(cuò)說內(nèi)存分配不正常,那個(gè)網(wǎng)址里說可能是登錄時(shí)沒有正常執(zhí)行系統(tǒng)所設(shè)的locked memory,或者作業(yè)調(diào)度系統(tǒng)沒有分配給應(yīng)用程序足夠大的內(nèi)存,最后重啟了一下每個(gè)節(jié)點(diǎn)的pbs調(diào)度系統(tǒng)的守護(hù)進(jìn)程,問題解決了~ 或者你可以在mpirun前邊兒加上ulimit -l unlimited,用qsub提交下試試 希望以上信息對(duì)樓主有用~ |
| 2 | 1/1 | 返回列表 |
| 最具人氣熱帖推薦 [查看全部] | 作者 | 回/看 | 最后發(fā)表 | |
|---|---|---|---|---|
|
[考研] 318求調(diào)劑 +4 | plum李子 2026-03-21 | 7/350 |
|
|---|---|---|---|---|
|
[考博] 招收博士1-2人 +3 | QGZDSYS 2026-03-18 | 4/200 |
|
|
[考研] 考研調(diào)劑 +3 | 呼呼?~+123456 2026-03-21 | 3/150 |
|
|
[考研] 0805 316求調(diào)劑 +3 | 大雪深藏 2026-03-18 | 3/150 |
|
|
[考研] 297求調(diào)劑 +11 | 戲精丹丹丹 2026-03-17 | 12/600 |
|
|
[考研] 307求調(diào)劑 +3 | 余意卿 2026-03-18 | 3/150 |
|
|
[考研] 266求調(diào)劑 +3 | 哇呼哼呼哼 2026-03-20 | 3/150 |
|
|
[考研] 346求調(diào)劑[0856] +4 | WayneLim327 2026-03-16 | 7/350 |
|
|
[考研] 二本跨考鄭大材料306英一數(shù)二 +3 | z1z2z3879 2026-03-17 | 3/150 |
|
|
[考研] 求調(diào)劑 +3 | Ma_xt 2026-03-17 | 3/150 |
|
|
[考研] 288求調(diào)劑 +16 | 于海海海海 2026-03-19 | 16/800 |
|
|
[考研] 一志愿 西北大學(xué) ,070300化學(xué)學(xué)碩,總分287,雙非一本,求調(diào)劑。 +4 | 晨昏線與星海 2026-03-19 | 4/200 |
|
|
[考研] 求調(diào)劑一志愿南京航空航天大學(xué)289分 +3 | @taotao 2026-03-19 | 3/150 |
|
|
[考研] 261求B區(qū)調(diào)劑,科研經(jīng)歷豐富 +3 | 牛奶很忙 2026-03-20 | 4/200 |
|
|
[考研] 085410人工智能專碩317求調(diào)劑(0854都可以) +4 | xbxudjdn 2026-03-18 | 4/200 |
|
|
[考研] 288求調(diào)劑,一志愿華南理工大學(xué)071005 +5 | ioodiiij 2026-03-17 | 5/250 |
|
|
[考研] 一志愿福大288有機(jī)化學(xué),求調(diào)劑 +3 | 小木蟲200408204 2026-03-18 | 3/150 |
|
|
[考研] 085600材料與化工求調(diào)劑 +6 | 緒幸與子 2026-03-17 | 6/300 |
|
|
[考研] 302求調(diào)劑 +4 | 小賈同學(xué)123 2026-03-15 | 8/400 |
|
|
[考研] 中科院材料273求調(diào)劑 +4 | yzydy 2026-03-15 | 4/200 |
|