|
View:
New views
7 Messages
—
Rating Filter:
Alert me
|
|
|
Temporary random kernel hangHello,
I have some trouble with a multithreaded java network server running on SLES10. At random times I see the kernel take 80% of the CPU leaving iddle to 0% for 30 seconds. After this period the system returns to normal operation state. Below is a vmstat -a 3 recording that shows the problem: 1 0 0 773068 529184 693048 0 0 0 0 272 201 0 0 100 0 0 0 0 0 773068 529184 693064 0 0 0 25 317 334 1 0 99 1 0 0 0 0 772944 529216 693248 0 0 0 24 477 1017 3 0 96 0 0 0 0 0 772820 529256 693316 0 0 0 0 525 1376 4 1 95 0 0 0 0 0 772448 529344 693636 0 0 0 107 1098 3306 11 2 86 0 0 0 0 0 772324 529404 693456 0 0 0 0 723 2247 7 2 91 0 0 0 0 0 772076 529496 693656 0 0 0 132 770 2488 7 2 91 1 0 0 0 0 772200 529528 693608 0 0 0 91 528 1168 4 1 94 1 0 0 0 0 772200 529532 693728 0 0 0 0 334 387 1 0 99 0 0 0 0 0 772076 529568 693680 0 0 0 24 564 1250 4 1 95 0 0 0 0 0 771828 529636 693784 0 0 0 0 787 2144 7 2 91 0 0 0 0 0 771580 529744 694232 0 0 0 111 995 3081 11 2 86 1 0 107 0 0 771316 529792 694904 0 0 0 153 829 1650 12 37 51 0 0 113 0 0 771316 529792 694912 0 0 0 0 323 169 15 85 0 0 0 116 0 0 771216 529792 694728 0 0 0 25 292 190 14 86 0 0 0 122 0 0 771340 529792 694728 0 0 0 21 311 191 15 85 0 0 0 138 0 0 771464 529792 694728 0 0 0 0 365 196 14 86 0 0 0 146 0 0 771464 529792 694728 0 0 0 0 331 189 16 84 0 0 0 150 0 0 771472 529792 694728 0 0 0 0 336 183 15 85 0 0 0 146 0 0 771472 529792 694728 0 0 0 4 310 201 14 86 0 0 0 145 0 0 771472 529792 694728 0 0 0 0 285 163 15 85 0 0 0 procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------ r b swpd free inact active si so bi bo in cs us sy id wa st 146 0 0 771472 529792 694728 0 0 0 0 277 159 14 86 0 0 0 145 0 0 771472 529792 694728 0 0 0 32 275 133 15 85 0 0 0 0 0 0 771208 529892 694176 0 0 0 0 1012 3408 12 4 84 0 0 0 0 0 770712 529972 694488 0 0 0 149 774 2869 8 2 90 0 0 0 0 0 770712 529972 694488 0 0 0 0 271 195 0 0 100 0 0 0 0 0 770728 529972 694488 0 0 0 35 269 167 0 0 100 1 0 0 0 0 770728 529972 694488 0 0 0 7 269 189 0 0 100 0 0 The application is memory stable ( no leaks ) and a deadlock is out of the question since in a deadlock case the system would freeze forever and not temporarily. There are around 200 - 250 tcp/ip clients connected to the application and 550 threads ( streaming blocking sockets are used so every client is managed by one reading thread and one writing thread) The same application works fine on SLES9.3 Hanging Evironment: ----------------------------------------------------------------------------- mustang:~ # uname -a Linux mustang 2.6.16.21-0.25-smp #1 SMP Tue Sep 19 07:26:15 UTC 2006 x86_64 x86_64 x86_64 GNU/Linux mustang:~ # java -version java version "1.6.0-rc" Java(TM) SE Runtime Environment (build 1.6.0-rc-b104) Java HotSpot(TM) Server VM (build 1.6.0-rc-b104, mixed mode) mustang:~ # cat /etc/SuSE-release SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10 ----------------------------------------------------------------------------- Working environment: ----------------------------------------------------------------------------- apollo:~ # uname -a Linux apollo 2.6.5-7.252-smp #1 SMP Tue Feb 14 11:11:04 UTC 2006 x86_64 x86_64 x86_64 GNU/Linux apollo:~ # java -version java version "1.6.0-rc" Java(TM) SE Runtime Environment (build 1.6.0-rc-b95) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0-rc-b95, mixed mode) apollo:~ # cat /etc/SuSE-release SUSE LINUX Enterprise Server 9 (x86_64) VERSION = 9 PATCHLEVEL = 3 ----------------------------------------------------------------------------- Can you give me some pointers about where to start debugging this issue? Regards, Horia |
|
|
Re: Temporary random kernel hang> On Fri, 8 Dec 2006 02:38:45 -0800 (PST) seven <horia.muntean@...> wrote:
> > Hello, > > I have some trouble with a multithreaded java network server running on > SLES10. At random times I see the kernel take 80% of the CPU leaving iddle > to 0% for 30 seconds. After this period the system returns to normal > operation state. > > Below is a vmstat -a 3 recording that shows the problem: > > 1 0 0 773068 529184 693048 0 0 0 0 272 201 0 0 100 > 0 0 > 0 0 0 773068 529184 693064 0 0 0 25 317 334 1 0 99 > 1 0 > 0 0 0 772944 529216 693248 0 0 0 24 477 1017 3 0 96 > 0 0 > 0 0 0 772820 529256 693316 0 0 0 0 525 1376 4 1 95 > 0 0 > 0 0 0 772448 529344 693636 0 0 0 107 1098 3306 11 2 86 > 0 0 > 0 0 0 772324 529404 693456 0 0 0 0 723 2247 7 2 91 > 0 0 > 0 0 0 772076 529496 693656 0 0 0 132 770 2488 7 2 91 > 1 0 > 0 0 0 772200 529528 693608 0 0 0 91 528 1168 4 1 94 > 1 0 > 0 0 0 772200 529532 693728 0 0 0 0 334 387 1 0 99 > 0 0 > 0 0 0 772076 529568 693680 0 0 0 24 564 1250 4 1 95 > 0 0 > 0 0 0 771828 529636 693784 0 0 0 0 787 2144 7 2 91 > 0 0 > 0 0 0 771580 529744 694232 0 0 0 111 995 3081 11 2 86 > 1 0 > 107 0 0 771316 529792 694904 0 0 0 153 829 1650 12 37 51 > 0 0 > 113 0 0 771316 529792 694912 0 0 0 0 323 169 15 85 0 > 0 0 > 116 0 0 771216 529792 694728 0 0 0 25 292 190 14 86 0 > 0 0 > 122 0 0 771340 529792 694728 0 0 0 21 311 191 15 85 0 > 0 0 > 138 0 0 771464 529792 694728 0 0 0 0 365 196 14 86 0 > 0 0 > 146 0 0 771464 529792 694728 0 0 0 0 331 189 16 84 0 > 0 0 > 150 0 0 771472 529792 694728 0 0 0 0 336 183 15 85 0 > 0 0 > 146 0 0 771472 529792 694728 0 0 0 4 310 201 14 86 0 > 0 0 > 145 0 0 771472 529792 694728 0 0 0 0 285 163 15 85 0 > 0 0 > procs -----------memory---------- ---swap-- -----io---- -system-- > -----cpu------ > r b swpd free inact active si so bi bo in cs us sy id > wa st > 146 0 0 771472 529792 694728 0 0 0 0 277 159 14 86 0 > 0 0 > 145 0 0 771472 529792 694728 0 0 0 32 275 133 15 85 0 > 0 0 > 0 0 0 771208 529892 694176 0 0 0 0 1012 3408 12 4 84 > 0 0 > 0 0 0 770712 529972 694488 0 0 0 149 774 2869 8 2 90 > 0 0 > 0 0 0 770712 529972 694488 0 0 0 0 271 195 0 0 100 > 0 0 > 0 0 0 770728 529972 694488 0 0 0 35 269 167 0 0 100 > 1 0 > 0 0 0 770728 529972 694488 0 0 0 7 269 189 0 0 100 > 0 0 > > The application is memory stable ( no leaks ) and a deadlock is out of the > question since in a deadlock case the system would freeze forever and not > temporarily. There are around 200 - 250 tcp/ip clients connected to the > application and 550 threads ( streaming blocking sockets are used so every > client is managed by one reading thread and one writing thread) > > The same application works fine on SLES9.3 > > Hanging Evironment: > ----------------------------------------------------------------------------- > mustang:~ # uname -a > Linux mustang 2.6.16.21-0.25-smp #1 SMP Tue Sep 19 07:26:15 UTC 2006 x86_64 > x86_64 x86_64 GNU/Linux > mustang:~ # java -version > java version "1.6.0-rc" > Java(TM) SE Runtime Environment (build 1.6.0-rc-b104) > Java HotSpot(TM) Server VM (build 1.6.0-rc-b104, mixed mode) > mustang:~ # cat /etc/SuSE-release > SUSE Linux Enterprise Server 10 (x86_64) > VERSION = 10 > ----------------------------------------------------------------------------- > > Working environment: > ----------------------------------------------------------------------------- > apollo:~ # uname -a > Linux apollo 2.6.5-7.252-smp #1 SMP Tue Feb 14 11:11:04 UTC 2006 x86_64 > x86_64 x86_64 GNU/Linux > apollo:~ # java -version > java version "1.6.0-rc" > Java(TM) SE Runtime Environment (build 1.6.0-rc-b95) > Java HotSpot(TM) 64-Bit Server VM (build 1.6.0-rc-b95, mixed mode) > apollo:~ # cat /etc/SuSE-release > SUSE LINUX Enterprise Server 9 (x86_64) > VERSION = 9 > PATCHLEVEL = 3 > ----------------------------------------------------------------------------- > > Can you give me some pointers about where to start debugging this issue? > A kernel profile will tell us were the kernel is burning CPU. Something like this (run as root): #!/bin/sh while true do opcontrol --stop opcontrol --shutdown rm -rf /var/lib/oprofile opcontrol --vmlinux=/boot/vmlinux-$(uname -r) opcontrol --start-daemon opcontrol --start date sleep 5 opcontrol --stop opcontrol --shutdown opreport -l /boot/vmlinux-$(uname -r) | head -50 done | tee /tmp/oprofile-output Then locate the output record which correlates with one of these episodes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
|
|
Re: Temporary random kernel hang
Hello Andrew, First, thank you for the pointers. I did what you suggested ( installed oprofile and run the script ). In the same time, I added a timestamp column to vmstat recording. I can see dramatic changes in the output of opreport during the freeze period, but I do not know what it means. Here are my recordings: vmstat -a 3 | awk {print $0"-"system("date")} > vmstat.txt & 0 0 0 211888 980736 755460 0 0 0 600 1599 2679 30 8 62 0 0 Mon Dec 11 18:30:07 GMT 2006 4 0 0 209356 982216 756600 0 0 0 604 2201 6401 44 8 48 0 0 Mon Dec 11 18:30:10 GMT 2006 88 0 0 208616 982648 756912 0 0 0 0 2340 3690 45 10 44 0 0 Mon Dec 11 18:30:13 GMT 2006 5 0 0 206892 984024 757228 0 0 0 752 2780 5246 53 9 38 0 0 Mon Dec 11 18:31:00 GMT 2006 179 0 0 206428 984244 758152 0 0 0 287 834 787 17 81 2 0 0 Mon Dec 11 18:31:00 GMT 2006 245 0 0 206344 984244 758256 0 0 0 0 527 244 14 86 0 0 0 Mon Dec 11 18:31:00 GMT 2006 297 0 0 206476 984248 758048 0 0 0 32 535 238 15 85 0 0 0 Mon Dec 11 18:31:00 GMT 2006 procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------ Mon Dec 11 18:31:00 GMT 2006 r b swpd free inact active si so bi bo in cs us sy id wa st Mon Dec 11 18:31:00 GMT 2006 303 0 0 206476 984256 758060 0 0 0 161 375 210 15 85 0 0 0 Mon Dec 11 18:31:00 GMT 2006 309 0 0 206468 984260 758064 0 0 0 0 352 209 15 85 0 0 0 Mon Dec 11 18:31:00 GMT 2006 316 0 0 206468 984264 758068 0 0 0 21 345 220 14 86 0 0 0 Mon Dec 11 18:31:00 GMT 2006 315 0 0 206592 984268 758072 0 0 0 15 315 229 15 85 0 0 0 Mon Dec 11 18:31:00 GMT 2006 323 0 0 206592 984272 758072 0 0 0 21 330 230 14 86 0 0 0 Mon Dec 11 18:31:00 GMT 2006 332 0 0 206608 984272 758072 0 0 0 0 439 235 13 87 0 0 0 Mon Dec 11 18:31:00 GMT 2006 333 0 0 206608 984272 758072 0 0 0 12 374 199 14 86 0 0 0 Mon Dec 11 18:31:00 GMT 2006 337 0 0 206608 984292 758096 0 0 0 28 333 233 15 85 0 0 0 Mon Dec 11 18:31:00 GMT 2006 341 0 0 206484 984292 758096 0 0 0 0 328 201 14 86 0 0 0 Mon Dec 11 18:31:00 GMT 2006 6 0 0 205344 984368 758164 0 0 0 21 597 887 21 78 1 0 0 Mon Dec 11 18:31:03 GMT 2006 0 0 0 204420 984768 757704 0 0 0 0 3911 4646 76 14 10 0 0 Mon Dec 11 18:31:06 GMT 2006 5 0 0 293960 917620 736532 0 0 0 707 3191 5204 62 9 28 0 0 Mon Dec 11 18:31:09 GMT 2006 28 0 0 294556 917764 735684 0 0 0 831 3128 5132 62 14 23 0 0 Mon Dec 11 18:31:12 GMT 2006 9 0 0 291836 919304 736960 0 0 0 0 2935 5228 57 9 34 0 0 Mon Dec 11 18:31:15 GMT 2006 ------------------------------------------------------------------------------------- The script output corresponding to the freeze period is below. Notice that the third sample has completely different look compared to sample 1, 2, 4, 5 witch correspond to normal operation situation. Mon Dec 11 18:29:56 GMT 2006 Stopping profiling. Stopping profiling. Killing daemon. CPU: P4 / Xeon, speed 3000.18 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 samples % symbol name 700 3.9093 ia32_sysenter_target 608 3.3955 find_vma 603 3.3676 copy_user_generic 570 3.1833 sysenter_do_call 486 2.7142 schedule 485 2.7086 tcp_sendmsg 445 2.4852 __find_get_block 429 2.3958 do_futex 365 2.0384 page_fault 355 1.9826 main_timer_handler 293 1.6363 do_gettimeofday 284 1.5861 try_to_wake_up 277 1.5470 __switch_to 258 1.4409 fget 255 1.4241 clear_page 220 1.2286 ia32_syscall 213 1.1895 ll_rw_block 197 1.1002 find_lock_page 194 1.0834 gs_change 187 1.0443 __up_read 177 0.9885 lock_sock 177 0.9885 tcp_ack 171 0.9550 __down_read 164 0.9159 IRQ0xa9_interrupt 156 0.8712 fget_light 152 0.8489 fput 146 0.8154 kfree 144 0.8042 math_state_restore 144 0.8042 tcp_v4_rcv 128 0.7148 __brelse 128 0.7148 recalc_task_prio 123 0.6869 __handle_mm_fault 123 0.6869 __kmalloc 122 0.6813 copy_page 122 0.6813 kmem_cache_alloc 120 0.6702 __tcp_push_pending_frames 117 0.6534 unlock_buffer 116 0.6478 rw_verify_area 115 0.6422 thread_return 114 0.6367 unmap_vmas 112 0.6255 kmem_cache_free 112 0.6255 radix_tree_lookup 110 0.6143 device_not_available 105 0.5864 ip_output 104 0.5808 pfn_to_page 104 0.5808 release_sock 90 0.5026 release_pages Using default event: GLOBAL_POWER_EVENTS:100000:1:1:1 Daemon started. Profiler running. Mon Dec 11 18:30:02 GMT 2006 Stopping profiling. Stopping profiling. Killing daemon. CPU: P4 / Xeon, speed 3000.18 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 samples % symbol name 741 4.3771 ia32_sysenter_target 682 4.0286 sysenter_do_call 622 3.6742 find_vma 523 3.0894 copy_user_generic 515 3.0421 schedule 495 2.9240 do_futex 419 2.4750 __find_get_block 373 2.2033 do_gettimeofday 372 2.1974 tcp_sendmsg 325 1.9198 main_timer_handler 312 1.8430 try_to_wake_up 298 1.7603 __switch_to 260 1.5358 fget 239 1.4118 gs_change 227 1.3409 ia32_syscall 222 1.3114 __down_read 219 1.2936 page_fault 193 1.1401 find_lock_page 189 1.1164 __up_read 180 1.0633 fget_light 178 1.0515 clear_page 178 1.0515 tcp_ack 175 1.0337 ll_rw_block 167 0.9865 math_state_restore 156 0.9215 thread_return 154 0.9097 lock_sock 151 0.8920 fput 138 0.8152 recalc_task_prio 135 0.7974 tcp_v4_rcv 129 0.7620 kfree 128 0.7561 IRQ0xa9_interrupt 124 0.7325 release_sock 123 0.7266 device_not_available 120 0.7088 __tcp_push_pending_frames 114 0.6734 __kmalloc 111 0.6557 unlock_buffer 110 0.6498 inet_sendmsg 108 0.6380 __brelse 105 0.6202 do_gettimeoffset_tsc 100 0.5907 rw_verify_area 95 0.5612 tcp_init_tso_segs 94 0.5553 kmem_cache_free 94 0.5553 radix_tree_lookup 92 0.5434 ip_output 89 0.5257 file_update_time 86 0.5080 dnotify_parent 86 0.5080 kmem_cache_alloc Using default event: GLOBAL_POWER_EVENTS:100000:1:1:1 Daemon started. Profiler running. Mon Dec 11 18:30:09 GMT 2006 Stopping profiling. Stopping profiling. Killing daemon. CPU: P4 / Xeon, speed 3000.18 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 samples % symbol name 196048 9.3307 page_fault 184945 8.8023 force_sig_info 153475 7.3045 do_signal 120071 5.7146 get_signal_to_deliver 116196 5.5302 ia32_setup_sigcontext 89368 4.2534 ia32_syscall 82987 3.9497 __down_read_trylock 82396 3.9215 save_i387_ia32 79436 3.7807 copy_user_generic 77587 3.6927 __sigqueue_alloc 74439 3.5428 sys32_rt_sigreturn 69484 3.3070 recalc_sigpending_tsk 65733 3.1285 do_page_fault 49694 2.3651 retint_restore_args 45756 2.1777 ia32_setup_rt_frame 45717 2.1759 try_to_wake_up 45658 2.1730 restore_i387_ia32 31626 1.5052 __up_read 30286 1.4414 __dequeue_signal 29849 1.4206 memcpy 27139 1.2917 sys32_sigaltstack 25838 1.2297 ia32_restore_sigcontext 25126 1.1958 kmem_cache_free 24757 1.1783 do_sigaltstack 22884 1.0891 __sigqueue_free 22818 1.0860 error_sti 22007 1.0474 signal_wake_up 19816 0.9431 kmem_cache_alloc 17810 0.8476 retint_signal 17208 0.8190 is_prefetch 16315 0.7765 _atomic_dec_and_lock 14320 0.6815 specific_send_sig_info 13881 0.6607 send_signal 13704 0.6522 find_vma 12392 0.5898 free_uid 9283 0.4418 get_sigframe 9184 0.4371 copy_siginfo_to_user32 8525 0.4057 pfn_to_page 7980 0.3798 unhandled_signal 7465 0.3553 error_entry 7309 0.3479 convert_rip_to_linear 6926 0.3296 notifier_call_chain 6367 0.3030 kprobe_exceptions_notify 5948 0.2831 copy_to_user 5408 0.2574 error_exit 4407 0.2097 int_ret_from_sys_call 4128 0.1965 dequeue_signal Using default event: GLOBAL_POWER_EVENTS:100000:1:1:1 Daemon started. Profiler running. Mon Dec 11 18:31:02 GMT 2006 Stopping profiling. Stopping profiling. Killing daemon. CPU: P4 / Xeon, speed 3000.18 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 samples % symbol name 1113 4.4269 ia32_sysenter_target 911 3.6234 find_vma 887 3.5280 copy_user_generic 850 3.3808 sysenter_do_call 715 2.8438 do_futex 712 2.8319 tcp_sendmsg 679 2.7007 __find_get_block 657 2.6132 schedule 508 2.0205 do_gettimeofday 399 1.5870 try_to_wake_up 375 1.4915 fget 366 1.4557 main_timer_handler 341 1.3563 __switch_to 314 1.2489 IRQ0xa9_interrupt 307 1.2211 ia32_syscall 297 1.1813 clear_page 292 1.1614 ll_rw_block 285 1.1336 unlock_buffer 278 1.1057 gs_change 277 1.1017 fget_light 276 1.0978 find_lock_page 274 1.0898 page_fault 268 1.0659 tcp_ack 247 0.9824 lock_sock 242 0.9625 __down_read 233 0.9267 kfree 223 0.8870 __tcp_push_pending_frames 213 0.8472 tcp_v4_rcv 210 0.8353 math_state_restore 203 0.8074 __up_read 202 0.8034 fput 196 0.7796 recalc_task_prio 194 0.7716 release_sock 187 0.7438 kmem_cache_free 182 0.7239 __brelse 180 0.7159 thread_return 173 0.6881 test_clear_page_dirty 170 0.6762 inet_sendmsg 161 0.6404 ip_output 159 0.6324 device_not_available 156 0.6205 pfn_to_page 151 0.6006 __kmalloc 149 0.5926 tcp_transmit_skb 145 0.5767 file_update_time 143 0.5688 dnotify_parent 142 0.5648 kmem_cache_alloc 141 0.5608 do_gettimeoffset_tsc Using default event: GLOBAL_POWER_EVENTS:100000:1:1:1 Daemon started. Profiler running. Mon Dec 11 18:31:08 GMT 2006 Stopping profiling. Stopping profiling. Killing daemon. CPU: P4 / Xeon, speed 3000.18 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 samples % symbol name 1054 4.3503 ia32_sysenter_target 880 3.6322 copy_user_generic 859 3.5455 sysenter_do_call 842 3.4753 find_vma 694 2.8645 tcp_sendmsg 675 2.7860 __find_get_block 656 2.7076 do_futex 645 2.6622 schedule 487 2.0101 do_gettimeofday 409 1.6881 try_to_wake_up 384 1.5849 fget 382 1.5767 main_timer_handler 360 1.4859 __switch_to 323 1.3332 tcp_ack 304 1.2547 clear_page 304 1.2547 ll_rw_block 302 1.2465 IRQ0xa9_interrupt 293 1.2093 fget_light 289 1.1928 page_fault 279 1.1516 find_lock_page 273 1.1268 ia32_syscall 271 1.1185 gs_change 256 1.0566 lock_sock 233 0.9617 kfree 229 0.9452 __up_read 227 0.9369 __down_read 223 0.9204 tcp_v4_rcv 207 0.8544 fput 194 0.8007 unlock_buffer 190 0.7842 thread_return 189 0.7801 math_state_restore 186 0.7677 release_sock 178 0.7347 __tcp_push_pending_frames 175 0.7223 __kmalloc 168 0.6934 inet_sendmsg 164 0.6769 dnotify_parent 164 0.6769 recalc_task_prio 158 0.6521 kmem_cache_free 154 0.6356 tcp_transmit_skb 153 0.6315 device_not_available 144 0.5944 tcp_init_tso_segs 140 0.5778 __brelse 139 0.5737 pfn_to_page 132 0.5448 kmem_cache_alloc 131 0.5407 file_update_time 131 0.5407 ip_output 128 0.5283 tcp_current_mss --------------------------------------------------------------------------------------------------- I did notice something that can be related to those freeze periods: I started to monitor the temperatures of the CPUs and noticed that the freeze (I could reproduce only one freeze today) happened in the same interval as a temperature spike ( around 50 degrees C ). My system is a 1U, 2-way Intel Xeon 3.0 GHz blade ( IBM x336 ) and I think it has some kind of temperature control that kicks in when the CPU temp reaches certain levels. AFAIK the temperature control system will change the CPU speed in order to cool it down. Anyway, I will try to heat up the other system tomorrow (the stable one) to see if I can bring it to the same temp. level and see how it responds. In the same time I will try to heat up again the unstable system to see if there is really relation between the CPU temp. and system freezes (only one match could be just a coincidence). Is is possible that this kernel behavior could be related to temp. control events? Regards and thank you again, Horia |
|
|
Re: Temporary random kernel hangOn Mon, 11 Dec 2006 10:46:25 -0800 (PST)
seven <horia.muntean@...> wrote: > > > > > A kernel profile will tell us were the kernel is burning CPU. Something > > like this (run as root): > > > > #!/bin/sh > > while true > > do > > opcontrol --stop > > opcontrol --shutdown > > rm -rf /var/lib/oprofile > > opcontrol --vmlinux=/boot/vmlinux-$(uname -r) > > opcontrol --start-daemon > > opcontrol --start > > date > > sleep 5 > > opcontrol --stop > > opcontrol --shutdown > > opreport -l /boot/vmlinux-$(uname -r) | head -50 > > done | tee /tmp/oprofile-output > > > > Then locate the output record which correlates with one of these episodes. > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@... > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > > > > Hello Andrew, No, I wasn't ignoring you for four days. Please, always do reply-to-all. > First, thank you for the pointers. I did what you suggested ( installed > oprofile and run the script ). In the same time, I added a timestamp column > to vmstat recording. > > I can see dramatic changes in the output of opreport during the freeze > period, but I do not know what it means. > > .. > > CPU: P4 / Xeon, speed 3000.18 MHz (estimated) > Counted GLOBAL_POWER_EVENTS events (time during which processor is not > stopped) with a unit mask of 0x01 (mandatory) count 100000 > samples % symbol name > 741 4.3771 ia32_sysenter_target > 682 4.0286 sysenter_do_call > 622 3.6742 find_vma > 523 3.0894 copy_user_generic > 515 3.0421 schedule > 495 2.9240 do_futex > 419 2.4750 __find_get_block > 373 2.2033 do_gettimeofday > 372 2.1974 tcp_sendmsg > 325 1.9198 main_timer_handler > 312 1.8430 try_to_wake_up > 298 1.7603 __switch_to > 260 1.5358 fget > 239 1.4118 gs_change > 227 1.3409 ia32_syscall > 222 1.3114 __down_read > 219 1.2936 page_fault > 193 1.1401 find_lock_page > 189 1.1164 __up_read > 180 1.0633 fget_light > 178 1.0515 clear_page > 178 1.0515 tcp_ack > 175 1.0337 ll_rw_block > 167 0.9865 math_state_restore > 156 0.9215 thread_return > 154 0.9097 lock_sock > 151 0.8920 fput > 138 0.8152 recalc_task_prio > 135 0.7974 tcp_v4_rcv > 129 0.7620 kfree > 128 0.7561 IRQ0xa9_interrupt > 124 0.7325 release_sock > 123 0.7266 device_not_available > 120 0.7088 __tcp_push_pending_frames > 114 0.6734 __kmalloc > 111 0.6557 unlock_buffer > 110 0.6498 inet_sendmsg > 108 0.6380 __brelse > 105 0.6202 do_gettimeoffset_tsc > 100 0.5907 rw_verify_area > 95 0.5612 tcp_init_tso_segs > 94 0.5553 kmem_cache_free > 94 0.5553 radix_tree_lookup > 92 0.5434 ip_output > 89 0.5257 file_update_time > 86 0.5080 dnotify_parent > 86 0.5080 kmem_cache_alloc versus > Using default event: GLOBAL_POWER_EVENTS:100000:1:1:1 > Daemon started. > Profiler running. > Mon Dec 11 18:30:09 GMT 2006 > Stopping profiling. > Stopping profiling. > Killing daemon. > CPU: P4 / Xeon, speed 3000.18 MHz (estimated) > Counted GLOBAL_POWER_EVENTS events (time during which processor is not > stopped) with a unit mask of 0x01 (mandatory) count 100000 > samples % symbol name > 196048 9.3307 page_fault > 184945 8.8023 force_sig_info > 153475 7.3045 do_signal > 120071 5.7146 get_signal_to_deliver > 116196 5.5302 ia32_setup_sigcontext > 89368 4.2534 ia32_syscall > 82987 3.9497 __down_read_trylock > 82396 3.9215 save_i387_ia32 > 79436 3.7807 copy_user_generic > 77587 3.6927 __sigqueue_alloc > 74439 3.5428 sys32_rt_sigreturn > 69484 3.3070 recalc_sigpending_tsk > 65733 3.1285 do_page_fault > 49694 2.3651 retint_restore_args > 45756 2.1777 ia32_setup_rt_frame > 45717 2.1759 try_to_wake_up > 45658 2.1730 restore_i387_ia32 > 31626 1.5052 __up_read > 30286 1.4414 __dequeue_signal > 29849 1.4206 memcpy > 27139 1.2917 sys32_sigaltstack > 25838 1.2297 ia32_restore_sigcontext > 25126 1.1958 kmem_cache_free > 24757 1.1783 do_sigaltstack > 22884 1.0891 __sigqueue_free > 22818 1.0860 error_sti > 22007 1.0474 signal_wake_up > 19816 0.9431 kmem_cache_alloc > 17810 0.8476 retint_signal > 17208 0.8190 is_prefetch > 16315 0.7765 _atomic_dec_and_lock It's certainly different. It is hard to work out what might be going on in there, but there's no obvious sign of misbehaviour. It could be that the application has simply gone berzerk and is doing a larger number or expensive system calls. The next step would be to try to catch one of these eposides with strace. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
|
|
Re: Temporary random kernel hangI am using nabble's interface to post messages ( www.nabble.com ) and I am not subscribed to the mailing list, so it's possible that this is the cause of the direct mail. I will try. Thank you. Regards, Horia |
|
|
Re: Temporary random kernel hangOn Thu, 2006-12-14 at 15:07 -0800, seven wrote:
> > Andrew Morton wrote: > > > > No, I wasn't ignoring you for four days. Please, always do reply-to-all. > > > > I am using nabble's interface to post messages ( www.nabble.com ) and I am > not subscribed to the mailing list, so it's possible that this is the cause > of the direct mail. However, all these replies are send to your gmail account, so you _could_ just 'reply to all' there, right? Please do, finding lone responses on lkml is hard. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
|
|
Re: Temporary random kernel hangWell, nabble sends an alert mail to my gmail account whenever there is a reply to the thread I monitor accompanied by a second message that is 'coming' from the author of the reply and a CC to linux-kernel@vger.kernel.org. So I use gmail only for reading the replays. In the Reply form used by nabble one can find the following so I assumed that nabble will do proper posting to the mailing list. It appears that it does not. I will use gmail with 'Reply to all' from now on in order to post messages. Sorry for any inconveniences. Regards, Horia |
| Free Forum Powered by Nabble | Forum Help |