红联Linux门户
Linux帮助

定位内核模块crash的方法

发布时间:2017-06-15 15:15:54来源:linux网站作者:蚁公仔
在内核模块开发过程中,常发生系统崩溃的现象,此时系统死机,无法定位和分析问题。
常见的定位方法是安装kdump-tools,kdump-tools可以把死机前的内核日志保存下来,以便开机后能分析上次死机的日志。
这里不介绍kdump-tools的安装配置方法,介绍如何分析crash日志,找到代码中出错的地方。
 
kdump-tools的crash日志一般放在/var/crash/出错时间/dmesg.时间目录下,如 /var/crash/201706131703/dmesg.201706131703,打开此文件,可见如下:
[1493201.293587] buflen=2097152,gwid=223344,addr=33554671
[1493258.160173] fq=300 full,will be change fq
[1493258.160179] max_gw_buf_len0=81984,max_gw_buf_len1=0
[1493258.160199] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[1493258.160204] IP: [<ffffffffc02ef10a>] search_fq_to_insert+0x1d2/0x239 [HNRcore]
[1493258.160216] PGD 0 
[1493258.160219] Oops: 0000 [#1] SMP 
[1493258.160222] Modules linked in: binfmt_misc fou(OE) HNRcore(OE) iptable_filter xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables ipip tunnel4 ip_tunnel ip6_udp_tunnel udp_tunnel bonding joydev input_leds intel_powerclamp coretemp kvm ipmi_ssif ipmi_devintf irqbypass gpio_ich crct10dif_pclmul crc32_pclmul 8250_fintek dcdbas shpchp aesni_intel serio_raw aes_x86_64 lrw gf128mul glue_helper lpc_ich ablk_helper i7core_edac cryptd edac_core ipmi_si ipmi_msghandler acpi_power_meter mac_hid parport_pc ppdev lp parport autofs4 hid_generic psmouse usbhid hid pata_acpi megaraid_sas bnx2 fjes [last unloaded: fou]
[1493258.160271] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G          IOE   4.4.0-21-generic #37-Ubuntu
[1493258.160274] Hardware name: Dell Inc. PowerEdge R710/0XDX06, BIOS 2.2.10 11/09/2010
[1493258.160276] task: ffffffff81e11500 ti: ffffffff81e00000 task.ti: ffffffff81e00000
[1493258.160278] RIP: 0010:[<ffffffffc02ef10a>]  [<ffffffffc02ef10a>] search_fq_to_insert+0x1d2/0x239 [HNRcore]
[1493258.160286] RSP: 0018:ffff88032f603ca8  EFLAGS: 00010097
[1493258.160288] RAX: 0000000000000000 RBX: ffff88032f60dd00 RCX: ffff8801a6101000
[1493258.160290] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88032f603ce0
[1493258.160292] RBP: ffff88032f603ce8 R08: 000000000000000a R09: ffff88032f603cd0
[1493258.160294] R10: 0000000000000020 R11: 0000000000000bfb R12: ffffffffc02f37c0
[1493258.160296] R13: 0000000000000100 R14: ffffffffc02e9bdb R15: 0000000000000000
[1493258.160299] FS:  0000000000000000(0000) GS:ffff88032f600000(0000) knlGS:0000000000000000
[1493258.160301] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[1493258.160303] CR2: 0000000000000028 CR3: 0000000001e0a000 CR4: 00000000000006f0
[1493258.160304] Stack:
[1493258.160306]  ffff88032f603d18 ffff8801a6101000 0000000100000000 0000000000000001
[1493258.160310]  0000000000000000 0000000000000000 0000000000000000 fa723263f77ef18e
[1493258.160313]  ffff88032f603d48 ffffffffc02ef5bb ffffffffc02f3b30 0000000200000001
[1493258.160316] Call Trace:
[1493258.160318]  <IRQ> 
[1493258.160325]  [<ffffffffc02ef5bb>] frequency_buf_full_process+0x35f/0x567 [HNRcore]
[1493258.160331]  [<ffffffffc02ef82d>] frequency_change_main+0x39/0x50 [HNRcore]
[1493258.160337]  [<ffffffffc02ee19f>] alloc_data_per_timer+0x6b0/0xbd9 [HNRcore]
[1493258.160343]  [<ffffffffc02eed69>] fq_alloc_timer+0xa9/0x278 [HNRcore]
[1493258.160348]  [<ffffffffc02e9217>] ? gw_manage_via_hfc+0x49/0x49 [HNRcore]
[1493258.160356]  [<ffffffff810fe4f0>] ? tick_sched_handle.isra.14+0x60/0x60
[1493258.160361]  [<ffffffffc02e9c08>] gw_send_ts_timer+0x2d/0x5e [HNRcore]
[1493258.160368]  [<ffffffff810ec345>] call_timer_fn+0x35/0x120
[1493258.160373]  [<ffffffffc02e9bdb>] ? gw_send_ts_process+0x9c4/0x9c4 [HNRcore]
[1493258.160377]  [<ffffffff810eccfa>] run_timer_softirq+0x23a/0x2f0
[1493258.160383]  [<ffffffff810859a1>] __do_softirq+0x101/0x290
[1493258.160387]  [<ffffffff81085ca3>] irq_exit+0xa3/0xb0
[1493258.160393]  [<ffffffff81826fa2>] smp_apic_timer_interrupt+0x42/0x50
[1493258.160398]  [<ffffffff81825262>] apic_timer_interrupt+0x82/0x90
[1493258.160399]  <EOI> 
[1493258.160405]  [<ffffffff816bb9ee>] ? cpuidle_enter_state+0x10e/0x2b0
[1493258.160408]  [<ffffffff816bb9df>] ? cpuidle_enter_state+0xff/0x2b0
[1493258.160412]  [<ffffffff816bbbc7>] cpuidle_enter+0x17/0x20
[1493258.160418]  [<ffffffff810c3d52>] call_cpuidle+0x32/0x60
[1493258.160421]  [<ffffffff816bbba3>] ? cpuidle_select+0x13/0x20
[1493258.160424]  [<ffffffff810c4010>] cpu_startup_entry+0x290/0x350
[1493258.160430]  [<ffffffff81817f2c>] rest_init+0x7c/0x80
[1493258.160438]  [<ffffffff81f5a011>] start_kernel+0x481/0x4a2
[1493258.160442]  [<ffffffff81f59120>] ? early_idt_handler_array+0x120/0x120
[1493258.160445]  [<ffffffff81f59339>] x86_64_start_reservations+0x2a/0x2c
[1493258.160448]  [<ffffffff81f59485>] x86_64_start_kernel+0x14a/0x16d
[1493258.160450] Code: 05 24 3b 2f c0 8b 00 3b 45 d0 0f 8f b1 fe ff ff c7 45 d0 00 00 00 00 eb 31 8b 45 d0 48 98 48 8b 44 c5 e8 48 89 45 e0 48 8b 45 e0 <8b> 40 28 3d f9 00 00 00 7f 10 48 8b 45 e0 8b 40 24 3d 95 00 00 
[1493258.160483] RIP  [<ffffffffc02ef10a>] search_fq_to_insert+0x1d2/0x239 [HNRcore]
[1493258.160489]  RSP <ffff88032f603ca8>
[1493258.160491] CR2: 0000000000000028
 
如何分析这段crash日志呢,首先,可看到unable to handle kernel NULL pointer 的打印,表示访问了空指针错误,再看RIP的位置在search_fq_to_insert+0x1d2/0x239 [HNRcore],表示出错时处于执行此处代码,怎么还原出这代码的行数呢,可以用gdb的方法,见下:
root@cjtx-PowerEdge-R710:/var/crash/201706131703# gdb /usr/local/HNR_target/bin/HNRcore.ko 
GNU gdb (Ubuntu 7.11-0ubuntu1) 7.11
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/HNR_target/bin/HNRcore.ko...done.
(gdb) b *search_fq_to_insert+0x1d2
Breakpoint 1 at 0x9133: file /home/work/HNR/core/frequency_info.c, line 762.
 
可以看出错误出现在frequency_info.c的第762行,代码见:
for (i=0;i<2;i++)
{
fq = new_fq_tmp[i];
if (fq->fq_full_count >= 250)    //超过2秒buffer满,此频点不能insert
continue;
if (fq->gw_num >= MAX_GW_PER_FQ)
continue;
实际出错处会比原来加一行,为763行,可见此处fq指针末判断为空而直接使用了,原因就在此。
 
本文永久更新地址:http://www.linuxdiyf.com/linux/31503.html