Bug总能在你意想不到的地方给你个措手不及,只是它所带来并不是惊喜,而是Blue Screen Of Death !
既如此,只能兵来将挡。
先介绍一下程序的大体流程:
NTSTATUS
XXXProcessDirents(…)
{
do {
KeEnterCriticalRegion();
ExAcquireResourceSharedLite(&fcb->Resource, TRUE);
/* access several members of fcb structure */
ExReleaseResourceLite(&fcb->Resource);
KeLeaveCriticalRegion();
XXXXProcessDirent(…);
} while (list_is_not_empty(….));
return status;
}
NTSTATUS
XXXXProcessDirent(…)
{
HANDLE handle = NULL;
XXXX_FILE_HEADE fileHead;
……
/* open file */
status = ZwCreateFile(&handle, GENERIC_READ, &oa, &iosb, NULL, 0,
FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE,
FILE_OPEN, 0, NULL, 0);
/* read file header*/
status = ZwReadFile(handle, ioevent, NULL, NULL, &iosb, (PVOID)&fileHead,
sizeof(XXXX_FILE_HEADE), &offset, NULL);
/* check whether file is interesting to us */
if (status == STATUS_SUCCESS && iosb.Information == sizeof(……)) {
/* it’s my taste, haha */
}
/* close file, not interested in it any more */
if (handle){
ZwClose(handle);
}
return status;
}
过程比较简单,XXXProcessDirents()会循环调用XXXProcessDirent(),直至列表中所有项全检查完毕。
下面再来看windbg分析吧:
1: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: 0abc9867, memory referenced
Arg2: 00000002, IRQL
Arg3: 00000001, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: 806e7a2a, address which referenced memory
Debugging Details:
------------------
WRITE_ADDRESS: 0abc9867
CURRENT_IRQL: 2
FAULTING_IP:
hal!KeAcquireInStackQueuedSpinLock+3a
806e7a2a 8902 mov dword ptr [edx],eax
DEFAULT_BUCKET_ID: DRIVER_FAULT
BUGCHECK_STR: 0xA
PROCESS_NAME: System
TRAP_FRAME: b9019bbc -- (.trap 0xffffffffb9019bbc)
ErrCode = 00000002
eax=b9019c40 ebx=00000000 ecx=c0000211 edx=0abc9867 esi=c0000128 edi=8842d268
eip=806e7a2a esp=b9019c30 ebp=b9019c68 iopl=0 nv up ei ng nz na pe nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010286
hal!KeAcquireInStackQueuedSpinLock+0x3a:
806e7a2a 8902 mov dword ptr [edx],eax ds:0023:0abc9867=????????
Resetting default scope
LAST_CONTROL_TRANSFER: from 806e7a2a to 80544768
STACK_TEXT:
b9019bbc 806e7a2a badb0d00 0abc9867 804f4e77 nt!KiTrap0E+0x238
b9019c68 806e7ef2 00000000 00000000 b9019c80 hal!KeAcquireInStackQueuedSpinLock+0x3a
b9019c68 b9019d24 00000000 00000000 b9019c80 hal!HalpApcInterrupt+0xc6
WARNING: Frame IP not in any known module. Following frames may be wrong.
b9019cf0 80535873 00000000 8896fb20 00000000 0xb9019d24
b9019d10 b79d87ff ba668a30 8859b7e8 00000440 nt!ExReleaseResourceLite+0x8d
b9019d2c b79d8a5c 8a3ff2f0 00000003 ba6685f0 XXXXX!XXXProcessDirents+0xef
b9019d88 b79e163a e2f6b170 00000001 00000001 XXXXX!XXXKernelQueryDirectory+0x20c
b9019ddc 8054616e b79e1530 88a8ae00 00000000 nt!PspSystemThreadStartup+0x34
00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16
问题出在系统函数ExReleaseResourceLite()及KeAcquireInStackQueuedSpinLock()上,且程序要写的地址为0abc9867 ,明显不对,所以此处可做栈损坏推断。
第一嫌疑要考虑的是,XXXProcessDirents()中有锁保护的部分,此部分是果真是最容易造成栈损坏buffer复制操作。但经过仔细检查及测试,便排除了此部分出错的可能。
在排除第一嫌疑后,就没有明显目标了。只好再接着看windbg log:
貌似KeAcquireInStackQueuedSpinLock()要写的地址是LockHandle的LockQueue->Next,而LockHandle一般都在从当前堆栈分配的,由此可肯定之前对于栈损坏的推断。可问题是,是谁导致的栈损坏。
Stack中有hal!HalpApcInterrupt()调用记录,它是处理APC的软中断。hal!HalpApcInterrupt()会一般会调用nt!KiDeliverApc()来处理线程的APC队列。但当ExReleaseResourceLite()调用的时候,线程还处于临界区内(Critical Section),此时User mode APC及Kernel mode normal APC都会被禁止的,但Kernel mode special APC不会。
Kernel Special APC最常见的情况便是由IoCompleteRequest()添加的:在APC Level中调用IopCompleteRequest()以处理Irp的Stage 2的清理工作。
至此,问题终于有些眉目了。分析代码中唯一有可能导致APC添加的地方就在函数XXXXProcessDirent()中的ZwReadFile()调用,而且fileHead正是于堆栈中分配的。
想到此处,此bug的根据原因便付出水面:
XXXXProcessDirent()没有处理ZwReadFile()返回STATUS_PENDING的情况,此情形下,XXXXProcessDirent()退出并继续执行,而之前的ZwReadFile()的IRP完成操作也在同时进行(还没有完成),并且此完成操作所要写的fileHead地址,正是早已被回收并加以重用的当前栈。
搞清楚之后,便在调用ZwReadFile()后,特别针对STATUS_PENING的情况来调用ZwWaitForSingleObject()以确保读操作全部完成后,再进行下一步操作。
到此,问题解决!
一个蓝屏的问题,竟然如此之绕,不禁让我想起刘震云的《一句顶一万句》,只是这能顶一万句的一句到底是哪句呢?
<下一步打算写写APC相关的东西,操作系统将APC隐藏得太深,总让人捉摸不定!>
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
Thank you..
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx
thx