摘要: 
  本文通过一次Oracle Job任务异常案例诊断,分析其原因及解决过程,从内部揭示Oracle Job任务调度及内部计时机制。 
  问题及环境 
  接到研发人员报告,数据库定时任务未正常执行,导致某些操作失败。 
  开始介入处理该事故. 
  系统环境: 
  以下为引用的内容: 
[code]SunOS DB 5.8 Generic_108528-21 sun4u sparc SUNW,Ultra-4 
Oracle9i Enterprise Edition Release 9.2.0.3.0 - Production[/code]
  解决过程 
  首先介入检查数据库任务 
  以下为引用的内容: 
[code]$ sqlplus "/ as sysdba" 
SQL*Plus: Release 9.2.0.3.0 - Production on Wed Nov 17 20:23:53 2004 
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. 
Connected to: 
Oracle9i Enterprise Edition Release 9.2.0.3.0 - Production 
With the Partitioning, OLAP and Oracle Data Mining options 
JServer Release 9.2.0.3.0 - Production 
SQL> select job,last_date,last_sec,next_date,next_sec,broken,failures from 
dba_jobs; 
JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC B FAILURES 
INTERVAL 
---------- --------- ---------------- --------- ---------------- - ---------- 
---------------------------- 
31 16-NOV-04 01:00:02 17-NOV-04 01:00:00 N 0 
trunc(sysdate+1)+1/24 
27 16-NOV-04 00:00:04 17-NOV-04 00:00:00 N 0 
TRUNC(SYSDATE) + 1 
35 16-NOV-04 01:00:02 17-NOV-04 01:00:00 N 0 
trunc(sysdate+1)+1/24 
29 16-NOV-04 00:00:04 17-NOV-04 00:00:00 N 0 
TRUNC(SYSDATE) + 1 
30 01-NOV-04 06:00:01 01-DEC-04 06:00:00 N 0 
trunc(add_months(sysdate,1),’MM’)+6/24 
65 16-NOV-04 04:00:03 17-NOV-04 04:00:00 N 0 
trunc(sysdate+1)+4/24 
46 16-NOV-04 02:14:27 17-NOV-04 02:14:27 N 0 
sysdate+1 
66 16-NOV-04 03:00:02 17-NOV-04 18:14:49 N 0 
trunc(sysdate+1)+3/24 
8 rows selected.[/code]
 
  发现JOB任务是都没有正常执行,最早一个应该在17-NOV-04 01:00:00执行。但是没有执行。 
  建立测试JOB 
  以下为引用的内容: 
[code]create or replace PROCEDURE pining 
IS 
BEGIN 
NULL; 
END; 
/ 
variable jobno number; 
variable instno number; 
begin 
select instance_number into :instno from v$instance; 
dbms_job.submit(:jobno, ’pining;’, trunc(sysdate+1/288,’MI’), 
’trunc(SYSDATE+1/288,’’MI’’)’, TRUE, :instno); 
end; 
/[/code]
 
  发现同样的,不执行。 
  但是通过dbms_job.run()执行没有任何问题。
  进行恢复尝试 
  怀疑是CJQ0进程失效,首先设置JOB_QUEUE_PROCESSES为0,Oracle会杀掉CJQ0及相应job进程 
  以下为引用的内容: 
[code]SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES = 0;[/code] 
   等2~3分钟,重新设置 以下为引用的内容: 
[code]SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES = 5;[/code] 
   此时PMON会重起CJQ0进程 以下为引用的内容: 
[code]Thu Nov 18 11:59:50 2004 
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY; 
Thu Nov 18 12:01:30 2004 
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY; 
Thu Nov 18 12:01:30 2004 
Restarting dead background process CJQ0 
CJQ0 started with pid=8 
但是Job仍然不执行,而且在再次修改的时候,CJQ0直接死掉了。 
Thu Nov 18 13:52:05 2004 
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY; 
Thu Nov 18 14:09:30 2004 
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY; 
Thu Nov 18 14:10:27 2004 
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY; 
Thu Nov 18 14:10:42 2004 
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY; 
Thu Nov 18 14:31:07 2004 
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY; 
Thu Nov 18 14:40:14 2004 
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY; 
Thu Nov 18 14:40:28 2004 
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY; 
Thu Nov 18 14:40:33 2004 
ALTER SYSTEM SET job_queue_processes=1 SCOPE=MEMORY; 
Thu Nov 18 14:40:40 2004 
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY; 
Thu Nov 18 15:00:42 2004 
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY; 
Thu Nov 18 15:01:36 2004 
ALTER SYSTEM SET job_queue_processes=15 SCOPE=MEMORY;[/code]
 
  尝试重起数据库,这个必须在晚上进行: 
  以下为引用的内容: 
[code]PMON started with pid=2 
DBW0 started with pid=3 
LGWR started with pid=4 
CKPT started with pid=5 
SMON started with pid=6 
RECO started with pid=7 
CJQ0 started with pid=8 
QMN0 started with pid=9 
....[/code]
  CJQ0正常启动,但是Job仍然不执行。 
  没办法了... 
  继续研究...居然发现Oralce有这样一个bug 以下为引用的内容: 
[code]1. Clear description of the problem encountered: 
slgcsf() / slgcs() on Solaris will stop incrementing after 
497 days 2 hrs 28 mins (approx) machine uptime. 
2. Pertinent configuration information 
No special configuration other than long machine uptime. . 
3. Indication of the frequency and predictability of the problem 
100% but only after 497 days. 
4. Sequence of events leading to the problem 
If the gethrtime() OS call returns a value > 42949672950000000 
nanoseconds then slgcs() stays at 0xffffffff. This can 
cause some problems in parts of the code which rely on 
slgcs() to keep moving. 
eg: In kkjssrh() does "now = slgcs(&se)" and compares that 
to a previous timestamp. After 497 days uptime slgcs() 
keeps returning 0xffffffff so "now - kkjlsrt" will 
always return 0. . 
5. Technical impact on the customer. Include persistent after effects. 
In this case DBMS JOBS stopped running after 497 days uptime. 
Other symptoms could occur in various places in the code.[/code]
 
  好么,原来是计时器溢出了,一检查我的主机: 
  以下为引用的内容: 
[code]bash-2.03$ uptime 
10:00pm up 500 day(s), 14:57, 1 user, load average: 1.31, 1.09, 1.08 
bash-2.03$ date 
Fri Nov 19 22:00:14 CST 2004[/code]
  刚好到事发时是497天多一点.ft. 
  安排重起主机系统.. 
  这个问题够郁闷的,NND,谁曾想Oracle这都成... 
  Oracle最后声称: 
  fix made it into 9.2.0.6 patchset 
  在Solaris上的9206尚未发布...晕. 
  好了,就当是个经历吧,如果有问题非常不可思议的话,那么大胆怀疑Oracle吧,是Bug,可能就是Bug。 
  重起以后问题解决,状态如下: 
  以下为引用的内容: 
[code]$ sqlplus "/ as sysdba" 
SQL*Plus: Release 9.2.0.3.0 - Production on Fri Nov 26 09:21:21 2004 
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. 
Connected to: 
Oracle9i Enterprise Edition Release 9.2.0.3.0 - Production 
With the Partitioning, OLAP and Oracle Data Mining options 
JServer Release 9.2.0.3.0 - Production 
SQL> select job,last_date,last_sec,next_date,next_sec from user_jobs; 
JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC 
---------- --------- ---------------- --------- ---------------- 
70 26-NOV-04 09:21:04 26-NOV-04 09:26:00 
SQL> / 
JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC 
---------- --------- ---------------- --------- ---------------- 
70 26-NOV-04 09:26:01 26-NOV-04 09:31:00 
SQL> 
SQL> select * from v$timer; 
HSECS 
---------- 
3388153 
SQL> select * from v$timer; 
HSECS 
---------- 
3388319 
SQL> 
FAQ[/code]
                  	
				
