Friday, May 19, 2006

Flash Recovery Area on ASM?

In yesterday's post, "VLDB with ASM?", I showed problems with ASM mirroring for high volume database. Today, I want to emphasize that you should be very careful placing flash recovery area (FRA) on ASM diskgroup with normal or high redundancy.

The reason is the same as before. Whenever all or majority of disks in a failure group are "broken", ASM starts immediately rebalancing. If there is no enough space left in failed failure group (well, zero is all disks fail) than ASM will mark those disks as HUNG and no files can be created on that particular disk group. For flash recovery area this is very bad state - no files can be create on this diskgroup, i.e. ARCx processes couldn't backup online redo logs so at some point the DB will just freeze if situation is not resolved promptly. Flash back logs also couldn't be written. Any database backups will fail. Resolution requires manual intervention and might actually take a while.

If FRA needs to be placed on ASM than consider using external redundancy even for relatively small size FRAs. Use mirroring inside storage box or between storage boxes as some vendors support that.

In fact, the latter looks like a good approach for mirroring of large ASM database over different storage boxes - using external redundancy and let SAN boxes to mirror using vendor's technology (though, it might require additional licensing).

Thursday, May 18, 2006

VLDB with ASM?

Today I've been on the first day of the ASM course/workshop led by Martin Gosejacob. He is quite good and knows the stuff. Since I've been playing with it before and installed the practice clusters for it, I kind of knew general content. Nevertheless, I got enough new and interesting pieces of information. Today I will mention my concerns regarding managing large amount of storage with ASM. Later I will try to post some other interesting twists and bells that are not commonly mentioned.

Serious limitation of ASM - it doesn't have dirty region logs. Oracle claims that you don't need it as Oracle database is ASM aware and recovery on ASM level is not required. However, there is serious flaw in a way that resyncing/rebalancing/resilvering of a large part of diskgroup have to be done from scratch.

Imagine few TB database with data mirrored by ASM across two SAN boxes accessed via separate FC's and separate switches. You will need to organized all disks into 2 failure groups (each group with disks from the same SAN box) so that all extents are mirrored across two SAN boxes.

Maintenance or unplanned outage on one of those components will cause the whole mirror to be unavailable and ASM will try to rebalance but having no available disks in one of failure groups it won't get anywhere. Database operational status is not impacted at this point (unless you need to add new datafile or resize it - see below). Again, ASM will start "ejecting" failed disks and status of disks will be HUNG, i.e. stuck in attempt to drop from diskgroup. If multipathing is used and both switches are connected to both enclosures than switch or FC card failure is not that bad but still no protection from SAN outage or firmware upgrade.

Let's assume that access to the secondary box is back in 30 minutes and we are trying to put disks back. Since there is no change log, there is no way to perform fast resilvering. But this is not the full story. As mentioned before, ASM has already initiated removal of failed disks from diskgroup and the fact that disks are back doesn't affect ASM in any way now. You cannot drop failed disks until ASM finishes rebalancing, you cannot resync them as ASM is actually dropping them - no way back. The only way out is to add enough disks to the failure group with HUNG disks so that ASM can finish rebalancing and finally drop disks from failed SAN box.

This transforms in hours/days of rebalancing for several TB and additional space equal to the size of failed disks. In reality, it's probably possible to reuse failed capacity but it will include reorganizing it on SAN level, presenting as new devices, cleaning them up and adding as new disks. There are two options to add same disks:
1. Clean up ASM labels (dd zeroes to the beginning of the file about 1 MB) prior to adding.
2. Add disks to diskgroup with force option.
The first method requires additional efforts and the second is quite dangerous as one can force adding another good disk used somewhere else. Especially, if you use wildcards (would like to put all 100 devices manually?). Well, in the first case you may as well overwrite good disks.

As I mentioned already, if there is not enough free space for rebalancing in one of failure groups then additional space cannot be allocated - i.e. files cannot be extended and new files added in this diskgroup. So until enough space is provided, you are at risk of getting your tablespaces full.

Interesting what is going to happen with archive logs when one failure group is "out". Need to test it as I believe that new files will not be created until some disks are added.

In fact, it ASM with its current methodology couldn't really use dirty region logs principle. As soon as failure is detected, ASM initiates drop of those disks. To solve this issue, there should be an option to avoid automatic rebalancing in addition to introduction of dirty region logs.
So how to mitigate this problem with current ASM implementation?
It's possible to use many small storage boxes instead of couple high end enclosures. However, this will require more complex SAN networking with multiple switches and FC cards.

Server-side FAN callouts

Update: This is relevant to RAC environments.

Fast Application Notifications mechanism was introduced in 10g as development of TAF (Transparent Application Failover). While the most important benefits are from using FAN on the client side, it's possible to receive notifications on the server and react appropriately. The beauty of it is simplicity.

Setting up custom server-side callout is very easy and doesn't require writing any code with OCI or Java. It can be a simple shell or perl script. This script with executable permissions should be put in $ORA_CRS_HOME/racg/usrco. Note that you will most probably have to create this directory as it doesn't exist after installation (well, at least in my case with 10.2 on Linux). The sample script I use just for logging:
#!/usr/bin/sh
HOST=`hostname`
FAN_LOGFILE=/opt/oracle/fan_log_${HOST}.log
echo $* "reported="`date` >>$FAN_LOGFILE
The script gets event attributes as command-line argiments and they can be processed however you want - emailing/paging DBA's, logging events, relocating services, and any other automated action.

Wednesday, May 17, 2006

v$object_usage empty?

Yesterday, one of my collegues asked why he couldn't see anything in v$object_monitoring view when he enabled monitoring of indexes. I was scratching my head trying to remeber the trick because I recalled that I'd had a similar problem and reason wasn't obvious. I couldn't help him at that time as my memory didn't serve well at this one.

Today I've come across Kirti Deshpande's post in Oracle-L that reminded me the issue. Apparently, v$object_usage contains information only about objects in the current user. This time I decided that it should get deeper into my mind so that next time I remeber it. I looked in the docs (Oracle Database Reference 10g Release 2) and figured out there is documentaion bug:
"You can use this view to monitor index usage. The view displays statistics about index usage gathered from the database. All indexes that have been used at least once can be monitored and displayed in this view."
Nothing about objects of current user. I also checked in "Administrator's Guide" and and there is no mentioning of this pecularity.

Kirti also gave a very nice advice to create improved version of v$object_usage. Excellent solution to simplify our lives when many schemas are involved. In fact, if you setup your environment properly, users owning the objects will be most probably locked so you can't easilly login as that user. ALTER SESSION SET CURRENT_SCHEMA wouldn't help because V$OBJECT_USAGE includes filter on userenv('SCHEMAID') which looks like is not affected by setting current_schema.

So thanks to Kirti here is the solution:
create or replace view V$ALL_OBJECT_USAGE
(OWNER
,INDEX_NAME
,TABLE_NAME
,MONITORING
,USED
,START_MONITORING
,END_MONITORING
)
as
select u.name
, io.name
, t.name
, decode(bitand(i.flags, 65536), 0, 'NO', 'YES')
, decode(bitand(ou.flags, 1), 0, 'NO', 'YES')
, ou.start_monitoring
, ou.end_monitoring
from
sys.user$ u
, sys.obj$ io
, sys.obj$ t
, sys.ind$ i
, sys.object_usage ou
where
i.obj# = ou.obj#
and io.obj# = ou.obj#
and t.obj# = i.bo#
and u.user# = io.owner#
/

Tuesday, May 16, 2006

Meeting Goran Bogdanovic

Today I met Goran Bogdanovic whom I got to know from Oracle-L list. We've had mass of beer at the Hofbraukeller and spent few hours chatting nicely about life, Oracle and whatever came to our mind. Thanks Goran for nice time that let me relax a bit after a tough day.

Monday, May 15, 2006

Struggle "keeping up"...

A very nice one today - The myth of "keeping up". I see myself in every line of that blog but I need something stronger to change. What that should be? "John isn't keeping up either" doesn't work... it's a John's problem!

Tuesday, May 09, 2006

Is do-it-yourself PC upgrade painless these days?

Everything started 10 days ago when PSU (powersupply) of my desktop went belly up. I tested that it was definitely PSU - I checked with the tester that standby +5V was there but shortening contacts 14 and 15 didn't start it (these are the contacts that are used in ATX to turn it on). I didn't have time until last Saturday to go and get a new one. Actually, I thought to go for an upgrade but I decided that my old Athlon 1400 MHz is enough for word processing, emails, internet and even simple photo processing. My kid was happy with his games either so I thought to keep it. Old friend is better that ten new ones as we say in Russia.

So I got one of the cheapest PSU's LC6350 - 350W from LC-Power somewhat silent with 12 cm fan. Surprises started there - it turned out that my old PSU had actually the reason to die. I figured it when my system didn't stay on longer than 2-3 seconds with new PSU. I couldn't really check which component killed my power supply but it looks like either CPU or motherboard. So... now it was definitely an upgrade time.

I am an Athlon fan so... Intel - "just say no". I hope Intel fans forgive me.



I decided to go modest - Athlon64 3200+ Venice as it was relatively cheap - only 112 Euro and further MHz increases and especially next generation dual-cores were priced much higher. But it's quite a step from my old one. Of course, I needed a new graphic card and Nvidia is the same as AMD for me ;-). Again going relatively modest from the latest line of their cards - LEADTEK WinFast PX7600GT.

Nvidia resurrected quite nice SLI technology - pairing two video cards. I remeber this idea died when I had my 3dfx Voodoo card. As any good idea SLI was ahead of time in 3dfx days. So to have some reserve I needed a mobo that can easilly support next generation Athlon64 X2 dual core and support SLI mode. Obviously chipset Nforce4 SLI would do. After looking at some reviews I decided to go for Gigabyte K8N-SLI. All I need for about 80+ Euro - good deal. An it also can make RAID 1, 0, 1+0 and 5 (well, who needs that 5 anyway???).

And of course, 2 GB of memory just in case I need to run Oracle on this machine. :-) Also important that I can get it up to 4 GB which is probably enough for half-splitted single head box.

After deciding on my config still on Saturday I went to our local computer shop. Ah.... forgot the cooler - realy nice and quite Freezer 64 PRO. Also I had to swap my PSU on a more powerfull one - I took 550W LC6550G from the same company. This one is even better - has a 14 cm fan!

At this point the story takes another direction and becomes 2+ days/nights nightmare. Apparently, new technology ACPI has not proven to be very stable and due to that as figured out later I was strugling during windows installation time and running any other existing versions on my hard drives. Installer kept rebooting when status bar turned into "Starting Windows..." (some say "big surprise"). I was getting mad! Internet proved that I wasn't the only one and people were spending weeks to resolve it. Somehow, I managed to install windows on couple partitions but both times it determined the system not as ACPI compliant but as MPS Uniprocesson ar Standard PC. In both cases it was rebooting every now and again, finding duplicate IDE controllers all the time and hardly working.

After sleepless weekend I decided (Sunday night) that it was mobo/BIOS problems (upgrde didn't help of course). Monday evening after work I happily swapped it to ASUS A8N-SLI SE (a bit more expensive but still under 100 Euro) and was looking forward to finally play "new" Need For Speed: Most Wanted. Boy, how mistaken I was. The same problem was waiting for me... I couldn't solve it again!

The final resolution was... You'd be surprised... my darling, Olga. She said that if I can't fix it than we need to call professionals. SHE DID SAY IT TO ME!!! Did I deserve so much shame on me!? Boy, that was it... I wasn't going anywhere next day/night whatever it takes until I fix it... so after sleepless night my new PC was up and running by 5am. Boy was I happy! The fact that I am going to bed for 90 minutes before going to work didn't bother me at all - AMD rocks. M$ - s... damn Windows.

I still can't be 100% sure what was the issue but I believe that Windows/mobo/BIOS were confised by the fact that I had the second hard disk in the second IDE channel along with my DVD burner there. Nevertheless, I was able to install it twice to prove that it just my luck and professionals are still "in-house".

So now instead of taking "my" Porshe for a spin I am burning CPU cycles on my new hardware writing this blog... is that really me? Please, don't tell anyone who knows me - they won't believe. 8-)

Oracle supercomputer

...ORACLE was the world's fastest computer in 1953...


No comments.

Thursday, May 04, 2006

Using index for IS NULL

Until some time I trully believed that Oracle doesn't store NULL in Oracle b-tree index. Apparently, this is not exactly true, i.e. false. Oracle doesn't store null values in index ONLY and ONLY when ALL columns in the index are null. If any of the index columns has not null value, the key is put in the index. My misconcept was so strong that I actually couldn't belive it until I dumped leaf block and saw myself that Oracle does store index keys with NULL columns.


Yesterday (or day before?) an interesting idea came to my mind while sitting on Tom Kyte's seminar here in Munich. Statement "IS NULL operator in filter predicates cannot use index becuase there is no guarantee that all not null values are actually within the index" is not correct either. It's quite easy to make optimizer using index range scan to seek rows with NULL columns. This might be expecially useful if it's not possible to change application source code. We just need create index including nullable column used in IS NULL and another column declared as NOT NULL. This can be any small not null column or dummy column created for that purpose. This way Oracle will always include NULL values of another column in the index.


And here is an example.


SQL> CREATE TABLE t (
2 keycol NUMBER(10,0) NOT NULL,
3 c CHAR(1) DEFAULT '1' NOT NULL ,
4 ncol NUMBER(6,0) NULL,
5 datacol NUMBER,
6 CONSTRAINT t_pk PRIMARY KEY (keycol)
7 );
Table created.

SQL> CREATE INDEX t_ind ON t (ncol,c);
Index created.

SQL> INSERT INTO t
2 (keycol, ncol, datacol)
3 SELECT ROWNUM, CASE
4 WHEN DBMS_RANDOM.VALUE (0, 1000) < 1 THEN NULL
5 ELSE TRUNC (DBMS_RANDOM.VALUE (0, 100000))
6 END CASE, ROUND (DBMS_RANDOM.VALUE (1, 99999999999999999999999999))
7 FROM (SELECT NULL
8 FROM DUAL
9 CONNECT BY LEVEL <= 10000);
10000 rows created.

SQL> COMMIT ;
Commit complete.

SQL> BEGIN
2 DBMS_STATS.gather_table_stats (USER, 'T');
3 END;
4 /
PL/SQL procedure successfully completed.

SQL> SET AUTOTRACE ON
SQL> SELECT *
2 FROM t
3 WHERE ncol IS NULL;

KEYCOL C NCOL DATACOL
---------- - ---------- ----------
1111 1 8.9202E+25
4742 1 8.7495E+25
5021 1 8.3964E+25
8956 1 2.2568E+24
269 1 6.8797E+25
487 1 4.5290E+25
7669 1 4.5904E+25
7878 1 1.5710E+25

8 rows selected.

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=8 Bytes=200)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T' (Cost=2 Card=8 Bytes=
200)

2 1 INDEX (RANGE SCAN) OF 'T_IND' (NON-UNIQUE) (Cost=2 Card=
8)


Statistics
----------------------------------------------------------
5 recursive calls
0 db block gets
16 consistent gets
0 physical reads
72 redo size
903 bytes sent via SQL*Net to client
651 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
8 rows processed

SQL> SET AUTOTRACE OFF


Now let's have a look at index leaf block. We need to find out the ROWID if the leaf block with null values on ncol. We don't need to dump the whole tree thanks to non-documented function sys_op_lbid. Thanks to Jonathan Lewis for mentioning that in Oracl-L thread.

SQL> SELECT object_id
2 FROM user_objects
3 WHERE object_name = 'T_IND';

OBJECT_ID
----------
1823665

SQL> SELECT DBMS_ROWID.rowid_relative_fno (myrowid) file_no,
2 DBMS_ROWID.rowid_block_number (myrowid) block_no
3 FROM (SELECT sys_op_lbid(1823665, 'L', t.ROWID) myrowid
4 FROM t
5 WHERE ncol IS NULL AND ROWNUM <>

FILE_NO BLOCK_NO
---------- ----------
3 58521

SQL> ALTER SESSION SET tracefile_identifier = 'null_in_index';
Session altered.

SQL> ALTER SYSTEM DUMP DATAFILE 3 BLOCK 58521;
System altered.


Having a look at trace file (it will have 'null_in_index' appended to the name) we can actually see that all rows with NULL values in ncol column are referenced in the index:
... skip ...
Leaf block dump
===============
... skip ...
row#264[7872] flag: -----, lock: 0
col 0; NULL
col 1; len 1; (1): 31
col 2; len 6; (6): 00 c1 06 8e 00 0e
row#265[7884] flag: -----, lock: 0
col 0; NULL
col 1; len 1; (1): 31
col 2; len 6; (6): 00 c1 06 8f 00 d7
... skip ...
----- end of leaf block dump -----

HASH GROUP BY can give wrong result in Oracle 10.2

Hit bug 4604970 in our DW environment. This is really bad one and you might not notice that your query returns wrong result until you get negative number on your payslip end of month.
Some one-offs are available and workaround is _GBY_HASH_AGGREGATION_ENABLED=FALSE. Good luck with 10.2!

Tom Kyte's seminar, Day 2

I got so excited about ORADEBUG feature I described in the previous post that completely forgot to mention the main news of the day - the second day of Tom's seminar. It was an excellent day packed with lot of useful tricks and tips plus I cleaned quite a few things in my mind. The best one that I got straight now - so called "Write Consistency" in Tom's terms.

Small but neat 10g R2 feature:

SQL> SET SERVEROUTPUT ON SIZE UNLIMITED

I can also put up to 32K lines using DBMS_OUTPUT.PUT_LINE. No comments.

And by the way, Tom's story with Best Buy was real as he mentioned today.

Wednesday, May 03, 2006

Closing trace file with ORADEBUG

Was playing with ORADEBUG today and figured out a cool feature - you can actually close trace file using ORADEBUG CLOSE_TRACE. The only way I knew before was to shrink the file. So now after 10046 trace you don't need to restart the process to close trace file.

If your SMON, PMON, etc. Oracle process is dumping like crazy (kind of abnormal but who needs a DBA anyway if SMON is not dumping regularly?) - you can close this file and actually delete it without bouncing the instance. Never really dealt with Windoracle on that matter but I suspect that removing SMON's trace file, for instance, is not possible (it's locked) without stopping Oracle instance UNLESS one uses ORADEBUG CLOSE_TRACE.

How many times you got into situation when dump destination got full and someone (perhaps, you) cleaned it up using something trivial like 'rm *trc'? Out of those cases how often you noticed that several GB filesystem with single 100 KB alert.log file is still 99% full after "clean up" because SMON and/or whatever other backgroung processes keep inodes open? My bet is - at least 50%.

Now you need simple:
SQL> oradebug setospid
Statement processed.
SQL> oradebug close_trace
Statement processed.

Isn't that amazing? Just checked - it's available at least in 9iR2, 10gR1 and 10gR2.

Tuesday, May 02, 2006

First post about Oracle: no more quoting nightmare in 10g

So here I am with my first blog after a while when everyone seems to have been blogging for at least half of their internet-enabled life. Let's see how it goes.

Anyway... Today I've been at the first day of Tom Kyte's 2 days seminar in Munich. I've enjoyed the day very much and I'm looking forward for tomorrow already. There is one thing that caught my attention - Oracle 10g Release 1 introduced a nice feature to string literals format. I can now forget about troublesome quoting in SQL and PL/SQL gererating dynamic SQL code. It works similar to Perl's Quote Operator.

Imagine my PL/SQL procedure to create new partitions for next day in some table t1. Now instead of
stmt : = 'ALTER TABLE t1 SPLIT PARTITION p_last AT (TO_DATE('''
|| TO_CHAR (TRUNC(sysdate+2), 'YYYY-MM-DD HH24:MI:SS')
|| ''',''SYYYY-MM-DD HH24:MI:SS'', ''NLS_CALENDAR=GREGORIAN'')) INTO (PARTITION p_'
|| TO_CHAR(sysdate+1,'YYYYMMDD') ', PARTITION p_last);'

I can write simple
stmt : = q'[ALTER TABLE t1 SPLIT PARTITION p_last AT (TO_DATE(']'
|| TO_CHAR(TRUNC(sysdate+2), 'YYYY-MM-DD HH24:MI:SS')
|| q'[','SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')) INTO (PARTITION p_]'
|| TO_CHAR(sysdate+1,'YYYYMMDD') ', PARTITION p_last);'

Well, still it's not that readable and elegant as in Perl. RTFM for details - Oracle® Database SQL Reference10g Release 2.
PS: After some editing I still couldn't get the formating to look like I wanted it... WYSIWYG editor for blogger.com seems to be not the best one. Also some tags break the layour of the whole page even though closed correctly withing the post. What anoyed me most is that pipes || are getting constantly removed by WYSIWYG editor!