Friday, 4 November 2011

Replacing a faulty tape drive

The scenario is you have Veritas Netbackup Enterprise running on a *nix server connected to a tape library. Unfortunately for you media have been constantly freezing, you are unable to label new media and your drives are asking to be cleaned every hour. An engineer from your backup support arrives and replaces the faulty drive so what do you do next?

First thing is to find out what drives you have and what their paths are. So before the engineer inserts the drive make sure you have the serial number of the faulty drive. 
 
[root@backup1:/]# tpautoconf -report_disc
======================= Missing Device (Drive) =======================
 Drive Name = HP.ULTRIUM4-SCSI.001
 Drive Path = /dev/rmt/0cbn
 Inquiry = "HP      Ultrium 4-SCSI  H21Z"
 Serial Number = AABBCCDDEE
 TLD(0) definition Drive = 1
 Hosts configured for this device:
  Host = backup1
======================= Missing Device (Drive) =======================
 Drive Name = HP.ULTRIUM4-SCSI.002
 Drive Path = /dev/rmt/7cbn
 Inquiry = "HP      Ultrium 4-SCSI  H58W"
 Serial Number = FFGGHHIIJJ
 TLD(1) definition Drive = 2
 Hosts configured for this device:
  Host = backup1
======================= New Device (Drive) =======================
 Inquiry = "HP      Ultrium 4-SCSI  H44W"
 Serial Number = KKLLMMNNOO
 Drive Path = /dev/rmt/7cbn
 Found as TLD(1), Drive = 2


Here we note that our faulty drive is drive 2 (FFGGHHIIJJ) and the new drive inserted is also drive 2 (KKLLMMNNOO). Both drives have the same path (/dev/rmt/7cbn) but Netbackup still does not recognise the new drive. So with the following command we replace the drive:

[root@backup1:/]# tpautoconf -replace_drive HP.ULTRIUM4-SCSI.002 -path /dev/rmt/7cbn
Found a matching device in global DB, HP.ULTRIUM4-SCSI.002 on host backup1

After this we restart our Netbackup daemon and test some backups. If you get an error when attempting to UP a drive e.g. IPC Error: Daemon may not be running. This may be due to the fact that if there is a faulty drive still in the library and in not operational Netbackup cannot seem to avoid it. It is best to delete the drive and restart Netbackup:

[root@backup1:/]# tpconfig -delete -drive ID

Friday, 21 October 2011

Hardware oddities - X4450 memory issue Part 2

Following on from my previous post our support lines were limited in ideas for what could be happening. One avenue we did not look in to would be replacing the four 4G DIMMs because the error could not be pinpointed to one particular DIMM. The same LED was lighting up on the mezzanine for any DIMM we placed in that slot. We decided to order four new DIMMs and replace them to see what would happen. It turns out that two of the DIMMs were faulty and were causing the issue. We still don't know why the same  LED was being lit up in the despite putting several different DIMMs in to the slot. This made it difficult for us to pinpoint the exact issue and unfortunately wasted alot of our time.

In the end we managed to replace two DIMMs and return the server back in to working order. Another one to add to the list of hardware oddities!


Monday, 26 September 2011

Hardware oddities - X4450 memory issue Part 1

Not so long ago we bought an extra 16GB RAM in 4GB sticks for one of our X4450s. This would enable us to increase the memory from 32GB to 48GB to handle some of our memory hungry applications. Inserting the sticks is a not a difficult job, in fact it is quite easy but unfortunately for me I have had a bad track record mostly due to faulty DIMMs. For an X4450 the DIMMs are paired so when you open up the server you are presented with 32 neat slots. 4 of these are coloured blue and the rest are coloured black. Your largest paired DIMMs (4GB in our case) would go in to the blue slots and after that populate the rest of the black slots with the smaller DIMMs (2GB for us).
After I had finished up and powered on the server an orange light appeared indicating a fault somewhere. So again I opened the server and with it powered on, right beside slot A0 an orange light appeared. Thinking that it was a faulty DIMM I switched it with the one from slot B0. Again the orange light at slot A0 appeared yet the light at slot B0 which has now what I thought was a faulty DIMM remained off. Again I switched with the other 4GB pair (slots C0/D0) and the A0 light appeared. So now the fault is most likely the memory mezzanine which luckily can be removed and replaced easily. After opening a call with Oracle a new one was sent out. After replacing the alleged faulty mezzanine and repopulating the slots I powered up the server and found to my dismay the same A0 light. I again swapped around the DIMMs which unfortunately did not fix the issue. So at the moment there are a few options for the cause:
  1. Faulty DIMM(s)
  2. Faulty mezzanine connector
  3. Faulty motherboard
  4. Firmware issue
  5. OS has not cleared old error
After production hours I will take down the server and a spare X4450 and I will change the mezzanine and the DIMMs between them. If the A0 light appears on the spare server I can narrow it down to the DIMMs or mezzanine kit. Hopefully we will be able to use the spare servers mezzanine with the new DIMMs. I will follow with an update.

Wednesday, 14 September 2011

Things I never knew about DNS – EDNS

We run a few internal and external DNS servers in the company I work for. Keeping them up to date is something we must do each year to keep one step ahead of any exploits discovered. I find that upgrading each version of BIND I learn something new about it. The documentation is not impressive to say the least!

One new thing I learned was EDNS – Extended Mechanisms for DNS. I discovered this new (to me) option as I finished installing and configuring BIND, I enabled verbose logging to test some DNS queries. I found that some queries were abit slow to begin with. I checked the log file for messages and discovered plenty of these messages: 
             edns-disabled: info: success resolving 'ns6.netnorth.net/AAAA' (in 'netnorth.net'?) after reducing the advertised EDNS UDP packet size to 512 octets

So what is EDNS? Well first I have to tell you about ordinary DNS. Typical DNS UDP packets come with a maximum 512 octets in size. Anything more than that will be rejected or fragmented depending on your firewall. But EDNS on the otherhand carries more information and allows for a packet size up to 4026 octets. The reason for this (in RFC2671) is: “Many of DNS’s protocol limits are too small for uses which are or which are desired to become common. There is no way for implementations to advertise their capabilities.” So new versions of BIND and whatever DNS software you use will support EDNS. What will happen is that when we send a DNS request to a DNS server by default the label type will be set to ’01′ which means “extended label type”. The DNS server recieves this request and replies with either a normal 512 octet packet if it does not support EDNS yet or replies with an EDNS packet of up to 4096 octets. The problem we saw was that our firewall only accepted DNS packets of size 512 octets. Anything above that was discarded or attempted to fragment. To resolve this instead of turning off EDNS on our server I asked our network engineers to allow DNS packets through the network of up to 4096 octets in size. As soon as this was implemented the messages disappeared from the log file and DNS resolving was much faster.

Some handy links:



edns-udp-size
edns-udp-size sets the advertised EDNS UDP buffer size. Valid values are 512 to 4096 (values outside this range will be silently adjusted). The default value is 4096. The usual reason for setting edns-udp-size to a non default value it to get UDP answers to pass through broken firewalls that block fragmented packets and/or block UDP packets that are greater than 512 bytes.”


Friday, 9 September 2011

Netapp monitoring with check_netapp.pl

In my previous post  I showed how to enable Nagios to monitor a Netapp device. The only issue that you may have noticed is the output of the Nagios check. Having Nagios return “SMNP 73 OK” is not a particularly interesting result. We would also like some performance metrics so RRDTool can graph it on Nagios.
So how can we achieve this? Nagios Exchange provides netapp.pl which is a Perl script which allows you to monitor disk usage and formats the results in to a reasonable format. The script takes several arguments by default:
/opt/nagios/libexec/check_netapp.pl
Missing arguments!
check_netapp -H <ip_address> -v variable [-w warn_range] 
[-c crit_range]
 [-C community] [-t timeout] [-p port-number]
 [-P snmp version] [-L seclevel] [-U secname] [-a authproto]
 [-A authpasswd] [-X privpasswd] [-o volume]
The ones we are interested in are
-H <ip_address>
-v <variable>
-o <volume>
-w -c <warn_range crit_range>
-t <timeout>
So for our test volume we would like to check the disk usage so we would execute the following command to check the diskspace used:
/opt/nagios/libexec/check_netapp.pl -H netapp -v DISKUSED 
-o /vol/test/ -w 60 -c 70 -t 60
 DISKUSED CRITICAL - /vol/test/ - total: 600 Gb - used 501 
Gb (84%) - free: 98 Gb|NetApp 
/vol/test/ Used Space=501GB;360;420;0;600
This provides a much nicer output that the standard SNMP result. So to have Nagios use this command we add the command to the commands.cfg file:
define command{
        command_name check-test-volume
        command_line $USER1$/check_netapp.pl -H netapp -v 
DISKUSED -o /vol/test/ 
-w 80 -c 90 -t 60
}
After this is added we add the service to our groups config file:
define service
        use                             netapp-template-unix
        host_name                       netapp
        service_description             Check Test Volume Disk 
Space
        check_command                   check-test-volume
}
To activate this check restart Nagios and wait for the monitoring results to appear on our Nagios site. If you have RRDTool and pnpnagios installed they should begin graphing the results shortly after the first checks are recieved.

Keeping your NetApp in check with Nagios

Recently we took delivery of a new NetApp. Two 3210′s with roughly 50TB of storage across five disk shelves. For monitoring we use Nagios which can check the NetApp effectively and alert us of any problems quickly. Luckily enough our existing NetApp is already being monitored by Nagios so getting the checks up and running is not a problem. Nagios relies on SNMP to monitor NetApp storage. Normally SNMP checking runs fine but we have been seeing issues with Nagios timing out when the NetApp is heavily utilized. According to NetApp this is a known bug and we should not see it in this new setup (DATA ONTAP 8).

Our first step is to retrieve all the SNMP OIDs using our snmpwalk command:

       snmpwalk -v 1  -c public netapp 1.3.6.1.4.1.789 > netapp-snmpwalk1-oid.txt
The 1.3.6.1.4.1.789 number is the top-level OID. This is our starting point for finding information such as CPU usage, disk space and other system status information. If you have created a volume on your NetApp then you should be able to find its corresponding OID. To find this we need to download the MIB zip folder from the NetApp site. This folder contains a file called traps.dat which lists all the OIDs needed. After unzipping the file and opening the traps.dat the first keyword we look for is dfFileSys which is the base OID for the filesystem.

       dfFileSys snmp.1.3.6.1.4.1.789.1.5.4.1.2.1

The base OID we are interested in is 1.5.4.1.2.1 so any volumes created will have the OID 1.5.4.1.2.2 or 1.5.4.1.2.3 and so on with the final digit incrementing as new volumes are added. Searching through our snmpwalk file we find the OID:

        SNMPv2-SMI::enterprises.789.1.5.4.1.2.5 = STRING: "/vol/test/"

This corresponds to the dfFileSys base OID with 1.5.4.1.2.5 being the corresponding OID for our test volume. Now we have the volume OID so we must look for the disk space utilized OID. Searching through the traps.dat file find dfPerCentKBytesCapacity which indicates in percentage the amount of disk space utilized on the volume.

        dfPerCentKBytesCapacity snmp.1.3.6.1.4.1.789.1.5.4.1.6.1
 
The test volume identifier is 1.5.4.1.2.5 and the reference number needed is the last digit 5. The dfPerCentKBytesCapacity in the traps.dat file shows the base OID 1.5.4.1.6.1 . The final digit will be the reference number so in the above OID we replace the 1 with 5  which will become 1.5.4.1.6.5.  Let’s query it and see what happens:


        snmpwalk -v 1  -c public netapp 1.3.6.1.4.1.789.1.5.4.1.6.5


result:
        SNMPv2-SMI::enterprises.789.1.5.4.1.6.5 = INTEGER: 1


Looking at FilerView I can see that 1% is in fact being used. But to be sure lets check the other volumes. Searching our OID list I find the other volumes at:
    SNMPv2-SMI::enterprises.789.1.5.4.1.2.3 = STRING: "/vol/test1/" 
    SNMPv2-SMI::enterprises.789.1.5.4.1.2.7 = STRING: "/vol/test2/" 

So with our reference numbers “3″ and “7″ let us see how much disk space is used:

        snmpwalk -v 1 -c public swe-filer1 1.3.6.1.4.1.789.1.5.4.1.6.3


result:
        SNMPv2-SMI::enterprises.789.1.5.4.1.6.3 = INTEGER: 2

and:
        snmpwalk -v 1 -c public swe-filer1 1.3.6.1.4.1.789.1.5.4.1.6.7

result:
        SNMPv2-SMI::enterprises.789.1.5.4.1.6.7 = INTEGER: 0


Looking at FilerView I can see that indeed test1 has 2% used and test3 has 0% used. Now that we have the commands needed let’s plug them in to Nagios.

In the commands.cfg file create the command as follows: 
    define command{ 
    command_name check-test-diskspace 
    command_line $USER1$/check_snmp -H netapp -C public -o .1.3.6.1.4.1.789.1.5.4.1.6.5 -w 80 -c 90 
    } 


Note the ‘.’ preceeding the OID.  Next we add it as a service: 
          define service{ 
    use                      netapp-template 
    host_name                netapp service_description      
    Check Test Volume Disk Space check_command            
    check_test_diskspace
    }

Restart Nagios and it should start checking the new service. This is only one of many services you can monitor through Nagios. For example we monitor NFS operations, CIFS operations, CPU load, Uptime, Fans, Global Status and many more. All you need to find is the corresponding OID in the traps.dat file, send some test queries to make sure you are hitting the right object and then add it as a Nagios service. You could also create a script which lists all OIDs and its corresponding object. With this all you would need is the object name (/vol/test for example) in your Nagios command rather than the OID string. This would centralize all your OIDs so you could just add new ones to your script as new volumes are created.