Friday, 9 September 2011

Keeping your NetApp in check with Nagios

Recently we took delivery of a new NetApp. Two 3210′s with roughly 50TB of storage across five disk shelves. For monitoring we use Nagios which can check the NetApp effectively and alert us of any problems quickly. Luckily enough our existing NetApp is already being monitored by Nagios so getting the checks up and running is not a problem. Nagios relies on SNMP to monitor NetApp storage. Normally SNMP checking runs fine but we have been seeing issues with Nagios timing out when the NetApp is heavily utilized. According to NetApp this is a known bug and we should not see it in this new setup (DATA ONTAP 8).

Our first step is to retrieve all the SNMP OIDs using our snmpwalk command:

       snmpwalk -v 1  -c public netapp 1.3.6.1.4.1.789 > netapp-snmpwalk1-oid.txt
The 1.3.6.1.4.1.789 number is the top-level OID. This is our starting point for finding information such as CPU usage, disk space and other system status information. If you have created a volume on your NetApp then you should be able to find its corresponding OID. To find this we need to download the MIB zip folder from the NetApp site. This folder contains a file called traps.dat which lists all the OIDs needed. After unzipping the file and opening the traps.dat the first keyword we look for is dfFileSys which is the base OID for the filesystem.

       dfFileSys snmp.1.3.6.1.4.1.789.1.5.4.1.2.1

The base OID we are interested in is 1.5.4.1.2.1 so any volumes created will have the OID 1.5.4.1.2.2 or 1.5.4.1.2.3 and so on with the final digit incrementing as new volumes are added. Searching through our snmpwalk file we find the OID:

        SNMPv2-SMI::enterprises.789.1.5.4.1.2.5 = STRING: "/vol/test/"

This corresponds to the dfFileSys base OID with 1.5.4.1.2.5 being the corresponding OID for our test volume. Now we have the volume OID so we must look for the disk space utilized OID. Searching through the traps.dat file find dfPerCentKBytesCapacity which indicates in percentage the amount of disk space utilized on the volume.

        dfPerCentKBytesCapacity snmp.1.3.6.1.4.1.789.1.5.4.1.6.1
 
The test volume identifier is 1.5.4.1.2.5 and the reference number needed is the last digit 5. The dfPerCentKBytesCapacity in the traps.dat file shows the base OID 1.5.4.1.6.1 . The final digit will be the reference number so in the above OID we replace the 1 with 5  which will become 1.5.4.1.6.5.  Let’s query it and see what happens:


        snmpwalk -v 1  -c public netapp 1.3.6.1.4.1.789.1.5.4.1.6.5


result:
        SNMPv2-SMI::enterprises.789.1.5.4.1.6.5 = INTEGER: 1


Looking at FilerView I can see that 1% is in fact being used. But to be sure lets check the other volumes. Searching our OID list I find the other volumes at:
    SNMPv2-SMI::enterprises.789.1.5.4.1.2.3 = STRING: "/vol/test1/" 
    SNMPv2-SMI::enterprises.789.1.5.4.1.2.7 = STRING: "/vol/test2/" 

So with our reference numbers “3″ and “7″ let us see how much disk space is used:

        snmpwalk -v 1 -c public swe-filer1 1.3.6.1.4.1.789.1.5.4.1.6.3


result:
        SNMPv2-SMI::enterprises.789.1.5.4.1.6.3 = INTEGER: 2

and:
        snmpwalk -v 1 -c public swe-filer1 1.3.6.1.4.1.789.1.5.4.1.6.7

result:
        SNMPv2-SMI::enterprises.789.1.5.4.1.6.7 = INTEGER: 0


Looking at FilerView I can see that indeed test1 has 2% used and test3 has 0% used. Now that we have the commands needed let’s plug them in to Nagios.

In the commands.cfg file create the command as follows: 
    define command{ 
    command_name check-test-diskspace 
    command_line $USER1$/check_snmp -H netapp -C public -o .1.3.6.1.4.1.789.1.5.4.1.6.5 -w 80 -c 90 
    } 


Note the ‘.’ preceeding the OID.  Next we add it as a service: 
          define service{ 
    use                      netapp-template 
    host_name                netapp service_description      
    Check Test Volume Disk Space check_command            
    check_test_diskspace
    }

Restart Nagios and it should start checking the new service. This is only one of many services you can monitor through Nagios. For example we monitor NFS operations, CIFS operations, CPU load, Uptime, Fans, Global Status and many more. All you need to find is the corresponding OID in the traps.dat file, send some test queries to make sure you are hitting the right object and then add it as a Nagios service. You could also create a script which lists all OIDs and its corresponding object. With this all you would need is the object name (/vol/test for example) in your Nagios command rather than the OID string. This would centralize all your OIDs so you could just add new ones to your script as new volumes are created.
 

1 comment:

  1. Great guide. Thanks.
    I have a slight problem. I'm testing the check disk space and I don't get the reply with INTEGER.

    I've dumped the snmp from my nas to the text file. I've searched through it and found a volume I want to test but the thing is I see multiple entries for the volume?


    cat /root/netapp-snmpwalk1-oid.txt | grep 789.1.5.4.1.2
    iso.3.6.1.4.1.789.1.5.4.1.2.5 = STRING: "/vol/volofimatica/"
    iso.3.6.1.4.1.789.1.5.4.1.2.6 = STRING: "/vol/volofimatica/.."

    When I combine the OIDs as you mention above I don't get the INTEGER reply which is what I guess I need for NAGIOS to display correctly. Any ideas.

    snmpwalk -v 1 -c public nas-sosaria 1.3.6.1.4.1.789.1.5.4.1.10.6
    iso.3.6.1.4.1.789.1.5.4.1.10.6 = STRING: "/vol/volofimatica/.."

    Thanks,
    Rob

    ReplyDelete