Tuesday, 24 April 2012

grep'n'dash

So you somehow have gotten in to the position that you have a file called '-f' containing various strings and one of those strings managed to be '-lookatme!' which is exactly what you need to find. The subtleties of combining the UNIX shell and commands sometimes both confuse and surprise users with unexpected output. Finding these neat tricks will help you understand the intricacies of your *NIX system.

So here is our absurdly named file:

> ls -la
total 32
-rw-r--r--   1 will sysadmin      80 Apr 18 11:08 -f
drwxr-xr-x   3 will sysadmin    4096 Apr 18 11:08 .
drwxr-xr-x   3 will sysadmin    4096 Apr 18 09:50 ..


And look at what's inside:

> cat ./-f
string1
string2
string3
string4
string1
string2
-lookatme!
string3
string4
-end


In this very basic example we want to find the string '-lookatme!'. Using grep we must escape the '-' character in the string we are looking for:

>grep "\-lookatme!" ./-f
-lookatme!


Or we can wrap the filename around double-quotes:

>grep "\-lookatme!" "-f"
-lookatme!


You can even search an entire directory for that ridiculous string:

>grep "\-lookatme!" *
-f:-lookatme!

Tuesday, 17 April 2012

Binding Solaris processes

Sometimes you may need to bind a particular process to a CPU so it gets all the resources it needs. This is done two simple commands:

psradm -i 1

First we disable interrupts on the above process and makes it available for LWP handling. Since the CPU will be needed only for the particular process we assign it we don't need to worry about network or I/O interrupts.

pbind -b 1 2222

Here we bind process ID 2222 to CPU 1.

To view current bindings use:

pbind -q

And finally to unbind any specific process:

pbind -u <PID>  (pbind -u 2222 for our process above)




Monday, 16 April 2012

Solaris 10 maximum TCP connections

We have quite a few servers whose applications rely on very fast TCP connections. Thousands of ports are open and closed every minute all internally. Although nothing like a large scale website it still is something to deal with. Our Solaris 10 servers came with 128 connections max per port. An issue we saw with when an otherwise quiet server had an increased amount of traffic to it connections were failing. The server hit it's maximum number of connections per port (128 for us) and wouldn't receive anymore until a port freed up. Unfortunately the server CPU and memory usage hit the roof and slowed down the OS and any running applications. So when the application on the client side told the server to close the connection, the server asked the application holding the port open if it can close it but under load the application was unable to respond. The server slowed down to a halt and well production chaos!

So to alleviate this we increase the maximum connection amount to 4192 which gave us plenty of room for now:

ndd -set /dev/tcp tcp_conn_req_max_q 4192

Friday, 4 November 2011

Replacing a faulty tape drive

The scenario is you have Veritas Netbackup Enterprise running on a *nix server connected to a tape library. Unfortunately for you media have been constantly freezing, you are unable to label new media and your drives are asking to be cleaned every hour. An engineer from your backup support arrives and replaces the faulty drive so what do you do next?

First thing is to find out what drives you have and what their paths are. So before the engineer inserts the drive make sure you have the serial number of the faulty drive. 
 
[root@backup1:/]# tpautoconf -report_disc
======================= Missing Device (Drive) =======================
 Drive Name = HP.ULTRIUM4-SCSI.001
 Drive Path = /dev/rmt/0cbn
 Inquiry = "HP      Ultrium 4-SCSI  H21Z"
 Serial Number = AABBCCDDEE
 TLD(0) definition Drive = 1
 Hosts configured for this device:
  Host = backup1
======================= Missing Device (Drive) =======================
 Drive Name = HP.ULTRIUM4-SCSI.002
 Drive Path = /dev/rmt/7cbn
 Inquiry = "HP      Ultrium 4-SCSI  H58W"
 Serial Number = FFGGHHIIJJ
 TLD(1) definition Drive = 2
 Hosts configured for this device:
  Host = backup1
======================= New Device (Drive) =======================
 Inquiry = "HP      Ultrium 4-SCSI  H44W"
 Serial Number = KKLLMMNNOO
 Drive Path = /dev/rmt/7cbn
 Found as TLD(1), Drive = 2


Here we note that our faulty drive is drive 2 (FFGGHHIIJJ) and the new drive inserted is also drive 2 (KKLLMMNNOO). Both drives have the same path (/dev/rmt/7cbn) but Netbackup still does not recognise the new drive. So with the following command we replace the drive:

[root@backup1:/]# tpautoconf -replace_drive HP.ULTRIUM4-SCSI.002 -path /dev/rmt/7cbn
Found a matching device in global DB, HP.ULTRIUM4-SCSI.002 on host backup1

After this we restart our Netbackup daemon and test some backups. If you get an error when attempting to UP a drive e.g. IPC Error: Daemon may not be running. This may be due to the fact that if there is a faulty drive still in the library and in not operational Netbackup cannot seem to avoid it. It is best to delete the drive and restart Netbackup:

[root@backup1:/]# tpconfig -delete -drive ID

Friday, 21 October 2011

Hardware oddities - X4450 memory issue Part 2

Following on from my previous post our support lines were limited in ideas for what could be happening. One avenue we did not look in to would be replacing the four 4G DIMMs because the error could not be pinpointed to one particular DIMM. The same LED was lighting up on the mezzanine for any DIMM we placed in that slot. We decided to order four new DIMMs and replace them to see what would happen. It turns out that two of the DIMMs were faulty and were causing the issue. We still don't know why the same  LED was being lit up in the despite putting several different DIMMs in to the slot. This made it difficult for us to pinpoint the exact issue and unfortunately wasted alot of our time.

In the end we managed to replace two DIMMs and return the server back in to working order. Another one to add to the list of hardware oddities!


Monday, 26 September 2011

Hardware oddities - X4450 memory issue Part 1

Not so long ago we bought an extra 16GB RAM in 4GB sticks for one of our X4450s. This would enable us to increase the memory from 32GB to 48GB to handle some of our memory hungry applications. Inserting the sticks is a not a difficult job, in fact it is quite easy but unfortunately for me I have had a bad track record mostly due to faulty DIMMs. For an X4450 the DIMMs are paired so when you open up the server you are presented with 32 neat slots. 4 of these are coloured blue and the rest are coloured black. Your largest paired DIMMs (4GB in our case) would go in to the blue slots and after that populate the rest of the black slots with the smaller DIMMs (2GB for us).
After I had finished up and powered on the server an orange light appeared indicating a fault somewhere. So again I opened the server and with it powered on, right beside slot A0 an orange light appeared. Thinking that it was a faulty DIMM I switched it with the one from slot B0. Again the orange light at slot A0 appeared yet the light at slot B0 which has now what I thought was a faulty DIMM remained off. Again I switched with the other 4GB pair (slots C0/D0) and the A0 light appeared. So now the fault is most likely the memory mezzanine which luckily can be removed and replaced easily. After opening a call with Oracle a new one was sent out. After replacing the alleged faulty mezzanine and repopulating the slots I powered up the server and found to my dismay the same A0 light. I again swapped around the DIMMs which unfortunately did not fix the issue. So at the moment there are a few options for the cause:
  1. Faulty DIMM(s)
  2. Faulty mezzanine connector
  3. Faulty motherboard
  4. Firmware issue
  5. OS has not cleared old error
After production hours I will take down the server and a spare X4450 and I will change the mezzanine and the DIMMs between them. If the A0 light appears on the spare server I can narrow it down to the DIMMs or mezzanine kit. Hopefully we will be able to use the spare servers mezzanine with the new DIMMs. I will follow with an update.

Wednesday, 14 September 2011

Things I never knew about DNS – EDNS

We run a few internal and external DNS servers in the company I work for. Keeping them up to date is something we must do each year to keep one step ahead of any exploits discovered. I find that upgrading each version of BIND I learn something new about it. The documentation is not impressive to say the least!

One new thing I learned was EDNS – Extended Mechanisms for DNS. I discovered this new (to me) option as I finished installing and configuring BIND, I enabled verbose logging to test some DNS queries. I found that some queries were abit slow to begin with. I checked the log file for messages and discovered plenty of these messages: 
             edns-disabled: info: success resolving 'ns6.netnorth.net/AAAA' (in 'netnorth.net'?) after reducing the advertised EDNS UDP packet size to 512 octets

So what is EDNS? Well first I have to tell you about ordinary DNS. Typical DNS UDP packets come with a maximum 512 octets in size. Anything more than that will be rejected or fragmented depending on your firewall. But EDNS on the otherhand carries more information and allows for a packet size up to 4026 octets. The reason for this (in RFC2671) is: “Many of DNS’s protocol limits are too small for uses which are or which are desired to become common. There is no way for implementations to advertise their capabilities.” So new versions of BIND and whatever DNS software you use will support EDNS. What will happen is that when we send a DNS request to a DNS server by default the label type will be set to ’01′ which means “extended label type”. The DNS server recieves this request and replies with either a normal 512 octet packet if it does not support EDNS yet or replies with an EDNS packet of up to 4096 octets. The problem we saw was that our firewall only accepted DNS packets of size 512 octets. Anything above that was discarded or attempted to fragment. To resolve this instead of turning off EDNS on our server I asked our network engineers to allow DNS packets through the network of up to 4096 octets in size. As soon as this was implemented the messages disappeared from the log file and DNS resolving was much faster.

Some handy links:



edns-udp-size
edns-udp-size sets the advertised EDNS UDP buffer size. Valid values are 512 to 4096 (values outside this range will be silently adjusted). The default value is 4096. The usual reason for setting edns-udp-size to a non default value it to get UDP answers to pass through broken firewalls that block fragmented packets and/or block UDP packets that are greater than 512 bytes.”