MGE Ellipse 750 UPS on Debian Squeeze

My home server is protected by an MGE Ellipse 750 UPS for years. I bought it for several reasons: it’s affordable, has good capacity and is Ubuntu certified.

I also read back then rumors implying that Nut’s maintainer was employed by MGE. Having a hardware manufacturer employing a fellow open-source hacker has certainly influenced my purchase decision.

MGE is no more and has been merged with EATON. But my UPS is still supported, and the release of Debian Squeeze is a good opportunity to consolidate my knowledge in the form of this tutorial.

So here is how I setup Nut on Debian Squeeze to monitor my UPS.

First things first, we have to install the main package and its USB driver:

$ aptitude install nut nut-usb

Now let’s configure Nut and run it:

$ sed -i 's/MODE=none/MODE=standalone/g' /etc/nut/nut.conf
$ echo '
[MGE-Ellipse750]
driver = usbhid-ups
port = auto
desc = "MGE UPS Systems"
' >> /etc/nut/ups.conf
$ sed -i 's/# LISTEN 127\.0\.0\.1 3493/LISTEN 127\.0\.0\.1/g' /etc/nut/upsd.conf
$ echo '
[kevin]
password = badpassword
upsmon master
' >> /etc/nut/upsd.users
$ sed -i 's/# NOTIFYCMD \/usr\/local\/ups\/bin\/notifyme/NOTIFYCMD \/sbin\/upssched/g' /etc/nut/upsmon.conf
$ echo '
MONITOR MGE-Ellipse750@localhost 1 kevin badpassword master
NOTIFYFLAG ONBATT SYSLOG+WALL+EXEC
NOTIFYFLAG ONLINE SYSLOG+WALL+EXEC
' >> /etc/nut/upsmon.conf
$ sed -i 's/CMDSCRIPT \/upssched-cmd/CMDSCRIPT \/etc\/nut\/upssched-cmd/g' /etc/nut/upssched.conf
$ sed -i 's/# PIPEFN \/var\/run\/nut\/upssched\/upssched.pipe/PIPEFN \/var\/run\/nut\/upssched.pipe/g' /etc/nut/upssched.conf
$ sed -i 's/# LOCKFN \/var\/run\/nut\/upssched\/upssched.lock/LOCKFN \/var\/run\/nut\/upssched.lock/g' /etc/nut/upssched.conf
$ echo '
AT ONBATT * START-TIMER onbatt 30
AT ONLINE * CANCEL-TIMER onbatt
' >> /etc/nut/upssched.conf
$ echo '
#!/bin/sh
exit 0
' > /etc/nut/upssched-cmd
$ /etc/init.d/nut restart

As you can see you have lots of stuff to configure before Nut can do what it was designed for. But after all of these commands, you should have a working UPS.

You can now test that your system works by using the command below, which list statistics of a given UPS:

$ upsc MGE-Ellipse750@localhost

But in some rare cases, your UPS will not be recognized and you’ll have like me the following messages in your /var/log/syslog:

May  5 16:12:36 paris-server upsmon[10773]: Poll UPS [MGE-Ellipse750@127.0.0.1] failed - Driver not connected

In this case, you should run Nut’s driver in debug mode:

$ /lib/nut/usbhid-ups -DDD -a MGE-Ellipse750
Network UPS Tools - Generic HID driver 0.34 (2.4.3)
USB communication driver 0.31
   0.000000     debug level is '3'
   0.013911     upsdrv_initups...
   0.189541     Checking device (0463/FFFF) (005/003)
   0.189705     - VendorID: 0463
   0.189741     - ProductID: ffff
   0.189767     - Manufacturer: unknown
   0.189794     - Product: unknown
   0.189819     - Serial Number: unknown
   0.189842     - Bus: 005
   0.189862     Trying to match device
   0.189906     Device matches
   0.189954     failed to claim USB device: could not claim interface 0: Operation not permitted
   0.189995     failed to detach kernel driver from USB device: could not detach kernel driver from interface 0: Operation not permitted
   0.190033     failed to claim USB device: could not claim interface 0: Operation not permitted
   0.190070     failed to detach kernel driver from USB device: could not detach kernel driver from interface 0: Operation not permitted
   0.190108     failed to claim USB device: could not claim interface 0: Operation not permitted
   0.190145     failed to detach kernel driver from USB device: could not detach kernel driver from interface 0: Operation not permitted
   0.190181     failed to claim USB device: could not claim interface 0: Operation not permitted
   0.190217     failed to detach kernel driver from USB device: could not detach kernel driver from interface 0: Operation not permitted
   0.190252     Can't claim USB device [0463:ffff]: could not detach kernel driver from interface 0: Operation not permitted

As you can see in messages above, Nut can’t see my UPS. By chance, forcing nut to use the root user let it see my UPS:

$ /lib/nut/usbhid-ups -DDD -u root -a MGE-Ellipse750
Network UPS Tools - Generic HID driver 0.34 (2.4.3)
USB communication driver 0.31
   0.000000     debug level is '3'
   0.001678     upsdrv_initups...
   0.172877     Checking device (0463/FFFF) (005/003)
   1.112408     - VendorID: 0463
   1.112464     - ProductID: ffff
   1.112489     - Manufacturer: MGE OPS SYSTEMS
   1.112516     - Product: ELLIPSE
   1.112542     - Serial Number: BDCJ3800Q
   1.112569     - Bus: 005
   1.112595     Trying to match device
   1.112647     Device matches
   1.112726     failed to claim USB device: could not claim interface 0: Device or resource busy
   1.113239     detached kernel driver from USB device...
   1.251394     HID descriptor, method 1: (9 bytes) => 09 21 00 01 21 01 22 01 03
   1.251460     HID descriptor, method 2: (9 bytes) => 09 21 00 01 21 01 22 01 03
   1.251491     HID descriptor length 769
   1.351379     Report Descriptor size = 769
   1.351456     Report Descriptor: (769 bytes) => 05 84 09 04 a1 01 09 24 a1 00 09 02 a1 00
   1.351509      55 00 65 00 85 01 75 01 95 05 15 00 25 01 05 85 09 d0 09 44 09 45 09 42 0b
(...)

So the issue is now clear and is related to permissions. I was able to fix this issue by changing the permissions on the USB device corresponding to my UPS:

$ chmod 0666 /dev/bus/usb/005/003

Another working way to fix this is to change the group of the device to nut:

$ chown :nut /dev/bus/usb/005/003

BTW, to get the bus number (005 here) and device number (003 in my case) of your UPS, run lsudb:

$ lsusb
Bus 005 Device 003: ID 0463:ffff MGE UPS Systems UPS
Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Of course this fix is absolutely temporary, as you’ll need to perform the change above after every reboot. This is far from practical. In fact, as describe in this Fedora 10 bug report, but also in some other Debian bug report, this issue is directly tied to conflicting Udev rules.

Based on clues from these bug reports you can fix Udev using different strategies. As I can’t decide which one is the cleanest, I just did something that is quite brutal, but works. It consist of replacing in /lib/udev/rules.d/91-permissions.rules the line setting rights for USBfs-like devices:

--- /lib/udev/rules.d/91-permissions.rules-orig 2011-05-05 18:49:08.015538434 +0200
+++ /lib/udev/rules.d/91-permissions.rules      2011-05-05 18:49:16.663537978 +0200
@@ -33,7 +33,7 @@

 # usbfs-like devices
 SUBSYSTEM=="usb", ENV{DEVTYPE}=="usb_device", \
-                               MODE="0664"
+                               MODE="0666"

 # serial devices
 SUBSYSTEM=="tty",

Now all you have to do is to unplug the power cord and wait until your machine gracefully shut down as soon as batteries are low ! :)

Heroic journey to RAID-5 data recovery

Last week there was a power grid failure which break down my server’s RAID array. I have no UPS (as I’m a skinflint) and no automatic email alerts (because I’m too lazy to set it up). As a result, for 5 days, my 3-disk RAID-5 array was relying on only 2 disks until I noticed the issue…

By using a combination of following commands, I was soon aware of the gravity of the situation:

cat /proc/mdstat
mdadm --examine /dev/sda1

My /dev/sda1 disk was kicked out of the array, so I did the right stuff which consisted of reconstructing the array:

mdadm /dev/md0 -a /dev/sda1

Then, in an unlucky combination of cosmic ray bombardment, spooky action at a distance and astrological misalignment, half-way to the end of the rebuilding process (which can take up to 5 hours), another disk failed ! It was late, I was tired and utterly worried about losing 1.5 To of precious data. In such a bad shape, I was afraid to worsen the situation. So I decided to shutdown the server and sleep on the problem.

The next day I tried to boot my server to find it (surprise !) stuck in the middle of the boot process, with the famous message:

hit control-D to continue or give root password to fix manually

This is “normal” as my server tried to mount the ext3 filesystem from the /dev/md0 partition that was just assembled by mdadm. Of course md0, if assembled and available to the system, was not running because only one disk, out of three, was in a clean state.

I skip here the epic substory in which I wasted days in a search of a working keyboard, but I let you imagine how such adventures makes my week…

Eventually, I was able to analyze the situation in details. My first reflex ? Check that disks are not physically dead:

fdisk -l /dev/sda
fdisk -l /dev/sdb
fdisk -l /dev/sdc

“Linux raid partitions” (type code “fd“) are still there. Good. I assumed here that disks where not physically damaged. Maybe I should have looked at S.M.A.R.T. datas and statistics (via smartmontools). But remember, I’m lazy (and a bit crazy).

The next step was to get informations about the RAID array itself using:

mdadm --detail /dev/md0

which output the status table below (probably inaccurate as I reconstructed it afterwards):

Number   Major   Minor   RaidDevice State
   0       0        0        0      removed
   1       0        0        1      faulty removed
   2       8       33        2      active sync   /dev/sdc1
   3       8       17        3      spare

What this table told us ?

  • The array is up, but not running. One of its device (sdc1) was clean and active, but it’s not enough to get a working RAID-5.
  • My first attempt to rebuild the array lead to an unexpected result: it added sda1 as a spare device (in slot #3).
  • It confirm that sdb1 unexpectedly failed and is now in a bad state (“faulty removed“).

Then I stopped the array and tried to fearlessly (re)assemble it using 3 differents methods:

mdadm -S /dev/md0
mdadm -A /dev/md0
mdadm --assemble /dev/md0 --verbose /dev/sd[abc]1
mdadm --assemble --force --scan /dev/md0 --verbose

It always failed with messages like:

mdadm: failed to RUN_ARRAY /dev/md0: Input/output error
mdadm: /dev/md0 assembled from 1 drives and 1 spare - not enough to start the array.

So I examined each drive from mdadm‘s point of view:

mdadm -E /dev/sda1
mdadm -E /dev/sdb1
mdadm -E /dev/sdc1
mdadm -E /dev/sd[abc]1 | grep Event

The lastest command compare the “Event” attribute of all devices. It output something like:

Events : 0.53120
Events : 0.53108
Events : 0.53120

which indicate that sda1 and sdc1 are somewhat synced (share the same number) and sdb1 “late” (lower number).

Here I’ve got the idea of recreating the raid array without sdb1, relying only on sda1 and sdc1, by using the “magic” (hence dangerous) --assume-clean option. The latter doesn’t build, erase or initialize a new array. It just try to assemble it “as is”. Here is the command:

mdadm --create /dev/md0 --assume-clean --level=5 --verbose --raid-devices=3 /dev/sda1 missing /dev/sdc1

And it worked ! :D

I mounted the md0 partition and cleaned it up:

fsck.ext3 -v /dev/md0
mount /dev/md0

I updated my mdadm configuration before rebooting my server:

mdadm --detail --scan >> /etc/mdadm/mdadm.conf
vi /etc/mdadm/mdadm.conf
reboot

But history repeat itself, and again, the system hang up during boot. Except this time I knew what was happening: the boot process detected the remaining sdb1 device as part of the old array (the one before the regeneration I did above) and tried to run it. Remembering my last year post, I zero-ized the superblock of sdb1:

mdadm -S /dev/md0
mdadm --zero-superblock /dev/sdb1

A server reboot proved I was right and my md0 partition was automagically mounted in altered state:

localhost:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdb1[3] sda1[0] sdc1[2]
      1465143808 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]

unused devices: <none>

I just had to re-add sdb1 to fill the available slot and update the mdadm configuration to get back my array in its initial state:

mdadm --manage /dev/md0 --add /dev/sdb1
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
vi /etc/mdadm/mdadm.conf