Discussion:
Need help with advfs corruption and lsm
(too old to reply)
Didier Godefroy
2009-05-13 20:26:08 UTC
Permalink
Hi all,

I posted earlier about my lsm issue and now it's turning into a nightmare
with an advfs problem on top of it.

What happened in the first place was a crash when I deleted a big folder
that contained a corrupt file. I had accidently found that corrupt file, it
wouldn't show up with ls and would cause an error when doing du on the
contents of the directory.

I deleted that big folder (over 600mb) and when that corrupted file was hit,
it caused a system crash.

The system wouldn't come back because fsck would cause the crash again every
time it booted, causing a loop.

I removed a drive from the mirrored set containing the volume with that
corrupt file, which caused the lsm volume to be disabled and the mirror
relocation to the hot spare but allowed the boot to succeed.
Then the problem was that that volume wouldn't resync properly and an other
crash happened before syncing was done.

I then had a mirror with a stale plex and the volume would stay disabled and

I couldn't clear that status.
After plugging the removed drive back in, the system was seeing it, but not
lsm, so it needed a reboot. Before rebooting again, I use scu and ran a
complete surface scan of all drives and they're all fine.

So I rebooted again with all the drives back in normally, the system booted
fine, the disabled volume was still in that state and the previously removed
drive was back in lsm normally, although it wasn't put back into place
because everything that used to be in it was relocated to the hot spare.

However I was able to bring back the biggest plex from that drive by adding
it back to the mirror set, having a mirror set with 3 plexes.

I was then able to change the state of the plexes so the volume could get a
needsync and I triggered the resync and it was syncing fine.

And that's where the advfs problem showed up again, when I tried mounting
that volume back on, the system crashed again and I had to do the drive
removal again to break out of the boot loop.

Once rebooted, I had the volume syncing going on again, except on that
failed volume with the advfs issue and the removed plex from that pulled out
drive.

That mirror set on the 2 drives has 3 lsm volumes, 2 small ones and the
large one. The two small ones resync just fine, only the large one won't
resync by itself.

The big problem now is with the advfs domain that's supposed to be on that
large volume. While the resyncing was still going on, I tried doing a
showfdmn on it and there was an other crash.

Basically every time I try to access that advfs domain in any way, wether by
trying to mount it, looking up some info on it, or when the lsm syncing is
finished on it and then it tries to mount, I get a crash.

I don't think this is advfs domain panic, I think that wouldn't bring the
system down entirely and it would keep everything else running.
The system just crashes every time the advfs data is "touched" in any way.

Now how could I get that fixed and back under control if it crashes without
provocation???

That system has now been down for more than half a day and it's getting
critical.

Is it possible to clean up and repair advfs corruption without causing the
system crashes?????


Help please,
--
Didier Godefroy
mailto:***@ulysium.net
Didier Godefroy
2009-06-21 21:37:58 UTC
Permalink
Hello all,

Sorry for the tardiness of this summary but I've been busy since this issue
happened and I've had to catch up on a lot of work.

My message post is at the bottom and sums up what happened and the things
I've tried to recover from it before posting.

What made things difficult is that not only there was advfs corruption
causing systematic system crashes, but that advfs volume was also on top of
an lsm mirrored volume.

It wasn't an ordinary advfs panic, as the whole system would crash every
time that advfs volume with the corruption was "touched" in any way, such as
trying to mount it, trying to look up info on the unmounted volume with
showfsets or other utilities, verify couldn't run because the volume
couldn't mount and since verify would try to run at reboot, the system was
in a boot-crash loop and I had to break that loop to attempt some kind of
fix.

To break that boot-crash loop, I just popped out one of the lsm mirror's
drives, breaking the mirror, saving a copy of the data at the same time in
case I needed to go back to it. Breaking that mirror prevented the attempts
at verifying that advfs volume during the boot, which allowed the system to
finish booting, back into a "somewhat" usable state, at least for a while,
so I could try something to fix the issue.

I caused more crashes later while attempting fixes, such as trying to mount
that advfs volume or trying to look up something about it.

Finally the only utility that worked to get back in control and get some
fixing done on that advfs volume was fixfdmn, which didn't cause any system
crashes when working on the corrupt domain, and it didn't try to mount it.

After running fixfdmn, I ran it again a couple more times and found a bit
more corruption and finally got it to "appear" clean and mount without
causing any more crashes.

However a couple of days later, although I ran verify a few times on the
mounted domain, there must've been some left over advfs corruption that
neither fixfdmn nor verify could find and fix, and I had more system
crashes.

So I fixed whatever could be fixed and copied the whole contents of that
domain in a backup before wiping out and recreating that domain from
scratch. Then I copied everything back from backup in that fresh new domain
and I haven't had more crashes (so far).

Perhaps this might help someone having this rare issue happening, saving
time by avoiding the things I did that wouldn't work and going straight to
what worked...



Thanks to all for all the suggestions (too many to list here),
What happened in the first place was a crash when I deleted a big folder that
contained a corrupt file. I had accidently found that corrupt file, it
wouldn't show up with ls and would cause an error when doing du on the
contents of the directory.
I deleted that big folder (over 600mb) and when that corrupted file was hit,
it caused a system crash.
The system wouldn't come back because fsck would cause the crash again every
time it booted, causing a loop.
I removed a drive from the mirrored set containing the volume with that
corrupt file, which caused the lsm volume to be disabled and the mirror
relocation to the hot spare but allowed the boot to succeed.
Then the problem was that that volume wouldn't resync properly and an other
crash happened before syncing was done.
I then had a mirror with a stale plex and the volume would stay disabled and
I couldn't clear that status.
After plugging the removed drive back in, the system was seeing it, but not
lsm, so it needed a reboot. Before rebooting again, I use scu and ran a
complete surface scan of all drives and they're all fine.
So I rebooted again with all the drives back in normally, the system booted
fine, the disabled volume was still in that state and the previously removed
drive was back in lsm normally, although it wasn't put back into place because
everything that used to be in it was relocated to the hot spare.
However I was able to bring back the biggest plex from that drive by adding it
back to the mirror set, having a mirror set with 3 plexes.
I was then able to change the state of the plexes so the volume could get a
needsync and I triggered the resync and it was syncing fine.
And that's where the advfs problem showed up again, when I tried mounting that
volume back on, the system crashed again and I had to do the drive removal
again to break out of the boot loop.
Once rebooted, I had the volume syncing going on again, except on that failed
volume with the advfs issue and the removed plex from that pulled out drive.
That mirror set on the 2 drives has 3 lsm volumes, 2 small ones and the large
one. The two small ones resync just fine, only the large one won't resync by
itself.
The big problem now is with the advfs domain that's supposed to be on that
large volume. While the resyncing was still going on, I tried doing a showfdmn
on it and there was an other crash.
Basically every time I try to access that advfs domain in any way, wether by
trying to mount it, looking up some info on it, or when the lsm syncing is
finished on it and then it tries to mount, I get a crash.
I don't think this is advfs domain panic, I think that wouldn't bring the
system down entirely and it would keep everything else running.
The system just crashes every time the advfs data is "touched" in any way.
Now how could I get that fixed and back under control if it crashes without
provocation???
That system has now been down for more than half a day and it's getting
critical.
Is it possible to clean up and repair advfs corruption without causing the
system crashes?????
Loading...