6.2.1 windows 10 vm on cache disk lost weeks of data (for the 2nd time)

lordoxide · January 28, 2017

Team,

This is the 2nd time this has happened to my server, basically after a server reboot, my windows VM reverts to a past state, and by past state I mean like 2-3 weeks in the past. The first time this occurred I rebooted my server from the unraid console and had not stopped my VM, I just assumed this caused the issue. But this time, I rebooted my VM, then shut it down, then rebooted the server, upon updating my vm's disk had reverted to something like two week in the past. I hope this is a settings proplem that someone can help me track down. Basically it feels like the entire images is committed to the actual disk, I would understand if the was something a btrfs sync issue it this was like 30min of lost data that didn't commit, but we are talking weeks.

unraid version 6.2.4 (was 6.2.1)

VM setup file attached

Cache drive = btrfs 2x SanDisk_Ultra_II_480GB_154798446452 - 480 GB (sdb)

If there is anything I can do to help troubleshoot this please let me know.

windows10.txt

trurl · January 28, 2017

I moved this to the KVM subforum. It doesn't seem like your subject line really matches the body of the post. Do you know of any actual files from the cache disk that are missing?

Go to Tools - Diagnostics and post the complete diagnostics zip.

lordoxide · January 28, 2017

trurl,

Diagnostic attached. I consider files incomplete rather than actually missing, basically the raw virtual disk image is written to a cache disk only user share. While the unraid is running there are no problems, I can reboot the VM, all of my data appears to be in tact. But if I reboot the unraid server (the correct way) when it comes back up, its like my virtual disk drive is missing all of its "unsaved" data or something.

oxnet-diagnostics-20170128-1407.zip

JorgeB · January 28, 2017

So we can see the pool status post the output of:

btrfs fi show /mnt/cache

btrfs fi df /mnt/cache

btrfs device stats /mnt/cache

trurl · January 28, 2017

Jan 28 13:10:02 Oxnet emhttp: SanDisk_Ultra_II_480GB_154798446452 (sdb) 468851512
Jan 28 13:10:02 Oxnet emhttp: SanDisk_Ultra_II_480GB_163131425712 (sdc) 468851512
Jan 28 13:10:02 Oxnet emhttp: import 31 cache device: sdc
Jan 28 13:10:02 Oxnet emhttp: import 32 cache device: no device
Jan 28 13:10:02 Oxnet emhttp: check_pool: /sbin/btrfs filesystem show 3bc272d8-5206-4911-b397-f97792364bc2 2>&1
Jan 28 13:10:02 Oxnet emhttp: cacheUUID: 3bc272d8-5206-4911-b397-f97792364bc2
Jan 28 13:10:02 Oxnet emhttp: cacheNumDevices: 2
Jan 28 13:10:02 Oxnet emhttp: cacheTotDevices: 2
Jan 28 13:10:02 Oxnet emhttp: cacheNumMissing: 0
Jan 28 13:10:02 Oxnet emhttp: cacheNumMisplaced: 0
Jan 28 13:10:02 Oxnet emhttp: cacheNumExtra: 0

I have 2 in pool but my syslog doesn't say no device for one of them so something not right.

I'll leave it to johnnie.black to get you straight.

lordoxide · January 28, 2017

johnnie.black,

root@Oxnet:~# btrfs fi show /mnt/cache

Label: none uuid: 3bc272d8-5206-4911-b397-f97792364bc2

Total devices 2 FS bytes used 132.10GiB

devid 1 size 447.13GiB used 339.03GiB path /dev/sdb1

devid 3 size 447.13GiB used 339.03GiB path /dev/sdc1

root@Oxnet:~# btrfs fi df /mnt/cache

Data, RAID1: total=337.00GiB, used=131.99GiB

System, RAID1: total=32.00MiB, used=64.00KiB

Metadata, RAID1: total=2.00GiB, used=113.09MiB

GlobalReserve, single: total=48.00MiB, used=0.00B

root@Oxnet:~# btrfs device stats /mnt/cache

[/dev/sdb1].write_io_errs 0

[/dev/sdb1].read_io_errs 0

[/dev/sdb1].flush_io_errs 0

[/dev/sdb1].corruption_errs 0

[/dev/sdb1].generation_errs 0

[/dev/sdc1].write_io_errs 0

[/dev/sdc1].read_io_errs 0

[/dev/sdc1].flush_io_errs 0

[/dev/sdc1].corruption_errs 0

[/dev/sdc1].generation_errs 0

JorgeB · January 28, 2017

Everything looks mostly normal but was one of the devices ever replaced? If not they should be devid 1 and 2, which could mean that one of them dropped offline and then rejoined the pool, this could cause data loss.

devid 1 size 447.13GiB used 339.03GiB path /dev/sdb1

devid 3 size 447.13GiB used 339.03GiB path /dev/sdc1

There's also a lot of slack on the file system:

Data, RAID1: total=337.00GiB, used=131.99GiB

System, RAID1: total=32.00MiB, used=64.00KiB

Metadata, RAID1: total=2.00GiB, used=113.09MiB

GlobalReserve, single: total=48.00MiB, used=0.00B

You should run a balance:

btrfs balance start -dusage=75 /mnt/cache

When it finishes repost output of:

btrfs fi df /mnt/cache

lordoxide · January 28, 2017

Initially it was an unbalanced pair, a 256G and a 480G i replaced the 256 with a 480 about a year ago. Both of these losses happened well after that.

Thanks,

lordoxide · January 28, 2017

Johnnie.black

root@Oxnet:~# btrfs balance start -dusage=75 /mnt/cache

Done, had to relocate 207 out of 340 chunks

root@Oxnet:~# btrfs fi df /mnt/cache

Data, RAID1: total=139.00GiB, used=132.14GiB

System, RAID1: total=32.00MiB, used=48.00KiB

Metadata, RAID1: total=2.00GiB, used=109.06MiB

GlobalReserve, single: total=48.00MiB, used=0.00B

JorgeB · January 28, 2017

The replacement explains the devid and I can't see nothing wrong with pool, usage is much better after the balance but don't think it was bad enough as to run out of space, I'm not sure the pool is to blame for whatever happened but can't say what happened without the syslog before rebooting.

Would recommend to update to latest unRAID release, or the latest v6.3-rc, as it has newer kernel and btrfs-progs, and next time you reboot grab the diagnostics before doing it.

lordoxide · January 29, 2017

The reboot in question was during the upgrade to 6.2.4, should I try out beta 6.3 or just stay in the stable branch?

Thanks,

JorgeB · January 29, 2017

6.3-rc has been rock solid for me and it has newer kernel, btrfs-progs and lot of security fixes among other improvements.

lordoxide · January 29, 2017

OK,

Installed mainline plugin, ran diagnostics, stopped windows vm, about to reboot now. Of coarse no issues happen generally when you prep for them, but in this case that would be a good thing. Let you know what happens!.

Thanks.

lordoxide · January 29, 2017

Johnnie.black,

Ok good news and bad news. Good news, upgrade went smoothly, and there appeared to be no data rollback. The bad news is, performance on my VM has deffinitely taken a hit. General web and OS usage is fine, but gaming (and light gaming, world of warcraft for testing) takes about 2x as long to load, and FPS appears almost cut in half. I benchmarked the system and the videocard is doing fine as expected, the processor always tests low because i'm only pinning in 4 vcpus, but disk seems incredibly slow, not sure if something changed with the driver or what. Here are my crystalDiskMark test results:

-----------------------------------------------------------------------

Crystal Dew World : http://crystalmark.info/

-----------------------------------------------------------------------

* MB/s = 1,000,000 bytes/s [sATA/600 = 600,000,000 bytes/s]

* KB = 1000 bytes, KiB = 1024 bytes

Sequential Read (Q= 32,T= 1) : 1046.275 MB/s

Sequential Write (Q= 32,T= 1) : 586.129 MB/s

Random Read 4KiB (Q= 32,T= 1) : 55.913 MB/s [ 13650.6 IOPS]

Random Write 4KiB (Q= 32,T= 1) : 37.448 MB/s [ 9142.6 IOPS]

Sequential Read (T= 1) : 1072.727 MB/s

Sequential Write (T= 1) : 568.593 MB/s

Random Read 4KiB (Q= 1,T= 1) : 15.660 MB/s [ 3823.2 IOPS]

Random Write 4KiB (Q= 1,T= 1) : 12.236 MB/s [ 2987.3 IOPS]

Test : 1024 MiB [C: 32.7% (65.2/199.4 GiB)] (x5) [interval=5 sec]

Date : 2017/01/29 13:50:10

OS : Windows 10 [10.0 Build 14393] (x64)

Sadly I don't have re-upgrade results to compare, because I never had an issue before. But loading into the game and FPS are greatly lowered now. You can see my vm config attached to the initial post. The VM is using the vstorage share, which is cache only.

Let me know if there is anything I can do to help troubleshoot.

Thanks

lordoxide · January 29, 2017

If it can help attached are the before and after diagnostics for the upgrade.

oxnet-diagnostics-20170129-1302.zip

oxnet-diagnostics-20170129-1309.zip

lordoxide · January 29, 2017

OK, even non-gaming OS level things are taking 3-4x as long to start and use post upgrade, it's almost unusable, is there a way to "downgrade" back, or should I try and solve the problem here.

Just for information sake, basically I run my primary workstation as a windows 10 VM on top of unraid with PCI passthrough. Haven't really had any issues with performance in the last year, aside from the 2 data rollbacks. Not sure where to start here.

The following warning shows up in the qemu logs (might be completely unrelated):

2017-01-29T18:12:04.796334Z qemu-system-x86_64: warning: Unknown firmware file in legacy mode: etc/msr_feature_control

Update2:

Changed VM to pc-i440fx-2.7 from 2.5, the above warning went away but the loading issue is still there. It feels like the VM is writing to the standard array rather than the ssd...

Updated

Thanks

JorgeB · January 29, 2017

You can downgrade by replacing bzroot, bzimage and bzroot-gui on your flash drive.

lordoxide · January 29, 2017

Johhnie,

Do you suggest immediately rolling back, or should I try and plug away and see what the issue is, and if its fixable? (with the communities help =)

Thanks,

JorgeB · January 29, 2017

They should be fixable, create a new post on the vm section.

lordoxide · January 29, 2017

Done

6.2.1 windows 10 vm on cache disk lost weeks of data (for the 2nd time)

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived