Jump to content

6.2.1 windows 10 vm on cache disk lost weeks of data (for the 2nd time)


lordoxide

Recommended Posts

Team,

 

This is the 2nd time this has happened to my server, basically after a server reboot, my windows VM reverts to a past state, and by past state I mean like 2-3 weeks in the past. The first time this occurred I rebooted my server from the unraid console and had not stopped my VM, I just assumed this caused the issue. But this time, I rebooted my VM, then shut it down, then rebooted the server, upon updating my vm's disk had reverted to something like two week in the past. I hope this is a settings proplem that someone can help me track down. Basically it feels like the entire images is committed to the actual disk, I would understand if the was something a btrfs sync issue it this was like 30min of lost data that didn't commit, but we are talking weeks.

 

unraid version 6.2.4 (was 6.2.1)

VM setup file attached

Cache drive = btrfs 2x SanDisk_Ultra_II_480GB_154798446452 - 480 GB (sdb)

 

If there is anything I can do to help troubleshoot this please let me know.

windows10.txt

Link to comment

trurl,

 

Diagnostic attached. I consider files incomplete rather than actually missing, basically the raw virtual disk image is written to a cache disk only user share. While the unraid is running there are no problems, I can reboot the VM, all of my data appears to be in tact. But if I reboot the unraid server (the correct way) when it comes back up, its like my virtual disk drive is missing all of its "unsaved" data or something.

oxnet-diagnostics-20170128-1407.zip

Link to comment

Jan 28 13:10:02 Oxnet emhttp: SanDisk_Ultra_II_480GB_154798446452 (sdb) 468851512
Jan 28 13:10:02 Oxnet emhttp: SanDisk_Ultra_II_480GB_163131425712 (sdc) 468851512
Jan 28 13:10:02 Oxnet emhttp: import 31 cache device: sdc
Jan 28 13:10:02 Oxnet emhttp: import 32 cache device: no device
Jan 28 13:10:02 Oxnet emhttp: check_pool: /sbin/btrfs filesystem show 3bc272d8-5206-4911-b397-f97792364bc2 2>&1
Jan 28 13:10:02 Oxnet emhttp: cacheUUID: 3bc272d8-5206-4911-b397-f97792364bc2
Jan 28 13:10:02 Oxnet emhttp: cacheNumDevices: 2
Jan 28 13:10:02 Oxnet emhttp: cacheTotDevices: 2
Jan 28 13:10:02 Oxnet emhttp: cacheNumMissing: 0
Jan 28 13:10:02 Oxnet emhttp: cacheNumMisplaced: 0
Jan 28 13:10:02 Oxnet emhttp: cacheNumExtra: 0

I have 2 in pool but my syslog doesn't say no device for one of them so something not right.

 

I'll leave it to johnnie.black to get you straight. 8)

Link to comment

johnnie.black,

 

root@Oxnet:~# btrfs fi show /mnt/cache

Label: none  uuid: 3bc272d8-5206-4911-b397-f97792364bc2

        Total devices 2 FS bytes used 132.10GiB

        devid    1 size 447.13GiB used 339.03GiB path /dev/sdb1

        devid    3 size 447.13GiB used 339.03GiB path /dev/sdc1

 

root@Oxnet:~# btrfs fi df /mnt/cache

Data, RAID1: total=337.00GiB, used=131.99GiB

System, RAID1: total=32.00MiB, used=64.00KiB

Metadata, RAID1: total=2.00GiB, used=113.09MiB

GlobalReserve, single: total=48.00MiB, used=0.00B

 

root@Oxnet:~# btrfs device stats /mnt/cache

[/dev/sdb1].write_io_errs  0

[/dev/sdb1].read_io_errs    0

[/dev/sdb1].flush_io_errs  0

[/dev/sdb1].corruption_errs 0

[/dev/sdb1].generation_errs 0

[/dev/sdc1].write_io_errs  0

[/dev/sdc1].read_io_errs    0

[/dev/sdc1].flush_io_errs  0

[/dev/sdc1].corruption_errs 0

[/dev/sdc1].generation_errs 0

 

Link to comment

Everything looks mostly normal but was one of the devices ever replaced? If not they should be devid 1 and 2, which could mean that one of them dropped offline and then rejoined the pool, this could cause data loss.

 

        devid    1 size 447.13GiB used 339.03GiB path /dev/sdb1

        devid    3 size 447.13GiB used 339.03GiB path /dev/sdc1

 

There's also a lot of slack on the file system:

 

Data, RAID1: total=337.00GiB, used=131.99GiB

System, RAID1: total=32.00MiB, used=64.00KiB

Metadata, RAID1: total=2.00GiB, used=113.09MiB

GlobalReserve, single: total=48.00MiB, used=0.00B

 

You should run a balance:

 

btrfs balance start -dusage=75 /mnt/cache

 

When it finishes repost output of:

 

btrfs fi df /mnt/cache

Link to comment

Johnnie.black

 

root@Oxnet:~# btrfs balance start -dusage=75 /mnt/cache

Done, had to relocate 207 out of 340 chunks

 

root@Oxnet:~# btrfs fi df /mnt/cache

Data, RAID1: total=139.00GiB, used=132.14GiB

System, RAID1: total=32.00MiB, used=48.00KiB

Metadata, RAID1: total=2.00GiB, used=109.06MiB

GlobalReserve, single: total=48.00MiB, used=0.00B

 

 

Link to comment

The replacement explains the devid and I can't see nothing wrong with pool, usage is much better after the balance but don't think it was bad enough as to run out of space, I'm not sure the pool is to blame for whatever happened but can't say what happened without the syslog before rebooting.

 

Would recommend to update to latest unRAID release, or the latest v6.3-rc, as it has newer kernel and btrfs-progs, and next time you reboot grab the diagnostics before doing it.

Link to comment

Johnnie.black,

 

Ok good news and bad news. Good news, upgrade went smoothly, and there appeared to be no data rollback. The bad news is, performance on my VM has deffinitely taken a hit. General web and OS usage is fine, but gaming (and light gaming, world of warcraft for testing) takes about 2x as long to load, and FPS appears almost cut in half. I benchmarked the system and the videocard is doing fine as expected, the processor always tests low because i'm only pinning in 4 vcpus, but disk seems incredibly slow, not sure if something changed with the driver or what. Here are my crystalDiskMark test results:

 

-----------------------------------------------------------------------

CrystalDiskMark 5.2.1 x64 (UWP) © 2007-2017 hiyohiyo

                          Crystal Dew World : http://crystalmark.info/

-----------------------------------------------------------------------

* MB/s = 1,000,000 bytes/s [sATA/600 = 600,000,000 bytes/s]

* KB = 1000 bytes, KiB = 1024 bytes

 

  Sequential Read (Q= 32,T= 1) :  1046.275 MB/s

  Sequential Write (Q= 32,T= 1) :  586.129 MB/s

  Random Read 4KiB (Q= 32,T= 1) :    55.913 MB/s [ 13650.6 IOPS]

Random Write 4KiB (Q= 32,T= 1) :    37.448 MB/s [  9142.6 IOPS]

        Sequential Read (T= 1) :  1072.727 MB/s

        Sequential Write (T= 1) :  568.593 MB/s

  Random Read 4KiB (Q= 1,T= 1) :    15.660 MB/s [  3823.2 IOPS]

  Random Write 4KiB (Q= 1,T= 1) :    12.236 MB/s [  2987.3 IOPS]

 

  Test : 1024 MiB [C: 32.7% (65.2/199.4 GiB)] (x5)  [interval=5 sec]

  Date : 2017/01/29 13:50:10

    OS : Windows 10  [10.0 Build 14393] (x64)

 

Sadly I don't have re-upgrade results to compare, because I never had an issue before. But loading into the game and FPS are greatly lowered now. You can see my vm config attached to the initial post. The VM is using the vstorage share, which is cache only.

 

Let me know if there is anything I can do to help troubleshoot.

 

Thanks

Link to comment

OK, even non-gaming OS level things are taking 3-4x as long to start and use post upgrade, it's almost unusable, is there a way to "downgrade" back, or should I try and solve the problem here.

 

Just for information sake, basically I run my primary workstation as a windows 10 VM on top of unraid with PCI passthrough. Haven't really had any issues with performance in the last year, aside from the 2 data rollbacks. Not sure where to start here.

 

The following warning shows up in the qemu logs (might be completely unrelated):

2017-01-29T18:12:04.796334Z qemu-system-x86_64: warning: Unknown firmware file in legacy mode: etc/msr_feature_control

 

Update2:

Changed VM to pc-i440fx-2.7 from 2.5, the above warning went away but the loading issue is still there. It feels like the VM is writing to the standard array rather than the ssd...

 

Updated

 

Thanks

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...