henris

Members
  • Posts

    295
  • Joined

  • Last visited

Converted

  • Gender
    Undisclosed
  • Location
    Finland

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

henris's Achievements

Contributor

Contributor (5/14)

7

Reputation

  1. Just got a SMT1500RMI2UC and was getting desperate. Thank you for posting the solution! With my model the behavior is exactly same. Enabling modbus on the UPS and setting the Unraid APC UPS daemon to use USB/Modbus results in garbage values. I restarted the APC UPS daemon after I had unplugged and replugged the USB-cable. So you can do it in that order too I guess. Simply restarting the daemon without unplugging the USB did not work. I have set my UPS to power off after shutdown so I will be manually starting the server and can check that APC UPS daemon initializes properly. Still it is an extra thing to worry about so it would be great to have this working properly. I was getting the basic information with USB/USB connection but was missing nominal power, usage and others. With USB/Modbus I get it all which is nice for my InluxDB/Grafana setup. My previous Back-UPS Pro worked with USB/USB out of box. As a side note this model has an integrated ethernet but it works only with the cloud based monitoring. No local web management or control. I did not understand this when ordering. To get proper web based management and PCNet/SNMP functionality I would have to buy the optional network management card. I feel a bit cheated...
  2. I was able to read the missing Home Assistant VM image when I mounted the SSD on Ubuntu Live usb. The BTRFS file system was mounted in degraded mode and all the files were readable. When I tried the same when the SSD was mounted in Unraid, some of the files were unreadable. In the mean while the re-created cache pool with the new SSDs has been functioning properly. It is still a BTRFS pool, I will make the switch to something else once I have upgraded my server. I still have an opinion that BTRFS is missing critical troubleshooting and management tools for pools and is not meant for production. In my mind it is a summer project which has the functionalities but was left unfinished regarding the non-functional aspects.
  3. Re-created docker image: Read the instructions here: Made sure I had all the docker templates backed up if anything went haywire in the process. These are stored in the flash /boot/config/plugins/dockerMan/templates-user Deleted the docker image using GUI (docker service was already stopped) Started docker service Installed all needed dockers through Apps/Previous Apps page. I also checked the cache pool's filesystem integrity before the above: Started the array in maintenance mode Run the check through GUI (btrsfs check /mnt/cache/ did not work for some reason) Results seemed ok -> cannot explain the corruption of docker image and windows vm image [1/7] checking root items [2/7] checking extents [3/7] checking free space tree [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) Opening filesystem to check... Checking filesystem on /dev/nvme0n1p1 UUID: 39320790-03d4-4117-a978-033abe08a975 found 309332566016 bytes used, no error found total csum bytes: 301153416 total tree bytes: 941031424 total fs tree bytes: 576176128 total extent tree bytes: 43843584 btree space waste bytes: 138846305 file data blocks allocated: 1659942002688 referenced 308089815040
  4. Like always the celebration was too early... There seems to be quite some corruption in different places. My other Windows vm has a corrupted filesystem beyond repair. Also the docker file seems corrupt. The docker service hang and I was unable to get dockers to stop. Now rebooting the server. I will try to get the cache pool in unmounted state so I can run filesystem check. Then I will decide if I will just revert to previous docker backups and re-create the docker image.
  5. I just successfully re-created my 2 x NVMe SSD cache pool replacing old 500GB with 1TB. Steps: Stopped VM/Docker services Created full backup of cache pool contents with "rsync -av" to a separate SSD Shutdown the server Replaced the SSDs Started the server Had some jitters since the server refused to boot from usb. Hav had this issue occasionally and finally it booted. Did not change any settings. I think it is due the Asus mb bios getting confused about the 25+ potential boot drives. Took the usb out, made a copy of it to make sure it was still fine and put it back in. And the Unraid was booted. Stopped the array Assigned the new SSDs to cache pool Formatted the SSDs Restored the cache pool contents with "rsync -av" Started VM/Docker services Started verifying docker services. Still going through them but the main ones like Plex seemed to be fully functional. I will check logs for any suspicious issues but looks good so far. Short rant about BTRFS pool management and troubleshooting tools: It is a short rant since there ain't no tools for seeing the pool or device status. Pool was in read-only mode and there was no way to see it One of the two devices of the pool had failed and there was no way to see it The only thing "visible" of any issue was the BTRFS device error counts which are NOT reflected in the Unraid GUI I cannot be sure if the data on the remaining SSD was ok or not. Though apart from one file I was able to copy the data off it. I will be building a new server in the near future. I will be very closely looking at the ZFS pools if they would provide better experience. The only file I lost was hassos_ova-4.16.qcow2. Initially I thought this was no biggie since I could just re-download it if needed. But I soon realised that it was the actual disk image of my Home Assistant environment. And then I realised that I had no backup of it anywhere... Arghh... Having no backup is on me, I cannot understand how I missed backing it up. I still have the old SSDs. I think I will put the non-failed on in a M.2 NVMe enclosure and try to see if the missing file could somehow be recovered. If someone has an idea how to do this, please chime in. If this fails, I guess it is always good to start from scratch sometimes. Fortunately I had mostly prototyping stuff in the HASS but some special things like KNX integration contained own developed parts.
  6. Started the replacement process by doing a full rsync archive copy to a standalone SSD: rsync -av /mnt/cache/ /mnt/disks/download/cache_backup/ --log-file=log.txt This seemed to run fine except for one error reported: vm/domains/hassos_ova-4.16.qcow2 rsync: [sender] read errors mapping "/mnt/cache/vm/domains/hassos_ova-4.16.qcow2": Input/output error (5) ERROR: vm/domains/hassos_ova-4.16.qcow2 failed verification -- update discarded. sent 315,302,030,978 bytes received 10,276,919 bytes 447,568,925.33 bytes/sec total size is 296,255,909,280 speedup is 0.94 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1330) [sender=3.2.3] If this is the only corrupted file I will be glad. The "hassos_ova-4.16.qcow2" can be just re-downloaded. I will next shutdown the server and replace the two 500GB SSDs with new 1TB ones. Then create a new pool and restore the data to it.
  7. Reading from this: - When a drive fails in two disk BTRFS RAID1 pool, the pool continues to operate in read-write mode (though some comments indicate that it might go to read-only mode) - If you reboot in this state, the pool will mounted in read-only mode - You can mount the pool in read-write mode with some special command (degraded mount) I could not find the BTRFS command to see the current state of the pool (read-write vs read-only, general healt, anything). The closest is the "device stats" but it provides cumulative historical data, not the current state. Am I missing something here?
  8. I really appreciate your replies but I'm now really confused. According to SMART report my second cache pool NVMe SSD has failed. Surely I cannot just re-format without replacing the failed drive first? The cache pool seems to in read-only mode and files can be read from it without causing any errors in syslog. Should I just start reading BTRFS manual and try to figure out what is going on? How can this (btrfs pool) be part of Unraids critical cache feature if it so fragile and untroubleshootable? Should I just start from scratch and if so is there something better than BTRFS cache pools? ZFS pools? I have already purchased two larger replacement NVMe SSDs since the current ones are already four years old and close to their recommended TBW. I'm willing to bite the bullet and start from scratch but it would be great know that the new mechanism actually worked. The only reason for using RAID1 cache pool was to get protection from drive failure and when the drive failure occurred Unraid was totally unaware of it. Sorry for the ranting, I really like Unraid, it's been serving me well for over a decade. This issue happened at the most inconvenient time and I don't have enough time to investigate this properly.
  9. When I try to run scrub via GUI I get "aborted" as the status: And pretty much the same thing via shell: I forgot that I had run the scrub via GUI last night when doing initial troubleshooting. Initially I got the same "aborted" but after I stopped the VMs and Dockers I was able to start the scrub which run for ~ 6 minutes and reported millions of unrecoverable errors. Unfortunately I did not get a screenshot of that result before hitting the Scrub again now...
  10. I will do this and post results. Just to be sure; it is safe to run scrub on the pool regardless of the state of the devices in it? This cannot cause any more corruption / data loss?
  11. I have a two NVMe SSD cache pool taken into use in 2019 on my second server. We had a power outage and although UPS seemed to allow for controlled shutdown the system was not behaving correctly after restarting the server. I noticed some dockers (Plex) not working correctly and started troubleshooting. I had not received any notifications about issues and also the main Unraid GUI pages (Main, Dashboard, Docker) did not indicate any issue. When I took a look in syslog I saw a flood of BTRFS related warnings and errors. Seemed like the whole server was on fire, well at least the cache pool. I started reading on FAQ and similar problem threads. I got confused fast. I've been using UnRAID since 2009 and pretty good with it but the cache pool BTRFS mechanism, how to see the status of it, how to troubleshoot it and in this case how to fix it seems overwhelming. I've read this FAQ entry and this addition to it. And several troubleshooting threads. And also this "how to monitor btrfs pool for errors" which I will take into use. My questions are: How can I see what is actually broken? From the smart logs and "btrfs dev stats /mnt/cace/" it seems like it is my /dev/nvme1n1p1 SSD which has failed. It just baffles me that this is not at all reflected in the UnRAID GUI. How can I see what data is corrupted or lost? Are there some specific command I can run to see a list of corrupted files? Why would I have corrupted data? I thought running a RAID1 cache pool would protect me from a single cache drive failure but now I seem to have a single drive failure and still experiencing at least functional loss (ie. unable to run dockers properly). What is the recommend way to fix this? I have replacement SSDs ready but I cannot connect them at the same time (only two M.2 slots). I'm especially unsure about trusting the data currently in cache pool. I do have CA backups available. My whole system is currently down so all help is greatly appreciated! I promise to document my path and end result in this thread. Diagnostics attached. This is though AFTER one shutdown but seems to show the same behavior. tms-740-diagnostics-20230922-0733.zip
  12. No changelog in UnRaid to my knowledge. I'm keeping my changelog in OneNote. Few lines for each "trivial change" and a separate subpage for larger changes or complex troubleshootings. Also some tailored pages or tables for things like disks and more complex dockers. It is just so much easier to have a compressed logical description of the changes rather than trying to reverse-engineer it from logs if even possible. In my work I use things like Jira for change management but I don't like for personal use (feels too much like work). 99% of all problems come from changes. I can document my own changes and I can try to control other changes with scheduled updates. I have docker updates running on Friday/Saturday night so I have all the weekend to fix things To emphasize the point, I just ran into the 1% and have a failed cache pool drive and potentially corrupted cache pool. That is why I'm in this forum right now, to make a new troubleshooting thread. Last time I had to do troubleshooting was 11.4.2022. I've been running UnRaid since 2009. I just love it, I can just let it run months and months without any manual intervention. Sometimes things just break.
  13. Latest Plex release broke hw transcoding (tone mapping) and in my case also PlexKodiConnect's ability to direct play. Are you considering downgrading to latest working release like some other docker publishers have done?
  14. Here you go. Thank you for the fast response and for the plugin itself, it has been a core plugin for many years and worked wonderfully. parityTuningIncrements="1" parityTuningFrequency="0" parityTuningResumeCustom="" parityTuningResumeHour="1" parityTuningResumeMinute="0" parityTuningPauseCustom="" parityTuningPauseHour="5" parityTuningPauseMinute="30" parityTuningUnscheduled="1" parityTuningRecon="1" parityTuningClear="1" parityTuningNotify="0" parityTuningHeat="1" parityTuningDebug="no" parityTuningAutomatic="0" parityTuningRestart="0" parityTuningHeatHigh="3" parityTuningHeatLow="8" parityTuningHeatNotify="1" parityTuningHeatShutdown="0" parityTuningLogging="0" parityTuningScheduled="1" parityTuningManual="0" parityTuningResumeDay="0" parityTuningPauseDay="0" parityTuningMover="1" parityTuningCABackup="1" parityTuningLogTarget="0"
  15. This morning I noticed that the parity check was running although it has been configured to run in increments and between 1:00:00 and 5:30:00 AM. I went through syslog and realised that parity check was resumed when the mover finished. Mover is configured at 6:00:00 (after the parity check has ended) so parity check should definitely not be resumed when it finishes. I haven't noticed this behavior before (ie. last month's scheduled check) but cannot be certain. Below is the syslog excerpt containing relevant entries: --- Parity check scheduled to run between 1:00:00 and 5:30:00 -> OK Apr 4 01:00:02 TMS-740 Parity Check Tuning: Resumed: Scheduled Correcting Parity-Check Apr 4 01:00:02 TMS-740 Parity Check Tuning: Resumed: Scheduled Correcting Parity-Check (28.1% completed) Apr 4 01:00:07 TMS-740 kernel: mdcmd (40): check resume Apr 4 01:00:07 TMS-740 kernel: Apr 4 01:00:07 TMS-740 kernel: md: recovery thread: check P Q ... Apr 4 05:30:01 TMS-740 Parity Check Tuning: Paused: Scheduled Correcting Parity-Check Apr 4 05:30:06 TMS-740 kernel: mdcmd (41): nocheck pause Apr 4 05:30:06 TMS-740 kernel: Apr 4 05:30:06 TMS-740 kernel: md: recovery thread: exit status: -4 Apr 4 05:30:12 TMS-740 Parity Check Tuning: Paused: Scheduled Correcting Parity-Check (40.7% completed) --- Mover scheduled to run on 6:00:00 -> OK Apr 4 06:00:01 TMS-740 root: mover: started --- Mover took ~14mins this time -> OK Apr 4 06:14:25 TMS-740 root: mover: finished --- Parity check resuming -> NOK Apr 4 06:18:43 TMS-740 Parity Check Tuning: Resumed: Mover no longer running Apr 4 06:18:48 TMS-740 kernel: mdcmd (42): check resume Apr 4 06:18:48 TMS-740 kernel: Apr 4 06:18:48 TMS-740 kernel: md: recovery thread: check P Q ... Apr 4 06:18:48 TMS-740 Parity Check Tuning: Resumed: Mover no longer running: Scheduled Correcting Parity-Check (40.7% completed) --- Manually pausing parity check after noticing that the parity check was still running Apr 4 09:56:34 TMS-740 kernel: mdcmd (43): nocheck Pause Apr 4 09:56:35 TMS-740 kernel: md: recovery thread: exit status: -4 Apr 4 09:58:23 TMS-740 ool www[3302]: /usr/local/emhttp/plugins/parity.check.tuning/parity.check.tuning.php 'updatecron' Apr 4 09:58:23 TMS-740 Parity Check Tuning: Configuration: Array#012(#012 [parityTuningScheduled] => 1#012 [parityTuningManual] => 0#012 [parityTuningAutomatic] => 0#012 [parityTuningFrequency] => 0#012 [parityTuningResumeCustom] => #012 [parityTuningResumeDay] => 0#012 [parityTuningResumeHour] => 1#012 [parityTuningResumeMinute] => 0#012 [parityTuningPauseCustom] => #012 [parityTuningPauseDay] => 0#012 [parityTuningPauseHour] => 5#012 [parityTuningPauseMinute] => 30#012 [parityTuningNotify] => 0#012 [parityTuningRecon] => 1#012 [parityTuningClear] => 1#012 [parityTuningRestart] => 0#012 [parityTuningMover] => 1#012 [parityTuningCABackup] => 1#012 [parityTuningHeat] => 1#012 [parityTuningHeatHigh] => 3#012 [parityTuningHeatLow] => 8#012 [parityTuningHeatNotify] => 1#012 [parityTuningHeatShutdown] => 0#012 [parityTuningHeatCritical] => 2#012 [parityTuningHeatTooLong] => 30#012 [parityTuningLogging] => 0#012 [parityTuningLogTarget] => 0#012 [parityTuningMonitorDefault] => 17#012 [parityTuningMonitorHeat] => 7#012 [parityTuningMonitorBusy] => 6#012 [parityTu