--- pageType: source id: source.zfs-casaos-project title: ZFS-CasaOS-Project sourceType: local-file sourcePath: /home/topher/.openclaw/workspace-crash-bot/projects/zfs-casaos.md ingestedAt: 2026-05-02T21:02:31.564Z updatedAt: 2026-05-02T21:02:31.564Z status: active --- # ZFS-CasaOS-Project ## Source - Type: `local-file` - Path: `/home/topher/.openclaw/workspace-crash-bot/projects/zfs-casaos.md` - Bytes: 10991 - Updated: 2026-05-02T21:02:31.564Z ## Content ````text # ZFS on CasaOS — Project Plan **Thread:** #setting-up-zfs-on-casaOS **Status:** Planning **Created:** 2026-04-16 ## Overview Migrate from current CasaOS setup to ZFS-backed storage with proper redundancy and an offsite backup strategy. Long-term play tied to new rig build. ## Current Hardware - **OS:** CasaOS on existing Linux box - **Drives (11 total, all spinners):** 6×4TB, 3×3TB, 1×8TB, 1×10TB — 48TB raw - **Current usage:** Under 6TB - **8TB + 10TB:** Currently in CasaOS — Plex, old machine backups, Minecraft servers, non-critical stuff - **SSDs available:** Smaller sizes (details TBD) - **Note:** 2TB drive replaced with 4TB (2026-04-18 update) ## ZFS Topology (Agreed) ### Main Pool ("tank") - **Vdev 1:** 5×4TB raidz2 → ~12TB usable, 2-drive redundancy (the vault — irreplaceable data) - **Vdev 2:** 3×3TB raidz1 → ~6TB usable, 1-drive redundancy (media, Minecraft, replaceable stuff) ### Backup (8TB + 10TB) - Standalone backup targets, NOT part of the main pool - Format TBD: ZFS for zfs send/recv OR ext4/NTFS for SHTF portability - Consider: one ZFS for automated snapshots, one NTFS/exFAT for "grab and run" scenarios ### 4TB Drive (replaced 2TB) - Scratch/overflow/ISO storage - Possible L2ARC candidate (probably not needed — working set likely fits in RAM) - Could also serve as hot spare for the raidz2 if desired ## Key Decisions Made - **CasaOS stays** — not switching back to TrueNAS. Bare metal access (OpenClaw) matters. - **ZFS layered underneath CasaOS** — CasaOS sees mountpoints, doesn't manage ZFS - **Cockpit + ZFS plugin** for GUI management of ZFS layer - **No dedup** — ever - **Striped mirrors considered** but rejected for this drive mix (too much wasted capacity with mismatched sizes) - **raidz2 on 5×4TB** chosen for main vdev — 2-drive redundancy, best for the vault - **raidz1 on 3×3TB** — acceptable risk for replaceable data ## Key ZFS Rules Learned - Can't add drives to existing raidz vdev — only replace or add new vdevs - Mixed-size vdevs waste capacity (capped at smallest drive) - Vdev expansion only after ALL drives in vdev are replaced with larger ones - Striped mirrors are only topology where single-drive swap gives immediate capacity boost - 50% overhead on mirrored setups ## Offsite Backup Strategy (Long-Term) ### 3-2-1 Compliance 1. **Main copy:** raidz2 pool on new rig 2. **Local backup:** 8/10TB standalone drives 3. **Offsite backup:** Old box relocated to brewery after migration ### Brewery Infrastructure - HA box (HAOS) — hands off - N100 (Batocera) — hands off - Older Mac — janky ZFS, skip - Raspberry Pis — possible backup target but slow - **Best option:** Old server box moved to brewery as dedicated backup target ### Brewery Backup Setup (Future) - Fresh Linux install on old box - ZFS pool on backup drives - `zfs send/recv` over SSH for automated incremental snapshots - Cron job or systemd timer ## Migration Roadmap 1. Build new rig (see AI thread for hardware planning) 2. Set up ZFS pools on new rig with planned topology 3. Migrate CasaOS + all services 4. Verify everything works on new iron 5. Wipe old box, fresh Linux + ZFS as backup target 6. Relocate old box to brewery 7. Set up zfs send/recv over SSH for nightly incremental backups 8. Decommission old setup with confidence ## SHTF Portability Notes - ZFS: Linux-readable only (no native Windows, read-only Mac) - ext4: Linux native, Windows/Mac need tools - NTFS/exFAT: Universal — any random box can read - Consider keeping one backup drive as NTFS for "grab and run" scenarios - Linux live USB can read ZFS/ext4 on any machine in a pinch ## Special ZFS Drive Uses (Reference) | Type | Use | Drive Type | Verdict | |---|---|---|---| | L2ARC | Read cache | SSD only | Skip unless ARC hit rate is low | | SLOG/ZIL | Sync write log | SSD with PLP | Only for NFS/VMs/databases | | Special vdev | Metadata storage | SSD preferred | High risk if not mirrored, overkill for home | | Dedup vdev | Dedup tables | SSD | NO. Just no. | ## SMART Monitoring (Critical for Old Drives) ### Why It Matters - All drives are old with no RMA coverage — failure = replace from own pocket - SMART warnings give days to weeks of notice before total failure - Running a degraded pool with old drives is risking data loss ### Key Metrics to Watch | Attribute | Warning Threshold | Critical | |---|---|---| | Reallocated Sectors (5) | >0 | >100 | | Current Pending Sectors (197) | >0 | >10 | | Uncorrectable Errors (198) | >0 | any | | Power-On Hours (9) | — | >50,000 | | Temperature (194) | >40°C | >45°C | ### Setup ```bash # Install smartmontools sudo apt install smartmontools # Enable and start smartd sudo systemctl enable smartd sudo systemctl start smartd # Run short test (5-10 min) smartctl -t short /dev/sdX # Run long/comprehensive test (1-2 hours) smartctl -t long /dev/sdX # Check results smartctl -l selftest /dev/sdX smartctl -a /dev/sdX ``` ### smartd.conf Configuration ``` # Email alerts on smartd warnings DEVICESCAN -a -m user@example.com -M daily # Or per-drive with specific schedules /dev/sda -a -m admin@example.com -M daily -s (S/../.././02|L/../../6/03) # Short test daily at 2am, long test Saturdays at 3am ``` ### Power Cycle Count (12) — Also Worth Tracking - Drive spinup count. High number = old drive that’s been powered on/off a lot - Not a failure predictor on its own, but tells you wear history ### Pre-Pool Drive Health Check Before building the pool, run full SMART tests on all drives: ```bash smartctl -t long /dev/sda smartctl -t long /dev/sdb # ... etc ``` And check `smartctl -a /dev/sdX | grep -E '(Reallocated|Current_Pending|Uncorrectable)'` Drives with any reallocated sectors or pending sectors should be considered questionable — use for non-critical vdevs or retire entirely. ### ZFS Integration ZFS doesn't do its own SMART polling, but `zpool status` shows drive errors. A rising error count in `zpool status` alongside SMART warnings = replace that drive now. ```bash # Check for ZFS errors zpool status -v ``` ## Pre-Pool Drive Testing (2026-04-18) ### Strategy - 11 drives to test, 4 at a time (SATA port limitation) - Sequential testing — run one batch, swap drives, run next batch - Two-phase testing per drive: 1. SMART long test (1-4 hours depending on drive size) 2. badblocks non-destructive scan (6-12 hours per 4TB drive) ### Phase 1: SMART Long Tests Tests all drives in current batch simultaneously. ```bash #!/bin/bash # drive-smart-test.sh # Usage: ./drive-smart-test.sh /dev/sda /dev/sdb /dev/sdc /dev/sdd DRIVES=("$@") LOGDIR="/root/drive-health-logs" mkdir -p "$LOGDIR" for DRIVE in "${DRIVES[@]}"; do echo "[$(date)] Starting SMART long test on $DRIVE" | tee -a "$LOGDIR/test.log" smartctl -t long "$DRIVE" 2>&1 | tee -a "$LOGDIR/test.log" echo "[$(date)] SMART long test complete on $DRIVE" | tee -a "$LOGDIR/test.log" done echo "[$(date)] All SMART tests initiated. Check results with:" echo " smartctl -l selftest $DRIVE" echo " smartctl -a $DRIVE | grep -E '(Reallocated|Current_Pending|Uncorrectable|Power_On_Hours)'" ``` ### Phase 2: badblocks Scan (Sequential) One drive at a time to avoid port contention. ```bash #!/bin/bash # drive-badblocks.sh # Usage: ./drive-badblocks.sh /dev/sda DRIVE="$1" LOGDIR="/root/drive-health-logs" LOGFILE="$LOGDIR/badblocks-$(basename $DRIVE).log" if [ -z "$DRIVE" ]; then echo "Usage: $0 /dev/sdX" exit 1 fi echo "[$(date)] Starting badblocks non-destructive scan on $DRIVE" | tee "$LOGFILE" badblocks -nvs "$DRIVE" 2>&1 | tee -a "$LOGFILE" echo "[$(date)] badblocks complete on $DRIVE" | tee -a "$LOGFILE" ``` ### Full Sequential Test Workflow ```bash #!/bin/bash # drive-test-batch.sh # Run one batch of 4 drives through full testing pipeline DRIVES=("$@") # pass 4 drives as args LOGDIR="/root/drive-health-logs" mkdir -p "$LOGDIR" for DRIVE in "${DRIVES[@]}"; do SERIAL=$(smartctl -a "$DRIVE" | grep 'Serial Number' | awk '{print $NF}') SIZE=$(smartctl -a "$DRIVE" | grep 'User Capacity' | awk '{print $5,$6}') echo "==========" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" echo "Drive: $DRIVE | Serial: $SERIAL | Size: $SIZE" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" echo "==========" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" # Phase 1: SMART long test echo "[$(date)] Phase 1: SMART long test" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" smartctl -t long "$DRIVE" 2>&1 | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" # Wait for SMART test to complete (poll every 60s) while true; do STATUS=$(smartctl -H "$DRIVE" | grep 'SMART overall-health' | awk '{print $NF}') if [ "$STATUS" = "PASSED" ] || [ "$STATUS" = "FAILED" ]; then echo "[$(date)] SMART test result: $STATUS" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" break fi echo "[$(date)] Waiting for SMART test to complete..." | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" sleep 60 done # Capture SMART attributes echo "[$(date)] Capturing SMART attributes" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" smartctl -a "$DRIVE" > "$LOGDIR/smart-$(basename $DRIVE)-$(date +%Y%m%d).log" # Phase 2: badblocks (only if SMART passed) if [ "$STATUS" = "PASSED" ]; then echo "[$(date)] Phase 2: badblocks non-destructive scan" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" badblocks -nvs "$DRIVE" > "$LOGDIR/badblocks-$(basename $DRIVE)-$(date +%Y%m%d).log" 2>&1 echo "[$(date)] badblocks complete" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" else echo "[$(date)] SKIPPING badblocks — SMART test $STATUS" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" fi done echo "[$(date)] Batch complete. Results in $LOGDIR" ``` ### Current Drive Availability (2026-04-18 update) - **2×4TB hot** (SATA power available): WDC WD40EFRX (sdb), MDD4000GSA (sde) - **4×4TB cold** (no power connectors available yet) — swap in batches for testing - **3×3TB, 8TB, 10TB** — also cold, same power limitation ### Drive Batch Schedule | Batch | Drives | Status | |---|---|---| | 1 | 2×4TB (sdb, sde) | **Ready to test** | | 2 | 2×4TB | Cold swap | | 3 | 2×4TB | Cold swap | | 4 | 1×3TB + 8TB + 10TB | Cold swap | ### Pass/Fail Criteria - **PASS:** SMART `Reallocated Sectors Count = 0`, `Current Pending Sectors = 0`, `Uncorrectable Errors = 0`, badblocks finds 0 bad sectors - **CONDITIONAL:** Any reallocated/pending sectors — demote to non-critical vdev (3×3TB raidz1) - **FAIL:** Any uncorrectable errors, badblocks bad sectors, or SMART health = FAILED — retire drive ### Next Steps - [ ] Get drive inventory (exact models, ages, health) - [ ] Finalize new rig hardware (cross-ref AI thread) - [ ] Decide backup drive format (ZFS vs NTFS) - [ ] Plan CasaOS migration steps when new rig is ready - [ ] Set up SMART monitoring on all drives before pool creation - [ ] Source cold spare 4TB drive (to keep on shelf for old-drive replacement) - [ ] Run pre-pool drive tests (batches 1-3) ```` ## Notes ## Related - No related pages yet.