326 lines
11 KiB
Markdown
326 lines
11 KiB
Markdown
|
|
---
|
|||
|
|
pageType: source
|
|||
|
|
id: source.zfs-casaos
|
|||
|
|
title: zfs-casaos
|
|||
|
|
sourceType: local-file
|
|||
|
|
sourcePath: /home/topher/.openclaw/workspace-crash-bot/projects/zfs-casaos.md
|
|||
|
|
ingestedAt: 2026-05-02T21:21:16.853Z
|
|||
|
|
updatedAt: 2026-05-02T21:21:16.853Z
|
|||
|
|
status: active
|
|||
|
|
growth: sprout
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# zfs-casaos
|
|||
|
|
|
|||
|
|
## Source
|
|||
|
|
- Type: `local-file`
|
|||
|
|
- Path: `/home/topher/.openclaw/workspace-crash-bot/projects/zfs-casaos.md`
|
|||
|
|
- Bytes: 10991
|
|||
|
|
- Updated: 2026-05-02T21:21:16.853Z
|
|||
|
|
|
|||
|
|
## Content
|
|||
|
|
````text
|
|||
|
|
# ZFS on CasaOS — Project Plan
|
|||
|
|
|
|||
|
|
**Thread:** #setting-up-zfs-on-casaOS
|
|||
|
|
**Status:** Planning
|
|||
|
|
**Created:** 2026-04-16
|
|||
|
|
|
|||
|
|
## Overview
|
|||
|
|
Migrate from current CasaOS setup to ZFS-backed storage with proper redundancy and an offsite backup strategy. Long-term play tied to new rig build.
|
|||
|
|
|
|||
|
|
## Current Hardware
|
|||
|
|
- **OS:** CasaOS on existing Linux box
|
|||
|
|
- **Drives (11 total, all spinners):** 6×4TB, 3×3TB, 1×8TB, 1×10TB — 48TB raw
|
|||
|
|
- **Current usage:** Under 6TB
|
|||
|
|
- **8TB + 10TB:** Currently in CasaOS — Plex, old machine backups, Minecraft servers, non-critical stuff
|
|||
|
|
- **SSDs available:** Smaller sizes (details TBD)
|
|||
|
|
- **Note:** 2TB drive replaced with 4TB (2026-04-18 update)
|
|||
|
|
|
|||
|
|
## ZFS Topology (Agreed)
|
|||
|
|
|
|||
|
|
### Main Pool ("tank")
|
|||
|
|
- **Vdev 1:** 5×4TB raidz2 → ~12TB usable, 2-drive redundancy (the vault — irreplaceable data)
|
|||
|
|
- **Vdev 2:** 3×3TB raidz1 → ~6TB usable, 1-drive redundancy (media, Minecraft, replaceable stuff)
|
|||
|
|
|
|||
|
|
### Backup (8TB + 10TB)
|
|||
|
|
- Standalone backup targets, NOT part of the main pool
|
|||
|
|
- Format TBD: ZFS for zfs send/recv OR ext4/NTFS for SHTF portability
|
|||
|
|
- Consider: one ZFS for automated snapshots, one NTFS/exFAT for "grab and run" scenarios
|
|||
|
|
|
|||
|
|
### 4TB Drive (replaced 2TB)
|
|||
|
|
- Scratch/overflow/ISO storage
|
|||
|
|
- Possible L2ARC candidate (probably not needed — working set likely fits in RAM)
|
|||
|
|
- Could also serve as hot spare for the raidz2 if desired
|
|||
|
|
|
|||
|
|
## Key Decisions Made
|
|||
|
|
- **CasaOS stays** — not switching back to TrueNAS. Bare metal access (OpenClaw) matters.
|
|||
|
|
- **ZFS layered underneath CasaOS** — CasaOS sees mountpoints, doesn't manage ZFS
|
|||
|
|
- **Cockpit + ZFS plugin** for GUI management of ZFS layer
|
|||
|
|
- **No dedup** — ever
|
|||
|
|
- **Striped mirrors considered** but rejected for this drive mix (too much wasted capacity with mismatched sizes)
|
|||
|
|
- **raidz2 on 5×4TB** chosen for main vdev — 2-drive redundancy, best for the vault
|
|||
|
|
- **raidz1 on 3×3TB** — acceptable risk for replaceable data
|
|||
|
|
|
|||
|
|
## Key ZFS Rules Learned
|
|||
|
|
- Can't add drives to existing raidz vdev — only replace or add new vdevs
|
|||
|
|
- Mixed-size vdevs waste capacity (capped at smallest drive)
|
|||
|
|
- Vdev expansion only after ALL drives in vdev are replaced with larger ones
|
|||
|
|
- Striped mirrors are only topology where single-drive swap gives immediate capacity boost
|
|||
|
|
- 50% overhead on mirrored setups
|
|||
|
|
|
|||
|
|
## Offsite Backup Strategy (Long-Term)
|
|||
|
|
|
|||
|
|
### 3-2-1 Compliance
|
|||
|
|
1. **Main copy:** raidz2 pool on new rig
|
|||
|
|
2. **Local backup:** 8/10TB standalone drives
|
|||
|
|
3. **Offsite backup:** Old box relocated to brewery after migration
|
|||
|
|
|
|||
|
|
### Brewery Infrastructure
|
|||
|
|
- HA box (HAOS) — hands off
|
|||
|
|
- N100 (Batocera) — hands off
|
|||
|
|
- Older Mac — janky ZFS, skip
|
|||
|
|
- Raspberry Pis — possible backup target but slow
|
|||
|
|
- **Best option:** Old server box moved to brewery as dedicated backup target
|
|||
|
|
|
|||
|
|
### Brewery Backup Setup (Future)
|
|||
|
|
- Fresh Linux install on old box
|
|||
|
|
- ZFS pool on backup drives
|
|||
|
|
- `zfs send/recv` over SSH for automated incremental snapshots
|
|||
|
|
- Cron job or systemd timer
|
|||
|
|
|
|||
|
|
## Migration Roadmap
|
|||
|
|
1. Build new rig (see AI thread for hardware planning)
|
|||
|
|
2. Set up ZFS pools on new rig with planned topology
|
|||
|
|
3. Migrate CasaOS + all services
|
|||
|
|
4. Verify everything works on new iron
|
|||
|
|
5. Wipe old box, fresh Linux + ZFS as backup target
|
|||
|
|
6. Relocate old box to brewery
|
|||
|
|
7. Set up zfs send/recv over SSH for nightly incremental backups
|
|||
|
|
8. Decommission old setup with confidence
|
|||
|
|
|
|||
|
|
## SHTF Portability Notes
|
|||
|
|
- ZFS: Linux-readable only (no native Windows, read-only Mac)
|
|||
|
|
- ext4: Linux native, Windows/Mac need tools
|
|||
|
|
- NTFS/exFAT: Universal — any random box can read
|
|||
|
|
- Consider keeping one backup drive as NTFS for "grab and run" scenarios
|
|||
|
|
- Linux live USB can read ZFS/ext4 on any machine in a pinch
|
|||
|
|
|
|||
|
|
## Special ZFS Drive Uses (Reference)
|
|||
|
|
| Type | Use | Drive Type | Verdict |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| L2ARC | Read cache | SSD only | Skip unless ARC hit rate is low |
|
|||
|
|
| SLOG/ZIL | Sync write log | SSD with PLP | Only for NFS/VMs/databases |
|
|||
|
|
| Special vdev | Metadata storage | SSD preferred | High risk if not mirrored, overkill for home |
|
|||
|
|
| Dedup vdev | Dedup tables | SSD | NO. Just no. |
|
|||
|
|
|
|||
|
|
## SMART Monitoring (Critical for Old Drives)
|
|||
|
|
|
|||
|
|
### Why It Matters
|
|||
|
|
- All drives are old with no RMA coverage — failure = replace from own pocket
|
|||
|
|
- SMART warnings give days to weeks of notice before total failure
|
|||
|
|
- Running a degraded pool with old drives is risking data loss
|
|||
|
|
|
|||
|
|
### Key Metrics to Watch
|
|||
|
|
| Attribute | Warning Threshold | Critical |
|
|||
|
|
|---|---|---|
|
|||
|
|
| Reallocated Sectors (5) | >0 | >100 |
|
|||
|
|
| Current Pending Sectors (197) | >0 | >10 |
|
|||
|
|
| Uncorrectable Errors (198) | >0 | any |
|
|||
|
|
| Power-On Hours (9) | — | >50,000 |
|
|||
|
|
| Temperature (194) | >40°C | >45°C |
|
|||
|
|
|
|||
|
|
### Setup
|
|||
|
|
```bash
|
|||
|
|
# Install smartmontools
|
|||
|
|
sudo apt install smartmontools
|
|||
|
|
|
|||
|
|
# Enable and start smartd
|
|||
|
|
sudo systemctl enable smartd
|
|||
|
|
sudo systemctl start smartd
|
|||
|
|
|
|||
|
|
# Run short test (5-10 min)
|
|||
|
|
smartctl -t short /dev/sdX
|
|||
|
|
|
|||
|
|
# Run long/comprehensive test (1-2 hours)
|
|||
|
|
smartctl -t long /dev/sdX
|
|||
|
|
|
|||
|
|
# Check results
|
|||
|
|
smartctl -l selftest /dev/sdX
|
|||
|
|
smartctl -a /dev/sdX
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### smartd.conf Configuration
|
|||
|
|
```
|
|||
|
|
# Email alerts on smartd warnings
|
|||
|
|
DEVICESCAN -a -m user@example.com -M daily
|
|||
|
|
|
|||
|
|
# Or per-drive with specific schedules
|
|||
|
|
/dev/sda -a -m admin@example.com -M daily -s (S/../.././02|L/../../6/03)
|
|||
|
|
# Short test daily at 2am, long test Saturdays at 3am
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Power Cycle Count (12) — Also Worth Tracking
|
|||
|
|
- Drive spinup count. High number = old drive that’s been powered on/off a lot
|
|||
|
|
- Not a failure predictor on its own, but tells you wear history
|
|||
|
|
|
|||
|
|
### Pre-Pool Drive Health Check
|
|||
|
|
Before building the pool, run full SMART tests on all drives:
|
|||
|
|
```bash
|
|||
|
|
smartctl -t long /dev/sda
|
|||
|
|
smartctl -t long /dev/sdb
|
|||
|
|
# ... etc
|
|||
|
|
```
|
|||
|
|
And check `smartctl -a /dev/sdX | grep -E '(Reallocated|Current_Pending|Uncorrectable)'`
|
|||
|
|
|
|||
|
|
Drives with any reallocated sectors or pending sectors should be considered questionable — use for non-critical vdevs or retire entirely.
|
|||
|
|
|
|||
|
|
### ZFS Integration
|
|||
|
|
ZFS doesn't do its own SMART polling, but `zpool status` shows drive errors. A rising error count in `zpool status` alongside SMART warnings = replace that drive now.
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Check for ZFS errors
|
|||
|
|
zpool status -v
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Pre-Pool Drive Testing (2026-04-18)
|
|||
|
|
|
|||
|
|
### Strategy
|
|||
|
|
- 11 drives to test, 4 at a time (SATA port limitation)
|
|||
|
|
- Sequential testing — run one batch, swap drives, run next batch
|
|||
|
|
- Two-phase testing per drive:
|
|||
|
|
1. SMART long test (1-4 hours depending on drive size)
|
|||
|
|
2. badblocks non-destructive scan (6-12 hours per 4TB drive)
|
|||
|
|
|
|||
|
|
### Phase 1: SMART Long Tests
|
|||
|
|
Tests all drives in current batch simultaneously.
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# drive-smart-test.sh
|
|||
|
|
# Usage: ./drive-smart-test.sh /dev/sda /dev/sdb /dev/sdc /dev/sdd
|
|||
|
|
|
|||
|
|
DRIVES=("$@")
|
|||
|
|
LOGDIR="/root/drive-health-logs"
|
|||
|
|
mkdir -p "$LOGDIR"
|
|||
|
|
|
|||
|
|
for DRIVE in "${DRIVES[@]}"; do
|
|||
|
|
echo "[$(date)] Starting SMART long test on $DRIVE" | tee -a "$LOGDIR/test.log"
|
|||
|
|
smartctl -t long "$DRIVE" 2>&1 | tee -a "$LOGDIR/test.log"
|
|||
|
|
echo "[$(date)] SMART long test complete on $DRIVE" | tee -a "$LOGDIR/test.log"
|
|||
|
|
done
|
|||
|
|
|
|||
|
|
echo "[$(date)] All SMART tests initiated. Check results with:"
|
|||
|
|
echo " smartctl -l selftest $DRIVE"
|
|||
|
|
echo " smartctl -a $DRIVE | grep -E '(Reallocated|Current_Pending|Uncorrectable|Power_On_Hours)'"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Phase 2: badblocks Scan (Sequential)
|
|||
|
|
One drive at a time to avoid port contention.
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# drive-badblocks.sh
|
|||
|
|
# Usage: ./drive-badblocks.sh /dev/sda
|
|||
|
|
|
|||
|
|
DRIVE="$1"
|
|||
|
|
LOGDIR="/root/drive-health-logs"
|
|||
|
|
LOGFILE="$LOGDIR/badblocks-$(basename $DRIVE).log"
|
|||
|
|
|
|||
|
|
if [ -z "$DRIVE" ]; then
|
|||
|
|
echo "Usage: $0 /dev/sdX"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
echo "[$(date)] Starting badblocks non-destructive scan on $DRIVE" | tee "$LOGFILE"
|
|||
|
|
badblocks -nvs "$DRIVE" 2>&1 | tee -a "$LOGFILE"
|
|||
|
|
echo "[$(date)] badblocks complete on $DRIVE" | tee -a "$LOGFILE"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Full Sequential Test Workflow
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# drive-test-batch.sh
|
|||
|
|
# Run one batch of 4 drives through full testing pipeline
|
|||
|
|
|
|||
|
|
DRIVES=("$@") # pass 4 drives as args
|
|||
|
|
LOGDIR="/root/drive-health-logs"
|
|||
|
|
mkdir -p "$LOGDIR"
|
|||
|
|
|
|||
|
|
for DRIVE in "${DRIVES[@]}"; do
|
|||
|
|
SERIAL=$(smartctl -a "$DRIVE" | grep 'Serial Number' | awk '{print $NF}')
|
|||
|
|
SIZE=$(smartctl -a "$DRIVE" | grep 'User Capacity' | awk '{print $5,$6}')
|
|||
|
|
|
|||
|
|
echo "==========" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
|||
|
|
echo "Drive: $DRIVE | Serial: $SERIAL | Size: $SIZE" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
|||
|
|
echo "==========" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
|||
|
|
|
|||
|
|
# Phase 1: SMART long test
|
|||
|
|
echo "[$(date)] Phase 1: SMART long test" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
|||
|
|
smartctl -t long "$DRIVE" 2>&1 | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
|||
|
|
|
|||
|
|
# Wait for SMART test to complete (poll every 60s)
|
|||
|
|
while true; do
|
|||
|
|
STATUS=$(smartctl -H "$DRIVE" | grep 'SMART overall-health' | awk '{print $NF}')
|
|||
|
|
if [ "$STATUS" = "PASSED" ] || [ "$STATUS" = "FAILED" ]; then
|
|||
|
|
echo "[$(date)] SMART test result: $STATUS" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
|||
|
|
break
|
|||
|
|
fi
|
|||
|
|
echo "[$(date)] Waiting for SMART test to complete..." | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
|||
|
|
sleep 60
|
|||
|
|
done
|
|||
|
|
|
|||
|
|
# Capture SMART attributes
|
|||
|
|
echo "[$(date)] Capturing SMART attributes" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
|||
|
|
smartctl -a "$DRIVE" > "$LOGDIR/smart-$(basename $DRIVE)-$(date +%Y%m%d).log"
|
|||
|
|
|
|||
|
|
# Phase 2: badblocks (only if SMART passed)
|
|||
|
|
if [ "$STATUS" = "PASSED" ]; then
|
|||
|
|
echo "[$(date)] Phase 2: badblocks non-destructive scan" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
|||
|
|
badblocks -nvs "$DRIVE" > "$LOGDIR/badblocks-$(basename $DRIVE)-$(date +%Y%m%d).log" 2>&1
|
|||
|
|
echo "[$(date)] badblocks complete" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
|||
|
|
else
|
|||
|
|
echo "[$(date)] SKIPPING badblocks — SMART test $STATUS" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
|||
|
|
fi
|
|||
|
|
done
|
|||
|
|
|
|||
|
|
echo "[$(date)] Batch complete. Results in $LOGDIR"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Current Drive Availability (2026-04-18 update)
|
|||
|
|
- **2×4TB hot** (SATA power available): WDC WD40EFRX (sdb), MDD4000GSA (sde)
|
|||
|
|
- **4×4TB cold** (no power connectors available yet) — swap in batches for testing
|
|||
|
|
- **3×3TB, 8TB, 10TB** — also cold, same power limitation
|
|||
|
|
|
|||
|
|
### Drive Batch Schedule
|
|||
|
|
| Batch | Drives | Status |
|
|||
|
|
|---|---|---|
|
|||
|
|
| 1 | 2×4TB (sdb, sde) | **Ready to test** |
|
|||
|
|
| 2 | 2×4TB | Cold swap |
|
|||
|
|
| 3 | 2×4TB | Cold swap |
|
|||
|
|
| 4 | 1×3TB + 8TB + 10TB | Cold swap |
|
|||
|
|
|
|||
|
|
### Pass/Fail Criteria
|
|||
|
|
- **PASS:** SMART `Reallocated Sectors Count = 0`, `Current Pending Sectors = 0`, `Uncorrectable Errors = 0`, badblocks finds 0 bad sectors
|
|||
|
|
- **CONDITIONAL:** Any reallocated/pending sectors — demote to non-critical vdev (3×3TB raidz1)
|
|||
|
|
- **FAIL:** Any uncorrectable errors, badblocks bad sectors, or SMART health = FAILED — retire drive
|
|||
|
|
|
|||
|
|
### Next Steps
|
|||
|
|
- [ ] Get drive inventory (exact models, ages, health)
|
|||
|
|
- [ ] Finalize new rig hardware (cross-ref AI thread)
|
|||
|
|
- [ ] Decide backup drive format (ZFS vs NTFS)
|
|||
|
|
- [ ] Plan CasaOS migration steps when new rig is ready
|
|||
|
|
- [ ] Set up SMART monitoring on all drives before pool creation
|
|||
|
|
- [ ] Source cold spare 4TB drive (to keep on shelf for old-drive replacement)
|
|||
|
|
- [ ] Run pre-pool drive tests (batches 1-3)
|
|||
|
|
````
|
|||
|
|
|
|||
|
|
## Notes
|
|||
|
|
<!-- openclaw:human:start -->
|
|||
|
|
<!-- openclaw:human:end -->
|
|||
|
|
|
|||
|
|
## Related
|
|||
|
|
<!-- openclaw:wiki:related:start -->
|
|||
|
|
### Referenced By
|
|||
|
|
|
|||
|
|
- [ai-rig-upgrade](sources/ai-rig-upgrade.md)
|
|||
|
|
<!-- openclaw:wiki:related:end -->
|