TREE(74): training modules, entity profiles, 2890 references, keyword indices SPROUT(42): knowledge pages, project docs, curated source material SEED(164): daily notes, raw session logs, unprocessed material Updated AUDIT_MANIFEST.json with growth classifications.
326 lines
11 KiB
Markdown
326 lines
11 KiB
Markdown
---
|
||
pageType: source
|
||
id: source.zfs-casaos
|
||
title: zfs-casaos
|
||
sourceType: local-file
|
||
sourcePath: /home/topher/.openclaw/workspace-crash-bot/projects/zfs-casaos.md
|
||
ingestedAt: 2026-05-02T21:21:16.853Z
|
||
updatedAt: 2026-05-02T21:21:16.853Z
|
||
status: active
|
||
growth: sprout
|
||
---
|
||
|
||
# zfs-casaos
|
||
|
||
## Source
|
||
- Type: `local-file`
|
||
- Path: `/home/topher/.openclaw/workspace-crash-bot/projects/zfs-casaos.md`
|
||
- Bytes: 10991
|
||
- Updated: 2026-05-02T21:21:16.853Z
|
||
|
||
## Content
|
||
````text
|
||
# ZFS on CasaOS — Project Plan
|
||
|
||
**Thread:** #setting-up-zfs-on-casaOS
|
||
**Status:** Planning
|
||
**Created:** 2026-04-16
|
||
|
||
## Overview
|
||
Migrate from current CasaOS setup to ZFS-backed storage with proper redundancy and an offsite backup strategy. Long-term play tied to new rig build.
|
||
|
||
## Current Hardware
|
||
- **OS:** CasaOS on existing Linux box
|
||
- **Drives (11 total, all spinners):** 6×4TB, 3×3TB, 1×8TB, 1×10TB — 48TB raw
|
||
- **Current usage:** Under 6TB
|
||
- **8TB + 10TB:** Currently in CasaOS — Plex, old machine backups, Minecraft servers, non-critical stuff
|
||
- **SSDs available:** Smaller sizes (details TBD)
|
||
- **Note:** 2TB drive replaced with 4TB (2026-04-18 update)
|
||
|
||
## ZFS Topology (Agreed)
|
||
|
||
### Main Pool ("tank")
|
||
- **Vdev 1:** 5×4TB raidz2 → ~12TB usable, 2-drive redundancy (the vault — irreplaceable data)
|
||
- **Vdev 2:** 3×3TB raidz1 → ~6TB usable, 1-drive redundancy (media, Minecraft, replaceable stuff)
|
||
|
||
### Backup (8TB + 10TB)
|
||
- Standalone backup targets, NOT part of the main pool
|
||
- Format TBD: ZFS for zfs send/recv OR ext4/NTFS for SHTF portability
|
||
- Consider: one ZFS for automated snapshots, one NTFS/exFAT for "grab and run" scenarios
|
||
|
||
### 4TB Drive (replaced 2TB)
|
||
- Scratch/overflow/ISO storage
|
||
- Possible L2ARC candidate (probably not needed — working set likely fits in RAM)
|
||
- Could also serve as hot spare for the raidz2 if desired
|
||
|
||
## Key Decisions Made
|
||
- **CasaOS stays** — not switching back to TrueNAS. Bare metal access (OpenClaw) matters.
|
||
- **ZFS layered underneath CasaOS** — CasaOS sees mountpoints, doesn't manage ZFS
|
||
- **Cockpit + ZFS plugin** for GUI management of ZFS layer
|
||
- **No dedup** — ever
|
||
- **Striped mirrors considered** but rejected for this drive mix (too much wasted capacity with mismatched sizes)
|
||
- **raidz2 on 5×4TB** chosen for main vdev — 2-drive redundancy, best for the vault
|
||
- **raidz1 on 3×3TB** — acceptable risk for replaceable data
|
||
|
||
## Key ZFS Rules Learned
|
||
- Can't add drives to existing raidz vdev — only replace or add new vdevs
|
||
- Mixed-size vdevs waste capacity (capped at smallest drive)
|
||
- Vdev expansion only after ALL drives in vdev are replaced with larger ones
|
||
- Striped mirrors are only topology where single-drive swap gives immediate capacity boost
|
||
- 50% overhead on mirrored setups
|
||
|
||
## Offsite Backup Strategy (Long-Term)
|
||
|
||
### 3-2-1 Compliance
|
||
1. **Main copy:** raidz2 pool on new rig
|
||
2. **Local backup:** 8/10TB standalone drives
|
||
3. **Offsite backup:** Old box relocated to brewery after migration
|
||
|
||
### Brewery Infrastructure
|
||
- HA box (HAOS) — hands off
|
||
- N100 (Batocera) — hands off
|
||
- Older Mac — janky ZFS, skip
|
||
- Raspberry Pis — possible backup target but slow
|
||
- **Best option:** Old server box moved to brewery as dedicated backup target
|
||
|
||
### Brewery Backup Setup (Future)
|
||
- Fresh Linux install on old box
|
||
- ZFS pool on backup drives
|
||
- `zfs send/recv` over SSH for automated incremental snapshots
|
||
- Cron job or systemd timer
|
||
|
||
## Migration Roadmap
|
||
1. Build new rig (see AI thread for hardware planning)
|
||
2. Set up ZFS pools on new rig with planned topology
|
||
3. Migrate CasaOS + all services
|
||
4. Verify everything works on new iron
|
||
5. Wipe old box, fresh Linux + ZFS as backup target
|
||
6. Relocate old box to brewery
|
||
7. Set up zfs send/recv over SSH for nightly incremental backups
|
||
8. Decommission old setup with confidence
|
||
|
||
## SHTF Portability Notes
|
||
- ZFS: Linux-readable only (no native Windows, read-only Mac)
|
||
- ext4: Linux native, Windows/Mac need tools
|
||
- NTFS/exFAT: Universal — any random box can read
|
||
- Consider keeping one backup drive as NTFS for "grab and run" scenarios
|
||
- Linux live USB can read ZFS/ext4 on any machine in a pinch
|
||
|
||
## Special ZFS Drive Uses (Reference)
|
||
| Type | Use | Drive Type | Verdict |
|
||
|---|---|---|---|
|
||
| L2ARC | Read cache | SSD only | Skip unless ARC hit rate is low |
|
||
| SLOG/ZIL | Sync write log | SSD with PLP | Only for NFS/VMs/databases |
|
||
| Special vdev | Metadata storage | SSD preferred | High risk if not mirrored, overkill for home |
|
||
| Dedup vdev | Dedup tables | SSD | NO. Just no. |
|
||
|
||
## SMART Monitoring (Critical for Old Drives)
|
||
|
||
### Why It Matters
|
||
- All drives are old with no RMA coverage — failure = replace from own pocket
|
||
- SMART warnings give days to weeks of notice before total failure
|
||
- Running a degraded pool with old drives is risking data loss
|
||
|
||
### Key Metrics to Watch
|
||
| Attribute | Warning Threshold | Critical |
|
||
|---|---|---|
|
||
| Reallocated Sectors (5) | >0 | >100 |
|
||
| Current Pending Sectors (197) | >0 | >10 |
|
||
| Uncorrectable Errors (198) | >0 | any |
|
||
| Power-On Hours (9) | — | >50,000 |
|
||
| Temperature (194) | >40°C | >45°C |
|
||
|
||
### Setup
|
||
```bash
|
||
# Install smartmontools
|
||
sudo apt install smartmontools
|
||
|
||
# Enable and start smartd
|
||
sudo systemctl enable smartd
|
||
sudo systemctl start smartd
|
||
|
||
# Run short test (5-10 min)
|
||
smartctl -t short /dev/sdX
|
||
|
||
# Run long/comprehensive test (1-2 hours)
|
||
smartctl -t long /dev/sdX
|
||
|
||
# Check results
|
||
smartctl -l selftest /dev/sdX
|
||
smartctl -a /dev/sdX
|
||
```
|
||
|
||
### smartd.conf Configuration
|
||
```
|
||
# Email alerts on smartd warnings
|
||
DEVICESCAN -a -m user@example.com -M daily
|
||
|
||
# Or per-drive with specific schedules
|
||
/dev/sda -a -m admin@example.com -M daily -s (S/../.././02|L/../../6/03)
|
||
# Short test daily at 2am, long test Saturdays at 3am
|
||
```
|
||
|
||
### Power Cycle Count (12) — Also Worth Tracking
|
||
- Drive spinup count. High number = old drive that’s been powered on/off a lot
|
||
- Not a failure predictor on its own, but tells you wear history
|
||
|
||
### Pre-Pool Drive Health Check
|
||
Before building the pool, run full SMART tests on all drives:
|
||
```bash
|
||
smartctl -t long /dev/sda
|
||
smartctl -t long /dev/sdb
|
||
# ... etc
|
||
```
|
||
And check `smartctl -a /dev/sdX | grep -E '(Reallocated|Current_Pending|Uncorrectable)'`
|
||
|
||
Drives with any reallocated sectors or pending sectors should be considered questionable — use for non-critical vdevs or retire entirely.
|
||
|
||
### ZFS Integration
|
||
ZFS doesn't do its own SMART polling, but `zpool status` shows drive errors. A rising error count in `zpool status` alongside SMART warnings = replace that drive now.
|
||
|
||
```bash
|
||
# Check for ZFS errors
|
||
zpool status -v
|
||
```
|
||
|
||
## Pre-Pool Drive Testing (2026-04-18)
|
||
|
||
### Strategy
|
||
- 11 drives to test, 4 at a time (SATA port limitation)
|
||
- Sequential testing — run one batch, swap drives, run next batch
|
||
- Two-phase testing per drive:
|
||
1. SMART long test (1-4 hours depending on drive size)
|
||
2. badblocks non-destructive scan (6-12 hours per 4TB drive)
|
||
|
||
### Phase 1: SMART Long Tests
|
||
Tests all drives in current batch simultaneously.
|
||
```bash
|
||
#!/bin/bash
|
||
# drive-smart-test.sh
|
||
# Usage: ./drive-smart-test.sh /dev/sda /dev/sdb /dev/sdc /dev/sdd
|
||
|
||
DRIVES=("$@")
|
||
LOGDIR="/root/drive-health-logs"
|
||
mkdir -p "$LOGDIR"
|
||
|
||
for DRIVE in "${DRIVES[@]}"; do
|
||
echo "[$(date)] Starting SMART long test on $DRIVE" | tee -a "$LOGDIR/test.log"
|
||
smartctl -t long "$DRIVE" 2>&1 | tee -a "$LOGDIR/test.log"
|
||
echo "[$(date)] SMART long test complete on $DRIVE" | tee -a "$LOGDIR/test.log"
|
||
done
|
||
|
||
echo "[$(date)] All SMART tests initiated. Check results with:"
|
||
echo " smartctl -l selftest $DRIVE"
|
||
echo " smartctl -a $DRIVE | grep -E '(Reallocated|Current_Pending|Uncorrectable|Power_On_Hours)'"
|
||
```
|
||
|
||
### Phase 2: badblocks Scan (Sequential)
|
||
One drive at a time to avoid port contention.
|
||
```bash
|
||
#!/bin/bash
|
||
# drive-badblocks.sh
|
||
# Usage: ./drive-badblocks.sh /dev/sda
|
||
|
||
DRIVE="$1"
|
||
LOGDIR="/root/drive-health-logs"
|
||
LOGFILE="$LOGDIR/badblocks-$(basename $DRIVE).log"
|
||
|
||
if [ -z "$DRIVE" ]; then
|
||
echo "Usage: $0 /dev/sdX"
|
||
exit 1
|
||
fi
|
||
|
||
echo "[$(date)] Starting badblocks non-destructive scan on $DRIVE" | tee "$LOGFILE"
|
||
badblocks -nvs "$DRIVE" 2>&1 | tee -a "$LOGFILE"
|
||
echo "[$(date)] badblocks complete on $DRIVE" | tee -a "$LOGFILE"
|
||
```
|
||
|
||
### Full Sequential Test Workflow
|
||
```bash
|
||
#!/bin/bash
|
||
# drive-test-batch.sh
|
||
# Run one batch of 4 drives through full testing pipeline
|
||
|
||
DRIVES=("$@") # pass 4 drives as args
|
||
LOGDIR="/root/drive-health-logs"
|
||
mkdir -p "$LOGDIR"
|
||
|
||
for DRIVE in "${DRIVES[@]}"; do
|
||
SERIAL=$(smartctl -a "$DRIVE" | grep 'Serial Number' | awk '{print $NF}')
|
||
SIZE=$(smartctl -a "$DRIVE" | grep 'User Capacity' | awk '{print $5,$6}')
|
||
|
||
echo "==========" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
||
echo "Drive: $DRIVE | Serial: $SERIAL | Size: $SIZE" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
||
echo "==========" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
||
|
||
# Phase 1: SMART long test
|
||
echo "[$(date)] Phase 1: SMART long test" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
||
smartctl -t long "$DRIVE" 2>&1 | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
||
|
||
# Wait for SMART test to complete (poll every 60s)
|
||
while true; do
|
||
STATUS=$(smartctl -H "$DRIVE" | grep 'SMART overall-health' | awk '{print $NF}')
|
||
if [ "$STATUS" = "PASSED" ] || [ "$STATUS" = "FAILED" ]; then
|
||
echo "[$(date)] SMART test result: $STATUS" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
||
break
|
||
fi
|
||
echo "[$(date)] Waiting for SMART test to complete..." | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
||
sleep 60
|
||
done
|
||
|
||
# Capture SMART attributes
|
||
echo "[$(date)] Capturing SMART attributes" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
||
smartctl -a "$DRIVE" > "$LOGDIR/smart-$(basename $DRIVE)-$(date +%Y%m%d).log"
|
||
|
||
# Phase 2: badblocks (only if SMART passed)
|
||
if [ "$STATUS" = "PASSED" ]; then
|
||
echo "[$(date)] Phase 2: badblocks non-destructive scan" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
||
badblocks -nvs "$DRIVE" > "$LOGDIR/badblocks-$(basename $DRIVE)-$(date +%Y%m%d).log" 2>&1
|
||
echo "[$(date)] badblocks complete" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
||
else
|
||
echo "[$(date)] SKIPPING badblocks — SMART test $STATUS" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
|
||
fi
|
||
done
|
||
|
||
echo "[$(date)] Batch complete. Results in $LOGDIR"
|
||
```
|
||
|
||
### Current Drive Availability (2026-04-18 update)
|
||
- **2×4TB hot** (SATA power available): WDC WD40EFRX (sdb), MDD4000GSA (sde)
|
||
- **4×4TB cold** (no power connectors available yet) — swap in batches for testing
|
||
- **3×3TB, 8TB, 10TB** — also cold, same power limitation
|
||
|
||
### Drive Batch Schedule
|
||
| Batch | Drives | Status |
|
||
|---|---|---|
|
||
| 1 | 2×4TB (sdb, sde) | **Ready to test** |
|
||
| 2 | 2×4TB | Cold swap |
|
||
| 3 | 2×4TB | Cold swap |
|
||
| 4 | 1×3TB + 8TB + 10TB | Cold swap |
|
||
|
||
### Pass/Fail Criteria
|
||
- **PASS:** SMART `Reallocated Sectors Count = 0`, `Current Pending Sectors = 0`, `Uncorrectable Errors = 0`, badblocks finds 0 bad sectors
|
||
- **CONDITIONAL:** Any reallocated/pending sectors — demote to non-critical vdev (3×3TB raidz1)
|
||
- **FAIL:** Any uncorrectable errors, badblocks bad sectors, or SMART health = FAILED — retire drive
|
||
|
||
### Next Steps
|
||
- [ ] Get drive inventory (exact models, ages, health)
|
||
- [ ] Finalize new rig hardware (cross-ref AI thread)
|
||
- [ ] Decide backup drive format (ZFS vs NTFS)
|
||
- [ ] Plan CasaOS migration steps when new rig is ready
|
||
- [ ] Set up SMART monitoring on all drives before pool creation
|
||
- [ ] Source cold spare 4TB drive (to keep on shelf for old-drive replacement)
|
||
- [ ] Run pre-pool drive tests (batches 1-3)
|
||
````
|
||
|
||
## Notes
|
||
<!-- openclaw:human:start -->
|
||
<!-- openclaw:human:end -->
|
||
|
||
## Related
|
||
<!-- openclaw:wiki:related:start -->
|
||
### Referenced By
|
||
|
||
- [ai-rig-upgrade](sources/ai-rig-upgrade.md)
|
||
<!-- openclaw:wiki:related:end -->
|