Files
learning-garden/sources/zfs-casaos-project.md

324 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
pageType: source
id: source.zfs-casaos-project
title: ZFS-CasaOS-Project
sourceType: local-file
sourcePath: /home/topher/.openclaw/workspace-crash-bot/projects/zfs-casaos.md
ingestedAt: 2026-05-02T21:02:31.564Z
updatedAt: 2026-05-02T21:02:31.564Z
status: active
growth: sprout
---
# ZFS-CasaOS-Project
## Source
- Type: `local-file`
- Path: `/home/topher/.openclaw/workspace-crash-bot/projects/zfs-casaos.md`
- Bytes: 10991
- Updated: 2026-05-02T21:02:31.564Z
## Content
````text
# ZFS on CasaOS — Project Plan
**Thread:** #setting-up-zfs-on-casaOS
**Status:** Planning
**Created:** 2026-04-16
## Overview
Migrate from current CasaOS setup to ZFS-backed storage with proper redundancy and an offsite backup strategy. Long-term play tied to new rig build.
## Current Hardware
- **OS:** CasaOS on existing Linux box
- **Drives (11 total, all spinners):** 6×4TB, 3×3TB, 1×8TB, 1×10TB — 48TB raw
- **Current usage:** Under 6TB
- **8TB + 10TB:** Currently in CasaOS — Plex, old machine backups, Minecraft servers, non-critical stuff
- **SSDs available:** Smaller sizes (details TBD)
- **Note:** 2TB drive replaced with 4TB (2026-04-18 update)
## ZFS Topology (Agreed)
### Main Pool ("tank")
- **Vdev 1:** 5×4TB raidz2 → ~12TB usable, 2-drive redundancy (the vault — irreplaceable data)
- **Vdev 2:** 3×3TB raidz1 → ~6TB usable, 1-drive redundancy (media, Minecraft, replaceable stuff)
### Backup (8TB + 10TB)
- Standalone backup targets, NOT part of the main pool
- Format TBD: ZFS for zfs send/recv OR ext4/NTFS for SHTF portability
- Consider: one ZFS for automated snapshots, one NTFS/exFAT for "grab and run" scenarios
### 4TB Drive (replaced 2TB)
- Scratch/overflow/ISO storage
- Possible L2ARC candidate (probably not needed — working set likely fits in RAM)
- Could also serve as hot spare for the raidz2 if desired
## Key Decisions Made
- **CasaOS stays** — not switching back to TrueNAS. Bare metal access (OpenClaw) matters.
- **ZFS layered underneath CasaOS** — CasaOS sees mountpoints, doesn't manage ZFS
- **Cockpit + ZFS plugin** for GUI management of ZFS layer
- **No dedup** — ever
- **Striped mirrors considered** but rejected for this drive mix (too much wasted capacity with mismatched sizes)
- **raidz2 on 5×4TB** chosen for main vdev — 2-drive redundancy, best for the vault
- **raidz1 on 3×3TB** — acceptable risk for replaceable data
## Key ZFS Rules Learned
- Can't add drives to existing raidz vdev — only replace or add new vdevs
- Mixed-size vdevs waste capacity (capped at smallest drive)
- Vdev expansion only after ALL drives in vdev are replaced with larger ones
- Striped mirrors are only topology where single-drive swap gives immediate capacity boost
- 50% overhead on mirrored setups
## Offsite Backup Strategy (Long-Term)
### 3-2-1 Compliance
1. **Main copy:** raidz2 pool on new rig
2. **Local backup:** 8/10TB standalone drives
3. **Offsite backup:** Old box relocated to brewery after migration
### Brewery Infrastructure
- HA box (HAOS) — hands off
- N100 (Batocera) — hands off
- Older Mac — janky ZFS, skip
- Raspberry Pis — possible backup target but slow
- **Best option:** Old server box moved to brewery as dedicated backup target
### Brewery Backup Setup (Future)
- Fresh Linux install on old box
- ZFS pool on backup drives
- `zfs send/recv` over SSH for automated incremental snapshots
- Cron job or systemd timer
## Migration Roadmap
1. Build new rig (see AI thread for hardware planning)
2. Set up ZFS pools on new rig with planned topology
3. Migrate CasaOS + all services
4. Verify everything works on new iron
5. Wipe old box, fresh Linux + ZFS as backup target
6. Relocate old box to brewery
7. Set up zfs send/recv over SSH for nightly incremental backups
8. Decommission old setup with confidence
## SHTF Portability Notes
- ZFS: Linux-readable only (no native Windows, read-only Mac)
- ext4: Linux native, Windows/Mac need tools
- NTFS/exFAT: Universal — any random box can read
- Consider keeping one backup drive as NTFS for "grab and run" scenarios
- Linux live USB can read ZFS/ext4 on any machine in a pinch
## Special ZFS Drive Uses (Reference)
| Type | Use | Drive Type | Verdict |
|---|---|---|---|
| L2ARC | Read cache | SSD only | Skip unless ARC hit rate is low |
| SLOG/ZIL | Sync write log | SSD with PLP | Only for NFS/VMs/databases |
| Special vdev | Metadata storage | SSD preferred | High risk if not mirrored, overkill for home |
| Dedup vdev | Dedup tables | SSD | NO. Just no. |
## SMART Monitoring (Critical for Old Drives)
### Why It Matters
- All drives are old with no RMA coverage — failure = replace from own pocket
- SMART warnings give days to weeks of notice before total failure
- Running a degraded pool with old drives is risking data loss
### Key Metrics to Watch
| Attribute | Warning Threshold | Critical |
|---|---|---|
| Reallocated Sectors (5) | >0 | >100 |
| Current Pending Sectors (197) | >0 | >10 |
| Uncorrectable Errors (198) | >0 | any |
| Power-On Hours (9) | — | >50,000 |
| Temperature (194) | >40°C | >45°C |
### Setup
```bash
# Install smartmontools
sudo apt install smartmontools
# Enable and start smartd
sudo systemctl enable smartd
sudo systemctl start smartd
# Run short test (5-10 min)
smartctl -t short /dev/sdX
# Run long/comprehensive test (1-2 hours)
smartctl -t long /dev/sdX
# Check results
smartctl -l selftest /dev/sdX
smartctl -a /dev/sdX
```
### smartd.conf Configuration
```
# Email alerts on smartd warnings
DEVICESCAN -a -m user@example.com -M daily
# Or per-drive with specific schedules
/dev/sda -a -m admin@example.com -M daily -s (S/../.././02|L/../../6/03)
# Short test daily at 2am, long test Saturdays at 3am
```
### Power Cycle Count (12) — Also Worth Tracking
- Drive spinup count. High number = old drive thats been powered on/off a lot
- Not a failure predictor on its own, but tells you wear history
### Pre-Pool Drive Health Check
Before building the pool, run full SMART tests on all drives:
```bash
smartctl -t long /dev/sda
smartctl -t long /dev/sdb
# ... etc
```
And check `smartctl -a /dev/sdX | grep -E '(Reallocated|Current_Pending|Uncorrectable)'`
Drives with any reallocated sectors or pending sectors should be considered questionable — use for non-critical vdevs or retire entirely.
### ZFS Integration
ZFS doesn't do its own SMART polling, but `zpool status` shows drive errors. A rising error count in `zpool status` alongside SMART warnings = replace that drive now.
```bash
# Check for ZFS errors
zpool status -v
```
## Pre-Pool Drive Testing (2026-04-18)
### Strategy
- 11 drives to test, 4 at a time (SATA port limitation)
- Sequential testing — run one batch, swap drives, run next batch
- Two-phase testing per drive:
1. SMART long test (1-4 hours depending on drive size)
2. badblocks non-destructive scan (6-12 hours per 4TB drive)
### Phase 1: SMART Long Tests
Tests all drives in current batch simultaneously.
```bash
#!/bin/bash
# drive-smart-test.sh
# Usage: ./drive-smart-test.sh /dev/sda /dev/sdb /dev/sdc /dev/sdd
DRIVES=("$@")
LOGDIR="/root/drive-health-logs"
mkdir -p "$LOGDIR"
for DRIVE in "${DRIVES[@]}"; do
echo "[$(date)] Starting SMART long test on $DRIVE" | tee -a "$LOGDIR/test.log"
smartctl -t long "$DRIVE" 2>&1 | tee -a "$LOGDIR/test.log"
echo "[$(date)] SMART long test complete on $DRIVE" | tee -a "$LOGDIR/test.log"
done
echo "[$(date)] All SMART tests initiated. Check results with:"
echo " smartctl -l selftest $DRIVE"
echo " smartctl -a $DRIVE | grep -E '(Reallocated|Current_Pending|Uncorrectable|Power_On_Hours)'"
```
### Phase 2: badblocks Scan (Sequential)
One drive at a time to avoid port contention.
```bash
#!/bin/bash
# drive-badblocks.sh
# Usage: ./drive-badblocks.sh /dev/sda
DRIVE="$1"
LOGDIR="/root/drive-health-logs"
LOGFILE="$LOGDIR/badblocks-$(basename $DRIVE).log"
if [ -z "$DRIVE" ]; then
echo "Usage: $0 /dev/sdX"
exit 1
fi
echo "[$(date)] Starting badblocks non-destructive scan on $DRIVE" | tee "$LOGFILE"
badblocks -nvs "$DRIVE" 2>&1 | tee -a "$LOGFILE"
echo "[$(date)] badblocks complete on $DRIVE" | tee -a "$LOGFILE"
```
### Full Sequential Test Workflow
```bash
#!/bin/bash
# drive-test-batch.sh
# Run one batch of 4 drives through full testing pipeline
DRIVES=("$@") # pass 4 drives as args
LOGDIR="/root/drive-health-logs"
mkdir -p "$LOGDIR"
for DRIVE in "${DRIVES[@]}"; do
SERIAL=$(smartctl -a "$DRIVE" | grep 'Serial Number' | awk '{print $NF}')
SIZE=$(smartctl -a "$DRIVE" | grep 'User Capacity' | awk '{print $5,$6}')
echo "==========" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
echo "Drive: $DRIVE | Serial: $SERIAL | Size: $SIZE" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
echo "==========" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
# Phase 1: SMART long test
echo "[$(date)] Phase 1: SMART long test" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
smartctl -t long "$DRIVE" 2>&1 | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
# Wait for SMART test to complete (poll every 60s)
while true; do
STATUS=$(smartctl -H "$DRIVE" | grep 'SMART overall-health' | awk '{print $NF}')
if [ "$STATUS" = "PASSED" ] || [ "$STATUS" = "FAILED" ]; then
echo "[$(date)] SMART test result: $STATUS" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
break
fi
echo "[$(date)] Waiting for SMART test to complete..." | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
sleep 60
done
# Capture SMART attributes
echo "[$(date)] Capturing SMART attributes" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
smartctl -a "$DRIVE" > "$LOGDIR/smart-$(basename $DRIVE)-$(date +%Y%m%d).log"
# Phase 2: badblocks (only if SMART passed)
if [ "$STATUS" = "PASSED" ]; then
echo "[$(date)] Phase 2: badblocks non-destructive scan" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
badblocks -nvs "$DRIVE" > "$LOGDIR/badblocks-$(basename $DRIVE)-$(date +%Y%m%d).log" 2>&1
echo "[$(date)] badblocks complete" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
else
echo "[$(date)] SKIPPING badblocks — SMART test $STATUS" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log"
fi
done
echo "[$(date)] Batch complete. Results in $LOGDIR"
```
### Current Drive Availability (2026-04-18 update)
- **2×4TB hot** (SATA power available): WDC WD40EFRX (sdb), MDD4000GSA (sde)
- **4×4TB cold** (no power connectors available yet) — swap in batches for testing
- **3×3TB, 8TB, 10TB** — also cold, same power limitation
### Drive Batch Schedule
| Batch | Drives | Status |
|---|---|---|
| 1 | 2×4TB (sdb, sde) | **Ready to test** |
| 2 | 2×4TB | Cold swap |
| 3 | 2×4TB | Cold swap |
| 4 | 1×3TB + 8TB + 10TB | Cold swap |
### Pass/Fail Criteria
- **PASS:** SMART `Reallocated Sectors Count = 0`, `Current Pending Sectors = 0`, `Uncorrectable Errors = 0`, badblocks finds 0 bad sectors
- **CONDITIONAL:** Any reallocated/pending sectors — demote to non-critical vdev (3×3TB raidz1)
- **FAIL:** Any uncorrectable errors, badblocks bad sectors, or SMART health = FAILED — retire drive
### Next Steps
- [ ] Get drive inventory (exact models, ages, health)
- [ ] Finalize new rig hardware (cross-ref AI thread)
- [ ] Decide backup drive format (ZFS vs NTFS)
- [ ] Plan CasaOS migration steps when new rig is ready
- [ ] Set up SMART monitoring on all drives before pool creation
- [ ] Source cold spare 4TB drive (to keep on shelf for old-drive replacement)
- [ ] Run pre-pool drive tests (batches 1-3)
````
## Notes
<!-- openclaw:human:start -->
<!-- openclaw:human:end -->
## Related
<!-- openclaw:wiki:related:start -->
- No related pages yet.
<!-- openclaw:wiki:related:end -->