Why I Stopped Using Btrfs for My AI Workflow

A few days ago,I recently transitioned my AI rig from MX Linux AHS to CachyOS. Coming from MX, I was used to a brilliant, transparent snapshot system where excluding specific paths from backups was as simple as editing a text file. It was a surgical approach to system safety that worked perfectly for my needs.

When I moved to CachyOS, I was introduced to the Btrfs ecosystem. Because I trusted the distribution’s optimization and the general praise for Btrfs, I proceeded with the default installation. I assumed the snapshotting logic would be similar to what I had experienced before.

That assumption turned out to be a costly mistake. I soon discovered that the way Btrfs handles snapshots is fundamentally different from the path-based exclusions in MX Linux, and for a heavy AI workload, that difference is catastrophic.

The Conflict: Copy-on-Write vs. Large Language Models

The problem lies in how Btrfs handles data through Copy-on-Write (CoW). While CoW is brilliant for system backups, it is a nightmare for the way AI tools actually operate.

I found that my storage was disappearing in ways that didn’t make sense. I discovered that tools like Forge WebUI Neo write massive temporary files every hour. In a CoW system, every time these files are modified, Btrfs creates new blocks instead of overwriting the old ones. When you add Ollama swapping 35B models in and out of memory, the disk activity becomes relentless.

The worst part is the snapshots. Because I was using the defaults, my system was automatically taking snapshots of these massive temporary changes. Instead of protecting my data, the snapshots were acting like a sponge and silently eating my NVMe capacity.

The Brutal Math of SSD Wear

Beyond the storage space, I started worrying about the actual physical health of my drive.

Consumer NVMe drives have a finite lifespan measured in Terabytes Written (TBW). A typical 1TB drive might offer 600TB of endurance. In my high intensity AI workflow, I was easily writing 50GB per hour during active use. By keeping my models and temp directories on my primary system drive, I was essentially putting my motherboard’s main slot on a countdown timer.

Now, I make it a habit to check my drive health monthly using smartctl:

If the wear percentage climbs too fast, I know it is time to make changes.


For reference, here is most of the actual output:

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPEKNU010TZ
NVMe Version:                       1.4
Namespace 1 Size/Capacity:          1.02 TB
... [Truncated for brevity] ...

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    3%  <-- LOOK AT THIS: Your "Wear" level
Data Units Read:                    17.2 TB
Data Units Written:                 14.3 TB <-- LOOK AT THIS: Total TB written
Power Cycles:                       12,383
Power On Hours:                     6,649
Unsafe Shutdowns:                   196
Media and Data Integrity Errors:    0

Here are the key metrics from your outpu of my 2021 Zephyrus G15 laptopt:

  • Percentage Used: 3%. This is the most important number. It means I have used only 3% of the drive’s estimated lifespan. I have 97% of your “write budget” remaining.
  • Data Units Written: 14.3 TB. For a 1TB Intel enterprise/pro-sumer drive, this is very low. I am not in the “danger zone” yet, but the blog post’s point is about preventing this number from skyrocketing.
  • Available Spare: 100%. The drive hasn’t had to retire any failing NAND blocks yet.
  • Temperature: 36 Celsius. Perfectly cool.

The “Red Flag” (Minor):

  • Power Cycles: 12,383. This is quite high relative to the Power On Hours (6,649). It suggests the drive is being powered on/off very frequently, or I’ve had many hard reboots/crashes (supported by the 196 Unsafe Shutdowns). This reinforces my point in the blog about “kernel crashes” and the need for a stable architecture.

My “Nuclear Fix” Architecture

To build a workflow that actually survives reality, I decided to decouple my OS from my data. I stopped treating my rig like a desktop and started treating it like a production server.

1. Moving to EXT4

I switched my root partition to EXT4. I lost the fancy snapshots, but I gained predictable performance and zero bloat. I figured that if my OS implodes due to a bad driver, a 30 minute reinstall is a fair trade for a stable system.

2. The External Model Vault

I moved Forge WebUI Neo and Ollama to a dedicated external NVMe SSD. This gave me two huge advantages. First, if I burn through the endurance of an 80 dollar external drive, it is a cheap and easy replacement. Second, it makes my environment portable. I can plug the same drive into the Mac Studio and keep working without moving a single file.

3. The Recovery Workflow

This setup changed my recovery time entirely. If my OS crashes, I just reinstall Linux on the EXT4 partition. Then I plug in the SSD and run my start script. I no longer have to re-download 100GB models. My production resumes instantly.

A Note for CachyOS Users

I love CachyOS, but its default Btrfs setup is designed for general users. The moment you install Forge and Ollama, the snapshotting system becomes a liability. If you are building an AI rig, I highly recommend choosing EXT4 during the installation process.

Final Thoughts

By using EXT4 for the system and an external NVMe for the models, I finally have a professional workstation. My models are too valuable to leave to the mercy of snapshot roulette.

Leave a Reply

Your email address will not be published. Required fields are marked *