jenga jenga Backups are If you
Found at: gopher.blog.benjojo.co.uk:70/imaging-mounted-disk-volumes-live
Imaging mounted disk volumes under duress
===
drive-jenga
Backups are critical. If you are lucky and organised you
have a set of useful backup primitives, such as Point in Time
snapshots on your Infrastructure As A Service (IaaS), your disk array
controller, or volume manager. However there always seems to be some
critical machine in my life that does not fall into these buckets.
Ideally, I prefer to have full disk images to restore
from. I prefer to boot a system as it was 30 days ago and
extract files from it rather than having to piece things together
using an unbootable copy of all of its files. Bootable systems
in my world always win. However, making a bootable image of a
system is only really possible if the file system the system is
installed on has been set up with something to enable this (for
example LVM), DRBD, or MD).
Annoyingly, I often find myself in the position where I
know I need to backup a system for some imminent demise, but
it doesn’t have any point in time snapshot ability. Rebooting
it or reconfiguring its storage setup is usually very
undesirable in this position, so migrating it to LVM is not possible,
I can’t use MD to add a replica disk to get a block
level backup copy of the disk, nor setup the disk to be
replicated over to another machine using DRBD
All of these options are amazing if you can get away with
making deep changes to the system. But what do you do if your
system looks like this?
```
root@doomed:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sr0 11:0 1 1024M 0 rom
sda 254:0 0 64G 0 disk
├─sda1 254:1 0 63.5G 0 part /
├─sda2 254:2 0 1K 0 part
└─sda5 254:5 0 508M 0 part [SWAP]
```
We can't simply read off the block device (sda) here
because the system could be actively making changes to the disk
behind where the imaging program has already read.
So, while a `dd if=/dev/sda of=/somewhere-else` might
complete it will not produce a disk image that you could
confidently mount without data corruption… the exact thing we are
trying to avoid while backing up in the first place.
dd not seeing dirty blocks
dd ((URL:http://en.wikipedia.org/wiki/Dd_(Unix:0/URL:http://en.wikipedia.org/wiki/Dd_(Unix HTML))
But what if we could build something like dd) that could
see the changes happening to the disk in real time? Well
thanks to a 2006 Linux tracing API that looks like it was
designed for SAN performance debugging, we can!
## Enter blktrace
block device performance by providing tracing data on what actions are happening to them
blktrace is some user space and kernel space code that
allows for decent insight into block device performance by
providing tracing data on what actions are happening to them. It was
made by HP who I assume was using it to debug their SAN
performance at the time. But it seems the blktrace APIs have almost
never been touched outside of the included blktrace programs
itself. This is sad since it's a genuinely useful tool (that
honestly could also be performed with eBPF, but hey! This API is
way more simple).
We can give it a quick go on a modern Debian 11 system
to check that everything still works as expected:
```
root@test-debian11:~# blktrace /dev/vda
^C=== vda ===
CPU 0: 1 events,
1 KiB data
CPU 1: 102 events,
5 KiB data
Total: 103 events
(dropped 0), 5 KiB data
```
In another terminal I ran `touch foo` to force some disk
I/O to happen, then hit `^C` on blktrace and it seems to have
recorded some events! Lovely!
Then we can dump out the actions using blkparse:
```
root@test-debian11:~# blkparse -i vda.blktrace.
Input file vda.blktrace.0 added
Input file vda.blktrace.1 added
254,0 1 1 0.000000000 181 A
WS 8673512 + 8 <- (254,1) 8671464
…...
254,0 1 84 1.741724200 761 Q
RM 15200 + 8 [touch]
254,0 1 85 1.741724432 761 M
RM 15200 + 8 [touch]
254,0 1 86 1.741725516 761 U
N [touch] 1
254,0 1 87 1.741726831 761 I
RA 15104 + 104 [touch]
254,0 1 88 1.741730665 761 D
RA 15104 + 104 [touch]
254,0 1 89 1.742324123 0 C
RA 15104 + 104 [0]
CPU1 (vda):
Reads Queued: 14, 56KiB
Queued: 9, 36KiB
Read Dispatches: 4, 56KiB
Dispatches: 2, 36KiB
Reads Requeued: 0
0
Reads Completed: 4, 56KiB
Completed: 3, 36KiB
Read Merges: 12, 48KiB
Merges: 7, 28KiB
Read depth: 1
depth: 1
IO unplugs: 2
unplugs: 0
Throughput (R/W): 32KiB/s / 20KiB/s
Events (vda): 89 entries
Skips: 0 forward (0 - 0.0%)
```
This is great since it tells us what sectors on the disk
is being altered and how much data is being changed. We can
use this to build a tool like dd that does not have the flaw
mentioned above!
## Reusing the blktrace API
To get started we need to add the blktrace ioctls and
system structs into golang’s unix API package, since this is the
first time it seems anyone is using blktrace in Go (my generally
preferred language).
go/sys generator to regenerate all the structs used for syscalls
This is basically a case of finding the C structs used
and telling the go/sys generator to regenerate all the structs
used for syscalls.
Admittedly this is not my first time doing this: the
Splitting the Ping post used the PPS API for the first time.
Thankfully it's reasonably easy to track down the ioctl identifiers
for setup and the structs used for events emitted, making
adding support of this an easy task
All of that work results in a reasonably small diff to
go's sys/unix package:
```
diff --git a/sys/unix/linux/types.go b/sys/unix/linux/types.go
index 515e3b6..d5e3cd5 100644
--- a/sys/unix/linux/types.go
+++ b/sys/unix/linux/types.go
@@ -95,6 +95,7 @@ struct termios2 {
#include
#include
#include
+#include
#include
#include
#include
@@ -3212,6 +3213,17 @@ const (
)
+// BLKTRACE API
+
+type BLK_user_trace_setup C.struct_blk_user_trace_setup
+type BLK_io_trace C.struct_blk_io_trace
+
+const (
+
+
+
+
+)
+
```
Then it is a case of setting the blktrace parameters in a
struct, and then ioctl-ing it on a file descriptor for the target
(to be traced) device:
```
unix.BLKTRACESETUP, uintptr(unsafe.Pointer(&traceOpts)))
```
After running `BLKTRACESTART` some handy files appear in
debugfs:
```
root@test-debian11:/sys/kernel/debug/block/sda# ls -alh
total 0
drwxr-xr-x 5 root root 0 Sep 19 15:22 .
drwxr-xr-x 4 root root 0 Sep 19 15:20 ..
-r--r--r-- 1 root root 0 Sep 19 15:22 dropped
...
-r-------- 1 root root 0 Sep 19 15:22 trace0
-r-------- 1 root root 0 Sep 19 15:22 trace1
…
```
Now the last step is opening the `trace(n)` files (one per
CPU) and reading `blk_io_trace` structs from them, then combining
that with a simple sector bit mask to easily track altered disk
sectors that received a write event during our imaging and going
back to them after imaging finishes!
Since we must track writes on the sector level (512 bytes
sections of a disk), we have to use up 1 byte for every 8
sectors. This means that for every TiB being imaged around 250MB of
RAM is needed to track dirty sectors for that device.
I ended up writing a simple proof of concept and now
needed a way to test the integrity of the whole thing to make
sure this can copy data without missing altered sectors (and
thus causing corruption)
## Verification
My plan to prove that this worked was:
* Zero out a block device to ensure no old data
remained on it
* Begin imaging the device with no data on it
* Create an ext4 file system, mount it, writes some
files
* unmount the newly made filesystem before the imaging
finishes
If the system works correctly, we should get back an image
that was byte to byte exactly the same with what was on the
target disk! Annoyingly all of my computer storage was cursed for
this use by being too fast, however my day was saved by an
incredibly sluggish hot pink 4GiB USB flash disk!
hot pink flash drive
```
[16:30:21] ben@metropolis:~$ lsblk | grep sdb
sdb 8:16 1 3.7G
0 disk /media/ben/800B-EAB7
[16:30:25] ben@metropolis:~$ sudo umount /dev/sdb
[sudo] password for ben:
[16:30:37] ben@metropolis:~$ sudo -i
root@metropolis:~# dd if^C
root@metropolis:~# pv -L 5M /dev/zero > /dev/sdb
620MiB 0:02:04 [5.02MiB/s] [====>
] 16% ETA 0:10:31
```
Now that the device is wiped and ready, we can begin our
test:
hot-clone imaging a device
Then we reassemble the output into a contiguous image,
```
[18:57:32] ben@metropolis:~/tmp$ hot-clone -reassemble sdb.hc
-reassemble-output sdb.img
2021/09/19 18:57:35 Restoring section (Sector: 0 (len
3959422976 bytes) (debug: 'S:0
...
[18:58:07] ben@metropolis:~/tmp$ sudo md5sum sdb.img /dev/sdb
e6f63ef26853354fa60245ae16fb209b sdb.img
e6f63ef26853354fa60245ae16fb209b /dev/sdb
```
They are the same! Since all of the altered blocks got
copied over, and since we unmounted the device before the imaging
finished, the device and the image are identical!
---
It's worth pointing out this tool is for quite narrow
situations where you can't use a point in time snapshot on the block
level. So please don't use it when you can do those things.
This tool has already saved my ass a few times, and I
(personally) would trust it to be correct. However as with almost all
software, it comes with no warranty.
You can find the tool, source code and pre-built binaries
over at https://github.com/benjojo/hot-clone
---
If you want to stay up to date with the blog you can
use the RSS feed or you can follow me on Twitter
Until next time!