[CONTACT]

[ABOUT]

[POLICY]

jenga jenga Backups are If you

Found at: gopher.blog.benjojo.co.uk:70/imaging-mounted-disk-volumes-live

 Imaging mounted disk volumes under duress
 ===

drive-jenga

 drive-jenga
 Backups are critical. If you are lucky and organised you
have a set of useful backup primitives, such as Point in Time
snapshots on your Infrastructure As A Service (IaaS), your disk array
controller, or volume manager. However there always seems to be some
critical machine in my life that does not fall into these buckets.

LVM ((URL:http://en.wikipedia.org/wiki/Logical_Volume_Manager_(Linux:0/URL:http://en.wikipedia.org/wiki/Logical_Volume_Manager_(Linux HTML))

DRBD

MD

 Ideally, I prefer to have full disk images to restore
from. I prefer to boot a system as it was 30 days ago and
extract files from it rather than having to piece things together
using an unbootable copy of all of its files. Bootable systems
in my world always win. However, making a bootable image of a
system is only really possible if the file system the system is
installed on has been set up with something to enable this (for
example LVM), DRBD, or MD).

DRBD

 Annoyingly, I often find myself in the position where I
know I need to backup a system for some imminent demise, but
it doesn’t have any point in time snapshot ability. Rebooting
it or reconfiguring its storage setup is usually very
undesirable in this position, so migrating it to LVM is not possible,
I can’t use MD to add a replica disk to get a block
level backup copy of the disk, nor setup the disk to be
replicated over to another machine using DRBD
 All of these options are amazing if you can get away with
making deep changes to the system. But what do you do if your
system looks like this?
 ```
 root@doomed:~# lsblk
 NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
 sr0     11:0    1 1024M  0 rom
 sda    254:0    0   64G  0 disk
 ├─sda1 254:1    0 63.5G  0 part /
 ├─sda2 254:2    0    1K  0 part
 └─sda5 254:5    0  508M  0 part [SWAP]
 ```
 We can't simply read off the block device (sda) here
because the system could be actively making changes to the disk
behind where the imaging program has already read.
 So, while a `dd if=/dev/sda of=/somewhere-else` might
complete it will not produce a disk image that you could
confidently mount without data corruption… the exact thing we are
trying to avoid while backing up in the first place.

dd not seeing dirty blocks

 dd not seeing dirty blocks

dd ((URL:http://en.wikipedia.org/wiki/Dd_(Unix:0/URL:http://en.wikipedia.org/wiki/Dd_(Unix HTML))

 But what if we could build something like dd) that could
see the changes happening to the disk in real time? Well
thanks to a 2006 Linux tracing API that looks like it was
designed for SAN performance debugging, we can!
 ## Enter blktrace

blktrace

block device performance by providing tracing data on what actions are happening to them

 blktrace is some user space and kernel space code that
allows for decent insight into block device performance by
providing tracing data on what actions are happening to them. It was
made by HP who I assume was using it to debug their SAN
performance at the time. But it seems the blktrace APIs have almost
never been touched outside of the included blktrace programs
itself. This is sad since it's a genuinely useful tool (that
honestly could also be performed with eBPF, but hey! This API is
way more simple).
 We can give it a quick go on a modern Debian 11 system
to check that everything still works as expected:
 ```
 root@test-debian11:~# blktrace /dev/vda
 ^C=== vda ===
   CPU  0:                    1 events,
     1 KiB data
   CPU  1:                  102 events,
    5 KiB data
   Total:                   103 events
(dropped 0),        5 KiB data
 ```
 In another terminal I ran `touch foo` to force some disk
I/O to happen, then hit `^C` on blktrace and it seems to have
recorded some events! Lovely!
 Then we can dump out the actions using blkparse:
 ```
 root@test-debian11:~# blkparse -i vda.blktrace.
 Input file vda.blktrace.0 added
 Input file vda.blktrace.1 added
 254,0    1        1     0.000000000   181  A
WS 8673512 + 8 <- (254,1) 8671464
 ...
 254,0    1       84     1.741724200   761  Q
RM 15200 + 8 [touch]
 254,0    1       85     1.741724432   761  M
RM 15200 + 8 [touch]
 254,0    1       86     1.741725516   761  U
 N [touch] 1
 254,0    1       87     1.741726831   761  I
RA 15104 + 104 [touch]
 254,0    1       88     1.741730665   761  D
RA 15104 + 104 [touch]
 254,0    1       89     1.742324123     0  C
 RA 15104 + 104 [0]
 CPU1 (vda):
  Reads Queued:          14,       56KiB
Queued:           9,       36KiB
  Read Dispatches:        4,       56KiB
Dispatches:        2,       36KiB
  Reads Requeued:         0
   0
  Reads Completed:        4,       56KiB
Completed:        3,       36KiB
  Read Merges:           12,       48KiB
Merges:            7,       28KiB
  Read depth:             1
depth:             1
  IO unplugs:             2
unplugs:           0
 Throughput (R/W): 32KiB/s / 20KiB/s
 Events (vda): 89 entries
 Skips: 0 forward (0 -   0.0%)
 ```
 This is great since it tells us what sectors on the disk
is being altered and how much data is being changed. We can
use this to build a tool like dd that does not have the flaw
mentioned above!
 ## Reusing the blktrace API
 To get started we need to add the blktrace ioctls and
system structs into golang’s unix API package, since this is the
first time it seems anyone is using blktrace in Go (my generally
preferred language).

go/sys generator to regenerate all the structs used for syscalls

 This is basically a case of finding the C structs used
and telling the go/sys generator to regenerate all the structs
used for syscalls.

Splitting the Ping post

PPS API

first time

setup

used for events emitted

 Admittedly this is not my first time doing this: the
Splitting the Ping post used the PPS API for the first time.
Thankfully it's reasonably easy to track down the ioctl identifiers
for setup and the structs used for events emitted, making
adding support of this an easy task
 All of that work results in a reasonably small diff to
go's sys/unix package:
 ```
 diff --git a/sys/unix/linux/types.go b/sys/unix/linux/types.go
 index 515e3b6..d5e3cd5 100644
 --- a/sys/unix/linux/types.go
 +++ b/sys/unix/linux/types.go
 @@ -95,6 +95,7 @@ struct termios2 {
  #include 
  #include 
  #include 
 +#include 
  #include 
  #include 
  #include 
 @@ -3212,6 +3213,17 @@ const (
  )
 +// BLKTRACE API
 +
 +type BLK_user_trace_setup C.struct_blk_user_trace_setup
 +type BLK_io_trace C.struct_blk_io_trace
 +
 +const (
 +
 +
 +
 +
 +)
 +
 ```

ioctl

 Then it is a case of setting the blktrace parameters in a
struct, and then ioctl-ing it on a file descriptor for the target
(to be traced) device:
 ```
unix.BLKTRACESETUP, uintptr(unsafe.Pointer(&traceOpts)))
 ```
 After running `BLKTRACESTART` some handy files appear in
debugfs:
 ```
 root@test-debian11:/sys/kernel/debug/block/sda# ls -alh
 total 0
 drwxr-xr-x  5 root root 0 Sep 19 15:22 .
 drwxr-xr-x  4 root root 0 Sep 19 15:20 ..
 -r--r--r--  1 root root 0 Sep 19 15:22 dropped
 ...
 -r--------  1 root root 0 Sep 19 15:22 trace0
 -r--------  1 root root 0 Sep 19 15:22 trace1
 ```
 Now the last step is opening the `trace(n)` files (one per
CPU) and reading `blk_io_trace` structs from them, then combining
that with a simple sector bit mask to easily track altered disk
sectors that received a write event during our imaging and going
back to them after imaging finishes!
 Since we must track writes on the sector level (512 bytes
sections of a disk), we have to use up 1 byte for every 8
sectors. This means that for every TiB being imaged around 250MB of
RAM is needed to track dirty sectors for that device.
 I ended up writing a simple proof of concept and now
needed a way to test the integrity of the whole thing to make
sure this can copy data without missing altered sectors (and
thus causing corruption)
 ## Verification
 My plan to prove that this worked was:
   * Zero out a block device to ensure no old data
remained on it
   * Begin imaging the device with no data on it
   * Create an ext4 file system, mount it, writes some
files
   * unmount the newly made filesystem before the imaging
finishes
 If the system works correctly, we should get back an image
that was byte to byte exactly the same with what was on the
target disk! Annoyingly all of my computer storage was cursed for
this use by being too fast, however my day was saved by an
incredibly sluggish hot pink 4GiB USB flash disk!

hot pink flash drive

 hot pink flash drive
 ```
 [16:30:21] ben@metropolis:~$ lsblk | grep sdb
 sdb                     8:16   1   3.7G
0 disk  /media/ben/800B-EAB7
 [16:30:25] ben@metropolis:~$ sudo umount /dev/sdb
 [sudo] password for ben:
 [16:30:37] ben@metropolis:~$ sudo -i
 root@metropolis:~# dd if^C
 root@metropolis:~# pv -L 5M /dev/zero > /dev/sdb
  620MiB 0:02:04 [5.02MiB/s] [====>
             ] 16% ETA 0:10:31
 ```
 Now that the device is wiped and ready, we can begin our
test:

hot-clone imaging a device

 hot-clone imaging a device
 Then we reassemble the output into a contiguous image,
 ```
 [18:57:32] ben@metropolis:~/tmp$ hot-clone -reassemble sdb.hc
-reassemble-output sdb.img
 2021/09/19 18:57:35 Restoring section (Sector: 0 (len
3959422976 bytes) (debug: 'S:0
 ...
 [18:58:07] ben@metropolis:~/tmp$ sudo md5sum sdb.img /dev/sdb
 e6f63ef26853354fa60245ae16fb209b  sdb.img
 e6f63ef26853354fa60245ae16fb209b  /dev/sdb
 ```
 They are the same! Since all of the altered blocks got
copied over, and since we unmounted the device before the imaging
finished, the device and the image are identical!
 ---
 It's worth pointing out this tool is for quite narrow
situations where you can't use a point in time snapshot on the block
level. So please don't use it when you can do those things.
 This tool has already saved my ass a few times, and I
(personally) would trust it to be correct. However as with almost all
software, it comes with no warranty.
 You can find the tool, source code and pre-built binaries
over at https://github.com/benjojo/hot-clone
 ---

RSS feed

Twitter

 If you want to stay up to date with the blog you can
use the RSS feed or you can follow me on Twitter
 Until next time!