ssplit - split files into lists by size, keep hardlinks in same list
ssplit [-hmuvw] [-0] [-p pfx] [-s size] [-t topdir] [find-output]
-0 Record separator for each filename is a null instead of a newline.
-h Print a brief help message and exit.
-m Print the manual page and exit.
-p pfx Prefix for output filenames; defaults to 'x'.
-s size Creates new list after files totalling "size" bytes have been read; defaults to 900 Mb. Trailing characters can be 'b' (bytes), 'k' (Kbytes), or 'm' (Mbytes).
-t top Top directory for output filenames; defaults to $TMPDIR if set, or "/tmp".
-u Print the script UUID and exit.
-v Print the version and exit.
-w Print the source location and exit.
ssplit accepts output from GNU find describing a set of regular files and splits the set into smaller lists depending on two things: the cumulative sizes of the files being read, and the presence of any hard- linked files (duplicate inodes).
I needed something like this for sending files to a backup server. I don't like running rsync as root, but I do like keeping permissions and ownerships intact on backed-up files. My solution (which I freely admit is a pig-rig):
Create unprivileged users on the production and backup servers, cleverly named "bkup".
Let root create a tar/cpio/pax/whatever file on the production box, and store that file on a staging drive readable by usr "bkup".
"bkup" copies that archive file via SSH to the backup server, using a similar staging area on that box. That staging area is the only place "bkup" can create files.
root on the backup box watches the staging area, and when something appears, unpacks it to the real backup area, keeping permissions and ownerships intact.
The staging areas aren't huge, so I needed a script to take a set of files to be backed up (possibly gigabytes in size) and break it into a set of smaller chunks for transfer to the backup box. The script is demonstrated below; it's called ssplit, for "size split".
% touch -d 'yesterday' /tmp/modtime
% cd /some/path
% find . -newer /tmp/modtime -printf "%y|%i|%s|%p\n" | sort > /tmp/toc
% grep '^f' /tmp/toc > /tmp/toc.regular
% grep -v '^f' /tmp/toc > /tmp/toc.other
Running sort is essential for ssplit to work; all the filetypes show up in order, followed by the inode numbers, so hard-linked files are in adjacent records.
% mkdir /tmp/list
% ssplit -p x -s 500m -t /tmp/list /tmp/toc.regular
% ssplit -p y -s 500m -t /tmp/list /tmp/toc.other
% rm /tmp/modtime /tmp/toc.*
would list any files under /some/path modified since yesterday and write their names in a set of lists starting with /tmp/list/x0001. Each list would consist of files totalling no (or not much) more than 500 Mbytes; any hard-linked files would be contained in the same individual list, so whatever archiving utility you use can figure out what should be linked.
The last file (starting with /tmp/list/y) would hold any other files (directories, symlinks, etc) in the collection. We copy those last because it's easier for most utilities like pax/cpio/tar to set those permissions and modification times last.
If ssplit is run like this:
% mkdir /tmp/work/{list,stage}
% ssplit -t /tmp/work/list /tmp/work/fdb.toc.regular
sorting /tmp/work/list/x000001...
sorting /tmp/work/list/x000002...
sorting /tmp/work/list/x000003...
sorting /tmp/work/list/x000004...
sorting /tmp/work/list/x000005...
sorting /tmp/work/list/x000006...
sorting /tmp/work/list/x000007...
sorting /tmp/work/list/x000008...
606 files, 8 archives
% ssplit -p y -t /tmp/work/list /tmp/work/fdb.toc.other
sorting /tmp/work/list/y000001...
68 files, 1 archives
and find was run from /, then we can use GNU tar to copy the regular files first and directories/symlinks/etc last:
# cd /
# tar --no-recursion -b128 -T /tmp/work/list/x000001 \
-cf /tmp/work/stage/x000001.tar
[...]
# tar --no-recursion -b128 -T /tmp/work/list/y000001 \
-cf /tmp/work/stage/y000001.tar
Each generated list of filenames is sorted in reverse using the system sort program, assumed to be /bin/sort. This is the equivalent of running find ... -depth, which is recommended for most archive programs.
Comments welcome.
Name | Last modified | Size | Description | |
---|---|---|---|---|
Parent Directory | 02-May-2019 20:55 | - | ||
ssplit | 23-Jun-2015 19:01 | 9k | Split files into lists by size, keep hardlinks in same lists |
UUID: 5531b03f-95ff-3b80-b37b-f59884b6e7a0 | Sun, 12 Sep 2021 03:30:40 -0400 |