Windows-patch_maker-PROSAGA-码农传奇

Patch : patch making and applying utility

 
 
Originally  developed as MS-DOS command line utility back in 1997, when computer
memory was  quite limited and  early  Internet speeds  very  slow, in  order  to
transfer binary patches as opposed to full executables. 
 
The  utility  was originally split into two separate executables (named `script`
and `play`). The  current version  combinies both functions in one (quite small)
executable called `patch`, which performs both creation of  patches and patching
(see flags below). 
 
As it  turned out,  over  the years  many similar utilities have  been developed
(both commerical and free). 
 
I am publishing this code mostly  for nostalgic reasons. I made some performance
fixes, switched parts  of code to modern C++ and made  sure it compiles and runs
on both  modern Linux (tested on Ubuntu 18.04) and Windows  10  (it's  slower on
Windows - because of the slower disk I/O, see note on performance below).

Usage

 
 
type `./patch` to see all options. 
 
`Syntax: ./patch [flags] oldfile.ext newfile.ext [file.pat]` 
     `   flags: -c - create patch`
     `          -a - apply patch`
     `          -f - force overwrite`
     `          -i - ignore timestamps`
     `          -s - print stats`
     `          -q - quiet mode`
     `          -v - verbose mode`
     `          -d - disable I/O buffering`
     `          -x - maximize compression`
 
Flags can be combined into a single argument. Example: 
 
`./patch -cfv picture1.bmp picture2.bmp patch.foc` 
 
The utility can be used on both binary and  text files. It  treats all files  as
binary.

Creating a patch file

 
 
The `patch -c` option creates a patch file with specified file name or (when not
set) automatically  assigns  the  patch  file  name.  The  output  file  can  be
compressed further with compression software (such as zip, 7zip etc).

Appying a patch

 
 
The `patch  -a`  option applies existing  patch  to  the  `oldfile`,  and should
produce `newfile` (identical to that given when creating a patch). It will check
timestamp, length and CRC checksum of `oldfile` (prior to execution) and of  the
`newfile`  (after creating it) to verify  that all went  correctly. It will also
apply  the latest  modification  timestamp  to  `newfile`  unless `-i` option is
given.

Algorithm description and patch schema

 
 
Algorithm creates a hashtable (or a sorted array  - with optional #define)  with
checksums and corresponding file positions evenly distributed over `oldfile`. It
then searches for the longest match between current position inside `newfile` to
find the duplicate parts. If such  part  is  at least eight bytes long, it  gets
referenced  inside  the  patch. Otherwise  it  writes sequences  of  bytes  from
`newfile`  to the patch,  until  such matching  part is  located.  An  important
difference of this utility from similar packages is that it also checks for long
sequences of identical bytes and compresses them as well. 
 
The patch file structure/schema is described in detail inside `main.cpp`.

Performance on Windows 10

 
 
A quick test of the patch-making part compiled with Visual Studio on  Windows 10
shows that it runs substantially slower  compared to Linux version, on a similar
hardware - release and all optimization flags being set. Visual  Studio profiler
shows the bottleneck is I/O functions such as `fread()`. 
 
As result, I added setvbuf() calls to  use memory  to speed up disk I/O. You can
disable  I/O  buffering with  `-d` option,  to save  more memory for hash table.

Limitations of current version

 
 
The  input  file size of `oldfile` is limited to 1GB.  This number is product of
22-bit  unsigned integer  used  as block id, and  256, the current maximum block
size. The simplest way to  increase  this limit is  to make  maximum block  size
larger (1024 or more) - the header's byte #4 and unused 6 bits of byte #5 can be
used to store new block size.

Possible improvements

 
 
The current maximum memory used for block ids hashtable  table (asuming all keys
being unique) is 2^22 times 8 bytes, or 2^25 (corresponds to 32 MB). As of 2020,
most modern  computers have gigabytes of RAM, which means we  have an  option of
increasing the maximum  hashtable size to 2^30 - block ids will still  fit  into
32-bit number in memory,  but we will need to use one extra byte for block id in
output patch  file - only when input  file  size  is  large enough - making  the
maximum memory requirement to be 8GB. 
 
The header also stores file  sizes  as unsigned 32-bit integers,  limiting  file
size to 4GB. This can easily be fixed by changing to 48-bit or 64-bit integers.

License

MIT