Closest string problem. Binary decisional version.
Binary decisional closest string problem definition:
Where d(s1, s2) is a Humming distance between stings s1 and s2.
This problem considered to be NP-complete.
Here, I suggest a polynomial approximation for this task with precision and performance measurements.
The algorithm is explained in detail in the following article:
Briefly, the suggested algorithm solves the problem in:
O(W^3 * H^2)
in the worst case (if K ~ 33% of string length, then the probability of the worst case is the hightest).
Where:
To achieve this, I developed some ideas, published in “Fixed-Parameter Algorithms for Closest String and Related Problems” by Jens Gramm, Rolf Niedermeier and Peter Rossmanith (Algorithmica, September 2003, Volume 37, Issue 1, pp 25–42) and added a couple of mine own solutions. Also, using the binary alphabet simplifies the search a lot.
On linux/MacOS:
git clone git@gitlab.com:kirilenkobm/bdcsp.git
cd bdcsp/
make
On windows:
git clone git@gitlab.com:kirilenkobm/bdcsp.git
cd bdcsp\
.\Win_make.bat
To run tests/*.ipynb install requirements first:
pip install -r requirements.txt
To generate input files:
make rnd
./generate_input
To reproduce testing dataset:
./repeat_dataset.sh
Will compile random datasets generator in C
Program usage:
Usage: ./CSP [input file] [k] [-v] [-p]
[input file]: text file containing input or stdin
[k]: minimal distance to check, positive integer number
[-v] <number 0 .. 3>: enable verosity mode, set logging level from 1 to 3, 0 - nothing
[-h]: show this message
[-V]: show version
[-p]: show patterns
[-nr]/[--nr]: you promise there are no repetative strings (not recommended) =)
[-r]: render initial state (not recommended on big datasets)
[-f]: optimize first line
[-s]: sanity checks, just check the input correctness and quit
[-sr]/[--sr] <filaname>: save final render to file
[-z]: turn on magic mode
[-t]: transpose dataset
For better (in average) results, add -f and -z options.
Also, results might be improved with:
../call_for_each_line.py [input_file] [k]
Warning - takes str_num times more operations!
In some cases, this also might be useful:
./iterative_call.py [input_file] [k]
In this case the program runs several times using the best state of the previous run.
Briefly, the program is based on the following ideas:
Precision measurements were performed in the following way:
We see that the best results are obtained with -z and -f flags. Each 4 plots represent one dataset.
Dataset name is written as [strlen] [str_num] [K expected].
F, Z and FZ mean different corrections applied.
We see that in case of _K ~ 35% of strlen program gives the worst results even after corrections.
In average, F and Z corrections improve results.
However, if we compare different corrections with uncorrected model, we see that in a minority of cases it is better to call the program without any corrections:
In the worst case algorithm shows complexity as O(W^3 * H^2).
However, in the best case program shows almost linear time.
It depends on the step where program found the answer. The search tree step is the bottleneck, and if the program fails to find the closest string at this stage, it actually is the worst case.
I performed performance tests on the dataset with fixed strnum (100 strings) and different str lengths (100 - 2000 characters).
Expected _K is 33% of str_len in each dataset, to perform the tests on the potentially worst case.
As we can see, the runtime is highly variable, however is quite low in the average case. Actually, according the algorithm structure, the worst runtime will be shown for cases, where program failed to find the closest string.
The same plot for “only False” answers looks like:
Also the runtime depends on applied corrections, because with corrections the search is deeper in average. So this plot shows runtime in “False” case for different types of corrections: