PROSAGA码农传奇-大数据安全-比较包含上一个和下一个文件中重复“存根”的许多文本文件，并自动删除重复文本

0# 部落用户 | 2019-08-31 10-32

你有一个非常重要的问题。编写代码以便在文件1的末尾和文件2的开头找到重复的文本很容易。但是你不想删除重复的文本—-你想要

分裂

第二篇文章开始的地方。获得正确的分裂可能是棘手的 - 一个标记是全部大写，另一个标记是
BY
在下一行的开头。

它有助于从连续文件中获取示例，但下面的脚本适用于一个测试用例。
的
在尝试此代码之前，请备份所有文件。
</强>
代码

覆写

现有文件。

实施是在

LUA

。
该算法大致是：

忽略文件1末尾的空行和文件2的开头。

查找文件1末尾和文件2开头常用的长行序列。

<UL>
<LI>
这可以通过尝试40行，然后是39行，依此类推
</LI>
</UL>
</LI>
<LI>
从两个文件中删除序列并调用它
<code>
overlap
</code>
。
</LI>
<LI>
在标题处拆分重叠
</LI>
<LI>
将重叠的第一部分附加到file1;将第二部分添加到file2。
</LI>
<LI>
使用行列表覆盖文件的内容。
</LI>
</醇>

这是代码：

#!/usr/bin/env lua
local ext = arg[1] == ‘-xxx’ and ‘.xxx’ or ‘’ if #ext > 0 then table.remove(arg, 1) end
local function lines(filename) local l = { } for line in io.lines(filename) do table.insert(l, (line:gsub(‘’, ‘’))) end assert(#l > 0, “No lines in file “ .. filename) return l end
local function write_lines(filename, lines) local f = assert(io.open(filename .. ext, ‘w’)) for i = 1, #lines do f:write(lines[i], ‘\n’) end f:close() end
local function lines_match(line1, line2) io.stderr:write(string.format(“%q ==? %q\n”, line1, line2)) return line1 == line2 — could do an approximate match here end
local function lines_overlap(l1, l2, k) if k > #l2 or k > #l1 then return false end io.stderr:write(‘* k = ‘, k, ‘\n’) for i = 1, k do if not lines_match(l2[i], l1[#l1 - k + i]) then if i > 1 then io.stderr:write(‘After ‘, i-1, ‘ matches: FAILED <====\n’) end return false end end return true end
function find_overlaps(fname1, fname2) local l1, l2 = lines(fname1), lines(fname2) — strip trailing and leading blank lines while l1[#l1]:find ‘^[%s]$’ do table.remove(l1) end while l2[1] :find ‘^[%s]$’ do table.remove(l2, 1) end local matchsize — # of lines at end of file 1 that are equal to the same — # at the start of file 2 for k = math.min(40, #l1, #l2), 1, -1 do if lines_overlap(l1, l2, k) then matchsize = k io.stderr:write(‘Found match of ‘, k, ‘ lines\n’) break end end
if matchsize == nil then return false — failed to find an overlap else local overlap = { } for j = 1, matchsize do table.remove(l1) — remove line from first set table.insert(overlap, table.remove(l2, 1)) end return l1, overlap, l2 end end
local function split_overlap(l) for i = 1, #l-1 do if l[i]:match ‘%u’ and not l[i]:match ‘%l’ then — has caps but no lowers — io.stderr:write(‘Looking for byline following ‘, l[i], ‘\n’) if l[i+1]:match ‘^%s*BY%s’ then local first = {} for j = 1, i-1 do table.insert(first, table.remove(l, 1)) end — io.stderr:write(‘Split with first line at ‘, l[1], ‘\n’) return first, l end end end end
local function strip_overlaps(filename1, filename2) local l1, overlap, l2 = find_overlaps(filename1, filename2) if not l1 then io.stderr:write(‘No overlap in ‘, filename1, ‘ an
</code>

1# 妖邪 | 2019-08-31 10-32

是标题＆amp;作者总是单行？并且该行总是包含大写的单词“BY”吗？如果是这样，你可以做一个公平的工作
的
AWK
</强>
，使用这些标准作为开始/结束标记。

的
编辑：
</强>
我真的不认为使用diff会起作用，因为它是比较广泛相似文件的工具。你的文件（从diff的角度来看）实际上完全不同 - 我认为它会立即失去同步。但是，我不是一个差异大师:-)

2# v-star*위위 | 2019-08-31 10-32

快速捅一下，假设两个文件中的存根严格相同：


    
  #!/usr/bin/perl
use strict;
use List::MoreUtils qw/ indexes all pairwise /;
my @files = @ARGV;
my @previous_text;
for my $filename ( @files ) {
    open my $in_fh,  ‘<’, $filename          or die;
    open my $out_fh, ‘>’, $filename.’.clean’ or die;
my @lines = <$in_fh>;
print $out_fh destub( \@previous_text, @lines );
@previous_text = @lines;
}
sub destub {
    my @previous = @{ shift() };
    my @lines = @_;
my @potential_stubs = indexes { $_ eq $lines[0] } @previous;
for my $i ( @potential_stubs ) {
    # check if the two documents overlap for that index
    my @p = @previous[ $i.. $#previous ];
    my @l = @lines[ 0..$#previous-$i ];
    return @lines[ $#previous-$i + 1 .. $#lines ]
            if all { $_ } pairwise { $a eq $b } @p, @l;
}
# no stub detected
return @lines;
}
</code>

3# 荧惑 | 2019-08-31 10-32

是存根

相同

到上一个文件的末尾？或者不同的行结尾/ OCR错误？

有没有办法辨别一篇文章的开头？也许是一个缩进的摘要？然后你可以浏览每个文件并丢弃第一个之前和之后（包括）第二个标题的所有内容。