PROSAGA码农传奇-Sahara云大数据-如何从字符串数组中以任何顺序匹配和突出显示所有术语？

<div class =“post-text”itemprop =“text”>
  <H1>
    更新2
  </H1>
  <P>
    由于在Vue中恢复工作字符串的问题，放弃了缩小集合的概念。
  </p>
  <P>
    现在，方法简单如下：
  </p>
  <OL>
    <LI>
      预处理选项集以使显示与工作保持同步。
    </LI>
    <LI>
      处理条款。
    </LI>
    <LI>
      通过迭代来减少（过滤）选项集，并在不匹配时循环术语，短路。
    </LI>
    <LI>
      使用简化集，迭代每个选项，找到匹配范围。
    </LI>
    <LI>
      在每个匹配范围周围插入HTML字符串。
    </LI>
  </醇>
  <P>
    代码被评论。
  </p>
  <P>
    原始javascript（记录过滤/操作的选项数组）：
    <a href="https://jsfiddle.net/pvLj9uxe/14/" rel="nofollow noreferrer">
      https://jsfiddle.net/pvLj9uxe/14/
    </A>
  </p>
  <P>
    新的Vue实施：
    <a href="https://jsfiddle.net/15prcpxn/30/" rel="nofollow noreferrer">
      https://jsfiddle.net/15prcpxn/30/
    </A>
  </p>
  <P>
    计算似乎相当快 -  DOM更新是杀死它的原因。
  </p>
  <P>
    添加到比较*：
    <a href="https://jsfiddle.net/ektyx133/4/" rel="nofollow noreferrer">
      https://jsfiddle.net/ektyx133/4/
    </A>
  </p>
  <P>
    *警告：预处理选项（视为“静态”）是策略的一部分，因此它已在基准之外进行处理。
  </p>
   <pre>
    <code>
      var separator = /\s|\*|,/;

// this function enhances the raw options array 
function enhanceOptions(options) {
  return options.map(option => ({
    working: option.toLowerCase(), // for use in filtering the set and matching
    display: option // for displaying
  }))
}

// this function changes the input to lower case, splits the input into terms, removes empty strings from the array, and enhances the terms with the size and wiping string
function processInput(input) {
  return input.trim().toLowerCase().split(separator).filter(term => term.length).map(term => ({
    value: term.toLowerCase(),
    size: term.length,
    wipe: " ".repeat(term.length)
  })).sort((a, b) => b.size - a.size);
}

// this function filters the data set, then finds the match ranges, and finally returns an array with HTML tags inserted
function filterAndHighlight(terms, enhancedOptions) {
  let options = enhancedOptions,
    l = terms.length;

// filter the options - consider recursion instead
  options = options.filter(option => {
    let i = 0,
      working = option.working,
      term;
    while (i < l) {
      if (!~working.indexOf((term = terms[i]).value)) return false;
      working = working.replace(term.value, term.wipe);
      i++;
    }
    return true;
  })

// generate the display string array
  let displayOptions = options.map(option => {
    let rangeSet = [],
      working = option.working,
      display = option.display;

// find the match ranges
    terms.forEach(term => {
      working = working.replace(term.value, (match, offset) => { // duplicate the wipe string replacement from the filter, but grab the offsets
        rangeSet.push({
          start: offset,
          end: offset + term.size
        });
        return term.wipe;
      })
    })

// sort the match ranges, last to first
    rangeSet.sort((a, b) => b.start - a.start);

// insert the html tags within the string around each match range
    rangeSet.forEach(range => {
      display = display.slice(0, range.start) + '<u>' + display.slice(range.start, range.end) + '</u>' + display.slice(range.end)
    })

return display;

})

return displayOptions;
}

</code>
  </pre>
  <H1>
    老尝试
  </H1>
  <P>
    <a href="https://jsfiddle.net/15prcpxn/25/" rel="nofollow noreferrer">
      https://jsfiddle.net/15prcpxn/25/
    </A>
  </p>
  <P>
    我的尝试，使用Vue进行渲染（方法是顺序的，所以你可以把它全部放在一个单片函数中而不需要太多努力 - 输入将是术语，完整选项集;输出将被过滤选项集，并突出显示范围）。
  </p>
  <OL>
    <LI>
      将输入拆分为单个术语
    </LI>
    <LI>
      按长度排序术语（最长术语首先，以便当您有一个选项，如
       <code>
        "abc ab"
      </code>
      和条款
       <code>
        "a abc"
      </code>
      ，即一个术语是另一个术语的子串，它将能够匹配
       <code>
        "abc"
      </code>
      ）
    </LI>
    <LI>
      将条款复制/更改为小写
    </LI>
    <LI>
      将选项（我们的“显示集”）复制到小写（我们的“工作集”）
    </LI>
    <LI>
      对于每个术语，删除没有来自“工作集”的匹配的工作选项，并且并行地从“显示集”中移除显示选项 - 当这样做时，从幸存的工作选项字符串中移除匹配的术语字符串，例如，匹配期限
       <code>
        "a"
      </code>
       在选项中
       <code>
        "abc"
      </code>
       产量
       <code>
        "bc"
      </code>
      <EM>
        [实际实现是相反的：对于每个术语，当有匹配时重新创建“工作集”，并且并行地向“显示集”添加显示选项，然后将这些集传递给下一个术语]
      </EM>
        - 这为我们提供了过滤的显示集
    </LI>
    <LI>
      将过滤后的显示集复制为小写，为我们提供一个新的过滤工作集
    </LI>
    <LI>
      对于剩余过滤工作集中的每个工作选项，通过记录范围（即开始和结束，例如匹配项）创建范围集
       <code>
        "a"
      </code>
       在选项中
       <code>
        "abc"
      </code>
      ：
       <code>
        start = 0, end = 1
      </code>
      ）通过获取匹配的偏移（开始）和术语/匹配的长度，每个术语匹配。将匹配的字符串替换为与该术语长度相等的空格（或其他未使用的字符），并将其提供给下一个术语，例如匹配期限
       <code>
        "a"
      </code>
       在选项中
       <code>
        "abc"
      </code>
       产量
       <code>
        " bc"
      </code>
        - 这将保留工作选项的长度，确保过滤的工作集（小写）与过滤的显示集（原始案例）匹配。范围集的总数将等于筛选选项集中剩余选项的数量。
    </LI>
    <LI>
      此外，对每个范围集内的范围进行排序（按降序排列，以反向工作），以允许字符串插入。
    </LI>
    <LI>
      对于过滤后的显示集中的每个选项，（反向工作以便在操作字符串时不干扰索引），插入
       <code>
        <u>
      </code>
       <code>
        </u>
      </code>
       通过切割显示选项，例如匹配范围周围的标签。匹配期限
       <code>
        "a"
      </code>
       在选项中
       <code>
        "abc"
      </code>
      ：
       <code>
        new option = "<u>" + "a" + "</u>" + "bc"
      </code>
    </LI>
    <LI>
      渲染它
    </LI>
  </醇>
  <P>
    当存在许多匹配/不可用的术语时（例如，当您输入单个字符时），性能很差。对于最终用途，我会可能会输入计算延迟。
  </p>
  <P>
    我应该能够将其中一些步骤汇总到更少的步骤，这可以提高性能。我明天再来。
  </p>
  <P>
    据推测，Vue还可以通过虚拟DOM等处理一些优化，因此它不一定能反映出vanilla Javascript / DOM渲染。
  </p>
</DIV>

<div class =“post-text”itemprop =“text”>
  <P>
    这是一种完全不同于我之前的答案的方法 - 我无法将以下所有内容添加到（大小限制）中，所以......这是一个单独的答案。
  </p>
  <H3>
    广义后缀树：预处理选项
  </H3>
  <P>
    一个
    <a href="https://en.wikipedia.org/wiki/Generalized_suffix_tree" rel="nofollow noreferrer">
      广义后缀树
    </A>
     理论上允许以有效的方式在一组字符串中搜索子字符串的结构。所以我以为我会去做。
  </p>
  <P>
    正如可以看到的那样，以有效的方式构建这样一棵树远非微不足道
    <a href="https://stackoverflow.com/a/9513423/5459839">
      这个对Ukkonen算法的精彩解释
    </A>
    ，这涉及建立一个
    <a href="https://en.wikipedia.org/wiki/Suffix_tree" rel="nofollow noreferrer">
      后缀树
    </A>
     一个短语（选项）。
  </p>
  <P>
    我从实施中汲取灵感
    <a href="https://felix-halim.net/misc/suffix-tree/" rel="nofollow noreferrer">
      在这里找到
    </A>
    ，需要适应：
  </p>
  <UL>
    <LI>
      应用更好的编码风格（例如，去除非显式声明的全局变量）
    </LI>
    <LI>
      无需在文本后添加分隔符即可使其工作。这真的很棘手，我希望我不会错过一些边境条件
    </LI>
    <LI>
      使其适用于多个字符串（即一般化）
    </LI>
  </UL>
  <P>
    所以这里是：
  </p>
  <P>
  </p>
  <div class =“snippet”data-lang =“js”data-hide =“false”data-console =“true”data-babel =“false”>
    <div class =“snippet-code”>
       <pre class="snippet-code-js lang-js prettyprint-override">
        <code>
          "use strict";
// Implementation of a Generalized Suffix Tree using Ukkonen's algorithm
// See also: https://stackoverflow.com/q/9452701/5459839
class Node {
    constructor() {
        this.edges = {};
        this.suffixLink = null;
    }
    addEdge(ch, textId, start, end, node) {
        this.edges[ch] = { textId, start, end, node };
    }
}

class Nikkonen extends Node {
    constructor() {
        super(); // root node of the tree
        this.texts = [];
    }
    findNode(s) {
        if (!s.length) return;
        let node = this,
            len,
            suffixSize = 0,
            edge;
        for (let i = 0; i < s.length; i += len) {
            edge = node.edges[s.charAt(i)];
            if (!edge) return;
            len = Math.min(edge.end - edge.start, s.length - i);
            if (this.texts[edge.textId].substr(edge.start, len) !== s.substr(i, len)) return;
            node = edge.node;
        }
        return { edge, len };
    }
    findAll(term, termId = 1) {
        const { edge, len } = this.findNode(term) || {};
        if (!edge) return {}; // not found
        // Find all leaves
        const matches = new Map;
        (function recurse({ node, textId, start, end }, suffixLen) {
            suffixLen += end - start;
            const edges = Object.values(node.edges);
            if (!edges.length) { // leaf node: calculate the match
                if (!(matches.has(textId))) matches.set(textId, []);
                matches.get(textId).push({ offset: end - suffixLen, termId });
                return;
            }
            edges.forEach( edge => recurse(edge, suffixLen) );
        })(edge, term.length - len);
        return matches;
    }
    addText(text) { 
        // Implements Nikkonen's algorithm for building the tree
        // Inspired by https://felix-halim.net/misc/suffix-tree/
        const root = this,
            active = {
                node: root,
                textId: this.texts.length,
                start: 0,
                end: 0,
            },
            texts = this.texts;
        
        // Private functions
        function getChar(textId, i) {
            return texts[textId].charAt(i) || '$' + textId;
        }
        
        function addEdge(fromNode, textId, start, end, node) {
            fromNode.addEdge(getChar(textId, start), textId, start, end, node);
        }
        
        function testAndSplit() {
            const ch = getChar(active.textId, active.end);
            if (active.start < active.end) {
                const edge = active.node.edges[getChar(active.textId, active.start)],
                    splitPoint = edge.start + active.end - active.start;
                if (ch === getChar(edge.textId, splitPoint)) return;
                const newNode = new Node();
                addEdge(active.node, edge.textId, edge.start, splitPoint, newNode);
                addEdge(newNode, edge.textId, splitPoint, edge.end, edge.node);
                return newNode;
            }
            if (!(ch in active.node.edges)) return active.node;
        }

texts.push(text);

if (!root.suffixLink) root.suffixLink = new Node(); 
        for (let i = 0; i < text.length; i++) {
            addEdge(root.suffixLink, active.textId, i, i+1, root);
        }

// Main Ukkonen loop: add each character from left to right to the tree
        while (active.end <= text.length) {
            update();
            active.end++;
            canonize(); // because active.end changed
        }
    }
}

let allTermsAllOptionsOffsets;
    // Loop through the unique terms:
    for (let [term, termInfo] of termMap) {
        // Get the offsets of the matches of this term in all options (in the preprocessed tree)
        const thisTermAllOptionsOffsets = suffixTree.findAll(term, termInfo.termId);
        //console.log('findAll:', JSON.stringify(Array.from(thisTermAllOptionsOffsets)));
        if (!thisTermAllOptionsOffsets.size) return []; // No option has this term, so bail out
        if (!allTermsAllOptionsOffsets) {
            allTermsAllOptionsOffsets = thisTermAllOptionsOffsets;
        } else {
            // Merge with all previously found offsets for other terms (intersection)
            for (let [optionId, offsets] of allTermsAllOptionsOffsets) {
                let newOffsets = thisTermAllOptionsOffsets.get(optionId);
                if (!newOffsets || newOffsets.length < termInfo.count) {
                    // this option does not have enough occurrences of this term
                    allTermsAllOptionsOffsets.delete(optionId); 
                } else {
                    allTermsAllOptionsOffsets.set(optionId, offsets.concat(newOffsets));
                }
            }
            if (!allTermsAllOptionsOffsets.size) return []; // No option has all terms, so bail out
        }
    }
    // Per option, see if (and where) the offsets can serve non-overlapping matches for each term
    const matches = Array.from(allTermsAllOptionsOffsets, ([optionId, offsets]) => {
            // Indicate how many of each term must (still) be matched:
            termMap.forEach( obj => obj.leftOver = obj.count );
            return [optionId, getNonOverlaps(offsets.sort( (a, b) => a.offset - b.offset ), terms.length)];
        })
        // Remove options that could not provide non-overlapping offsets
        .filter( ([_, offsets]) => offsets ) 
        // Sort the remaining options in their original order
        .sort( (a,b) => a[0] - b[1] )
        // Replace optionId, by the corresponding text and apply mark-up at the offsets
        .map( ([optionId, offsets]) => {
            let option = options[optionId];
            offsets.map((index, i) => {
                option = option.substr(0, index) 
                    + (i%2 ? "<u>" : "</u>")
                    + option.substr(index);
            });
            return option;            
        });
    //console.log(JSON.stringify(matches));
    return matches;
}

function trincotPreprocess(options) {
    const nikkonen = new Nikkonen();
    // Add all the options (lowercased) to the suffic tree
    options.map(option => option.toLowerCase()).forEach(nikkonen.addText.bind(nikkonen));
    return nikkonen;
}

const options = ['abbbba', 'United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];

/*
 * I/O and performance measurements 
 */

let preprocessed;
 
function processInput() {
    if (!preprocessed) { // Only first time
        const t0 = performance.now();
        preprocessed = trincotPreprocess(options);
        const spentTime = performance.now() - t0;
        // Output the time spent on preprocessing
        pretime.textContent = spentTime.toFixed(2);
    }
    var query = this.value.toLowerCase();
    const t0 = performance.now();
    const matches = trincotSuffixTree(query, options, preprocessed, ' ');
    const spentTime = performance.now() - t0;
    // Output the time spent
    time.textContent = spentTime.toFixed(2);
    // Output the matches
    result.innerHTML = '';
    for (var match of matches) {
        // Append it to the result list
        var li = document.createElement('li');
        li.innerHTML = match;
        result.appendChild(li);
    }
}

findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
        </code>
      </pre>
       <pre class="snippet-code-css lang-css prettyprint-override">
        <code>
          ul { 
    height:300px;
    font-size: smaller;
    overflow: auto;
}
        </code>
      </pre>
       <pre class="snippet-code-html lang-html prettyprint-override">
        <code>
          Input terms: <input type="text" id="findTerms"><br>

<h3>Trincot's Suffix Tree Search</h3>
Preprocessing Time: <span id="pretime"></span>ms (only done once)<br>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
        </code>
      </pre>
    </DIV>
  </DIV>
  <P>
  </p>
  <P>
    这个方法背后有相当多的代码，所以我认为它可能不会显示小数据集的有趣性能，而对于较大的数据集，它将消耗内存：树占用的内存比原始选项数组多得多。
  </p>
</DIV>