项目作者: wroberts

项目描述 :
UNIX line counting utilities
高级语言: C++
项目地址: git://github.com/wroberts/count.git
创建时间: 2014-11-27T14:27:01Z
项目社区:https://github.com/wroberts/count

开源协议:MIT License

下载


count - UNIX line counting utilities

Copyright (c) 2014 Will Roberts \wildwilhelm@gmail.com\

Homepage: https://github.com/wroberts/count

This project is licensed under the terms of the MIT license (see
LICENSE.md).

Overview

count works similarly to sort fruit | uniq -c. The output is
tab-separated and in alphabetical order.

addcount sums two count files produced by count, assuming that the
files are sorted in alphabetical order.

sortalph takes count data as produced by count and sorts it
alphabetically; it can also be used to sum two (or more) count files
together (even if they’re not in alphabetical order):

  1. `cat COUNT1 COUNT2 | sortalph`

sortnum is a script that calls sort -nr.

threshcount reads a count file as produced by count and outputs
only those lines whose counts are greater than the given threshold
argument.

shuffle is a short Python script which reads in a file and outputs
its lines in random order. shuf in the
GNU Coreutils is faster and
more flexible.

Install

From tarball:

  1. tar xf count-1.0.tar.gz
  2. cd count-1.0/
  3. ./configure
  4. make install

From github:

  1. autoreconf --install
  2. mkdir build
  3. cd build
  4. ../configure
  5. make install

Speed Test

count is faster than sort | uniq -c, but can use much more memory:

  1. $ cat BIGFILE | wc
  2. 1653677 21751482 75598346
  3. $ time (cat BIGFILE | sort | uniq -c > /dev/null)
  4. real 0m50.933s
  5. user 0m55.267s
  6. sys 0m0.347s
  7. $ time (cat BIGFILE | count > /dev/null)
  8. real 0m9.233s
  9. user 0m9.357s
  10. sys 0m0.453s

Awk Equivalents

Most of the count tools can be replicated with trivial awk scripts.
Usually, the compiled binaries are faster.

count is equivalent to, though faster than:

  1. awk '{c[$0]++} END {OFS="\t"; for (x in c) print c[x], x}' | sort -k2

sortalph is equivalent to, though faster than:

  1. awk 'BEGIN{FS=OFS="\t"} {v=$1; $1=""; c[substr($0,2)]+=v} END {for (x in c) print c[x], x}' | sort -k2

threshcount 2 is equivalent to, but slower than:

  1. awk '{if (2 < $1) print $0}'