项目作者: preciz

项目描述 :
A library for cosine similarity & simhash calculation
高级语言: Elixir
项目地址: git://github.com/preciz/similarity.git
创建时间: 2019-06-18T14:03:58Z
项目社区:https://github.com/preciz/similarity

开源协议:MIT License

下载


Similarity

test

Cosine similarity & Simhash implementation

Full documentation can be found at https://hexdocs.pm/similarity.

Installation

Add similarity to your list of dependencies in mix.exs:

  1. def deps do
  2. [
  3. {:similarity, "~> 0.4"}
  4. ]
  5. end

Cosine Similarity

Cosine similarity is not sensitive to the scale of the vector:

  1. Similarity.cosine([1,2,3], [1,2,3])
  2. 1.0
  3. Similarity.cosine([1,2,3], [2,4,6])
  4. 1.0

Module Similarity.Cosine takes care of building a struct and streaming similarities:
(It handles non matching attributes, elements added don’t have to have the exact attributes)

  1. s = Similarity.Cosine.new()
  2. s = s |> Similarity.Cosine.add("a", [{"bananas", 9}, {"hair_color_r", 124}, {"hair_color_g", 8}, {"hair_color_b", 122}])
  3. s = s |> Similarity.Cosine.add("b", [{"bananas", 19}, {"hair_color_r", 124}, {"hair_color_g", 8}, {"hair_color_b", 122}])
  4. s = s |> Similarity.Cosine.add("c", [{"bananas", 9}, {"hair_color_r", 124}])
  5. s |> Similarity.Cosine.stream |> Enum.to_list
  6. [
  7. {"a", "b", 1.9967471152702767},
  8. {"a", "c", 1.4142135623730951},
  9. {"b", "c", 1.409736747211141}
  10. ]
  11. s |> Similarity.Cosine.between("a", "b")
  12. 1.9967471152702767

Similarity.cosine_srol/2
Cosine similarity between two vectors, multiplied by the square root of the length of the vectors.
(In my experience, where the number of common attributes doesn’t match between some vectors, this gives a better value.)

  1. a = [1,2,3,4]
  2. b = [1,2,3]
  3. c = [1,2,3,4]
  4. Similarity.cosine_srol(a |> Enum.take(3), b)
  5. 1.7320508075688772
  6. Similarity.cosine_srol(a, c)
  7. 2.0

Above even though the first 3 elements of a match with b, just like a with c,
the a & c cosine similarity returns higher value due to more elements matching.
In real world scenario I suggest using this if compared vectors aren’t the same length.

Simhash

  1. left = "pork belly jerky brisket tenderloin shank kevin spare ribs"
  2. right = "porchetta pork loin. Leberkas ball tip biltong, beef ribs"
  3. Similarity.simhash(left, right, ngram_size: 3)
  4. 0.484375

Performance

Similarity.simhash is 2x faster than simhash-ex v1.1.0 package.

  1. Benchmark suite executing with the following configuration:
  2. warmup: 2 s
  3. time: 5 s
  4. memory time: 0 ns
  5. parallel: 1
  6. inputs: none specified
  7. Estimated total run time: 14 s
  8. Benchmarking simhash-ex...
  9. Benchmarking similarity.simhash...
  10. Name ips average deviation median 99th %
  11. similarity.simhash 3.67 K 272.69 μs ±6.50% 267.84 μs 353.05 μs
  12. simhash-ex 1.75 K 572.14 μs ±12.31% 552.22 μs 781.02 μs
  13. Comparison:
  14. similarity.simhash 3.67 K
  15. simhash-ex 1.75 K - 2.10x slower +299.46 μs

License

Similarity is MIT licensed.