项目作者: q-m

项目描述 :
Extract the structure of ingredient lists on food products
高级语言: Ruby
项目地址: git://github.com/q-m/food-ingredient-parser-ruby.git
创建时间: 2018-06-05T11:15:01Z
项目社区:https://github.com/q-m/food-ingredient-parser-ruby

开源协议:MIT License

下载


Food ingredient parser

Gem Version

Ingredients are listed on food products in various ways. This Ruby
gem and program parses the ingredient text and returns a structured representation.

Installation

  1. gem install food_ingredient_parser

This will also install the dependency treetop.
If you want colored output for the test program, also install pry: gem install pry.

Example

  1. require 'food_ingredient_parser'
  2. s = "Water* 60%, suiker 30%, voedingszuren: citroenzuur, appelzuur, zuurteregelaar: E576/E577, " \
  3. + "natuurlijke citroen-limoen aroma's 0,2%, zoetstof: steviolglycosiden, * = Biologisch. " \
  4. + "E = door de E.U. goedgekeurde toevoeging."
  5. parser = FoodIngredientParser::Strict::Parser.new
  6. puts parser.parse(s).to_h.inspect

Results in

  1. {
  2. :contains=>[
  3. {:name=>"Water", :amount=>"60%", :marks=>["*"]},
  4. {:name=>"suiker", :amount=>"30%"},
  5. {:name=>"voedingszuren", :contains=>[
  6. {:name=>"citroenzuur"}
  7. ]},
  8. {:name=>"appelzuur"},
  9. {:name=>"zuurteregelaar", :contains=>[
  10. {:name=>"E576"},
  11. {:name=>"E577"}
  12. ]},
  13. {:name=>"natuurlijke citroen-limoen aroma's", :amount=>"0,2%"},
  14. {:name=>"zoetstof", :contains=>[
  15. {:name=>"steviolglycosiden"}
  16. ]}
  17. ],
  18. :notes=>[
  19. "* = Biologisch",
  20. "E = door de E.U. goedgekeurde toevoeging"
  21. ]
  22. }

Test tool

The executable food_ingredient_parser is available after installing the gem. If you’re
running this from the source tree, use bin/food_ingredient_parser instead.

  1. $ food_ingredient_parser -h
  2. Usage: bin/food_ingredient_parser [options] --file|-f <filename>
  3. bin/food_ingredient_parser [options] --string|-s <ingredients>
  4. -f, --file FILE Parse all lines of the file as ingredient lists.
  5. -s, --string INGREDIENTS Parse specified ingredient list.
  6. -q, --[no-]quiet Only show summary.
  7. -p, --parsed Only show lines that were successfully parsed.
  8. -n, --noresult Only show lines that had no result.
  9. -r, --parser PARSER Use specific parser (strict, loose).
  10. -e, --[no-]escape Escape newlines
  11. -c, --[no-]color Use color
  12. --[no-]html Print as HTML with parsing markup
  13. -v, --[no-]verbose Show more data (parsed tree).
  14. --version Show program version.
  15. -h, --help Show this help
  16. $ food_ingredient_parser -v -s "tomato"
  17. "tomato"
  18. RootNode+Root3 offset=0, "tomato" (contains,notes):
  19. SyntaxNode offset=0, ""
  20. SyntaxNode offset=0, ""
  21. SyntaxNode offset=0, ""
  22. ListNode+List13 offset=0, "tomato" (contains):
  23. SyntaxNode+List12 offset=0, "tomato" (ingredient):
  24. SyntaxNode+Ingredient0 offset=0, "tomato":
  25. SyntaxNode offset=0, ""
  26. IngredientNode+IngredientSimpleWithAmount3 offset=0, "tomato" (ing):
  27. IngredientNode+IngredientSimple5 offset=0, "tomato" (name):
  28. SyntaxNode+IngredientSimple4 offset=0, "tomato" (word):
  29. SyntaxNode offset=0, "tomato":
  30. SyntaxNode offset=0, "t"
  31. SyntaxNode offset=1, "o"
  32. SyntaxNode offset=2, "m"
  33. SyntaxNode offset=3, "a"
  34. SyntaxNode offset=4, "t"
  35. SyntaxNode offset=5, "o"
  36. SyntaxNode offset=6, ""
  37. SyntaxNode offset=6, ""
  38. SyntaxNode offset=6, ""
  39. SyntaxNode+Root2 offset=6, "":
  40. SyntaxNode offset=6, ""
  41. SyntaxNode offset=6, ""
  42. SyntaxNode offset=6, ""
  43. SyntaxNode offset=6, ""
  44. {:contains=>[{:name=>"tomato"}]}
  45. $ food_ingredient_parser --html -s "tomato"
  46. <div class="root"><span class='depth0'><span class='name'>tomato</span></span></div>
  47. $ food_ingredient_parser -v -r loose -s "tomato"
  48. "tomato"
  49. Node interval=0..5
  50. Node interval=0..5, name="tomato"
  51. {:contains=>[{:name=>"tomato"}]}
  52. $ food_ingredient_parser -q -f data/test-cases
  53. parsed 35 (100.0%), no result 0 (0.0%)

If you want to use the output in (shell)scripts, the options -e -c may be quite useful.

to_html

When ingredient lists are entered manually, it can be very useful to show how the text is
recognized. This can help understanding why a certain ingredients list cannot be parsed.

For this you can use the to_html method on the parsed output, which returns the original
text, augmented with CSS classes for different parts.

  1. require 'food_ingredient_parser'
  2. parsed = FoodIngredientParser::Strict::Parser.new.parse("Saus (10% tomaat*, zout). * = bio")
  3. puts parsed.to_html
  1. <span class='depth0'>
  2. <span class='name'>Saus</span> (
  3. <span class='contains depth1'>
  4. <span class='amount'>10%</span> <span class='name'>tomaat</span><span class='mark'>*</span>,
  5. <span class='name'>zout</span>
  6. </span>)
  7. </span>.
  8. <span class='note'>* = bio</span>

For an example of an interactive editor, see examples/editor.rb.

editor example screenshot

Loose parser

The strict parser only parses ingredient lists that conform to one of the many different
formats expected. If you’d like to return a result always, even if that is not necessarily
completely correct, you can use the loose parser. This does not use Treetop, but looks
at the input character for character and tries to make the best of it. Nevertheless, if you
just want to have some result, this can still be very useful.

  1. require 'food_ingredient_parser'
  2. parsed = FoodIngredientParser::Loose::Parser.new.parse("Saus [10% tomaat*, (zout); peper.")
  3. puts parsed.to_h

Even though the strict parser would not give a result, the loose parser returns:

  1. {
  2. :contains=>[
  3. {:name=>"Saus", :contains=>[
  4. {:name=>"tomaat", :marks=>["*"], :amount=>"10%", {
  5. :contains=>[{:name=>"zout"}
  6. ]},
  7. {:name=>"peper"}
  8. ]}
  9. ]
  10. }

Compatibility

From the 1.0.0 release, the main interface will be stable. This comprises the two parser’s parse
methods (incl. documented options), its nil result when parsing failed, and the parsed output’s
to_h and to_html methods. Please note that parsed node trees may be subject to change, even within
a major release. Within a minor release, node trees are expected to remain stable.

So if you only use the stable interface (parse, to_h and to_html), you can lock your version
to e.g. ~> 1.0. If you depend on more, lock your version against e.g. ~> 1.0.0 and test when you
upgrade to 1.1.

Languages

While most of the parsing is language-independent, some parts need knowledge about certain words
(like abbreviations and amount specifiers). The gem was developed with ingredient lists in Dutch (nl),
plus a bit of English and German. Support for other languages is already good, but lacks in certain
areas: improvements are welcome (starting with a corpus in data/).

Many ingredient lists from the USA are structured a bit differently than those from Europe, they
parse less well (that is probably a matter of tine-tuning).

Test data

data/ingredient-samples-qm-nl contains about 150k
real-world ingredient lists found on the Dutch market. Each line contains one ingredient
list (newlines are encoded as \n, empty lines and those starting with # are ignored).
The strict parser currently parses 80%, while the loose parser returns something for all of them.

License

This software is distributed under the MIT license. Data may have a different license.