项目作者: BK-SCOSS

项目描述 :
A Source Code Tokenizer
高级语言: Python
项目地址: git://github.com/BK-SCOSS/sctokenizer.git
创建时间: 2020-08-14T06:06:44Z
项目社区:https://github.com/BK-SCOSS/sctokenizer

开源协议:MIT License

下载


sctokenizer

A Source Code Tokenizer

Supports those languages: C, C++, Java, Python, PHP

How to install

  1. pip install sctokenizer

How to use

Use sctokenizer:

  1. import sctokenizer
  2. tokens = sctokenizer.tokenize_file(filepath='tests/data/hello_world.cpp', lang='cpp')
  3. for token in tokens:
  4. print(token)

Or create new CppTokenizer:

  1. from sctokenizer import CppTokenizer
  2. tokenizer = CppTokenizer() # this object can be used for multiple source files
  3. with open('tests/data/hello_world.cpp') as f:
  4. source = f.read()
  5. tokens = tokenizer.tokenize(source)
  6. for token in tokens:
  7. print(token)

Or better solution:

  1. from sctokenizer import Source
  2. src = Source.from_file('tests/data/hello_world.cpp', lang='cpp')
  3. tokens = src.tokenize()
  4. for token in tokens:
  5. print(token)

Result is a list of Token. Each Token has four attributes including token_value, token_type, line, column:

  1. (#, TokenType.SPECIAL_SYMBOL, (1, 1))
  2. (include, TokenType.KEYWORD, (1, 2))
  3. (<, TokenType.OPERATOR, (1, 10))
  4. (bits/stdc++.h, TokenType.IDENTIFIER, (1, 11))
  5. (>, TokenType.OPERATOR, (1, 24))
  6. (using, TokenType.KEYWORD, (3, 1))
  7. (namespace, TokenType.KEYWORD, (3, 7))
  8. (std, TokenType.IDENTIFIER, (3, 17))
  9. (;, TokenType.SPECIAL_SYMBOL, (3, 20))
  10. (int, TokenType.KEYWORD, (5, 1))
  11. (main, TokenType.IDENTIFIER, (5, 5))
  12. ((, TokenType.SPECIAL_SYMBOL, (5, 9))
  13. (), TokenType.SPECIAL_SYMBOL, (5, 10))
  14. ({, TokenType.SPECIAL_SYMBOL, (6, 1))
  15. (cout, TokenType.IDENTIFIER, (7, 5))
  16. (<<, TokenType.OPERATOR, (7, 11))
  17. (", TokenType.SPECIAL_SYMBOL, (7, 13))
  18. (Hello World, TokenType.STRING, (7, 14))
  19. (", TokenType.SPECIAL_SYMBOL, (7, 25))
  20. (;, TokenType.SPECIAL_SYMBOL, (7, 26))
  21. (return, TokenType.KEYWORD, (8, 5))
  22. (0, TokenType.CONSTANT, (8, 12))
  23. (;, TokenType.SPECIAL_SYMBOL, (8, 13))
  24. (}, TokenType.SPECIAL_SYMBOL, (9, 1))

TODO

  • Support other languages: Matlab, Javascript, Typescript,...
  • Auto detect language
  • Parse source to a tree of tokens???