项目作者: jwrona

项目描述 :
Create a categorized image of an unorganized directory
高级语言: Python
项目地址: git://github.com/jwrona/categorize-files.git
创建时间: 2019-06-05T15:31:02Z
项目社区:https://github.com/jwrona/categorize-files

开源协议:MIT License

下载


File Categorization Utility

Create a categorized image of an unorganized directory (e.g. a disk dump).
Each contained file is classified according to its suffix or signatures (also
known as magic numbers or magic bytes) and moved/copied/linked to a
corresponding directory in the output directory.

Quick Example

If you want to organize a directory which looks like this (or possibly a
thousand times worse):

  1. big_mess
  2. ├── definitely_not_porn
  3. ├── cool.mkv
  4. ├── omg.mkv
  5. └── secret
  6. ├── nice.jpg
  7. └── wow.mkv
  8. ├── documents
  9. ├── dir1
  10. ├── not_a_doc.doc
  11. └── untitled.ods
  12. ├── dir2
  13. ├── document.odt
  14. └── untitled.ods
  15. └── recording.mp3
  16. ├── random_stuff
  17. ├── document.docx
  18. ├── howto.pdf
  19. └── paper.docx
  20. └── song.mp3

you can run categorize_files.py -c mime_content -p copy -i flat_name input to
get it done:

  1. big_mess_categorized_by_mime_content
  2. ├── application
  3. ├── pdf
  4. └── howto.pdf
  5. ├── vnd.oasis.opendocument.spreadsheet
  6. ├── untitled.ods
  7. └── untitled.ods.1
  8. ├── vnd.oasis.opendocument.text
  9. ├── document.odt
  10. └── not_a_doc.doc
  11. └── vnd.openxmlformats-officedocument.wordprocessingml.document
  12. ├── document.docx
  13. └── paper.docx
  14. ├── audio
  15. └── mpeg
  16. ├── recording.mp3
  17. └── song.mp3
  18. ├── image
  19. └── jpeg
  20. └── nice.jpg
  21. └── video
  22. └── x-matroska
  23. ├── cool.mkv
  24. ├── omg.mkv
  25. └── wow.mkv

Parameters

There are three essential parameters: classification criterion, file system
operation used to create the categorized files, and output image structure.
Several other parameters are available, for more information see the built-in
help (-h/--help argument).
To demonstrate how these parameters affect the output, the following directory
tree will be used as input:

  1. input
  2. ├── app1
  3. ├── python_bytecode.pyc
  4. ├── python_script.py
  5. └── shell_script
  6. └── app2
  7. ├── python_bytecode.pyc
  8. ├── python_script.py
  9. └── shell_script

Classification Criterion

A criterion by which files are classified.
Set by the -c/--criterion argument with one of the following options:

  • suffix — the file suffix (extension)
    1. input_categorized_by_suffix
    2. ├── py
    3. ├── python_script.py
    4. └── python_script.py.1
    5. ├── pyc
    6. ├── python_bytecode.pyc
    7. └── python_bytecode.pyc.1
    8. └── unknown
    9. ├── shell_script
    10. └── shell_script.1
  • mime_name — guess the MIME type of the file based on its filename
    1. input_categorized_by_mime_name
    2. ├── application
    3. └── x-python-code
    4. ├── python_bytecode.pyc
    5. └── python_bytecode.pyc.1
    6. ├── text
    7. └── x-python
    8. ├── python_script.py
    9. └── python_script.py.1
    10. └── unknown
    11. ├── shell_script
    12. └── shell_script.1
  • mime_content — guess the MIME type of the file based on its content
    1. input_categorized_by_mime_content
    2. ├── application
    3. └── octet-stream
    4. ├── python_bytecode.pyc
    5. └── python_bytecode.pyc.1
    6. └── text
    7. ├── x-python
    8. ├── python_script.py
    9. └── python_script.py.1
    10. └── x-shellscript
    11. ├── shell_script
    12. └── shell_script.1

File System Operation

A file system operation used to create the categorized file.
Set by the -p/--operation argument with one of the following options:

  • move — move (rename) the file
  • copy — copy the file
  • hard_link — create a hard link pointing to the file
  • symbolic_link — create a symbolic link (symlink) pointing to the file
    1. input_categorized_by_suffix
    2. ├── py
    3. ├── python_script.py -> /path/to/the/input/app2/python_script.py
    4. └── python_script.py.1 -> /path/to/the/input/app1/python_script.py
    5. ├── pyc
    6. ├── python_bytecode.pyc -> /path/to/the/input/app2/python_bytecode.pyc
    7. └── python_bytecode.pyc.1 -> /path/to/the/input/app1/python_bytecode.pyc
    8. └── unknown
    9. ├── shell_script -> /path/to/the/input/app2/shell_script
    10. └── shell_script.1 -> /path/to/the/input/app1/shell_script

Output Image Structure

Determines file names and a directory structure of the output image directory.
Set by the -i/--image-structure argument with one of the following options:

  • flat_name — input directories are not preserved.
    Collisions are possible.
    1. input_categorized_by_suffix
    2. ├── py
    3. ├── python_script.py
    4. └── python_script.py.1
    5. ├── pyc
    6. ├── python_bytecode.pyc
    7. └── python_bytecode.pyc.1
    8. └── unknown
    9. ├── shell_script
    10. └── shell_script.1
  • flat_path — input directories encoded in the file name (path separator
    characters are replaced with underscores).
    Collisions are possible.
    1. input_categorized_by_suffix
    2. ├── py
    3. ├── app1_python_script.py
    4. └── app2_python_script.py
    5. ├── pyc
    6. ├── app1_python_bytecode.pyc
    7. └── app2_python_bytecode.pyc
    8. └── unknown
    9. ├── app1_shell_script
    10. └── app2_shell_script
  • nested — input directories are preserved.
    Collisions are not possible.
    1. input_categorized_by_suffix
    2. ├── py
    3. ├── app1
    4. └── python_script.py
    5. └── app2
    6. └── python_script.py
    7. ├── pyc
    8. ├── app1
    9. └── python_bytecode.pyc
    10. └── app2
    11. └── python_bytecode.pyc
    12. └── unknown
    13. ├── app1
    14. └── shell_script
    15. └── app2
    16. └── shell_script

Known Bugs and TODOs

  • Using the flat_path output image structure may create file names longer than
    the file system can handle.
    • Currently [Errno 36] File name too long: 'filename' is reported and the
      affected file is skipped.
    • In POSIX, the limis is defined by NAME_MAX which is usually set to 255
      chars.
    • To eliminate this, some kind of name ellipsization would be necessary.
  • Some information about file format identification can be found
    here or
    here