Create a categorized image of an unorganized directory
Create a categorized image of an unorganized directory (e.g. a disk dump).
Each contained file is classified according to its suffix or signatures (also
known as magic numbers or magic bytes) and moved/copied/linked to a
corresponding directory in the output directory.
If you want to organize a directory which looks like this (or possibly a
thousand times worse):
big_mess
├── definitely_not_porn
│ ├── cool.mkv
│ ├── omg.mkv
│ └── secret
│ ├── nice.jpg
│ └── wow.mkv
├── documents
│ ├── dir1
│ │ ├── not_a_doc.doc
│ │ └── untitled.ods
│ ├── dir2
│ │ ├── document.odt
│ │ └── untitled.ods
│ └── recording.mp3
├── random_stuff
│ ├── document.docx
│ ├── howto.pdf
│ └── paper.docx
└── song.mp3
you can run categorize_files.py -c mime_content -p copy -i flat_name input
to
get it done:
big_mess_categorized_by_mime_content
├── application
│ │ └── howto.pdf
│ ├── vnd.oasis.opendocument.spreadsheet
│ │ ├── untitled.ods
│ │ └── untitled.ods.1
│ ├── vnd.oasis.opendocument.text
│ │ ├── document.odt
│ │ └── not_a_doc.doc
│ └── vnd.openxmlformats-officedocument.wordprocessingml.document
│ ├── document.docx
│ └── paper.docx
├── audio
│ └── mpeg
│ ├── recording.mp3
│ └── song.mp3
├── image
│ └── jpeg
│ └── nice.jpg
└── video
└── x-matroska
├── cool.mkv
├── omg.mkv
└── wow.mkv
There are three essential parameters: classification criterion, file system
operation used to create the categorized files, and output image structure.
Several other parameters are available, for more information see the built-in
help (-h/--help argument
).
To demonstrate how these parameters affect the output, the following directory
tree will be used as input:
input
├── app1
│ ├── python_bytecode.pyc
│ ├── python_script.py
│ └── shell_script
└── app2
├── python_bytecode.pyc
├── python_script.py
└── shell_script
A criterion by which files are classified.
Set by the -c/--criterion
argument with one of the following options:
suffix
— the file suffix (extension)
input_categorized_by_suffix
├── py
│ ├── python_script.py
│ └── python_script.py.1
├── pyc
│ ├── python_bytecode.pyc
│ └── python_bytecode.pyc.1
└── unknown
├── shell_script
└── shell_script.1
mime_name
— guess the MIME type of the file based on its filename
input_categorized_by_mime_name
├── application
│ └── x-python-code
│ ├── python_bytecode.pyc
│ └── python_bytecode.pyc.1
├── text
│ └── x-python
│ ├── python_script.py
│ └── python_script.py.1
└── unknown
├── shell_script
└── shell_script.1
mime_content
— guess the MIME type of the file based on its content
input_categorized_by_mime_content
├── application
│ └── octet-stream
│ ├── python_bytecode.pyc
│ └── python_bytecode.pyc.1
└── text
├── x-python
│ ├── python_script.py
│ └── python_script.py.1
└── x-shellscript
├── shell_script
└── shell_script.1
A file system operation used to create the categorized file.
Set by the -p/--operation
argument with one of the following options:
move
— move (rename) the filecopy
— copy the filehard_link
— create a hard link pointing to the filesymbolic_link
— create a symbolic link (symlink) pointing to the file
input_categorized_by_suffix
├── py
│ ├── python_script.py -> /path/to/the/input/app2/python_script.py
│ └── python_script.py.1 -> /path/to/the/input/app1/python_script.py
├── pyc
│ ├── python_bytecode.pyc -> /path/to/the/input/app2/python_bytecode.pyc
│ └── python_bytecode.pyc.1 -> /path/to/the/input/app1/python_bytecode.pyc
└── unknown
├── shell_script -> /path/to/the/input/app2/shell_script
└── shell_script.1 -> /path/to/the/input/app1/shell_script
Determines file names and a directory structure of the output image directory.
Set by the -i/--image-structure
argument with one of the following options:
flat_name
— input directories are not preserved.
input_categorized_by_suffix
├── py
│ ├── python_script.py
│ └── python_script.py.1
├── pyc
│ ├── python_bytecode.pyc
│ └── python_bytecode.pyc.1
└── unknown
├── shell_script
└── shell_script.1
flat_path
— input directories encoded in the file name (path separator
input_categorized_by_suffix
├── py
│ ├── app1_python_script.py
│ └── app2_python_script.py
├── pyc
│ ├── app1_python_bytecode.pyc
│ └── app2_python_bytecode.pyc
└── unknown
├── app1_shell_script
└── app2_shell_script
nested
— input directories are preserved.
input_categorized_by_suffix
├── py
│ ├── app1
│ │ └── python_script.py
│ └── app2
│ └── python_script.py
├── pyc
│ ├── app1
│ │ └── python_bytecode.pyc
│ └── app2
│ └── python_bytecode.pyc
└── unknown
├── app1
│ └── shell_script
└── app2
└── shell_script
flat_path
output image structure may create file names longer than[Errno 36] File name too long: 'filename'
is reported and theNAME_MAX
which is usually set to 255