项目作者: mirage

项目描述 :
Like detective conan, find clue about the type of the file
高级语言: OCaml
项目地址: git://github.com/mirage/conan.git
创建时间: 2020-03-20T16:18:25Z
项目社区:https://github.com/mirage/conan

开源协议:

下载


Conan, a detective which aggregate some clues to recognize MIME type

conan is a re-implementation of the famous command file:

  1. $ file --mime image.png
  2. image/png
  3. $ conan.file --mime image.png
  4. image/png

This program/library (see libmagic) is widely used on protocols to transmit
the MIME type of a file. It permits then to call the right program to
manipulate the given file.

For instance, the HTTP protocol transmits via the Content-Type field the
MIME type of the body:

  1. HTTP/1.1 200 OK
  2. Content-Type: image/png
  3. Content-Length: 4096
  4. <your image>

You can find the usage of file into many places such as your web browser (to
be able to execute the right application to interpret the given file) or your
Mail User Agent.

However, file is pretty old (1987), its implementation is in C, it was not
formalized and it does not have a standard. Take file as an engine for the
file recognition can be a risk (segmentation fault, undefined behavior,
unstable release process, etc.)

But file was involved for several years and it contains a great extensible
database which can be reliable due to its seniority. So, some famous
softwares decided to re-implement a subset of file/libmagic which is less
expressive/powerful but it does the job in a certain expectation.

The DSL - libmagic

The file‘s database use a certain language described by man magic:

  • a line describe an operation
  • an operation is:
    • a test of a certain value at a certain position into the given file
    • an anchor
    • a jump instruction to an anchor
    • a MIME value
    • a strength value

These operations are organized as a tree. An operation is prepended by a
level (>) and, from it, we are able to construct the decision tree which
describes multiple paths to recognize the MIME type of the given file.

For instance:

  1. [0]
  2. > [1]
  3. >> [2]
  4. > [3]
  5. >> [4]
  6. >>> [5]

produces this decision tree:

  1. [0]
  2. | \
  3. [1][3]
  4. | |
  5. [2][4]
  6. |
  7. [5]

The test operation

An operation is usually a test which compares the data starting at a particular
offset in the file with a byte value, a string or a numeric value. If the test
succeeds, we continue along the path according to your decision tree. For
instance, if operation-0 succeeds, we will try [1] and [3].

Along the process, we will aggregate multiple solutions which have a priority -
see the strength value - and we will choose the highest one.

The test operations of the following fields:

  • offset: A number specifying the offset (in bytes) into the file of the data
    which is to be tested. This offset can be relative from the previous
    operation’s offset if it begins with &.
  • type: The type of the data to be tested. We implemented many types such as
    byte, short, long, string or date.
  • test: the value to be compared with the value from the file.
  • message: the message to be printed if the comparison succeeds.

Let’s play!

An example is more intersting than the theory. Let’s try to recognize a
zlib archive. According to [RFC1950][] (where CMF and FLG are the first bytes
in MSB order of an zlib archive):

The FCHECK value must be such that CMF and FLG, when viewed as
a 16-bit unsigned integer stored in MSB order (CMF*256 + FLG),
is a multiple of 31.

Then, the first byte should have a CM = 8:

CM (Compression method)
This identifies the compression method used in the file. CM = 8
denotes the “deflate” compression method with a window size up
to 32K.

And the RFC precises that CM should not be equal to 15 (as a reserved value),
so we can consider that CM & 0x80 (the most significant bit) should not be
equal to 1.

Finally, we have 3 tests to do:
1) the 16-bits number (big-endian order) must be a multiple of 31
2) CM which is the 4 most significant bits of the first byte must be equal to 8
3) CM should not be equal to 15 and its most significant bit should not be
equal to 1

In our syntax and according to the idea of a decision tree, we must test
step by step these assertions. At the end, we can say that the file is
probably an application/zlib:

  1. 0 beshort%31 =0
  2. >0 byte&0xf =8
  3. >>0 byte&0x80 =0
  4. !:mime application/zlib

Now, let’s play with conan:

  1. open Rresult
  2. let zlib =
  3. {file|0 beshort%31 =0
  4. >0 byte&0xf =8
  5. >>0 byte&0x80 =0
  6. !:mime application/zlib
  7. |file}
  8. let tree = R.failwith_error_msg @@ Conan_unix.tree_of_string zlib
  9. let () =
  10. if Array.length Sys.argv >= 2
  11. then
  12. let m = R.failwith_error_msg @@
  13. Conan_unix.run_with_tree tree Sys.argv.(1) in
  14. match Conan.Metadata.mime m with
  15. | Some v -> Fmt.pr "%s\n%!" v
  16. | None -> Fmt.epr "MIME type not found.\n%!"
  17. else Fmt.epr "%s <filename>" Sys.argv.(0)

This little program will only recognize “application/zlib” according to our
description above. Of course, the DSL can be more complex than that!

Complex recognition

Indirect offset

Offsets do not need to be constant, but can also be read from the file being
examined. If the first character following the last > is a parenthesis then
the string inner is interpreted as an indirect offset. value at that offset is
read, and is used again as an offset in the file.

For instance, such tree will do an indirection from the unsigned long number
(little-endian) value available at the offset 0x3c:

  1. 0 string MZ
  2. >0x18 leshort >0x3f
  3. >>(0x3c.l) string PE\0\0 PE executable (MS-Windows)
  4. >>(0x3c.l) string LE\0\0 LX executable (OS/2)

You should check the man magic to see the syntax and available types. You are
able to apply a calculation if the indirect offset can not be used directly
such as this example when we multiple the indirect offset with 512:

  1. >0x18 leshort <0x40
  2. >>(4.s*512) leshort 0x014c COFF executable (MS-DOS, DJGPP)
  3. >>(4.s*512) leshort !0x014c MZ executable (MS-DOS)

Relative offset

Moreover you can specify an offset relative to the end of the last up-level
field using & as a prefix to the offset:

  1. 0 string MZ
  2. >0x18 leshort >0x3f
  3. >>(0x3c.l) string PE\0\0 PE executable (MS-Windows)
  4. >>>&0 leshort 0x14c for Intel 80386
  5. >>>&0 leshort 0x184 for DEC Alpha

And, of course, indirect and relative offsets can be combined.

Jump and recursion

It is possible to define a “named” magic instance that can be called from
another use magic entry, like a subroutine call. The offset of the subroutine
is relative to the caller.

To be able to call a subroutine, we use the use operation with the name of
the subroutine. You don’t need to define the subroutine before the caller.
Indeed, file and conan collects all subroutines first and process then the
decision tree.

This is a simple example to determine if a length of the given file is odd or
even:

  1. 0 name even
  2. >0 byte x even
  3. >>1 use odd
  4. 0 name odd
  5. >0 byte x odd
  6. >>1 use even
  7. 0 byte x
  8. >0 use odd

Other operations

The libmagic DSL implements many things but as we said, a standard of it does
not exist. We mostly tried to do a reverse engineering on it to implement
operations. Some of them are not implemented - due to the lack of definitions
or just because we did not find them into the file‘s database. Some others are
explicitely not implemented because we judge them as a hack instead of an
homogene feature.

Then, we are mostly focus to deliver the MIME type instead of a full
description of the given file. file shows you many things such as the size
of the image, the bitrate of the sound, etc. We tried to implement them but
we are more focused on the MIME recognition.

Experimental

According to what we said above, conan is experimental and for the usage
point of view, it can leak exceptions such as Unimplemented feature.

Then, even if a big work was done about types where we try to unify type of the
expected value and type of the test, the type expected by the message still is
weak (for many reasons). In other words, even if we can parse and process the
decision tree, we still are able to fail when we print out messages (because
we can not unify the type of the value and the expected type from the given
message).

Finally, file does not describe any standards about the database and man
pages are a bit obsolete according to what the file command do. For these
reasons, it’s hard to prove/and say that we have the same behavior than file.
We try to be close to what it does, but in some edge cases, we can not ensure
that we will produce the same result as file.

Also, we did not discovered everything from the given database. Even if we can
parse and generate a decision tree from the database, some specific execution
paths can lead to an unexpected failure. We are prompted to fix them step by
step of course. Feel free to test and write an issue!

MirageOS support

The other goal of conan is to be able to integrate the database into an
unikernel and to give an opportunity for an application (such as a web server)
to recognize MIME types of files.

syscalls

As any MirageOS projects, conan abstracts required syscalls to introspect
a file. In this way, conan.string exists and it is able to recognize the MIME
type of a given string (instead of a file). lwt support exists too which
manipulate a stream.

Database

conan is able to parse a database and serialize it as a full OCaml value. The
distribution provides 2 databases:

  • the file‘s database
  • the previous database without extra paths which don’t not tag the MIME type

The second is lighter than the first and should be used only to get the MIME
type. Indeed, any information such as the size of the image or the bitrate of
the sound are deleted.

For instance, an unikernel for Solo5 with the ligher database is around 6 MB.

You can also build your own special database. If you know that you want to
recognize only few objects, you can merge tree values for these objects and
make a smaller database:

  1. #require "conan-unix" ;;
  2. #require "conan-database" ;;
  3. let tree0 = Conan_compress.tree
  4. let tree1 = Conan_ocaml.tree
  5. let tree2 = Conan_audio.tree
  6. let tree = List.fold_left Conan.Tree.merge Conan.tree.empty
  7. [ tree0; tree1; tree2 ]
  8. let recognize_ocaml_or_archive_or_audio filename =
  9. Conan_unix.run_with_tree tree filename