项目作者: openbases

项目描述 :
An example to extract metadata from a Dockerfile using schema.org
高级语言: HTML
项目地址: git://github.com/openbases/extract-dockerfile.git
创建时间: 2018-11-05T15:51:25Z
项目社区:https://github.com/openbases/extract-dockerfile

开源协议:BSD 3-Clause "New" or "Revised" License

下载


Schema.org Dockerfile Parser

https://github.com/openschemas/spec-template/raw/master/img/hexagon_square_small.png

This is an example of using schema.org to label a Dockerfile. Specifically, we will
walk through examples that use already defined schema.org specifications, and other
examples that use specifications for containers
that are under development. The goal is to demonstrate utility in being able to
label a Dockerfile (a container recipe) and then different ideas for how this could be done.

Why should I care?

Labeling our containers and container recipes with schema.org metadata means that
we can serve it alongside the containers, and have the properties indexed by Google Search.
This means that regardless of where your recipe is hosted (a registry or Github pages)
your container and recipes will (finally) be discoverable. Note that the official
definition for a container “recipe” has not been integrated into schema.org, and we
have had good discussion with the OCI community about a proper name. Any of the
following would be a good contender:

  1. Thing > CreativeWork > SoftwareSourceCode > BuildDefinition
  2. Thing > CreativeWork > SoftwareSourceCode > BuildInstructions
  3. Thing > CreativeWork > SoftwareSourceCode > BuildPlan
  4. Thing > CreativeWork > SoftwareSourceCode > BuildRecipe
  5. Thing > CreativeWork > SoftwareSourceCode > Configuration
  6. Thing > CreativeWork > SoftwareSourceCode > ContainerConfig
  7. Thing > CreativeWork > SoftwareSourceCode > ImageDefinition

Just to provide you with more examples (for the same position in the hierarchy)
I have used several of the names above in the examples below. The only significantly
different is the “SoftwareSourceCode” that cuts away some of the additional recipe
metadata. This is also the only specification that is production schema.org.

What examples are here?

Toward this goal, we are going to walk through several simple examples
representing a Dockerfile as each of:

For each of the above, the metadata shown is also embedded in the page as json-ld
(when you “View Source.”) The examples are minimal in that I didn’t do any special
parsing of the containers to extract more meaningful tags for software, this would
be valuable to do to improve the search.

Each folder above includes a script to extract (extract.py),
a recipe to follow (recipe.yml), and the specification in yaml format (in the
case of a specification not served by production schema.org). When you view
the github pages of this
repository served by the master branch, you will see that the index.html
in each folder serves a simple page to show the extracted metadata. The
Dockerfile is the recipe that we aim to describe using schema.org,
and is for the container toasterlint/storjshare-cli on Docker Hub. I chose it pretty randomly because it started with
“toast.”

Usage

Before running these examples, make sure you have installed the module (and note
this module is under development, contributions are welcome!) An install.sh script
is provided that will install Singularity python (spython), Schema Org Python (schemaorg)
along with ContainerDiff.

  1. $ /bin/bash install.sh

For those curious, ContainerDiff
allows extraction of different metadata for containers.
To extract a recipe for a particular datatype, just run the extract.py file
in the corresponding folder. You can look at any of the extractors to get a gist
of what we do to generate the final metadata for Github pages. Generally we:

  1. Read in a specific version of the schemaorg definitions provided by the library
  2. Read in a recipe for a template that we want to populate (e.g., google/dataset)
  3. Use helper functions provided by the template (or our own) to extract
  4. Extract, validate, and generate the final dataset

The goal of the software is to provide enough structure to help the user (typically a developer)
but not so much as to be annoying to use generally.

What are the files in each folder?

recipe.yml Files

If I am a provider of a service and want my users to label their data for my service,
I need to tell them how to do this. I do this by way of a recipe file, in each
example folder there is a file called recipe.yml that is a simple listing of required fields defined for the entities that are needed. For example, the recipe.yml in the “SoftwareSourceCode” folder tells the parser that we need to define
properties for “SoftwareSourceCode” and an Organization or Person. For example.
with the schemaorg Python module
I can learn that the “SoftwareSourceCode” definition has 121 properties,
but the recipe tells us that we only need a subset of those
properties for a valid extraction.

specification.yml

The specification.yml file is only provided in the ContainerRecipe
folder because this isn’t a production specification provided by schema.org. For
those that are published (e.g., SoftwareSourceCode and Dataset) the definitions are
provided in the python module.

extract.py

This is the code snippet that shows how you extract metadata and use the
schemaorg Python module
to generate the final template page. This file could be run in multiple places!

  • In a continuous integration setup so that each change to master updates the Github Pages metadata.
  • Using a tool like datalad that allows for version control of such metadata, and definition of extractors (also in Python).
  • As a Github hook (or action) that is run at any stage in the development process.
  • Rendered by a web server that provides Container Recipes for users that should be indexed with Google Search (e.g., Singularity Hub).

Check out any of the subfolders for:

Resources