CSD for Apache Airflow
This repository allows you to install Apache Airflow as a service managable by Cloudera Manager.
/opt/cloudera/csd
location on the Cloudera Manager server.service cloudera-scm-server restart
AIRFLOWDB_PASSWORD
to a sufficient value. For example, run the following in your Linux shell session: < /dev/urandom tr -dc A-Za-z0-9 | head -c 20;echo
Example for MySQL:
CREATE DATABASE airflow DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Alternatively, you can use the Airflow/MySQL deployment script to create the MySQL database using:
GRANT ALL ON airflow.* TO 'airflow'@'localhost' IDENTIFIED BY 'AIRFLOWDB_PASSWORD';
GRANT ALL ON airflow.* TO 'airflow'@'%' IDENTIFIED BY 'AIRFLOWDB_PASSWORD';
create_mysql_dbs-airflow.sh --host <host_name> --user <username> --password <password>
Example for PostgreSQL:
CREATE ROLE airflow LOGIN ENCRYPTED PASSWORD 'AIRFLOWDB_PASSWORD' NOSUPERUSER INHERIT CREATEDB NOCREATEROLE;
ALTER ROLE airflow SET search_path = airflow, "$user", public;
Alternatively, you can use the Airflow/PostgreSQL deployment script to create the PostgreSQL database using:
CREATE DATABASE airflow WITH OWNER = airflow ENCODING = 'UTF8' TABLESPACE = pg_default CONNECTION LIMIT = -1;
create_postgresql_dbs-airflow.sh --host <host_name> --user <username> --password <password>
There are six roles available for deployment:
Webserver: Airflow Webserver role runs the Airflow Web UI. Webserver role can be deployed on more than instances. However, they will be the same and can be used for backup purposes.
Scheduler: Airflow Scheduler role is used to schedule the Airflow jobs. This is limited to one instance to reduce the risk of duplicate jobs.
Worker: Airflow Worker role picks jobs from the Scheduler and executes them. Multiple instances can be deployed.
Flower Webserver: Flower Webserver role is used to monitor Celery clusters. Celery allows for the expansion of Worker Only one instance is needed.
Kerberos: Airflow Kerberos role is used to enable Kerberos protocol for the other Airflow roles and for DAGs. This role should exist on each host with an Airflow Worker role.
Gateway: The purpose of the gateway role is to make the configuration available to CLI clients.
Here are some of the examples of Airflow commands:
airflow list_dags
The dag file has to be copied to all the nodes to the dags folder manually.
airflow trigger_dag <DAG Name>
For a complete list of Airflow commands refer to the Airflow Command Line Interface.
The DAG file has to be copied to dags_folder
directory within all the nodes. It is important to manually distribute to all the nodes where the roles are deployed.
In order to enable authentication for the Airflow Web UI check the “Enable Airflow Authentication” option. You can create Airflow users using one of two options below.
One way to add Airflow users to the database is using the airflow-mkuser
script. Users can be added as follows:
Note: Although the last created user shows up in the Airflow configurations, you can still use the previously created users.
Another way to add Airflow users to the database is using the airflow-mkuser
script. Users can be added as follows:
airflow-mkuser <username> <email> <password>
For example, this can be like:
airflow-mkuser admin admin@localdomain password123
git clone https://github.com/teamclairvoyant/apache-airflow-cloudera-csd
cd apache-airflow-cloudera-csd
make dist
Update the version
file before running make dist
if creating a new release.
Upon many deployments, you may face an error called ‘Markup file already exists’ while trying to stop a role and the process never stops. In that case, stop the process using the “Abort” command and navigate to /var/run/cloudera-scm-agent/process
and delete all the GracefulRoleStopRunner
directories.