Elastic Data Processing (EDP) SPI

EDP job engine objects provide methods for creating, monitoring, and terminating jobs on Sahara clusters. Provisioning plugins that support EDP must return an EDP job engine object from the get_edp_engine(cluster, job_type) method described in Plugin SPI.

Sahara provides subclasses of the base job engine interface that support EDP on clusters running Oozie or on Spark standalone clusters. These are described below.

Job Types

Some of the methods below test job type. Sahara supports the following string values for job types:

  • Hive
  • Java
  • Pig
  • MapReduce
  • MapReduce.Streaming
  • Spark
  • Shell

Note, constants for job types are defined in sahara.utils.edp

Job Status Values

Several of the methods below return a job status value. A job status value is a dictionary of the form:

{‘status’: job_status_value}

where job_status_value is one of the following string values:

  • DONEWITHERROR
  • FAILED
  • TOBEKILLED
  • KILLED
  • PENDING
  • RUNNING
  • SUCCEEDED

Note, constants for job status are defined in sahara.utils.edp

EDP Job Engine Interface

The sahara.service.edp.base_engine.JobEngine class is an abstract class with the following interface:

cancel_job(job_execution)

Stops the running job whose id is stored in the job_execution object.

Returns: None if the operation was unsuccessful or an updated job status value

get_job_status(job_execution)

Returns the current status of the job whose id is stored in the job_execution object.

Returns: a job status value

run_job(job_execution)

Starts the job described by the job_execution object

Returns: a tuple of the form (job_id, job_status_value, job_extra_info)

  • job_id is required and must be a string that allows the EDP engine to uniquely identify the job.
  • job_status_value may be None or a job status value
  • job_extra_info may be None or optionally a dictionary that the EDP engine uses to store extra information on the job_execution_object.

validate_job_execution(cluster, job, data)

Checks whether or not the job can run on the cluster with the specified data. Data contains values passed to the /jobs/<job_id>/execute REST API method during job launch. If the job cannot run for any reason, including job configuration, cluster configuration, or invalid data, this method should raise an exception.

Returns: None

get_possible_job_config(job_type)

Returns hints used by the Sahara UI to prompt users for values when configuring and launching a job. Note that no hints are required.

See Elastic Data Processing (EDP) for more information on how configuration values, parameters, and arguments are used by different job types.

Returns: a dictionary of the following form, containing hints for configs, parameters, and arguments for the job type:

{‘job_config’: {‘configs’: [], ‘params’: {}, ‘args’: []}}

  • args is a list of strings
  • params contains simple key/value pairs
  • each item in configs is a dictionary with entries for ‘name’ (required), ‘value’, and ‘description’

get_supported_job_types()

This method returns the job types that the engine supports. Not all engines will support all job types.

Returns: a list of job types supported by the engine

Oozie Job Engine Interface

The sahara.service.edp.oozie.engine.OozieJobEngine class is derived from JobEngine. It provides implementations for all of the methods in the base interface but adds a few more abstract methods.

Note, the validate_job_execution(cluster, job, data) method does basic checks on the job configuration but probably should be overloaded to include additional checks on the cluster configuration. For example, the job engines for plugins that support Oozie add checks to make sure that the Oozie service is up and running.

get_hdfs_user()

Oozie uses HDFS to distribute job files. This method gives the name of the account that is used on the data nodes to access HDFS (such as ‘hadoop’ or ‘hdfs’). The Oozie job engine expects that HDFS contains a directory for this user under /user/

Returns: a string giving the username for the account used to access HDFS on the cluster.

create_hdfs_dir(remote, dir_name)

The remote object remote references a node in the cluster. This method creates the HDFS directory dir_name under the user specified by get_hdfs_user() in the HDFS accessible from the specified node. For example, if the HDFS user is ‘hadoop’ and the dir_name is ‘test’ this method would create ‘/user/hadoop/test’.

The reason that this method is broken out in the interface as an abstract method is that different versions of Hadoop treat path creation differently.

Returns: None

get_oozie_server_uri(cluster)

Returns the full URI for the Oozie server, for example http://my_oozie_host:11000/oozie. This URI is used by an Oozie client to send commands and queries to the Oozie server.

Returns: a string giving the Oozie server URI.

get_oozie_server(self, cluster)

Returns the node instance for the host in the cluster running the Oozie server

Returns: a node instance

get_name_node_uri(self, cluster)

Returns the full URI for the Hadoop NameNode, for example http://master_node:8020.

Returns: a string giving the NameNode URI.

get_resource_manager_uri(self, cluster)

Returns the full URI for the Hadoop JobTracker for Hadoop version 1 or the Hadoop ResourceManager for Hadoop version 2.

Returns: a string giving the JobTracker or ResourceManager URI

Spark Job Engine

The sahara.service.edp.spark.engine.SparkJobEngine class provides a full EDP implementation for Spark standalone clusters.

Note, the validate_job_execution(cluster, job, data) method does basic checks on the job configuration but probably should be overloaded to include additional checks on the cluster configuration. For example, the job engine returned by the Spark plugin checks that the Spark version is >= 1.0.0 to ensure that spark-submit is available.

get_driver_classpath(self)

Returns driver class path.

Returns: a string of the following format ‘ –driver-class-path class_path_value