Spark Temporary Job Data Retention and Cleanup¶
https://blueprints.launchpad.net/sahara/+spec/spark-cleanup
Creates a configurable cron job at cluster configuration time to clean up data from Spark jobs, in order to ease maintenance of long-lived clusters.
Problem description¶
The current Spark plugin stores data from any job run in the /tmp directory, without an expiration policy. While this is acceptable for short-lived clusters, it increases maintenance on long-running clusters, which are likely to run out of space in time. A mechanism to automatically clear space is needed.
Proposed change¶
On the creation of any new Spark cluster, a script (from sahara.plugins.spark/resources) will be templated with the following variables (which will be defined in Spark’s config_helper module and thus defined per cluster):
Minimum Cleanup Seconds
Maximum Cleanup Seconds
Minimum Cleanup Megabytes
That script will then be pushed to /etc/hadoop/tmp_cleanup.sh. In the following cases, no script will be pushed:
Maximum Cleanup Seconds is 0 (or less)
Minimum Cleanup Seconds and Minimum Cleanup Megabytes are both 0 (or less)
Also at cluster configuration time, a cron job will be created to run this script once per hour.
This script will iterate over each extant job directory on the cluster; if it finds one older than Maximum Cleanup Seconds, it will delete that directory. It will then check the size of the set of remaining directories. If there is more data than Minimum Cleanup Megabytes, then it will delete directories older than Minimum Cleanup Seconds, starting with the oldest, until the remaining data is smaller than Minimum Cleanup Megabytes or no sufficiently aged directories remain.
Alternatives¶
Any number of more complex schemes could be developed to address this problem, including per-job retention information, data priority assignment (to effectively create a priority queue for deletion,) and others. The above plan, however, while it does allow for individual cluster types to have individual retention policies, does not demand excessive maintenance or interface with that policy after cluster creation, which will likely be appropriate for most users. A complex retention and archival strategy exceeds the intended scope of this convenience feature, and could easily become an entire project.
Data model impact¶
None; all new data will be stored as cluster configuration.
REST API impact¶
None; operative Spark cluster template configuration parameters will be documented the current interface allows this change.
Other end user impact¶
None.
Deployer impact¶
None.
Developer impact¶
None.
Sahara-image-elements impact¶
None.
Sahara-dashboard / Horizon impact¶
None. (config_helper.py variables will be automatically represented in Horizon.)
Implementation¶
Assignee(s)¶
- Primary assignee:
egafford
Work Items¶
Creation of periodic job
Creation of deletion script
Testing
Dependencies¶
None
Testing¶
Because this feature is entirely Sahara-internal, and requires only a remote shell connection to the Spark cluster (without which many, many other tests would fail) I believe that Tempest tests of this feature are unnecessary. Unit tests should be sufficient to cover this feature.
Documentation Impact¶
The variables used to set retention policy will need to be documented.