commit f166202b41705f17946c0184f456bbc322bde9a9 Author: Balazs Gibizer Date: Thu Sep 24 17:53:37 2020 +0200 Move implemented specs Change-Id: I95baa31ac4eb363e0bda779e24f3b338d175a061 diff --git a/specs/victoria/approved/add-emulated-virtual-tpm.rst b/specs/victoria/approved/add-emulated-virtual-tpm.rst deleted file mode 100644 index f0b2e2b..0000000 --- a/specs/victoria/approved/add-emulated-virtual-tpm.rst +++ /dev/null @@ -1,693 +0,0 @@ -.. - This work is licensed under a Creative Commons Attribution 3.0 Unported - License. - - http://creativecommons.org/licenses/by/3.0/legalcode - -============================================== -Add support for encrypted emulated virtual TPM -============================================== - -https://blueprints.launchpad.net/nova/+spec/add-emulated-virtual-tpm - -There are a class of applications which expect to use a TPM device to store -secrets. In order to run these applications in a virtual machine, it would be -useful to expose a virtual TPM device within the guest. Accordingly, the -suggestion is to add flavor/image properties which a) translate to placement -traits for scheduling and b) cause such a device to be added to the VM by the -relevant virt driver. - -Problem description -=================== - -Currently there is no way to create virtual machines within nova that provide -a virtual TPM device to the guest. - -Use Cases ---------- - -Support the virtualizing of existing applications and operating systems which -expect to make use of physical TPM devices. At least one hypervisor -(libvirt/qemu) currently supports the creation of an emulated TPM device which -is associated with a per-VM ``swtpm`` process on the host, but there is no way -to tell nova to enable it. - -Proposed change -=============== - -In recent libvirt and qemu (and possibly other hypervisors as well) there is -support for an emulated vTPM device. We propose to modify nova to make use -of this capability. - -This spec describes only the libvirt implementation. - -XML ---- - -The desired libvirt XML arguments are something like this (`source -`_):: - - ... - - - - - - - - ... - -Prerequisites -------------- - -Support for encrypted emulated TPM requires at least: - -* libvirt version 5.6.0 or greater. -* qemu 2.11 at a minimum, though qemu 2.12 is recommended. The virt driver code - should add suitable version checks (in the case of LibvirtDriver, this would - include checks for both libvirt and qemu). Currently emulated TPM is only - supported for x86, though this is an implementation detail rather than an - architectural limitation. -* The ``swtpm`` binary and libraries on the host. -* Access to a castellan-compatible key manager, such as barbican, for storing - the passphrase used to encrypt the virtual device's data. (The key manager - implementation's public methods must be capable of consuming the user's auth - token from the ``context`` parameter which is part of the interface.) -* Access to an object-store service, such as swift, for storing the file the - host uses for the virtual device data during operations such as shelve. - -Config ------- - -All of the following apply to the compute (not conductor/scheduler/API) -configs: - -* A new config option will be introduced to act as a "master switch" enabling - vTPM. This config option would apply to future drivers' implementations as - well, but since this spec and current implementation are specific to libvirt, - it is in the ``libvirt`` rather than the ``compute`` group:: - - [libvirt] - vtpm_enabled = $bool (default False) - -* To enable move operations (anything involving rebuilding a vTPM on a new - host), nova must be able to lay down the vTPM data with the correct ownership - -- that of the ``swtpm`` process libvirt will create -- but we can't detect - what that ownership will be. Thus we need a pair of config options on the - compute indicating the user and group that should own vTPM data on that - host:: - - [libvirt] - swtpm_user = $str (default 'tss') - swtpm_group = $str (default 'tss') - -* (Existing, known) options for ``[key_manager]``. - -* New standard keystoneauth1 auth/session/adapter options for ``[swift]`` will - be introduced. - -Traits, Extra Specs, Image Meta -------------------------------- - -In order to support this functionality we propose to: - -* Use the existing ``COMPUTE_SECURITY_TPM_1_2`` and - ``COMPUTE_SECURITY_TPM_2_0`` traits. These represent the two different - versions of the TPM spec that are currently supported. (Note that 2.0 is not - backward compatible with 1.2, so we can't just ignore 1.2. A summary of the - differences between the two versions is currently available here_.) When all - the Prerequisites_ have been met and the Config_ switch is on, the libvirt - compute driver will set both of these traits on the compute node resource - provider. -* Support the following new flavor extra_specs and their corresponding image - metadata properties (which are simply ``s/:/_/`` of the below): - - * ``hw:tpm_version={1.2|2.0}``. This will be: - - * translated to the corresponding - ``required=COMPUTE_SECURITY_TPM_{1_2|2_0}`` in the allocation candidate - request to ensure the instance lands on a host capable of vTPM at the - requested version - * used by the libvirt compute driver to inject the appropriate guest XML_. - - .. note:: Whereas it would be possible to specify - ``trait:COMPUTE_SECURITY_TPM_{1_2|2_0}=required`` directly in the - flavor extra_specs or image metadata, this would only serve to - land the instance on a capable host; it would not trigger the libvirt - driver to create the virtual TPM device. Therefore, to avoid - confusion, this will not be documented as a possibility. - - * ``hw:tpm_model={TIS|CRB}``. Indicates the emulated model to be used. If - omitted, the default is ``TIS`` (this corresponds to the libvirt default). - ``CRB`` is only compatible with TPM version 2.0; if ``CRB`` is requested - with version 1.2, an error will be raised from the API. - -To summarize, all and only the following combinations are supported, and are -mutually exclusive (none are inter-compatible): - -* Version 1.2, Model TIS -* Version 2.0, Model TIS -* Version 2.0, Model CRB - -Note that since the TPM is emulated (a process/file on the host), the -"inventory" is effectively unlimited. Thus there are no resource classes -associated with this feature. - -If both the flavor and the image specify a TPM trait or device model and the -two values do not match, an exception will be raised from the API by the -flavor/image validator. - -.. _here: https://en.wikipedia.org/wiki/Trusted_Platform_Module#TPM_1.2_vs_TPM_2.0 - -Instance Lifecycle Operations ------------------------------ - -Descriptions below are libvirt driver-specific. However, it is left to the -implementation which pieces are performed by the compute manager vs. the -libvirt ComputeDriver itself. - -.. note:: In deciding whether/how to support a given operation, we use "How - does this work on baremetal" as a starting point. If we can support a - VM operation without introducing inordinate complexity or user-facing - weirdness, we do. - -Spawn -~~~~~ - -#. Even though swift is not required for spawn, ensure a swift endpoint is - present in the service catalog (and reachable? version discovery? - implementation detail) so that a future unshelve doesn't break the instance. -#. Nova generates a random passphrase and stores it in the configured key - manager, yielding a UUID, hereinafter referred to as ``$secret_uuid``. -#. Nova saves the ``$secret_uuid`` in the instance's ``system_metadata`` under - key ``tpm_secret_uuid``. -#. Nova uses the ``virSecretDefineXML`` API to define a private (value can't be - listed), ephemeral (state is stored only in memory, never on disk) secret - whose ``name`` is the instance UUID, and whose UUID is the ``$secret_uuid``. - The ``virSecretSetValue`` API is then used to set its value to the generated - passphrase. We already provide a wrapper around this API at - ``nova.virt.libvirt.host.Host.create_secret`` for use with encrypted volumes - and will expand this to cover vTPM also. -#. Nova injects the XML_ into the instance's domain. The ``model`` and - ``version`` are gleaned from the flavor/image properties, and the ``secret`` - is ``$secret_uuid``. -#. Once libvirt has created the guest, nova uses the ``virSecretUndefine`` API - to delete the secret. The instance's emulated TPM continues to function. - -.. note:: Spawning from an image created by snapshotting a VM with a vTPM will - result in a fresh, empty vTPM, even if that snapshot was created by - ``shelve``. By contrast, `spawn during unshelve`_ will restore such - vTPM data. - -Cold Boot -~~~~~~~~~ - -...and any other operation that starts the guest afresh. (Depending on the `key -manager`_ security model, these may be restricted to the instance owner.) - -#. Pull the ``$secret_uuid`` from the ``tpm_secret_uuid`` of the instance's - ``system_metadata``. -#. Retrieve the passphrase associated with ``$secret_uuid`` via the configured - key manager API. - -Then perform steps 4-6 as described under Spawn_. - -Migrations and their ilk -~~~~~~~~~~~~~~~~~~~~~~~~ - -For the libvirt implementation, the emulated TPM data is stored in -``/var/lib/libvirt/swtpm/``. Certain lifecycle operations require -that directory to be copied verbatim to the "destination". For (cold/live) -migrations, only the user that nova-compute runs as is guaranteed to be able to -have SSH keys set up for passwordless access, and it's only guaranteed to be -able to copy files to the instance directory on the destination node. We -therefore propose the following procedure for relevant lifecycle operations: - -* Copy the directory into the local instance directory, changing the ownership - to match it. -* Perform the move, which will automatically carry the data along. -* Change ownership back and move the directory out to - ``/var/lib/libvirt/swtpm/`` on the destination. -* On confirm/revert, delete the directory from the source/destination, - respectively. (This is done automatically by libvirt when the guest is torn - down.) -* On revert, the data directory must be restored (with proper permissions) on - the source. - -Since the expected ownership on the target may be different than on the source, -and is (we think) impossible to detect, the admin must inform us of it via the -new ``[libvirt]swtpm_user`` and ``[libvirt]swtpm_group`` Config_ options if -different from the default of ``tss``. - -This should allow support of cold/live migration and resizes that don't change -the device. - -.. todo:: Confirm that the above "manual" copying around is actually necessary - for migration. It's unclear from reading - https://github.com/qemu/qemu/blob/6a5d22083d50c76a3fdc0bffc6658f42b3b37981/docs/specs/tpm.txt#L324-L383 - -Resize can potentially add a vTPM to an instance that didn't have one before, -or remove the vTPM from an instance that did have one, and those should "just -work". When resizing from one version/model to a different one the data can't -and won't carry over (for same-host resize, we must *remove* the old backing -file). If both old and new flavors have the same model/version, we must ensure -we convey the virtual device data as described above (for same-host resize, we -must *preserve* the existing backing file). - -Shelve (offload) and Unshelve -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Restoring vTPM data when unshelving a shelve-offloaded server requires the vTPM -data to be persisted somewhere. We can't put it with the image itself, as it's -data external to the instance disk. So we propose to put it in object-store -(swift) and maintain reference to the swift object in the instance's -``system_metadata``. - -The shelve operation needs to: - -#. Save the vTPM data directory to swift. -#. Save the swift object ID and digital signature (sha256) of the directory to - the instance's ``system_metadata`` under the (new) ``tpm_object_id`` and - ``tpm_object_sha256`` keys. -#. Create the appropriate ``hw_tpm_version`` and/or ``hw_tpm_model`` metadata - properties on the image. (This is to close the gap where the vTPM on - original VM was created at the behest of image, rather than flavor, - properties. It ensures the proper scheduling on unshelve, and that the - correct version/model is created on the target.) - -The unshelve operation on a shelved (but not offloaded) instance should "just -work" (except for deleting the swift object; see below). The code path for -unshelving an offloaded instance needs to: - -#. Ensure we land on a host capable of the necessary vTPM version and model - (we get this for free via the common scheduling code paths, because we did - step 3 during shelve). -#. Look for ``tpm_object_{id|sha256}`` and ``tpm_secret_uuid`` in the - instance's ``system_metadata``. -#. Download the swift object. Validate its checksum and fail if it doesn't - match. -#. Assign ownership of the data directory according to - ``[libvirt]swtpm_{user|group}`` on the host. -#. Retrieve the secret and feed it to libvirt; and generate the appropriate - domain XML (we get this for free via ``spawn()``). -#. Delete the object from swift, and the ``tpm_object_{id|sha256}`` from the - instance ``system_metadata``. This step must be done from both code paths - (i.e. whether the shelved instance was offloaded or not). - -.. note:: There are a couple of ways a user can still "outsmart" our checks and - make horrible things happen on unshelve. For example: - - * The flavor specifies no vTPM properties. - * The *original* image specified version 2.0. - * Between shelve and unshelve, edit the snapshot to specify version - 1.2. - - We will happily create a v1.2 vTPM and restore the (v2.0) data into - it. The VM will (probably) boot just fine, but unpredictable things - will happen when the vTPM is accessed. - - We can't prevent *all* stupidity. - -.. note:: As mentioned in `Security impact`_, if shelve is performed by the - admin, only the admin will be able to perform the corresponding - unshelve operation. And depending on the `key manager`_ security - model, if shelve is performed by the user, the admin may not be able - to perform the corresponding unshelve operation. - -Since the backing device data is virt driver-specific, it must be managed by -the virt driver; but we want the object-store interaction to be done by compute -manager. We therefore propose the following interplay between compute manager -and virt driver: - -The ``ComputeDriver.snapshot()`` contract currently does not specify a return -value. It will be changed to allow returning a file-like with the (prepackaged) -backing device data. The libvirt driver implementation will open a ``tar`` pipe -and return that handle. The compute manager is responsible for reading from -that handle and pushing the contents into the swift object. (Implementation -detail: we only do the swift thing for snapshots during shelve, so a) the virt -driver should not produce the handle except when the VM is in -``SHELVE[_OFFLOADED]`` state; and/or the compute manager should explicitly -close the handle from other invocations of ``snapshot()``.) - -.. _`spawn during unshelve`: - -The compute driver touchpoint for unshelving an offloaded instance is -``spawn()``. This method will get a new kwarg which is a file-like. If not -``None``, virt driver implementations are responsible for streaming from that -handle and reversing whatever was done during ``snapshot()`` (in this case un-\ -``tar``\ -ing). For the unshelve path for offloaded instances, the compute -manager will pull down the swift object and stream it to ``spawn()`` via this -kwarg. - -createImage and createBackup -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Because vTPM data is associated with the **instance**, not the **image**, the -``createImage`` and ``createBackup`` flows will not be changed. In particular, -they will not attempt to save the vTPM backing device to swift. - -This, along with the fact that fresh Spawn_ will not attempt to restore vTPM -data (even if given an image created via ``shelve``) also prevents "cloning" -of vTPMs. - -This is analogous to the baremetal case, where spawning from an image/backup on -a "clean" system would get you a "clean" (or no) TPM. - -Rebuild -~~~~~~~ - -Since the instance is staying on the same host, we have the ability to leave -the existing vTPM backing file intact. This is analogous to baremetal behavior, -where restoring a backup on an existing system will not touch the TPM (or any -other devices) so you get whatever's already there. However, it is also -possible to lock your instance out of its vTPM by rebuilding with a different -image, and/or one with different metadata. A certain amount of responsibility -is placed on the user to avoid scenarios like using the TPM to create a master -key and not saving that master key (in your rebuild image, or elsewhere). - -That said, rebuild will cover the following scenarios: - -* If there is no existing vTPM backing data, and the rebuild image asks for a - vTPM, create a fresh one, just like Spawn_. -* If there is an existing vTPM and neither the flavor nor the image asks for - one, delete it. -* If there is an existing vTPM and the flavor or image asks for one, leave the - backing file alone. However, if different versions/models are requested by - the old and new image in combination with the flavor, we will fail the - rebuild. - -Evacuate -~~~~~~~~ - -Because the vTPM data belongs to libvirt rather than being stored in the -instance disk, the vTPM is lost on evacuate, *even if the instance is -volume-backed*. This is analogous to baremetal behavior, where the (hardware) -TPM is left behind even if the rest of the state is resurrected on another -system via shared storage. - -(It may be possible to mitigate this by mounting ``/var/lib/libvirt/swtpm/`` on -shared storage, though libvirt's management of that directory on guest -creation/teardown may stymie such attempts. This would also bring in additional -security concerns. In any case, it would be an exercise for the admin; nothing -will be done in nova to support or prevent it.) - -Destroy -~~~~~~~ - -#. Delete the key manager secret associated with - ``system_metadata['tpm_secret_uuid']``. -#. libvirt deletes the vTPM data directory as part of guest teardown. -#. If ``system_metadata['tpm_object_id']`` exists, the *API side* will delete - the swift object it identifies. Since this metadata only exists while an - instance is shelved, this should only be applicable in corner cases like: - - * If the ``destroy()`` is performed between shelve and offload. - * Cleaning up a VM in ``ERROR`` state from a shelve, offload, or unshelve - that failed (at just the right time). - * Cleaning up a VM that is deleted while the host was down. - -Limitations ------------ - -This is a summary of odd or unexpected behaviors resulting from this design. - -* Except for migrations and shelve-offload, vTPM data sticks with the - instance+host. In particular: - - * vTPM data is lost on Evacuate_. - * vTPM data is not carried with "reusable snapshots" - (``createBackup``/``createImage``). - -* The ability of instance owners or admins to perform certain instance - lifecycle operations may be limited depending on the `security model - `_ used for the `key manager`_. -* Since secret management is done by the virt driver, deleting an - instance when the compute host is down can orphan its secret. If the host - comes back up, the secret will be reaped when compute invokes the virt - driver's ``destroy``. But if the host never comes back up, it would have to - be deleted manually. - -Alternatives ------------- - -* Rather than using a trait, we could instead use arbitrarily large inventories - of ``1_2``/``2_0`` resource classes. Unless it can be shown that there's an - actual limit we can discover, this just isn't how we do things. -* Instead of using specialized ``hw:tpm*`` extra_spec/image_meta properties, - implicitly configure based on the placement-ese syntax - (``resources:COMPUTE_SECURITY_TPM_*``). Rejected because we're trying to move - away from this way of doing things in general, preferring instead to support - syntax specific to the feature, rather than asking the admin to understand - how the feature maps to placement syntax. Also, whereas in some cases the - mapping may be straightforward, in other cases additional configuration is - required at the virt driver level that can't be inferred from the placement - syntax, which would require mixing and matching placement and non-placement - syntax. -* That being the case, forbid placement-ese syntax using - ``resources[$S]:COMPUTE_SECURITY_TPM_*``. Rejected mainly due to the - (unnecessary) additional complexity, and because we don't want to get in the - business of assuming there's no use case for "land me on a vTPM (in)capable - host, but don't set one up (yet)". -* Use physical passthrough (````) of a real - (hardware) TPM device. This is not feasible with current TPM hardware because - (among other things) changing ownership of the secrets requires a host - reboot. -* Block the operations that require object store. This is deemed nonviable, - particularly since cross-cell resize uses shelve under the covers. -* Use glance or the key manager instead of swift to store the vTPM data for - those operations. NACKed because those services really aren't intended for - that purpose, and (at least glance) may block such usages in the future. -* Save vTPM data on any snapshot operation (including ``createImage`` and - ``createBackup``). This adds complexity as well as some unintended behaviors, - such as the ability to "clone" vTPMs. Users will be less surprised when their - vTPM acts like a (hardware) TPM in these cases. -* Rather than checking for swift at spawn time, add an extra spec / image prop - like ``vtpm_I_promise_I_will_never_shelve_offload=True`` or - ``vtpm_is_totally_ephemeral=True`` which would either error or simply not - back up the vTPM, respectively, on shelve-offload. - -Data model impact ------------------ - -The ``ImageMetaProps`` and ``ImageMetaPropsPayload`` objects need new versions -adding: - -* ``hw_tpm_version`` -* ``hw_tpm_model`` -* ``tpm_object_id`` -* ``tpm_object_sha256`` - -REST API impact ---------------- - -The image/flavor validator will get new checks for consistency of properties. -No new microversion is needed. - -Security impact ---------------- - -The guest will be able to use the emulated TPM for all the security enhancing -functionality that a physical TPM provides, in order to protect itself against -attacks from within the guest. - -The `key manager`_ and `object store`_ services are assumed to be adequately -hardened against external attack. However, the deployment must consider the -issue of authorized access to these services, as discussed below. - -Data theft -~~~~~~~~~~ - -The vTPM data file is encrypted on disk, and is therefore "safe" (within the -bounds of encryption) from simple data theft. - -We will use a passphrase of 384 bytes, which is the default size of an SSH key, -generated from ``/dev/urandom``. It may be desirable to make this size -configurable in the future. - -Compromised root -~~~~~~~~~~~~~~~~ - -It is assumed that the root user on the compute node would be able to glean -(e.g. by inspecting memory) the vTPM's contents and/or the passphrase while -it's in flight. Beyond using private+ephemeral secrets in libvirt, no further -attempt is made to guard against a compromised root user. - -Object store -~~~~~~~~~~~~ - -The object store service allows full access to an object by the admin user, -regardless of who created the object. There is currently no facility for -restricting admins to e.g. only deleting objects. Thus, if a ``shelve`` has -been performed, the contents of the vTPM device will be available to the admin. -They are encrypted, so without access to the key, we are still trusting the -strength of the encryption to protect the data. However, this increases the -attack surface, assuming the object store admin is different from whoever has -access to the original file on the compute host. - -By the same token (heh) if ``shelve`` is performed by the admin, the vTPM data -object will be created and owned by the admin, and therefore only the admin -will be able to ``unshelve`` that instance. - -Key manager -~~~~~~~~~~~ - -The secret stored in the key manager is more delicate, since it can be used to -decrypt the contents of the vTPM device. The barbican implementation scopes -access to secrets at the project level, so the deployment must take care to -limit the project to users who should all be trusted with a common set of -secrets. Also note that project-scoped admins are by default allowed to access -and decrypt secrets owned by any project; if the admin is not to be trusted, -this should be restricted via policy. - -However, castellan backends are responsible for their own authentication -mechanisms. Thus, the deployment may wish to use a backend that scopes -decryption to only the individual user who created the secret. (In any case it -is important that admins be allowed to delete secrets so that operations such -as VM deletion can be performed by admins without leaving secrets behind.) - -Note that, if the admin is restricted from decrypting secrets, lifecycle -operations performed by the admin cannot result in a running VM. This includes -rebooting the host: even with `resume_guests_state_on_host_boot`_ set, an -instance with a vTPM will not boot automatically, and will instead have to be -powered on manually by its owner. Other lifecycle operations which are by -default admin-only will only work when performed by the VM owner, meaning the -owner must be given the appropriate policy roles to do so; otherwise these -operations will be in effect disabled. - -...except live migration, since the (already decrypted) running state of the -vTPM is carried along to the destination. (To clarify: live migration, unlike -other operations, would actually work if performed by the admin because of the -above.) - -.. _resume_guests_state_on_host_boot: https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.resume_guests_state_on_host_boot - -Notifications impact --------------------- - -None - -Other end user impact ---------------------- - -None - -Performance Impact ------------------- - -* An additional API call to the key manager is needed during spawn (to register - the passphrase), cold boot (to retrieve it), and destroy (to remove it). -* Additional API calls to libvirt are needed during spawn and other boot-like - operations to define, set the value, and undefine the vTPM's secret in - libvirt. -* Additional API calls to the object store (swift) are needed to create - (during shelve), retrieve (unshelve), and delete (unshelve/destroy) the vTPM - device data object. - -Other deployer impact ---------------------- - -None - -Developer impact ----------------- - -The various virt drivers would be able to implement the emulated vTPM as -desired. - -Upgrade impact --------------- - -None - - -Implementation -============== - -Assignee(s) ------------ - -Primary assignee: - stephenfin - -Other contributors: - cfriesen - efried - -Feature Liaison ---------------- - -stephenfin - -Work Items ----------- - -* API changes to prevalidate the flavor and image properties. -* Scheduler changes to translate flavor/image properties to placement-isms. -* Libvirt driver changes to - - * detect Prerequisites_ and Config_ and report traits to placement. - * communicate with the key manager API. - * manage libvirt secrets via the libvirt API. - * translate flavor/image properties to domain XML_. - * copy vTPM files on relevant `Instance Lifecycle Operations`_. - * communicate with object store to save/restore the vTPM files on (other) - relevant `Instance Lifecycle Operations`_. - -* Testing_ - -Dependencies -============ - -None - -Testing -======= - -Unit and functional testing will be added. New fixtures for object store and -key manager services will likely be necessary. - -Because of the eccentricities of a) user authentication for accessing the -encryption secret, and b) management of the virtual device files for some -operations, CI coverage will be added for: - -- Live migration -- Cold migration -- Host reboot (how?) -- Shelve (offload) and unshelve -- Backup and rebuild - -Documentation Impact -==================== - -Operations Guide and End User Guide will be updated appropriately. -Feature support matrix will be updated. - -References -========== - -* TPM on Wikipedia: https://en.wikipedia.org/wiki/Trusted_Platform_Module - -* ``swtpm``: https://github.com/stefanberger/swtpm/wiki - -* Qemu docs on tpm: - https://github.com/qemu/qemu/blob/master/docs/specs/tpm.txt - -* Libvirt XML to request emulated TPM device: - https://libvirt.org/formatdomain.html#elementsTpm - -History -======= - -.. list-table:: Revisions - :header-rows: 1 - - * - Release Name - - Description - * - Stein - - Introduced - * - Train - - Re-proposed - * - Ussuri - - Re-proposed with refinements including encryption pieces - * - Victoria - - Re-proposed diff --git a/specs/victoria/approved/nova-image-download-via-rbd.rst b/specs/victoria/approved/nova-image-download-via-rbd.rst deleted file mode 100644 index 146c216..0000000 --- a/specs/victoria/approved/nova-image-download-via-rbd.rst +++ /dev/null @@ -1,201 +0,0 @@ -.. - This work is licensed under a Creative Commons Attribution 3.0 Unported - License. - - http://creativecommons.org/licenses/by/3.0/legalcode - -===================================================== -Allow Nova to download Glance images directly via RBD -===================================================== - -https://blueprints.launchpad.net/nova/+spec/nova-image-download-via-rbd - - -Problem description -=================== - -When using compute-local storage with qcow2 based VM root disks, Glance images -are downloaded into the libvirt image store by way of the Glance HTTP API. -For images in the 10s-100s of GB, this download can be _very_ slow. -If the compute node has access to Ceph, it can instead perform an 'rbd export' -on the Glance image, bypassing the Glance API entirely and directly download -the image from Ceph. This direct download can result in a drastic reduction -in download time, from tens of minutes to tens of seconds. - -Use Cases ---------- - -As a user with a Ceph-backed image storage, I want to configure some compute -hosts for qcow2 images local to the compute host but quickly get the images -from Ceph rather than slow downloads from the Glance API. - -Proposed change -=============== - -A special download handler will be registered for Glance images when the 'rbd' -value is present in ``allowed_direct_url_schemes`` option. - -This download handler will be called only when a VM is scheduled on a node and -the required Glance image is not already present in the local libvirt image -cache. It will execute the OS native 'rbd export' command, using ``privsep``, -in order to perform the download operation instead of using the Glance HTTP -API. - -The mechanism for per-scheme download handlers was previously available -as a plugin point, which is now deprecated, along with the -allowed_direct_url_schemes config option. This effort will close out on that -deprecation by moving the per-scheme support into the nova.images.glance module -itself, undeprecating the allowed_direct_url_schemes config, and removing the -old nova.images.download plug point. - -The glance module also never used to perform image signature verification when -the per-scheme module was used. Since we are moving this into core code, -we will also fix this so that per-scheme images are verified like all the rest. - -Alternatives ------------- - -VM root disks can be run directly within Ceph as creation of these VM root -disks are fast as they are COW clones for the Glance image, also in Ceph. -However, running the VM root disks from Ceph introduces additional latency to -the running VM and needlessly wastes network bandwidth and Ceph IOPS. This -specific functionality was added in Mitaka but is aimed at a different use case -where the VM root disks remain in Ceph and are not run as qcow2 local disks. - -https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/rbd-instance-snapshots.html - -The other alternative is to continue with existing approach only. - -Data model impact ------------------ - -None - -REST API impact ---------------- - -None - -Security impact ---------------- - -None - -Notifications impact --------------------- - -None - -Other end user impact ---------------------- - -None - -Performance Impact ------------------- - -None - -Other deployer impact ---------------------- - -As proposed, there are no new configuration items, simply configuration of -existing items. - -The following configuration options are required to ensure qcow2 local images -are downloaded from Ceph and cached on the local compute host: - -On the Glance API node in glance-api.conf: - -``DEFAULT.show_image_direct_url=true`` - -On the Nova compute node in nova.conf: - -``DEFAULT.force_raw_images=false`` - -``libvirt.images_type=qcow2`` - -``libvirt.images_rbd_ceph_conf=`` - -``libvirt.rbd_user=`` - -``glance.allowed_direct_url_schemes = rbd`` - -Looking ahead, it may be desired to create additional entries in the libvirt -section of ``nova.conf`` for this feature as the current implementation assumes -that the ``rbd_user`` will have access to the Glance images. This may not be -the case depending upon how the Ceph pool permissions are configured. - -Developer impact ----------------- - -The ``allowed_direct_url_schemes`` option was deprecated in Queens. Proposed -implementation of this feature would halt the deprecation of this option and -we would need to "un-deprecate" it. - -Upgrade impact --------------- - -None - -Implementation -============== - -Assignee(s) ------------ - -Primary assignee: - Jiri Suchomel - -Feature Liaison ---------------- - -Feature liaison: - Dan Smith (danms) - -Work Items ----------- - -* Refactor existing in-house out-of-tree implementation and integrate it fully - into current codebase -* Write tests for implementation -* Update the admin guide with the description of how to set up the config if - the new feature is required. - -Dependencies -============ - -None - -Testing -======= - -* Unit tests -* Add an experimental on-demand queue job which uses Ceph with local qcow2 - images and 'direct from rbd' feature enabled - -Documentation Impact -==================== - -The admin guide should be updated to call out this use case and how it differs -from the Ceph-native snapshot feature. A good place to document this may be: - -https://docs.openstack.org/nova/latest/admin/configuration/hypervisor-kvm.html#configure-compute-backing-storage - -References -========== - -http://lists.openstack.org/pipermail/openstack-dev/2018-May/131002.html - -http://lists.openstack.org/pipermail/openstack-operators/2018-June/015384.html - -History -======= - -.. list-table:: Revisions - :header-rows: 1 - - * - Release Name - - Description - * - Victoria - - Introduced diff --git a/specs/victoria/approved/provider-config-file.rst b/specs/victoria/approved/provider-config-file.rst deleted file mode 100644 index db55ede..0000000 --- a/specs/victoria/approved/provider-config-file.rst +++ /dev/null @@ -1,379 +0,0 @@ -.. - This work is licensed under a Creative Commons Attribution 3.0 Unported - License. - - http://creativecommons.org/licenses/by/3.0/legalcode - -=========================== -Provider Configuration File -=========================== - -https://blueprints.launchpad.net/nova/+spec/provider-config-file - -This is a proposal to configure resource provider inventory and traits using a -standardized YAML file format. - -.. note:: This work is derived from `Jay's Rocky provider-config-file - proposal`_ and `Konstantinos's device-placement-model spec`_ (which - is derived from `Eric's device-passthrough spec`_), but differs in - several substantive ways. - -.. note:: This work is influenced by requirements to Nova to support non - native compute resources that are managed by Resource Management - Daemon for finer grain control. PTG discussion notes available at - `Resource Management Daemon_PTG Summary`_ - -.. note:: We currently limit the ownership and consumption of the provider - config YAML as described by the file format to Nova only. - -.. note:: The provider config will currently only accept placement overrides - to create and manage inventories and traits for resources not - natively managed by the Nova virt driver. - -.. note:: This is intended to define a) a file format for currently active use - cases, and b) Nova's consumption of such files. Subsequent features - can define the semantics by which the framework can be used by other - consumers or enhanced to satisfy particular use cases. - -Problem description -=================== -In order to facilitate the proper management of resource provider information -in the placement API by agents within Nova (such as virt drivers and the -PCI passthrough subsystem), we require a way of expressing various -overrides for resource provider information. While we could continue to use -many existing and new configuration options for expressing this information, -having a standardized, versioned provider descriptor file format allows us to -decouple the management of provider information from the configuration of the -service or daemon that manages those resource providers. - -Use Cases ---------- -Note that the file format/schema defined here is designed to accommodate the -following use cases. The file format/schema currently addresses a few use cases -that require changes to resource provider information as consumed by virt -drivers in Nova but it should allow options for extensions to be consumed -by Nova or other services as described in the problem statement in the future. - -Inventory Customization -~~~~~~~~~~~~~~~~~~~~~~~ - -**An operator would like to describe inventories for new platform features** - -These features could be experimental or not yet completely supported by Nova. -The expectation is that Nova can manage these inventories and help schedule -workloads requesting support for new platform features against their -capacities. For instance, to report ``CUSTOM_LLC`` (last-level cache) -inventories. - -The file defined by this spec must allow its author to: - -* Identify a provider unambiguously. -* Create and manage inventories for resource classes not natively managed by - Nova virt driver (``CUSTOM_LLC``, ``CUSTOM_MEMORY_BANDWIDTH`` etc.) - -Trait Customization -~~~~~~~~~~~~~~~~~~~ - -**An operator wishes to associate new custom traits with a provider.** - -These features could be experimental or not yet completely supported by Nova. -The expectation is that Nova can manage these traits and help schedule -workloads with support to new platform features against their traits. - -The file defined by this spec must allow its author to: - -* Identify a provider unambiguously. -* Specify arbitrary custom traits which are to be associated with the provider. - -Proposed change -=============== - -Provider Config File Schema ---------------------------- -A versioned YAML file format with a formal schema is proposed. The scope of -this spec is the schema, code to parse a file into a Python dict, code to -validate the dict against the schema, and code to merge the resulting dict with -the provider tree as processed by the resource tracker. - -The code shall be introduced into the ``openstack/nova`` project initially and -consumed by the resource tracker. Parts of it (such as the schema definition, -file loading, and validation) may be moved to a separate oslo-ish library in -the future if it can be standardized for consumption outside of Nova. - -The following is a simplified pseudo-schema for the file format. - -.. code-block:: yaml - - meta: - # Version ($Major, $minor) of the schema must successfully parse documents - # conforming to ($Major, *). I.e. additionalProperties must be allowed at - # all levels; but code at a lower $minor will ignore fields it does not - # recognize. Schema changes representing optional additions should bump - # $minor. Any breaking schema change (e.g. removing fields, adding new - # required fields, imposing a stricter pattern on a value, etc.) must bump - # $Major. The question of whether/how old versions will be deprecated or - # become unsupported is left for future consideration. - schema_version: $Major.$minor - - providers: - # List of dicts - # Identify a single provider to configure. - # Exactly one of uuid or name is mandatory. Specifying both is an error. - # The consuming nova-compute service will error and fail to start if the - # same value is used more than once across all provider configs for name - # or uuid. - # NOTE: Caution should be exercised when identifying ironic nodes, - # especially via the `$COMPUTE_NODE` special value. If an ironic node - # moves to a different compute host with a different provider config, its - # attributes will change accordingly. - - identification: - # Name or UUID of the provider. - # The uuid can be set to the specialized string `$COMPUTE_NODE` which - # will cause the consuming compute service to apply the configuration - # in this section to each node it manages unless that node is also - # identified by name or uuid. - uuid: ($uuid_pattern|"$COMPUTE_NODE") - # Name of the provider. - name: $string - # Customize provider inventories - inventories: - # This section allows the admin to specify various adjectives to - # create and manage providers' inventories. This list of adjectives - # can be extended in the future as the schema evolves to meet new - # use cases. For now, only one adjective, `additional`, is supported. - additional: - # The following inventories should be created on the identified - # provider. Only CUSTOM_* resource classes are permitted. - # Specifying inventory of a resource class natively managed by - # nova-compute will cause the compute service to fail. - $resource_class: - # `total` is required. Other optional fields not specified - # get defaults from the Placement service. - total: $int - reserved: $int - min_unit: $int - max_unit: $int - step_size: $int - allocation_ratio: $float - # Next inventory dict, keyed by resource class... - ... - # Customize provider traits. - traits: - # This section allows the admin to specify various adjectives to - # create and manage providers' traits. This list of adjectives - # can be extended in the future as the schema evolves to meet new - # use cases. For now, only one adjective, `additional`, is supported. - additional: - # The following traits are added on the identified provider. Only - # CUSTOM_* traits are permitted. The consuming code is - # responsible for ensuring the existence of these traits in - # Placement. - - $trait_pattern - - ... - # Next provider... - - identification: - ... - -Example -~~~~~~~ -.. note:: This section is intended to describe at a very high level how this - file format could be consumed to provide ``CUSTOM_LLC`` inventory - information. - -.. note:: This section is intended to describe at a very high level how this - file format could be consumed to provide P-state compute trait - information. - -.. code-block:: yaml - - meta: - schema_version: 1.0 - - providers: - # List of dicts - - identification: - uuid: $COMPUTE_NODE - inventories: - additional: - CUSTOM_LLC: - # Describing LLC on this compute node - # max_unit indicates maximum size of single LLC - # total indicates sum of sizes of all LLC - total: 22 - reserved: 2 - min_unit: 1 - max_unit: 11 - step_size: 1 - allocation_ratio: 1 - traits: - additional: - # Describing that this compute node enables support for - # P-state control - - CUSTOM_P_STATE_ENABLED - -Provider config consumption from Nova -------------------------------------- -Provider config processing will be performed by the nova-compute process as -described below. There are no changes to virt drivers. In particular, virt -drivers have no control over the loading, parsing, validation, or integration -of provider configs. Such control may be added in the future if warranted. - -Configuration - A new config option is introduced:: - - [compute] - # Directory of yaml files containing resource provider configuration. - # Default: /etc/nova/provider_config/ - # Files in this directory will be processed in lexicographic order. - provider_config_location = $directory - -Loading, Parsing, Validation - On nova-compute startup, files in ``CONF.compute.provider_config_location`` - are loaded and parsed by standard libraries (e.g. ``yaml``), and - schema-validated (e.g. via ``jsonschema``). Schema validation failure or - multiple identifications of a node will cause nova-compute startup to fail. - Upon successful loading and validation, the resulting data structure is - stored in an instance attribute on the ResourceTracker. - -Provider Tree Merging - A generic (non-hypervisor/virt-specific) method will be written that merges - the provider config data into an existing ``ProviderTree`` data structure. - The method must detect conflicts whereby provider config data references - inventory of a resource class managed by the virt driver. Conflicts should - log a warning and cause the conflicting config inventory to be ignored. - The exact location and signature of this method, as well as how it detects - conflicts, is left to the implementation. In the event that a resource - provider is identified by both explicit UUID/NAME and $COMPUTE_NODE, only the - UUID/NAME record will be used. - -``_update_to_placement`` - In the ResourceTracker's ``_update_to_placement`` flow, the merging method is - invoked after ``update_provider_tree`` and automatic trait processing, *only* - in the ``update_provider_tree`` flow (not in the legacy ``get_inventory`` or - ``compute_node_to_inventory_dict`` flows). On startup (``startup == True``), - if the merge detects a conflict, the nova-compute service will fail. - -Alternatives ------------- -Ad hoc provider configuration is being performed today through an amalgam of -oslo.config options, more of which are being proposed or considered to deal -with VGPUs, NUMA, bandwidth resources, etc. The awkwardness of expressing -hierarchical data structures has led to such travesties as -``[pci]passthrough_whitelist`` and "dynamic config" mechanisms where config -groups and their options are created on the fly. YAML is natively suited for -this purpose as it is designed to express arbitrarily nested data structures -clearly, with minimal noisy punctuation. In addition, the schema is -self-documenting. - -Data model impact ------------------ -None - -REST API impact ---------------- -None - -Security impact ---------------- -Admins should ensure that provider config files have appropriate permissions -and ownership. Consuming services may wish to check this and generate an error -if a file is writable by anyone other than the process owner. - -Notifications impact --------------------- -None - -Other end user impact ---------------------- -None - -Performance Impact ------------------- -None - -Other deployer impact ---------------------- -An understanding of this file and its implications is only required when the -operator desires provider customization. The deployer should be aware of the -precedence of records with UUID/NAME identification over $COMPUTE_NODE. - -Developer impact ----------------- -Subsequent specs will be needed for services consuming this file format. - -Upgrade impact --------------- -None. (Consumers of this file format will need to address this - e.g. decide -how to deprecate existing config options which are being replaced). - -Implementation -============== - -Assignee(s) ------------ - -Primary assignee: - tony su - -Other contributors: - dustinc - efried dakshinai - -Feature Liaison ---------------- - -Feature liaison: - gibi - -Work Items ----------- - -* Construct a formal schema -* Implement parsing and schema validation -* Implement merging of config to provider tree -* Incorporate above into ResourceTracker -* Compose a self-documenting sample file - -Dependencies -============ -None - - -Testing -======= -* Schema validation will be unit tested. -* Functional and integration testing to move updates from provider config file - to Placement via Nova virt driver. - -Documentation Impact -==================== -* The formal schema file and a self-documenting sample file for provider - config file. -* Admin-facing documentation on guide to update the file and how Nova - processes the updates. -* User-facing documentation (including release notes). - -References -========== -.. _Jay's Rocky provider-config-file proposal: https://review.openstack.org/#/c/550244/2/specs/rocky/approved/provider-config-file.rst -.. _Konstantinos's device-placement-model spec: https://review.openstack.org/#/c/591037/8/specs/stein/approved/device-placement-model.rst -.. _Eric's device-passthrough spec: https://review.openstack.org/#/c/579359/10/doc/source/specs/rocky/device-passthrough.rst -.. _Resource Management Daemon_PTG Summary: http://lists.openstack.org/pipermail/openstack-discuss/2019-May/005809.html -.. _Handling UUID/NAME and $COMPUTE_NODE conflicts: http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-11-19.log.html#t2019-11-19T21:25:26 - -History -======= - -.. list-table:: Revisions - :header-rows: 1 - - * - Release Name - - Description - * - Stein - - Introduced - * - Train - - Re-proposed, simplified - * - Ussuri - - Re-proposed - * - Victoria - - Re-proposed diff --git a/specs/victoria/approved/rbd-glance-multistore.rst b/specs/victoria/approved/rbd-glance-multistore.rst deleted file mode 100644 index 3167c63..0000000 --- a/specs/victoria/approved/rbd-glance-multistore.rst +++ /dev/null @@ -1,266 +0,0 @@ -.. - This work is licensed under a Creative Commons Attribution 3.0 Unported - License. - - http://creativecommons.org/licenses/by/3.0/legalcode - -======================================================= -Libvirt RBD image backend support for glance multistore -======================================================= - -https://blueprints.launchpad.net/nova/+spec/rbd-glance-multistore - -Currently, Nova does not natively support a deployment where there are -multiple Ceph RBD backends that are known to glance. If there is only -one, Nova and Glance collaborate for fast-and-light image-to-VM -cloning behaviors. If there is more than one, Nova generally does not -handle the situation well, resulting in silent slow-and-heavy behavior -in the worst case, and a failed instance boot failsafe condition in -the best case. We can do better. - -Problem description -=================== - -There are certain situations where it is desirable to have multiple -independent Ceph clusters in a single openstack deployment. The most -common would be a multi-site or edge deployment where it is important -that the Ceph cluster is physically close to the compute nodes that it -serves. Glance already has the ability to address multiple ceph -clusters, but Nova is so naive about this that such a configuration -will result in highly undesirable behavior. - -Normally when Glance and Nova collaborate on a single Ceph deployment, -images are stored in Ceph by Glance when uploaded by the operator or -the user. When Nova starts to boot an instance, it asks Ceph to make a -Copy-on-Write clone of that image, which extremely fast and -efficient, resulting in not only reduced time to boot and lower -network traffic, but a shared base image across all compute nodes. - -If, on the other hand, you have two groups of compute nodes, each with -their own Ceph deployment, extreme care must be taken currently to -ensure that an image stored in one is not booted on a compute node -assigned to the other. Glance can represent that a single logical -image is stored in one or both of those Ceph stores and Nova looks at -this during instance boot. However, if the image is not in its local -Ceph cluster, it will quietly download the image from Glance and then -upload it to its local Ceph as a raw flat image each time an instance -from that image is booted. This results in more network traffic and -disk usage than is expected. We merged a workaround to make Nova -refuse to do this antithetical behavior, but it just causes a failed -instance boot. - -Use Cases ---------- - -- As an operator I want to be able to have a multi-site single Nova - deployment with one Ceph cluster per site and retain the - high-performance copy-on-write behavior that I get with a single - one. - -- As a power user which currently has to pre-copy images to a - remote-site ceph backend with glance before being able to boot an - instance, I want to not have to worry about such things and just - have Nova do that for me. - -Proposed change -=============== - -Glance can already represent that a single logical image is stored in -multiple locations. Recently, it gained an API to facilitate copying -images between backend stores. This means that an API consumer can -request that it copy an image from one store to another by doing an -"import" operation where the method is "copy-image". - -The change proposed in this spec is to augment the existing libvirt -RBD imagebackend code so that it can use this image copying API when -needed. Currently, we already look at all the image locations to find -which one matches our Ceph cluster, and then use that to do the -clone. After this spec is implemented, that code will still examine -all the *current* locations, and if none match, ask Glance to copy the -image to the appropriate backend store so we can continue without -failure or other undesirable behavior. - -In the case where we do need Glance to copy the image to our store, -Nova can monitor the progress of the operation through special image -properties that Glance maintains on the image. These indicate that the -process is in-progress (via ``os_glance_importing_to_stores``) and -also provide notice when an import has failed (via -``os_glance_failed_import``). Nova will need to poll the image, -waiting for the process to complete, and some configuration knobs will -be needed to allow for appropriate tuning. - -Alternatives ------------- - -One alternative is always to do nothing. This is enhanced behavior on -top of what we already support. We *could* just tell people not to use -multiple Ceph deployments or add further checks to make sure we do not -do something stupid if they do. - -We could teach nova about multiple RBD stores in a more comprehensive -way, which would basically require either pulling ceph information out -of Glance, or configuring Nova with all the same RBD backends that -Glance has. However, we would need to teach Nova about the topology -and configure it to not do stupid things like use a remote Ceph just -because the image is there. - -Data model impact ------------------ - -None. - -REST API impact ---------------- - -None. - -Security impact ---------------- - -Users can already use the image import mechanism in Glance, so Nova -using it on their behalf does not result in privilege escalation. - -Notifications impact --------------------- - -None. - -Other end user impact ---------------------- - -This removes the need for users to know details about the deployment -configuration and topology, as well as eliminates the need to manually -pre-place images in stores. - -Performance Impact ------------------- - -Image boot time will be impacted in the case when a copy needs to -happen, of course. Performance overall will be much better because -operators will be able to utilize more Ceph clusters if they wish, -and locate them closer to the compute nodes they serve. - -Other deployer impact ---------------------- - -Some additional configuration will be needed in order to make this -work. Specifically, Nova will need to know the Glance store name that -represents the RBD backend it is configured to use. Additionally, -there will be some timeout tunables related to how often we poll the -Glance server for status on the copy, as well as an overall timeout -for how long we are willing to wait. - -One other deployer consideration is that Glance requires an API setup -capable of doing background tasks in order to support the -``image_import`` API. That means ``mod_wsgi`` or similar, as ``uwsgi`` -does not provide reliable background task support. This is just a -Glance requirement, but worth noting here. - -Developer impact ----------------- - -The actual impact to the imagebackend code is not large as we are just -using a new mechanism in Glance's API to do the complex work of -copying images between backends. - -Upgrade impact --------------- - -In order to utilize this new functionality, at least Glance from -Ussuri will be required for a Victoria Nova. Individual -``nova-compute`` services can utilize this new functionality -immediately during a partial upgrade scenario so no minimum service -version checks are required. The control plane does not know which RBD -backend each compute node is connected to, and thus there is no need -for control-plane-level upgrade sensitivity to this feature. - - -Implementation -============== - -Assignee(s) ------------ -Primary assignee: - danms - -Feature Liaison ---------------- - -Feature liaison: - danms - -Work Items ----------- - -* Plumb the ``image_import`` function through the - ``nova.image.glance`` modules - -* Teach the libvirt RBD imagebackend module how to use the new API to - copy images to its own backend when necessary and appropriate. - -* Document the proper setup requirements for administrators - - -Dependencies -============ - -* Glance requirements are already landed and available - -Testing -======= - -* Unit testing, obviously. - -* Functional testing turns out to be quite difficult, as we stub out - massive amounts of the underlying image handling code underneath our - fake libvirt implementation. Adding functional tests for this would - require substantial refactoring of all that test infrastructure, - dwarfing the actual code in this change. - -* Devstack testing turns out to be relatively easy. I think we can get - a solid test of this feature on every run, by altering that job to: - - * Enable Glance and Nova multistore support. - * Enable Glance image conversion support, to auto-convert the default - QCOW Cirros image to raw when we upload it. - * Create two stores, one file-backed (like other jobs) and one - RBD-backed (like the current Ceph job). - * Default the Cirros upload to the file-backed store. - * The first use of the Cirros image in a tempest test will cause Nova - to ask Glance to copy the image from the file-backed store to the - RBD-backed store. Subsequent tests will see it as already in the - RBD store and proceed as normal. - - The real-world goal of this is to facilitate RBD-to-RBD backend - store copying, but from Nova's perspective file-to-RBD is an - identical process, so it's a good analog without having to - bootstrap two independent Ceph clusters in a devstack job. - -Documentation Impact -==================== - -This is largely admin-focused. Users that are currently aware of this -limitation already have admin-level knowledge if they are working -around it. Successful implementation will just eliminate the need to -care about multiple Ceph deployments going forward. Thus admin and -configuration documentation should be sufficient. - -References -========== - -* https://blueprints.launchpad.net/glance/+spec/copy-existing-image - -* https://docs.openstack.org/glance/latest/admin/interoperable-image-import.html - -* https://review.opendev.org/#/c/699656/8 - -History -======= - -.. list-table:: Revisions - :header-rows: 1 - - * - Release Name - - Description - * - Victoria - - Introduced diff --git a/specs/victoria/approved/sriov-interface-attach-detach.rst b/specs/victoria/approved/sriov-interface-attach-detach.rst deleted file mode 100644 index 25f4c13..0000000 --- a/specs/victoria/approved/sriov-interface-attach-detach.rst +++ /dev/null @@ -1,174 +0,0 @@ -.. - This work is licensed under a Creative Commons Attribution 3.0 Unported - License. - - http://creativecommons.org/licenses/by/3.0/legalcode - -========================================= -Support SRIOV interface attach and detach -========================================= - -https://blueprints.launchpad.net/nova/+spec/sriov-interface-attach-detach - -Nova supports booting servers with SRIOV interfaces. However, attaching and -detaching an SRIOV interface to an existing server is not supported as the PCI -device management is missing from the attach and detach code path. - - -Problem description -=================== - -SRIOV interfaces cannot be attached or detached from an existing nova server. - -Use Cases ---------- - -As an end user I need to connect my server to another neutron network via an -SRIOV interface to get high throughput connectivity to that network direction. - -As an end user I want to detach an existing SRIOV interface as I don't use that -network access anymore and I want to free up the scarce SRIOV resource. - -Proposed change -=============== - -In the compute manager, during interface attach, the compute needs to generate -``InstancePCIRequest`` for the requested port if the vnic_type of the port -indicates an SRIOV interface. Then run a PCI claim on the generated PCI request -to check if there is a free PCI device, claim it, and get a ``PciDevice`` -object. If this is successful then connect the PCI request to the -``RequestedNetwork`` object and call Neutron as today with that -``RequestedNetwork``. Then call the virt driver as of today. - -If the PCI claim fails then the interface attach instance action will fail but -the instance state will not be set to ERROR. - -During detach, we have to recover the PCI request from the VIF being destroyed -then from that, we can get the PCI device that we need to unclaim in the PCI -tracker. - -Note that detaching an SRIOV interface succeeds today from API user -perspective. However, the detached PCI device is not freed from resource -tracking and therefore leaked until the nova server is deleted or live -migrated. This issue will be gone when the current spec is implemented. Also -as a separate bugfix SRIOV detach will be blocked on stable branches to prevent -the resource leak. - -There is a separate issue with SRIOV PF detach due to the way the libvirt -domain XML is generated. While the fix for that is needed for the current spec, -it also needed for the existing SRIOV live migration feature because that also -detaches the SRIOV interfaces during the migration. So the SRIOV PF detach -issue will be fixed as an independent bugfix of the SRIOV live migration -feature and the implementation of this spec will depend on that bugfix. - -Alternatives ------------- - -None - -Data model impact ------------------ - -None - -REST API impact ---------------- - -None - -Security impact ---------------- - -None - -Notifications impact --------------------- - -None - -Other end user impact ---------------------- - -None - -Performance Impact ------------------- - -There will be an extra neutron call during interface attach as well as -additional DB operations. The ``interface_attach`` RPC method is synchronous -today, so this will be an end user visible change. - -Other deployer impact ---------------------- - -None - -Developer impact ----------------- - -None - -Upgrade impact --------------- - -None - -Implementation -============== - -Assignee(s) ------------ - - -Primary assignee: - balazs-gibizer - - -Feature Liaison ---------------- - -Feature liaison: - gibi - - -Work Items ----------- - -* change the attach and detach code path -* add unit and functional tests -* add documentation - - -Dependencies -============ - -None - - -Testing -======= - -Tempest test cannot be added since the upstream CI does not have SRIOV devices. -Functional tests with libvirt driver will be added instead. - - -Documentation Impact -==================== - -* remove the limitation from the API documentation - -References -========== - -None - -History -======= - -.. list-table:: Revisions - :header-rows: 1 - - * - Release Name - - Description - * - Victoria - - Introduced diff --git a/specs/victoria/approved/use-pcpu-vcpu-in-one-instance.rst b/specs/victoria/approved/use-pcpu-vcpu-in-one-instance.rst deleted file mode 100644 index d9c067f..0000000 --- a/specs/victoria/approved/use-pcpu-vcpu-in-one-instance.rst +++ /dev/null @@ -1,417 +0,0 @@ -.. - This work is licensed under a Creative Commons Attribution 3.0 Unported - License. - - http://creativecommons.org/licenses/by/3.0/legalcode - -========================================= -Use ``PCPU`` and ``VCPU`` in One Instance -========================================= - -https://blueprints.launchpad.net/nova/+spec/use-pcpu-and-vcpu-in-one-instance - -The spec `CPU resource tracking`_ splits host CPUs into ``PCPU`` and ``VCPU`` -resources, making it possible to run instances of ``dedicated`` CPU allocation -policy and instances of ``shared`` CPU allocation policy in the same host. -This spec aims to create such kind of instance that some of the vCPUs are -dedicated (``PCPU``) CPUs and the remaining vCPUs are shared (``VCPU``) vCPUs -and expose this information via the metadata API. - -Problem description -=================== - -The current CPU allocation policy, ``dedicated`` or ``shared``, is applied to -all vCPUs of an instance. However, with the introduction of -`CPU resource tracking`_, it is possible to propose a more fine-grained CPU -allocation policy, which is based on the control over individual instance vCPU, -and specifying the ``dedicated`` or ``shared`` CPU allocation policy to each -instance vCPU. - -Use Cases ---------- - -As an operator, I would like to have an instance with some realtime CPUs for -high performance, and at the same time, in order to increase instance density, -I wish to make the remaining CPUs, which do not demand high performance, -shared with other instances because I only care about the performance of -realtime CPUs. One example is deploying the NFV task that is enhanced with -DPDK framework in the instance, in which the data plane threads could be -processed with the realtime CPUs and the control-plane tasks are scheduled -on CPUs that may be shared with other instances. - -As a Kubernetes administrator, I wish to run a multi-tier or auto-scaling -application in Kubernetes, which is running in single OpenStack VM, with -the expectation that using dedicated high-performance CPUs for application -itself and deploying the containers on shared cores. - -Proposed change -=============== - -Introduce a new CPU allocation policy ``mixed`` ------------------------------------------------ - -``dedicated`` and ``shared`` are the existing instance CPU allocation policies -that determine how instance CPU is scheduled on host CPU. This specification -proposes a new CPU allocation policy, with the name ``mixed``, to -create a CPU *mixed* instance in such way that some instance vCPUs are -allocated from computing node's ``PCPU`` resource, and the rest of instance -vCPUs are allocated from the ``VCPU`` resources. The CPU allocated from -``PCPU`` resource will be pinned on particular host CPUs which are defined in -``CONF.compute.dedicated_cpu_set``, and the CPU from ``VCPU`` resource will be -floating on the host CPUs which are defined in ``CONF.compute.shared_cpu_set``. -In this proposal, we call these two kinds of vCPUs as *dedicated* vCPU and -*shared* vCPU respectively. - -Instance CPU policy matrix -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Nova operator may set the instance CPU allocation policy through the -``hw:cpu_policy`` and ``hw_cpu_policy`` interfaces, which may raise conflict. -The CPU policy conflict is proposed to be solved with the following policy -matrix: - -+---------------------------+-----------+-----------+-----------+-----------+ -| | hw:cpu_policy | -+ INSTANCE CPU POLICY +-----------+-----------+-----------+-----------+ -| | DEDICATED | MIXED | SHARED | undefined | -+---------------+-----------+-----------+-----------+-----------+-----------+ -| hw_cpu_policy | DEDICATED | dedicated | conflict | conflict | dedicated | -+ +-----------+-----------+-----------+-----------+-----------+ -| | MIXED | dedicated | mixed | conflict | mixed | -+ +-----------+-----------+-----------+-----------+-----------+ -| | SHARED | dedicated | conflict | shared | shared | -+ +-----------+-----------+-----------+-----------+-----------+ -| | undefined | dedicated | mixed | shared | undefined | -+---------------+-----------+-----------+-----------+-----------+-----------+ - -For example, if a ``dedicated`` CPU policy is specified in instance flavor -``hw:cpu_policy``, then the instance CPU policy is ``dedicated``, regardless -of the setting specified in image property ``hw_cpu_policy``. If ``shared`` -is explicitly set in ``hw:cpu_policy``, then a ``mixed`` policy specified -in ``hw_cpu_policy`` is conflict, which will throw an exception, the instance -booting request will be rejected. - -If there is no explicit instance CPU policy specified in flavor or image -property, the flavor matrix result would be 'undefined', and the final -instance policy is further determined and resolved by ``resources:PCPU`` -and ``resources:VCPU`` specified in flavor extra specs. Refer to -:ref:`section ` and the spec -`CPU resource tracking`_. - -Affect over real-time vCPUs -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Real-time vCPU also occupies the host CPU exclusively and does not share CPU -with other instances, all real-time vCPUs are dedicated vCPUs. For a *mixed* -instance with some real-time vCPUs, with this proposal, the vCPUs not in the -instance real-time vCPU list are shared vCPUs. - -Affect over emulator thread policy -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -If emulator thread policy is ``ISOLATE``, the *mixed* instance will look for -a *dedicated* host CPU for instance emulator thread, which is very similar -to the case introduced by ``dedicated`` policy instance. - -If the emulator thread policy is ``SHARE``, then the instance emulator thread -will float over the host CPUs defined in configuration -``CONF.compute.cpu_shared_set``. - -Set dedicated CPU bit-mask in ``hw:cpu_dedicated_mask`` for ``mixed`` instance ------------------------------------------------------------------------------- - -As an interface to create the ``mixed`` policy instance through legacy flavor -extra specs or image properties, the flavor extra spec -``hw:cpu_dedicated_mask`` is introduced. If the extra spec -``hw:cpu_dedicated_mask`` is found in the instance flavor, then the -information of the *dedicated* CPU could be found through -parsing ``hw:cpu_dedicated_mask``. - -Here is the example to create an instance with ``mixed`` policy: - -.. code:: - - $ openstack flavor set \ - --property hw:cpu_policy=mixed \ - --property hw:cpu_dedicated_mask=0-3,7 - -And, following is the proposing command to create a *mixed* instance which -consists of multiple NUMA nodes by setting the *dedicated* vCPUs in -``hw:cpu_dedicated_mask``: - -.. code:: - - $ openstack flavor set \ - --property hw:cpu_policy=mixed \ - --property hw:cpu_dedicated_mask=2,7 \ - --property hw:numa_nodes=2 \ - --property hw:numa_cpus.0=0-2 \ - --property hw:numa_cpus.1=3-7 \ - --property hw:numa_mem.0=1024 \ - --property hw:numa_mem.1=2048 - -.. note:: - Please be aware that there is no equivalent setting in image properties - for flavor extra spec ``hw:cpu_dedicated_mask``. It will not be supported - to create *mixed* instance through image properties. - -.. note:: - The dedicated vCPU list of a *mixed* instance could be specified through - the newly introduced dedicated CPU mask or the cpu-time CPU mask, the - ``hw:cpu_realtime_mask`` or ``hw_cpu_realtime_mask``, you cannot set it - by setting dedicated CPU mask extra spec and real-time CPU mask at the - same time. - -.. _mixed-instance-PCPU-VCPU: - -Create *mixed* instance via ``resources:PCPU`` and ``resources:VCPU`` ---------------------------------------------------------------------- - -`CPU resource tracking`_ introduced a way to create an instance with -``dedicated`` or ``shared`` CPU allocation policy through ``resources:PCPU`` -and ``resources:VCPU`` interfaces, but did not allow requesting both ``PCPU`` -resource and ``VCPU`` resource for one instance. - -This specification proposes to let an instance request ``PCPU`` resource along -with ``VCPU``, and effectively applying for the ``mixed`` CPU allocation -policy if the ``cpu_policy`` is not explicitly specified in the flavor list. -So an instance with such flavors potentially creates a ``mixed`` policy -instance: - -.. code:: - - $ openstack flavor set \ - --property "resources:PCPU"="" \ - --property "resources:VCPU"="" \ - - -For *mixed* instance created in such way, both and - must be greater than zero. Otherwise, it effectively -creates the ``dedicated`` or ``shared`` policy instance, that all vCPUs in the -instance is in a same allocation policy. - -The ``resources:PCPU`` and ``resources::VCPU`` interfaces only put the request -toward ``Placement`` service for how many ``PCPU`` and ``VCPU`` resources are -required to fulfill the instance vCPU thread and emulator thread requirement. -The ``PCPU`` and ``VCPU`` distribution on the instance, especially on the -instance with multiple NUMA nodes, will be spread across the NUMA nodes in the -round-robin way, and ``VCPU`` will be put ahead of ``PCPU``. Here is one -example and the instance is created with flavor below:: - - flavor: - vcpus:8 - memory_mb=512 - extra_specs: - hw:numa_nodes:2 - resources:VCPU=3 - resources:PCPU=5 - -Instance emulator thread policy is not specified in the flavor, so it does not -occupy any dedicated ``PCPU`` resource for it, all ``PCPU`` and ``VCPU`` -resources will be used by vCPU threads, and the expected distribution on NUMA -nodes is:: - - NUMA node 0: VCPU VCPU PCPU PCPU - NUMA node 1: VCPU PCPU PCPU PCPU - -.. note:: - The demanding instance CPU number is the number of vCPU, specified by - ``flavor.vcpus``, plus the number of CPU that is special for emulator - thread, and if the emulator thread policy is ``ISOLATE``, the instance - requests ``flavor.vcpus`` + 1 vCPUs, if the policy is not ``ISOLATE``, - the instance just requests ``flavor.vcpus`` vCPU. - -Alternatives ------------- - -Creating CPU mixed instance by extending the ``dedicated`` policy -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Instead of adding a special instance CPU allocation policy, the CPU mixed -instance is supported by extending the existing ``dedicated`` policy and -specifying the vCPUs that are pinned to the host CPUs chosen from ``PCPU`` -resource. - -Following extra spec and the image property are defined to keep the -*dedicated* vCPUs of a ``mixed`` policy instance:: - - hw:cpu_dedicated_mask= - hw_cpu_dedicated_mask= - -The ```` shares the same definition defined above. - -This was rejected at it overloads the ``dedicated`` policy to mean two things, -depending on the value of another configuration option. - -Creating ``mixed`` instance with ``hw:cpu_policy`` and ``resources:(P|V)CPU`` -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Following commands was proposed as an example to create a *mixed* instance by -an explicit request of ``PCPU`` resources, and infer the ``VCPU`` count by -``flavor::vcpus`` and ``PCPU`` count: - -.. code:: - - $ openstack flavor create mixed_vmf --vcpus 4 --ram 512 --disk 1 - $ openstack flavor set mixed_vmf \ - --property hw:cpu_policy=mixed \ - --property resources:PCPU=2 - -This was rejected due to the mixing use of ``hw:cpu_policy`` and -``resources:PCPU``. It is not recommended to mix placement style syntax with -traditional extra specs. - -Data model impact ------------------ - -Add the ``pcpuset`` field in ``InstanceNUMACell`` object to track the dedicated -vCPUs of the instance NUMA cell, and the original ``InstanceNUMACell.cpuset`` -is special for shared vCPU then. - -This change will introduce some database migration work for the existing -instance in a ``dedicated`` CPU allocation policy, since all vCPUs in such an -instance are dedicated vCPUs which should be kept in ``pcpuset`` field, but -they are stored in ``cpuset`` historically. - -REST API impact ---------------- - -The metadata API will be extended with the *dedicated* vCPU info and a new -OpenStack metadata version will be added to indicate this is a new metadata -API. - -The new field will be added to the ``meta_data.json``:: - - dedicated_cpus= - -The ```` lists the *dedicated* vCPU set of the instance, which -might be the content of ``hw:cpu_dedicated_mask`` or -``hw:cpu_realtime_mask`` or ``hw_cpu_realtime_mask`` or the CPU list -generated with the *round-robin* policy as described in -:ref:`section `. - -The new cpu policy ``mixed`` is added to extra spec ``hw:cpu_policy``. - -Security impact ---------------- - -None - -Notifications impact --------------------- - -None - -Other end user impact ---------------------- - -If the end user wants to create an instance with a ``mixed`` CPU allocation -policy, the user is required to set corresponding flavor extra specs or image -properties. - -Performance Impact ------------------- - -This proposal affects the selection of instance CPU allocation policy, but the -performance impact is trivial. - -Other deployer impact ---------------------- - -None - -Developer impact ----------------- - -None - -Upgrade impact --------------- - -The ``mixed`` cpu policy is only available when the whole cluster upgrade -finished. A service version will be bumped for detecting the upgrade. - -The ``InstanceNUMACell.pcpuset`` is introduced for dedicated vCPUs and the -``InstanceNUMACell.cpuset`` is special for shared vCPUs, all existing -instances in a ``dedicated`` CPU allocation policy should be updated by moving -content in ``InstanceNUMACell.cpuset`` filed to -``InstanceNUMACell.pcpuset`` field. The underlying database keeping the -``InstanceNUACell`` object also need be updated to reflect this change. - -Implementation -============== - -Assignee(s) ------------ - -Primary assignee: - Wang, Huaqiang - -Feature Liaison ---------------- - -Feature liaison: - Stephen Finucane - -Work Items ----------- - -* Add a new field, the ``pcpuset``, for ``InstanceNUMACell`` for dedicated - vCPUs. -* Add new instance CPU allocation policy ``mixed`` property and resolve - conflicts -* Bump nova service version to indicate the new CPU policy in nova-compute -* Add flavor extra spec ``hw:cpu_dedicated_mask`` and create *mixed* instance -* Translate *dedicated* and *shared* CPU request to placement ``PCPU`` and - ``VCPU`` resources request. -* Change libvirt driver to create ``PCPU`` mapping and ``VCPU`` mapping -* Add nova metadata service by offering final pCPU layout in - ``dedicated_cpus`` field -* Validate real-time CPU mask for ``mixed`` instance. - -Dependencies -============ - -None - -Testing -======= - -Functional and unit tests are required to cover: - -* Ensure to solve the conflicts between the CPU policy matrix -* Ensure only *dedicated* vCPUs are possible to be real-time vCPUs -* Ensure creating ``mixed`` policy instance properly either by flavor - settings or by ``resources::PCPU=xx`` and ``resources::VCPU=xx`` settings. -* Ensure *shared* vCPUs is placed before the ``dedicated`` vCPUs -* Ensure the emulator CPU is properly scheduled according to its policy. - -Documentation Impact -==================== - -The documents should be changed to introduce the usage of new ``mixed`` CPU -allocation policy and the new flavor extra specs. - -Metadata service will be updated accordingly. - -References -========== - -* `CPU resource tracking`_ - -.. _CPU resource tracking: http://specs.openstack.org/openstack/nova-specs/specs/train/approved/cpu-resources.html - -History -======= - -.. list-table:: Revisions - :header-rows: 1 - - * - Release Name - - Description - * - Train - - Introduced, abandoned - * - Ussuri - - Approved - * - Victoria - - Re-proposed diff --git a/specs/victoria/approved/victoria-template.rst b/specs/victoria/approved/victoria-template.rst deleted file mode 120000 index c0440d1..0000000 --- a/specs/victoria/approved/victoria-template.rst +++ /dev/null @@ -1 +0,0 @@ -../../victoria-template.rst \ No newline at end of file diff --git a/specs/victoria/implemented/add-emulated-virtual-tpm.rst b/specs/victoria/implemented/add-emulated-virtual-tpm.rst new file mode 100644 index 0000000..f0b2e2b --- /dev/null +++ b/specs/victoria/implemented/add-emulated-virtual-tpm.rst @@ -0,0 +1,693 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +============================================== +Add support for encrypted emulated virtual TPM +============================================== + +https://blueprints.launchpad.net/nova/+spec/add-emulated-virtual-tpm + +There are a class of applications which expect to use a TPM device to store +secrets. In order to run these applications in a virtual machine, it would be +useful to expose a virtual TPM device within the guest. Accordingly, the +suggestion is to add flavor/image properties which a) translate to placement +traits for scheduling and b) cause such a device to be added to the VM by the +relevant virt driver. + +Problem description +=================== + +Currently there is no way to create virtual machines within nova that provide +a virtual TPM device to the guest. + +Use Cases +--------- + +Support the virtualizing of existing applications and operating systems which +expect to make use of physical TPM devices. At least one hypervisor +(libvirt/qemu) currently supports the creation of an emulated TPM device which +is associated with a per-VM ``swtpm`` process on the host, but there is no way +to tell nova to enable it. + +Proposed change +=============== + +In recent libvirt and qemu (and possibly other hypervisors as well) there is +support for an emulated vTPM device. We propose to modify nova to make use +of this capability. + +This spec describes only the libvirt implementation. + +XML +--- + +The desired libvirt XML arguments are something like this (`source +`_):: + + ... + + + + + + + + ... + +Prerequisites +------------- + +Support for encrypted emulated TPM requires at least: + +* libvirt version 5.6.0 or greater. +* qemu 2.11 at a minimum, though qemu 2.12 is recommended. The virt driver code + should add suitable version checks (in the case of LibvirtDriver, this would + include checks for both libvirt and qemu). Currently emulated TPM is only + supported for x86, though this is an implementation detail rather than an + architectural limitation. +* The ``swtpm`` binary and libraries on the host. +* Access to a castellan-compatible key manager, such as barbican, for storing + the passphrase used to encrypt the virtual device's data. (The key manager + implementation's public methods must be capable of consuming the user's auth + token from the ``context`` parameter which is part of the interface.) +* Access to an object-store service, such as swift, for storing the file the + host uses for the virtual device data during operations such as shelve. + +Config +------ + +All of the following apply to the compute (not conductor/scheduler/API) +configs: + +* A new config option will be introduced to act as a "master switch" enabling + vTPM. This config option would apply to future drivers' implementations as + well, but since this spec and current implementation are specific to libvirt, + it is in the ``libvirt`` rather than the ``compute`` group:: + + [libvirt] + vtpm_enabled = $bool (default False) + +* To enable move operations (anything involving rebuilding a vTPM on a new + host), nova must be able to lay down the vTPM data with the correct ownership + -- that of the ``swtpm`` process libvirt will create -- but we can't detect + what that ownership will be. Thus we need a pair of config options on the + compute indicating the user and group that should own vTPM data on that + host:: + + [libvirt] + swtpm_user = $str (default 'tss') + swtpm_group = $str (default 'tss') + +* (Existing, known) options for ``[key_manager]``. + +* New standard keystoneauth1 auth/session/adapter options for ``[swift]`` will + be introduced. + +Traits, Extra Specs, Image Meta +------------------------------- + +In order to support this functionality we propose to: + +* Use the existing ``COMPUTE_SECURITY_TPM_1_2`` and + ``COMPUTE_SECURITY_TPM_2_0`` traits. These represent the two different + versions of the TPM spec that are currently supported. (Note that 2.0 is not + backward compatible with 1.2, so we can't just ignore 1.2. A summary of the + differences between the two versions is currently available here_.) When all + the Prerequisites_ have been met and the Config_ switch is on, the libvirt + compute driver will set both of these traits on the compute node resource + provider. +* Support the following new flavor extra_specs and their corresponding image + metadata properties (which are simply ``s/:/_/`` of the below): + + * ``hw:tpm_version={1.2|2.0}``. This will be: + + * translated to the corresponding + ``required=COMPUTE_SECURITY_TPM_{1_2|2_0}`` in the allocation candidate + request to ensure the instance lands on a host capable of vTPM at the + requested version + * used by the libvirt compute driver to inject the appropriate guest XML_. + + .. note:: Whereas it would be possible to specify + ``trait:COMPUTE_SECURITY_TPM_{1_2|2_0}=required`` directly in the + flavor extra_specs or image metadata, this would only serve to + land the instance on a capable host; it would not trigger the libvirt + driver to create the virtual TPM device. Therefore, to avoid + confusion, this will not be documented as a possibility. + + * ``hw:tpm_model={TIS|CRB}``. Indicates the emulated model to be used. If + omitted, the default is ``TIS`` (this corresponds to the libvirt default). + ``CRB`` is only compatible with TPM version 2.0; if ``CRB`` is requested + with version 1.2, an error will be raised from the API. + +To summarize, all and only the following combinations are supported, and are +mutually exclusive (none are inter-compatible): + +* Version 1.2, Model TIS +* Version 2.0, Model TIS +* Version 2.0, Model CRB + +Note that since the TPM is emulated (a process/file on the host), the +"inventory" is effectively unlimited. Thus there are no resource classes +associated with this feature. + +If both the flavor and the image specify a TPM trait or device model and the +two values do not match, an exception will be raised from the API by the +flavor/image validator. + +.. _here: https://en.wikipedia.org/wiki/Trusted_Platform_Module#TPM_1.2_vs_TPM_2.0 + +Instance Lifecycle Operations +----------------------------- + +Descriptions below are libvirt driver-specific. However, it is left to the +implementation which pieces are performed by the compute manager vs. the +libvirt ComputeDriver itself. + +.. note:: In deciding whether/how to support a given operation, we use "How + does this work on baremetal" as a starting point. If we can support a + VM operation without introducing inordinate complexity or user-facing + weirdness, we do. + +Spawn +~~~~~ + +#. Even though swift is not required for spawn, ensure a swift endpoint is + present in the service catalog (and reachable? version discovery? + implementation detail) so that a future unshelve doesn't break the instance. +#. Nova generates a random passphrase and stores it in the configured key + manager, yielding a UUID, hereinafter referred to as ``$secret_uuid``. +#. Nova saves the ``$secret_uuid`` in the instance's ``system_metadata`` under + key ``tpm_secret_uuid``. +#. Nova uses the ``virSecretDefineXML`` API to define a private (value can't be + listed), ephemeral (state is stored only in memory, never on disk) secret + whose ``name`` is the instance UUID, and whose UUID is the ``$secret_uuid``. + The ``virSecretSetValue`` API is then used to set its value to the generated + passphrase. We already provide a wrapper around this API at + ``nova.virt.libvirt.host.Host.create_secret`` for use with encrypted volumes + and will expand this to cover vTPM also. +#. Nova injects the XML_ into the instance's domain. The ``model`` and + ``version`` are gleaned from the flavor/image properties, and the ``secret`` + is ``$secret_uuid``. +#. Once libvirt has created the guest, nova uses the ``virSecretUndefine`` API + to delete the secret. The instance's emulated TPM continues to function. + +.. note:: Spawning from an image created by snapshotting a VM with a vTPM will + result in a fresh, empty vTPM, even if that snapshot was created by + ``shelve``. By contrast, `spawn during unshelve`_ will restore such + vTPM data. + +Cold Boot +~~~~~~~~~ + +...and any other operation that starts the guest afresh. (Depending on the `key +manager`_ security model, these may be restricted to the instance owner.) + +#. Pull the ``$secret_uuid`` from the ``tpm_secret_uuid`` of the instance's + ``system_metadata``. +#. Retrieve the passphrase associated with ``$secret_uuid`` via the configured + key manager API. + +Then perform steps 4-6 as described under Spawn_. + +Migrations and their ilk +~~~~~~~~~~~~~~~~~~~~~~~~ + +For the libvirt implementation, the emulated TPM data is stored in +``/var/lib/libvirt/swtpm/``. Certain lifecycle operations require +that directory to be copied verbatim to the "destination". For (cold/live) +migrations, only the user that nova-compute runs as is guaranteed to be able to +have SSH keys set up for passwordless access, and it's only guaranteed to be +able to copy files to the instance directory on the destination node. We +therefore propose the following procedure for relevant lifecycle operations: + +* Copy the directory into the local instance directory, changing the ownership + to match it. +* Perform the move, which will automatically carry the data along. +* Change ownership back and move the directory out to + ``/var/lib/libvirt/swtpm/`` on the destination. +* On confirm/revert, delete the directory from the source/destination, + respectively. (This is done automatically by libvirt when the guest is torn + down.) +* On revert, the data directory must be restored (with proper permissions) on + the source. + +Since the expected ownership on the target may be different than on the source, +and is (we think) impossible to detect, the admin must inform us of it via the +new ``[libvirt]swtpm_user`` and ``[libvirt]swtpm_group`` Config_ options if +different from the default of ``tss``. + +This should allow support of cold/live migration and resizes that don't change +the device. + +.. todo:: Confirm that the above "manual" copying around is actually necessary + for migration. It's unclear from reading + https://github.com/qemu/qemu/blob/6a5d22083d50c76a3fdc0bffc6658f42b3b37981/docs/specs/tpm.txt#L324-L383 + +Resize can potentially add a vTPM to an instance that didn't have one before, +or remove the vTPM from an instance that did have one, and those should "just +work". When resizing from one version/model to a different one the data can't +and won't carry over (for same-host resize, we must *remove* the old backing +file). If both old and new flavors have the same model/version, we must ensure +we convey the virtual device data as described above (for same-host resize, we +must *preserve* the existing backing file). + +Shelve (offload) and Unshelve +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Restoring vTPM data when unshelving a shelve-offloaded server requires the vTPM +data to be persisted somewhere. We can't put it with the image itself, as it's +data external to the instance disk. So we propose to put it in object-store +(swift) and maintain reference to the swift object in the instance's +``system_metadata``. + +The shelve operation needs to: + +#. Save the vTPM data directory to swift. +#. Save the swift object ID and digital signature (sha256) of the directory to + the instance's ``system_metadata`` under the (new) ``tpm_object_id`` and + ``tpm_object_sha256`` keys. +#. Create the appropriate ``hw_tpm_version`` and/or ``hw_tpm_model`` metadata + properties on the image. (This is to close the gap where the vTPM on + original VM was created at the behest of image, rather than flavor, + properties. It ensures the proper scheduling on unshelve, and that the + correct version/model is created on the target.) + +The unshelve operation on a shelved (but not offloaded) instance should "just +work" (except for deleting the swift object; see below). The code path for +unshelving an offloaded instance needs to: + +#. Ensure we land on a host capable of the necessary vTPM version and model + (we get this for free via the common scheduling code paths, because we did + step 3 during shelve). +#. Look for ``tpm_object_{id|sha256}`` and ``tpm_secret_uuid`` in the + instance's ``system_metadata``. +#. Download the swift object. Validate its checksum and fail if it doesn't + match. +#. Assign ownership of the data directory according to + ``[libvirt]swtpm_{user|group}`` on the host. +#. Retrieve the secret and feed it to libvirt; and generate the appropriate + domain XML (we get this for free via ``spawn()``). +#. Delete the object from swift, and the ``tpm_object_{id|sha256}`` from the + instance ``system_metadata``. This step must be done from both code paths + (i.e. whether the shelved instance was offloaded or not). + +.. note:: There are a couple of ways a user can still "outsmart" our checks and + make horrible things happen on unshelve. For example: + + * The flavor specifies no vTPM properties. + * The *original* image specified version 2.0. + * Between shelve and unshelve, edit the snapshot to specify version + 1.2. + + We will happily create a v1.2 vTPM and restore the (v2.0) data into + it. The VM will (probably) boot just fine, but unpredictable things + will happen when the vTPM is accessed. + + We can't prevent *all* stupidity. + +.. note:: As mentioned in `Security impact`_, if shelve is performed by the + admin, only the admin will be able to perform the corresponding + unshelve operation. And depending on the `key manager`_ security + model, if shelve is performed by the user, the admin may not be able + to perform the corresponding unshelve operation. + +Since the backing device data is virt driver-specific, it must be managed by +the virt driver; but we want the object-store interaction to be done by compute +manager. We therefore propose the following interplay between compute manager +and virt driver: + +The ``ComputeDriver.snapshot()`` contract currently does not specify a return +value. It will be changed to allow returning a file-like with the (prepackaged) +backing device data. The libvirt driver implementation will open a ``tar`` pipe +and return that handle. The compute manager is responsible for reading from +that handle and pushing the contents into the swift object. (Implementation +detail: we only do the swift thing for snapshots during shelve, so a) the virt +driver should not produce the handle except when the VM is in +``SHELVE[_OFFLOADED]`` state; and/or the compute manager should explicitly +close the handle from other invocations of ``snapshot()``.) + +.. _`spawn during unshelve`: + +The compute driver touchpoint for unshelving an offloaded instance is +``spawn()``. This method will get a new kwarg which is a file-like. If not +``None``, virt driver implementations are responsible for streaming from that +handle and reversing whatever was done during ``snapshot()`` (in this case un-\ +``tar``\ -ing). For the unshelve path for offloaded instances, the compute +manager will pull down the swift object and stream it to ``spawn()`` via this +kwarg. + +createImage and createBackup +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Because vTPM data is associated with the **instance**, not the **image**, the +``createImage`` and ``createBackup`` flows will not be changed. In particular, +they will not attempt to save the vTPM backing device to swift. + +This, along with the fact that fresh Spawn_ will not attempt to restore vTPM +data (even if given an image created via ``shelve``) also prevents "cloning" +of vTPMs. + +This is analogous to the baremetal case, where spawning from an image/backup on +a "clean" system would get you a "clean" (or no) TPM. + +Rebuild +~~~~~~~ + +Since the instance is staying on the same host, we have the ability to leave +the existing vTPM backing file intact. This is analogous to baremetal behavior, +where restoring a backup on an existing system will not touch the TPM (or any +other devices) so you get whatever's already there. However, it is also +possible to lock your instance out of its vTPM by rebuilding with a different +image, and/or one with different metadata. A certain amount of responsibility +is placed on the user to avoid scenarios like using the TPM to create a master +key and not saving that master key (in your rebuild image, or elsewhere). + +That said, rebuild will cover the following scenarios: + +* If there is no existing vTPM backing data, and the rebuild image asks for a + vTPM, create a fresh one, just like Spawn_. +* If there is an existing vTPM and neither the flavor nor the image asks for + one, delete it. +* If there is an existing vTPM and the flavor or image asks for one, leave the + backing file alone. However, if different versions/models are requested by + the old and new image in combination with the flavor, we will fail the + rebuild. + +Evacuate +~~~~~~~~ + +Because the vTPM data belongs to libvirt rather than being stored in the +instance disk, the vTPM is lost on evacuate, *even if the instance is +volume-backed*. This is analogous to baremetal behavior, where the (hardware) +TPM is left behind even if the rest of the state is resurrected on another +system via shared storage. + +(It may be possible to mitigate this by mounting ``/var/lib/libvirt/swtpm/`` on +shared storage, though libvirt's management of that directory on guest +creation/teardown may stymie such attempts. This would also bring in additional +security concerns. In any case, it would be an exercise for the admin; nothing +will be done in nova to support or prevent it.) + +Destroy +~~~~~~~ + +#. Delete the key manager secret associated with + ``system_metadata['tpm_secret_uuid']``. +#. libvirt deletes the vTPM data directory as part of guest teardown. +#. If ``system_metadata['tpm_object_id']`` exists, the *API side* will delete + the swift object it identifies. Since this metadata only exists while an + instance is shelved, this should only be applicable in corner cases like: + + * If the ``destroy()`` is performed between shelve and offload. + * Cleaning up a VM in ``ERROR`` state from a shelve, offload, or unshelve + that failed (at just the right time). + * Cleaning up a VM that is deleted while the host was down. + +Limitations +----------- + +This is a summary of odd or unexpected behaviors resulting from this design. + +* Except for migrations and shelve-offload, vTPM data sticks with the + instance+host. In particular: + + * vTPM data is lost on Evacuate_. + * vTPM data is not carried with "reusable snapshots" + (``createBackup``/``createImage``). + +* The ability of instance owners or admins to perform certain instance + lifecycle operations may be limited depending on the `security model + `_ used for the `key manager`_. +* Since secret management is done by the virt driver, deleting an + instance when the compute host is down can orphan its secret. If the host + comes back up, the secret will be reaped when compute invokes the virt + driver's ``destroy``. But if the host never comes back up, it would have to + be deleted manually. + +Alternatives +------------ + +* Rather than using a trait, we could instead use arbitrarily large inventories + of ``1_2``/``2_0`` resource classes. Unless it can be shown that there's an + actual limit we can discover, this just isn't how we do things. +* Instead of using specialized ``hw:tpm*`` extra_spec/image_meta properties, + implicitly configure based on the placement-ese syntax + (``resources:COMPUTE_SECURITY_TPM_*``). Rejected because we're trying to move + away from this way of doing things in general, preferring instead to support + syntax specific to the feature, rather than asking the admin to understand + how the feature maps to placement syntax. Also, whereas in some cases the + mapping may be straightforward, in other cases additional configuration is + required at the virt driver level that can't be inferred from the placement + syntax, which would require mixing and matching placement and non-placement + syntax. +* That being the case, forbid placement-ese syntax using + ``resources[$S]:COMPUTE_SECURITY_TPM_*``. Rejected mainly due to the + (unnecessary) additional complexity, and because we don't want to get in the + business of assuming there's no use case for "land me on a vTPM (in)capable + host, but don't set one up (yet)". +* Use physical passthrough (````) of a real + (hardware) TPM device. This is not feasible with current TPM hardware because + (among other things) changing ownership of the secrets requires a host + reboot. +* Block the operations that require object store. This is deemed nonviable, + particularly since cross-cell resize uses shelve under the covers. +* Use glance or the key manager instead of swift to store the vTPM data for + those operations. NACKed because those services really aren't intended for + that purpose, and (at least glance) may block such usages in the future. +* Save vTPM data on any snapshot operation (including ``createImage`` and + ``createBackup``). This adds complexity as well as some unintended behaviors, + such as the ability to "clone" vTPMs. Users will be less surprised when their + vTPM acts like a (hardware) TPM in these cases. +* Rather than checking for swift at spawn time, add an extra spec / image prop + like ``vtpm_I_promise_I_will_never_shelve_offload=True`` or + ``vtpm_is_totally_ephemeral=True`` which would either error or simply not + back up the vTPM, respectively, on shelve-offload. + +Data model impact +----------------- + +The ``ImageMetaProps`` and ``ImageMetaPropsPayload`` objects need new versions +adding: + +* ``hw_tpm_version`` +* ``hw_tpm_model`` +* ``tpm_object_id`` +* ``tpm_object_sha256`` + +REST API impact +--------------- + +The image/flavor validator will get new checks for consistency of properties. +No new microversion is needed. + +Security impact +--------------- + +The guest will be able to use the emulated TPM for all the security enhancing +functionality that a physical TPM provides, in order to protect itself against +attacks from within the guest. + +The `key manager`_ and `object store`_ services are assumed to be adequately +hardened against external attack. However, the deployment must consider the +issue of authorized access to these services, as discussed below. + +Data theft +~~~~~~~~~~ + +The vTPM data file is encrypted on disk, and is therefore "safe" (within the +bounds of encryption) from simple data theft. + +We will use a passphrase of 384 bytes, which is the default size of an SSH key, +generated from ``/dev/urandom``. It may be desirable to make this size +configurable in the future. + +Compromised root +~~~~~~~~~~~~~~~~ + +It is assumed that the root user on the compute node would be able to glean +(e.g. by inspecting memory) the vTPM's contents and/or the passphrase while +it's in flight. Beyond using private+ephemeral secrets in libvirt, no further +attempt is made to guard against a compromised root user. + +Object store +~~~~~~~~~~~~ + +The object store service allows full access to an object by the admin user, +regardless of who created the object. There is currently no facility for +restricting admins to e.g. only deleting objects. Thus, if a ``shelve`` has +been performed, the contents of the vTPM device will be available to the admin. +They are encrypted, so without access to the key, we are still trusting the +strength of the encryption to protect the data. However, this increases the +attack surface, assuming the object store admin is different from whoever has +access to the original file on the compute host. + +By the same token (heh) if ``shelve`` is performed by the admin, the vTPM data +object will be created and owned by the admin, and therefore only the admin +will be able to ``unshelve`` that instance. + +Key manager +~~~~~~~~~~~ + +The secret stored in the key manager is more delicate, since it can be used to +decrypt the contents of the vTPM device. The barbican implementation scopes +access to secrets at the project level, so the deployment must take care to +limit the project to users who should all be trusted with a common set of +secrets. Also note that project-scoped admins are by default allowed to access +and decrypt secrets owned by any project; if the admin is not to be trusted, +this should be restricted via policy. + +However, castellan backends are responsible for their own authentication +mechanisms. Thus, the deployment may wish to use a backend that scopes +decryption to only the individual user who created the secret. (In any case it +is important that admins be allowed to delete secrets so that operations such +as VM deletion can be performed by admins without leaving secrets behind.) + +Note that, if the admin is restricted from decrypting secrets, lifecycle +operations performed by the admin cannot result in a running VM. This includes +rebooting the host: even with `resume_guests_state_on_host_boot`_ set, an +instance with a vTPM will not boot automatically, and will instead have to be +powered on manually by its owner. Other lifecycle operations which are by +default admin-only will only work when performed by the VM owner, meaning the +owner must be given the appropriate policy roles to do so; otherwise these +operations will be in effect disabled. + +...except live migration, since the (already decrypted) running state of the +vTPM is carried along to the destination. (To clarify: live migration, unlike +other operations, would actually work if performed by the admin because of the +above.) + +.. _resume_guests_state_on_host_boot: https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.resume_guests_state_on_host_boot + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +* An additional API call to the key manager is needed during spawn (to register + the passphrase), cold boot (to retrieve it), and destroy (to remove it). +* Additional API calls to libvirt are needed during spawn and other boot-like + operations to define, set the value, and undefine the vTPM's secret in + libvirt. +* Additional API calls to the object store (swift) are needed to create + (during shelve), retrieve (unshelve), and delete (unshelve/destroy) the vTPM + device data object. + +Other deployer impact +--------------------- + +None + +Developer impact +---------------- + +The various virt drivers would be able to implement the emulated vTPM as +desired. + +Upgrade impact +-------------- + +None + + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + stephenfin + +Other contributors: + cfriesen + efried + +Feature Liaison +--------------- + +stephenfin + +Work Items +---------- + +* API changes to prevalidate the flavor and image properties. +* Scheduler changes to translate flavor/image properties to placement-isms. +* Libvirt driver changes to + + * detect Prerequisites_ and Config_ and report traits to placement. + * communicate with the key manager API. + * manage libvirt secrets via the libvirt API. + * translate flavor/image properties to domain XML_. + * copy vTPM files on relevant `Instance Lifecycle Operations`_. + * communicate with object store to save/restore the vTPM files on (other) + relevant `Instance Lifecycle Operations`_. + +* Testing_ + +Dependencies +============ + +None + +Testing +======= + +Unit and functional testing will be added. New fixtures for object store and +key manager services will likely be necessary. + +Because of the eccentricities of a) user authentication for accessing the +encryption secret, and b) management of the virtual device files for some +operations, CI coverage will be added for: + +- Live migration +- Cold migration +- Host reboot (how?) +- Shelve (offload) and unshelve +- Backup and rebuild + +Documentation Impact +==================== + +Operations Guide and End User Guide will be updated appropriately. +Feature support matrix will be updated. + +References +========== + +* TPM on Wikipedia: https://en.wikipedia.org/wiki/Trusted_Platform_Module + +* ``swtpm``: https://github.com/stefanberger/swtpm/wiki + +* Qemu docs on tpm: + https://github.com/qemu/qemu/blob/master/docs/specs/tpm.txt + +* Libvirt XML to request emulated TPM device: + https://libvirt.org/formatdomain.html#elementsTpm + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Stein + - Introduced + * - Train + - Re-proposed + * - Ussuri + - Re-proposed with refinements including encryption pieces + * - Victoria + - Re-proposed diff --git a/specs/victoria/implemented/nova-image-download-via-rbd.rst b/specs/victoria/implemented/nova-image-download-via-rbd.rst new file mode 100644 index 0000000..146c216 --- /dev/null +++ b/specs/victoria/implemented/nova-image-download-via-rbd.rst @@ -0,0 +1,201 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +===================================================== +Allow Nova to download Glance images directly via RBD +===================================================== + +https://blueprints.launchpad.net/nova/+spec/nova-image-download-via-rbd + + +Problem description +=================== + +When using compute-local storage with qcow2 based VM root disks, Glance images +are downloaded into the libvirt image store by way of the Glance HTTP API. +For images in the 10s-100s of GB, this download can be _very_ slow. +If the compute node has access to Ceph, it can instead perform an 'rbd export' +on the Glance image, bypassing the Glance API entirely and directly download +the image from Ceph. This direct download can result in a drastic reduction +in download time, from tens of minutes to tens of seconds. + +Use Cases +--------- + +As a user with a Ceph-backed image storage, I want to configure some compute +hosts for qcow2 images local to the compute host but quickly get the images +from Ceph rather than slow downloads from the Glance API. + +Proposed change +=============== + +A special download handler will be registered for Glance images when the 'rbd' +value is present in ``allowed_direct_url_schemes`` option. + +This download handler will be called only when a VM is scheduled on a node and +the required Glance image is not already present in the local libvirt image +cache. It will execute the OS native 'rbd export' command, using ``privsep``, +in order to perform the download operation instead of using the Glance HTTP +API. + +The mechanism for per-scheme download handlers was previously available +as a plugin point, which is now deprecated, along with the +allowed_direct_url_schemes config option. This effort will close out on that +deprecation by moving the per-scheme support into the nova.images.glance module +itself, undeprecating the allowed_direct_url_schemes config, and removing the +old nova.images.download plug point. + +The glance module also never used to perform image signature verification when +the per-scheme module was used. Since we are moving this into core code, +we will also fix this so that per-scheme images are verified like all the rest. + +Alternatives +------------ + +VM root disks can be run directly within Ceph as creation of these VM root +disks are fast as they are COW clones for the Glance image, also in Ceph. +However, running the VM root disks from Ceph introduces additional latency to +the running VM and needlessly wastes network bandwidth and Ceph IOPS. This +specific functionality was added in Mitaka but is aimed at a different use case +where the VM root disks remain in Ceph and are not run as qcow2 local disks. + +https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/rbd-instance-snapshots.html + +The other alternative is to continue with existing approach only. + +Data model impact +----------------- + +None + +REST API impact +--------------- + +None + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +None + +Other deployer impact +--------------------- + +As proposed, there are no new configuration items, simply configuration of +existing items. + +The following configuration options are required to ensure qcow2 local images +are downloaded from Ceph and cached on the local compute host: + +On the Glance API node in glance-api.conf: + +``DEFAULT.show_image_direct_url=true`` + +On the Nova compute node in nova.conf: + +``DEFAULT.force_raw_images=false`` + +``libvirt.images_type=qcow2`` + +``libvirt.images_rbd_ceph_conf=`` + +``libvirt.rbd_user=`` + +``glance.allowed_direct_url_schemes = rbd`` + +Looking ahead, it may be desired to create additional entries in the libvirt +section of ``nova.conf`` for this feature as the current implementation assumes +that the ``rbd_user`` will have access to the Glance images. This may not be +the case depending upon how the Ceph pool permissions are configured. + +Developer impact +---------------- + +The ``allowed_direct_url_schemes`` option was deprecated in Queens. Proposed +implementation of this feature would halt the deprecation of this option and +we would need to "un-deprecate" it. + +Upgrade impact +-------------- + +None + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + Jiri Suchomel + +Feature Liaison +--------------- + +Feature liaison: + Dan Smith (danms) + +Work Items +---------- + +* Refactor existing in-house out-of-tree implementation and integrate it fully + into current codebase +* Write tests for implementation +* Update the admin guide with the description of how to set up the config if + the new feature is required. + +Dependencies +============ + +None + +Testing +======= + +* Unit tests +* Add an experimental on-demand queue job which uses Ceph with local qcow2 + images and 'direct from rbd' feature enabled + +Documentation Impact +==================== + +The admin guide should be updated to call out this use case and how it differs +from the Ceph-native snapshot feature. A good place to document this may be: + +https://docs.openstack.org/nova/latest/admin/configuration/hypervisor-kvm.html#configure-compute-backing-storage + +References +========== + +http://lists.openstack.org/pipermail/openstack-dev/2018-May/131002.html + +http://lists.openstack.org/pipermail/openstack-operators/2018-June/015384.html + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Victoria + - Introduced diff --git a/specs/victoria/implemented/provider-config-file.rst b/specs/victoria/implemented/provider-config-file.rst new file mode 100644 index 0000000..db55ede --- /dev/null +++ b/specs/victoria/implemented/provider-config-file.rst @@ -0,0 +1,379 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +=========================== +Provider Configuration File +=========================== + +https://blueprints.launchpad.net/nova/+spec/provider-config-file + +This is a proposal to configure resource provider inventory and traits using a +standardized YAML file format. + +.. note:: This work is derived from `Jay's Rocky provider-config-file + proposal`_ and `Konstantinos's device-placement-model spec`_ (which + is derived from `Eric's device-passthrough spec`_), but differs in + several substantive ways. + +.. note:: This work is influenced by requirements to Nova to support non + native compute resources that are managed by Resource Management + Daemon for finer grain control. PTG discussion notes available at + `Resource Management Daemon_PTG Summary`_ + +.. note:: We currently limit the ownership and consumption of the provider + config YAML as described by the file format to Nova only. + +.. note:: The provider config will currently only accept placement overrides + to create and manage inventories and traits for resources not + natively managed by the Nova virt driver. + +.. note:: This is intended to define a) a file format for currently active use + cases, and b) Nova's consumption of such files. Subsequent features + can define the semantics by which the framework can be used by other + consumers or enhanced to satisfy particular use cases. + +Problem description +=================== +In order to facilitate the proper management of resource provider information +in the placement API by agents within Nova (such as virt drivers and the +PCI passthrough subsystem), we require a way of expressing various +overrides for resource provider information. While we could continue to use +many existing and new configuration options for expressing this information, +having a standardized, versioned provider descriptor file format allows us to +decouple the management of provider information from the configuration of the +service or daemon that manages those resource providers. + +Use Cases +--------- +Note that the file format/schema defined here is designed to accommodate the +following use cases. The file format/schema currently addresses a few use cases +that require changes to resource provider information as consumed by virt +drivers in Nova but it should allow options for extensions to be consumed +by Nova or other services as described in the problem statement in the future. + +Inventory Customization +~~~~~~~~~~~~~~~~~~~~~~~ + +**An operator would like to describe inventories for new platform features** + +These features could be experimental or not yet completely supported by Nova. +The expectation is that Nova can manage these inventories and help schedule +workloads requesting support for new platform features against their +capacities. For instance, to report ``CUSTOM_LLC`` (last-level cache) +inventories. + +The file defined by this spec must allow its author to: + +* Identify a provider unambiguously. +* Create and manage inventories for resource classes not natively managed by + Nova virt driver (``CUSTOM_LLC``, ``CUSTOM_MEMORY_BANDWIDTH`` etc.) + +Trait Customization +~~~~~~~~~~~~~~~~~~~ + +**An operator wishes to associate new custom traits with a provider.** + +These features could be experimental or not yet completely supported by Nova. +The expectation is that Nova can manage these traits and help schedule +workloads with support to new platform features against their traits. + +The file defined by this spec must allow its author to: + +* Identify a provider unambiguously. +* Specify arbitrary custom traits which are to be associated with the provider. + +Proposed change +=============== + +Provider Config File Schema +--------------------------- +A versioned YAML file format with a formal schema is proposed. The scope of +this spec is the schema, code to parse a file into a Python dict, code to +validate the dict against the schema, and code to merge the resulting dict with +the provider tree as processed by the resource tracker. + +The code shall be introduced into the ``openstack/nova`` project initially and +consumed by the resource tracker. Parts of it (such as the schema definition, +file loading, and validation) may be moved to a separate oslo-ish library in +the future if it can be standardized for consumption outside of Nova. + +The following is a simplified pseudo-schema for the file format. + +.. code-block:: yaml + + meta: + # Version ($Major, $minor) of the schema must successfully parse documents + # conforming to ($Major, *). I.e. additionalProperties must be allowed at + # all levels; but code at a lower $minor will ignore fields it does not + # recognize. Schema changes representing optional additions should bump + # $minor. Any breaking schema change (e.g. removing fields, adding new + # required fields, imposing a stricter pattern on a value, etc.) must bump + # $Major. The question of whether/how old versions will be deprecated or + # become unsupported is left for future consideration. + schema_version: $Major.$minor + + providers: + # List of dicts + # Identify a single provider to configure. + # Exactly one of uuid or name is mandatory. Specifying both is an error. + # The consuming nova-compute service will error and fail to start if the + # same value is used more than once across all provider configs for name + # or uuid. + # NOTE: Caution should be exercised when identifying ironic nodes, + # especially via the `$COMPUTE_NODE` special value. If an ironic node + # moves to a different compute host with a different provider config, its + # attributes will change accordingly. + - identification: + # Name or UUID of the provider. + # The uuid can be set to the specialized string `$COMPUTE_NODE` which + # will cause the consuming compute service to apply the configuration + # in this section to each node it manages unless that node is also + # identified by name or uuid. + uuid: ($uuid_pattern|"$COMPUTE_NODE") + # Name of the provider. + name: $string + # Customize provider inventories + inventories: + # This section allows the admin to specify various adjectives to + # create and manage providers' inventories. This list of adjectives + # can be extended in the future as the schema evolves to meet new + # use cases. For now, only one adjective, `additional`, is supported. + additional: + # The following inventories should be created on the identified + # provider. Only CUSTOM_* resource classes are permitted. + # Specifying inventory of a resource class natively managed by + # nova-compute will cause the compute service to fail. + $resource_class: + # `total` is required. Other optional fields not specified + # get defaults from the Placement service. + total: $int + reserved: $int + min_unit: $int + max_unit: $int + step_size: $int + allocation_ratio: $float + # Next inventory dict, keyed by resource class... + ... + # Customize provider traits. + traits: + # This section allows the admin to specify various adjectives to + # create and manage providers' traits. This list of adjectives + # can be extended in the future as the schema evolves to meet new + # use cases. For now, only one adjective, `additional`, is supported. + additional: + # The following traits are added on the identified provider. Only + # CUSTOM_* traits are permitted. The consuming code is + # responsible for ensuring the existence of these traits in + # Placement. + - $trait_pattern + - ... + # Next provider... + - identification: + ... + +Example +~~~~~~~ +.. note:: This section is intended to describe at a very high level how this + file format could be consumed to provide ``CUSTOM_LLC`` inventory + information. + +.. note:: This section is intended to describe at a very high level how this + file format could be consumed to provide P-state compute trait + information. + +.. code-block:: yaml + + meta: + schema_version: 1.0 + + providers: + # List of dicts + - identification: + uuid: $COMPUTE_NODE + inventories: + additional: + CUSTOM_LLC: + # Describing LLC on this compute node + # max_unit indicates maximum size of single LLC + # total indicates sum of sizes of all LLC + total: 22 + reserved: 2 + min_unit: 1 + max_unit: 11 + step_size: 1 + allocation_ratio: 1 + traits: + additional: + # Describing that this compute node enables support for + # P-state control + - CUSTOM_P_STATE_ENABLED + +Provider config consumption from Nova +------------------------------------- +Provider config processing will be performed by the nova-compute process as +described below. There are no changes to virt drivers. In particular, virt +drivers have no control over the loading, parsing, validation, or integration +of provider configs. Such control may be added in the future if warranted. + +Configuration + A new config option is introduced:: + + [compute] + # Directory of yaml files containing resource provider configuration. + # Default: /etc/nova/provider_config/ + # Files in this directory will be processed in lexicographic order. + provider_config_location = $directory + +Loading, Parsing, Validation + On nova-compute startup, files in ``CONF.compute.provider_config_location`` + are loaded and parsed by standard libraries (e.g. ``yaml``), and + schema-validated (e.g. via ``jsonschema``). Schema validation failure or + multiple identifications of a node will cause nova-compute startup to fail. + Upon successful loading and validation, the resulting data structure is + stored in an instance attribute on the ResourceTracker. + +Provider Tree Merging + A generic (non-hypervisor/virt-specific) method will be written that merges + the provider config data into an existing ``ProviderTree`` data structure. + The method must detect conflicts whereby provider config data references + inventory of a resource class managed by the virt driver. Conflicts should + log a warning and cause the conflicting config inventory to be ignored. + The exact location and signature of this method, as well as how it detects + conflicts, is left to the implementation. In the event that a resource + provider is identified by both explicit UUID/NAME and $COMPUTE_NODE, only the + UUID/NAME record will be used. + +``_update_to_placement`` + In the ResourceTracker's ``_update_to_placement`` flow, the merging method is + invoked after ``update_provider_tree`` and automatic trait processing, *only* + in the ``update_provider_tree`` flow (not in the legacy ``get_inventory`` or + ``compute_node_to_inventory_dict`` flows). On startup (``startup == True``), + if the merge detects a conflict, the nova-compute service will fail. + +Alternatives +------------ +Ad hoc provider configuration is being performed today through an amalgam of +oslo.config options, more of which are being proposed or considered to deal +with VGPUs, NUMA, bandwidth resources, etc. The awkwardness of expressing +hierarchical data structures has led to such travesties as +``[pci]passthrough_whitelist`` and "dynamic config" mechanisms where config +groups and their options are created on the fly. YAML is natively suited for +this purpose as it is designed to express arbitrarily nested data structures +clearly, with minimal noisy punctuation. In addition, the schema is +self-documenting. + +Data model impact +----------------- +None + +REST API impact +--------------- +None + +Security impact +--------------- +Admins should ensure that provider config files have appropriate permissions +and ownership. Consuming services may wish to check this and generate an error +if a file is writable by anyone other than the process owner. + +Notifications impact +-------------------- +None + +Other end user impact +--------------------- +None + +Performance Impact +------------------ +None + +Other deployer impact +--------------------- +An understanding of this file and its implications is only required when the +operator desires provider customization. The deployer should be aware of the +precedence of records with UUID/NAME identification over $COMPUTE_NODE. + +Developer impact +---------------- +Subsequent specs will be needed for services consuming this file format. + +Upgrade impact +-------------- +None. (Consumers of this file format will need to address this - e.g. decide +how to deprecate existing config options which are being replaced). + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + tony su + +Other contributors: + dustinc + efried dakshinai + +Feature Liaison +--------------- + +Feature liaison: + gibi + +Work Items +---------- + +* Construct a formal schema +* Implement parsing and schema validation +* Implement merging of config to provider tree +* Incorporate above into ResourceTracker +* Compose a self-documenting sample file + +Dependencies +============ +None + + +Testing +======= +* Schema validation will be unit tested. +* Functional and integration testing to move updates from provider config file + to Placement via Nova virt driver. + +Documentation Impact +==================== +* The formal schema file and a self-documenting sample file for provider + config file. +* Admin-facing documentation on guide to update the file and how Nova + processes the updates. +* User-facing documentation (including release notes). + +References +========== +.. _Jay's Rocky provider-config-file proposal: https://review.openstack.org/#/c/550244/2/specs/rocky/approved/provider-config-file.rst +.. _Konstantinos's device-placement-model spec: https://review.openstack.org/#/c/591037/8/specs/stein/approved/device-placement-model.rst +.. _Eric's device-passthrough spec: https://review.openstack.org/#/c/579359/10/doc/source/specs/rocky/device-passthrough.rst +.. _Resource Management Daemon_PTG Summary: http://lists.openstack.org/pipermail/openstack-discuss/2019-May/005809.html +.. _Handling UUID/NAME and $COMPUTE_NODE conflicts: http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-11-19.log.html#t2019-11-19T21:25:26 + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Stein + - Introduced + * - Train + - Re-proposed, simplified + * - Ussuri + - Re-proposed + * - Victoria + - Re-proposed diff --git a/specs/victoria/implemented/rbd-glance-multistore.rst b/specs/victoria/implemented/rbd-glance-multistore.rst new file mode 100644 index 0000000..3167c63 --- /dev/null +++ b/specs/victoria/implemented/rbd-glance-multistore.rst @@ -0,0 +1,266 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +======================================================= +Libvirt RBD image backend support for glance multistore +======================================================= + +https://blueprints.launchpad.net/nova/+spec/rbd-glance-multistore + +Currently, Nova does not natively support a deployment where there are +multiple Ceph RBD backends that are known to glance. If there is only +one, Nova and Glance collaborate for fast-and-light image-to-VM +cloning behaviors. If there is more than one, Nova generally does not +handle the situation well, resulting in silent slow-and-heavy behavior +in the worst case, and a failed instance boot failsafe condition in +the best case. We can do better. + +Problem description +=================== + +There are certain situations where it is desirable to have multiple +independent Ceph clusters in a single openstack deployment. The most +common would be a multi-site or edge deployment where it is important +that the Ceph cluster is physically close to the compute nodes that it +serves. Glance already has the ability to address multiple ceph +clusters, but Nova is so naive about this that such a configuration +will result in highly undesirable behavior. + +Normally when Glance and Nova collaborate on a single Ceph deployment, +images are stored in Ceph by Glance when uploaded by the operator or +the user. When Nova starts to boot an instance, it asks Ceph to make a +Copy-on-Write clone of that image, which extremely fast and +efficient, resulting in not only reduced time to boot and lower +network traffic, but a shared base image across all compute nodes. + +If, on the other hand, you have two groups of compute nodes, each with +their own Ceph deployment, extreme care must be taken currently to +ensure that an image stored in one is not booted on a compute node +assigned to the other. Glance can represent that a single logical +image is stored in one or both of those Ceph stores and Nova looks at +this during instance boot. However, if the image is not in its local +Ceph cluster, it will quietly download the image from Glance and then +upload it to its local Ceph as a raw flat image each time an instance +from that image is booted. This results in more network traffic and +disk usage than is expected. We merged a workaround to make Nova +refuse to do this antithetical behavior, but it just causes a failed +instance boot. + +Use Cases +--------- + +- As an operator I want to be able to have a multi-site single Nova + deployment with one Ceph cluster per site and retain the + high-performance copy-on-write behavior that I get with a single + one. + +- As a power user which currently has to pre-copy images to a + remote-site ceph backend with glance before being able to boot an + instance, I want to not have to worry about such things and just + have Nova do that for me. + +Proposed change +=============== + +Glance can already represent that a single logical image is stored in +multiple locations. Recently, it gained an API to facilitate copying +images between backend stores. This means that an API consumer can +request that it copy an image from one store to another by doing an +"import" operation where the method is "copy-image". + +The change proposed in this spec is to augment the existing libvirt +RBD imagebackend code so that it can use this image copying API when +needed. Currently, we already look at all the image locations to find +which one matches our Ceph cluster, and then use that to do the +clone. After this spec is implemented, that code will still examine +all the *current* locations, and if none match, ask Glance to copy the +image to the appropriate backend store so we can continue without +failure or other undesirable behavior. + +In the case where we do need Glance to copy the image to our store, +Nova can monitor the progress of the operation through special image +properties that Glance maintains on the image. These indicate that the +process is in-progress (via ``os_glance_importing_to_stores``) and +also provide notice when an import has failed (via +``os_glance_failed_import``). Nova will need to poll the image, +waiting for the process to complete, and some configuration knobs will +be needed to allow for appropriate tuning. + +Alternatives +------------ + +One alternative is always to do nothing. This is enhanced behavior on +top of what we already support. We *could* just tell people not to use +multiple Ceph deployments or add further checks to make sure we do not +do something stupid if they do. + +We could teach nova about multiple RBD stores in a more comprehensive +way, which would basically require either pulling ceph information out +of Glance, or configuring Nova with all the same RBD backends that +Glance has. However, we would need to teach Nova about the topology +and configure it to not do stupid things like use a remote Ceph just +because the image is there. + +Data model impact +----------------- + +None. + +REST API impact +--------------- + +None. + +Security impact +--------------- + +Users can already use the image import mechanism in Glance, so Nova +using it on their behalf does not result in privilege escalation. + +Notifications impact +-------------------- + +None. + +Other end user impact +--------------------- + +This removes the need for users to know details about the deployment +configuration and topology, as well as eliminates the need to manually +pre-place images in stores. + +Performance Impact +------------------ + +Image boot time will be impacted in the case when a copy needs to +happen, of course. Performance overall will be much better because +operators will be able to utilize more Ceph clusters if they wish, +and locate them closer to the compute nodes they serve. + +Other deployer impact +--------------------- + +Some additional configuration will be needed in order to make this +work. Specifically, Nova will need to know the Glance store name that +represents the RBD backend it is configured to use. Additionally, +there will be some timeout tunables related to how often we poll the +Glance server for status on the copy, as well as an overall timeout +for how long we are willing to wait. + +One other deployer consideration is that Glance requires an API setup +capable of doing background tasks in order to support the +``image_import`` API. That means ``mod_wsgi`` or similar, as ``uwsgi`` +does not provide reliable background task support. This is just a +Glance requirement, but worth noting here. + +Developer impact +---------------- + +The actual impact to the imagebackend code is not large as we are just +using a new mechanism in Glance's API to do the complex work of +copying images between backends. + +Upgrade impact +-------------- + +In order to utilize this new functionality, at least Glance from +Ussuri will be required for a Victoria Nova. Individual +``nova-compute`` services can utilize this new functionality +immediately during a partial upgrade scenario so no minimum service +version checks are required. The control plane does not know which RBD +backend each compute node is connected to, and thus there is no need +for control-plane-level upgrade sensitivity to this feature. + + +Implementation +============== + +Assignee(s) +----------- +Primary assignee: + danms + +Feature Liaison +--------------- + +Feature liaison: + danms + +Work Items +---------- + +* Plumb the ``image_import`` function through the + ``nova.image.glance`` modules + +* Teach the libvirt RBD imagebackend module how to use the new API to + copy images to its own backend when necessary and appropriate. + +* Document the proper setup requirements for administrators + + +Dependencies +============ + +* Glance requirements are already landed and available + +Testing +======= + +* Unit testing, obviously. + +* Functional testing turns out to be quite difficult, as we stub out + massive amounts of the underlying image handling code underneath our + fake libvirt implementation. Adding functional tests for this would + require substantial refactoring of all that test infrastructure, + dwarfing the actual code in this change. + +* Devstack testing turns out to be relatively easy. I think we can get + a solid test of this feature on every run, by altering that job to: + + * Enable Glance and Nova multistore support. + * Enable Glance image conversion support, to auto-convert the default + QCOW Cirros image to raw when we upload it. + * Create two stores, one file-backed (like other jobs) and one + RBD-backed (like the current Ceph job). + * Default the Cirros upload to the file-backed store. + * The first use of the Cirros image in a tempest test will cause Nova + to ask Glance to copy the image from the file-backed store to the + RBD-backed store. Subsequent tests will see it as already in the + RBD store and proceed as normal. + + The real-world goal of this is to facilitate RBD-to-RBD backend + store copying, but from Nova's perspective file-to-RBD is an + identical process, so it's a good analog without having to + bootstrap two independent Ceph clusters in a devstack job. + +Documentation Impact +==================== + +This is largely admin-focused. Users that are currently aware of this +limitation already have admin-level knowledge if they are working +around it. Successful implementation will just eliminate the need to +care about multiple Ceph deployments going forward. Thus admin and +configuration documentation should be sufficient. + +References +========== + +* https://blueprints.launchpad.net/glance/+spec/copy-existing-image + +* https://docs.openstack.org/glance/latest/admin/interoperable-image-import.html + +* https://review.opendev.org/#/c/699656/8 + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Victoria + - Introduced diff --git a/specs/victoria/implemented/sriov-interface-attach-detach.rst b/specs/victoria/implemented/sriov-interface-attach-detach.rst new file mode 100644 index 0000000..25f4c13 --- /dev/null +++ b/specs/victoria/implemented/sriov-interface-attach-detach.rst @@ -0,0 +1,174 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +========================================= +Support SRIOV interface attach and detach +========================================= + +https://blueprints.launchpad.net/nova/+spec/sriov-interface-attach-detach + +Nova supports booting servers with SRIOV interfaces. However, attaching and +detaching an SRIOV interface to an existing server is not supported as the PCI +device management is missing from the attach and detach code path. + + +Problem description +=================== + +SRIOV interfaces cannot be attached or detached from an existing nova server. + +Use Cases +--------- + +As an end user I need to connect my server to another neutron network via an +SRIOV interface to get high throughput connectivity to that network direction. + +As an end user I want to detach an existing SRIOV interface as I don't use that +network access anymore and I want to free up the scarce SRIOV resource. + +Proposed change +=============== + +In the compute manager, during interface attach, the compute needs to generate +``InstancePCIRequest`` for the requested port if the vnic_type of the port +indicates an SRIOV interface. Then run a PCI claim on the generated PCI request +to check if there is a free PCI device, claim it, and get a ``PciDevice`` +object. If this is successful then connect the PCI request to the +``RequestedNetwork`` object and call Neutron as today with that +``RequestedNetwork``. Then call the virt driver as of today. + +If the PCI claim fails then the interface attach instance action will fail but +the instance state will not be set to ERROR. + +During detach, we have to recover the PCI request from the VIF being destroyed +then from that, we can get the PCI device that we need to unclaim in the PCI +tracker. + +Note that detaching an SRIOV interface succeeds today from API user +perspective. However, the detached PCI device is not freed from resource +tracking and therefore leaked until the nova server is deleted or live +migrated. This issue will be gone when the current spec is implemented. Also +as a separate bugfix SRIOV detach will be blocked on stable branches to prevent +the resource leak. + +There is a separate issue with SRIOV PF detach due to the way the libvirt +domain XML is generated. While the fix for that is needed for the current spec, +it also needed for the existing SRIOV live migration feature because that also +detaches the SRIOV interfaces during the migration. So the SRIOV PF detach +issue will be fixed as an independent bugfix of the SRIOV live migration +feature and the implementation of this spec will depend on that bugfix. + +Alternatives +------------ + +None + +Data model impact +----------------- + +None + +REST API impact +--------------- + +None + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +There will be an extra neutron call during interface attach as well as +additional DB operations. The ``interface_attach`` RPC method is synchronous +today, so this will be an end user visible change. + +Other deployer impact +--------------------- + +None + +Developer impact +---------------- + +None + +Upgrade impact +-------------- + +None + +Implementation +============== + +Assignee(s) +----------- + + +Primary assignee: + balazs-gibizer + + +Feature Liaison +--------------- + +Feature liaison: + gibi + + +Work Items +---------- + +* change the attach and detach code path +* add unit and functional tests +* add documentation + + +Dependencies +============ + +None + + +Testing +======= + +Tempest test cannot be added since the upstream CI does not have SRIOV devices. +Functional tests with libvirt driver will be added instead. + + +Documentation Impact +==================== + +* remove the limitation from the API documentation + +References +========== + +None + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Victoria + - Introduced diff --git a/specs/victoria/implemented/use-pcpu-vcpu-in-one-instance.rst b/specs/victoria/implemented/use-pcpu-vcpu-in-one-instance.rst new file mode 100644 index 0000000..d9c067f --- /dev/null +++ b/specs/victoria/implemented/use-pcpu-vcpu-in-one-instance.rst @@ -0,0 +1,417 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +========================================= +Use ``PCPU`` and ``VCPU`` in One Instance +========================================= + +https://blueprints.launchpad.net/nova/+spec/use-pcpu-and-vcpu-in-one-instance + +The spec `CPU resource tracking`_ splits host CPUs into ``PCPU`` and ``VCPU`` +resources, making it possible to run instances of ``dedicated`` CPU allocation +policy and instances of ``shared`` CPU allocation policy in the same host. +This spec aims to create such kind of instance that some of the vCPUs are +dedicated (``PCPU``) CPUs and the remaining vCPUs are shared (``VCPU``) vCPUs +and expose this information via the metadata API. + +Problem description +=================== + +The current CPU allocation policy, ``dedicated`` or ``shared``, is applied to +all vCPUs of an instance. However, with the introduction of +`CPU resource tracking`_, it is possible to propose a more fine-grained CPU +allocation policy, which is based on the control over individual instance vCPU, +and specifying the ``dedicated`` or ``shared`` CPU allocation policy to each +instance vCPU. + +Use Cases +--------- + +As an operator, I would like to have an instance with some realtime CPUs for +high performance, and at the same time, in order to increase instance density, +I wish to make the remaining CPUs, which do not demand high performance, +shared with other instances because I only care about the performance of +realtime CPUs. One example is deploying the NFV task that is enhanced with +DPDK framework in the instance, in which the data plane threads could be +processed with the realtime CPUs and the control-plane tasks are scheduled +on CPUs that may be shared with other instances. + +As a Kubernetes administrator, I wish to run a multi-tier or auto-scaling +application in Kubernetes, which is running in single OpenStack VM, with +the expectation that using dedicated high-performance CPUs for application +itself and deploying the containers on shared cores. + +Proposed change +=============== + +Introduce a new CPU allocation policy ``mixed`` +----------------------------------------------- + +``dedicated`` and ``shared`` are the existing instance CPU allocation policies +that determine how instance CPU is scheduled on host CPU. This specification +proposes a new CPU allocation policy, with the name ``mixed``, to +create a CPU *mixed* instance in such way that some instance vCPUs are +allocated from computing node's ``PCPU`` resource, and the rest of instance +vCPUs are allocated from the ``VCPU`` resources. The CPU allocated from +``PCPU`` resource will be pinned on particular host CPUs which are defined in +``CONF.compute.dedicated_cpu_set``, and the CPU from ``VCPU`` resource will be +floating on the host CPUs which are defined in ``CONF.compute.shared_cpu_set``. +In this proposal, we call these two kinds of vCPUs as *dedicated* vCPU and +*shared* vCPU respectively. + +Instance CPU policy matrix +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Nova operator may set the instance CPU allocation policy through the +``hw:cpu_policy`` and ``hw_cpu_policy`` interfaces, which may raise conflict. +The CPU policy conflict is proposed to be solved with the following policy +matrix: + ++---------------------------+-----------+-----------+-----------+-----------+ +| | hw:cpu_policy | ++ INSTANCE CPU POLICY +-----------+-----------+-----------+-----------+ +| | DEDICATED | MIXED | SHARED | undefined | ++---------------+-----------+-----------+-----------+-----------+-----------+ +| hw_cpu_policy | DEDICATED | dedicated | conflict | conflict | dedicated | ++ +-----------+-----------+-----------+-----------+-----------+ +| | MIXED | dedicated | mixed | conflict | mixed | ++ +-----------+-----------+-----------+-----------+-----------+ +| | SHARED | dedicated | conflict | shared | shared | ++ +-----------+-----------+-----------+-----------+-----------+ +| | undefined | dedicated | mixed | shared | undefined | ++---------------+-----------+-----------+-----------+-----------+-----------+ + +For example, if a ``dedicated`` CPU policy is specified in instance flavor +``hw:cpu_policy``, then the instance CPU policy is ``dedicated``, regardless +of the setting specified in image property ``hw_cpu_policy``. If ``shared`` +is explicitly set in ``hw:cpu_policy``, then a ``mixed`` policy specified +in ``hw_cpu_policy`` is conflict, which will throw an exception, the instance +booting request will be rejected. + +If there is no explicit instance CPU policy specified in flavor or image +property, the flavor matrix result would be 'undefined', and the final +instance policy is further determined and resolved by ``resources:PCPU`` +and ``resources:VCPU`` specified in flavor extra specs. Refer to +:ref:`section ` and the spec +`CPU resource tracking`_. + +Affect over real-time vCPUs +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Real-time vCPU also occupies the host CPU exclusively and does not share CPU +with other instances, all real-time vCPUs are dedicated vCPUs. For a *mixed* +instance with some real-time vCPUs, with this proposal, the vCPUs not in the +instance real-time vCPU list are shared vCPUs. + +Affect over emulator thread policy +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If emulator thread policy is ``ISOLATE``, the *mixed* instance will look for +a *dedicated* host CPU for instance emulator thread, which is very similar +to the case introduced by ``dedicated`` policy instance. + +If the emulator thread policy is ``SHARE``, then the instance emulator thread +will float over the host CPUs defined in configuration +``CONF.compute.cpu_shared_set``. + +Set dedicated CPU bit-mask in ``hw:cpu_dedicated_mask`` for ``mixed`` instance +------------------------------------------------------------------------------ + +As an interface to create the ``mixed`` policy instance through legacy flavor +extra specs or image properties, the flavor extra spec +``hw:cpu_dedicated_mask`` is introduced. If the extra spec +``hw:cpu_dedicated_mask`` is found in the instance flavor, then the +information of the *dedicated* CPU could be found through +parsing ``hw:cpu_dedicated_mask``. + +Here is the example to create an instance with ``mixed`` policy: + +.. code:: + + $ openstack flavor set \ + --property hw:cpu_policy=mixed \ + --property hw:cpu_dedicated_mask=0-3,7 + +And, following is the proposing command to create a *mixed* instance which +consists of multiple NUMA nodes by setting the *dedicated* vCPUs in +``hw:cpu_dedicated_mask``: + +.. code:: + + $ openstack flavor set \ + --property hw:cpu_policy=mixed \ + --property hw:cpu_dedicated_mask=2,7 \ + --property hw:numa_nodes=2 \ + --property hw:numa_cpus.0=0-2 \ + --property hw:numa_cpus.1=3-7 \ + --property hw:numa_mem.0=1024 \ + --property hw:numa_mem.1=2048 + +.. note:: + Please be aware that there is no equivalent setting in image properties + for flavor extra spec ``hw:cpu_dedicated_mask``. It will not be supported + to create *mixed* instance through image properties. + +.. note:: + The dedicated vCPU list of a *mixed* instance could be specified through + the newly introduced dedicated CPU mask or the cpu-time CPU mask, the + ``hw:cpu_realtime_mask`` or ``hw_cpu_realtime_mask``, you cannot set it + by setting dedicated CPU mask extra spec and real-time CPU mask at the + same time. + +.. _mixed-instance-PCPU-VCPU: + +Create *mixed* instance via ``resources:PCPU`` and ``resources:VCPU`` +--------------------------------------------------------------------- + +`CPU resource tracking`_ introduced a way to create an instance with +``dedicated`` or ``shared`` CPU allocation policy through ``resources:PCPU`` +and ``resources:VCPU`` interfaces, but did not allow requesting both ``PCPU`` +resource and ``VCPU`` resource for one instance. + +This specification proposes to let an instance request ``PCPU`` resource along +with ``VCPU``, and effectively applying for the ``mixed`` CPU allocation +policy if the ``cpu_policy`` is not explicitly specified in the flavor list. +So an instance with such flavors potentially creates a ``mixed`` policy +instance: + +.. code:: + + $ openstack flavor set \ + --property "resources:PCPU"="" \ + --property "resources:VCPU"="" \ + + +For *mixed* instance created in such way, both and + must be greater than zero. Otherwise, it effectively +creates the ``dedicated`` or ``shared`` policy instance, that all vCPUs in the +instance is in a same allocation policy. + +The ``resources:PCPU`` and ``resources::VCPU`` interfaces only put the request +toward ``Placement`` service for how many ``PCPU`` and ``VCPU`` resources are +required to fulfill the instance vCPU thread and emulator thread requirement. +The ``PCPU`` and ``VCPU`` distribution on the instance, especially on the +instance with multiple NUMA nodes, will be spread across the NUMA nodes in the +round-robin way, and ``VCPU`` will be put ahead of ``PCPU``. Here is one +example and the instance is created with flavor below:: + + flavor: + vcpus:8 + memory_mb=512 + extra_specs: + hw:numa_nodes:2 + resources:VCPU=3 + resources:PCPU=5 + +Instance emulator thread policy is not specified in the flavor, so it does not +occupy any dedicated ``PCPU`` resource for it, all ``PCPU`` and ``VCPU`` +resources will be used by vCPU threads, and the expected distribution on NUMA +nodes is:: + + NUMA node 0: VCPU VCPU PCPU PCPU + NUMA node 1: VCPU PCPU PCPU PCPU + +.. note:: + The demanding instance CPU number is the number of vCPU, specified by + ``flavor.vcpus``, plus the number of CPU that is special for emulator + thread, and if the emulator thread policy is ``ISOLATE``, the instance + requests ``flavor.vcpus`` + 1 vCPUs, if the policy is not ``ISOLATE``, + the instance just requests ``flavor.vcpus`` vCPU. + +Alternatives +------------ + +Creating CPU mixed instance by extending the ``dedicated`` policy +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Instead of adding a special instance CPU allocation policy, the CPU mixed +instance is supported by extending the existing ``dedicated`` policy and +specifying the vCPUs that are pinned to the host CPUs chosen from ``PCPU`` +resource. + +Following extra spec and the image property are defined to keep the +*dedicated* vCPUs of a ``mixed`` policy instance:: + + hw:cpu_dedicated_mask= + hw_cpu_dedicated_mask= + +The ```` shares the same definition defined above. + +This was rejected at it overloads the ``dedicated`` policy to mean two things, +depending on the value of another configuration option. + +Creating ``mixed`` instance with ``hw:cpu_policy`` and ``resources:(P|V)CPU`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Following commands was proposed as an example to create a *mixed* instance by +an explicit request of ``PCPU`` resources, and infer the ``VCPU`` count by +``flavor::vcpus`` and ``PCPU`` count: + +.. code:: + + $ openstack flavor create mixed_vmf --vcpus 4 --ram 512 --disk 1 + $ openstack flavor set mixed_vmf \ + --property hw:cpu_policy=mixed \ + --property resources:PCPU=2 + +This was rejected due to the mixing use of ``hw:cpu_policy`` and +``resources:PCPU``. It is not recommended to mix placement style syntax with +traditional extra specs. + +Data model impact +----------------- + +Add the ``pcpuset`` field in ``InstanceNUMACell`` object to track the dedicated +vCPUs of the instance NUMA cell, and the original ``InstanceNUMACell.cpuset`` +is special for shared vCPU then. + +This change will introduce some database migration work for the existing +instance in a ``dedicated`` CPU allocation policy, since all vCPUs in such an +instance are dedicated vCPUs which should be kept in ``pcpuset`` field, but +they are stored in ``cpuset`` historically. + +REST API impact +--------------- + +The metadata API will be extended with the *dedicated* vCPU info and a new +OpenStack metadata version will be added to indicate this is a new metadata +API. + +The new field will be added to the ``meta_data.json``:: + + dedicated_cpus= + +The ```` lists the *dedicated* vCPU set of the instance, which +might be the content of ``hw:cpu_dedicated_mask`` or +``hw:cpu_realtime_mask`` or ``hw_cpu_realtime_mask`` or the CPU list +generated with the *round-robin* policy as described in +:ref:`section `. + +The new cpu policy ``mixed`` is added to extra spec ``hw:cpu_policy``. + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +If the end user wants to create an instance with a ``mixed`` CPU allocation +policy, the user is required to set corresponding flavor extra specs or image +properties. + +Performance Impact +------------------ + +This proposal affects the selection of instance CPU allocation policy, but the +performance impact is trivial. + +Other deployer impact +--------------------- + +None + +Developer impact +---------------- + +None + +Upgrade impact +-------------- + +The ``mixed`` cpu policy is only available when the whole cluster upgrade +finished. A service version will be bumped for detecting the upgrade. + +The ``InstanceNUMACell.pcpuset`` is introduced for dedicated vCPUs and the +``InstanceNUMACell.cpuset`` is special for shared vCPUs, all existing +instances in a ``dedicated`` CPU allocation policy should be updated by moving +content in ``InstanceNUMACell.cpuset`` filed to +``InstanceNUMACell.pcpuset`` field. The underlying database keeping the +``InstanceNUACell`` object also need be updated to reflect this change. + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + Wang, Huaqiang + +Feature Liaison +--------------- + +Feature liaison: + Stephen Finucane + +Work Items +---------- + +* Add a new field, the ``pcpuset``, for ``InstanceNUMACell`` for dedicated + vCPUs. +* Add new instance CPU allocation policy ``mixed`` property and resolve + conflicts +* Bump nova service version to indicate the new CPU policy in nova-compute +* Add flavor extra spec ``hw:cpu_dedicated_mask`` and create *mixed* instance +* Translate *dedicated* and *shared* CPU request to placement ``PCPU`` and + ``VCPU`` resources request. +* Change libvirt driver to create ``PCPU`` mapping and ``VCPU`` mapping +* Add nova metadata service by offering final pCPU layout in + ``dedicated_cpus`` field +* Validate real-time CPU mask for ``mixed`` instance. + +Dependencies +============ + +None + +Testing +======= + +Functional and unit tests are required to cover: + +* Ensure to solve the conflicts between the CPU policy matrix +* Ensure only *dedicated* vCPUs are possible to be real-time vCPUs +* Ensure creating ``mixed`` policy instance properly either by flavor + settings or by ``resources::PCPU=xx`` and ``resources::VCPU=xx`` settings. +* Ensure *shared* vCPUs is placed before the ``dedicated`` vCPUs +* Ensure the emulator CPU is properly scheduled according to its policy. + +Documentation Impact +==================== + +The documents should be changed to introduce the usage of new ``mixed`` CPU +allocation policy and the new flavor extra specs. + +Metadata service will be updated accordingly. + +References +========== + +* `CPU resource tracking`_ + +.. _CPU resource tracking: http://specs.openstack.org/openstack/nova-specs/specs/train/approved/cpu-resources.html + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Train + - Introduced, abandoned + * - Ussuri + - Approved + * - Victoria + - Re-proposed diff --git a/specs/victoria/implemented/victoria-template.rst b/specs/victoria/implemented/victoria-template.rst deleted file mode 120000 index c0440d1..0000000 --- a/specs/victoria/implemented/victoria-template.rst +++ /dev/null @@ -1 +0,0 @@ -../../victoria-template.rst \ No newline at end of file diff --git a/specs/victoria/redirects b/specs/victoria/redirects index e69de29..570881b 100644 --- a/specs/victoria/redirects +++ b/specs/victoria/redirects @@ -0,0 +1,6 @@ +approved/add-emulated-virtual-tpm.rst ../implemented/add-emulated-virtual-tpm.rst +approved/nova-image-download-via-rbd.rst ../implemented/nova-image-download-via-rbd.rst +approved/provider-config-file.rst ../implemented/provider-config-file.rst +approved/rbd-glance-multistore.rst ../implemented/rbd-glance-multistore.rst +approved/sriov-interface-attach-detach.rst ../implemented/sriov-interface-attach-detach.rst +approved/use-pcpu-vcpu-in-one-instance.rst ../implemented/use-pcpu-vcpu-in-one-instance.rst