Quality Assurance
=================

.. contents:: Table of Contents
   :local:

Tested Platforms
----------------

* Vanilla Kubernetes, 1 Tesla GPU node - AWS.
* Vanilla Kubernetes, 1 Tesla GPU node - Ubuntu 18.04.
* Openshift 4.1, 3 CPU nodes and 2 RHCOS Tesla GPU node.

End to End Stories
------------------

As a cluster admin, I want to be able to install the GPU Operator with helm, Kubernetes, Ubuntu and Docker.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Validate that all pods are running.
* Validate that we can run a CUDA application.
* Validate that we can run a Tensorflow notebook.

As a cluster admin, I want to be able to install the GPU Operator with helm, Openshift 4.1, RHCOS and CRIO.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Validate that all pods are running.
* Validate that we can run a CUDA application.
* Validate that we can run a Tensorflow notebook.

As a cluster admin, I want to be able to gather GPU metrics after installing the GPU Operator.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Validate that we can gather metrics.
* Validate that we can plug the GPU metrics in something like Grafana or prometheus. Is it validating instructions and/or is this automatable? Probably not.

ipmi_msghandler isn't loaded
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As a cluster admin, I want to ensure that CPU nodes are not configured by the GPU Operator.**
* Validate that all pods in the `gpu-operator-resources` have the GPU label requirement

Tainted Nodes
~~~~~~~~~~~~~

As a cluster admin, I want to ensure that the GPU Operator doesn't deploy a failing monitoring container.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Validate that dcgm doesn't get deployed on Openshift <= 4.2

Key Performance Indicator
-------------------------

Quality Assurance Score Card
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Number of E2E Tests
* Unit Tests Coverage
* Number of Manual Tests / Number of Tests
* Number of Automated Tests / Number of Tests
* Number of user stories covered / Number of stories written
* Number of regression tests
* [REJECTED] Burndown Rate: rejected because the amount of work required is to high.

Performance Score Card
~~~~~~~~~~~~~~~~~~~~~~

* Installation Time
* Scalability of the GPU Operator
* Memory Consumption
* CPU Consumption
* Network Consumption (kb/minute)

Security Score Card
~~~~~~~~~~~~~~~~~~~

**Best practices**

* Small surface (ports opened, software installed, permissions given)
* Container Best Practices (slim image, latest version)
* Bill of materials present
* Bonus: provide a link to a website that provides security information
* TBD

**Threat modeling (-100)**

**Automation in place**

* CVE Analysis
* Automated publication

Bill of Materials Score Card
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Number of Dependencies
* Number of Licenses
* License Blacklist and Whitelist