Skip to content

Storage Service Tests During Upgrade

Requirements

A default storage class must be defined before running these tests. If the variable tester_storage_class is defined, it will use this one instead. Also, the provisioner associated with this storage class must implement the CSI CLONE_VOLUME capability. For more information on storage classes, see the OCP documentation.

Three scenarios of tests

It can be enabled by setting the boolean storage_upgrade_tester to true, when the dci-openshift-agent is running an upgrade. The idea is to test the impact of the cluster upgrade and the storage operator upgrade on the availability on the storage service, which means if volumes can be used normally. It creates three different k8s CronJob to tests the three different accesses of a PVC:

  • RWO: ReadWriteOnce, a single pod is trying to bind a volume and write in a file inside it.
  • ROX: ReadOnlyMany, two pods are bound to the same volume simultaneously and try to read a value inside it.
  • RWX: ReadWriteMany, two pods are bound to the same volume simultaneously and try to write in a file inside it.

Each cronjob is launching a job every minute, starting just before the upgrade of the platform and results are gathered at the end of the storage service upgrade.

Test results

The number of files, lines written in the volumes is used to estimate the downtime of the storage service for RWO and RWX cases. The number of failed ROX jobs is used to estimate this time for ROX use cases. At the end of the upgrade process, the results are gathered via three bash scripts (one for each test case) gather-storage-upgrade-tester-ROX.log, gather-storage-upgrade-tester-RWO.logand gather-storage-upgrade-tester-RWX.log. They are then published in DCI. In these output files, there are some general information on the k8s objects used for the test (cronjobs, job and pods) and also an estimation of the downtime. As an example, for the ROX test case:

>>> Quick gathering object info in the Namespace of test
pod/storage-volume-reader-rox-27742362-ld5l5          0/1     Completed           0          3m20s   10.130.2.151   worker-2   <none>           <none>
pod/storage-volume-reader-rox-27742362-pxlcp          0/1     Completed           0          3m20s   10.130.2.152   worker-2   <none>           <none>
pod/storage-volume-reader-rox-27742364-bfst2          0/1     Completed           0          80s     10.130.2.156   worker-2   <none>           <none>
pod/storage-volume-reader-rox-27742364-xf7ns          0/1     Completed           0          80s     10.130.2.157   worker-2   <none>           <none>
pod/storage-volume-reader-rox-27742365-g5zg7          0/1     Completed           0          20s     10.130.2.159   worker-2   <none>           <none>
pod/storage-volume-reader-rox-27742365-j2kml          0/1     Completed           0          20s     10.130.2.158   worker-2   <none>           <none>
cronjob.batch/storage-volume-reader-rox          */1 * * * *   False     0        20s             123m   volume-reader-rox   registry.dfwt5g.lab:4443/rhel8/support-tools   <none>
job.batch/storage-volume-reader-rox-1664535840          0/1 of 2      101m       101m    volume-reader-rox   registry.dfwt5g.lab:4443/rhel8/support-tools   controller-uid=1ea14747-25a9-4b96-9e30-d38e1cc36709
job.batch/storage-volume-reader-rox-27742362            2/1 of 2      16s        3m20s   volume-reader-rox   registry.dfwt5g.lab:4443/rhel8/support-tools   controller-uid=e4e56035-ce2e-41f0-85a8-4a44958157f9
job.batch/storage-volume-reader-rox-27742364            2/1 of 2      16s        80s     volume-reader-rox   registry.dfwt5g.lab:4443/rhel8/support-tools   controller-uid=1a49e5ab-7660-4487-956f-afcce633b674
job.batch/storage-volume-reader-rox-27742365            2/1 of 2      16s        20s     volume-reader-rox   registry.dfwt5g.lab:4443/rhel8/support-tools   controller-uid=c7872426-3bce-4c81-a8c6-3917185b677d
> Time of the test: 124m
> Number of failures during this time: 1            # Because of the gathering method, this is just an estimation, specially on the last tests could be seens as a failure whereas it was working fine
> Estimated percentage of failure: .800             # 0.8% of down time during the 2h the upgrade was running