E2E Testing Best Practices, Reloaded
End-to-end (E2E) testing in Kubernetes is how the project validates functionality with real clusters. Contributors sooner or later encounter it when asked to write E2E tests for new features or to help with debugging test failures. Cluster admins or vendors might run the conformance tests, a subset of all tests in the E2E test suite.
The underlying E2E framework for writing these E2E tests has been around for a long time. Functionality was added to it as needed, leading to code that became hard to maintain and use. The testing commons WG started cleaning it up, but dissolved before completely achieving their goals.
After the migration to Gingko v2 in Kubernetes 1.25, I picked up several of the loose ends and started untangling them. This blog post is a summary of those changes. Some of this content is also found in the Kubernetes contributor document about writing good E2E tests and gets reproduced here to raise awareness that the document has been updated.
Overall architecture
At the moment, the framework is used in-tree for testing against a cluster
(test/e2e
), testing kubeadm (test/e2e_kubeadm
) and kubelet
(test/e2e_node
). The goal is to make the core test/e2e/framework
a package
that has no dependencies on internal code and that can be used in different E2E
suites without polluting them with features or options that make no sense for
them. This is currently only a technical goal. There are no plans anymore to
actually move the code into a staging repository.
The framework acts like a normal client of an apiserver and thus doesn’t need
much more than client-go. Since the sub-package
refacoring, additional
sub-packages like test/e2e/framework/pod
depend on the framework, not the
other way around. Those other sub-packages therefore can still use internal
code. The import boss configuration enforces these
constraints.
What’s left to clean up is that the framework contains a TestContext
with
fields that are used only by some tests or some test suites. The configuration
for test/e2e_node
is the last remaining dependency on internal code. Such settings should get
moved into the different test suites and/or tests. The advantage besides
avoiding such dependencies will be that they will only show up in the command
line of a suite when the option really has an effect.
Debuggability
If your test fails, it should provide as detailed as possible reasons for the
failure in its failure message. The failure message is the string that gets
passed (directly or indirectly) to ginkgo.Fail[f]
. That text is what gets
shown in the overview of failed tests for a Prow job and what gets aggregated
by https://go.k8s.io/triage.
A good failure message:
- identifies the test failure
- has enough details to provide some initial understanding of what went wrong
It’s okay for it to contain information that changes during each test run. Aggregation simplifies the failure message with regular expressions before looking for similar failures.
Helper libraries like Gomega or testify can be used to produce informative failure messages. Gomega is a bit easier to use in combination with Ginkgo.
The E2E framework itself only has one helper function for assertions that is
still recommended. The others are deprecated. Compared to
gomega.Expect(err).NotTo(gomega.HaveOccurred())
,
framework.ExpectNoError(err)
is shorter and produces better failure
messages because it logs the full error and then includes only the shorter
err.Error()
in the failure message.
As with any other assertion, it is recommended to include additional context in cases where the parameters being checked by an assertion helper lack relevant information:
framework.ExpectNoError(err, "tried creating %d foobars, only created %d", foobarsReqd, foobarsCreated)
Use assertions that match the check in the test. Using Go code to evaluate some condition and then checking the result often isn’t informative. For example this check should be avoided:
gomega.Expect(strings.Contains(actualStr, expectedSubStr)).To(gomega.Equal(true))
Comparing a boolean
like this against true
or false
with gomega.Equal
or
framework.ExpectEqual
is not useful because dumping the actual and expected
value just distracts from the underlying failure reason.
Better pass the actual values to Gomega, which will automatically include them in the
failure message. Add an annotation that explains what the assertion is about:
gomega.Expect(actualStr).To(gomega.ContainSubstring("xyz"), "checking log output")
This produces the following failure message:
[FAILED] checking log output
Expected
<string>: hello world
to contain substring
<string>: xyz
If there is no suitable Gomega assertion, call ginkgo.Failf
directly:
import "github.com/onsi/gomega/format"
ok := someCustomCheck(abc)
if !ok {
ginkgo.Failf("check xyz failed for object:\n%s", format.Object(abc))
}
It is good practice to include details like the object that failed some assertion in the failure message because then a) the information is available when analyzing a failure that occurred in the CI and b) it only gets logged when some assertion fails. Always dumping objects via log messages can make the test output very large and may distract from the relevant information.
Dumping structs with format.Object
is recommended. Starting with Kubernetes
1.26, format.Object
will pretty-print Kubernetes API objects or structs as
YAML and omit unset
fields, which is more
readable than other alternatives like fmt.Sprintf("%+v")
.
import (
"fmt"
"k8s.io/api/core/v1"
"k8s.io/kubernetes/test/utils/format"
)
var pod v1.Pod
fmt.Printf("Printf: %+v\n\n", pod)
fmt.Printf("format.Object:\n%s", format.Object(pod, 1 /* indent one level */))
=>
Printf: {TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name: GenerateName: Namespace: SelfLink: UID: ResourceVersion: Generation:0 CreationTimestamp:0001-01-01 00:00:00 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Finalizers:[] ManagedFields:[]} Spec:{Volumes:[] InitContainers:[] Containers:[] EphemeralContainers:[] RestartPolicy: TerminationGracePeriodSeconds:<nil> ActiveDeadlineSeconds:<nil> DNSPolicy: NodeSelector:map[] ServiceAccountName: DeprecatedServiceAccount: AutomountServiceAccountToken:<nil> NodeName: HostNetwork:false HostPID:false HostIPC:false ShareProcessNamespace:<nil> SecurityContext:nil ImagePullSecrets:[] Hostname: Subdomain: Affinity:nil SchedulerName: Tolerations:[] HostAliases:[] PriorityClassName: Priority:<nil> DNSConfig:nil ReadinessGates:[] RuntimeClassName:<nil> EnableServiceLinks:<nil> PreemptionPolicy:<nil> Overhead:map[] TopologySpreadConstraints:[] SetHostnameAsFQDN:<nil> OS:nil HostUsers:<nil> SchedulingGates:[] ResourceClaims:[]} Status:{Phase: Conditions:[] Message: Reason: NominatedNodeName: HostIP: PodIP: PodIPs:[] StartTime:<nil> InitContainerStatuses:[] ContainerStatuses:[] QOSClass: EphemeralContainerStatuses:[] Resize:}}
format.Object:
<v1.Pod>:
metadata:
creationTimestamp: null
spec:
containers: null
status: {}
Recovering from test failures
All tests should ensure that a cluster is restored to the state that it was in
before the test ran. ginkgo.DeferCleanup
is recommended for
this because it can be called similar to defer
directly after setting up
something. It is better than defer
because Ginkgo will show additional
details about which cleanup code is running and (if possible) handle timeouts
for that code (see next section). It is better than ginkgo.AfterEach
because
it is not necessary to define additional variables and because
ginkgo.DeferCleanup
executes code in the more useful last-in-first-out order,
i.e. things that get set up first get removed last.
Objects created in the test namespace do not need to be deleted because deleting the namespace will also delete them. However, if deleting an object may fail, then explicitly cleaning it up is better because then failures or timeouts related to it will be more obvious.
In cases where the test may have removed the object, framework.IgnoreNotFound
can be used to ignore the “not found” error:
podClient := f.ClientSet.CoreV1().Pods(f.Namespace.Name)
pod, err := podClient.Create(ctx, testPod, metav1.CreateOptions{})
framework.ExpectNoError(err, "create test pod")
ginkgo.DeferCleanup(framework.IgnoreNotFound(podClient.Delete), pod.Name, metav1.DeleteOptions{})
Interrupting tests
When aborting a manual gingko ./test/e2e
invocation with CTRL-C or a signal,
the currently running test(s) should stop immediately. This gets achieved by
accepting a ctx context.Context
as first parameter in the Ginkgo callback
function and then passing that context through to all code that might
block. When Ginkgo notices that it needs to shut down, it will cancel that
context and all code trying to use it will immediately return with a context canceled
error. Cleanup callbacks get a new context which will time out
eventually to ensure that tests don’t get stuck. For a detailed description,
see https://onsi.github.io/ginkgo/#interrupting-aborting-and-timing-out-suites.
Most of the E2E tests were update to use the Ginkgo
context at the start of
the 1.27 development cycle.
There are some gotchas:
Don’t use the
ctx
passed intoginkgo.It
in aginkgo.DeferCleanup
callback because the context will be canceled when the cleanup code runs. This is wrong:ginkgo.It("something", func(ctx context.Context) { ... ginkgo.DeferCleanup(func() { // do something with ctx }) })
Instead, register a function which accepts a new context:
ginkgo.DeferCleanup(func(ctx context.Context) { // do something with the new ctx })
Anonymous functions can be avoided by passing some existing function and its parameters directly to
ginkgo.DeferCleanup
. Again, beware to not pass the wrongctx
. This is wrong:ginkgo.It("something", func(ctx context.Context) { ... ginkgo.DeferCleanup(myDeleteFunc, ctx, objName) })
Instead, just pass the other parameters and let
ginkgo.DeferCleanup
add a new context:ginkgo.DeferCleanup(myDeleteFunc, objName)
When starting some background goroutine in a
ginkgo.BeforeEach
callback, usecontext.WithCancel(context.Background())
. The context passed into the callback will get canceled when the callback returns, which would cause the background goroutine to stop before the test runs. This works:backgroundCtx, cancel := context.WithCancel(context.Background()) ginkgo.DeferCleanup(cancel) _, controller = cache.NewInformer( ... ) go controller.Run(backgroundCtx.Done())
When adding a timeout to the context for one particular operation, beware of overwriting the
ctx
variable. This code here applies the timeout to the next call and everything else that follows:ctx, cancel := context.WithTimeout(ctx, 5 * time.Second) defer cancel() someOperation(ctx) ... anotherOperation(ctx)
Better use some other variable name:
timeoutCtx, cancel := context.WithTimeout(ctx, 5 * time.Second) defer cancel() someOperation(timeoutCtx)
When the intention is to set a timeout for the entire callback, use
ginkgo.NodeTimeout
:ginkgo.It("something", ginkgo.NodeTimeout(30 * time.Second), func(ctx context.Context) { })
There is also a
ginkgo.SpecTimeout
, but that then also includes the time taken forBeforeEach
,AfterEach
andDeferCleanup
callbacks.
Polling and timeouts
When waiting for something to happen, use a reasonable timeout. Without it, a test might keep running until the entire test suite gets killed by the CI. Beware that the CI under load may take a lot longer to complete some operation compared to running the same test locally. On the other hand, a too long timeout also has drawbacks:
- When a feature is broken so that the expected state doesn’t get reached, a test waiting for that state first needs to time out before the test fails.
- If a state is expected to be reached within a certain time frame, then a timeout that is much higher will cause test runs to be considered successful although the feature was too slow. A dedicated performance test in a well-know environment may be a better solution for testing such performance expectations.
The framework provides some common timeouts through the framework instance. When writing a test, check whether one of those fits before defining a custom timeout in the test.
Good code that waits for something to happen meets the following criteria:
- accepts a context for test timeouts
- depending on how the test suite was invoked:
- informative during interactive use (i.e. intermediate reports, either periodically or on demand)
- little to no output during a CI run except when it fails
- full explanation when it fails: when it observes some state and then encounters errors reading the state, then dumping both the latest observed state and the latest error is useful
- extension mechanism for writing custom checks
- early abort when condition cannot be reached anymore
gomega.Eventually
satisfies all of these criteria and therefore is recommended, but not required.
In https://github.com/kubernetes/kubernetes/pull/113298,
test/e2e/framework/pods/wait.go
and the framework were modified to use gomega. Typically, Eventually
is
passed a function which gets an object or lists several of them, then Should
checks against the expected result. Errors and retries specific to Kubernetes
are handled by wrapping client-go
functions.
Using normal Gomega assertions in helper packages is problematic for two reasons:
- The stacktrace associated with the failure starts with the helper unless extra care is take to pass in a stack offset.
- Additional explanations for a potential failure must be prepared beforehand and passed in.
The E2E framework therefore uses a different approach:
framework.Gomega()
offers similar functions as thegomega
package, except that they return a normal error instead of failing the test.- That error gets wrapped with
fmt.Errorf("<explanation>: %w)
to add additional information, just as in normal Go code. - Wrapping the error (
%w
instead of%v
) is important because thenframework.ExpectNoError
directly uses the error message as failure without additional boiler plate text. It also is able to log the stacktrace where the error occurred and not just where it was finally treated as a test failure.
Tips for writing and debugging long-running tests
Use
ginkgo.By
to record individual steps. Ginkgo will use that information when describing where a test timed out.Invoke the
ginkgo
CLI with--poll-progress-after=30s
or some other suitable duration to be informed early why a test doesn’t complete and where it is stuck. A SIGINFO or SIGUSR1 signal can be sent to the CLI and/or e2e.test processes to trigger an immediate progress report.Use
gomega.Eventually
to wait for some condition. When it times out or gets stuck, the last failed assertion will be included in the report automatically. A good way to invoke it is:gomega.Eventually(ctx, func(ctx context.Context) (book Book, err error) { // Retrieve book from API server and return it. ... }).WithPolling(5 * time.Second).WithTimeout(30 * time.Second). Should(gomega.HaveField("Author.DOB.Year()", BeNumerically("<", 1900)))
Avoid testing for some condition inside the callback and returning a boolean because then failure messages are not informative (see above). See https://github.com/kubernetes/kubernetes/pull/114640 for an example where gomega/gcustom was used to write assertions.
Some of the E2E framework sub-packages have helper functions that wait for certain domain-specific conditions. Currently most of these functions don’t follow best practices (not using gomega.Eventually, error messages not very informative). Work is planned in that area, so beware that these APIs may change at some point.
Use
gomega.Consistently
to ensure that some condition is true for a while. As withgomega.Eventually
, make assertions about the value instead of checking the value with Go code and then asserting that the code returns true.Both
gomega.Consistently
andgomega.Eventually
can be aborted early viagomega.StopPolling
.Avoid polling with functions that don’t take a context (
wait.Poll
,wait.PollImmediate
,wait.Until
, …) and replace with their counterparts that do (wait.PollWithContext
,wait.PollImmediateWithContext
,wait.UntilWithContext
, …) or even better, withgomega.Eventually
.
Next steps
Using wait.Poll
in E2E tests can be detected with
forbidigo since import alias
support was merged. In
Kubernetes, that can be enabled in a golangci-lint invocation as soon as a
configuration extension
is merged. Another
enhancement would be
useful, but not absolutely required.
Because a lot of existing code wouldn’t pass such a check, it probably will
only be enabled in the new stricter pull request
linting
initially. Converting individual sub packages similar to
test/e2e/framework/pod
to match current best practices would be a good way for new contributors to get
involved.
The SIG Testing’s Slack channel is a good place to start. At KubeCon EU 2023, the “Keeping the lights on and the bugs away” talk will cover some of the material of this blog post. Catch me there or meet me at the Intel booth to discuss this further!