Pipeline

Pipeline Dependencies

Because we have pipelines that should not run while other pipelines are running we created a way to lock a pipeline outside of the continuous delivery server. For example, if we have these pipelines:

Build --------> Deploy --------> Test

When deploy is running, test should not run or the tests will fail because the site is deploying. When test is running, deploy should not run for the same reason.

To address this situation we created a process that uses "lock files" to pause a pipeline whenever it detects the lock file. When deploy starts, the first task is to call the IsLocked target. The name of the lock file has to be known by both of the dependent pipelines and this can be codified in a property in the task script used to drive CD. The IsLocked target will check all of the lock files configured for the pipeline. If the file does not exist, it will create it and continue running the pipeline. If the lock file exists, it will start a timer, periodically poll for the lock file, if the lock file is found it will continue to wait and poll for the configured timeout period. If a poll does not find a lock file, it will create it and continue running the pipeline. If the lock file is still found after the end of the timout period, the pipeline will fail with a lock file timeout error.

Pipeline Failure

Keeping pipelines green is a high priority. A broken build, deployment, test or other pipeline task should trigger an immediate response from the team. A failure prevents changes from being promoted to the next step in the delivery pipeline.

Monitoring

Dev Automation will have the primary responsibility of monitoring pipelines. Everyone on the team should subscribe to all pipeline failures for the application they are working on.

Notification

If you know that your change caused a failure or you feel you are in a position to work on a fix, you should notify the team that you will be working on the fix.

If you have insight into the cause of a failure, you should notify the team with your insight.

Types of Failure

Build

Development will have primary responsibility for fixing build failures. When a build failure occurs it must be fixed in 15 minutes or the commit that caused the failure will be rolled back by Dev Automation.

Test Failure

Test Script Defect

Dev Automation will have primary responsibility for fixing test failures due to test script defects. When a test failure is caused by a test script defect the defect should be treated as a production incident and fixed as soon as possible. If the test cannot be fixed in 15 minutes the commit that caused the failure will be rolled back.

Application Defect

Development has the primary responsibility for fixing test failures due to application defect. When a test failure is caused by an application defect:

  • Verify and documented the failure by executing and recording manual test steps.

  • Discuss the failure with the team to verify it as a defect.

  • If it is a defect,

    • Create a defect ticket and assign to business.

    • Tag the test with the defect ticket number.

    • The failing test can only be ignored or removed if

      • The tested functionality is removed from the application.

      • There is new feature specifications that negate the test.

      • The team agrees on ignoring the test.

Indeterminate Reason (Flaky Test)

Dev Automation has the primary responsibility for fixing test failures due to flaky tests. When a test failure is caused by an indeterminate reason (i.e. server down, timing issue, in consistent test data…) we call it flaky. That is to say that a test fails then passes for some unapparent reason this random passing is called a flake. When we suspect a flaky test:

  • Verify the feature functionality by executing and recording manual test steps.

  • Identify the source and frequency of occurrence of the failure.

  • If it is flaky,

    • Create a defect ticket and assign to Dev Automation.

    • The failing test can be ignored or removed only after discussion with the team. This insures that we do not quarantine high priority tests without agreement in from the team.

    • Tag the test as Flaky.

Flaky tests need special care since they cause a failure that is not related to changes in the application. It is important to keep test pipelines green to keep confidence in the tests.

We are working on functionality in the test framework to automatically handle flaky tests. Until that is complete, we will have to manually identify and tag flaky tests. Tagging the flaky tests quarantines them. We have a separate pipeline setup to run the flakes while ignoring them in normal test runs. It is also important to add flaky tests to the manual regression plan since we won’t have confidence in running the tests in the pipeline until the flakiness is fixed.

If a test has been quarantined for 30 days the flaky test ticket should be moved to a medium priority. 60 days and it should be moved to high priority. High priority tickets must be mitigated in 30 days. Mitigation is fixing the flake: fixing the test, fixing the application, adjusting the specifications, or removing the test.

http://www.thoughtworks.com/insights/blog/no-more-flaky-tests-go-team https://wiki.jenkins-ci.org/display/JENKINS/Flaky+Test+Handler+Plugin http://martinfowler.com/articles/nonDeterminism.html

To automatically handle flakes, we will attempt to rerun test failures a configurable number of times. If a test passes in the rerun it will be marked as flaky. After the rerun if all tests have been marked as re-run the test job will be marked as passed and the flaky tests will be reported.

When testing against doubles we can use a technique described by Martin Fowler to verify that the double and the live system are in sync based on a contract. If the double uses the same API contract that the live system does, then running tests against the double and the live system and comparing the results can give some confidence that the double is an adequate stand in for tests. This type of testing would be done outside of the delivery pipeline and run on a periodic basis.

Last updated