Continuous integration/Architecture/Troubleshooting
The continuous integration infrastructure is a machine with many moving parts. As with any machine of sufficient complexity, sometimes things go wrong. This page amalgamates information on troubleshooting the continuous integration infrastructure so that when things go wrong there is one unified place to go.
Ideally, any troubleshooting tips, tricks, or diagnostic measures should be included in this page. Effectively, this page is a CI Troubleshooting Cheatsheet.
This page is divided up into infrastructure pieces, where those live, what the known issues are with those pieces, and first-step solutions to common problems.
Jenkins
[edit]- gallium.wikimedia.org
Common Issues
[edit]Stuck Nodes
[edit]From time-to-time Jenkins nodes will get stuck waiting for executors even though there is nothing running on those machines.
There are 2 possible issues Jenkins executioner lock or a Gearman Deadlock
Jenkins executioner lock
[edit]- Take node offline in Jenkins
https://integration.wikimedia.org/ci/computer/[node]/markOffline
- Kill any jenkins jobs running on the node via Jenkins UI
- Kill all pending jobs in the Jenkins queue that are "waiting on executors"
- Disconnect the node
https://integration.wikimedia.org/ci/computer/[node]/disconnect
- Bring node back online (button labeled "Bring this node back online")
- Launch slave agent (there's a button that says this)
- Check agent log to see that it connected
https://integration.wikimedia.org/ci/computer/[node]/log
Sometimes you have to do this whole dance several times before Jenkins realizes that the there are a bunch of executors that it can use.
Gearman Deadlock
[edit]- Go to https://integration.wikimedia.org/ci/manage
- Search page for "Enable Gearman"
- Un-check the checkbox
- Save
- Wait 30s
- Check the "Enable Gearman" checkbox
- Save
This second method may interrupt communication between running Jenkins jobs and Zuul but it seems to work even when the offline/online method fails to clear the deadlock.
don'ts
[edit]- Don't restart gallium.wikimedia.org—
fsck
takes 2 hours to run during which time CI will be down.
Zuul
[edit]- gallium.wikimedia.org
- Dashboard
Common Issues
[edit]Zuul is unresponsive
[edit]Zuul schedules jobs for Jenkins through Gearman. Sometimes Zuul will fail with an exception and stop scheduling jobs to run on Jenkins completely. This happens rarely, but has been known to happen. The only fix is to restart Zuul completely.
Restarting Zuul means that all jobs that are currently queued will be dropped. Everyone that submitted a patch, or +2'd a patch will have to recheck, or resubmit—be sure that Zuul is really stuck before pulling the trigger here |
ssh gallium sudo -su zuul /etc/init.d/zuul stop /etc/init.d/zuul start tail -n100 /var/log/zuul/zuul.log
don'ts
[edit]- Don't restart Zuul when deploying a configuration change.
Nodepool
[edit]- labnodepool1001.eqiad.wmnet
- Node pool administration
Common Issues
[edit]Instance deletion breaks
[edit]This prevents new nodepool jobs from running as there are no instances capable of running the builds. This is easy to diagnose:
ssh labnodepool1001.eqiad.wmnet nodepool list
If there are many instances all marked delete
in the State
column, this may be broken. Also sudo service nodepool status
will probably show timeouts waiting for image deletion.
The common fix is finding someone in #wikimedia-labs to restart RabbitMQ. Instance creation via horizon/wikitech is broken as well as nodepool instance deletion. To verify a fix, try:
nodepool delete --now [instance-id-marked-for-deletion]
don'ts
[edit]¯\_(ツ)_/¯