Wikimedia Release Engineering Team/Checkin archive/2024-08-28
Appearance
2024-08-28
[edit]π Agenda
[edit]- Wins/anti-wins
- Important dates
- Train
- FYI, Scap: require_tty_multiplexer support
- Post-work investigation: docker-hub mirror
- DISCUSS: GitLab private repo for PrivateSettings: https://phabricator.wikimedia.org/T355026
- DISCUSS: Reggie image availability documentation: https://phabricator.wikimedia.org/T324361#10079303
π Wins/winterrogation
[edit]- https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Monthly_notable_accomplishments
- August 2024
- Fixed issue with deployment-deploy04 free space. Added a 40GB volume and copied /srv to it.
- Buildkit 0.15.1 release deployed
- Helped data-engineering Airflow DAGs with their Gitlab CI.
- Rewrote remainder of make-container-image stuff in Python: https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/99 \o/
- Scap invokes this repo during deployment
- Current status: Create a php7.4 image + debugging packages
- Future: php7.4 + php8.1
- Single version images: there's a change in mw-config to override wikiversions.json
- Updated train-dev to use debian:11 base image.
- Kicked off nomination process to reboot the Toolforge standards committee (https://phabricator.wikimedia.org/T370474)
- Moved a tool to toolforge build service
- Fixed links in patchdemo for catalyst wikis
- merged persistence for k8s patchdemo
- Added read-only flag for patchdemo
- Fixed a Phab code bug not checking user permissions creating a form
- Merged more Phorge upstream stuff to get bugfixes + features once we pull, e.g. logging errors for broken Herald rules. See some stuff as deps: https://phabricator.wikimedia.org/T370266 (when downstream tasks exist)
- Played with checking for active Phab accounts linked to locked WMF SUL accounts (TODO: other way round)
- Started working on a Kubernetes cluster for deployment-prep using OpenTofu and Magnum as provisioning tools. Lots of things to figure out still, but a proof of concept cluster was provisioned, destroyed, and provisioned again. https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu + deploymentpreps3
- Scap deploy with rewrite of build-image script
- Merged catalyst/patchdemo environment redirects
- Repos under https://gitlab.wikimedia.org/toolforge-repos/ are now indexed by codesearch as part of the "wmcs" collection. https://codesearch.wmcloud.org/wmcs/?q=mwclient
- Images built by Kokkuri on DO runners are now usable from WMCS runners. This is being used in the tech spike on creating a deployment-prep Kubernetes cluster using Magnum to build and then run an image containing OpenTofu and other tools needed for gitops automation of the process. https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/-/blob/main/.gitlab-ci.yml
- 8 folks have been nominated to reboot the membership of the Toolforge Standards Committee. Bryan will be working in the coming weeks to get them all vetted by the Toolforge admins and to facilitate them signing an NDA with the Foundation. https://wikitech.wikimedia.org/wiki/Help_talk:Toolforge/Toolforge_standards_committee#August_2024_committee_nominations
- train dev fixin'
- Fixed mw-web deployment in train-dev: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1060464
- https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1064124 mw-debug/mw-web: Reduce CPU requests/limits for train-dev
- Renewed the DEPLOY_TOKEN in gitlab-cloud-runners project.
- System needed other than Ahmon's remindersΒ :)
- Upgraded buildkitd to 0.15.2 in all the places.
- https://gitlab.wikimedia.org/repos/releng/buildkit/-/merge_requests/69 (README.md: Initial notes on handling new releases)
- Wrapped up build-images stuff w/ help from Jeena and Scott French
- T372921: scap deploy blank checks bug fixed.
- https://phabricator.wikimedia.org/T361724 scap should check if it is running within a tmux/screen
- Better remote build context support in Kokkuri
- A handy `.kokkuri:remote-context` mixin
- Kokkuri can now resolve the frontend ref ("syntax" line in .pipeline/blubber.yaml) from a remote build context (via the GitLab API)
- New releases-jenkins job to cut wmf/next is ready \o/
- Played with upstream Phorge doc tool (Diviner), wrote a dozen of upstream patches to fix 404s of methods in search results, PHP 8 exceptions (unit tests for phorgeΒ :(( ), some PhpDoc cleanup, etc.
- Calm train last week
π Vacations/Important dates
[edit]- https://office.wikimedia.org/wiki/HR_Corner/Holiday_List#2024
- https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar
- https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Time_off
- Aug 02: Bryan
- Aug 05-08: Dan
- Fri 09 Aug β Global holiday: International Day of the Worldβs Indigenous Peoples
- Aug 12: Dan
- Mon 12 Aug-16: Ahmon out
- Mon 12 Aug - Fri 23 Aug: Antoine
- Aug 16: Bryan
- Aug 23: Bryan
- Aug 23: Jaime
- Sat 24 Aug - 03 Sep: Brennen π₯
- Aug 30: Bryan
- Sept 02: US Labor day (WMF US holiday)
- Sept 06: Bryan
- Sept 13: Bryan
- Sept 18-19: Brennen Winfield
- Sept 19-20, 23: Bryan Riot Fest in Chicago!
- Sept23-27: (likely) Andre
- Sept24: Dancy
- Sept 27: Bryan
- Sept 9-30 Jeena
- Sept 27, 30-Oct 02: Dan
- Oct 03-06: WikiCon North America (Indianapolis)
- Oct 6: Dancy
- Oct 1-11 Jeena
- Oct 14: Indigenous Peoples' Day (also Columbus Day) US Staff w/reqs
- Oct 28 - Nov 01 maybe maybe: Andre
Future
[edit]π₯π Train
[edit]- https://versions.toolforge.org/
- https://train-blockers.toolforge.org/
- https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar
Rotation
[edit]- 05 Aug (05-09) β 1.43.0-wmf.17 β Jaime + Brennen (Dan out, Global holiday Friday)
- 12 Aug (12β16) β 1.43.0-wmf.18 β Jeena + Jaime (Ahmon out, Antoine out)
- 19 Aug (19β23) β 1.43.0-wmf.19 β Andre + Jeena (Antoine out)
- 26 Aug (26β30) β 1.43.0-wmf.20 β Antoine + Andre (Brennen out)
- 02 Sep (02β06) β 1.43.0-wmf.21 β Ahmon + Antoine (US holiday Monday, Brennen out Tues)
- 09 Sep (09β13) β 1.43.0-wmf.22 β Dan + Ahmon
- 16 Sep (16β20) β 1.43.0-wmf.23 β Jaime + Dan (Brennen out)
- 23 Sep (23β27) β 1.43.0-wmf.24 β Brennen + Jaime (andre likely out)
Team Discussions
[edit]- Train issues
- Pre-sync failed: scap stage train fails
- Followed by: scap train clean is working
- Then run: scap train
- Problems:
- (seems fine) files owned by hashar vs mwpresync.
- rebuild-localisation cache runs as www-data which could not run: [x] TODO https://phabricator.wikimedia.org/T373425 (cf: https://wikitech.wikimedia.org/wiki/UID#Permission/security_hierarchy)
- mwpresync problems
- Problems:
- Next week: post-work investigation
Post-work investigation: docker-hub mirror
[edit]- Reported: 2024-08-15 14:40 UTC to Resolved: 2024-08-15 19:40 UTC (5 hrs)
- Task: https://phabricator.wikimedia.org/T372568
- Recap:
- docker hub mirror persistent volume filled
- Jaime and Dan ran terraform in https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner to replace the docker hub mirror resource in terraform.
- Now documented how to add flags to a terraform run to force unlock (and we can use this for other issues): https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner#how-to-force-unlock-terraform-state
- Terraform problems
- Terraform knew the release existed via helm, but didn't know it needed to do anything to fix it
- Told terraform to force-replace a particular resource: https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner#how-to-force-replacement-of-resources
- Persistent volume _still_ in terminating state: why? Unclear.
- Can also remove locks via: https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/terraform
- What went well?
- Pipeline supports extra arguments to plan and apply
- What can we improve?
- Followup: edit readme to show how edit the terraform state via https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/terraform
- Followup: edit https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner#how-to-increase-the-size-of-the-buildkitd-volumes to have a more general resizing
- [x] TODO: File a task for investigating PVC terminating: Check PVC cleaner logs for terminating bound PVC
- [x] TODO: File a task for alerting for disk space issues
Scap & terminal multiplexers
[edit]- Followup work from incident https://phabricator.wikimedia.org/T361724
- Requires folks to use screen or tmux for interactive commands
- Documentation needed and questions?
Private repo 4 PrivateSettings
[edit]- /srv/mediawiki/private but in GitLab
Reggie and Image availability
[edit]- What docs are needed?
Decision making on Gerrit +2 requests for random non-deployed repos
[edit]How do we (?) handle requests like https://phabricator.wikimedia.org/T372073Β ? Or do we not and someone else is supposed to?
Issue trackers
[edit]Andre wonders for which teams/projects we allowed having Issues in GitLab, and how to find out.
π» Open source/Upstream contributions
[edit]