Wikimedia Release Engineering Team/Offsites/2018-05-Barcelona/Notes
These are the raw notes from our 2 days of offsite discussions.
Summary of action items
[edit]Data Data Data
[edit]- Talk with Analytics - JR
- Talk with CE/Bitergia - JR
- Explore Bitergia - JR
- Identify data sources we want to collect - RelEng (who know what systems)
- Erik Bernhardson / Guillaume Lederrey
SWATs/Trains
[edit]- Tyler reasses scap swat in mw-config from Mukunda
- Look into parsing scap messages for known patterns and pulling out the data
- Look into enabling scap start/done
- Look into recording if mwdebug was used during the deploy (eg: 'scap stage')
- H/Now will we get time for this?
- Have Mukunda do a couple weeks of SWATs
- Mukunda has a lot to say about this subject.... writeup incoming
Staging
[edit]- Greg to talk with Deb about what to do next with talking to Victoria
- Greg to figure out how we can better market what we are accomplishing (eg "monthly showcase")
- Get a k8s cluster from SRE for CI to deploy to.
Data Data Data
[edit]Lead: Jean-René
- Data for code-stewardship reviews (historic data)
- Commits & patch sets
- Jenkins & CI, test results discarded after 15||30 days
- Where can I put new kind of data/metrics. Is there a shared environment to store them?
- jr: for example, talking to explanatory testers. No idea about the result of their work. Hard to get new QA testers on board. Role is broad, but a sure thing is they will either produce or consume testing data.
- We have lots of data/dashboard, but we have not statistics over long-term
- antoine: raita was the dashboard (but it has been decomissioned)
- Historic dashboard for metrics and data
- Dan: targeted towards browser-tests
- Hypothetical Entity Relationship (ER) diagram
- Patchsets relate to deployments
- Deployments relate to outages
- Relationships in a tree format
- Relationships between gerrit change and phabricator tasks
- Developer/maintainers page. For an extension/skin JR would like to:
- Activity (commits and changes)
- Outstanding tasks
- How it follows mediawiki latest standard (ex: extension.json, versions of linters, test coverage etc)
- Tests that are running:
- How frequent are errors
- How many tests are failing
- Average resolution for a failed test (E2E, unit tests failling on unrelated change because core changed months ago and extension is barely active)
- The pace of changes being merged
- extension status, alpha, maintenance, wikimedia deployed, obsolete. That is mostly on mediawiki.org (partly in CI config as "archived")
- Overview of stewardship
- github pulse ( https://github.com/wikimedia/mediawiki/pulse ) -- do we want that?
- Human process oriented vs repository oriented (merges vs task closing)
- time to resolution (TTR) for tasks (filed to resolved/declined/whatever)
- but this is only meaningful for "bugs" not other planning type tasks
- what are the systems that we have, how do we normalize the data for those systems, where do we put it?
- A consistent interface for retrieving data
- We need to keep all the data that we can -- get data outside of jenkins (for example we could send that data to elasticsearch, but currently this is locked-up in jenkins)
- We have an agreement that we'd like to collect all the test data...somewhere somehow
- RelEng is the best place for this data
- Do we set this up? Or do we work with other teams to do this?
- Proposal: prepare for a 20 minute analytics team at the hackathon
- A system: https://wikimedia.biterg.io/app/kibana#/dashboard/Overview (see also https://www.mediawiki.org/wiki/Community_metrics )
- Stewardship creates these open questions, useful for annual planning as well
- Going through, system-by-system, and finding out what data we want to store
Open Questions
- Is our current analytics stack open for use by others in open ended ways?
- Example: https://pivot.wikimedia.org/ for page view/requests ( upstream: https://imply.io ). Lets one easily build whatever graph by country/browser etc
- Analytics: Can we start dumping various data sources into a place and figure out how we're going to view/make sense of it later?
- How can we interact with Bitergia to extend the data sources and views (poke Quim/Andre)
- identify reviewers/maintainers: https://www.mediawiki.org/wiki/Git/Reviewers | https://www.mediawiki.org/wiki/Developers/Maintainers
Next Steps:
- Talk with Analytics - JR
- Talk with CE/Bitergia - JR
- Explore Bitergia - JR
- Identify data sources we want to collect - RelEng (who know what systems)
- Erik Bernhardson / Guillaume Lederrey
SWATs/Trains
[edit]Lead: Tyler
- Automating/improving logging of SWATs and Trains - https://phabricator.wikimedia.org/T193311 :
- It would be nice to have concrete data about SWAT windows without having to dig in the SAL. Some nice-to-have info: number of syncs per SWAT window and time spent deploying patches for a given SWAT window.
- Problem: We've wanted to change SWAT windows/deploys. People hated that we wanted to change things (namely: reduce # of patchsets deployed and how they are done). We need data to make informed decisions. eg: correlating syncs with swats and outages.
- Definition: SWAT is three 1 hour windows per day for developers to propose hotfixes/config changes. Served by releng / deployment group users.
- now we have sync and we have windows and they're only relation is through the wiki pages
- out of scope:
- relating patches -> swat window
- proposing patches in a window
- Zeljko: we are just pushing buttons. We do not have much added value
- NEEDs:
- Given a time window, get the list of syncs / patchset deployed (and utlimately a developer / point of contact)
- we need the data
- a place to display/query it
- Minimal Viable Solution
- Have scap ask "is this a SWAT? y/n" each time it's not a full scap or --force
- This Deployment did this Change associated with this Task.
- what about...
- scap swat start (or: `scap swat` starts a shell)
- (query wiki page, list changes, etc)
- scap swat done
- See: "scap swat" patch from Mukunda
- ( https://gerrit.wikimedia.org/r/#/c/306259/ / https://phabricator.wikimedia.org/T142880 ). Demo: https://asciinema.org/a/1x54kw77tvatxiqv45ba6ael7
- current documentation https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Full_deployment
- current command: scap sync-file path/to/file 'SWAT: Commit message (T456)'
- if the comment is not in this format, scap asks you swat/gerrit/phabricator
- not allow deploys without first indicating what window you're starting
- scap swat start or scap deploy start (or --force)
- that informs scap on what how to act/log
- mw-config.php
- assume as soon as it's merged it's deployed
TODO
- Tyler reasses scap swat in mw-config from Mukunda
- Look into parsing scap messages for known patterns and pulling out the data
- Look into enabling scap start/done
- Look into recording if mwdebug was used during the deploy (eg: 'scap stage')
- H/Now will we get time for this?
- Have Mukunda do a couple weeks of SWATs
- Mukunda has a lot to say about this subject.... writeup incoming
Staging
[edit]https://docs.google.com/document/d/1CT_pKjwiDmFhZZ9LW9mz0z434-wgr3NFdapUPWUvMNA/edit?ts=5aba5398#heading=h.ra4sbg2fs7zl 2018-2019 annual plan https://www.mediawiki.org/wiki/Wikimedia_Technology/Annual_Plans/FY2019
Lead: Greg
- The presentation
- The project as defined by operations is incomplete
- The response to Victoria
- We are here due to the initial issue of a choice between doing the Pipeline project vs a Staging project. That either/or is now a both/and.
- Operations wants an environment that can potentially prevent outages depending on how they define it. It could potentially prevent outages of services that we don't control nor deploy.
- We are making a survey to gather the current usage of the Beta Cluster that can help inform SRE's decisions/planning.
- We have defined use cases
- The other questions are best answered by SRE as they heavily depend on technical implementation decisions
- protocol changes as proposed are out of scope to this dicussion and truthfully feel like reach through micromanagement without any real data nor reasoning.
What RelEng needs:
- Just to continue to do our positive interaction with SRE in our weekly Pipeline meetings
- A simple part of that is for SRE to provide a k8s cluster and/or namespace for CI to deploy to (as previously discussed and agreed upon)
- Idea (Dan) rebrand "deployment pipeline" project to "Continuous Delivery of MediaWiki Stack"
NEXT:
- Greg to talk with Deb about what to do next with talking to Victoria
- Greg to figure out how we can better market what we are accomplishing (eg "monthly showcase")
- Get a k8s cluster from SRE for CI to deploy to.
Developer Productivity JD
[edit]Lead: Greg Blog post: https://squiggle.city/~frencil/archives/20150625.html#anatomy_of_a_healthy_job_post
You will be leading the effort to improve overall developer productivity. We will want you to create a replacement for our homebuilt Vagrant-based local development environment using the latest technologies such as Kubernetes (minikube), Docker, and Helm. You will be working closely with several teams and volunteers in the community.
Responsibilities
- Help engineer container based tooling for MediaWiki application development and deployment
- Maintain integration of developer tooling into a continuous delivery pipeline
- Proactively find and create productivity improvements
- Working in a highly collaborative and open organization and community
Requirements
- Proficiency with software, systems, or devops engineering
- Collaboration skills are as, if not more, important as technical skills
- Experience with continuous integration/deployment systems
- Experience with virtualization or container technologies
- Experience with server configuration management software
Nice to haves
- Free Software experience
- Experience working in a remote-first organization
- Experience using a Kubernetes environment
- MediaWiki and/or Wikimedia project experience
- Golang experience
Moving to a "everyone deploys their own changes" model (for SWAT)
[edit]- Why are SWATs scheduled?
- Why are there only a limited number of people in-charge of doing them?
Z: Would like everyone already staff/contractors to be able to do their own deploys. Z: lot of european swat users now self deploy (eg: Amir, David Causse).
- Turn SWATs into "volunteer patch deployment" windows. If you are staff/contractor, you deploy your own thing when you need to do it.
Pipeline Demo
[edit]Lead: Dan/Tyler https://integration.wikimedia.org/ci/job/service-pipeline-test-only-debug Job using Jenkins Pipeline. Defined in Groovy.
- Presentation of Blubber and pipeline
- What is minikube
Blubber and MediaWiki + extensions
[edit]- We use docker-pkg w/ Quibble and Blubber in the pipeline. Is problem? No. Not really.
- Use of docker-pkg is appropriate in domains that require/allow full control of Dockerfile and image build (root)
- Base images are controlled by SRE (operations/docker-images/production-images)
- CI images for use with Quibble are controlled by RelEng (integration/config)
- Talked about whether we should use Quibble as entrypoint in pipeline testing. Should we? No. Probably not.
- Different use case. Quibble depends on environment that has superset of MW+ext dependencies. Blubber is meant to be repo-authoritative.
- EVERYTHING IS GREAT, AGAIN.
- What does a Blubberized MediaWiki look like? For limited scope of FY1718Q4 goal ((MediaWiki + Math) + Mathoid)? For far future?
- Discussion about how to deal with Debian dependencies and extensions depending on each other.
- For Q4 goal, we don't technically need to solve the ext dependency issue (Math does not depend on other extensions or skins)
Are we testing a lot
[edit]all quibble jobs -- combinations mysql/vendor/php70 mysql/composer/php70 mysql/vendor/php55 mysql/vendor/hhvmT:
- php/js lint/eslint
- qunit/phpunit
- webdriver.io