Wikimedia Performance Team/Sprints
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. The Wikimedia Performance Team was disbanded with the WMF re-org that happened in July 2023. |
2023
[edit]Outreach:
- Complete refresh of the frontend and backend guidelines and best practices (Timo, Peter, Aaron)
- Blog about the completion of the Multi-DC project (Aaron, Timo)
Insights:
- Frontend Synthetic: Move synthetic tests from AWS to bare metal (Peter) — In Q2 we evaluated in-house and external suppliers. We ended up choosing Hetzner. The server is available and accounted for in our budget.
- Frontend Synthetic: Reliably measure how fast a Wikipedia article would be without JavaScript (Peter)
- Frontend RUM: Add Long Tasks metrics to Navigation Timing (Barakat)
- Frontend RUM: Decommission coal and coal-web (Timo)
- Frontend RUM: Migrate navtiming processor from Graphite to Prometheus (Peter, Timo) — The python-prometheus client became a bottleneck in our non-parallelized setup. We reduced cardinality to resolve this.
- Backend: Profile time spent per component/extension in MW entry points (and visualise in Grafana) (Aaron)
- Backend: Increase retention of ArcLamp SVGs to 2 years (Timo)
- Backend: Add per-request flamegraph option to WikimediaDebug (Tim, Timo)
- Blog post: Flame graphs arrive in WikimediaDebug!
Improvement:
- ResourceLoader: Implement support for Source Maps (Tim, Timo)
- ResourceLoader: Implement continous verification of MediaWiki core's foreign resources in WMF CI (Timo)
- ResourceLoader: Raise Grade A JavaScript requirement from ES5 (2009) to ES6 (2015) (Timo)
- Rdbms: Reduce complexity of LB and LBF (Amir, Aaron, Timo)
- Rdbms: Evaluate LoadMonitor connection weighing improvements (Aaron, Tim)
- Support Serve production traffic via Kubernetes (Timo)Internal:
- Onboard
twoone new team members (Aaron, Peter, Tim, Timo, Larissa). - Support for various teams outside perf scope (Tim)
- Better Diffs: Wikidiff2 revise algorithm
- Async Fragments (officewiki document)
- (late arrival) IP Masking 2.0
Other goals that we considered but were post-poned, cancelled, or incomplete:
2022
[edit]Insights:
- Frontend RUM: Migrate navtiming processor from Graphite to Prometheus (Peter, Timo) — Continue in 2023
- Frontend RUM: Expand navigation timing metrics to include modern user experience metrics (Peter, Timo)
- Frontend RUM: Update how we measure Layoutshift in Navigation Timing to reflect CLS metrics (Peter)
- Frontend Synthetic: Migrate synthetic tests infrastructure from AWS to bare metal (Larissa, Peter) — Moved to Q3 (Jan-Mar 2023)
- Frontend Synthetic: Bitbar: Add firefox capabilities (Peter)
- Backend: Understand the status of SLOs on Product side (Larissa) — Have been talking with Suman and Desiree trying to restart the discussions
- Backend: Cross-DC query Alerts (Aaron)
Improvement:
- Prepare MediaWiki for PHP 8.1 (Tim, Timo). — Done from our side. Waiting on SRE ServiceOps. They prioritize mediawiki-on-k8s until end of Q3, but will be able to tackle PHP 8.1 in the beginning of Q4 (April-June 2023)
- Rdbms: Better LoadBalancer connection pooling (Aaron)
- Research opportunities in static.php traffic to identify simpler and longer-lasting caching policies. Reduce backend traffic to static.php by more than 70%, and removing a custom WMF-specific endpoint in the process, in favour of standard MediaWiki routes, requiring less maintenance going forward. (T285232, T302465)
Other goals that we considered but were post-poned, cancelled, or incomplete:
- Multi-DC BagOStuff interfaces (Aaron)
- Find someone to run user interviews (Larissa) — Both Desiree and Marshal cannot help us at this time. Marshal suggested I run a couple of interviews on my own first, but we currently don't have the bandwidth to come up with a solid interview script and do the necessary pre-work
2021
[edit]See also internal 2021-2022 roadmap and internal Jan-Mar 2022 achievements.
Outreach:
- Support product development by Inuka Team (Wikipedia Preview), Reading Web (NearbyPages, and RelatedArticles), CPT (WebAuthn), Design Systems Team (WVUI/Vue.js), and WMDE (Kartographer-revid)
- Participate in SLO working group to help establish an SLO around MediaWiki Save Timing SLO.
- Participate in W3C WebPerf WG, provide feedback to Chrome team on Google Web Vitals and Chrome bugs.
- Organise the Web Performance devroom for FOSDEM 2021 (recordings).
- Speak at the We Love Speed conference (recording).
- Organise four Web Perf Hero awards.
Insights:
- Migrate our device lab to BitBar.
- Evaluate and build proof-of-concept synthetic testing on bare metal instead of at AWS.
- Write runbooks for investigating RUM alerts, WPT alerts, and WPR alerts.
- Support to SRE Observablity in developing a new Prometheus-compatible MW-Stats client library.
- On-going maintenance of WebPageTest, WebPageReplay, and Fresh-node.
Improvement:
- Multi-DC: Deploy MainStash DB and migrate away from Redis-based MainStash (T212129).
- Multi-DC: MariaDB-TLS tested and enabled for all wikis.
- Multi-DC: CDN routing logic written and deployed to Beta and Prod behind feature flag.
- ResourceLoader debug mode v2, reduce wait time on complex pages from ~1 minute to ~1 second.
- Guidance and code review for DBA-led normalization of "templatelinks" MediaWiki database table, to reduce storage pressure and improve query performance. (T299417)
- Support to SRE ServiceOps for MW-on-K8s project.
- Develop precache-based GlobalUserEdit API for CentralAuth, following an incident.
2020
[edit]See also internal 2020-2021 roadmap.
Outreach:
- Support product launch by Anti-Harrasment Team (IPInfo extension), and CPT (API Portal skin, API Portal OAuth extension, Changes to OAuth ext).
- Support development kick-off of Abstract Wikipedia (WikiLambda) through early check-in and 1-month team residency/matrixing in both directions.
- Organise the first Web Performance conference at FOSDEM (blogpost, recordings).
- Organise the first Web Perf Hero award.
- Get published in the Web Performance Calendar (4x: Human performance metrics, Profiling PHP at scale, Future of Web Vitals from a non-Googler, Setting up a device lab).
- Enable teams to create their own production error dashboards in Logstash with a template, written guide, and video presentation.
Insights:
- Expand navtiming RUM metrics pipeline with new Layout Shift metric.
- Kobiton setup for our device lab, expand to include iOS in addition to Android.
- Explore BitBar for our device lab.
- Explore moving WPT/WPR infra away from AWS.
Improvement:
- Multi-DC: Implement multi-dc strategy for ChronologyProtector (T254634).
- Multi-DC: Determine and start implementing strategy for MainStash DB (T212129).
2019
[edit]See also 2019-20 Q1#Performance and internal 2019-2020 roadmap.
- Outreach:
- Design and implement the AS Report, to expand and formalize collaborations to leverage our influence with browsers vendors and ISPs. (Announcement on Techblog).
- Initiate and work on Wikimedia Foundation becoming an official W3C member organization. This expands the Performance Team's participation in web standards and moves us from an "invited expert" (individual) to a represented membership organisation. (Announcement on wikimediafoundation.org)
- Support product launches by Parsing Team (Parsoid-PHP launch), Editing Team (DiscussionTools launch), Growth Team (GrowthExperiments launch), and Inuka Team (Wikipedia KaiOS app launch).
- Support RelEng around establishing production error triage workflows and semi-automation thereof.
- Organise WMF-wide frontend web performance training.
- Provide performance expertise to Frontend Architecture Working Group (FAWG).
- Get published in the Web Performance Calendar (2x: Measuring LT and FID, Big questions on RUM)
- Insights:
- Research and develop and test new RUM metrics that better match user perception (T187299, Meta-Wiki, Rossi 2019 paper).
- Organise and oversee implementation of First Paint metric in WebKit for Apple Safari (blog post).
- Introduce automatic developer-facing performance metrics for specific chunks of MediaWiki code in core and extensions, powered by WANObjectCache (T197849).
- Add more RUM metrics to the navtiming pipeline, including instrumentation for First Input Delay (T332012).
- Participate in Chrome Origin trial for Element Timing and provide feedback on upcoming W3C standard (blog post).
- Release WikimediaDebug v2 (blog post).
- Create our own Mobile Device Lab.
- On-going first-respondence to synthetic testing alerts, including investigating regressions after Chrome/Firefox releases and comms with upstream browser vendors.
- On-going maintenance of WebPageTest and WebPageReplay.
- On-going maintenance of XHGui, including dealing with MongoDB becoming non-free software by developing and upstreaming MySQL drivers for XHGui, and migration our install from MongoDB to MySQL.
- Improvements:
- PHP7 Transition: Finish the transition from HHVM and support SRE with instrumentation, sampling, and benchmarking.
- Multi-DC: Start work on MainStash DB.
- Faster MediaWiki backend startup time to reclaim PHP7 latency increase in certain areas. (T233886, T189966).
- Faster page load time, by reducing ResourceLoader startup cost (blog post).
- Guidance, CR and testing for new AbuseFilter parser (development by Daimona) to improve Save Timing (T156095).
2018
[edit]See also 2018-19 Q1, 2018-19 Q2, and internal 2018-2019 roadmap.
Insights:
- Annual Plans/FY2019/TEC1: Current levels of service are maintained and/or improved.
- Enhance performance testing infrastructure, including addition of Chrome Tracelog (T182510), and introduction of WebPageReplay+Browsertime (based on last year's research) to complement and eventually replace WebPageTest (T153360). Blog post: Performance testing in a controlled lab environment
- Introduce Excimer, a new sampling profiler for PHP 7 to replace HHVM Xenon (T176916). Includes creation of the new php-excimer extension (blog post).
- Implement new "Backend-Timing" metric on Apache PHP web servers, as first full measurement of MediaWiki latencies. Backed by Prometheus. (T131894)
- Migrate WebPageTest hosting from Windows to Linux (T165626)
- Expand synthetic testing to more non-English wikis.
- Introduce Fresnel, performance testing in MediaWiki CI jobs. (T133646).
- Review current research on performance perception (T165272, T187299). Essay: Perceived Performance (2018). Blog posts: Mobile web performance: the importance of the device, Machine learning: how to undersample the wrong way.
- Develop new "navtiming2" metric definitions, addressing what we learned since 2015, and enable use of stacked graphs (T104902).
- On-going maintenance of navtiming.py service, including migration to dedicated hardware, and support for failover to secondary datacenter.
Outreach:
- Measure performance from Asia both pre- and post- Singapore data center coming online (T169180, T168416), including a new navtiming capability for geographic oversampling (T169522). (blog post)
- Publish the first post in the Perf Matters at Wikipedia series.
- Get published in the Web Performance Calendar (5x: Magic numbers, Comparing HAR, Measuring Wikipedia, Why perf matters, AVIF).
Improvement:
- Annual Plans/FY2019/TEC1: Improve MediaWiki availability and reduce read-only impact from data center switchovers.
- Multi-DC: Develop integration and support for Mcrouter service in MediaWiki's WANObjectCache, support SRE's rollout of mcrouter service. (T198239)
- Annual Plans/FY2019/TEC4: PHP7 Migration: Guide the work and support other teams.
- Introduce support for packageFiles to ResourceLoader (T133462).
- Introduce support for WebP compression format to Thumbor.
- Reduce page load time by refactoring the startup module to need only one roundtrip instead of two, effectively loading jQuery in parallel outside the critical path. (T192623).
2017
[edit]See also Annual Plan/2017-2018#Technology, 2017-18 Q3, 2017-18 Q4, and internal 2017-2018 roadmap.
Outreach:
- Publish in the Web Performance Calendar (Automate performance regression alerts).
Insights:
- Program 1. Availability, performance, and maintenance.
- All production sites and services maintain current levels of availability or better.
- Maintain a comprehensive toolset to measure the performance of our platforms.
- Research reverse proxies technologies with objective to obtain more stable metrics from synthetic testing infrastructure, increasing confidence, reduce minimum regression size for detection. Evaluated Mahimahi, WebPageReplay, and mitmproxy; selected WebPageReplay. Deployed WebPageReplay+Browsertime to complement and eventually replace WebPageTest (T153360).
- Implement a performance alerting system atop Grafana. Establish it as a practice for other teams to follow. Two teams used it in the first year. T153169
- Develop new "navtiming2" metric definitions, addressing what we learned since 2015, and enable use of stacked graphs (T104902, blog post).
Improvement:
- Support for HHVM-PHP7 migration and upgrade, including development of php-excimer (T176916, blog post)
- Support regular data center switchovers, including development of EtcdConfig in MediaWiki core (T156924, T160178)
- Expand support in Thumbor to private wikis. Thumbor service replaces MediaWiki ImageHandler (3-part blog post series).
- Program 8. Progress towards multi-datacenter support (wikitech:Performance/Multi-DC MediaWiki).
- Faster Wikipedia time-to-logo. (blog post, T100999)
- Faster edit save timing. (blog post)
- Faster page load time. Reduce load time on 3G-Slow connections by one whole second, from 14s to 13s. T164299#3572231
- Phase out "mediawiki.legacy.wikibits" module to reduce page view cost. T122755
- Migrate MediaWiki core and all deployed extensions to jQuery 3, multi-month cross-team effort. T124742
2016
[edit]See also Perf Matters at Wikipedia in 2016 (Blog post), and Annual Plan/2016-2017 Program 4: Improve site performance.
Insights:
- Enhance performance testing infrastructure, including speeding up the infrastructure to achieve hourly testing instead every 3 hours (T151197), and adding new metrics for DOM size (T159362).
Improvement:
- Help develop Thumbor as service to replace MediaWiki FileHandler in production (3-part blog post series).
- Help guide and prepare for HTTP/2 roll out to Wikimedia CDN (blog post).
- Progress towards multi-datacenter support (wikitech:Performance/Multi-DC MediaWiki).
2015
[edit]See also Perf Matters at Wikipedia in 2015 (Blog post).