Jump to content

Manual:Database layout/MySQL Optimization/Tutorial

From mediawiki.org
Speed up your queries!

Welcome to the tutorial on MySQL Optimization for MediaWiki, originally given at the Berlin Hackathon 2012. The live tutorial will cover:

  • Why Optimization Matters
  • Why Indices are most important
  • How to avoid Unindexed and Unlimited Queries
  • Using EXPLAIN

The slide deck for this tutorial has specific examples of optimized queries and simple practices that you can use to speed up your queries.

Simple prep you need in order to take this tutorial:
For the practice exercise, you will access a database with sample data in Wikimedia Cloud Services. All you need to access it is a Wikimedia developer account and membership in the 'bastion' project (all users who are members of any project are also members of the 'bastion' project). You don't need to be a member of the tutorial project. Before the tutorial, we suggest that you be sure you can access this database. You need to ssh into bastion (ssh bastion.wmflabs.org) and, once you're in, run

mysql -h tutorial-mysql -u tutorial commonswiki_partial

You may also want to suggest a query to be used in the demo, via the list below.

Introduction

[edit]

To many MediaWiki developers, SQL query performance and optimization is shrouded in mystery. Most know that there are efficient and inefficient queries, and that if they write an inefficient query, it will either be noticed during code review, or it will be noticed because it takes down a wiki, which will prompt an ops person to fix the breakage and yell at the developer who caused it. But few people seem to really understand how query performance works.

How can you tell if a query is inefficient? How do you write efficient queries, and avoid inefficient ones? If so few people know this, it must be this difficult, mysterious thing, right? Fortunately, you don't have to be Domas or Tim to understand this. If you understand how a phone book works, you can learn this too.

This tutorial will cover the basics of how database engines in general, and MySQL specifically, execute different kinds of queries, and explain why certain queries are executed more efficiently than others and what role indexes play in this process. We will demonstrate better practices by writing efficient queries, and showing you how to use tables and indexes so they facilitate efficient queries, and discuss common pitfalls that result in inefficient queries and how to address them. We will also demonstrate how to obtain a query analysis from MySQL and how to make sense of it.

Suggest a Query to be Used in the Demo

[edit]

Below, add a query you want to see optimized in the tutorial Demo (list query suggestions here):

  1. ContributionScores query

Extension:ContributionScores polls the wiki database to locate contributors with the highest contribution volume - this has NOT been tested on a high-volume wiki. The extension is intended for fledgling Wikis looking to add a fun metric for Contributors to see how much they are helping out.

It is used at translatewiki.net and occasionally causes (very) slow queries there. Example query:

# Time: 120525  6:51:18
# User@Host: twn[twn] @ localhost []
# Query_time: 28.124669  Lock_time: 0.000105 Rows_sent: 50  Rows_examined: 18860849
SELECT /*  Erdemaslancan */ user_id, user_name, user_real_name, page_count, rev_count,
page_count+SQRT(rev_count-page_count)*2 AS wiki_rank
FROM `bw_user` u
JOIN (
  (
    SELECT rev_user,
    COUNT(DISTINCT rev_page) AS page_count,
    COUNT(rev_id) AS rev_count
    FROM `bw_revision`
    WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
    GROUP BY rev_user
    ORDER BY page_count DESC
    LIMIT 50
  ) UNION (
    SELECT rev_user,
    COUNT(DISTINCT rev_page) AS page_count,
    COUNT(rev_id) AS rev_count
    FROM `bw_revision`
    WHERE rev_user NOT IN (SELECT ug_user FROM `bw_user_groups` WHERE ug_group='bot')
    GROUP BY rev_user
    ORDER BY rev_count DESC
    LIMIT 50
  )
) s ON (user_id=rev_user)
ORDER BY wiki_rank DESC LIMIT 50;

Uses:

2. Batch query vs. many queries Don't have an example off the top of my head, but this may be a less obvious optimisation with potentially big rewards.

Feedback and Discussion

[edit]
  • Collect participants' feedback and questions
  • Reminder to document your discoveries, bugs and optimization tips