Jump to content

Talk:Multi-Content Revisions/Database Schema

About this board

<code>slot_page</code> and <code>content_page</code>

6
Summary by Daniel Kinzler (WMDE)

Should we put the page ID into the new slots and revision tables, for sharding, analysis, and recovery?

Anomie (talkcontribs)

I'd like to hear from JCrespo as to whether this is needed or desirable. I'm skeptical that access patterns will actually make sharding by page ID more efficient.

The proposed "slot history" would potentially be more efficient if the slot table were sharded by page ID, since it will be having to join that table to hit the slot_inherited field. Even without sharding, I suppose the DB could select on slot_page, slot_inherited rather than using rev_page then filtering based on the join.

I can't think of any common situation where sharding the content table by page ID would be more efficient. Nor can I think of a situation where we'd want to use content_page directly in a query.

Daniel Kinzler (WMDE) (talkcontribs)

As far as I know, there is only one use case where we query the revision table without looking for a specific page: user contributions. In all other situations we can and should provide a page ID.

Sharding isn't the only reason to have this, though. It also provides some low-cost redundancy that should help a lot with analysis and repair.

Anomie (talkcontribs)

API list=allrevisions also queries the revision table without looking for a specific page.

RecentChanges and Watchlist, both API and web UI, have many pages to query but join revision by rev_id rather than page, so they'd likely see little or no benefit from this sharding either.

Also, API prop=revisions can specify up to 5000 specific pages and fetch one revision per page, so sharding would be of no benefit there either.

I also note that sharding gives no advantage if you're fetching information for just one revision, which covers the most common cases. And it's probably not that useful if you're fetching information for just two revisions either, which covers diffs. Page histories probably don't need the content and may not even need the slot information. What's left? Just API prop=revisions in single-page mode and dumps?

I'm even more skeptical of this proposal if the only remaining justification is "analysis and repair". How often do we ever need to repair, and how often would there be an analysis that needs to worry about optimization at this level and that would even benefit from it?

Daniel Kinzler (WMDE) (talkcontribs)

The main reason to introduce sharding is not query performance. The main reason is keeping the total size of tables down, to keep them manageable for schema updates. On-disk size is also a concern.

At least this is from what I gathered from the feedback I got from Jaime and others during earlier rounds of MCR schema discussions. My original proposal does not have the page ID here and did not consider sharding. I added this to reflect discussions we hat at the Summit and at the Hackathon.

In any case: sharding the revision table is not part of this proposal. I'm only suggesting to define the new slots and content tables in a way that allows sharding to be introduced easily. The comments on sharding the revision table are idle musing at this point. Changing revision to use page_id+rev_id as its primary key would indeed be a bit of work, and a separate project.

As to the cases you mentioned:

  • Listing all revisions would need some extra logic to iterate over all shards. Order would not be by revision ID or page ID, but by shard+page-rev. This needs extra code, but should not be a problem for performance.
  • Anything that selects by or joins against rev_id would need to be changed to include page_id in the condition, if we start sharding the revision table.
  • prop=revisions is a problem without page IDs. Without page IDs, finding specific revisions becomes expensive.
  • When selecting a single revision, or two revisions, of a single page, sharding gives a slight advantage, since you are hitting a smaller DB index (provided you also know the page id, which you should).

But none of this is required for MCR, nor implied by the proposed schema.

Daniel Kinzler (WMDE) (talkcontribs)

@Anomie: do you think slot_page and content_page are problematic or need further discussion? Otherwise, I think this thread can be closed.

Anomie (talkcontribs)

I really want to hear from Jaime on this one. The schema would be a lot cleaner without them.

Reply to "<code>slot_page</code> and <code>content_page</code>"
Summary by Daniel Kinzler (WMDE)

The discussion on wikitech-l indicates that we are not ready to drop rev_sha1 at the moment. https://lists.gt.net/wiki/wikitech/844067

We could however re-calculate rev_sha1 for every revision based on the actual blobs, instead of using pre-calculated content_sha1 fields. We would trade the overhead of loading and hashing all content against the overhead of storing the pre-calculated hahes in the DB.

Anomie (talkcontribs)

That is, rev_sha1, ar_rev_sha1, and content_sha1. Other teams have been discussing revert detection, which is the only point of having these fields. It would be useful to get their input on these. Post to wikitech-l?

Daniel Kinzler (WMDE) (talkcontribs)

I was thinking of turning the schema proposal into an RFC anyway. But I can send a separate email right away, sure.

I believe the person looking into revert detection/tracking is Roan.

Miscellaneous questions on existing tables

2
Summary by Daniel Kinzler (WMDE)

No changes to the page, revision, or archive tables are needed for MCR. Some fields in these tables become meaningless, however - in particular the fields containing content model and format. These can be dropped later.

Anomie (talkcontribs)

Most of these seem unrelated to MCR, and unneeded for MCR.

  • A bigint for rev_id would probably make sense, although that seems mostly a separate concern from MCR. I'd say leave that change for a separate project. The only dependency is the type of slot_revision, but there are many other fields in the DB that have a similar dependency.
    • If we want to do this at the same time as any MCR-related schema changes, it shouldn't be too hard to do it.
  • Again, your questions for rev_page seem unrelated to MCR. Let's leave that for a separate project, MCR is big enough as it is.
  • ar_text and ar_flags still exist because old rows deleted in MediaWiki before 1.5 were never migrated to the text table. We should really do that someday so we can get rid of this column.
  • Regarding ar_rev_id not referring to an existing revision, it is the "existing" revision. A migration to merge it with ar_id would need to invent IDs for existing pre-1.5 rows without conflicting with any existing rev_id or ar_rev_id.
  • Your question on page_namespace is entirely unrelated to MCR.
  • page_restrictions may need old rows migrated before being dropped (a count using the enwiki replica on tool labs reports 4102 rows that have a value here). Again, this seems unrelated to MCR.
Daniel Kinzler (WMDE) (talkcontribs)

Yes, none of this is needed for MCR. In fact, the MCR was intentionally designed so it would function with no change to the page, revision, or archive table. My intention was to provide an overview of other planned or possible changes that will effect these tables, and also to say which fields become redundant through MCR.

I suppose I should make it clear for each change if it's a) needed for MCR b) made possible by MCR c) has no causal relationship and is mentioned purely for information.

<code>content_format</code>

2
Summary by Daniel Kinzler (WMDE)

content_format can sensibly be derived from the model, or detected from the data. We don't need to track it.

Anomie (talkcontribs)

We can only omit this if we do a more global change to require that all content handlers supporting multiple formats auto-detect the format instead of having it be passed in. For all existing code that I'm aware of this seems certainly doable.

Daniel Kinzler (WMDE) (talkcontribs)

Since most handlers only support a single format in the first place, this should be trivial. And it would save us quite a bit of work, and it would reduce the complexity of the schema somewhat.

<code>slot_inherited</code> versus <code>content_origin</code>

5
Summary by Daniel Kinzler (WMDE)

We prefer a boolean slot_inherited over full content_origin.

Anomie (talkcontribs)

In retrospect, while doing this on-wiki versus in Gerrit made it easier for you to add implementation notes, it makes it harder to comment on things. Flow makes it even more of a pain.

slot_inherited seems more natural to me. A content row could theoretically be reused for later revisions. For example, if someone makes and edit and that edit is reverted using the rollback feature, we're probably going to take advantage of that trivial case to reuse the old content row. With content_origin, though, we'd either be prevented from doing that because we'd have to update the origin or the existing origin would cause the slot-history thing to skip showing the revert.

slot_inherited would also be more efficient for your proposed use case, since it won't require joining the content table to determine whether a revision can be omitted.

Daniel Kinzler (WMDE) (talkcontribs)

I'm also leaning towards slot_inherited for practical reasons, though content_origin retains more information.

Daniel Kinzler (WMDE) (talkcontribs)

...and yes, using Flow for this discussion is a bit painful. Inline would be better... I'll think about it.

I really want inline notes for wiki pages! But where do we store them? I wish we could just attach additional content objects to a page... ;)

Anomie (talkcontribs)

The bigger trick with inline notes for wiki pages would be in not losing them when something else in the page is edited.

Daniel Kinzler (WMDE) (talkcontribs)

We could just switch to HTML for primary storage, then we wouldn't even need MCR for the inline notes :)

<code>content_address</code>

4
Summary by Daniel Kinzler (WMDE)

255 bytes it is for now.

Anomie (talkcontribs)

This seems fine to me as proposed, assuming 255 bytes is enough. I don't see much point to separating the prefix from the rest of the URL here.

Daniel Kinzler (WMDE) (talkcontribs)

255 bytes should be enough, but if it costs nothing, let's just declare 1023 bytes or something. But this does make a difference for indexes, right? Do we need an index on this?

Anomie (talkcontribs)

If the varchar is more than 255 bytes, it uses an extra byte per row.

It also makes the rows potentially larger, although I don't know to what extent that might be a concern.

It may also affect indexing, if you decide to put an index on it. I can't think of any reason why an index would be necessary though except for potential maintenance.

Daniel Kinzler (WMDE) (talkcontribs)

ok, 255 bytes it is for now.

Summary by Daniel Kinzler (WMDE)

we need rev_len to be the sum of the relevant content_size fields. Nothing changes otherwise.

Anomie (talkcontribs)

page_len is queried for various purposes, including Special:ShortPages. It's presumably denormalized from page_latestrev_len to avoid having to join and to allow for the page_len index on the page table.

rev_len seems to be used just for displaying the size of the revision (and size change compared to the previous revision) in history pages and such. It may still be useful as the sum of content_len for that purpose to avoid history pages having to join slots and content and add the lengths together (either with a GROUP BY or in PHP).

Daniel Kinzler (WMDE) (talkcontribs)

Yes, we should keep rev_len as a sum of content_size. Denormalizing page_len seems dubious, but whether it's needed really depends on the access patterns.

I would love to see the mutable part of page (page_len, page_touched, page_current) split off into a separate table though, so that the "primary" part becomes append-only. But that's definitely a separate project.

Anomie (talkcontribs)

Denormalizing page_len is probably needed for everything that uses the page_len or page_redirect_namespace_len indexes. Otherwise, for the former index it would need to scan all revisions by rev_len (which would need an index) and filter just the latest revisions, and for the latter index it would either have to do the same or would have to fetch all redirects in the namespace and filesort by rev_len.

A lot more of page is mutable than just those three fields, although those three (along with page_links_updated) are probably the ones most often mutated. The only really constant fields are page_id and page_random.

Daniel Kinzler (WMDE) (talkcontribs)

Yea, you are right.

So, we need rev_len to be the sum of the relevant content_size fields. Nothing changes otherwise. Thoughts about page_len are idle musing.

There are no older topics