Jump to content

Flow/Architecture

From mediawiki.org

Big ideas

[edit]
  • Flow was designed to be about workflow management. A "discussion" is a type of workflow – a very simple one, structured in a way that resembles Reddit comments, with the addition of versatile wiki text support.
  • Flow was designed to be cross-wiki. The developers hoped to allow discussions to take place across wikis and appear on pages on different wikis, but that feature never materialized.
  • Flow reduces the consumption of server space storage, because each added comment does not have to inefficiently record the entire existing discussion page into the revision history.
  • Less maintenance necessary: No need to archive (move) finished discussions to an archive subpage upon discussion board overlength. Instead, each started discussion thread creates a separate entry in the Topic: namespace.
  • No possibility of edit conflicts when people add comments.

Templates are often used to implement ad-hoc workflows.

[edit]

In many cases local wikis use templates to encourage workflow within them. The goal for Flow's workflow models is to be dynamic enough to be managed by local wiki administrators to cover use cases currently handled by workflow suggestions inside templates. In other words, Flow will implement a whole bunch of Lego pieces, and the individual communities will stick them together into the various workflows they need.

Cross-wiki database

[edit]

Flow metadata is vertically partitioned away from core MediaWiki into a single database shared by all wikis. While Flow stores comments from all wikis in a single database, it does not implement display of a piece of data on wikis other than the wiki it was created on. There are checks in a variety of places to ensure cross-wiki data is not displayed until we have a chance to focus on the implications (user IDs, page IDs, configuration differences of the wiki and its extensions, and many things we don't even know yet).

Flow revisions are kept in ExternalStore.

Data layer

[edit]
See also: Flow/Database


IDs

[edit]

Flow uses 88-bit timestamped identifiers that are unique across machines. This design (in theory) would permit a Flow board that is a "feed" of items from multiple wikis. It stores them in the database as binary(11) rows. Flow has a UUID model class to deal with IDs in either binary or alphadecimal(a-z0-9) as necessary; internally this generates IDs using the UIDGenerator class in MediaWiki core. Because these identifiers are timestamped, sorting by UUID gives chronological order.

Because of this, Flow generally does not store timestamps separately and just extracts the timestamp from the UUID as needed. A procedure for extracting the timestamp is[1]:

  • Convert the ID to hexadecimal
  • Left-pad the hex string with zeros so that its total length is 22
  • Take the first 12 characters
  • Parse that as a number
  • Divide by 4 (or shift right 2 bits)
  • This is a UNIX timestamp in milliseconds
  • To get a standard UNIX timestamp (in seconds), divide by 1000

To get a MariaDB datetime, you can use: date_format(from_unixtime((conv(substring(hex(a.rev_id), 1, 12), 16, 10) >> 2) / 1000),"%Y%m%d%H%i%S")

Workflow

[edit]

Globally unique identifier of an individual instance of a workflow. A discussion topic is a Workflow instance (for example, /wiki/User_talk:Maryana?workflow=0506af29cfae6e5b09a3fa163e68c4ac).

  • ID - A unique identifier of this workflow
  • wikiId - The wiki id of the owning Title
  • pageId - The article id of the owning Title
  • namespace - The numeric namespace of the owning Title
  • titleText - The db key of the owning Title
  • userId - The user id of the initial workflow creator
  • userText - A user name of the initial workflow creator
  • definitionId - Id of the flow Definition this workflow is a type of

TopicListEntry

[edit]

The topic list is an N to M relation between workflows. Initial use case is a parent discussion workflow is related to many topic workflows within the discussion. These topics can then be included into other discussions. (Considering renaming and adding a type field, to allow generic N to M relations if use cases arise)

  • topicListId - UID of the parent workflow
  • topicId - UID of the child workflow

AbstractRevision

[edit]
  • revId - UIDGenerator::newTimestampedUID128()
  • userId - Id of the user that created this (from the wiki id of the owning workflow, identified by the concrete revisions)
  • userText - The user name that created this
  • parentId - A revision id that this revision is based on
  • changeType - A string identifying the type of action that created this revision(not user generated)
  • type - A string identifying the concrete revision type
  • content - The content, or if ExternalStore is enabled a URL from ES.
  • flags - Array(comma separated in storage) of string flags that apply specifically to the content. Examples include utf-8, html, etc.
  • modState - String identifier of the revisions current moderation state
  • modUserId - Denormalized Id of the user that most recently moderated this revision (from the wiki id of the owning workflow)
  • modUserText - Denormalized user name of the moderating user
  • modTimestamp - Denormalized wfTimestampNow() created when moderation most recently occurred
  • lastEditId - Denormalized UID of the revision that is the last content edit
  • lastEditUserId - Denormalized id of the user (from the wiki id of the owning workflow) that performed the last content edit
  • lastEditUserText - Denormalized name of the user that performed the last content edit

Notes:

  • The canonical source of who moderated what when is to look at the changeType of all related revisions and who created the revisions with moderation related change types. The denormalized moderation information is related to the most recent moderation action performed against a series of revisions without changing the canonical information they store.
  • The canonical source of who edited the content when is to check which revisions content does not match their parent. This is denormalized so the data model can expose the most recent content editor to the UI without performing extra lookups.

HeaderRevision

[edit]

A very simple piece of revisionable content displayed at the top of every discussion with no custom aspects beyond AbstractRevision

  • workflowId - The workflow that this header belongs to

PostRevision

[edit]

A revisionable piece of content related to other posts in a 1 to N parent/child relationship.

While not explicitly defined in each post, the owning workflow has the same UID as the postId of only parent with a null replyToId in a tree of posts. This post represents the title of the topic. This is cached elsewhere.

  • replyToId - UID of the posts parent
  • postId - UID of this post
  • revId - UID of the related AbstractRevision
  • origCreateTime - Denormalized creation time (extracted from UID) of the first revision of this post
  • origUserId - Denormalized user id (from the wiki id of the owning workflow) that created the first revision of this post
  • origUserText - Denormalized user name that created the first revision of this post

Revisions

[edit]

Posts, Headers, and possibly a wide variety of content within Flow needs to be revisioned much like normal wiki pages.

  • Flow revisions have a different use case than wiki Article revisions
    • Articles have hundreds to (tens of?) thousands of revisions
    • Flow currently stores a revision per post, most posts will only ever have a single revision. Posts that go through a few edits and moderation cycles will still only have 10's of revisions.
    • Other pieces of flow, such as discussion headers, will have perhaps dozens or hundreds (thousands wouldn't be unreasonable) of revisions, but still not nearly as many as an article.
  • Flow revisions need different metadata depending on where the piece is used (header, post, etc).
  • Content is stored in External Store (same as article revisions in core MW). You can configure the ExernalStore servers that Flow uses independent from core.
    • The actual content when rendering pages will typically come from memcache so not as worried about needing multiple queries or joins to get revision + content.
    • External Store has been extended to batch requests for multiple pieces of content such that only 1 query is issued per server.


Abuse Filter, SpamBlacklist, SpamRegex, etc.

[edit]


Within Flow the `Flow\SpamFilter\Controller` class is used to apply the automated spam prevention techniques. Before writing out any new revision the revision is passed into the Controller which responds with a `Status` object. Small wrappers for each of the core spam prevention implementations like AbuseFilter and SpamBlacklist are implemented to satisfy the `Flow\SpamFilter\SpamFilter` interface. These implementations are queried individually by the Controller, all SpamFilter implementations must agree the content is safe for it to pass the SpamFilter.

Monitoring

[edit]

Users monitor revisions and actions on the wiki in various places. Whenever an action (e.g. edit-post, lock-topic, patrol, etc.) is added or changed, consider the following. Not all of the below necessarily apply, but you should check whether it does (because it's a mapping of an existing core action, such as patrol) or should:

  • Board history
  • Topic history
  • Recent Changes
  • Watchlist
  • Special:Log

Front-end architecture

[edit]


Suppressed revisions

[edit]


Mapping URLs to Flow workflows

[edit]


Performance considerations

[edit]

How much data will flow need to store? For estimation purposes, Wikipedia Statistics shows there are approximately 26M articles across all wikis. Not all of these will have talk page, but many will. Assuming they all have talk pages ranging from just a post or two, to a couple thousand posts on the largest, then full deployment of Flow to all talk pages (which never happened) would result in a lower bound of perhaps 100M individual replies will need to be stored in perhaps 20M seperate discussion graphs (not today, or even within the first year, but an approximation of size within a few years). If each reply consumes 1kb of space that puts a lower bound of at least 100GB of post content before we even get into metadata, indexes, etc.

To help get an idea of the space required I applied the EchoDiscussionParser to one of the enwiki database dumps ( enwiki-latest-pages-meta-current10.xml-p000925001p001325000 ). Within this file it detected:

  • 43952 pages in either Talk: or *_Talk: namespaces
  • 211904 individual section headers
  • 514916 user signatures

That works out to an average of:

  • 5 sections per talk page
  • 2.5 signatures per section
  • 12 signatures per talk page

These pages were of course built up over time, but give us a general idea of the size of the problem we need to handle.

It may be worthwhile to re-run this code and split the stats between article talk pages and user talk pages. There will be several orders of magnitude more article talk pages than user talk pages, so if they have different characteristics that may be useful information.

Caching possibilities

[edit]

See /Memcache

Interactions with other systems

[edit]

When Flow is enabled on a page (usually a Talk page), the page becomes a Flow board. MediaWiki creates a new revision of the page with a different contentmodel property ('flow-board' instead of 'wikitext').

A Flow board is different from a wiki page, it stores its content, revisions, and metadata in an external cross-wiki Flow database. If you query the MediaWiki API for a Flow board's content, or use Special:Export of a Flow board, you will see only a pointer to a UUID in this external database.

On WMF wikis, Flow stores posts as HTML, using the output from Parsoid. When you edit a post as wikitext, the wikitext shown is generated from that HTML, and potentially different from any original wikitext.

Flow implements many expected interactions with other parts of MediaWiki.

  • Flow generates Echo notifications
  • Flow adds flow-{edit-post,hide,delete,suppress} rights to $wgAvailableRights so that they are available for global groups/staff rights
  • Rendering of content:
    • Fire wikipage.content hook when new posts are loaded.
  • Filtering of edits:
  • Logging of user actions:
    • Flow inserts rows in RecentChanges, and formats them for display
    • Flow edits show up in CheckUser.
  • Flow inserts rows in Special:Contributions, and formats them for display
  • Flow creates entries in deletion and suppression logs, and formats them for display
    • user actions in Flow appear in Special:Contributions
  • Flow creates entries in the IRC feed
  • On a Flow board, [View history] is replaced with its own implementation, as is viewing differences. Users can see history of an individual topic as well.
  • Flow does not interact with RevisionDelete
[edit]

Flow detects references in items, stores them in new flow_wiki_ref and flow_ext_ref tables, and appends to the standard MediaWiki link tables. So Special:WhatLinksHere works for pages, links between Flow boards, images, and templates; and the "File usage" section of a File: page shows the Flow boards using the image.

Gerrit change 115860 adds a WhatLinksHereProps hook in MW 1.24 so that Flow can add "from the _header_; from a _post_" links to the line in WhatLinksHere.

Statistics

[edit]

Adding a new topic increments the page and edit counts in Special:Statistics, since it creates a new page in the Topic namespace. But posts replying to a topic and edits to Flow board header/topic titles/posts do not, since they are not revisions in the regular MediaWiki page tables.

URL actions

[edit]

You can add ?action=something to Flow URLs, although many regular page actions don't apply (delete, edit, veaction=edit, etc.). $wgFlowCoreActionWhitelist lists the actions that Flow doesn't override, including 'protect', 'watch', etc.

Flow boards accept actions such as view (default), and new Flow actions board-history, edit-header, etc.

Many more actions apply to individual topics and posts, identified by workflow=UUID. If the browser has JavaScript enabled, many GET requests with action URLs are replaced by in-page API calls.

See also

[edit]