Jump to content

Talk:Requests for comment/New sites system

About this board

Dantman (talkcontribs)

Since we're replacing interwiki with sites the constraints used by links tables are going to be broken. We're probably going to need code changes and perhaps some schema changes to fit them into the new sites system.

iwlinks has a iwl_prefix column and langlinks has a ll_lang column, both of which point to interwiki.iw_prefix.

Since we're going to be breaking things in this area anyways I think we should take this chance to replace iwlinks and langlinks with a sitelinks table. Besides the schema changes this will also make it possible to have things like project links/sister site links (i.e.: different si_types) without needing new tables.

# prefix
sitelinks {
  sl_from INT UNSIGNED,
  sl_prefix VARBINARY(25), # points to site_identifiers.si_key
  sl_title VARBINARY(255)
}

# site
sitelinks {
  sl_from INT UNSIGNED,
  sl_site ???, # either an INT UNSIGNED pointing to site_id or a VARBINARY pointing to site_global_key, we can discuss that later
  sl_title VARBINARY(255)
}

Both of these options has advantages and disadvantages that we need to sort out.

By using si_prefix the prefix method makes it so that all site links only work when a site has a local prefix. This means that extensions adding interlanguage links to pages from other sources can't add arbitrary interlanguage links to a page.

On the other hand while the site method effectively deals with that issue (and also technically leaves the window open for us to let normal interwiki links be added as language links with something like {{#langlink:Wikipedia:Foo}} or {{#langlink:enwikipedia|Foo}}) by pointing to a site directly when a site_identifiers row for a site is removed all sitelinks pointing to that site have to be refreshed. Even when the prefix used in the page is completely different.

I think we might have to review how the {iw,lang}links system works and is used and figure out how site links will have to work. Especially with cases like modifications to the interwiki table. Thinking about it addition to interwiki/site_identifiers may actually be worse. All of a sudden pagelinks become interwiki links.

93.220.88.78 (talkcontribs)

site_identifiers contains the equivalent information to what was interwiki.iw_prefix. si_type="interwiki" for siter-links and si_type="equivalent" for language links. I don't see how that breaks anything.

Also, I suggest not make this change even more complex by messing with the database some more. Generalizing langlinks and iwlinks is nice, but can be done as an optional follow up. This should not block the deployment of the Sites stuff.

The change as proposed does not even replace the interwiki table, Title will still use that. Sites is supplying an alternative/additional mechanism for maintaining information about external sites. It may *eventually* replace interwiki, but that is what is being discussed here.

So, again: what you propose seems a good idea to me, but it's two steps ahead and should in no way be blocking deployment of the change as proposed.

This post was posted by 93.220.88.78, but signed as Daniel Kinzler (WMDE).

Daniel Kinzler (WMDE) (talkcontribs)

Ooops, forgot to log in.

Dantman (talkcontribs)
site_identifiers contains the equivalent information to what was interwiki.iw_prefix. si_type="interwiki" for siter-links and si_type="equivalent" for language links. I don't see how that breaks anything.

The problem is the case where you have a site link not using a local prefix. eg: Added by Wikidata, some central interwiki extension, etc...

Reply to "Sitelinks"

Patch removing core sites system

1
Parent5446 (talkcontribs)

Just an FYI of this patch: gerrit:141724. It removes the Sites classes from core, since they are not used anywhere.

Reply to "Patch removing core sites system"
Sharihareswara (WMF) (talkcontribs)

Hi! Will you be completing this RfC?

Dantman (talkcontribs)

This was a joint RFC between me and the Wiki Data team. They have implemented bits from this RFC and are using it in Wiki Data, however they did not seem interested in finishing the implementation, and I do not have the time for it.

Sharihareswara (WMF) (talkcontribs)

Since neither Daniel nor the Wikidata team has time/interest in working on this I am marking it abandoned as a way of saying that it's dormant and does not need any decisions made; anyone is welcome to reactivate it and notify wikitech-l if they are interested in working on it.

Reply to "Status"
Nemo bis (talkcontribs)

Is this going to address this bug?

Dantman (talkcontribs)

Not necessarily.

Dantman (talkcontribs)

However. If we add a good UI and sync things properly. It may not be an issue anymore.

At the least adding local prefixes someplace like meta won't be as disliked and can be used to pretty much address that issue.

Nemo bis (talkcontribs)

Local interwikis are a horrible idea and I don't see how it could be relevant: the use cases were all of global use, adding more interwikis for that stuff is easy and not disliked at all but requires to learn many codes.

Dantman (talkcontribs)

Not local to the wiki. Local to a central wiki. The request basically wanted an interwiki with interwiki links inside that. If you add forwarding interwikis to a central place like Meta (something easier with a nice ui that makes people not care about it not being dynamic) then you can get the same result by using Meta as that central interwiki.

Reply to "Bug 24748 – Create generic Wikimedia interwiki"

Database schema proposal

20
Dantman (talkcontribs)

Denny Vrandečić proposed this database schema on the page.

-- Holds all the sites known to the wiki.
-- This includes their associated data and handling configuration.
-- In case a synchronization tool is used (ie Wikibase), the table
-- can be obtained from an external source, in which case
-- they should not be modified locally.
CREATE TABLE /*_*/site (
  -- Numeric id of the site
  site_id                    int unsigned        NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Global identifier for the site, ie enwiktionary
  site_global_key            varchar(25)         NOT NULL,

  -- Type of the site, ie SITE_TYPE_MW
  site_type                  int unsigned        NOT NULL,

  -- Group of the site, ie SITE_GROUP_WIKIPEDIA
  site_group                 int unsigned        NOT NULL,

  -- Base URL of the site, ie http://en.wikipedia.org
  site_url                   varchar(255)        NOT NULL,

  -- Path of pages relative to the base url, ie /wiki/$1
  site_page_path             varchar(255)        NOT NULL,

  -- Path of files relative to the base url, ie /w/
  site_file_path             varchar(255)        NOT NULL,

  -- Language code of the sites primary language.
  -- We do not have real multilingual handling here by design,
  -- as implementing it would require expensive changes in core
  -- and would overcomplicate things. If you have a multilingual
   -- site, for instance imdb, you can just create multiple rows
   -- for it, ie imdben and imdbbe.
  site_language              varchar(10)         NOT NULL, 

  -- Type dependent site data.
  site_data                  blob                NOT NULL
) /*$wgDBTableOptions*/;

-- Holds all the local site keys and data for the sites in site
CREATE TABLE /*_*/sitelocal (
  -- local key
  sitelocal_key              VARCHAR(25)         NOT NULL,
  --   Key to site.site_id
  site_id                    int unsigned        NOT NULL,
  -- If the site should be linkable inline as an "interwiki link" using
  -- [[site_local_key:pageTitle]].
  sitelocal_link_inline      bool                NOT NULL,

  -- If equivalent pages of this site should be listed.
  -- For example in the "language links" section.
  sitelocal_link_navigation  bool                NOT NULL,

  -- If site.tld/path/key:pageTitle should forward users to  the page on
  -- the actual site, where "key" is the local identifier.
  sitelocal_forward          bool                NOT NULL,

  -- Type dependent site config.
  -- For instance if template transclusion should be allowed if it's a MediaWiki.
  sitelocal_config           blob                NOT NULL
) /*$wgDBTableOptions*/;

Mapping of use cases to schema

  • (1) GlobalIDs: site.site_global_key
  • (2) Multiple IDs: there can be several sitelocal_key for a single site_id
  • (3) Types and Typed data: site_type and site_data, for local differences also sitelocal_config
  • (4) Languages: site_language
  • (5) Arbitrary language links: sitelocal_link_inline, sitelocal_link_navigation
  • (6) Groups: site_group
  • (7) Custom URLs: site_url, site_page_path, site_file_path, as well as further data in site_data and sitelocal_config
  • (8) Unprefixed sites: no entry in sitelocal for a site_id means an unprefixed site, basically
  • (9) Synchronization: the split between site and sitelocal splits the global and local data
  • (10) Site title: not present now. Is there consensus for this?
  • (11) iw_api: covered by site_url and site_file_path (or site_data)
  • (12) iw_wikiid: covered either by site_global_key or site_data
  • (13) iw_local: sitelocal_forward
  • (14) iw_trans: sitelocal_config or site_data
  • (15) UI: not covered by schema
Denny Vrandečić (WMDE) (talkcontribs)
Dantman (talkcontribs)

It's missing the type and group being varbinary instead of ints as was said to be fixed there. Also the dropping of site_file_path. However I still don't like the site_url and site_page_path separation. IMHO the actual url still looks like type specific data.

Btw when I said "Do we want to split the data into two different tables?" I was mostly serious with that as a question. I haven't quite figured out if that's a good idea or not. The one technical reason to do that I can think of would be to share the database. But when I think about it again trying to manage it in that situation where a tiny change on one wiki suddenly affects every wiki makes me think things could quickly go wrong. So I don't know which is the option to use yet. We also need to have a separate discussion on whether the site table is going to be a first-class table of data or an index built up of configured sources. That decision will probably also affect what we do in this part of the schema.

Notes on sitelocal:

  • Presumably sitelocal is a multi-row 0+ table. Do we need link_inline anymore? Is there actually a situation where we can have an interwiki prefix and have it be a language link but not be an interwiki link?
  • Also sitelocal_config. Do we really want separate site config for each prefix? ie: If Asdf: and en: point to the same site, is there any reason only one of them should be usable in interwiki transclusion?
    • If we can't come up with a use for it I'd like to avoid having the table storing our prefixes have extra data besides the flag that says whether if the prefix is an interlanguage. The UI for separating interlanguage and interwiki links is one thing. For that we would just let the user input a comma-separated list of interwiki prefixes and a separate box would have a comma-separated list of interlanguage prefixes. But if we add any more data than that to the prefix the UI suddenly explodes from just editing a simple form of site data to something with subforms containing configuration for every single prefix.
Jeroen De Dauw (talkcontribs)
We also need to have a separate discussion on whether the site table is going to be a first-class table of data or an index built up of configured sources.

The thing obviously needs to work on single wiki installs. So if we make this inherently be an index, we need to introduce another thing storing the site data, which would presumably be similar to what we're proposing now, and be completely useless for very nearly everyone. I don't quite understand how seeing the table as primary data by default will cause problems when some code decides to use it as index - obviously other code interacting with the table should be aware of this, or use some suitable interface to the table that is, but what would be different in the table??

Jeroen De Dauw (talkcontribs)

This is more like what I had in mind:

-- Holds all the sites known to the wiki.
-- This includes their associated data and handling configuration.
-- In case a synchronization tool is used (ie Wikibase), the table
-- can be obtained from an external source, in which case
-- they should not be modified locally.
CREATE TABLE /*_*/site (
  -- Numeric id of the site
  site_id                    int unsigned        NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Global identifier for the site, ie 'enwiktionary'
  site_global_key            varbinary(32)       NOT NULL,

  -- Type of the site, ie 'mediawiki'
  site_type                  varbinary(32)        NOT NULL,

  -- Group of the site, ie 'wikipedia'
  site_group                 varbinary(32)        NOT NULL,

  -- Source of the site data, ie 'local', 'wikidata', 'my-magical-repo'
  site_source                varbinary(32)        NOT NULL,

  -- Domain of the site in reverse order, ie 'org.mediawiki.www'
  -- This field is an index for lookups and is build from type specific data in site_data.
  site_domain               varchar(255)        NOT NULL,

  -- Protocol of the site, ie 'http://', 'irc://', '//'
  -- This field is an index for lookups and is build from type specific data in site_data.
  site_protocol             varchar(255)        NOT NULL,

  -- Language code of the sites primary language.
  -- We do not have real multilingual handling here by design,
  -- as implementing it would require expensive changes in core
  -- and would overcomplicate things. If you have a multilingual
   -- site, for instance imdb, you can just create multiple rows
   -- for it, ie imdben and imdbbe.
  site_language              varbinary(32)       NOT NULL, 

  -- Type dependent site data.
  site_data                  blob                NOT NULL,

  -- If site.tld/path/key:pageTitle should forward users to  the page on
  -- the actual site, where "key" is the local identifier.
  site_forward              bool                NOT NULL,

  -- Type dependent site config.
  -- For instance if template transclusion should be allowed if it's a MediaWiki.
  site_config               blob                NOT NULL
) /*$wgDBTableOptions*/;

-- Holds all the local site keys and data for the sites in site
CREATE TABLE /*_*/site_identifiers (
  --   Key to site.site_id
  si_site                    int unsigned        NOT NULL,

  -- local key type, ie 'interwiki' or 'langlink'
  si_type                   varbinary(32)       NOT NULL,

  -- local key value, ie 'en' or 'wiktionary'
  si_key                    varbinary(32)       NOT NULL

) /*$wgDBTableOptions*/;

-- unique key on ( si_key_type, si_key )
  • Can now have an arbitrary amount of langlink or interwiki identifiers per site, eliminating the case where we where previously forced to duplicate stuff.
  • Killed site_path field as it's type specific (site_page_path is something we need for every site and something I want to keep separate from site_url so we can easily change the later or select on it).
  • Modified field types of identifiers, types and lang to varbinary(32) to be consistent w/ core.
  • Removed split of config from the main table - I see more hassle arise with having it split then when not split.
  • Removed site_link_inline and site_link_navigation as they are obsolete due to the key link table
Dantman (talkcontribs)

Yeah this looks better. Course I'm still not keen on the site_url/site_page_path separation an think it should be type data.

I think we can drop site_link_inline. It was needed before because local_key could be a copy of the global key and not be a prefix. But now that a site with no interwiki links simply is one with no site_identifiers (site_prefix?) rows I can't think of a purpose for site_link_inline.

Also if you don't mind the bikeshedding:

  • site_global_key can probably just be site_global, or maybe just site_key since we're already talking about the unique data
  • si_key_type is basically the replacement for site_link_navigation, so we don't need site_link_navigation.
  • For si_site_id using si_site would match the other tables, see rev_page.
  • si_key_type can probably be just si_type. (Btw, I like 'type', I couldn't figure out what to call it before)
Jeroen De Dauw (talkcontribs)
Course I'm still not keen on the site_url/site_page_path separation an think it should be type data.

In order to be able to make a link to a site, we need to know where to put the page name. In the current interwiki table this is done with a single field holding for instance https://encrypted.google.com/?q=$1. I've now split this up into the base url and the part being appended. I don't see how this is type specific. It's true that the part being appended has a typical format per type, and often even tends to have the same value per type of site (ie /wiki/ suggests MediaWiki). But those are the values of the field being type specific, not the field itself.

I'd be great to find an approach with which we're both happy ofc - I'd love to see a suggestion coming from you. Just moving the field into ALL of the type classes would be rather stupid obviously (and makes it apparent it's not type specific), so that's not an approach I could settle for.

I think we can drop site_link_inline.

Definitely, gone now :)

site_global_key can probably just be site_global, or maybe just site_key since we're already talking about the unique data

Disagree - the current name is clearer, and the 4 extra bytes are not going to kill anyone. site_global could be confused with a setting indicating if it's a global site or not (ie you just wasted a minute of every new dev looking at the code).

For si_site_id using si_site would match the other tables, see rev_page.

You're right, did not know of this "convention". Updated now.

si_key_type can probably be just si_type.

Yeah.

Dantman (talkcontribs)

I don't really see moving the url stuff into the type data as strange, though I also don't see types using the same format for this.

Here, I'll give some examples of the possible situation I've been thinking of the whole time. Where site_url and site_*_path are gone and we just use site_data. (Using JSON so you can read it)

A GenericSite type site_data (just a url with a $1 replacement)

{ "url": "https://encrypted.google.com/?q=$1" }

A MediaWikiSite type site_data (data in the same format we always work with):

{ "server": "//mediawiki.org", "script_path": "/w", "article_path": "/wiki/$1" }

A GerritSite type site_data (A base url, if we used something like https://git.wikimedia.org/gerrit/r/4016 instead of https://gerrit.wikimedia.org/r/4016 the base_url would be https://git.wikimedia.org/gerrit so it's not the same as server in MWSite) that knows the differences between change numbers, change ids, and commit hashes and knows what url to build:

{ "base_url": "https://gerrit.wikimedia.org" }

A very custom TwitterSite type site_data which doesn't need any url and does special things like making [[twitter:@nadir_seen_fire]] point a profile while [[twitter:MediaWiki]] links to a search of tweets (yes this one is a little ridiculous but I have a feeling we'll end up with some people wanting some types so custom that the type itself doesn't want any instruction what the url is):

{}
Jeroen De Dauw (talkcontribs)

You can do that yes, but then you cannot:

  • Link to a sites domain (w/o doing evil regex stuff)
  • Lists sites by domain (w/o doing evil regex stuff)
  • Select sites based on domain
  • Display all sites on a specific domain
  • Update the domain of a site (w/o doing evil regex stuff)

Really, what's the harm done in having a page_path field for all sites? It will make the above things a lot easier/nicer and for those weird edge case site types you can always override behavior in their associate site class.

Dantman (talkcontribs)

Hmmmm... ok I do see a use for those. (Although I don't know if a column is necessary for 1 or 5, and 2-4 are almost the same thing)

How would you handle protocol then?

site_url is basically a user-inputted string not only can can it be http:// https:// it can also be protocol relative // and technically there is nothing stopping someone from adding (freenode, irc://irc.freenode.net/$1) so they can make irc links like freenode:mediawiki ( ;) in fact that's already being done!!!).

So using site_url to do things by-domain might not work so well.

How about this instead.

We store the url data inside of site_data. A type class has a method that returns the domain of a site (we can probably make a trick default implementation that calls $this->getUrl( '' ); and then uses wfParseUrl to get the domain). Using that method we store a new site_domain column we index and use in queries.

And if you really do want to make it so we can use this in the ui instead of storing the raw domain name we'll store a reverse-dot-postfixed domain like "org.mediawiki.www." so that we can optimize queries like site_domain LIKE 'org.mediawiki.%'. The reversal keeps the heaviest information at the start properly inside the index data and the trailing dot lets us make a mediawiki.org query match www.mediawiki.org without matching mediawiki-sucks.org in the same query or requiring a separate test for complete equality.

Jeroen De Dauw (talkcontribs)

Ok, modified schema accordingly. And since we're adding such an index anyway, also include a field for the protocol.

Dantman (talkcontribs)

Don't need site_page_path anymore, right?

Also site_domain needs that trailing . for LIKE queries to be effective.

Jeroen De Dauw (talkcontribs)

Oops - maintaining the stuff in LQT sort of is a pain :)

Don't see why we need the dot - can you explain? Either way, that does not affect the schema :)

Dantman (talkcontribs)

I mentioned before but I'll try to clarify.

The dot acts as a guarantee that every chunk of the domain will terminate with a dot even when it is the last chunk. It allows us to do partial matches on a domain using a single LIKE instead of both a LIKE and an or a LIKE that will make bad matches.

For example if we want to do a match on anything ending in mediawiki.org (which is almost always what you really want) we would do a query like this site_domain LIKE 'org.mediawiki.%'. Because mediawiki.org is 'org.mediawiki.' this LIKE query will match both mediawiki.org and www.mediawiki.org.

However if we use just 'org.mediawiki' instead we can't do this with one simple test. If we do site_domain LIKE 'org.mediawiki.%' then we won't match mediawiki.org without the www. If we do site_domain LIKE 'org.mediawiki' we will match mediawiki-something.org in addition to mediawiki.org, which is a different domain name. The only option is a long query site_domain = 'org.mediawiki' OR site_domain LIKE 'org.mediawiki.%'.

So reversing and postfixing together keep the important parts in the index and make it possible to match the base domain name in a single query (the query we will likely be using the most).

Additionally if databases are using sorting similar to `sort` then it is also important to keep domain names together.

org.mediawiki
org.mediawiki-something.www
org.mediawiki.www
org.mediawikisomething.www

vs.

org.mediawiki-something.www.
org.mediawiki.
org.mediawiki.www.
org.mediawikisomething.www.

Basically without the trailing . an ORDER BY would sort mediawiki.org separately from www.mediawiki.org if a mediawiki-something.org or www.mediawiki-something.org was also in the database.

Dantman (talkcontribs)

Oh right. We don't allow an interlanguage link, etc... to have the same prefix as an interwiki link, vice versa, and whatnot.

So rather than a UNIQUE on (si_key, si_type) we probably want si_key to be our PRIMARY index.

Jeroen De Dauw (talkcontribs)

Really - we don't allow for that? You sure?

I thought we did allow for this. If we don't, this seems like either policy or unrelated technical restriction, neither of which should affect design of this. This table should not be specific to the interwiki and "interlanguage" stuff at all - if you need another type of identifier that can be the same as one of another type, that'd just work with the current schema proposal, while I'd require changes if we put that primary index on si_key.

Dantman (talkcontribs)

Yes we only allow one single unique prefix. Whether it's an interwiki or a language link is just a simple bit of meta information on that.

We created this table practically just for the interwiki prefix stuff.

How are our interwiki links supposed to work when [[foo:Bar]] can simultaneously point to an interwiki, an intersite, and a sister site, and possibly three different sites at the same time?

Where do you see local site idenitfiers being used besides our interwiki/interlanguage system?

Dantman (talkcontribs)

Lets make it site_prefix(sp_). The prefix si_ is already used by searchindex while sp_ is unused.

Jeroen De Dauw (talkcontribs)

Where do you see having these similar key prefixes cause problems? We never need to join this table against searchindex I would think...

If it could cause problems, we ought to change the prefix, but not at the cost of the table name being good. site_prefix implies it contains prefixes, while it does not (the identifiers/keys might be used as one obviously, but this does not make them prefixes).

Reply to "Database schema proposal"

First-class data or index

11
Dantman (talkcontribs)

We need to figure out what the site table is going to be. Our first class data source or a table like pagelinks indexed from other sources.

First-class

If it's a first-class data source like interwiki was we'll be doing all of our editing on the site table. Anything done through a web interface will rely on our limited log system. We'll need to come up with some way to do synchronization without making the UI and sync fight each other.

Advantages:

  • We don't have to write code for rebuilding the table.

Index

If the site table is an index we'll have a setting for configuring the source that site data comes from and the rows in the sites table will be rebuilt from that data instead of edited.

Advantages:

  • Synchronization can be done using that source configuration. We'll just have a source type that uses the site data from another wiki. Probably two, one that looks at the site table in another wiki's database and another that uses the API.
  • We will not be restricted to editing the site table. We can take our time implementing the UI for the sites system if we implement a source that reads sites from a text file first and use that. Or read it from a wiki page. Additionally when we do implement the web UI we can implement it with a proper system that tracks the history of modifications to each site instead of just using a log table.
Jeroen De Dauw (talkcontribs)

I'm a bit confused by apparently having to decide between either having it as first class data or as index. Seems like we can easily make it work as either depending on the use case.

This is what I propose:

  • The site table is first class by default but can act like an index
  • All interaction with the table is done through an interface that knows if it's an index or not using some new wiki configuration

That would be all in the initial patchset. We would follow up with:

  • Wikidata makes use of this interface and modifies the wiki config to indicate the site config is behaving as an index

Other people could write editing UIs or whatever on top of it without having to modify the site interface and without even having to care the info is coming from the table or some random other source.

Dantman (talkcontribs)

Mmmmm... ok, yeah I was starting to lean a little towards somewhere in the middle a little earlier.

How about some notes/changes to that:

  • Note that if we get the web UI out and make it the recommended method the site table will quickly turn into something that is always treated as an index.
  • This web UI is actually probably going to end up used for editing the local stuff even when you're using something like Wikidata to fill it.
  • How about instead of making it a firstclass/index boolean we call it a 'source'. If the source is the string that the web UI uses it knows that it can edit the global data since it inserted it. If the source is something like 'wikidata' then it knows that it can't edit it and must only let the user modify the local data that it manages.
  • If you don't mind, while we're not aiming for multiple sources here we could actually put source as a column in the database. It could be useful to deal with some situations like a wiki transitioning from local to global data. So that we know what sites from a wiki being turned into one that just reads the global data have not been put into the central database yet. It would also let us safely purge data from an old source without damaging new data added to the table.
Jeroen De Dauw (talkcontribs)
Note that if we get the web UI out and make it the recommended method the site table will quickly turn into something that is always treated as an index.

I don't understand this. Are you talking about what will happen for Wikipedia, or for MediaWiki installs in general? For the later I really don't see how having an edit UI really affects the table being an index or not.

This web UI is actually probably going to end up used for editing the local stuff even when you're using something like Wikidata to fill it.

I don't understand this either - do you mean it could behave incorrectly here if the below is not implemented at all?

How about instead of making it a firstclass/index boolean we call it a 'source'. If the source is the string that the web UI uses it knows that it can edit the global data since it inserted it. If the source is something like 'wikidata' then it knows that it can't edit it and must only let the user modify the local data that it manages.

I'd prefer having settings such as:

  • site data = one of ( primary, editable index, non-editable index )
  • site config = one of ( primary, editable index, non-editable index )

This way the UI is not forced to care about which source it's actually coming from, and the logic to determine if it should allow editing remains in the site interface, which is IMO where it should be (since you can obviously have many editing UIs, APIs, ect).

put source as a column in the database

Sure. This will work if only the site data can come from an external source. If we also want to allow this for the config, we need a second field. So what would you suggest doing? Add site_source, add site_data_source and site_config_source, or something else?

Dantman (talkcontribs)
Note that if we get the web UI out and make it the recommended method the site table will quickly turn into something that is always treated as an index.

I don't understand this. Are you talking about what will happen for Wikipedia, or for MediaWiki installs in general? For the later I really don't see how having an edit UI really affects the table being an index or not.

I just mean that if I go with making the web UI use a revision system like we do for edits instead of modifying the site table directly and everyone starts using that for actual editing of the sites list then we'll basically just have a web ui using sites as an index and a sync system using sites as an index and pretty quickly almost no-one will be using site as a first-class table.

This web UI is actually probably going to end up used for editing the local stuff even when you're using something like Wikidata to fill it.

I don't understand this either - do you mean it could behave incorrectly here if the below is not implemented at all?

This was just a note to provide the background to the rationale why we might want source instead of a boolean. It's combined with the above.

How about instead of making it a firstclass/index boolean we call it a 'source'. If the source is the string that the web UI uses it knows that it can edit the global data since it inserted it. If the source is something like 'wikidata' then it knows that it can't edit it and must only let the user modify the local data that it manages.

I'd prefer having settings such as:

  • site data = one of ( primary, editable index, non-editable index )
  • site config = one of ( primary, editable index, non-editable index )
This way the UI is not forced to care about which source it's actually coming from, and the logic to determine if it should allow editing remains in the site interface, which is IMO where it should be (since you can obviously have many editing UIs, APIs, ect).

The line of thought behind a source string instead of primary/non-editable (what is an editable index?) is this:

  • The web UI manages it's data with a revision system and it indexes the site table off that so it says the site table is an index
  • Wikidata/MediaWiki sync site data from a foreign source so it says the site table is an index
  • In both situations the site table is set as an index. How does the web UI tell the difference and know if it is allowed to update the site table with it's locally edited revision data?
put source as a column in the database

Sure. This will work if only the site data can come from an external source. If we also want to allow this for the config, we need a second field. So what would you suggest doing? Add site_source, add site_data_source and site_config_source, or something else?

Hmmm... I didn't think about that before.

I know talked about things like having a text and page based interwiki source. But those are actually only things that come to mind when I think about what we'd do to make this easy for users too use soon. Honestly what I really want is one really good web UI. Once that is done and we're using this global id based system with local prefixes I actually don't really see any use for any other user interface anymore.

I do believe that if I put global data about sites inside of revisions in a web UI I'm probably going to do the same thing with local data. So whether we want site_source or two columns for this will depend on if you believe we're going to have multiple editing interfaces on one single wiki disagreeing on where the local data comes from. The idea of site_source came up because global data can come from anywhere (direct local db edit, wiki UI edit, synced from wikidata, synced from another database, synced from a wiki's API) which at the start I didn't think applied to local data that always came from somewhere on the same wiki.

Jeroen De Dauw (talkcontribs)
I just mean that if I go with making the web UI use a revision system like we do for edits instead of modifying the site table directly and everyone starts using that for actual editing of the sites list then we'll basically just have a web ui using sites as an index and a sync system using sites as an index and pretty quickly almost no-one will be using site as a first-class table.

Oh, sure, if you add revisioning, then yes. That's not trivial to do nicely though, unless you wait till the contentHandler stuff is fully merged into core and use that. Either way, we cannot block on creation of such a system so need this to be able to work as primary data as well.

Jeroen De Dauw (talkcontribs)
The line of thought behind a source string instead of primary/non-editable (what is an editable index?) is this:
  • The web UI manages it's data with a revision system and it indexes the site table off that so it says the site table is an index
  • Wikidata/MediaWiki sync site data from a foreign source so it says the site table is an index
  • In both situations the site table is set as an index. How does the web UI tell the difference and know if it is allowed to update the site table with it's locally edited revision data?

That is a reason to have a source field in the table, not to expose this field to the UI. The site interface would figure out if it's editable or not and provide this info the the UI, for instance in the form I suggested. So I suspect we're actually agreeing here? :)

Dantman (talkcontribs)

Sure, long as the site system actually understands the difference. Because 'editable' does not necessarily mean the table contents can be edited but could actually mean something more like "Can I update this with my data?" and the answer to that might be no to one interface if another added it.

Jeroen De Dauw (talkcontribs)

To address

Add site_source, add site_data_source and site_config_source, or something else?

I guess the question comes down to if we want to assume configuration will either be local or coming from the same source as the actual site data. I suspect this assumption is going to hold for a while, and will keep holding forever for nearly all usecases. So what about just having the single site_source field for now? If at a later point more info is needed, we can add a new field (or even do something else and drop this one). This ought to be easy as the table is small, and the only thing being aware of the fields and how they should be handled should be the sites interface.

Dantman (talkcontribs)

Yup, a single site_source defining the source of the global data for now pending future issues should work.

Reply to "First-class data or index"
There are no older topics