Talk:Wikimedia Services/Revision storage for HTML and structured data: Use cases

About this board

Dumps

2 comments • 16:41, 1 March 2017 7 years ago

2

NRuiz (WMF) (talkcontribs)

Dumps use cases are much better addressed by the code (already existing and working) that parses mediawiki text (content) in hadoop together with mediawiki edit history (metadata). Please see: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake#Schemas

and:

https://gerrit.wikimedia.org/r/#/c/171056/

Reply Edited 20:57, 28 February 2017 7 years ago

Erik Zachte (talkcontribs)

@NRuiz (WMF): Hadoop is about data which WMF extracts from the databases. But dumps are also used a lot outside WMF for all kind of purposes. And shrinking those dumps might improve download and processing times. Compression can only do so much. A random example I encountered today: https://ti.wikipedia.org/w/index.php?title=%E1%89%B2%E1%89%AA&action=history 25 / 29 revisions were about interwiki links. For popular topics there might be up to 200 or so language links, and as many revisions.

Those interwiki links have been migrated to wikidata, but the edit history is still there. My suggestion was to migrate those edits and replace them with dummy edits (only timestamp and user, and no raw text). I know this sounds radical, and not exactly trivial to implement, but shouldn't we deal with our history bloat someday?

Reply 16:41, 1 March 2017 7 years ago

Reply to "Dumps"

Analytics use cases

One comment • 16:42, 3 February 2017 7 years ago

1

71.35.184.198 (talkcontribs)

I have removed analytics from use cases cause I do not think we need access to this data beyond what we are planning on doing of importing text revisions to the cluster for generation of dumps. In other words, the source of the data we need is the cluster which we populate from db, not any other storage.

Reply 16:42, 3 February 2017 7 years ago

Reply to "Analytics use cases"

There are no older topics