Jump to content

Talk:Wikimedia Services/Revision storage for HTML and structured data: Use cases

About this board

NRuiz (WMF) (talkcontribs)
Erik Zachte (talkcontribs)
@NRuiz (WMF): Hadoop is about data which WMF extracts from the databases. But dumps are also used a lot outside WMF for all kind of purposes. And shrinking those dumps might improve download and processing times. Compression can only do so much. A random example I encountered today: https://ti.wikipedia.org/w/index.php?title=%E1%89%B2%E1%89%AA&action=history 25 / 29 revisions were about interwiki links. For popular topics there might be up to 200 or so language links, and as many revisions.
Those interwiki links have been migrated to wikidata, but the edit history is still there. My suggestion was to migrate those edits and replace them with dummy edits (only timestamp and user, and no raw text). I know this sounds radical, and not exactly trivial to implement, but shouldn't we deal with our history bloat someday?
Reply to "Dumps"
71.35.184.198 (talkcontribs)

I have removed analytics from use cases cause I do not think we need access to this data beyond what we are planning on doing of importing text revisions to the cluster for generation of dumps. In other words, the source of the data we need is the cluster which we populate from db, not any other storage.

Reply to "Analytics use cases"
There are no older topics