Dumps use cases are much better addressed by the code (already existing and working) that parses mediawiki text (content) in hadoop together with mediawiki edit history (metadata). Please see: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake#Schemas
and: