Jump to content

User:ABaso (WMF)/Dev Summit 2018 full extract

From mediawiki.org

This is the expanded extract (refer to the condensed extract, where you can also see other extracts) for Adam Baso for the Wikimedia Developer Summit 2018.

Structure Most Things with Schema.org

[edit]

Can we standardize template parameters so as to maximize findability on discovery platforms like search engines, voice assistants, and social media?

The future of digital information will likely be brokered by major platform providers such as Google, Apple, Amazon, Microsoft, and a few international equivalents and social networks. We’re thankful for the hard work of small and large intermediaries extending the Wikimedia movement’s reach, even as we seek to identify pathways on these platforms for consumers to join our movement.

Today, these platform providers mine a considerable amount of information from Wikipedia and Wikidata, as well as other sister projects. The sophistication of platform providers in extracting unstructured, semi-structured, and ontological constructs is advancing...and isn’t going to stop...yet we could help them, their users, and our users solve problems better through adoption of the open standard Schema.org into Wikipedia pages mapped with templates and, ideally, federated and synchronized Wikidata properties (exported for machine consumption in the widely used JSON-LD format).

The benefits could be manifold:

  1. Automata can operate with greater confidence, and Wikipedia will have even better presentation and placement in search engines (including our own) and other data rich experiences.
  2. We provide an opportunity for a more consistent data model for template authors and people/bots filling template values. And the richly defined Schema.org entities provide a good target to reach on all entities represented in the Wikipedia/Wikimedia corpora. A level of standardization reduces duplication of effort and inconsistencies among projects and languages, all the while making it more possible to ensure the critical information is present for readers.
  3. We introduce an easier vector for mobile contribution, which could include simpler data entry, opportunities for gestural or other modes of data mapping and modeling, and so on.
  4. We can elevate an open standard and push its adoption forward while increasing the movement’s standing in the open standards community.
  5. Schema.org compliant data is more easily amenable to machine learning models that cover data structures, the relations between entities, and the dynamics of sociotechnical systems. This applies not only to search automata, but to machine learning models that can be used for a range of practical applications like vandalism detection, coverage analysis, and much more.
  6. This might provide a means for the education sector to educate students about knowledge creation, and data modeling, and more. It might help afford scientists and other practitioners a further standardized way to model the knowledge in their fields more directly on the Wikimedia projects, too.

What would it take to reach a more semantic web on the projects through this mechanism? And can/should this be done in harmony with the existing {{Template}} system?

This session is to discuss the following:

  1. Are we aligned on the benefits, and which ones?
  2. Implementation options. Specifically:
    1. Is it easy to relate the vast majority of existing and proposed Wikidata entity types and properties to existing Schema.org entities and properties?
    2. Can we extend templates so they could be mapped to Schema.org?
      1. Would it be okay to empirically derive the mapping by manual and semi-automated analysis at WMF/WMDE and apply it behind the scenes so as to not introduce extra work for template authors? Would that be sustainable?
      2. Could we make it easy for template authors to mark up their templates for Schema.org compatibility and have some level of enforcement for as much? Could Schema.org attributes and entity types be autosuggested for template creators?
    3. Assuming we can do #1, what would it take to streamline MCR Schema.org data structures or MCR Wikibase property clusters mapped to Schema.org on defined entity types?
    4. Furthermore, if we can do #1 and #2, what’s to prevent us from letting templates as is merely be the interface for Schema.org compliant Wikibase entities and properties (e.g., by duck typing / autosynthesis)?
    5. What are the means by which data could be bidirectionally synchronized between Wikipedia and Wikidata with confidence in a way compatible with patroller expectations? And what storage and event processing would be needed? Can the systems be scaled in a way to accommodate arrival of real-time and increasingly fine grained information?

Understand that this discussion is not about Schema.org’s sameAs field mapping to Wikidata entities, although it’s complementary. Additionally, it is not a suggestion to implement Semantic MediaWiki, although that may (or may not - it hasn’t been a supported Wikimedia technology for a while) be one possible implementation strategy.

It’s possible Wikimedia projects outside of Wikipedia could also adopt this approach, although Wikipedia is the starting point for so many people and machines.

I must acknowledge the earlier and existing efforts and discourse by the WMDE, Wikidata, and WikiCite teams & communities; the Semantic MediaWiki community; Magnus Manske; Andy Mabbett; Dan Brickley (I don’t know Dan); Gabriel Wicke; Yaron Koren; Parsoid; Citoid; DBPedia; and probably numerous others in this space.