Jump to content

Wikimedia Enterprise/Updates

From mediawiki.org
This is an archive of all technical updates for the Wikimedia Enterprise project.


2024 - Q2

[edit]

____________

Machine Readability

[edit]
  • Goal: To include structured data into our feeds and to make unstructured Wikimedia content available in pre-parsed formats
  • Launches:
    • Structured Contents snapshots: early beta release of Structured Contents Snapshots endpoint, including pre-parsed articles (abstracts, main images, descriptions, infoboxes, sections) in bulk, and covering several languages. Alongside this release, we’re also making available a Hugging Face dataset of the new beta Structured Contents snapshots and inviting the general public to freely use and provide feedback. All of the information regarding the Hugging Face dataset is posted on our blog here.
    • Beta Structured Contents endpoint within On-demand API which gives users access to our team’s latest machine readability features, including the below:
      • Short Description (available in Structured Contents On-demand)
        • A concise explanation of the scope of the page written by Wikipedia and Wikidata editors. This allows rapid clarification and helps with topic disambiguation
      • Pre-parsed infoboxes (available in Structured Contents On-demand)
        • Infoboxes from Wikipedia articles to easily extract the important facts of the topic to enrich your entities.
      • Pre-parsed sections (available in Structured Contents On-demand)
        • Content sections from Wikipedia articles to easily extract and access information hidden deeper in the page.
      • Main Image (available in all Wikimedia Enterprise APIs)
        • The main image is curated by editors to represent a given article’s content. This can be used as a visual representation of the topic.
      • Summaries (aka `abstract`) (available in all Wikimedia Enterprise APIs)
        • Easy to ingest text included with each revision to provide a concise summary of the content without any need to parse HTML or Wikitext.

Content Integrity

[edit]
  • Goal: To provide more contextual information alongside each revision to help judge whether or not to trust the revision.
  • Launches
    • Maintenance Tags
      • Key enWiki tags that point to changes in credibility.
      • Small scale POC
    • Breaking News Beta [Realtime Streaming v2]
      • A boolean field detecting breaking news events to support prioritization when doing real-time ingestion of new Wikipedia pages
    • Liftwing ‘Revertrisk’
      • ORES ‘goodfaith’ and ‘damaging’ scores have been deprecated from our API responses. We are working on the integration of ‘revertrisk’ score to our API response objects.
    • No-Index tag per revision

API Usability

[edit]
  • Goal: To improve the usability of Wikimedia Enterprise APIs
  • Launches:
    • Snapshots
      • Filtering available snapshots to group snapshots to download
      • Parallel downloading capabilities to optimize ingestion speeds
    • On-demand
      • Cross language project entity lookups to connect different language projects for faster knowledge graph ingestion.
      • NDJSON responses to enable data consistency across WME APIs
      • Filtering and customized response payloads
    • Realtime Batch
      • Filtering available batch updates to group files to download
      • Parallel downloading capabilities to optimize ingestion speeds
    • Realtime Streaming
      • Realtime Streaming reconnection performance improvement
      • Shared credibility signals accuracy results
      • Shared latency distribution for Realtime Streaming events
      • Parallel consumption - enable users to open multiple connections to a stream simultaneously
      • More precise tracking - empower users to reconnect and seamlessly resume message consumption from the exact point where they left off
      • Event filtering by data field/value to narrow down revisions
      • Customized response payloads to control event size
      • Proper ordering of revisions to remove accidental overwrites
      • Lower event latency to ensure faster updates
      • NDJSON responses to enable data consistency across WME APIs


2024 - Q1

[edit]

Machine Readability

[edit]
  • Goal: To include structured data into our feeds and to make unstructured Wikimedia content available in pre-parsed formats
  • Launches:
    • The Structured Contents (beta) endpoint which gives users access to our team’s latest machine readability features, including:
      • Short Description: A concise explanation of the scope of the page written by Wikipedia and Wikidata editors. This allows rapid clarification and helps with topic disambiguation.
      • Pre-parsed infoboxes to easily extract the important facts of the topic to enrich your entities.
      • Preparsed sections from Wikipedia articles to easily extract and access information hidden deeper in the page.
    • Main Image available in all Wikimedia Enterprise APIs
      • The main image is curated by editors to represent a given article’s content. This can be used as a visual representation of the topic.
    • Summaries (aka `abstract`) available in all Wikimedia Enterprise APIs:
      • Easy to ingest text included with each revision to provide a concise summary of the content without any need to parse HTML or Wikitext.

Content Integrity

[edit]
  • Goal: To provide more contextual information alongside each revision to help judge whether or not to trust the revision.
  • Launches:
    • Maintenance Tags
      • Key enWiki tags that point to changes in credibility.
      • Small scale POC
      • Breaking News Beta [Realtime Streaming v2]
        • A boolean field detecting breaking news events to support prioritization when doing real-time ingestion of new Wikipedia pages
      • Liftwing
        • ORES ‘goodfaith’ and ‘damaging’ scores have been deprecated from our API responses. We are working on the integration of ‘revertrisk’ score to our API response objects.
      • No-Index tag per revision

API Usability

[edit]
  • Goal: To improve the usability of Wikimedia Enterprise APIs
  • Launches:
    • Snapshots
      • Filtering available snapshots to group snapshots to download
      • Parallel downloading capabilities to optimize ingestion speeds
    • On-demand
      • Cross language project entity lookups to connect different language projects for faster knowledge graph ingestion.
      • NDJSON responses to enable data consistency across WME APIs
      • Filtering and customized response payloads
    • Realtime Batch
      • Filtering available batch updates to group files to download
      • Parallel downloading capabilities to optimize ingestion speeds
    • Realtime Streaming
      • Shared credibility signals accuracy results
      • Shared latency distribution for Realtime Streaming events
      • Parallel consumption - enable users to open multiple connections to a stream simultaneously
      • More precise tracking - empower users to reconnect and seamlessly resume message consumption from the exact point where they left off
      • Event filtering by data field/value to narrow down revisions
      • Customized response payloads to control event size
      • Proper ordering of revisions to remove accidental overwrites
      • Lower event latency to ensure faster updates
      • NDJSON responses to enable data consistency across WME APIs

2023 - Q4

[edit]

Machine Readability

[edit]
  • Goal: To include structured data into our feeds and to make unstructured Wikimedia content available in pre-parsed formats
  • Launch:
    • Sections (in Structured Contents beta endpoint)
      • Preparsed sections from Wikipedia articles to easily extract and access information hidden deeper in the page.
      • The Structured Contents (beta) endpoint which gives users access to our team’s latest machine readability features, including
      • Short Description: A concise explanation of the scope of the page written by Wikipedia and Wikidata editors. This allows rapid clarification and helps with topic disambiguation.
      • Pre-parsed infoboxes to easily extract the important facts of the topic to enrich your entities.
    • Main Image link (in Snapshots and Realtime Streaming)
      • The main image is curated by editors to represent a given article’s content. This can be used as a visual representation of the topic.
    • Summaries (aka `abstract`) available in all Wikimedia Enterprise APIs:
      • Easy to ingest text included with each revision to provide a concise summary of the content without any need to parse HTML or Wikitext.

Content Integrity

[edit]
  • Goal: To provide more contextual information alongside each revision to help judge whether or not to trust the revision.
  • Recent Launch:
    • Maintenance Tags
      • Key enWiki tags that point to changes in credibility.
      • Small scale POC
      • Slight change in schema
  • Launches:
    • Version Diffs [Realtime Streaming v2]
      • Quantitative word changes in a new revision grouped by word attributes to provide understanding of the risk of a new revision’s changes.
    • Breaking News Beta [Realtime Streaming v2]
      • A boolean field detecting breaking news events to support prioritization when doing real-time ingestion of new Wikipedia pages
    • Liftwing
      • ORES ‘goodfaith’ and ‘damaging’ scores have been deprecated from our API responses. We are working on the integration of ‘revertrisk’ score to our API response objects.
    • No-Index tag per revision

API Usability:

[edit]
  • Goal: To improve the usability of Wikimedia Enterprise APIs
  • Launches:
    • Snapshots
      • Filtering available snapshots to group snapshots to download
      • Parallel downloading capabilities to optimize ingestion speeds
    • On-demand
      • Cross language project entity lookups to connect different language projects for faster knowledge graph ingestion.
      • NDJSON responses to enable data consistency across WME APIs
      • Filtering and customized response payloads
    • Realtime Batch
      • Filtering available batch updates to group files to download
      • Parallel downloading capabilities to optimize ingestion speeds
    • Realtime Streaming
      • Shared credibility signals accuracy results
      • Shared latency distribution for Realtime Streaming events
      • Parallel consumption - enable users to open multiple connections to a stream simultaneously
      • More precise tracking - empower users to reconnect and seamlessly resume message consumption from the exact point where they left off
      • Event filtering by data field/value to narrow down revisions
      • Customized response payloads to control event size
      • Proper ordering of revisions to remove accidental overwrites
      • Lower event latency to ensure faster updates
      • NDJSON responses to enable data consistency across WME APIs

2023 - Q3

[edit]

Machine Readability

[edit]
  • Goal: To include structured data into our feeds and to make unstructured Wikimedia content available in pre-parsed formats
  • Launches:
    • The Structured Contents (beta) endpoint which gives users access to our team’s latest machine readability features, including:
      • Short Description: A concise explanation of the scope of the page written by Wikipedia and Wikidata editors. This allows rapid clarification and helps with topic disambiguation.
      • Pre-parsed infoboxes to easily extract the important facts of the topic to enrich your entities.
      • Main Image link (in Snapshots and Realtime Streaming
        • The main image is curated by editors to represent a given article’s content. This can be used as a visual representation of the topic.
    • Summaries (aka `abstract`) available in all Wikimedia Enterprise APIs:
      • Easy to ingest text included with each revision to provide a concise summary of the content without any need to parse HTML or Wikitext.


Content Integrity

[edit]
  • Goal: To provide more contextual information alongside each revision to help judge whether or not to trust the revision.
  • Launches
    • Version Diffs [Realtime Streaming v2]
      • Quantitative word changes in a new revision grouped by word attributes to provide understanding of the risk of a new revision’s changes.
    • Breaking News Beta [Realtime Streaming v2]
      • A boolean field detecting breaking news events to support prioritization when doing real-time ingestion of new Wikipedia pages

API Usability

[edit]
  • Goal: To improve the usability of Wikimedia Enterprise APIs
  • Launches:
    • Snapshots
      • Filtering available snapshots to group snapshots to download
      • Parallel downloading capabilities to optimize ingestion speeds
    • On-demand
      • Cross language project entity lookups to connect different language projects for faster knowledge graph ingestion.
      • NDJSON responses to enable data consistency across WME APIs
      • Filtering and customized response payloads
    • Realtime Batch
      • Filtering available batch updates to group files to download
      • Parallel downloading capabilities to optimize ingestion speeds
    • Realtime Streaming
      • Parallel consumption - enable users to open multiple connections to a stream simultaneously
      • More precise tracking - empower users to reconnect and seamlessly resume message consumption from the exact point where they left off
      • Event filtering by data field/value to narrow down revisions
      • Customized response payloads to control event size
      • Proper ordering of revisions to remove accidental overwrites
      • Lower event latency to ensure faster updates
      • NDJSON responses to enable data consistency across WME APIs


2023 - Q1&2

[edit]

Machine Readability

[edit]
  • Goal: To include structured data into our feeds and to make unstructured Wikimedia content available in pre-parsed formats
  • Recent Launch (in On-Demand and Realtime Batch):
    • Main Image link
      • The main image is curated by editors to represent a given article’s content. This can be used as a visual representation of the topic.
  • Launches:
    • “Summaries” available in all Wikimedia Enterprise APIs:
      • Easy to ingest text included with each revision to provide a concise description of the content without any need to parse HTML or Wikitext.

Content Integrity

[edit]
  • Goal: To provide more contextual information alongside each revision to help judge whether or not to trust the revision.
  • Active Public Beta Offerings:
    • Version Diffs [Realtime Streaming v2]:
      • Quantitative word changes in a new revision grouped by word attributes to provide understanding of the risk of a new revision’s changes.
    • Breaking News:
      • A boolean field detecting breaking news events to support prioritization when doing real-time ingestion of new Wikipedia pages

API Usability

[edit]
  • Goal: To improve the usability of Wikimedia Enterprise APIs
  • Launches:
    • Snapshots
      • Filtering available snapshots to group snapshots to download
      • Parallel downloading capabilities to optimize ingestion speeds
    • On-demand
      • Cross language project entity lookups to connect different language projects for faster knowledge graph ingestion.
      • NDJSON responses to enable data consistency across WME APIs
      • Filtering and customized response payloads
    • Realtime Batch
      • Filtering available batch updates to group files to download
      • Parallel downloading capabilities to optimize ingestion speeds
    • Realtime Streaming
      • Event filtering by data field/value to narrow down revisions
      • Customized response payloads to control event size
      • Proper ordering of revisions to remove accidental overwrites
      • Lower event latency to ensure faster updates
      • NDJSON responses to enable data consistency across WME APIs

2022-Q4: Machine Readability POCs, Credibility Signals, and a new Realtime API feed in Beta

[edit]

New Realtime API is in closed beta:

  • As part of some of our larger infrastructural work to accommodate some of the expanding dataset needs, we
  • The beta Realtime API is a significant update and is a much more flexible event system providing:
    • Event filtering by data field/value to narrow down revisions
    • Customized response payloads to control event size
    • Proper ordering of revisions to remove accidental overwrites
    • Lower event latency to ensure faster updates
    • NDJSON responses to enable data consistency across WME APIs

Machine Readability:

  • Working out a larger roadmap but have prioritized which includes parsing out the first paragraph of Wikipedia articles (lede/summary) to add to the Wikimedia Enterprise APIs. Beginning work on this feature.

Credibility Signals:

  • We’ve released the first version of “Diffs” into a closed beta, a json payload that quantifies changes in language between two revisions. We’re testing the feature across a few popular Wikipedia languages for accuracy and usefulness.
  • Our Breaking news signal has a proof of concept. We’re testing reliability and accuracy of results on this signal that detects if new entries on Wikipedia relate to exogenous breaking news.
  • More context on this work: What are Credibility Signals?

We welcomed three new team members!

2022-Q3: Preparing the future of WME APIs

[edit]

New API Versions in the works:

  • We’re working on a new version of the WME Snapshot, Realtime, and On-demand APIs with a focus on filtering/flexibility, scalability, and the ability to more easily expand provided data signals without overloading the architecture.

Credibility Signals:

  • Francisco joined to produce a longer term roadmap of what Credibility Signals could be based on deep dive of research done over the summer. A summary of his work is to come in February 2023.

New Team Members:

  • Francisco Navas, Product Research lead for Content Integrity and Credibility Signals

2022-Q2: Self Registration and Credibility Signals

[edit]
  • Self registration:
    • Responding to feedback around accessibility, we have been working to improve the ability for individuals and companies to get started working with Wikimedia Enterprise APIs. We are building a turnkey flow to sign up and get started using our products.
    • A major goal of this access to provide the ability to work with our APIs to more interested people as well as garner more feedback to help us understand how we can tackle problems around using Wikimedia data outside of the Wikimedia ecosystem - something we have done quite a bit of qualitative research on - see Research Study below.
  • Credibility Signals:
    • In order to help Wikimedia data reusers understand what they are receiving, especially when ingesting all of the changes from a project in real time - we are creating a series of "signals", or individual data points, to help give more context to what has changed in a revision as it happens. Our first effort on this front is focused on turning changes into quantitative measures like "text differences" on new revisions. We plan to release this work into beta to try it and continue to evolve and experiment towards a better answer to some of these challenges.

2022-Q1: Release work, Uptime Monitoring, and new team members!

[edit]
  • Release work:
    • We have received an enormous amount of great feedback on phabricator and from initial users of Wikimedia Enterprise APIs that have kept us busy improving the stability of the product.
    • We have had some delays on our new architecture work and fully moving over versus prioritizing some of the new feature work on version 1.0. In the coming months, we plan to wrap the new architecture work up and release it as version 2.0.
  • Uptime Monitoring:
    • As our SLAs are a major value offering of Wikimedia Enterprise APIs, we have done quite a bit of work to improve our reliability of uptime monitoring. You can see our status page here.
  • New Team Members
    • We welcomed Haroon Shaikh to the team as our Engineering Manager. He is welcomed at an important time as we start to take in great technical feedback on our projects to triage and improve.

2021-10: Website Launch and Wikimedia Dumps release!

[edit]
  • Website Launch:
    • Our website is live! Check it out
    • Launched in this is our initial product offering details along with some pricing and sign up information.
  • Wikimedia Dumps release!
    • Wikimedia Dumps now has Wikimedia Enterprise dumps! Give it a download and please provide feedback to our team as you see relevant
    • Reminder: The Daily and Hourly Diffs are available on WMCS currently

2021-09: Launch! Building towards the next version and public access

[edit]
  • V1 launched on 9/15/2021: This month we stepped out of beta and fully launched v1 of Wikimedia Enterprise APIs. V1 APIs include:
    • Real Time:
      • Streaming: Three real time streams of all of the current events happening across our projects. You can hold this connection indefinitely and returns you the same data model as the others so that you can get all of the information in just one event object. The three streams are:
        • page-update: all revisions and changes to a page across the projects
        • page-delete: all page deletions to remove from records
        • page-visibility: highly urgent community driven events within the projects to reset
      • Batch: An API that returns a zip file containing all of changes with in a day of all "text-based" Wikimedia projects
    • Snapshot: An API that returns a zip file containing all of changes with in a day of all "text-based" Wikimedia projects
    • On-demand: An API that allows you to lookup a single page in the same JSON structure as the other endpoints.
  • Implementing new architecture:
    • We are starting to implement the architecture that we've been working on in past months to move towards a more flexible system that is built around streaming data. More information to be shared on our mediawiki page soon.
    • We are also working on rewriting some of our existing launch work into the new process - this is a lot of repurposing code but making for a stronger and more scalable system.
    • After this, we will begin the implementation of Wikidata, more credibility signals, and flexible filtering into the suite of APIs.
  • Public Access:
    • The Daily and Hourly Diffs are available on WMCS currently
    • We are planning to launch with Wikimedia Dumps soon as we launch hashing capabilities in the APIs in v1! Stay tuned.

2021-08: Roadmap Design and Building towards our September Launch!

[edit]
  • Roadmapping the next six months:
    • Wikidata:
      • Wikidata is a heavily used project by Wikimedia Enterprise's persona of commercial content reusers. Looking into the future, it is important for us to include "text-based" projects as well as Wikidata in the feeds that we create.
      • Our goal is to add Wikidata to the Firehose streams, Hourly Diffs, and Daily Exports giving Enterprise users the ability to access all of the projects (except Commons) in one API suite.
    • Credibility Signals
      • As we work to solve the challenges of reliably ingesting in real time Wikimedia data at scale, there are two big problems that still come with our data: Content Integrity and Machine Readability.
      • Wikimedia data reusers are not necessarily savvy in the nuances of the communities efforts to keep the projects as credible as possible and miss much of the context that comes with revisions that might help inform whether or not a new revision is worth replacing in an external system. This is exacerbated as reusers aim to move towards real time data on projects that are always in flux.
      • We plan to draw out the landscape of what signals can be included alongside real time and bulk feeds of new revisions to help end users add more context to their systems. Stay tuned here.
    • Flexible APIs:
      • Customizable Payload: With the ever expanding data added to our schemas, we need more flexibility on the payloads that end users would like. This is not easy or possible for Hourly Diffs or Daily Exports since those files are pre-generated and static but we aim to work on this capability across the Firehose and Structured Content APIs.
      • Enhanced Filtering: Since there are so many different data points coming through the feeds, end users will start to build their comfortability of ingestion around a few feeds. It is imperative that we provide the ability to filter beyond client side so that we can limit the direct traffic on end user's systems. This also provides a much easier user experience for users o the APIs.
  • September Launch:
    • We are all hands on deck building and processing towards our launch of our initial launch product.

2021-07: Onboarding, Architecture, and Launch Schema

[edit]
  • Added some new folks to our engineering team:
    • Welcome Prabhat Tiwary, Daniel Memije, and Tim Abdullin! They join us with each different perspectives and experiences adding substantial experience and capacity to our team.
    • With this came a lot of work stepping back and building onboarding documentation to make sure our team can grow and folks can join and contribute to our work.
  • New Architecture
    • As Wikimedia Enterprise APIs become more defined and complicated, we have started to draw out what a target architecture would look like. We are doing a lot of planning and taking time to think through what a streaming pipe should look like.
    • Our original architecture was centered around the solution of "Exports" and less around the real-time component, which in the long run will create flexibility issues with how we store and move data around our architecture.
  • Data Model / API Schema:
    • We have decided on a target schema, dataset, and set of APIs for our move out of beta in September. See more on our documentation page here


2021-06: Parsing HTML, Schema, API Organization, and Public Access

[edit]
  • Parsing HTML
    • We are entering the world of "what we can do to make the data easier to use" as we near having reliable pipes as the core of the Enterprise product.
    • First stop, parsing HTML. We are working with the Parsing team to find ways that Enterprise can support the open-source project to make parsing Parsoid HTML easier at scale for our end users.
  • Data Model / API Schema:
    • We are sending our schema work into the technical decision making process at the Wikimedia Foundation, follow on this ticket from the architecture team.
    • We have decided to adopt snake_case in our APIs as it has more flexibility with non-english languages, as we look down the line of more accessible apis.
  • Launch API Organization
    • Next week we will add to our docs page our final API name-spacing and structure for launch, we are including endpoints to quickly discern if anything has changed from project to project. Stay tuned here, I'm just typing them up in draft.
  • Public Access

2021-05: Schema, Public Access, Documentation, and Firehose

[edit]
  • Data Model / API Schema:
  • Public Access:
  • Documentation:
    • For now, we are hosting our documentation on-wiki here until we build out our larger sitemap for the Wikimedia Enterprise product. This work is in progress but feel free to watch that page for updates.
    • We are live on phabricator and all Wikimedia Enterprise related technical work is documented on our board!
  • Firehose API:
    • We have scoped the v1 release of the Firehose API and it will include filtering of Project and Page-Types (namespaces) for easier ingestion. Track progress here.
    • The Firehose will include the data from the above schema in a real time feed.

2021-04: Beta, Transparency, and Roadmap

[edit]
  • Beta Launch!:
    • The team launched a "closed beta" for our bulk and structured-content api endpoints! So far, great feedback but still working through kinks that come with a beta offering.
    • Follow this ticket for more information on when public access will be available via Wikimedia Database Dumps. Note these will be experimental, if interested in providing feedback, feel free to post on our phabricator board - we appreciate it!
    • We are finalizing a timeline with the Technical Engagement team to find how we can provide access to folks with access to their tools. Stay tuned.
  • Project transparency improvements:
    • We are moving all of Wikimedia Enterprise's project management to our Phabricator board over the next week or two.
    • We are reflecting/iterating on our open-source workflow to provide a better window into our Github push schedule for those who are interested in following along. More to come here.
  • Roadmap:
    • The next big roadmap item is refining the "data schema" work we have already done and publishing updates here. We are looking to include more contextual data to revisions as part of our ingestion feeds.

2021-03: Community conversations

[edit]
  • Refreshed documentation
    • Publication of completely refreshed documentation on MediaWiki.org and Meta. See Meta talkpage with significant amount of community feedback/comment.
  • Landing-page website
    • Launched! Incremental improvements in temporary code.
    • The website content itself is temporary and a placeholder until a fully featured page is launched alongside the product in a few months