ويكيميديا إنتربرايز
Wikimedia Enterprise
APIs for commercial users of Wikimedia content
|
The Wikimedia Enterprise API is a new service focused on high-volume commercial reusers of Wikimedia content. It will provide a new funding stream for the Wikimedia movement; greater reliability for commercial reusers; and greater reach for Wikimedia content.
For general information, the relationship to the Wikimedia strategy, operating principles, and FAQ, see Wikimedia Enterprise on Meta. The project was formerly known as "Okapi".
See also our website for up-to-date API documentation. Current development work is tracked on our Phabricator board. Our source code is on Github. For information about Wikimedia community access to this service, please see Access on the project's Meta homepage.
Contact the team if you would like to arrange a conversation about this project with your community.
التحديثات
هذه هي الأشهر الأخيرة من التحديثات التقنية. [جميع التحديثات السابقة يمكن العثور عليها في الأرشيف]
2024 - Q2
____________
Machine Readability
- Goal: To include structured data into our feeds and to make unstructured Wikimedia content available in pre-parsed formats
- Launches:
- Structured Contents snapshots: early beta release of Structured Contents Snapshots endpoint, including pre-parsed articles (abstracts, main images, descriptions, infoboxes, sections) in bulk, and covering several languages. Alongside this release, we’re also making available a Hugging Face dataset of the new beta Structured Contents snapshots and inviting the general public to freely use and provide feedback. All of the information regarding the Hugging Face dataset is posted on our blog here.
- Beta Structured Contents endpoint within On-demand API which gives users access to our team’s latest machine readability features, including the below:
- Short Description (available in Structured Contents On-demand)
- A concise explanation of the scope of the page written by Wikipedia and Wikidata editors. This allows rapid clarification and helps with topic disambiguation
- Pre-parsed infoboxes (available in Structured Contents On-demand)
- Infoboxes from Wikipedia articles to easily extract the important facts of the topic to enrich your entities.
- Pre-parsed sections (available in Structured Contents On-demand)
- Content sections from Wikipedia articles to easily extract and access information hidden deeper in the page.
- Main Image (available in all Wikimedia Enterprise APIs)
- The main image is curated by editors to represent a given article’s content. This can be used as a visual representation of the topic.
- Summaries (aka `abstract`) (available in all Wikimedia Enterprise APIs)
- Easy to ingest text included with each revision to provide a concise summary of the content without any need to parse HTML or Wikitext.
- Short Description (available in Structured Contents On-demand)
Content Integrity
- Goal: To provide more contextual information alongside each revision to help judge whether or not to trust the revision.
- Launches
- Maintenance Tags
- Key enWiki tags that point to changes in credibility.
- Small scale POC
- Breaking News Beta [Realtime Streaming v2]
- A boolean field detecting breaking news events to support prioritization when doing real-time ingestion of new Wikipedia pages
- Liftwing ‘Revertrisk’
- ORES ‘goodfaith’ and ‘damaging’ scores have been deprecated from our API responses. We are working on the integration of ‘revertrisk’ score to our API response objects.
- No-Index tag per revision
- Maintenance Tags
API Usability
- Goal: To improve the usability of Wikimedia Enterprise APIs
- Launches:
- Snapshots
- Filtering available snapshots to group snapshots to download
- Parallel downloading capabilities to optimize ingestion speeds
- On-demand
- Cross language project entity lookups to connect different language projects for faster knowledge graph ingestion.
- NDJSON responses to enable data consistency across WME APIs
- Filtering and customized response payloads
- Realtime Batch
- Filtering available batch updates to group files to download
- Parallel downloading capabilities to optimize ingestion speeds
- Realtime Streaming
- Realtime Streaming reconnection performance improvement
- Shared credibility signals accuracy results
- Shared latency distribution for Realtime Streaming events
- Parallel consumption - enable users to open multiple connections to a stream simultaneously
- More precise tracking - empower users to reconnect and seamlessly resume message consumption from the exact point where they left off
- Event filtering by data field/value to narrow down revisions
- Customized response payloads to control event size
- Proper ordering of revisions to remove accidental overwrites
- Lower event latency to ensure faster updates
- NDJSON responses to enable data consistency across WME APIs
- Snapshots
Past updates
للحصول على تحديثات الأشهر السابقة، راجع الأرشيف.
نظرة عامة
الخلفية
نظرًا لعدد لا يحصى من مصادر المعلومات على الإنترنت، أصبح تجميع مجموعات البيانات العامة والخاصة معًا أحد الأصول الرئيسية المملوكة (كما هو موضح في الرسوم البيانية لمعرفة العملاء) لشركات التكنولوجيا الكبيرة عند بناء منتجاتها. ومن خلال هذا العمل يمكن للمساعدين الصوتيين ومحركات البحث الخاصة بالشركة أن يكونوا أكثر فعالية من تلك الخاصة بمنافسيهم. تعد بيانات ويكيميديا أكبر مصدر بيانات عام على الإنترنت ويتم استخدامها باعتبارها العمود الفقري "المعرفة العامة" للرسوم البيانية المعرفية. إن عدم وجود بيانات ويكيميديا في الرسم البياني المعرفي يضر بقيمة المنتج، كما أثبتنا من خلال أبحاث العملاء. It is through this work that a company’s voice assistants and search engines can be more effective than those of their competitors. Wikimedia data is the largest public data source on the internet and is used as the "common knowledge" backbone of knowledge graphs. Not having Wikimedia data in a knowledge graph is detrimental to a product’s value, as we've proven through customer research.
لكي يتمكن عملاء واجهة برمجة تطبيقات ويكيميديا إنتربرايز من إنشاء تجارب مستخدم فعالة، فإنهم يحتاجون إلى ميزتين أساسيتين من مجموعة بيانات ويكيميديا: الاكتمال وحسن التوقيت.
يوفر محتوى ويكيميديا أكبر مجموعة من المعلومات المتاحة مجانًا على الويب. فهو يرسم موضوعات واسعة عبر مئات اللغات ويمنح المنتجات الاستهلاكية شعورًا بـ "المعرفة الكاملة" و"الاكتمال" الذي يحفز تجارب المستخدم الإيجابية.
ينشأ محتوى ويكيميديا من مجتمع يقوم بتأليف المحتوى في الوقت الفعلي، مع تطور التاريخ. إن الاستفادة من عمل هذا المجتمع توفر لمنتجات العملاء شعورًا بأنها "على علم" (أي "حسن التوقيت") عند وقوع الأحداث، وبالتالي توليد تجارب مستخدم إيجابية.
لا توجد حاليًا طريقة يمكن للعميل الذي يستهلك البيانات من خلالها تقديم طلب واحد أو اثنين من طلبات واجهة برمجة التطبيقات (API) لاسترداد مستند كامل وحديث يحتوي على جميع المعلومات ذات الصلة وذات الصلة بالموضوع المطلوب. وقد أدى ذلك إلى قيام العملاء ببناء حلول معقدة مخصصة يصعب صيانتها؛ باهظة الثمن بسبب الاستثمار الداخلي الكبير. عرضة للخطأ، بسبب التناقضات في بيانات ويكيميديا؛ وهشة، بسبب التغيرات في ردود ويكيميديا.
دراسة بحثية لعام 2020
من يونيو 2020 إلى أكتوبر 2020، أجرى فريق ويكيميديا إنتربرايز سلسلة من المقابلات مع أطراف ثالثة تعيد استخدام [مستخدمي] بيانات ويكيميديا للحصول على فهم أفضل للشركات التي تستخدم بياناتنا، وكيف تستخدم بياناتنا، وفي أي منتجات يستخدمونها، وما هي التحديات التي يواجهونها عند العمل مع واجهات برمجة التطبيقات الخاصة بنا. أظهر بحثنا أن:
- يقوم المستخدمون بتخزين بياناتنا خارجيًا بدلاً من الاستعلام عن واجهات برمجة التطبيقات الخاصة بنا للحصول على البيانات المباشرة
- يتعامل كل مستخدم مع مجموعتنا الحالية بشكل مختلف، مع تحديات وطلبات فريدة
- لا يُنظر إلى واجهات برمجة تطبيقات ويكيميديا على أنها آلية استيعاب موثوقة لجمع البيانات وتكون عرضة لقيود المعدلات ومشكلات وقت التشغيل والاستخدام المفرط لتحقيق أهدافها
- يواجه جميع المستخدمين نفس المشكلات العامة عند العمل مع المحتوى الخاص بنا، وقد تلقينا طلبات مماثلة من المستخدمين من جميع الأحجام
حدد فريق واجهة برمجة تطبيقات المؤسسة أربع نقاط ضعف تتسبب في معاناة عدد كبير من مستخدمي الطرف الثالث عند استخدام مجموعتنا العامة من واجهات برمجة التطبيقات لأغراض تجارية. ملاحظة: العديد من هذه المفاهيم تتداخل مع مبادرات أخرى جارية حاليًا ضمن حركة ويكيميديا، على سبيل المثال مبادرة API Gateway.
- الحداثة: يريد المستخدمون التجاريون أن يكونوا قادرين على استيعاب المحتوى الخاص بنا "خارج الصحافة" حتى يتمكنوا من الحصول على أحدث رؤية عالمية للمعرفة العامة عند تقديم المعلومات إلى مستخدميهم.
- 'موثوقية النظام: يريد المستخدمون التجاريون وقت تشغيل موثوقًا لواجهات برمجة التطبيقات المهمة وتنزيلات الملفات حتى يتمكنوا من البناء باستخدام أدواتنا دون صيانة أو زيادة المخاطر على منتجاتهم.
- 'نزاهة المحتوى: يرث المستخدمون التجاريون نفس التحديات التي تواجهها مشاريع ويكيميديا فيما يتعلق بالتخريب والقصص المتطورة. يرغب المستخدمون التجاريون في الحصول على مزيد من البيانات الوصفية مع كل تحديث للمراجعة من أجل إعلامهم بقراراتهم حول ما إذا كان سيتم نشر مراجعة لمنتجاتهم أم لا.
- قابلية قراءة الآلة: يريد المستخدمون التجاريون مخططًا نظيفًا ومتسقًا للعمل مع البيانات عبر جميع مشاريعنا. ويرجع ذلك إلى التحديات التي تأتي من تحليل وفهم البيانات التي يحصلون عليها من واجهات برمجة التطبيقات الحالية لدينا.
من أجل سلامة المحتوى وإمكانية قراءة الآلة، أنشأ فريق ويكيميديا إنتربرايز هذه القائمة من المجالات المثيرة للاهتمام بشكل خاص لتركيز عملنا على إعادة المستخدمين الخارجيين. تم إنشاء هذه القائمة في مارس 2021، وبالتالي تم تنقيحها وترتيبها حسب الأولوية في ميزات خريطة الطريق الموضحة أدناه، ومع ذلك، فإن هذا بمثابة جزء من هذا البحث وشيء يمكن استخدامه للإشارة إلى بعض المشكلات التي يواجهها المستخدمون.
Theme | Feature | Details |
---|---|---|
Machine Readability | Parsed Wikipedia Content | Break out the HTML and Wikitext content into clear sections that customers can use when processing our content into their external data structures |
Optimized Wikidata Ontology | Wikidata entries mapped into a commercially consistent ontology | |
Wikimedia-Wide Schema | Combine Wikimedia project data together to create “single-view” for multiple projects around topics. | |
Topic Specific Exports | Segment corpus into distinct groupings for more targeted consumption. | |
Content Integrity | Anomaly Signals | Update schema with information guiding customers to understand the context of an edit. Examples: page view / edit data |
Credibility Signals | Packaged data from the community useful to detect larger industry trends in disinfo, misinfo, or bad actors | |
Improved Wikimedia Commons license access | More machine readable licensing on Commons media | |
Content Quality Scoring (Vandalism detection, “best last revision”) | Packaged data used to understand the editorial decision-making of how communities catch vandalism. |
خريطة طريق المنتج
The Wikimedia Enterprise APIs are designed to help external content reusers seamlessly and reliably mirror Wikimedia content in real time on their systems. However, even with this system in place, reusers still have many struggles with the Content Integrity and the Machine Readability of Wikimedia content when they try to make it actionable on the other end. This section will lay out all of the work we are actively working on to help alleviate some of the struggles. To reference our previous research work:
Theme | Feature | Details |
---|---|---|
Machine Readability | Parsed Wikipedia Content | Break out the HTML and Wikitext content into clear sections that customers can use when processing our content into their external data structures |
Optimized Wikidata Ontology | Wikidata entries mapped into a commercially consistent ontology | |
Wikimedia-Wide Schema | Combine Wikimedia project data together to create “single-view” for multiple projects around topics. | |
Topic Specific Exports | Segment corpus into distinct groupings for more targeted consumption. | |
Content Integrity | Anomaly Signals | Update schema with information guiding customers to understand the context of an edit. Examples: page view / edit data |
Credibility Signals | Packaged data from the community useful to detect larger industry trends in disinfo, misinfo, or bad actors | |
Improved Wikimedia Commons license access | More machine readable licensing on Commons media | |
Content Quality Scoring (Vandalism detection, “best last revision”) | Packaged data used to understand the editorial decision-making of how communities catch vandalism. |
In Flight Work
New Functionality
- Content Integrity: For external reusers that choose to work with Wikimedia data in real-time or even with a slight delay increase their exposure to the most fluid components of the projects and increase risk of propagating vandalism, dis/mis-disinformation, unstable article content, etc. Our goal is not to prescribe content with a decision as to its credibility, but rather to increase the contextual data "signals" around a revision to allow Wikimedia Enterprise reusers to have a better picture of what this revision is doing and how they might want to handle it on their end. This will manifest in new fields in our responses in the Realtime, Snapshot, and On-demand APIs. We are focused on two main categories of signals:
- Credibility Signals: "Context" of a revision. This looks like diving into "what changed", editor reputation, and general article level flagging. The goal initially is to lean on the information that is publicly used by editors and translate those concepts to the reusers that are otherwise unfamiliar. Track this work here.
- Anomaly Signals: "Activity" around a revision. This looks like temporal edit, page views, or talk page activity. The goal initially is to compile quantitative signals to unpack popularity that can be used to help reusers prioritize updates as well as calibrate around our trends and what that might mean for the reliability of the content.
General Improvements
- Accessibility: In order to increase the availability of access to Wikimedia Enterprise APIs, we are developing a new self signup tier for folks to get started working with our APIs. Track this work here.
- Reliability: Continuous improvement on our system's health in order to comfortably scale, with more context as to the problems that we'll need to continually solve for. We are building what will become a v2 architecture of Wikimedia Enterprise APIs. Track this work for the Snapshots and Realtime APIs. View our status page.
- Freshness: We are working with Wikimedia Foundation teams (Platform and Data Engineering) to better understand and flag where we may have revisions missing in the feeds as to improve performance for our systems and the public systems.
Wikimedia Enterprise (Version 1.0)
See also: Up to date API documentation and more information about the general value offerings on our commercial website.
Name | Compare To | What is it? | What’s New? |
---|---|---|---|
Enterprise Realtime API | EventStream HTTP API | A stable, push HTTP stream of real-time activity across "text-based" Wikimedia Enterprise Projects |
|
Enterprise On-demand API | Restbase APIs | Current article content in Wikimedia Enterprise JSON format. Structured Contents beta endpoint with experimental parsing. |
|
Enterprise Snapshot API | Wikimedia Dumps | Recent, compressed Wikimedia data exports for bulk content ingestion. |
|
On-demand API
High-volume reusers that use an infrastructure reliant on the EventStream platform depend on services like RESTBase to pull HTML from page titles and current revisions to update their products. High-volume reusers have requested a reliable means to gather this data, as well as structures other than HTML when incorporating our content into their KGs and products.
Wikimedia Enterprise On-demand API contains:
- A commercial schema
- SLA
- Beta Structured Contents endpoint (not SLA)
Realtime API
High-volume reusers currently rely heavily on the changes that are pushed from our community to update their products in real time, using EventStream APIs to access such changes.High-volume reusers are interested in a service that will allow them to filter the changes they receive to limit their processing, guarantee stable HTTP connections to ensure no data loss, and supply a more useful schema to limit the number of api calls they need to make per event.
Enterprise Realtime API contains:
- Update streams that provides real-time events of changes across supported projects
- Batch processing files updated hourly with each day's project changes (formerly classified as part of the Snapshot API)
- Commercially useful schema similar* to those that we are building in our On-demand API and Snapshot API
- SLA
*We are still in the process of mapping out the technical specifications to determine the limitations of schema in event platforms and will post here when we have finalized our design.
Snapshot API
For high volume reusers that currently rely on the Wikimedia Dumps to access our information, we have created a solution to ingest Wikimedia content in near real time without excessive API calls (On-demand API) or maintaining hooks into our infrastructure (Realtime API - Streaming).
Enterprise Snapshot API contains:
- 24-hour JSON*, Wikitext, or HTML compressed dumps of supported Wikimedia project
- SLA
*JSON dumps will contain the same schema per page as the On-demand API.
These dumps are available for public use fortnightly on Wikimedia Dumps and daily on WMCS users
التطوير الماضي
استجابةً للدراسة البحثية الأولية في عام 2020، يركز فريق ويكيميديا إنتربرايز على بناء أدوات لإعادة المستخدمين التجاريين والتي ستوفر مزايا العلاقة مع توسيع إمكانية استخدام المحتوى الذي نقدمه.
تم تقسيم خريطة الطريق إلى مرحلتين منظمتين تركزان على مساعدة كبار مستخدمي الطرف الثالث في:
- بناء "أنبوب استهلاك تجاري" (COMPLETE)
- إنشاء بيانات أكثر فائدة لتغذية "أنبوب استهلاك تجاري" (IN PROGRESS)
Building a "Commercial Ingestion Pipe" aka Version 1.0 (Launched September 2021)
The goal of the first phase was to build infrastructure that ensures the Wikimedia Foundation can reasonably guarantee Service Level Agreements (SLAs) for 3rd-party reusers as well as create a "single product" where commercial reusers can confidently ingest our content in a clear and consistent manner. While the main goal of this is not explicitly to remove the load of the large reusers from Wikimedia Foundation infrastructure, it is a significant benefit, for we do not currently know the total capacity of these large reusers on donor-funded infrastructure. For more information on the APIs that are currently available, please reference the section Version 1.0 above or our public API documentation.
Daily HTML Dumps (Launched December 2020)
The Enterprise team's first product was building daily dump files of HTML for every "text-based" Wikimedia project. These dumps will help content re-users use a more familiar data type as they work with Wikimedia content.
Reusers have four immediate needs from a service that supports large-scale content reuse: system reliability, freshness or real-time access, content integrity, and machine readability.
واجهة الويب
A downloader interface now in design stages allows for users to download a daily dump for each "text-based" project, search and download individual pages, and save their preferences for return visits. Currently the software is in Alpha and still in usage and quality testing. This dashboard is built in React with internal-facing client endpoints built on top of our infrastructure. The downloads are hosted and served through S3.
Rationale behind choosing this as the Enterprise API's first product
- Already validated: Before the Enterprise team ran research to discover the needs of high-volume data reusers, this was the most historically requested feature. Large technology partners, researchers, and internal stakeholders within the Wikimedia Foundation have long sought a comprehensive way to access all of the Wikimedia "text-based" wikis in a form outside of Wikitext.
- Take pressure off internal Wikimedia infrastructure: While not proven, anecdotally we can conclude there is a significant band of traffic to our APIs by high-volume reusers aiming to get the most up-to-date content cached on their systems for reuse. Building a tool where they can achieve this has been the first step to pulling high-volume reusers away from WMF infrastructure and onto a new service.
- Standalone in nature: Of the projects already laid out for consideration by the Enterprise team, this is the most standalone. We can easily understand the specs without working with a specific partner. We were not forced to make technical decisions that would affect a later product or offering. In fact, in many ways, this flexibility forced us to build a data platform that produced many of the APIs that we are offering in the near future.
- Strong business development case: This project gave the Enterprise team a lot of room to talk through solutions with reusers and open up business development conversations.
- Strong introductory project for contractors: The Enterprise team started with a team of outside contractors. This forced the team to become reusers of Wikimedia in order to build this product. In the process, the team was able to identify and relate to the problems with the APIs that our customer base faces, giving them a broader understanding of the issues at hand.
وثائق التصميم
Application Hosting
The engineering goal of this project is to rapidly prototype and build solutions that could scale to the needs of the Enterprise API's intended customers – high volume, high speed, commercial reusers. To do this, the product has been optimized for quick iteration, infrastructural separation from critical Wikimedia projects, and to utilize downstream Service Level Agreements (SLAs). To achieve these goals in the short term, we have built the Enterprise API upon a third-party cloud provider (specifically Amazon Web Services [AWS]). While there are many advantages of using external cloud for our use case, we acknowledge there are also fundamental tensions – given the culture and principles of how applications are built at the Foundation.
Consequently, the goal with the Enterprise API is to create an application that is "cloud-agnostic" and can be spun up on any provider's platform. We have taken reasonable steps to architect abstraction layers within our application to remove any overt dependencies on our current host, Amazon Web Services. This was also a pragmatic decision, due to the unclear nature of where this project will live long-term.
The following steps were taken to ensure that principle. We have:
- Designed and built service interfaces to create abstractions from provider-specific tools. For instance, we have layers that tie to general File Storage capabilities, decoupling us from using exclusively "AWS S3" or creating undo dependency on other potential cloud options
- Built the application using Terraform as Infrastructure as Code to manage our cloud services. [The Terraform code will be published in the near future and this documentation will be updated when it is]
- Used Docker for containerization throughout the application
- Implemented hard drive encryption to ensure that the data is protected (we are working to expand our data encryption and will continually as this project develops)
We have intentionally kept our technical stack as general, libre & open source, and lightweight as possible. There is a temptation to use a number of proprietary services that may provide easy solutions to hard problems (including EMR, DynamoDB, etc). However, we have restricted our reliance on Amazon services to what we can be found in most other cloud providers. Below is a list of services used by the Enterprise API within Amazon and its purpose in our infrastructure:
- Amazon EC2 - Compute
- Amazon S3 - File Storage
- Amazon Relational Database Service (PostgreSQL) - PostGRES Database
- Amazon ElastiCache for Redis - Cache
- Amazon Elasticsearch Service - Search Engine
- Amazon MSK - Apache Kafka Cluster
- Amazon ELB - Load Balancer
- Amazon VPC - Virtual Private Cloud
- Amazon Cognito - Authentication
We are looking to provide Service Level Agreements (SLA) to customers similar to those guaranteed by Amazon's EC2. We don't have equivalent uptime information from the Wikimedia Foundation's existing infrastructure. However, this is something we are exploring with Wikimedia Site Reliability Engineering. Any alternative hosting in the future would require equivalent services or time to allow us to add more staff to our team in order to give us confidence to handle the SLA we are promising.
In the meantime, we are researching alternatives to AWS (and remain open to ideas that might fit our use case) when this project is more established and we are confident in knowing what the infrastructure needs are in reality.
Team
For the most up-to-date list of people involved in the project, see Meta:Wikimedia Enterprise#Team.
See also
- Wikitech: Data Services portal – A list of community-facing services that allow for direct access to databases and dumps, as well as web interfaces for querying and programmatic access to data stores.
- Enterprise hub – a page for those interested in using the MediaWiki software in corporate contexts:
- MediaWiki Stakeholders group – an independent affiliate organisation that advocates for the needs of MediaWiki users outside the Wikimedia Foundation, including commercial enterprises.
- Enterprise MediaWiki Conference – an independent conference series for that community.
- Wikimedia update feed service – A defunct paid data service that enabled third parties to maintain and update local databases of Wikimedia content.
واجهة برمجة التطبيقات | التوفر | أساس معرف الموارد الموحد | مثال |
---|---|---|---|
واجهة برمجة التطبيقات لتطبيق Action على ميدياويكي | مشمولة في ميدياويكي
مفعلة على مشاريع ويكيميديا |
/api.php | https://en.wikipedia.org/w/api.php?action=query&prop=info&titles=Earth |
واجهة برمجة التطبيقات لتطبيق REST على ميدياويكي | مشمولة في ميدياويكي 1.35 وما بعدها
مفعلة على مشاريع ويكيميديا |
/rest.php | https://en.wikipedia.org/w/rest.php/v1/page/Earth |
Wikimedia REST API | غير مشمولة في ميدياويكي
متوفرة لمشاريع ويكيميديا لا غير |
/api/rest | https://en.wikipedia.org/api/rest_v1/page/title/Earth |
للاطلاع على واجهات برمجة التطبيقات المخصصة للأنشطة التجارية من مشاريع ويكيميديا، اذهب إلى ويكيميديا إنتربرايز |