Platform Engineering Team/Event Platform Value Stream
Event Platform
|
What is it?
[edit]Engineers from across Technology (Platform, Data Engineering, Search and Enterprise) will collaborate on a shared event streaming platform capability that is beneficial to each group and the overall foundation.
Existing event streams serve as a change of state but lack many details required to make sense of that change (see T291120), the event platform will enable us to build enriched data streams that will allow the foundation and community to build and share better knowledge experiences.
What we aim to achieve?
[edit]- Evaluation of event streaming platforms
- Implementation of chosen event streaming solution as a proof of concept (no SLO's)
- Implementation of the following services/stream processors:
- Simple Enrichment - transform a single stream by enriching with calls to MediaWiki API's
- Research Use Case - transform a single stream to provide data for a Research Use Case
- Data Integration - integrating streams and databases
- Understanding the pathway and considerations to take the chosen solution to production
- Creating tooling and pathways for other engineering groups to build streaming services/processors
How does this benefit the movement?
[edit]- Knowledge as a service - Publishing enriched event streams to the world will allow anyone to build on that to create new knowledge experiences
- Knowledge equity - By publishing enriched streams we break down technical barriers in navigating and accessing data that could be used to build new knowledge experiences
Links
[edit]- Assess what is required for the enrichment pipeline to run on k8s
- Build simple stateless service using Flink SQL
- Build simple stateless service using PyFlink
- Contribution
- Evaluate a pyflink version of Mediawiki Stream Enrichment
- Event Catalog
- Event Driven Use Cases
- PoC Mediawiki Stream Enrichment
- Pyflink Enrichment Service Deployment
- Stream Processing Framework Evaluation
- Use case: Event Platform SDLC practices
- Use case: compute needs for streaming pipelines
Now that we are moving forward with Flink as a solution, the first service will consolidate existing streams, enrich messages with page content (wikitext, json, etc) and output to a new topic.
More details can be found here
As part of the POC work we also worked on tooling to make consuming existing event platform streams easy, see here.
MILESTONE: Demo see video here
(In Progress): Building on Flink Learnings and the POC Service
[edit]To be groomed and defined:
Ticket | Title | Description | Lead/Backup | Timebox | Status |
---|---|---|---|---|---|
T310082 | State Changelog schema design | Streaming event data represents changes to an entity, e.g. a page. If we are able to represent these changes in a way that can be used to update 'current state', such that after consuming all past events, the current state is materializable via events alone, then the event stream is a changelog. Flink has support for automatically consuming changelog streams and presenting them as materialized views of current state. We should research designing our event streams as changelogs so they can be consumed by Flink in this way. | Andrew Otto/David Causse | 1 weeks | Planned/Needs ticket grooming |
T309784 | Consolidated and Ordered Page Change Stream | POC service to mimic what it would be like to have a consolidated single stream with ordered events. | David Causse/Gabriele Modena | 4 weeks? | Planned |
T306627 | Integrate Image Suggestions Feedback with Cassandra | Design, implement and deploy a service that listens for image suggestions feedback and writes the data to the Cassandra schema so that the feedback can be persisted | Thomas Chin/Group | Unbounded | In Progress |
To do | Research Use Case | Demo and explain event stream to Research, discuss potential use cases or useful streams - diffs, enrichments. Work with Research to implement a POC using events - TBD | Group for now | 4 weeks? TBD | Planned |
Future Phases: Tooling and abstractions
[edit]To be groomed and defined:
Ticket | Title | Description | Lead/Backup | Timebox | Status |
---|---|---|---|---|---|
T310218 | Flink output support for Event Platform | We now have a Table API abstractions for Event Platform streams as a Table source. We should automate emitting events too, likely wrapping JsonEventGenerator. | Andrew Otto/David Causse? | 4 weeks? TBD | Planned/Needs ticket grooming |
To do | AsyncLookupTable for the MW API | Can/Should we make an AsyncLookupTable for the MW API? This could wrap handling retries, etc, and would make using the MW API in Flink quite nice. | Andrew Otto/? | 4 weeks? TBD | Planned/Needs ticket grooming |
T309699 | Retry Logic/Error Handling |