Analytics/Archive/Pixel Service

This page is archived! Find up-to-date documentation at https://wikitech.wikimedia.org/wiki/Analytics

The Pixel service is the "front door" to the analytics system: a public endpoint with a simple interface for getting data into the datastore.

Components

Request Endpoint: HTTP server that handles GET requests to the <blink>PIXEL</blink> service endpoint, responding with 204 NO CONTENT or an actual, honest-to-god 1x1 transparent gif. Data is submitted into the cluster via query parameters.
Messaging System: routes messages (request information + data content) from the request endpoint to the datastore. This component is intended to be implemented by Apache Kafka.
Datastore Consumer: consumes messages, shunting them into the datastore utilizing HDFS staging and/or append.
Processing Toolkit: a standard template for a pig job to process (count, aggregate, etc) event data query string params, handling standard indirection for referrer and timestamp, Apache Avro de/serialization, and providing tools for conversion funnel and A/B testing analysis.
Event Logging Library: a JS library with an easy interface to abstract the sending of data to the service. Handles event data conventions for proxied timestamp, referrer; normal web request components.

Service prototype

To get up and running right away, we're going to start with an alpha prototype, and work with teams to see where it goes.

/event.gif on bits multicast stream -> udp2log (1:1) running in Analytics cluster
- Until bits caches are ready, we'll also have a publicly accessible endpoint on analytics1001
Kafka consumes udp2log, creating topic per product-code -- no intermediate aggregation at cache DC
Cron to run Kafka-Hadoop consumer, importing all topics into Hadoop to datetime+producer-code paths

EventLogging Integration TODOs

Make sure all event data goes into Kraken (I think it may only be esams at the moment, not sure). [ottomata] (Dec)
Divvy up some TODOs with Ori:
- Keeping udplog seq id counters for each bits host and emitting some alert if gaps detected
- Until https://rt.wikimedia.org/Ticket/Display.html?id=4094 is resolved, monitor for truncated URIs (detectable because missing trailing ';') and set up some alerting scheme
- Speaking of that RT ticket: check w/Mark if we can do something useful to move that along (like update the patch so it applies against the versions deployed to prod).
Figure out a useful arrangement for server-side events (basic idea: call wfDebugLog(..) on hooks that represent "business" events, have wfDebugLog write to UDP / TCP socket pointing at Kraken. See EventLogging extension for some idea of what I mean.
- already done? EventLogging's efLogServerSideEvent() validates events against a versioned schema on meta-wiki and writes them using wfDebugLog (currently to UDP). E3 logs all AccountCreation events on all servers using this. -- S Page (WMF) (talk) 00:39, 12 January 2013 (UTC)[reply]
Things Ori needs and would repay in dev time and/or sexual favors: - Puppetization of stuff on Vanadium - Help w/MySQL admin
Other EventLogging TODOs: mw:Extension:EventLogging/Todos
- - - - Figure out how to map event schemas to Avro(?) or some other way to make Hadoop schema-aware so the data is actually useful rather than just blob-like

Getting to production

We're pretty settled on Kafka as the messaging transport, but to use the dynamic load-balancing and failover features we need a ZooKeeper-aware producer — unfortunately, only the Java and C# clients have this functionality. (This is a blocker for both the Pixel Service AND general request logging.)

Three options:

Pipe logging output from Squid & Varnish into the console producer (which implies running the JVM in production);
Write code (a Varnish plugin plus configuration as described here, as well as a Squid module, both in something C-like) to do ZK-integration and publish to Kafka
Continue to use udp2log -> Kafka with the caveat that the stream is unreliable until it gets to Kafka.

Frequently Asked Questions

What HTTP actions will the service support?

GET.

What about `POST`s?

No POST. Only GET. Other than content-length, there's no real justification for a POST, and if you're sending strings that are greater than 2k, you kind of already have a problem.

Can I send JSON?

Sure, but we're probably not going to do anything special with it -- the JSON values will show up as strings that you'll have to parse to aggregate, count, etc. Ex: GET /event.gif?json=={"foo":1,"bar":[1,2,3]} (and recall you'll have to encodeURIComponent(json)).

As we want to build tools to cover the normal cases first, this is not really recommended. (Just use www-form-encoding KV pairs as usual.) If anyone has a REEEEALLY good use-case, we can talk about having a key-convention for sending a json payload, like, say, calling the key json.

If I send crazy `HTTP` headers, will the service record them?

No. We will not parse anything other than the query string.

Custom headers are exactly what we want to avoid -- think of the metadata in an HTTP request as being an interface. You want it to be minimal and well-defined, so little custom parsing needs to occur. KV-pairs in the query string are both flexible and generic enough to meet all reasonable use-cases. If you really need typing, send JSON as the value (as mentioned above).