About this board

Edit description

Previous page history was archived for backup purposes at Talk:Requests for comment/Structured logging/LQT Archive 1 on 2015-06-25.

Start a new topic

IP address and other personal identifying information

9 comments • 23:01, 23 October 2022 2 years ago

9

Nemo bis (talkcontribs)

Is there any technical reason to keep actual IPs as data points of structured logging entries? The Internet Archive replaced them with hashes generated with a daily key. Especially for operations logs, hashes should be more than enough (if a day is too frequent, make it a week or whatever).

Privacy in this case is not just a philosophical matter but also a very practical one: the less private information a log contains, the easier it is to distribute it, have more eyeballs checking it and identify problems quickly.

Reply 17:57, 4 December 2013 10 years ago

BDavis (WMF) (talkcontribs)

This is definitely a point worth discussing. I happen to personally fall into the camp of persons who don't believe that IP addresses constitute private information, but I'm just one voice in the discussion. Having an IP address is useful for diagnosing some bugs but for most use cases I can think of an opaque token that corresponds to the IP (or even /24 block) would be just as useful. I think it would be nice to end up with a core standard that takes sharing log data with the community into consideration but my primary goal is to enable better log analysis for the operational wikis run by the Foundation.

Reply 18:16, 4 December 2013 10 years ago

Hashar (talkcontribs)

Wikimedia Foundation and the community it represents does consider the IP address to be private. https://wikimediafoundation.org/wiki/Privacy_policy#IP_and_other_technical_information

We do need IP addresses (client and X-Forwarded-For header) in logs, they are very valuable. Obfuscating them would make debugging harder. Similar thing are the Cookies which we obviously need in logs.

We could redact them whenever they are published, might even have some utility to handle it for us/ But honestly I think it is out of the scope of this document.

Reply 20:50, 4 December 2013 10 years ago

BDavis (WMF) (talkcontribs)

Having structure to the log events themselves would be a great help for scrubbing/filtering operations needed to redact logs for wider distribution. This is similar to the discussions that have taken place^{[citation needed]} about filtering exception stack traces in logs so that they can be shown.

Reply Edited by Nemo bis 21:33, 5 February 2014 10 years ago

CSteipp (WMF) (talkcontribs)

It may be nice to somehow flag private data at the top level of the log structure. I'd like to avoid having a consumer filter out top level fields like ip, then assume the context/message is safe to make public. Or have to grep through those unstructured bits trying to redact particular messages.

This post was posted by CSteipp (WMF), but signed as CSteipp.

Reply Edited by Flow talk page manager 04:54, 25 June 2015 9 years ago

BDavis (WMF) (talkcontribs)

I think the spec could accommodate a "private" collection of information, but it would ultimately be up to code review to ensure that private info isn't placed in non-private containers.

Reply 23:24, 4 December 2013 10 years ago

Nemo bis (talkcontribs)

Thanks Hashar. Is this true of all logs or are there some where unobfuscated IP addresses are not required for debugging? Also, do IPs get useless for debugging after a short period of time? If not, it would be useful to encrypt them so that you have at least some (permanent) information, as opposed to being forced to purge everything after 3 months or so per the privacy policy (as with CheckUser data).

Reply 09:41, 9 December 2013 10 years ago

Nemo bis (talkcontribs)

EFF best practices may contain some useful tips, I think: https://www.eff.org/wp/osp (for instance around "Other key solutions to wipe PII from logs"). Most of that is probably not relevant or already in progress (e.g. EventLogging/UserAgentSanitization), but dropping the link here can't harm I think. :)

Reply 21:14, 5 February 2014 10 years ago

Nemo bis (talkcontribs)

Quoting Adam Wight and Bawolff here to be able to find their notes again later:

++the EFF for more ideas, they are actively doing great work on so-called
perfect forward secrecy.

There are simple things we could do to achieve a better balance between privacy and sockpantsing, such as cryptolog , in which IP addresses are hashed using a salt that changes every day. In theory, nobody can reverse the function to reveal the IP, but you can still correlate all of an address's edits for the day, week, or whatever, making CheckUser possible.

IP range blocking obviously needs to happen up-front, before the IP is mangled. I have no suggestions, but maybe browser and preferences fingerprinting would be more effective anyway, since: tor.

The cryptolog approach - This has the property that there's a specific time where all anon identifiers suddenly change (e.g. Midnight every day in the setup cryptolog uses). Having an arbitrary point in time where suddenly identifiers shift is probably an unwanted property. (Although maybe it doesn't matter that much in practice? Someone who actually deals with abuse on wiki would be better able to answer that).

Reply 14:58, 30 July 2014 10 years ago

Reply to "IP address and other personal identifying information"

log4j

2 comments • 04:54, 25 June 2015 9 years ago

2

SSastry (WMF) (talkcontribs)

log4j

log4j and associated logging facilities might be useful to look at while developing this.

This post was posted by SSastry (WMF), but signed as Ssastry.

Reply Edited by Flow talk page manager 04:54, 25 June 2015 9 years ago

BDavis (WMF) (talkcontribs)

The monolog product that has been mentioned was obviously heavily influenced by log4j and it's descendants. Separating the concerns of gathering log event information, formatting the information and outputting the formatted record to storage should definitely be considered.

One feature of log4j that monolog is missing is the idea of a hierarchy of loggers. I'd like to see something like that in MediaWiki eventually but pushing for it in the initial implementation may be overreaching.

Reply 23:28, 4 December 2013 10 years ago

Reply to "log4j"

feedback needed

One comment • 15:58, 17 April 2014 10 years ago

1

Sumanah (talkcontribs)

See http://lists.wikimedia.org/pipermail/wikitech-l/2014-April/075921.html .

Reply 15:57, 17 April 2014 10 years ago

Reply to "feedback needed"

Google Protocol Buffers

3 comments • 22:11, 12 February 2014 10 years ago

3

Tfinc (talkcontribs)

Curious to see how our approach here compares to https://developers.google.com/protocol-buffers/ .

PHP: https://github.com/allegro/php-protobuf/

thanks

Reply 21:45, 12 February 2014 10 years ago

BDavis (WMF) (talkcontribs)

I think protocol buffers are solving a slightly different problem. They allow for specification of a rigid message structure with efficient binary representation such as would typically be used for some sort of interprocess (and possibly inter-language) message passing. For event logging I think that having a rigid message structure is unwanted. Not all events will look alike and thus not all events will contain the exact same fields. We are looking to specify a "core" set of fields but would like the ability to invent additional ones as appropriate. I think there are also advantages to using plain text rather than binary formats. Plain text is more easily debugged and manipulated with a variety of tools which supports the primary use case of recording events so that developers and operational support teams can debug runtime issues.

Reply 22:11, 12 February 2014 10 years ago

Dantman (talkcontribs)

Protocol Buffers requires defining the whole format with all the keys beforehand and outputs binary. The RFC seems to suggest simple JSON which is a good format for data with variable keys, especially if it's supposed to be consumed by a variety of things. Even 3rd party logging services like Loggly accept JSON.

Reply 22:11, 12 February 2014 10 years ago

Reply to "Google Protocol Buffers"

Proof of concept implementation

One comment • 17:10, 11 February 2014 10 years ago

1

BDavis (WMF) (talkcontribs)

I've opened Gerrit change 112699 with a proof of concept implementation that leverages PSR-3 and Monolog. I'm sure it's not ready to merge yet but hopefully it will get some attention from interested parties to move this discussion forward.

Reply 17:10, 11 February 2014 10 years ago

Reply to "Proof of concept implementation"

We are considering a similar approach for Parsoid

3 comments • 18:31, 6 December 2013 10 years ago

3

GWicke (talkcontribs)

The logging library we are considering is https://github.com/trentm/node-bunyan, which also logs JSON messages. Example messages:

{"name":"myapp","hostname":"banana.local","pid":40161,"level":30,"msg":"hi","time":"2013-01-04T18:46:23.851Z","v":0}
{"name":"myapp","hostname":"banana.local","pid":40161,"level":40,"lang":"fr","msg":"au revoir","time":"2013-01-04T18:46:23.853Z","v":0}

Reply 22:06, 4 December 2013 10 years ago

GWicke (talkcontribs)

We can also easily log to GELF. Together with the available transport infrastructure that looks like a very good option to me.

Reply Edited 23:23, 4 December 2013 10 years ago

BDavis (WMF) (talkcontribs)

It looks like the current proposed data points mostly map into GELF.

Proposal	GELF 1.1	Notes
timestamp	timestamp
host	host
source	_source	(facility is deprecated in 1.1)
pid	_pid
severity	level	(via string->number transform)
channel	_channel
message	full_message
file	file
line	line
errorcode	short_message
exception	?	GELF doesn't have collection support
args	?	GELF doesn't have collection support
vhost	_vhost
ip	_ip
user	_user
req	_req

One thing I don't like about GELF is that it doesn't support fields that are collections. This leads to clumsy representations for things like argument lists or structured stack traces. I do really like the support for chunking and compression in the GELF transport protocol.

One of my use cases for structured output is feeding a Logstash server with a Kibana frontend. Logstash will happily index and search anything that can be represented in JSON via it's ElasticSearch backend. With this combination you can do some pretty neat things like find all errors in the last N days with stack traces that mention file X or have argument foo = 'bar'. This can be done with stack traces as big blobs of text but if you get the JSON document structure right it becomes much easier and less prone to false positives/negatives. It may well turn out that defining the "one log event output format to rule them all" is too hard to spend much time thinking about.

Reply 18:31, 6 December 2013 10 years ago

Reply to "We are considering a similar approach for Parsoid"

Third party libraries

2 comments • 16:58, 6 December 2013 10 years ago

2

Jeroen De Dauw (talkcontribs)

The RFC contains a mention that Monolog might be used, together with a question if it should be used directly or behind a facade. One should not bind to a specific logging library. If MW defines its own interface, which might be very similar to the one of the logging library, such binding can be avoided. One simple adepter would need to be created for each logging library that needs to be supported. This way other libraries can be used in the future, and changes to the library interfaces will not require changes to all of MW.

Reply 10:03, 5 December 2013 10 years ago

BDavis (WMF) (talkcontribs)

If we go the route of using a third party library to implement the logging changes I would +1 using the PSR-3 Psr\Log\LoggerInterface as the expected API between MW and the logger implementation.

Reply 16:58, 6 December 2013 10 years ago

Reply to "Third party libraries"

Location of Logger registry

One comment • 23:03, 4 December 2013 10 years ago

1

Parent5446 (talkcontribs)

One of the big questions here is where will the loggers be stored. One option is in ContextSource, or maybe just a global singleton. Not sure which would be most appropriate here.

Reply 23:03, 4 December 2013 10 years ago

Reply to "Location of Logger registry"

Tracking components

3 comments • 20:53, 4 December 2013 10 years ago

3

Cneubauer (talkcontribs)

Here's a related bug I filed back in 2008 about using wfDebugLog instead of wfDebug so that a system admin can determine more easily where log messages are coming from. https://bugzilla.wikimedia.org/show_bug.cgi?id=14601. Something to think about.

Reply 15:16, 4 December 2013 10 years ago

BDavis (WMF) (talkcontribs)

Thanks for pointing this out. I've added it to the "See also" section of the RFC.

Reply 16:23, 4 December 2013 10 years ago

Hashar (talkcontribs)

And since Dec 2012 ( https://gerrit.wikimedia.org/r/37442 ) the log bucket is prepended to the message when it falls in the main $wgDebugLogFile. So we could potentially replace all the wfDebug() calls right now, though having wfDebugLog( 'default', 'some message' ); is not idea.

Reply 20:53, 4 December 2013 10 years ago

Reply to "Tracking components"

There are no older topics