Jump to content

Talk:Requests for comment/Text extraction

About this board

Dantman (talkcontribs)
If extracts will be integrated into core custom extraction classes could go to a separate extension (e.g. WikimediaTextExtraction); otherwise they could be part of the main extraction extension.

Personally even if this was implemented in a TextExtraction extension instead of core (though I think it should be implemented in core) I wouldn't want Wikimedia specific stuff in the generic MediaWiki extension. ie: I'd prefer that in both situations WMF would have a WikimediaTextExtraction extesion.

Such timing is less than optimal, I propose to extract text during LinksUpdate and store it in page_props.

page_props is for storage of indexed and queryable data that results from the canonical parse run. ie: Something should only ever be stored there when there is also an equivalent parser cache entry.

page_props is for data you want to be able to query for not for storage. Since you're not going to be making SQL queries trying to match extraction results the extraction data should be stored in the parser cache using either ParserOutput::setExtensionData or adding a new prop + methods to ParserOutput instead.

Alternatively if you want to do this completely separate from the parser cache the proposed DataStore would probably be the best method of storage.

MaxSem (talkcontribs)

We don't need text extracts in parser output:

  • I want to make extract retrieval a batch opertaion - it would never be like that if it only came with ParserOutput.
  • You need to generate an extract once per revision, not on every parse.
MZMcBride (talkcontribs)

Some wikis, such as Wiktionaries, rely heavily on templates. I'm not sure you can only generate an extract once... if templates change and the resulting page output changes, you'll need to re-generate an extract, right? Plus there will be incremental improvements to the extractor itself, which people will want to benefit from without needing to make dummy edits to pages.

Reply to "Some notes"

Inherit ALL the things

1
Jeroen De Dauw (talkcontribs)
class ExtractFormatter extends HtmlFormatter
class WiktionaryExtractFormatter extends ExtractFormatter

Code reuse via inheritance much? What happened to favoring composition over inheritance?

Reply to "Inherit ALL the things"
Nemo bis (talkcontribs)
Reply to "Mobile and Wikidata"

What's this waiting on?

1
Sumanah (talkcontribs)

The DataStore RfC has been approved, but not implemented. Is the text extraction RfC awaiting DataStore's implementation?

Reply to "What's this waiting on?"
Quiddity (talkcontribs)

[Nutshell context: There is a recurring ("perennial") proposal at Enwiki and at Meta, to create a "synopsis version" of Wikipedia articles. The most recent is from late 2012, at m:Concise Wikipedia]

I recently took a swing at summarizing everything, from NavPopups to Google Knowledge Graph, as briefly as possible, at m:Concise Wikipedia#A summary of existing short-options, using an example. That info might be relevant to this proposal, or just interesting to some of the folks who are following this.

Reply to "Related endeavours"
There are no older topics