The following patches for the maintenance script generateSitemap.php from https://gerrit.wikimedia.org/r/c/620746 works (removed noindex pages from sitemap file) only for the behavior switch magic word (___NOINDEX___), but does not remove pages marked 'noindex' via the LocalSettings.php from the generated sitemap file.
I think there might be a solution to this because, if there wasn't, Wikipedia would have a problem excluding talkpages from its sitemap, which I think it doesn't: https://en.wikipedia.org/wiki/Wikipedia:Controlling_search_engine_indexing
Now, the wiki in question is by default noindex. Pages that are to be index have Template:INDEX added to them but the entire wiki is noindex by default, because: $wgDefaultRobotPolicies = true; in LocalSettings.php. Thus the desire sitemap solution is to generate sitemap for pages that has ___INDEX___ or Template:INDEX in them or that indicate 'index' in the HTML output of the page.
```
diff --git a/maintenance/generateSitemap.php b/maintenance/generateSitemap.php
index 6060567..bc5e865 100644
--- a/maintenance/generateSitemap.php
+++ b/maintenance/generateSitemap.php
@@ -305,15 +305,27 @@
* @return IResultWrapper
*/
private function getPageRes( $namespace ) {
- return $this->dbr->select( 'page',
+ return $this->dbr->select(
+ [ 'page', 'page_props' ],
[
'page_namespace',
'page_title',
'page_touched',
- 'page_is_redirect'
+ 'page_is_redirect',
+ 'pp_propname',
],
[ 'page_namespace' => $namespace ],
- __METHOD__
+ __METHOD__,
+ [],
+ [
+ 'page_props' => [
+ 'LEFT JOIN',
+ [
+ 'page_id = pp_page',
+ 'pp_propname' => 'noindex'
+ ]
+ ]
+ ]
);
}
@@ -335,7 +347,13 @@
$fns = $contLang->getFormattedNsText( $namespace );
$this->output( "$namespace ($fns)\n" );
$skippedRedirects = 0; // Number of redirects skipped for that namespace
+ $skippedNoindex = 0; // Number of pages with switch for that NS
foreach ( $res as $row ) {
+ if ( $row->pp_propname === 'noindex' ) {
+ $skippedNoindex++;
+ continue;
+ }
+
if ( $this->skipRedirects && $row->page_is_redirect ) {
$skippedRedirects++;
continue;
@@ -380,6 +398,10 @@
}
}
+ if ( $skippedNoindex > 0 ) {
+ $this->output( " skipped $skippedNoindex page(s) with switch\n" );
+ }
+
if ( $this->skipRedirects && $skippedRedirects > 0 ) {
$this->output( " skipped $skippedRedirects redirect(s)\n" );
}
```