Jump to content

Manual talk:GenerateSitemap.php

This script will generate errors on many wikis

4
Ppehrson (talkcontribs)

Due to the fact that the URLS are not HTML sanitized, Google will reject the sitemaps if they do have HTML unescaped characters in them.


You simply need to adapt the script to sanitize the URLs further.


At line 384, change:

$entry = $this->fileEntry( $title->getCanonicalURL(), $date, $this->priority( $namespace ) );

to:

$entry = $this->fileEntry( encodeURL($title->getCanonicalURL()), $date, $this->priority( $namespace ) );

$title = htmlentities($title);

Before the private function open, around line 424, add:

   private function encodeUrl($url) {
       return str_replace(array('(',')','$','&','\,'@','*','#'),array('%28','%29','%24','%26','%27','%40','%2A','%23'), $url);

//return $url;

   }


This will sanitize to match official sitemap rules. Your generator FAILS the tests without this code, especially if someone enters special characters in a wiki title, like a dollar sign, an asterisk, or parentheses/apostrophes.

Klaugust (talkcontribs)

Hello!

I tried this but got:

PHP Parse error:  syntax error, unexpected token "@", expecting ")" in maintenance/generateSitemap.php on line 434

Also I think that one ' is missing after the backslash \ in the array, but tried this too and it returned the same error.

Kghbln (talkcontribs)

Thanks for describing the issue and providing a solution. Honestly, from experience gained by sticking around here for a while, I believe that you should file an issue at Phabricator to address this issue and ideally provide a patch to be merged. Otherwise only a few people will notice this which is kinda sad.

Ciencia Al Poder (talkcontribs)

Can someone back up how the URLs aren't correctly encoded?

From the source code, the URL is passed to htmlspecialchars PHP function, which encoded XML problematic characters. On the other hand, the URLs are URL-encoded. They're generated from Title::getCanonicalURL(), which comes from Title::getLocalURL(). If you look at the source code, the "dbKey" is returned from wfUrlencode(), which correctly URL-encodes any non-ascii characters or special URL characters like ? #.

Reply to "This script will generate errors on many wikis"

"title=" missing in sitemap urls

1
76.102.130.155 (talkcontribs)

From my homepage, links look like "/index.php?title=My_Page_Name". I turned on $wgEnableCanonicalServerLink, so my pages contain meta data, and the URL is the same. So far so good!


Unfortunately, generateSitemap.php is making <loc> entries that look like "/index.php/My_Page_Name", i.e. without the "title=". (Note that they do contain the scheme and domain, but this forum software thinks I'm link spamming, so they're not shown here)


Google's indexing is mad about this discrepancy. What's the magic incantation to make them all contain "title="?

Reply to ""title=" missing in sitemap urls"
Jcwild (talkcontribs)

When I run this script I get the following:

Content-type: text/html

<br />

<b>Parse error</b>:  syntax error, unexpected T_STRING, expecting T_CONSTANT_ENCAPSED_STRING or '(' in <b>/path/wiki/maintenance/generateSitemap.php</b> on line <b>29</b><br />

Any ideas?

Jcwild (talkcontribs)

Ah, turns out "php" was v4.4.9. I needed to use a newer version.

generateSitemap.php should remove __NOINDEX__ pages added via $wgNamespaceRobotPolicies or $wgDefaultRobotPolicies in LocalSettings.php

2
Goodman Andrew (talkcontribs)

The following patches for the maintenance script generateSitemap.php from https://gerrit.wikimedia.org/r/c/620746 works (removed noindex pages from sitemap file) only for the behavior switch magic word (___NOINDEX___), but does not remove pages marked 'noindex' via the LocalSettings.php from the generated sitemap file.

I think there might be a solution to this because, if there wasn't, Wikipedia would have a problem excluding talkpages from its sitemap, which I think it doesn't: https://en.wikipedia.org/wiki/Wikipedia:Controlling_search_engine_indexing

Now, the wiki in question is by default noindex. Pages that are to be index have Template:INDEX added to them but the entire wiki is noindex by default, because: $wgDefaultRobotPolicies = true; in LocalSettings.php. Thus the desire sitemap solution is to generate sitemap for pages that has ___INDEX___ or Template:INDEX in them or that indicate 'index' in the HTML output of the page.

``` diff --git a/maintenance/generateSitemap.php b/maintenance/generateSitemap.php index 6060567..bc5e865 100644 --- a/maintenance/generateSitemap.php +++ b/maintenance/generateSitemap.php

@@ -305,15 +305,27 @@

	 * @return IResultWrapper
	 */
	private function getPageRes( $namespace ) {

- return $this->dbr->select( 'page', + return $this->dbr->select( + [ 'page', 'page_props' ],

			[
				'page_namespace',
				'page_title',
				'page_touched',

- 'page_is_redirect' + 'page_is_redirect', + 'pp_propname',

			],
			[ 'page_namespace' => $namespace ],

- __METHOD__ + __METHOD__, + [], + [ + 'page_props' => [ + 'LEFT JOIN', + [ + 'page_id = pp_page', + 'pp_propname' => 'noindex' + ] + ] + ]

		);
	}

@@ -335,7 +347,13 @@

			$fns = $contLang->getFormattedNsText( $namespace );
			$this->output( "$namespace ($fns)\n" );
			$skippedRedirects = 0; // Number of redirects skipped for that namespace

+ $skippedNoindex = 0; // Number of pages with switch for that NS

			foreach ( $res as $row ) {

+ if ( $row->pp_propname === 'noindex' ) { + $skippedNoindex++; + continue; + } +

				if ( $this->skipRedirects && $row->page_is_redirect ) {
					$skippedRedirects++;
					continue;

@@ -380,6 +398,10 @@

				}
			}

+ if ( $skippedNoindex > 0 ) { + $this->output( " skipped $skippedNoindex page(s) with switch\n" ); + } +

			if ( $this->skipRedirects && $skippedRedirects > 0 ) {
				$this->output( "  skipped $skippedRedirects redirect(s)\n" );
			}

```

Goodman Andrew (talkcontribs)

How does one skip redirects or namespace redirects that are add via the LocalSettings.php during sitemap generation?

Reply to "generateSitemap.php should remove __NOINDEX__ pages added via $wgNamespaceRobotPolicies or $wgDefaultRobotPolicies in LocalSettings.php"
There are no older topics