Extension talk:MultiReplace
Add topicStripping accents and diacritics from a text
[edit]I was trying to do this with #replace but quickly found out that it has a limited nesting level, so I looked for alternatives and found this great extension. Here's the simple function to strip all diacritics from the Latin alphabet, perhaps it may be useful for someone else:
{{#multireplace:{{{1}}} |à=a|á=a|ä=a|ã=a|â=a|å=a|ā=a|ă=a|ą=a|è=e|é=e|ê=e|ë=e|ē=e|ě=e|ĕ=e|ė=e|ę=e| ì=i|í=i|î=i|ï=i|ĩ=i|ī=i|ĭ=i|į=i|ò=o|ó=o|ô=o|õ=o|ö=o|ō=o|ŏ=o|ő=o|ø=o|ù=u|ú=u|û=u|ü=u|ū=u|ŭ=u|ů=u|ý=y|ÿ=y| ç=c|ć=c|č=c|đ=d|ď=d|ğ=g|ģ=g|ķ=k|ł=l|ĺ=l|ľ=l|ļ=l|ń=n|ñ=n|ň=n|ņ=n|ŕ=r|ř=r|ŗ=r|ś=s|š=s|ş=s|ß=ss|ť=t|ź=z|ż=z|ž=z}}
I inserted it in a template called StripAccents and it's working fine. Very few letters with diacritics were left out. Don't forget to include the upper case equivalents if necessary. —Capmo 03:24, 13 May 2009 (UTC)
Just to make it more clear, it's not impossible to do the same using the #replace function, nested like this:
{{#replace:{{#replace:{{#replace:{{{1}}}|à|a}}|á|a}}|ä|a}}
),
but as soon as the maximum level of nested #replaces is reached, you'll have to continue the replaces using a second (a third, etc.) template. Then, to achieve the same result as the StripAccents template of the above example, you'd have to write something like {{StripAEI|{{StripOUY|{{StripC-Z|<text>}}}}}}
. —Capmo 05:51, 13 May 2009 (UTC)
- Thanks! This might be very useful in many situations. But there's an easier way to do this with regular expression:
{{#multireplace:{{{1}}}|/[àáäãâåāăą]/u=a|/[èéêëēěĕėę]/u=e|/[ìíîïĩīĭį]/u=i|/[òóôõöōŏőø]/u=o|/[ùúûüūŭů]/u=u| /[ýÿ]/u=y|/[çćč]/u=c|/[đď]/u=d|/[ğģ]/u=g|ķ=k|/[łĺľļ]/u=l|/[ńñňņ]/u=n|/[ŕřŗ]/u=r|/[śšş]/u=s|ß=ss|ť=t|/[źżž]/u=z}}
- Great, Matěj! I tried doing this with RegExp but couldn't find the exact syntax, thanks for that! Do you have any idea of which of the solutions is less memory/processor demanding? Capmo 18:40, 13 May 2009 (UTC)
- Hi Matěj, me again. Sorry to inform, but something didn't work well when I tried to use your syntax: "Antonín Dvořák" became "Antonain Dvooeaak" and "Gegrüßet" displayed as "Gegrauaget". I had to revert to my previous syntax. It seems the RegExp is getting messed up with all those Unicode characters... I read somewhere that RegExp requires Unicode parameters in hex format, but then it wouldn't be worth using RegExp at all. Any idea on how to fix it in a simple way? Capmo 20:45, 13 May 2009 (UTC)
- Crap, I'll take a look into it. —Matěj Grabovský 05:33, 14 May 2009 (UTC)
- Hey Matěj, solution found! We need to use the option /u "which turns on the Unicode matching mode, instead of the default 8-bit matching mode"[1]. I already updated your example above with this option, ok! By the way, only now did I notice that you're the developer of this extension, thanks a lot for it! :) —Capmo 18:46, 16 May 2009 (UTC)
- Hell yeah! That's it, thanks you very much for finding it. —Matěj Grabovský 05:38, 19 May 2009 (UTC)
needless cache for replacing
[edit]Have you ever tried something like: {{#multireplace: abababa |a=b| b=a}}
The result is: aaaaaaa and not as expected bababab. I think this behavior is nonsensical because if someone won't as result aaaaaaa he wouldn't replace a=b because a already is a. I would consider this behavior as a bug. --Danwe 12:49, 13 May 2009 (UTC)
- Hum, that's interesting! Based on your example I see that the extension scans the whole text for the first replace argument, then it scans again all the text for the second argument, and so on. I would expect the opposite too: that the text would be scanned just once, and all replacements made during this process.
- As an alternative, you can use an intermediary variable. For example,
{{#multireplace: abababa |a=x|b=a|x=b}}
produces the result you want. —Capmo 18:53, 13 May 2009 (UTC)
- Well, I'll take a look into this, too. —Matěj Grabovský 05:33, 14 May 2009 (UTC)
- The idea with the variable isn't bad but the risk is that the variable appears somewhere else in the string and then you have a problem. Or you have to use a very complex variable string which makes the whole function call longer and confusing. --Danwe 12:27, 14 May 2009 (UTC)
Error Message
[edit]I've notice when I run the php program runJobs.php
from the UNIX prompt that there is an incessant error message:
PHP Warning: preg_match(): Compilation failed: reference to non-existent subpattern at offset 33 in /**/**/**/extensions/MultiReplace.php on line 86
How can this be help? I run MediaWiki 1.16alpha (r50326), PHP 5.2.1 (cgi-fcgi) and MySQL 5.0.22. --Aquatiki 10:59, 22 May 2009 (UTC)
Problem with look around assertions??
[edit]I just tried to replace something with an regex wich uses look around assertions but It won't work. Is there a way to make look arround assertions work? --Danwe 20:18, 26 May 2009 (UTC)
- Just ran into this problem again... The problem which is causing it, is the "=" which is needed in the positive look-ahead and positive look-behind syntax:
(?=MUSTER)
and(?<=MUSTER)
. --Danwe 16:22, 9 February 2010 (UTC)
Bug which makes valid regular expressions won't work
[edit]I just tried the following regex: (?<![#,\.\d])[\d]+(?(?=\.)\.[\d]+)*(?(?=,)\,[\d]+)? you can see a well working example at [2]
But with MultiReplace it won't work: {{#multireplace: #4 times 20,40, 30, 70, and #7 times 200.000,00. |/(?<![#,\.\d])[\d]+(?(?=\.)\.[\d]+)*(?(?=,)\,[\d]+)?/=}}
If I cut off something from the regex like: {{#multireplace: #4 times 20,40, 30, 70, and #7 times 200.000,00. |%(?<![#,\.\d])[\d]+%=""}}
it work's but doesn't support the whole feature I need.
Looks like there is a bug which makes multireplace don't supporting some part of regular expressions. With Extension:RegexParserFunctions the whole regular expression works well! --Danwe 19:13, 18 June 2009 (UTC)
- What exactly is the result you are expecting from the RegEx? The expression in the example above seems to return an array, but MultiReplace only works with single strings: it receives a string as argument and returns another string. So you need to define a RegEx that also returns a string; if your RegEx has arguments, you need to somehow concatenate them after the equal sign in order to obtain the desired string. See for instance this piece of code I used in one template (FluteAcc and ViolAcc are other templates we use):
{{#multireplace: {{{1|}}} |/(flute)([^s])/i={{FluteAcc|$1}}$2 |/(viols?)([^io])/i={{ViolAcc|$1}}$2 }}
- Observe that after the equal signs, the numbered arguments $1 and $2 found by the RegEx were applied. Capmo 20:08, 18 June 2009 (UTC)
- the example will simply replace all matches with nothing. This has nothing to do with an array. See my example at [3] all the matches there should be replaced with nothing by using the expression with multireplace. It's also weird that the more simple expression
(?<![#,\.\d])[\d]+
works well with multireplace but the more complex(?<![#,\.\d])[\d]+(?(?=\.)\.[\d]+)*(?(?=,)\,[\d]+)?
won't work with multyreplace but it works well with regex extension. Both regex are pretty similar in what they do and how they work and return. I guess the extension has problems with some kinds of look arround assertions or with If-Then-Else Conditionals. --Danwe 14:36, 21 June 2009 (UTC)
- the example will simply replace all matches with nothing. This has nothing to do with an array. See my example at [3] all the matches there should be replaced with nothing by using the expression with multireplace. It's also weird that the more simple expression
- Yes I had seen your example, but there the "Ausgabe" (that I understood as being the expected output) is an array. So what exactly is the text you want as result from this example? Is it the string titled "Treffer" without the parts in yellow? Capmo 04:36, 22 June 2009 (UTC)
- It's beacuse of the equal sign (=). I might take a look into it and add something like "nesting checker". As a workaround you can replace the first foreach loop with:
foreach( $args as $expr ) {
$tmp = array();
preg_match( "/(.*?)=([^=]*)/", $expr, &$tmp );
array_shift( $tmp );
$exprs[] = $tmp;
}
Another essential extension
[edit]Thanks a lot. Totally essential :) --Subfader 19:46, 1 December 2009 (UTC)
Replace "="?
[edit]How would I replace a string being or containing =? --Subfader 19:48, 1 December 2009 (UTC)
- Create template {{= }} and use it insted of the equal sign in the string. Matěj Grabovský 06:26, 2 December 2009 (UTC)
- Or use
\x3D
instead of=
in your regex (3D is the hex value for ASCII=
). --Danwe 15:40, 16 December 2009 (UTC)
- Or use
Bugs with escaped character "/"
[edit]Because =
and /
have a meaning in MultiReplace they don't work like expected in a MultiReplace regular expression.
I can understand that =
won't work and you need a workaround for that, but at least /
should work when it is escaped \/
. For now I use \x2F
. --Danwe 15:44, 16 December 2009 (UTC)
Escape special characters: = | (feature request)
[edit]I think there should be a way to deal with the special characters needed by Regexp and MediaWiki like | or =.
For me I found a solution that works with ;-->| and the | in the regexp are hardcoded for the expression I used. My Idea is now to use some special escape signs for these problems. What do you mean?
--Schubi87 20:12, 8 January 2011 (UTC)
- I agree with the
=
but for|
you can use a template like{{!}}
with nothing but a|
inside instead. You can also use the\x3D
for the=
. --Danwe 09:51, 20 July 2011 (UTC)
Problem with SMW
[edit]I had some problems with SMW. A dirty hack solved that problem for me:
below the line
$tmp= array_map( "trim", $tmp ); // Will maybe have to be removed"
I added
$tmp= preg_replace('/\[\[SMW::(.*)\]\]/',"",$tmp);
--Schubi87 20:12, 8 January 2011 (UTC)
- I am not sure but I think this is due to an old SMW bug, this shouldn't happen anymore so this should be obsolete with the new SMW versions. I had some problems with those SMW strings a while ago and found out about that bug which is solved now. --Danwe 09:53, 20 July 2011 (UTC)
Bug when empty search and replace strings are given
[edit]This will end up in an ugly PHP message due to insufficient parsing of the replacement regex strings:
{{#multireplace: a_____b | a=x | = | b=x | | }}
This is not an unpractical case, just imagine something like {{#multireplace: a_____b | {{{1|}}} }}
or similar.
I solved it by replacing (line 61)
$exprs[] = $tmp;
with
if( strlen( implode( '', $tmp ) ) > 0 )
$exprs[] = $tmp;
perhaps I will contribute another fix some day to get rid of all the annoying bugs which prevent from using all the features regex has to offer like look around assertions and some special chars. This should be possible with more use of regular expressions within the code. --Danwe 10:00, 20 July 2011 (UTC)