The PHP ’explode’ function splits a string into an array based on a separator character (or separator string). This is not enough to build a parser for a template language on as most languages allow strings to contain any character. In this post we will show a function that will split while respecting quotes and one to remove the quotes while allowing for escaped quotes as part of the string.
The easy assignment
Write a function or program that can split a string at each non-escaped occurrence of a separator character.
It should accept three input parameters:
- The string
- The separator character
- The escape character
It should output a list of strings. (source)
Test case
The input string:
"one^|uno||three^^^^|four^^^|^cuatro|"
Should result in an array of 5 strings:
[ "one|uno", "", "three^^", "four^|cuatro", "" ]
In this example the ‘^’ is the escape character and the ‘|’ is the separator.
The code
<?php
function token_with_escape($str, $escape = '^', $separator = '|')
{
$tokens = [];
$token = '';
$escaped = false;
for ($i = 0; $i < strlen($str); $i++) {
$c = $str[$i];
if (!$escaped) {
if ($c == $escape) {
$escaped = true;
} elseif ($c == $separator) {
$tokens[] = $token;
$token = '';
} else {
$token .= $c;
}
} else {
$token .= $c;
$escaped = false;
}
}
$tokens[] = $token;
return $tokens;
}
$input = "one^|uno||three^^^^|four^^^|^cuatro|";
$output = token_with_escape($input);
echo json_encode($output)."\n";
And it does in fact output the right string.
The hard assignment (complex templates)
Write a function or program that can split a string at each occurrence of a separator character that is not within non-escaped quotes.
It should accept four input parameters:
- The string
- The quote character
- The escape character
- The separator character
It should output a list of strings.
Test case
You need to avoid splitting within a ‘strings between quotes’. So you want:
"'one|uno'||'three^'^''|'four^^^'^cuatro'|"
to be split into (step 1):
[ "'one|uno'", "", "'three^'^''", "'four^^^'^cuatro'", "" ]
and to be parsed into (step 2):
[ "one|uno", "", "three''", "four^'cuatro", "" ]
As you can see you never split within a quoted string.
The code
This function will take care of the first step:
<?php
function token_with_quote($str, $quote = "'", $escape = '^', $separator = '|')
{
$tokens = [];
$token = '';
$escaped = false;
$quoted = false;
$seplen = strlen($separator);
for ($i = 0; $i < strlen($str); $i++) {
$c = $str[$i];
if (!$quoted) {
if ($c == $quote) {
$quoted = true;
} elseif (substr($str, $i, $seplen) == $separator) {
$tokens[] = $token;
$token = '';
$i += $seplen - 1;
continue;
}
} else {
if (!$escaped) {
if ($c == $quote) {
$quoted = false;
} elseif ($c == $escape) {
$escaped = true;
}
} else {
$escaped = false;
}
}
$token .= $c;
}
$tokens[] = $token;
return $tokens;
}
$input = "'one|uno'||'three^'^''|'four^^^'^cuatro'|";
$output = token_with_quote($input);
echo json_encode($output)."\n";
This function will take care of the second step:
function token_unquote($arr, $quote = "'", $escape = '^')
{
for ($i = 0; $i < count($arr); $i++) {
$str = trim($arr[$i]);
if (strlen($str) > 1 && $str[0] == $quote && $str[strlen($str) - 1] == $quote) {
$escaped = false;
$token = '';
$str = substr($str, 1, strlen($str) - 2);
for ($j = 0; $j < strlen($str); $j++) {
$c = $str[$j];
if (!$escaped) {
if ($c == $escape) {
$escaped = true;
continue;
}
} else {
$escaped = false;
}
$token .= $c;
}
$arr[$i] = $token;
}
}
return $arr;
}
$input = "'one|uno'||'three^'^''|'four^^^'^cuatro'|";
$output = token_unquote(token_with_quote($input));
echo json_encode($output)."\n";
And as expected the output is parsed correctly.
Enjoy!