TQ
dev.com

Blog about software development

Subscribe

Split while respecting quotes in PHP

27 Apr 2019 - by 'Maurits van der Schee'

The PHP 'explode' function splits a string into an array based on a separator character (or separator string). This is not enough to build a parser for a template language on as most languages allow strings to contain any character. In this post we will show a function that will split while respecting quotes and one to remove the quotes while allowing for escaped quotes as part of the string.

The easy assignment

Write a function or program that can split a string at each non-escaped occurrence of a separator character.

It should accept three input parameters:

It should output a list of strings. (source)

Test case

The input string:

"one^|uno||three^^^^|four^^^|^cuatro|"

Should result in an array of 5 strings:

[ "one|uno", "", "three^^", "four^|cuatro", "" ]

In this example the '^' is the escape character and the '|' is the separator.

The code

<?php
function token_with_escape($str, $escape = '^', $separator = '|')
{
    $tokens = [];
    $token = '';
    $escaped = false;
    for ($i = 0; $i < strlen($str); $i++) {
        $c = $str[$i];
        if (!$escaped) {
            if ($c == $escape) {
                $escaped = true;
            } elseif ($c == $separator) {
                $tokens[] = $token;
                $token = '';
            } else {
                $token .= $c;
            }
        } else {
            $token .= $c;
            $escaped = false;
        }
    }
    $tokens[] = $token;
    return $tokens;
}

$input = "one^|uno||three^^^^|four^^^|^cuatro|";
$output = token_with_escape($input);
echo json_encode($output) . "\n";

And it does in fact output the right string.

The hard assignment (complex templates)

Write a function or program that can split a string at each occurrence of a separator character that is not within non-escaped quotes.

It should accept four input parameters:

It should output a list of strings.

Test case

You need to avoid splitting within a 'strings between quotes'. So you want:

"'one|uno'||'three^'^''|'four^^^'^cuatro'|"

to be split into (step 1):

[ "'one|uno'", "", "'three^'^''", "'four^^^'^cuatro'", "" ]

and to be parsed into (step 2):

[ "one|uno", "", "three''", "four^'cuatro", "" ]

As you can see you never split within a quoted string.

The code

This function will take care of the first step:

<?php
function token_with_quote($str, $quote = "'", $escape = '^', $separator = '|')
{
    $tokens = [];
    $token = '';
    $escaped = false;
    $quoted = false;
    $seplen = strlen($separator);
    for ($i = 0; $i < strlen($str); $i++) {
        $c = $str[$i];
        if (!$quoted) {
            if ($c == $quote) {
                $quoted = true;
            } elseif (substr($str, $i, $seplen) == $separator) {
                $tokens[] = $token;
                $token = '';
                $i += $seplen - 1;
                continue;
            }
        } else {
            if (!$escaped) {
                if ($c == $quote) {
                    $quoted = false;
                } elseif ($c == $escape) {
                    $escaped = true;
                }
            } else {
                $escaped = false;
            }
        }
        $token .= $c;
    }
    $tokens[] = $token;
    return $tokens;
}

$input = "'one|uno'||'three^'^''|'four^^^'^cuatro'|";
$output = token_with_quote($input);
echo json_encode($output) . "\n";

This function will take care of the second step:

function token_unquote($arr, $quote = "'", $escape = '^')
{
    for ($i = 0; $i < count($arr); $i++) {
        $str = trim($arr[$i]);
        if (strlen($str) > 1 && $str[0] == $quote && $str[strlen($str) - 1] == $quote) {
            $escaped = false;
            $token = '';
            $str = substr($str, 1, strlen($str) - 2);
            for ($j = 0; $j < strlen($str); $j++) {
                $c = $str[$j];
                if (!$escaped) {
                    if ($c == $escape) {
                        $escaped = true;
                        continue;
                    }
                } else {
                    $escaped = false;
                }
                $token .= $c;
            }
            $arr[$i] = $token;
        }
    }
    return $arr;
}

$input = "'one|uno'||'three^'^''|'four^^^'^cuatro'|";
$output = token_unquote(token_with_quote($input));
echo json_encode($output) . "\n";

And as expected the output is parsed correctly.

Enjoy!


PS: Liked this article? Please share it on Facebook, Twitter or LinkedIn.