Welcome! Log In Create A New Profile

Advanced

[PHP-DEV] fputcsv() and $escape character

Posted by Christoph M. Becker 
Christoph M. Becker
[PHP-DEV] fputcsv() and $escape character
September 21, 2017 01:50PM
Hi everybody!

There are several bug reports regarding "broken" fputcsv() behavior in
our tracker, namely, because the $escape parameter causes unexpected
results. For instance:

<?php
$row = ['a\\"', 'bbb'];

$fp = fopen('php://memory', 'w+');
fputcsv($fp, $row);
rewind($fp);
echo stream_get_contents($fp);
fclose($fp);
?>

outputs

"a\"",bbb

instead of the expected

"a\""",bbb

I don't think the current behavior is a bug, but rather the escape
character is an extension to the CSV "standard" (RFC 7111). One can
practically disable the escape character by passing any character that
is not contained in any of the strings in the array; "\0" is usually a
good choice, so changing line 5 in the script above to the following
gives the desired behavior:

fputcsv($fp, $row, ',', '"', "\0");

Cf. https://3v4l.org/InlUv vs. https://3v4l.org/tVFBo.

It is, however, not possible to pass an empty string as $escape
argument, because fputcsv() bails out in this case raising

Warning: fputcsv(): escape must be a character

I suggest to allow an empty string instead, and to consider making this
the default in a future version, probably after some time of deprecating
any other $escape argument.

Thoughts?

--
Christoph M. Becker

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Andreas Hennings
Re: [PHP-DEV] fputcsv() and $escape character
September 21, 2017 07:40PM
So empty string would enable the standard behavior RFC 7111 with no escape char.
If so, I support this.

I don't know if '' or true / false / null should be this "special value".
It has to be something that was not legal before, and that people
should use intentionally and not by accident.
I guess '' is good enough for this, or not worse than other options.

One question: Does this also require a change in fgetcsv()?
So it can read csv without escape character?
I remember that fgetcsv() does currently tolerate some broken CSV. It
should continue to do so for BC reasons.

For the record, here are some links:
https://stackoverflow.com/questions/44427926/data-gets-garbled-when-writing-to-csv-with-fputcsv-fgetcsv/46342634#46342634
https://bugs.php.net/bug.php?id=74713&edit=2

On Thu, Sep 21, 2017 at 1:43 PM, Christoph M. Becker <[email protected]> wrote:
> Hi everybody!
>
> There are several bug reports regarding "broken" fputcsv() behavior in
> our tracker, namely, because the $escape parameter causes unexpected
> results. For instance:
>
> <?php
> $row = ['a\\"', 'bbb'];
>
> $fp = fopen('php://memory', 'w+');
> fputcsv($fp, $row);
> rewind($fp);
> echo stream_get_contents($fp);
> fclose($fp);
> ?>
>
> outputs
>
> "a\"",bbb
>
> instead of the expected
>
> "a\""",bbb
>
> I don't think the current behavior is a bug, but rather the escape
> character is an extension to the CSV "standard" (RFC 7111). One can
> practically disable the escape character by passing any character that
> is not contained in any of the strings in the array; "\0" is usually a
> good choice, so changing line 5 in the script above to the following
> gives the desired behavior:
>
> fputcsv($fp, $row, ',', '"', "\0");
>
> Cf. https://3v4l.org/InlUv vs. https://3v4l.org/tVFBo.
>
> It is, however, not possible to pass an empty string as $escape
> argument, because fputcsv() bails out in this case raising
>
> Warning: fputcsv(): escape must be a character
>
> I suggest to allow an empty string instead, and to consider making this
> the default in a future version, probably after some time of deprecating
> any other $escape argument.
>
> Thoughts?
>
> --
> Christoph M. Becker
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: http://www.php.net/unsub.php
>

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Rowan Collins
Re: [PHP-DEV] fputcsv() and $escape character
September 21, 2017 08:10PM
On 21 September 2017 18:32:57 BST, Andreas Hennings <[email protected]> wrote:
>So empty string would enable the standard behavior RFC 7111 with no
>escape char.
>If so, I support this.

Just a note regarding standards: please make sure the behaviour of common applications, most notably Excel, is considered and tested. In my experience it has its own quirks, and it's likely that a large proportion of users require interoperability with it. It may be there's nothing relevant to consider, but I thought I'd mention it, in case people get too caught up with a specification that nobody actually follows.

Regards,

--
Rowan Collins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Dan Ackroyd
Re: [PHP-DEV] fputcsv() and $escape character
September 21, 2017 11:10PM
On 21 September 2017 at 12:43, Christoph M. Becker <[email protected]> wrote:
> Hi everybody!
>
> There are several bug reports regarding "broken" fputcsv() behavior in
> our tracker, namely, because the $escape parameter causes unexpected
> results. For instance:


I looked at fixing some of the CSV related bugs about a year or so
ago. My conclusions were:

i) There is no way to fix the problems that wouldn't cause horrible BC
breaks for code that is only coincidentally working currently.

ii) Handling strings in C is much more error prone than handling them in PHP.

I'm reasonably certain that trying to fix the current functions is the
wrong approach, and one of the following would be much better.

Either, find a C library that has already been proven to handle CSV
parsing/generating 'correctly' and bring that into PHP core under
either new function names or namespace.

Or, write the code in PHP (or just use
https://github.com/thephpleague/csv) and find a way to make that fast
enough for people to use.

Touching the existing code is pretty certain to bring a lot of pain,
without resulting in a fully compliant csv parser/generator.

cheers
Dan
Ack

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Andreas Hennings
Re: [PHP-DEV] fputcsv() and $escape character
September 22, 2017 02:50AM
On Thu, Sep 21, 2017 at 11:08 PM, Dan Ackroyd <[email protected]> wrote:
> On 21 September 2017 at 12:43, Christoph M. Becker <[email protected]> wrote:
>> Hi everybody!
>>
>> There are several bug reports regarding "broken" fputcsv() behavior in
>> our tracker, namely, because the $escape parameter causes unexpected
>> results. For instance:
>
>
> I looked at fixing some of the CSV related bugs about a year or so
> ago. My conclusions were:
>
> i) There is no way to fix the problems that wouldn't cause horrible BC
> breaks for code that is only coincidentally working currently.

How so?
What we can do:
If $escape_char !== '', fputcsv() should work exactly as it does now.
It can even use the same C code.
If $escape_char === '', fputcsv() will have a different behavior than
currently, where no character is treated as special.

If we want to be 100% sure, these two cases can use completely
separate C code, with one if() at the entry point.

>
> ii) Handling strings in C is much more error prone than handling them in PHP.
>
> I'm reasonably certain that trying to fix the current functions is the
> wrong approach, and one of the following would be much better.
>
> Either, find a C library that has already been proven to handle CSV
> parsing/generating 'correctly' and bring that into PHP core under
> either new function names or namespace.

Yeah why not. But it will require a lot more work and design decisions
than just introducing a new allowed value for $enclosure parameter.

>
> Or, write the code in PHP (or just use
> https://github.com/thephpleague/csv) and find a way to make that fast
> enough for people to use.

It can never be as fast as native functions.
I vaguely remember something like factor 2 or 3 when I tried it, but
for now treat this as hearsay.

>
> Touching the existing code is pretty certain to bring a lot of pain,
> without resulting in a fully compliant csv parser/generator.
>
> cheers
> Dan
> Ack
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: http://www.php.net/unsub.php
>

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Andreas Hennings
Re: [PHP-DEV] fputcsv() and $escape character
September 22, 2017 03:00AM
On Thu, Sep 21, 2017 at 1:43 PM, Christoph M. Becker <[email protected]> wrote:
> I don't think the current behavior is a bug, but rather the escape
> character is an extension to the CSV "standard" (RFC 7111).

Are you sure you mean RFC 7111 ?
I was just parrotting the number in my previous email, but now I
looked it up and only find this:
https://tools.ietf.org/html/rfc7111
This talks about uri fragments with CSV, not CSV itself.

RFC 4180 seems to be closer to what we are looking for:
https://tools.ietf.org/html/rfc4180#section-2

fputcsv() and fgetcsv() already have a number of extensions to this
format, which I think are not harmful and that we should keep:
- option to choose a different delimiter
- option to choose a different enclosure
- option to have a different number of cells per row.
- fgetcsv() has some tolerance for broken CSV, that we should continue
to support.

In the stackoverflow discussion, someone argues that line breaks are
not part of the standard / not portable:
https://stackoverflow.com/questions/44427926/data-gets-garbled-when-writing-to-csv-with-fputcsv-fgetcsv/46342634#comment75882780_44427926
However, the RFC 4180 that I found clearly mentions them:

> 6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

So, all we would change is the escape behavior. In RFC 4180, it says:

> If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Christoph M. Becker
Re: [PHP-DEV] fputcsv() and $escape character
September 22, 2017 01:30PM
On 21.09.2017 at 20:01, Rowan Collins wrote:

> On 21 September 2017 18:32:57 BST, Andreas Hennings <[email protected]> wrote:
>
>> So empty string would enable the standard behavior RFC 7111 with no
>> escape char.
>> If so, I support this.
>
> Just a note regarding standards:

Actually, there is no accepted standard regarding CSV. RFC 7111 is just
an update to RFC 4180 (sorry, I've messed that up), but anyway both are
just informational.

> please make sure the behaviour of common applications, most notably
> Excel, is considered and tested. In my experience it has its own quirks,
> and it's likely that a large proportion of users require
> interoperability with it. It may be there's nothing relevant to
> consider, but I thought I'd mention it, in case people get too caught up
> with a specification that nobody actually follows.

That's exactly the point of this proposal. As it is, fputcsv() outputs
CSV that won't be understood by Excel (or any other CSV aware
implementation I'm aware of), as soon as the escape character is
actually part of any string. So being able to avoid any such escaping
would be helpful wrt. major CSV implementations, and making that the
default even more so.

The only other issue that I can see is that currently our fputcsv()
hard-codes the line endings to LF (while RFC 4180 demands CRLF), but
that may not be a real problem anymore. (Well, there might be issues
with non ASCII compatible character encodings as well).

--
Christoph M. Becker


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Andreas Hennings
Re: [PHP-DEV] fputcsv() and $escape character
September 22, 2017 08:50PM
On Fri, Sep 22, 2017 at 1:20 PM, Christoph M. Becker <[email protected]> wrote:
>> please make sure the behaviour of common applications, most notably
>> Excel, is considered and tested. In my experience it has its own quirks,
>> and it's likely that a large proportion of users require
>> interoperability with it. It may be there's nothing relevant to
>> consider, but I thought I'd mention it, in case people get too caught up
>> with a specification that nobody actually follows.
>
> That's exactly the point of this proposal. As it is, fputcsv() outputs
> CSV that won't be understood by Excel (or any other CSV aware
> implementation I'm aware of), as soon as the escape character is
> actually part of any string. So being able to avoid any such escaping
> would be helpful wrt. major CSV implementations, and making that the
> default even more so.

The other problem that this proposal wants to fix is that CSV cannot
currently be used reliably to store or transport data between
different PHP scripts, due to this bug:
https://bugs.php.net/bug.php?id=74713&edit=2

> If one cell of the data sent to fputcsv() contains "{$enclosure}{$escape_char}{$escape_char}{$enclosure}{$delimiter}", this cell will be split after a round trip of fputcsv() + fgetcsv().


>
> The only other issue that I can see is that currently our fputcsv()
> hard-codes the line endings to LF (while RFC 4180 demands CRLF), but
> that may not be a real problem anymore. (Well, there might be issues
> with non ASCII compatible character encodings as well).

I have opened a lot of php-generated CSV files with LibreOffice Calc,
and the line endings have not been a problem.
I assume Excel and OpenOffice would also be ok with it.
I'd say stick with existing behavior for BC reasons, or discuss it separately.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Christoph M. Becker
Re: [PHP-DEV] fputcsv() and $escape character
September 23, 2017 12:20AM
On 21.09.2017 at 23:08, Dan Ackroyd wrote:

> On 21 September 2017 at 12:43, Christoph M. Becker <[email protected]> wrote:
>
>> There are several bug reports regarding "broken" fputcsv() behavior in
>> our tracker, namely, because the $escape parameter causes unexpected
>> results. For instance:
>
> I looked at fixing some of the CSV related bugs about a year or so
> ago. My conclusions were:
>
> i) There is no way to fix the problems that wouldn't cause horrible BC
> breaks for code that is only coincidentally working currently.
>
> ii) Handling strings in C is much more error prone than handling them in PHP.
>
> I'm reasonably certain that trying to fix the current functions is the
> wrong approach, and one of the following would be much better.
>
> Either, find a C library that has already been proven to handle CSV
> parsing/generating 'correctly' and bring that into PHP core under
> either new function names or namespace.
>
> Or, write the code in PHP (or just use
> https://github.com/thephpleague/csv) and find a way to make that fast
> enough for people to use.
>
> Touching the existing code is pretty certain to bring a lot of pain,
> without resulting in a fully compliant csv parser/generator.

I agree that php_fgetcsv() has serious issues, and it might not be
possible to fix it without causing severe BC breaks. php_fputcsv(), on
the other hand, is less of a problem, though.

Overall, the most demanding issue is that both functions try to regard
the current locale, but already fail that generally, since several
parameters are declared as char, which can't work for (some) multibyte
encodings. For instance, it is impossible to generate proper UTF-16
encoded CSV files, or to read them. This issue continues, because
several (mostly?) whitespace characters are hard-coded assuming an ASCII
compatible character encoding.

A minor issue are the hard-coded record terminators, which are currently
LF (RFC 4180 specifies CRLF). Apparently, this isn't a real issue
nowadays (besides the missing support for non ASCII compatible character
encodings).

Another issue concerns the escape character. Frankly, I don't even have
the slightest clue how that is supposed to work, and why it even has
been introduced in the first place. Maybe it has been introduced for
compatibility with some application requiring it; maybe it has been
introduced to support "DSV style"[1]. If the latter, at least
php_fputcsv() doesn't support it (anymore). Unless it's clear what the
escape character exactly is supposed to do, we *cannot* even *hope* to
fix the implementation.

Introducing new functions with a clearly defined behavior would be nice,
but appears to me somewhat as pie in the sky (somebody would have to do
the actual work!). But even if we do so, at least the actual behavior
of the existing functions would have to be documented.

And frankly, I don't see why it would be a problem to allow to use no
escape character for fputcsv(). That certainly wouldn't be a BC break,
since currently the function bails out if `escape_str_len < 1`. Of
course, that wouldn't fix all issues, but it appears to make the
function work as expected for ASCII compatible character encodings (for
other character encodings the function appears to be broken anyway).

Ad league/csv: rather impressive! However, including this functionality
into ext/standard is totally over the top, in my opinion. I guess that
fgetcsv() and fputcsv() are mostly used for importing from and exporting
to CSV, respectively, but not as replacement for an SQL database engine.

[1] http://www.catb.org/~esr/writings/taoup/html/ch05s02.html

--
Christoph M. Becker

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Sorry, only registered users may post in this forum.

Click here to login