Welcome! Log In Create A New Profile

Advanced

[PHP-DEV] Progress or just 'a mess'?

Posted by Lester Caine 
Lester Caine
[PHP-DEV] Progress or just 'a mess'?
September 17, 2017 11:00AM
It's a question I've asked before, but there still does not seem to be a
proper answer ... just where is PHP in relation to unicode? The thread
on 'case-insensitive constants' cherry picks a particular aspect without
picking up on the base problem? Just what character set is PHP7 designed
to work with.

The SQL standard provides a working solution to the problem and one that
is still applied 25 years on ... it lists the subset of characters
available for writing SQL code. Essentially the Latin character set with
well defined special characters. The irritating part of cause is that
this standard is one you have to pay for copies off, but the principle
can easily be copied along perhaps with some of the extensions relating
to handling unicode data within the constrained framework.

Everything in SQL is essentially 'upper case' although I still have fun
moving datasets to PHP arrays where the keys end up as lower case'
versions of the default UPPER CASE returned by the standard. THIS is an
area where case-insensitive operations would be very useful, but that is
not going to happen any time soon.

For PHP8 is it not time to lay out a similar set of rules as provided by
SQL and identify just what 'case-insensitive' means and where it does apply?

--
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Rowan Collins
Re: [PHP-DEV] Progress or just 'a mess'?
September 17, 2017 01:00PM
On 17 September 2017 09:54:54 BST, Lester Caine <[email protected]> wrote:
> Just what character set is PHP7
>designed
>to work with.

Focusing on the answerable part of this, PHP actually allows a very wide variety of characters in identifiers (names of variables, classes, functions, etc).

I checked the PHP lang-spec repo expecting to find a set of Unicode classes, but it currently mentions "U+0080-U+00FF": https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#names That seems wrong to me, unless I'm looking at the wrong definition - the first part of that range is control characters, and you can have variables called things like $
Christoph M. Becker
Re: [PHP-DEV] Progress or just 'a mess'?
September 17, 2017 02:30PM
On 17.09.2017 at 12:53, Rowan Collins wrote:

> I checked the PHP lang-spec repo expecting to find a set of Unicode classes, but it currently mentions "U+0080-U+00FF": https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#names That seems wrong to me, unless I'm looking at the wrong definition - the first part of that range is control characters, and you can have variables called things like $
Rowan Collins
Re: [PHP-DEV] Progress or just 'a mess'?
September 17, 2017 02:40PM
On 17 September 2017 13:18:44 BST, "Christoph M. Becker" <[email protected]> wrote:
>On 17.09.2017 at 12:53, Rowan Collins wrote:
>
>> I checked the PHP lang-spec repo expecting to find a set of Unicode
>classes, but it currently mentions "U+0080-U+00FF":
>https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#names
>That seems wrong to me, unless I'm looking at the wrong definition -
>the first part of that range is control characters, and you can have
>variables called things like $
Christoph M. Becker
Re: [PHP-DEV] Progress or just 'a mess'?
September 17, 2017 03:50PM
On 17.09.2017 at 14:37, Rowan Collins wrote:

> That makes much more sense, but doesn't answer the other question, of if there's a working definition of what we mean by "case insensitive".

For case-insensitive constants zend_register_constant() uses
zend_str_tolower_copy() which uses zend_tolower_ascii() which looks up
in tolower_map:
<https://github.com/php/php-src/blob/php-7.0.23/Zend/zend_operators.c#L46-L63>;.
As the name already says, this is a simple ASCII lower case mapping
(A-Z are mapped to a-z; all others map to themselves). So only
identifiers consisting solely of ASCII characters can actually be
case-insensitive.

I presume that this map is also used for other case-insensitive identifiers.

--
Christoph M. Becker

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Lester Caine
Re: [PHP-DEV] Progress or just 'a mess'?
September 17, 2017 04:00PM
On 17/09/17 11:53, Rowan Collins wrote:
> On 17 September 2017 09:54:54 BST, Lester Caine <[email protected]> wrote:
>> Just what character set is PHP7
>> designed
>> to work with.
>
> Focusing on the answerable part of this, PHP actually allows a very wide variety of characters in identifiers (names of variables, classes, functions, etc).
>
> I checked the PHP lang-spec repo expecting to find a set of Unicode classes, but it currently mentions "U+0080-U+00FF": https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#names That seems wrong to me, unless I'm looking at the wrong definition - the first part of that range is control characters, and you can have variables called things like $
Christoph M. Becker
Re: [PHP-DEV] Progress or just 'a mess'?
September 17, 2017 04:20PM
On 17.09.2017 at 15:45, Christoph M. Becker wrote:

> On 17.09.2017 at 14:37, Rowan Collins wrote:
>
>> That makes much more sense, but doesn't answer the other question, of if there's a working definition of what we mean by "case insensitive".
>
> For case-insensitive constants zend_register_constant() uses
> zend_str_tolower_copy() which uses zend_tolower_ascii() which looks up
> in tolower_map:
> <https://github.com/php/php-src/blob/php-7.0.23/Zend/zend_operators.c#L46-L63>;.
> As the name already says, this is a simple ASCII lower case mapping
> (A-Z are mapped to a-z; all others map to themselves). So only
> identifiers consisting solely of ASCII characters can actually be
> case-insensitive.
>
> I presume that this map is also used for other case-insensitive identifiers.

See also Sara's reply to the other thread:
http://news.php.net/php.internals/100602.

--
Christoph M. Becker



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Stanislav Malyshev
Re: [PHP-DEV] Progress or just 'a mess'?
September 20, 2017 09:30AM
Hi!

> picking up on the base problem? Just what character set is PHP7 designed
> to work with.

What do you mean by "work with"?

> For PHP8 is it not time to lay out a similar set of rules as provided by
> SQL and identify just what 'case-insensitive' means and where it does apply?

I'm not sure which problem you are trying to solve here. Could you
explain what you'd be using these rules for?
--
Stas Malyshev
smalyshev@gmail.com

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Lester Caine
Re: [PHP-DEV] Progress or just 'a mess'?
September 20, 2017 11:40AM
On 20/09/17 08:26, Stanislav Malyshev wrote:
>> picking up on the base problem? Just what character set is PHP7 designed
>> to work with.
>
> What do you mean by "work with"?

Actually that HAS already been identified in this thread, and it is only
the basic ASCII character set, but this is not actually specified anywhere?

>> For PHP8 is it not time to lay out a similar set of rules as provided by
>> SQL and identify just what 'case-insensitive' means and where it does apply?
>
> I'm not sure which problem you are trying to solve here. Could you
> explain what you'd be using these rules for?

Having established that the only characters that are case-insensitive in
PHP7 ... the unicode basic latin set ... the discussion SHOULD be on
either expanding that to cover all case folding or simply removing this
rather limited case? Tony Marston is making an impassioned demand to
retain this very limited case, and therefore expand it to cover all
character sets, and as a fellow 'English only' coder, I can accept that
argument. However many of my clients do not use English as a first
language so any data handling has to be unicode based, and case in that
data can be important, so is case-insensitive really as universal as
Tony thinks? Certainly we need data case-insensitivity to handle unicode
properly and not just a few english characters ( should I really add a
capital 'E' to english just to please the spell checker? )

People are using their own languages when writing PHP variables and
function names, and apart from a few edge cases this does seem to be
working for them. As with SQL, the key programming words are in English,
and I don't think anybody would suggest adding aliases for them, so
restricting keywords to 'unicode basic latin set' can be defined, but
does THEN making that case-insensitive add to the problems of making PHP
more user friendly in handling unicode names elsewhere? I am seeing SQL
field names coming in with unicode content, and these are then array
keys in PHP ... the latin characters get lower cased at times and this
DOES cause a problem if the metadata defines upper case and I suspect
that is something that will never be changed now, but the actual rules
applied would be nice to know?

--
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Stanislav Malyshev
Re: [PHP-DEV] Progress or just 'a mess'?
September 21, 2017 12:30AM
Hi!

> Having established that the only characters that are case-insensitive in
> PHP7 ... the unicode basic latin set ... the discussion SHOULD be on
> either expanding that to cover all case folding or simply removing this
> rather limited case?

Why? Does anybody seriously need Russian case folding in PHP constants?
I mean, sure, nice demo, but does anybody *need* it? I don't see much
code on github - in any language - that uses Russian identifiers, for
example.

> argument. However many of my clients do not use English as a first
> language so any data handling has to be unicode based, and case in that

You seem to be mixing data and code here. So what you are talking about
- data or code?
--
Stas Malyshev
smalyshev@gmail.com

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
Sorry, only registered users may post in this forum.

Click here to login