Alan Feuerbacher
[PHP] Counting File Lines in XMLReader with a Large File
August 30, 2017 07:30PM
Hi,

As a new PHP user, I've recently completed a PHP program that extracts a
bunch of data from a relatively unstructured XML file. The file has
roughly 500,000 lines and I have no control over its generation.

The file generally has one XML tag like <foo> per line, but sometimes
lines are more complicated.

After a lot of reading and experimenting, I found that XMLReader was the
tool for getting the data.

As part of my debugging process, I used the function LineNumber =
$reader->expand()->getLineNo(); (after doing $reader->open(
"InputFileName" ); ) to get the file line number that the XMLReader
cursor was pointing to. Eventually I found that files larger than about
65535 lines returned wrong line numbers. Again after some online
searching, I found a discussion from about 2006 between a PHP user and a
developer that pretty much explained what was going on: the XMLReader
program uses a 16-bit integer to count file line numbers, which of
course is limited to 65535. The developer said he would not fix this,
for various reasons.

I ended up splitting the original XML file into smaller pieces under
65535 lines each, and concatenating the results.

It appears that this line numbering issue remains today. Are there any
plans to make file line numbering work with larger files?

One of the PHP developer's points was that XML does not necessarily
include Newlines that would result in file lines, but that all content
could be in one giant string. True in principle, but not in practice
where human readers are involved. I know that I would have been hard put
to debug my PHP code without being able to correlate file lines with
XMLReader cursor positions.

Comments?

Alan

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
> Date: Wednesday, August 30, 2017 11:18:51 -0600
> From: Alan Feuerbacher <[email protected]>
>
> Hi,
>
> As a new PHP user, I've recently completed a PHP program that
> extracts a bunch of data from a relatively unstructured XML file.
> The file has roughly 500,000 lines and I have no control over its
> generation.
>
> The file generally has one XML tag like <foo> per line, but
> sometimes lines are more complicated.
>
> After a lot of reading and experimenting, I found that XMLReader
> was the tool for getting the data.
>
> As part of my debugging process, I used the function LineNumber =
> $reader->expand()->getLineNo(); (after doing $reader->open(
> "InputFileName" ); ) to get the file line number that the XMLReader
> cursor was pointing to. Eventually I found that files larger than
> about 65535 lines returned wrong line numbers. Again after some
> online searching, I found a discussion from about 2006 between a
> PHP user and a developer that pretty much explained what was going
> on: the XMLReader program uses a 16-bit integer to count file line
> numbers, which of course is limited to 65535. The developer said he
> would not fix this, for various reasons.
>
> I ended up splitting the original XML file into smaller pieces
> under 65535 lines each, and concatenating the results.
>
> It appears that this line numbering issue remains today. Are there
> any plans to make file line numbering work with larger files?
>
> One of the PHP developer's points was that XML does not necessarily
> include Newlines that would result in file lines, but that all
> content could be in one giant string. True in principle, but not in
> practice where human readers are involved. I know that I would have
> been hard put to debug my PHP code without being able to correlate
> file lines with XMLReader cursor positions.
>
> Comments?
>
> Alan

I don't think it is ever a particularly good idea to try to read in
the whole of some arbitrarily sized file over which you have no
control. If something like this specific issue doesn't get you
something else will, e.g., machine memory constraints.

So instead, write your program to read in a specific number of
bytes/characters (or if appropriate, lines) that you know are safe
for your environment and tools, process those, and go on to the next
chunk.



--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Christoph M. Becker
[PHP] Re: Counting File Lines in XMLReader with a Large File
August 31, 2017 12:10AM
On 30.08.2017 at 19:18, Alan Feuerbacher wrote:

> As part of my debugging process, I used the function LineNumber =
> $reader->expand()->getLineNo(); (after doing $reader->open(
> "InputFileName" ); ) to get the file line number that the XMLReader
> cursor was pointing to. Eventually I found that files larger than about
> 65535 lines returned wrong line numbers. Again after some online
> searching, I found a discussion from about 2006 between a PHP user and a
> developer that pretty much explained what was going on: the XMLReader
> program uses a 16-bit integer to count file line numbers, which of
> course is limited to 65535. The developer said he would not fix this,
> for various reasons.

See https://bugs.php.net/bug.php?id=54138 for a more recent discussion.

TL;DR: You can pass LIBXML_BIGLINES
(<http://php.net/manual/en/libxml.constants.php#constant.libxml-biglines>;)
as $options to XMLReader::open(), if you're using PHP ≥ 7.0.0 and libxml
≥ 2.9.0.

--
Christoph M. Becker

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Sorry, only registered users may post in this forum.

Click here to login