Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser suddenly quits while parsing Discogs dump #13

Open
aleksblendwerk opened this issue Jul 21, 2024 · 4 comments
Open

Parser suddenly quits while parsing Discogs dump #13

aleksblendwerk opened this issue Jul 21, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@aleksblendwerk
Copy link

Hi there,

as I am currenty looking to speed up my database import code for Discogs' dump files, I just tried your library with this file: https://discogs-data-dumps.s3-us-west-2.amazonaws.com/data/2024/discogs_20240701_labels.xml.gz and I might be using it wrong anyway but it also seems to stop after a couple thousand nodes.

This is more or less my code:

$stream = fopen('compress.zlib://[...]/discogs/discogs_20240701_labels.xml.gz', 'rb');

foreach (new XMLParser($stream) as $node) {
    if ($node instanceof XMLNodeContent && $node->name === 'label') {
        var_dump($node->content);
    }
}

fclose($stream);

The output ends with

string(67) "https://web.archive.org/web/20160427071301/http://www.exogenic.com/"
string(17) "Breakbeat Science"
string(17) "Breakbeat Science"

Somehow parsing suddenly ends at about 1% into the file.

I haven't investigated this further yet, will look elsewhere for now but I just thought I'd report it.

@macbre
Copy link
Member

macbre commented Jul 22, 2024

@aleksblendwerk - first of all, thanks for giving my library a try and reporting the bug!

Is PHP reporting any error? Is the gzip'ed XML properly formatted? What's the exit code of that script when you run it?

@macbre macbre added the bug Something isn't working label Jul 22, 2024
@aleksblendwerk
Copy link
Author

aleksblendwerk commented Jul 26, 2024

@aleksblendwerk - first of all, thanks for giving my library a try and reporting the bug!

You're welcome!

Is PHP reporting any error? Is the gzip'ed XML properly formatted? What's the exit code of that script when you run it?

PHP doesn't report any error and the process just exits normally, exit code 0.
A timestamp I echo after the fclose is also printed.

The XML should be fine, I successfully loaded it using PHP's built-in XMLReader.

One thing I noticed in the given XML file is that within the label nodes it might contain a sublabels node with child nodes called label again. Maybe that's a case you haven't encountered with your parser before.

@macbre
Copy link
Member

macbre commented Jul 26, 2024

One thing I noticed in the given XML file is that within the label nodes it might contain a sublabels node with child nodes called label again. Maybe that's a case you haven't encountered with your parser before.

Might be. Can you submit the XML you're trying to parse? Or at least a small sample that can be used to reproduce the problem?

@aleksblendwerk
Copy link
Author

aleksblendwerk commented Jul 26, 2024

Might be. Can you submit the XML you're trying to parse? Or at least a small sample that can be used to reproduce the problem?

It is the file I linked in the initial post:

https://discogs-data-dumps.s3-us-west-2.amazonaws.com/data/2024/discogs_20240701_labels.xml.gz

As far as providing a small sample to reproduce it, that would probably require me to dig in too deep right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants