Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming interface to warc files. #4

Open
tef opened this issue Oct 5, 2013 · 3 comments
Open

Streaming interface to warc files. #4

tef opened this issue Oct 5, 2013 · 3 comments

Comments

@tef
Copy link
Contributor

tef commented Oct 5, 2013

  • Avoid parsing entirety of warc file
  • Don't parse http records inside

Any improvements we can make to mean that large and gargantuan warc files can be read and processed speedily

@rajbot
Copy link

rajbot commented Oct 7, 2013

Many users of the warc library would need to have parsed http headers, so it would be nice to at least have a convenience function to do so. In addition, it might by useful to have a function to stream through the payload and calculate sha1 if the WARC-Payload-Digest header is not present.

I have some changes that implement parsing of http records and calculating sha1 while streaming the payload. However, this happens internal in the library and these changes are not suitable for upstream. https://bitbucket.org/rajbot/warc-tools

@nibrahim
Copy link
Member

nibrahim commented Oct 7, 2013

The warc library at https://github.com/internetarchive/warc has a number
of these features.

@tef
Copy link
Contributor Author

tef commented Oct 9, 2013

It's GPLed. This is MIT licensed.

Edit: For the record the other major difference is that this library has had to handle more corrupt warcfiles, or weirder variants

  • Wrong newlines in records, missing trailing newlines after block
  • arcfile with warcrecords.
  • gzipped as a whole instead of record by record.

(and the http library handles far too much weirdness)

That said, the interface to warc is /far/ nicer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants