Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web scraping section? #331

Open
egpbos opened this issue Jun 23, 2024 · 0 comments
Open

Web scraping section? #331

egpbos opened this issue Jun 23, 2024 · 0 comments
Labels

Comments

@egpbos
Copy link
Member

egpbos commented Jun 23, 2024

At the SSH SIG meeting of 20 June, it turned out that 4 recent (or even still running) projects led by Reggie, Kody, Flavio and Olga were/are doing web scraping of several kinds and yet another was using pre-scraped data (FIRST, led by Laura).

This is not a completely new phenomenon either; 10 years ago we had scraped a KB (royal library) newspaper dataset and used it in many projects. It seems an especially SSH-y topic, but was also relevant in the past for deep learning when that was new (e.g. we scraped car images for project Sherlock).

Perhaps given all this it makes sense to devote some words to how to perform this task well. One could do this shallowly (just describe the scraping tools and techniques we have experience with) or a bit more deeply (e.g. how to go from scraping to a clean, shareable, open dataset). I think it would make a nice addition to the Dataset chapter.

The Turing Way only mentions scraping in passing (here).

@egpbos egpbos added the dash label Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant