You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the SSH SIG meeting of 20 June, it turned out that 4 recent (or even still running) projects led by Reggie, Kody, Flavio and Olga were/are doing web scraping of several kinds and yet another was using pre-scraped data (FIRST, led by Laura).
This is not a completely new phenomenon either; 10 years ago we had scraped a KB (royal library) newspaper dataset and used it in many projects. It seems an especially SSH-y topic, but was also relevant in the past for deep learning when that was new (e.g. we scraped car images for project Sherlock).
Perhaps given all this it makes sense to devote some words to how to perform this task well. One could do this shallowly (just describe the scraping tools and techniques we have experience with) or a bit more deeply (e.g. how to go from scraping to a clean, shareable, open dataset). I think it would make a nice addition to the Dataset chapter.
The Turing Way only mentions scraping in passing (here).
The text was updated successfully, but these errors were encountered:
At the SSH SIG meeting of 20 June, it turned out that 4 recent (or even still running) projects led by Reggie, Kody, Flavio and Olga were/are doing web scraping of several kinds and yet another was using pre-scraped data (FIRST, led by Laura).
This is not a completely new phenomenon either; 10 years ago we had scraped a KB (royal library) newspaper dataset and used it in many projects. It seems an especially SSH-y topic, but was also relevant in the past for deep learning when that was new (e.g. we scraped car images for project Sherlock).
Perhaps given all this it makes sense to devote some words to how to perform this task well. One could do this shallowly (just describe the scraping tools and techniques we have experience with) or a bit more deeply (e.g. how to go from scraping to a clean, shareable, open dataset). I think it would make a nice addition to the Dataset chapter.
The Turing Way only mentions scraping in passing (here).
The text was updated successfully, but these errors were encountered: