Web scraping / API re-introduction #264

trevorcampbell · 2023-09-22T23:52:48Z

Closes #64
Closes #270
Closes #272

reintroduces web scraping / API material
makes web scraping section reproducible (downloaded .html file from wiki, behind the scenes loads that file)
cleans up / syntax highlights the HTML code on the page
adds lxml to the book image
switches twitter API for NASA API
add google analytics
fixes bugs in the update_environment workflow

…ge; scraping section done

github-actions · 2023-09-24T02:13:47Z

Hello! I've built a preview of your PR so that you can compare it to the current main branch.

PR deploy preview available here
Current main deploy preview available here
Public production build available here

trevorcampbell · 2023-09-24T23:42:35Z

@joelostblom this should be good now, but I did want your opinion on one thing: the last block of code in the API section. It looks like this:

data_dict = {
	"date":[],
	"title": [],
	"copyright" : [],
	"url": []
}

for item in nasa_data:
	data_dict["copyright"].append(item["copyright"] if "copyright" in item else None)
	for entry in ["url", "title", "date"]:
		data_dict[entry].append(item[entry])

nasa_df = pd.DataFrame(data_dict)
nasa_df

It is a little bit complicated for chapter 2 of the book. I currently put a note box warning before it:

But do you have any idea for how to simplify it?

joelostblom

I didn't go through this in detail, but from skimming it looks great overall!

I don't have anything drastically simpler for your snippet in the end. Maybe students would find it easier without the ternary expression?

data_dict = {
	"date":[],
	"title": [],
	"copyright" : [],
	"url": []
}

for item in nasa_data:
        if "copyright" in item:
            data_dict["copyright"].append(item["copyright"])
        else:
            data_dict["copyright"].append(None)

	for entry in ["url", "title", "date"]:
		data_dict[entry].append(item[entry])

nasa_df = pd.DataFrame(data_dict)
nasa_df

One note is that we use 4 spaces for indentation in the rest of the book, but it seems like you used 8 here. Another note is that if we want to further this chapter sometime in the future, I think https://scrapy.org/ is both more powerful and more intuitive in many cases than using beautiful soup directly.

trevorcampbell added 11 commits September 22, 2023 16:52

re-add old web scraping / api section

4fdb696

minor wordsmith

a6a73fd

Merge branch 'main' into web-scraping-api

e742a1b

uncomment learning objectives, scraping/api section

64543e5

minor wordsmith

0255a75

added website source to img/reading/ per link

d9f05db

fix broken html snippet in scraping

e3c96cd

add back missing image in web scraping

9379c45

delete duplicate copy of img

c91e637

fix image path

a082785

add lxml to dockerfile; add wiki html data; update selectorgadget ima…

83f6972

…ge; scraping section done

trevorcampbell marked this pull request as ready for review September 23, 2023 21:30

trevorcampbell added 4 commits September 23, 2023 14:32

force rebuild Dockerfile

1cd2efb

try to debug dockerfile workflow...

00891b3

bugfix workflow...

3cbe154

commit to dockerfile to force rebuild

9421030

trevorcampbell mentioned this pull request Sep 23, 2023

Scraping section not reproducible UBC-DSCI/introduction-to-datascience#537

Closed

actions-user and others added 8 commits September 23, 2023 22:00

update build_html.sh script with new docker image

389a1e7

update build_pdf.sh script with new docker image

95e5eb0

minor formatting api section

9df4ca1

working on new NASA API section

f85423f

added more nasa api figs; rate limits added to book

52941b0

added rho ophiuchi nasa example; WIP

da615fa

WIP nasa api section

8ec2a60

reading wip

15b0cf8

trevorcampbell added 4 commits September 23, 2023 19:15

api WIP

ee0f3c4

nasa api wip

203871e

api section rough version done

d7b426d

add google analytics

936c7a8

minor polish

e2f19b9

trevorcampbell mentioned this pull request Sep 24, 2023

Add analytics UBC-DSCI/introduction-to-datascience#540

Merged

trevorcampbell added 7 commits September 24, 2023 17:41

always update build env to allow deploy pr to trigger

2d9b724

remove text referencing nonexistent tag price

3f1aaf8

minor ed (new wiki scrape is beyond 2016)

ef36897

minor selector fix

12f3a0e

typo fix

49c1815

typo fix

c2e1f24

minor adjustments to learning objs

c1985de

trevorcampbell mentioned this pull request Sep 25, 2023

Add google analytics to py and R books, worksheets #270

Closed

trevorcampbell requested a review from joelostblom September 25, 2023 18:37

joelostblom reviewed Sep 26, 2023

View reviewed changes

trevorcampbell added 5 commits September 27, 2023 13:05

mlee comments addressed

a7b38fa

simpler json parse code in api

4a30452

tabs to spaces in reading

4e50339

remove unnecessary heading nasa parameters img

8a53639

recrop rho ophiuchi img

8173f41

trevorcampbell merged commit 8f5e933 into main Sep 28, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web scraping / API re-introduction #264

Web scraping / API re-introduction #264

trevorcampbell commented Sep 22, 2023 •

edited

Loading

github-actions bot commented Sep 24, 2023 •

edited

Loading

trevorcampbell commented Sep 24, 2023

joelostblom left a comment

Web scraping / API re-introduction #264

Web scraping / API re-introduction #264

Conversation

trevorcampbell commented Sep 22, 2023 • edited Loading

github-actions bot commented Sep 24, 2023 • edited Loading

trevorcampbell commented Sep 24, 2023

joelostblom left a comment

Choose a reason for hiding this comment

trevorcampbell commented Sep 22, 2023 •

edited

Loading

github-actions bot commented Sep 24, 2023 •

edited

Loading