2018-SASxSTCWorkshop

Agenda

Time	Agenda
0930	Registration start
1020	Opening
1030	Analytic Talk by Fusionex
1200	Q & A
1230	Lunch
1330	Workshop start
1620	Ending Speech
1630	End

Workshop Agenda

Time	Agenda
1330	Ice Breaking
1333	Introduce TAs
1334	Install requirement
1350	Basic Web Scrapping
1415	Scrape CIA world factbook
1445	Panda + Matplotlib
1620	End

Slides

Google Slides

Workshop

Install all the requirements before start

$ pip install -r ./requirements.txt

OR

$ python -m pip install -r ./requirements.txt

Error (Windows)

Lack of TK

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/matplotlib/pyplot.py", line 115, in <module>
    _backend_mod, new_figure_manager, draw_if_interactive, _show = pylab_setup()
  File "/usr/local/lib/python3.6/site-packages/matplotlib/backends/__init__.py", line 32, in pylab_setup
    globals(),locals(),[backend_name],0)
  File "/usr/local/lib/python3.6/site-packages/matplotlib/backends/backend_tkagg.py", line 6, in <module>
    from six.moves import tkinter as Tk
  File "/usr/local/lib/python3.6/site-packages/six.py", line 92, in __get__
    result = self._resolve()
  File "/usr/local/lib/python3.6/site-packages/six.py", line 115, in _resolve
    return _import_module(self.mod)
  File "/usr/local/lib/python3.6/site-packages/six.py", line 82, in _import_module
    __import__(name)
  File "/usr/local/lib/python3.6/tkinter/__init__.py", line 36, in <module>
    import _tkinter # If this fails your Python may not be configured for Tk
ModuleNotFoundError: No module named '_tkinter'

Solution: Reinstall python and check TK modules

Microsoft Visual C++ 14.0

running build_ext
building 'twisted.test.raiser' extension
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

Solution:

Manually download twisted from here
Run

$ pip install Twisted‑18.9.0‑cp37‑cp37m‑win32.whl

Lack of pywin32

ModuleNotFoundError: No module named 'pywin32'

Solution:

$ pip install pywin32

OR

$ pip install pypiwin32

Scrapping Workshop

First, create a scrapy project
```
$ scrapy startproject sasxstc
```
We will start by scrapping this website
Create a new file in projectdir/sasxstc/sasxstc/spiders/SampleSpider.py

Insert below boilerplate code into SampleSpider.py

import scrapy


class SampleSpider(scrapy.Spider):
    name = "sample"
    start_urls = [
        'https://sunwaytechclub.github.io/2018-SASxSTCWorkshop/1.html'
    ]

    '''
    Called after every request
    This is where your scrapping code should be
    '''
    def parse(self, response):
        result = response.body
        return {"result": result}

Run the spider
```
$ scrapy crawl sample
```

If you get the following error:

  File "/home/gaara/.virtualenvs/sasxstc/lib/python3.7/site-packages/twisted/conch/manhole.py", line 154
    def write(self, data, async=False):
                              ^
SyntaxError: invalid syntax

run

$ pip install git+https://github.com/twisted/twisted.git@trunk

Xpath basic

# Extract every html tag
result = response.xpath('//html').extract()

# Extract the body tag which follow by html
result = response.xpath('//html/body').extract()

# Extract every body tag
result = response.xpath('//body').extract()

# Extract the p tag that follow by div, body and html
result = response.xpath('//html/body/div/p').extract()

# Extract the text within every p tag
result = response.xpath('//p/text()').extract()

Xpath with id and class

# Extract every element within div with id of 'd01'
result = response.xpath('//div[@id="d01"]').extract()

# Extract all the text with class blue
result = response.xpath('//p[@class="blue"]/text()').extract()

# Extract the word SAS
result = response.xpath('//div[@id="d02"]/p[@class="red"]/text()').extract()

Scrapping CIA factbook

CIA is Central Intelligence Agency. Various data can be found in CIA factbook, such as country GDP, population growth rate, etc. The list of data can be found in here.

So, first, visit this url and observe the website.

Let's create an empty spider FactbookSpider.py

from pprint import pprint
import scrapy


class FactbookSpider(scrapy.Spider):
    name = "factbook"
    start_urls = [
        'https://www.cia.gov/library/publications/the-world-factbook/rankorder/rankorderguide.html'
    ]

    '''
    Called after every request
    This is where your scrapping code should be
    '''

    def parse(self, response):
        pass

Time to get the link

links = response.xpath('//body//div[@id="profileguide"]/div[@class="answer"]//a')
for index, link in enumerate(links):
    text = link.xpath('text()').extract_first()
    link = link.xpath('@href').extract_first()
    print(text)
    print(link)

Now, join the url

link = response.urljoin(link.xpath('@href').extract_first())

Put the links into results object

links = response.xpath('//body//div[@id="profileguide"]/div[@class="answer"]//a')
results = {}
for index, link in enumerate(links):
    text = link.xpath('text()').extract_first()
    link = response.urljoin(link.xpath('@href').extract_first())
    results[text] = link

pprint(results)

Crawl into one of the link

def parse(self, response):
    links = response.xpath('//body//div[@id="profileguide"]/div[@class="answer"]//a')
    results = {}
    for index, link in enumerate(links):
        text = link.xpath('text()').extract_first()
        link = response.urljoin(link.xpath('@href').extract_first())
        results[text] = link

    yield scrapy.Request(
        results["Population growth rate:"],
        callback=self.parse_population,
        meta={"links": results}
    )

def parse_population(self, response):
    meta = response.meta
    pprint(meta)

Scrape the row and store into results

rows = response.xpath('//div[@class="wfb-text-box"]//table[@id="rankOrder"]/tbody/tr')
results = {}
for index, row in enumerate(rows):
    if not row.xpath('@class').extract_first() == "rankHeading":
        id = row.xpath('@id').extract_first()
        name = row.xpath('td[@class="region"]//text()').extract_first()
        population_growth = row.xpath('td[3]/text()').extract_first()
        print(id + " " + name + " " + population_growth)
        results[id] = {
            "name": name,
            "population_growth_rate": population_growth
        }

Do the same to extract gdp growth rate

def parse_population(self, response):
    meta = response.meta
    rows = response.xpath('//div[@class="wfb-text-box"]//table[@id="rankOrder"]/tbody/tr')
    results = {}
    for index, row in enumerate(rows):
        if not row.xpath('@class').extract_first() == "rankHeading":
            id = row.xpath('@id').extract_first()
            name = row.xpath('td[@class="region"]//text()').extract_first()
            population_growth = row.xpath('td[3]/text()').extract_first()
            results[id] = {
                "name": name,
                "population_growth_rate": population_growth
            }
    meta["results"] = results
    yield scrapy.Request(
        meta["links"]["Infant mortality rate:"],
        callback=self.parse_infant_mortality,
        meta=meta
    )

def parse_infant_mortality(self, response):
    meta = response.meta
    results = meta["results"]

    rows = response.xpath('//div[@class="wfb-text-box"]//table[@id="rankOrder"]/tbody/tr')
    for index, row in enumerate(rows):
        if not row.xpath('@class').extract_first() == 'rankHeading':
            id = row.xpath('@id').extract_first()
            infant_mortality_rate = row.xpath('td[3]/text()').extract_first()
            results[id]["infant_mortality_rate"] = infant_mortality_rate

    return results

Getting error?

Traceback (most recent call last):
  File "/home/gaara/.virtualenvs/sasxstc/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/gaara/Desktop/2018-SASxSTCWorkshop/sasxstc/sasxstc/spiders/FactbookSpider.py", line 60, in parse_gdp
    results[id]["infant_mortality_rate"] = infant_mortality_rate
KeyError: 'kv'

Surround with try and except

try:
    results[id]["infant_mortality_rate"] = infant_mortality_rate
except KeyError:
    pass

Now you have the data of factbook with you!
However, before we move forward, let's make use of the pipelines

Add a field to projectdir/sasxstc/sasxstc/items.py
```
class SasxstcItem(scrapy.Item):
    # define the fields for your item here like:
    results = scrapy.Field()
```
Uncomment the following line in projectdir/sasxstc/sasxstc/settings.py
```
# ITEM_PIPELINES = {
#    'sasxstc.pipelines.SasxstcPipeline': 300,
#}
```
Import the item in projectdir/sasxstc/sasxstc/spiders/FactbookSpider.py
```
from sasxstc.items import SasxstcItem
```
Change the last line of projectdir/sasxstc/sasxstc/spiders/FactbookSpider.py
```
# return results
item = SasxstcItem()
item["results"] = results
return item
```
You are good to go now!

Analytics

Go to projectdir/sasxstc/sasxstc/pipelines.py

Add imports

import pandas
import seaborn
from matplotlib import pyplot
from pprint import pprint
from scipy import stats

Seperate results into different list

results = item["results"]
    country_name = []
    population_growth = []
    infant_mortality = []

    for country_code in list(results.keys()):

        country_name.append(results[country_code]["name"])

        population_growth.append(float(results[country_code]["population_growth_rate"]))

        try:
            infant_mortality.append(float(results[country_code]["infant_mortality_rate"]))
        except KeyError:
            infant_mortality.append(None)

Put data into Panda dataframe

data = pandas.DataFrame(
        {
            "infant_mortality": infant_mortality,
            "population_growth": population_growth
        },
        index=country_name
    )

pprint(data)

Run it and see how the data looks like

Drop the row with empty field

data = data.dropna(how='any')
pprint(data)

Plot the graph

seaborn.jointplot(x="infant_mortality", y="population_growth", data=data, kind="reg")

pyplot.show()

Run it!

Now, add R and P value?

seaborn.jointplot(x="infant_mortality", y="population_growth", data=data,
                  kind="reg", stat_func=stats.pearsonr)

Add the equation of regression line

slope, intercept, r_value, p_value, std_err = stats.linregress(data["infant_mortality"].tolist(), data["population_growth"].tolist())

seaborn.jointplot(x="infant_mortality", y="population_growth", data=data,
                  kind="reg", stat_func=stats.pearsonr)
pyplot.annotate("y={0:.1f}x+{1:.1f}".format(slope, intercept), xy=(0.05, 0.95), xycoords='axes fraction')
pyplot.show()

And, you are done!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
jupyter_sample		jupyter_sample
sasxstc_demo		sasxstc_demo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2018-SASxSTCWorkshop

Agenda

Slides

Workshop

Scrapping Workshop

Scrapping CIA factbook

Analytics

About

Releases

Packages

Languages

License

gaara4896/2018-SASxSTCWorkshop

Folders and files

Latest commit

History

Repository files navigation

2018-SASxSTCWorkshop

Agenda

Slides

Workshop

Scrapping Workshop

Scrapping CIA factbook

Analytics

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages