Merge branch 'okfn-brasil:main' into connect_qd

Luisa-Coelho · Apr 16, 2024 · 72f2e23 · 72f2e23
2 parents 556572d + a6e515a
commit 72f2e23
Show file tree

Hide file tree

Showing 198 changed files with 2,746 additions and 860 deletions.
diff --git a/.github/ISSUE_TEMPLATE/revisao_retroativa.yaml b/.github/ISSUE_TEMPLATE/revisao_retroativa.yaml
@@ -1,6 +1,6 @@
 name: Revisão retroativa
 description: Dar manutenção em código legado de raspadores no repositório
-title: "[Revisão retroativa]: Raspador de <MUNICÍPIO-UF>"
+title: "[Revisão retroativa]: <MUNICÍPIO-UF>"
 labels: ["refactor"]
 body: 
   - type: dropdown
@@ -10,7 +10,7 @@ body:
       description: Selecione a opção abaixo
       multiple: false
       options:
-        - Neste repositório, há muitos códigos de raspadores que foram desenvolvidos no começo do projeto e não estão sendo usados. Para passar a usar o raspador deste município, é necessário testar para verificar se segue funcionando e revisá-lo caso não esteja. 
+        - Neste repositório, há muitos códigos de raspadores que foram desenvolvidos no começo do projeto e não estão sendo usados. Para passar a usar o raspador deste município, é necessário testar para verificar se segue funcionando e revisá-lo caso não esteja. Consulte a <a href='https://docs.queridodiario.ok.org.br/pt-br/latest/contribuindo/raspadores.html'>documentação</a> para te ajudar.
     validations:
       required: true
   - type: input
@@ -39,10 +39,3 @@ body:
       placeholder: ex. mês/ano até atualmente; de 2016 à 2020
     validations:
       required: true
-  - type: textarea
-    id: test-list
-    attributes:
-      label: Lista de testes
-      description: "Utilize a lista a seguir de referência para teste. O raspador precisa atender todos os itens para estar pronto para ser usado. \n 1. [ ] Você executou uma extração completa do spider localmente e os dados retornados estavam corretos.\n 2. [ ] Você executou uma extração por período (start_date e end_date definidos) ao menos uma vez e os dados retornados estavam corretos. \n 3. [ ] Você verificou que não existe nenhum erro nos logs (log/ERROR igual a zero).\n 4. [ ] Você definiu o atributo de classe start_date no seu spider com a data do Diário Oficial mais antigo disponível na página da cidade.\n 5. [ ] Você garantiu que todos os campos que poderiam ser extraídos foram extraídos <a href='https://docs.queridodiario.ok.org.br/pt/latest/escrevendo-um-novo-spider.html#definicao-de-campos'>de acordo com a documentação</a>.  \n \n Por favor, inclua qualquer informação relevante para o desenvolvimento."
-    validations:
-      required: false
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -1,12 +1,27 @@
-**AO ABRIR** um Pull Request de um novo raspador (spider), marque com um `X` cada um dos items do checklist 
-abaixo. **NÃO ABRA** um novo Pull Request antes de completar todos os items abaixo.
-
-#### Checklist - Novo spider
-- [ ] Você executou uma extração completa do spider localmente e os dados retornados estavam corretos.
-- [ ] Você executou uma extração por período (`start_date` e `end_date` definidos) ao menos uma vez e os dados retornados estavam corretos.
-- [ ] Você verificou que não existe nenhum erro nos logs (`log_count/ERROR` igual a zero).
-- [ ] Você definiu o atributo de classe `start_date` no seu spider com a data do Diário Oficial mais antigo disponível na página da cidade.
-- [ ] Você garantiu que todos os campos que poderiam ser extraídos foram extraídos [de acordo com a documentação](https://docs.queridodiario.ok.org.br/pt/latest/escrevendo-um-novo-spider.html#definicao-de-campos).
+**AO ABRIR** uma *Pull Request* de um novo raspador (*spider*), marque com um `X` cada um dos items da checklist abaixo. Caso algum item não seja marcado, JUSTIFIQUE o motivo.
+
+#### Layout do site publicador de diários oficiais
+Marque apenas um dos itens a seguir:
+- [ ] O *layout* não se parece com nenhum caso [da lista de *layouts* padrão](https://docs.queridodiario.ok.org.br/pt-br/latest/contribuindo/lista-sistemas-replicaveis.html)
+- [ ] É um *layout* padrão e esta PR adiciona a spider base do padrão ao projeto junto com alguns municípios que fazem parte do padrão.
+- [ ] É um *layout* padrão e todos os municípios adicionados usam a [classe de spider base](https://github.com/okfn-brasil/querido-diario/tree/main/data_collection/gazette/spiders/base) adequada para o padrão.
+
+#### Código da(s) spider(s)
+- [ ] O(s) raspador(es) adicionado(s) tem os [atributos de classe exigidos](https://docs.queridodiario.ok.org.br/pt-br/latest/contribuindo/raspadores.html#UFMunicipioSpider).
+- [ ] O(s) raspador(es) adicionado(s) cria(m) objetos do tipo Gazette coletando todos [os metadados necessários](https://docs.queridodiario.ok.org.br/pt-br/latest/contribuindo/raspadores.html#Gazette).
+- [ ] O atributo de classe [start_date](https://docs.queridodiario.ok.org.br/pt-br/latest/contribuindo/raspadores.html#UFMunicipioSpider.start_date) foi preenchido com a data da edição de diário oficial mais antiga disponível no site.
+- [ ] Explicitar o atributo de classe [end_date](https://docs.queridodiario.ok.org.br/pt-br/latest/contribuindo/raspadores.html#UFMunicipioSpider.end_date) não se fez necessário.
+- [ ] Não utilizo `custom_settings` em meu raspador.
+
+#### Testes
+- [ ] Uma coleta-teste **da última edição** foi feita. O arquivo de `.log` deste teste está anexado na PR.
+- [ ] Uma coleta-teste **por intervalo arbitrário** foi feita. Os arquivos de `.log`e `.csv` deste teste estão anexados na PR.
+- [ ] Uma coleta-teste **completa** foi feita. Os arquivos de `.log` e `.csv` deste teste estão anexados na PR.
+
+#### Verificações
+- [ ] Eu experimentei abrir alguns arquivos de diários oficiais coletados pelo meu raspador e verifiquei eles [conforme a documentação](https://docs.queridodiario.ok.org.br/pt-br/latest/contribuindo/raspadores.html#diarios-oficiais-coletados) não encontrando problemas.
+- [ ] Eu verifiquei os arquivos `.csv` gerados pela minha coleta [conforme a documentação](https://docs.queridodiario.ok.org.br/pt-br/latest/contribuindo/raspadores.html#arquivos-auxiliares) não encontrando problemas.
+- [ ] Eu verifiquei os arquivos de `.log` gerados pela minha coleta [conforme a documentação](https://docs.queridodiario.ok.org.br/pt-br/latest/contribuindo/raspadores.html#arquivos-auxiliares) não encontrando problemas.
 
 #### Descrição
 

diff --git a/.github/workflows/periodic_crawl.yaml → .github/workflows/daily_crawl.yaml b/.github/workflows/periodic_crawl.yaml → .github/workflows/daily_crawl.yaml
@@ -1,10 +1,10 @@
-name: Daily execution of Spiders
+name: Daily Crawl of Enabled Spiders
 
 on:
   schedule:
-    # Execute twice a day at 8AM/6PM (BRT)
-    - cron: "0 11 * * *"
+    # Execute once a day at 6PM (BRT)
     - cron: "0 21 * * *"
+  workflow_dispatch:
 
 jobs:
   schedule-jobs:
@@ -29,6 +29,8 @@ jobs:
     - name: Prepare environment
       run: |
         python -m pip install --upgrade pip
-        pip install click python-decouple scrapinghub
+        pip install click python-decouple scrapinghub SQLAlchemy psycopg2
     - name: Schedule jobs
-      run: python scripts/scheduler.py schedule-enabled-spiders
+      run: |
+        cd data_collection/
+        python scheduler.py schedule-enabled-spiders
diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml
@@ -3,6 +3,7 @@ name: Deploy to Scrapy Cloud
 on:
   push:
     branches: [main]
+  workflow_dispatch:
 
 jobs:
   deploy_to_scrapy_cloud:
@@ -21,4 +22,4 @@ jobs:
         SHUB_APIKEY: ${{ secrets.SHUB_APIKEY }}
       run: |
         cd data_collection/
-        shub deploy
+        shub deploy ${{ secrets.SCRAPY_CLOUD_PROJECT_ID }}
diff --git a/.github/workflows/monthly_crawl.yaml b/.github/workflows/monthly_crawl.yaml
@@ -1,9 +1,10 @@
-name: Weekly execution of Spiders
+name: Monthly Crawl of Enabled Spiders
 
 on:
   schedule:
     # Execute once a month at 8PM (BRT)
     - cron: "0 23 1 * *"
+  workflow_dispatch:
 
 jobs:
   schedule-jobs:
@@ -28,6 +29,8 @@ jobs:
     - name: Prepare environment
       run: |
         python -m pip install --upgrade pip
-        pip install click python-decouple scrapinghub
+        pip install click python-decouple scrapinghub SQLAlchemy psycopg2
     - name: Schedule jobs
-      run: python scripts/scheduler.py last-month-schedule-enabled-spiders
+      run: |
+        cd data_collection/
+        python scheduler.py last-month-schedule-enabled-spiders
diff --git a/.github/workflows/schedule_spider.yaml b/.github/workflows/schedule_spider.yaml
@@ -32,14 +32,18 @@ jobs:
     - uses: actions/checkout@v2
     - uses: actions/setup-python@v2
       with:
-        python-version: "3.10"
+        python-version: '3.10'
     - name: Prepare environment
       run: |
         python -m pip install --upgrade pip
-        pip install click python-decouple scrapinghub
+        pip install click python-decouple scrapinghub SQLAlchemy psycopg2
     - name: Schedule full crawl
       if: ${{ !github.event.inputs.start_date }}
-      run: python scripts/scheduler.py schedule-spider --spider_name=${{ github.event.inputs.spider_name }}
+      run: |
+        cd data_collection/
+        python scheduler.py schedule-spider --spider_name=${{ github.event.inputs.spider_name }}
     - name: Schedule partial crawl
       if: ${{ github.event.inputs.start_date }}
-      run: python scripts/scheduler.py schedule-spider --spider_name=${{ github.event.inputs.spider_name }} --start_date=${{ github.event.inputs.start_date }}
+      run: |
+        cd data_collection/
+        python scheduler.py schedule-spider --spider_name=${{ github.event.inputs.spider_name }} --start_date=${{ github.event.inputs.start_date }} --end_date=${{ github.event.inputs.end_date }}
diff --git a/.github/workflows/schedule_spider_by_date.yaml b/.github/workflows/schedule_spider_by_date.yaml
@@ -30,7 +30,9 @@ jobs:
     - name: Prepare environment
       run: |
         python -m pip install --upgrade pip
-        pip install click python-decouple scrapinghub
+        pip install click python-decouple scrapinghub SQLAlchemy psycopg2
     - name: Schedule partial crawl
       if: ${{ github.event.inputs.start_date }}
-      run: python scripts/scheduler.py schedule-all-spiders-by-date --start_date ${{ github.event.inputs.start_date }}
+      run: |
+        cd data_collection/
+        python scheduler.py schedule-all-spiders-by-date --start_date ${{ github.event.inputs.start_date }}
diff --git a/.github/workflows/schedule_spider_date_range.yaml b/.github/workflows/schedule_spider_date_range.yaml
diff --git a/.github/workflows/update_spider_status.yaml b/.github/workflows/update_spider_status.yaml
@@ -0,0 +1,40 @@
+name: Update spider status on production
+
+on:
+  workflow_dispatch:
+    inputs:
+      spider_name:
+        description: 'Spider name'
+        required: true
+      status:
+        type: choice
+        description: 'New Spider status in production'
+        options: 
+        - enabled
+        - disabled
+        required: true
+
+jobs:
+  update_status:
+    runs-on: ubuntu-latest
+    env:
+      QUERIDODIARIO_DATABASE_URL: ${{ secrets.QUERIDODIARIO_DATABASE_URL }}
+    steps:
+    - uses: actions/checkout@v2
+    - uses: actions/setup-python@v2
+      with:
+        python-version: '3.10'
+    - name: Prepare environment
+      run: |
+        python -m pip install --upgrade pip
+        pip install click python-decouple scrapinghub SQLAlchemy psycopg2
+    - name: Enable spider in production
+      if: ${{ github.event.inputs.status == 'enabled' }}
+      run: |
+        cd data_collection/
+        python scheduler.py enable-spider --spider_name=${{ github.event.inputs.spider_name }}
+    - name: Disable spider in production
+      if: ${{ github.event.inputs.status == 'disabled' }}
+      run: |
+        cd data_collection/
+        python scheduler.py disable-spider --spider_name=${{ github.event.inputs.spider_name }}
diff --git a/data_collection/.local.env b/data_collection/.local.env
@@ -0,0 +1,6 @@
+AWS_ACCESS_KEY_ID=minio-access-key
+AWS_SECRET_ACCESS_KEY=minio-secret-key
+AWS_ENDPOINT_URL=http://localhost:9000
+AWS_REGION_NAME=nyc3
+FILES_STORE=s3://queridodiariobucket/
+QUERIDODIARIO_DATABASE_URL=postgresql://queridodiario:queridodiario@localhost:5432/queridodiariodb
diff --git a/data_collection/gazette/commands/__init__.py b/data_collection/gazette/commands/__init__.py
diff --git a/data_collection/gazette/commands/qd-list-enabled.py b/data_collection/gazette/commands/qd-list-enabled.py
@@ -0,0 +1,53 @@
+import datetime
+
+from scrapy.commands import ScrapyCommand
+from scrapy.exceptions import UsageError
+
+from gazette.utils import get_enabled_spiders
+
+
+class Command(ScrapyCommand):
+    requires_project = True
+
+    def add_options(self, parser):
+        ScrapyCommand.add_options(self, parser)
+        parser.add_argument(
+            "--start_date",
+            dest="start_date",
+            default=None,
+            metavar="VALUE",
+            help="List spiders enabled from date (format: YYYY-MM-DD)",
+        )
+        parser.add_argument(
+            "--end_date",
+            dest="end_date",
+            default=None,
+            metavar="VALUE",
+            help="List spiders enabled until date (format: YYYY-MM-DD)",
+        )
+
+    def short_desc(self):
+        return "List production enabled spiders"
+
+    def run(self, args, opts):
+        start_date, end_date = None, None
+
+        if opts.start_date is not None:
+            try:
+                start_date = datetime.datetime.strptime(opts.start_date, "%Y-%m-%d")
+            except ValueError:
+                raise UsageError("'start_date' must match YYYY-MM-DD format")
+
+        if opts.end_date is not None:
+            try:
+                end_date = datetime.datetime.strptime(opts.end_date, "%Y-%m-%d")
+            except ValueError:
+                raise UsageError("'end_date' must match YYYY-MM-DD format")
+
+        print("\nEnabled spiders\n===============")
+        for spider_name in get_enabled_spiders(
+            database_url=self.settings["QUERIDODIARIO_DATABASE_URL"],
+            start_date=start_date,
+            end_date=end_date,
+        ):
+            print(spider_name)
diff --git a/data_collection/gazette/database/models.py b/data_collection/gazette/database/models.py
@@ -49,46 +49,51 @@ def load_territories(engine):
     logger.info("Populating 'territories' table - Done!")
 
 
+def get_new_spiders(session, territory_spider_map):
+    registered_spiders = session.query(QueridoDiarioSpider).all()
+    registered_spiders_set = {
+        (spider.spider_name, territory.id, spider.date_from)
+        for spider in registered_spiders
+        for territory in spider.territories
+    }
+    only_new_spiders = [
+        spider_info
+        for spider_info in territory_spider_map
+        if spider_info not in registered_spiders_set
+    ]
+    return only_new_spiders
+
+
 def load_spiders(engine, territory_spider_map):
     Session = sessionmaker(bind=engine)
     session = Session()
 
-    if session.query(QueridoDiarioSpider).count() > 0:
-        return
+    table_is_populated = session.query(QueridoDiarioSpider).count() > 0
+    new_spiders = (
+        get_new_spiders(session, territory_spider_map)
+        if table_is_populated
+        else territory_spider_map
+    )
 
     logger.info("Populating 'querido_diario_spider' table - Please wait!")
 
-    spiders = []
-    territory_ids = set()
-    for info in territory_spider_map:
-        spider_name, territory_id, date_from = info
-        spiders.append(
-            QueridoDiarioSpider(spider_name=spider_name, date_from=date_from)
-        )
-        territory_ids.add(territory_id)
-
-    session.add_all(spiders)
-    session.commit()
-
-    spiders = (
-        session.query(QueridoDiarioSpider)
-        .filter(
-            QueridoDiarioSpider.spider_name.in_(set(s[0] for s in territory_spider_map))
-        )
-        .all()
-    )
-    spider_map = {spider.spider_name: spider for spider in spiders}
-
-    territories = session.query(Territory).filter(Territory.id.in_(territory_ids)).all()
+    territories = session.query(Territory).all()
     territory_map = {t.id: t for t in territories}
 
-    for info in territory_spider_map:
-        spider_name, territory_id, _ = info
-        spider = spider_map.get(spider_name)
+    spiders = []
+    for info in new_spiders:
+        spider_name, territory_id, date_from = info
         territory = territory_map.get(territory_id)
-        if spider is not None and territory is not None:
-            spider.territories.append(territory)
+        if territory is not None:
+            spiders.append(
+                QueridoDiarioSpider(
+                    spider_name=spider_name,
+                    date_from=date_from,
+                    territories=[territory],
+                )
+            )
 
+    session.add_all(spiders)
     session.commit()
     logger.info("Populating 'querido_diario_spider' table - Done!")
 

diff --git a/data_collection/gazette/monitors.py b/data_collection/gazette/monitors.py
@@ -27,7 +27,7 @@ def test_requests_items_ratio(self):
             ratio = n_requests_count / n_scraped_items
             percent = round(ratio * 100, 2)
             allowed_percent = round(max_ratio * 100, 2)
-            self.assertLess(
+            self.assertLessEqual(
                 ratio,
                 max_ratio,
                 msg=f"""{percent}% is greater than the allowed {allowed_percent}%