Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error when trying to extend bundesanzeiger search #75

Open
time4breakfast opened this issue Sep 23, 2022 · 8 comments
Open

error when trying to extend bundesanzeiger search #75

time4breakfast opened this issue Sep 23, 2022 · 8 comments

Comments

@time4breakfast
Copy link

time4breakfast commented Sep 23, 2022

I thought about contributing to your package by adding the extended search functionality (i.e. not only search for all documents but add the possibility to limit the search to certain types of documents).
Unfortunately, this is only working for certain companies while for certain other companies the captcha solver always fails. Any ideas why that might be?
(e.g. it works without errors for "Deutsche Bahn AG" but it keeps failing for "Deutsche Bank AG")

image
Change:
add the value 22 to the search request
response = self.session.get(
f"https://www.bundesanzeiger.de/pub/de/start?0-2.-top%7Econtent%7Epanel-left%7Ecard-form=&fulltext={company_name}&area_select=22&search_button=Suchen"
)

@time4breakfast
Copy link
Author

Just learned, that there is a new format, called ESEF. Reports using this new format do not have a captcha that needs to be solved, which is why the soup.find() function returns NoneType.

@wirthual
Copy link
Member

Thanks for looking into this.

Does this mean we need to adapt or extend our code?

@mariedittmer
Copy link

Just learned, that there is a new format, called ESEF. Reports using this new format do not have a captcha that needs to be solved, which is why the soup.find() function returns NoneType.

Hi, did you solve it? I guess I have the same problem.. would be super nice to adapt n the code! :)

@wirthual
Copy link
Member

Well you could add an additional check here if a captcha is there:
Something like

if soup.find("div", {"class": "captcha_wrapper"}) is not None:
    //Solve the captcha here

@time4breakfast
Copy link
Author

Hi,

we kind of solved it/implemented a workaround for our use case:
If soup.find() returns NoneType then assume that this is an ESEF (so no need to solve a captcha) and just find and click the "accept" button on the website. After that, we implemented a function or two that are able to read process the esef viewer (which painfully slows down your browser when trying to work with it or just view something).

I don't have the code here with me but will provide it after the holidays.

@mdittmer-A
Copy link

Hi,

thank you very much, it would be great to share the function for the esef viewer, do you know if there are plans to make a PR for this feature?

thanks in advance

@jurekmff
Copy link

jurekmff commented Dec 6, 2023

Hi @time4breakfast,
I am running into the same issue.
Do you mind sharing your code? Thanks a lot

@time4breakfast
Copy link
Author

time4breakfast commented Jan 25, 2024

Hi jurekmff,
sorry for the late reply.

Current situation: I switched companies and do not have access to the code anymore. But I have found a test sample on my machine (which unfortunately imports my own, corrected version of the deutschland api that I don't have anymore -.-), that I will share with you.

In theory, what you need to do to fix the error is:

  • assume Bundesanzeiger only has 2 formats for reports: old one and new (=esef) one
  • for the old one, the existing deutschland api python package works, because the old format requires to solve a captcha
  • for the new one bundesanzeiger website sends you to another page without the need to solve a captcha, so you need to adapt the code from this package by "just" wrapping it inside a try-except-block. If the try-part succeeds, then you have an old-format report. If the except-part hits, then you'll most probably have an esef report

The esef report itself works a little bit different than the old format: Using a standard browser (just as a normal user, looking at it) it will open inside its very own "viewer" implementation which took forever on my machine to load and usually also kind of slowed down the whole computer. The format itself subdivides into several (kind of) pages containing different contents and use-cases. Using the code sample I'll provide further down in this post you can start and try accessing the esef report(s) for yourself. The sample was done for Deutsche Bank. You should be fine replacing the first import "from handelsregister_updates import Bundesanzeiger" to just "from deutschland.bundesanzeiger import Bundesanzeiger" and make the necessary try-except adaption.

Also, keep in mind that the domain bundesanzeiger.de will change in the future to unternehmensregister.de. In the background it is the same company but they are trying to separate the data, domains and everything more clearly.

Hope that helps. If you have any further questions, don't hesitate to ask. I'll hope to be able to answer more quickly in the future.

Best regards
time4breakfast


# -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""

from handelsregister_updates import Bundesanzeiger
ba = Bundesanzeiger()
reports = ba.get_reports("Deutsche Bahn AG")

#GET /pub/de/start?12-2.-top%7Econtent%7Epanel-left%7Ecard-form=&fulltext=Deutsche+Bank+AG&area_select=22&search_button=Suchen HTTP/1.1

import requests
from bs4 import BeautifulSoup
import dateparser

session = requests.Session()
session.cookies["cc"] = "1663315556-37c8ed90cc5e8d6c-10"
session.headers.update(
            {
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                "Accept-Encoding": "gzip, deflate, br",
                "Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7,et;q=0.6,pl;q=0.5",
                "Cache-Control": "no-cache",
                "Connection": "keep-alive",
                "DNT": "1",
                "Host": "www.bundesanzeiger.de",
                "Pragma": "no-cache",
                "Referer": "https://www.bundesanzeiger.de/",
                "sec-ch-ua-mobile": "?0",
                "Sec-Fetch-Dest": "document",
                "Sec-Fetch-Mode": "navigate",
                "Sec-Fetch-Site": "same-origin",
                "Sec-Fetch-User": "?1",
                "Upgrade-Insecure-Requests": "1",
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
            }
        )
# get the jsessionid cookie
response = session.get("https://www.bundesanzeiger.de")
# go to the start page
response = session.get("https://www.bundesanzeiger.de/pub/de/start?0")
# perform the search
response = session.get(
    "https://www.bundesanzeiger.de/pub/de/start?0-2.-top%7Econtent%7Epanel-left%7Ecard-form=&fulltext=Deutsche+Bank+AG&area_select=22&search_button=Suchen"
)

def find_all_entries_on_page(page_content: str):
    soup = BeautifulSoup(page_content, "html.parser")
    wrapper = soup.find("div", {"class": "result_container"})
    rows = wrapper.find_all("div", {"class": "row"})
    for row in rows:
        info_element = row.find("div", {"class": "info"})
        if not info_element:
            continue

        link_element = info_element.find("a")
        if not link_element:
            continue

        entry_link = link_element.get("href")
        entry_name = link_element.contents[0].strip()

        date_element = row.find("div", {"class": "date"})
        if not date_element:
            continue

        date = dateparser.parse(date_element.contents[0], languages=["de"])

        company_name_element = row.find("div", {"class": "first"})
        if not date_element:
            continue

        company_name = company_name_element.contents[0].strip()

        yield date, entry_name, entry_link, company_name
 
# get menu of esef report
def get_esef_menu(find_res):
    menulist = []
    for menu_item in find_res:
        menulist.append({"link":menu_item.find("a", {"class": "link-file"})['href'],
                         "name":menu_item.find("a", {"class": "link-file"})['title']})
    return menulist

# list all results
result = []
for element in find_all_entries_on_page(response.text):
    result.append(element)

# extract esef_report as xml/BeautifulSoup object from link
def get_esef_report(esef_link):
    ja_db = session.get(esef_link)
    mysoup = BeautifulSoup(ja_db.text.encode('utf-8'), "lxml")
    
    return mysoup

# find esefs within the results
esef_list = []
for entry in result:
    get_element_response = session.get(entry[2])
    soup = BeautifulSoup(get_element_response.text, "html.parser")
    if soup.find("div", {"class": "esef-select-container"}) is not None:
        esef_session = session.get(soup.find("div", {"class": "esef-select-container"}).find("a", {"class": "btn btn-primary"})['href'])
        esef_bs = BeautifulSoup(esef_session.text.encode("utf-8"), "html.parser")
        esef_menu = get_esef_menu(esef_bs.find_all("div", {"class": "file-list-item level-1"}))
        esef_list.append(esef_menu)
        # get esef reports
        for entry in esef_menu:
            get_esef_report(entry['link'])

# find text in BeautifulSoup object
mysoup(text = lambda t: "Honorar" in t.text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants