This Python script extracts tables from HTML files and converts them into Excel spreadsheets. It handles hyperlinks by creating separate URL columns and can add a custom 'type' column to all extracted tables.
- Extracts all tables from a given HTML file
- Preserves hyperlinks by creating separate URL columns
- Adds a custom 'type' column to all tables
- Outputs data to an Excel file with multiple sheets (one per table)
- Prepends "https://scan.merlinchain.io" to all extracted URLs
- Python 3.x
- pandas
- BeautifulSoup4
- openpyxl
Run the script from the command line with two arguments:
- Path to the HTML file
- Value for the 'type' column
The script generates an Excel file named '[original_filename]_tables.xlsx' in the same directory as the input HTML file. Each table from the HTML file is placed in a separate sheet within the Excel file.
read_html_file(file_path)
: Reads the HTML filehtml_table_to_dataframe(table, type_value)
: Converts an HTML table to a pandas DataFrameextract_tables_to_excel(html_content, output_file, type_value)
: Extracts all tables and saves them to an Excel filemain()
: Handles command-line arguments and orchestrates the extraction process
- The script assumes relative URLs in the HTML and prepends "https://scan.merlinchain.io" to all URLs
- Empty URL columns are automatically removed from the output