Tables in pdf not getting saved into csv file #824

vijayproxima · 2024-06-03T05:59:53Z

HI,
In my pdf file, I have 4 tables [4 regions] for listing the holidays for a year. the tables has columns, Sr.No, Date, Day and Festival. The title on the table is Region Name Holiday List 2024. However, when i execute this line, there is no csv file being created nor the pdfdocs.jsonl file is created. it is just creating the data.jsonl file.
def parsing_the_pdfs():
t0 = time.time()
# Create a Library
LLMWareConfig().set_active_db("sqlite")

lib = Library().create_new_library("pdfdocs")
#parse and extract all of the contents from these documents
# Add file to the library
parsing_output = lib.add_files(input_folder_path=input_data)

print("Update: parsing time :", time.time() -t0)
print("Update: parsing output :", parsing_output)
#export all of the content of the library into jsonl files with metadata
output1 = lib.export_library_to_jsonl_file(output_data, "data.jsonl")
# export all of the tables
output2 = Query(lib).export_all_tables(query="Holiday", output_fp=output_data)

return 0

p= parsing_the_pdfs()
This is the output when I execute the code:
Update: parsing time : 0.0057866573333740234
Update: parsing output : {'docs_added': 0, 'blocks_added': 0, 'images_added': 0, 'pages_added': 0, 'tables_added': 0, 'rejected_files': []}

The text was updated successfully, but these errors were encountered:

noman1321 · 2024-10-03T11:47:55Z

assign me this issue

noman1321 · 2024-10-03T12:42:23Z

@vijayproxima you can fix this in 3 ways
1.import tabula

Path to your PDF file

pdf_file = 'your_pdf_file.pdf'

Extract tables from the PDF (all pages)

tables = tabula.read_pdf(pdf_file, pages='all', multiple_tables=True)

Check how many tables were extracted

print(f'Total tables extracted: {len(tables)}')

Export the tables to CSV files

for i, table in enumerate(tables):
output_csv = f'output_table_{i}.csv'
table.to_csv(output_csv, index=False)
print(f'Table {i} saved to {output_csv}')

noman1321 · 2024-10-03T12:43:06Z

import camelot

Path to your PDF file

pdf_file = 'your_pdf_file.pdf'

Extract tables from all pages of the PDF

tables = camelot.read_pdf(pdf_file, pages='all')

Check how many tables were extracted

print(f'Total tables extracted: {len(tables)}')

Export each extracted table to a separate CSV file

for i, table in enumerate(tables):
output_csv = f'output_table_{i}.csv'
table.to_csv(output_csv)
print(f'Table {i} saved to {output_csv}')

noman1321 · 2024-10-03T12:43:58Z

3.import pdfplumber

Path to your PDF file

pdf_file = 'your_pdf_file.pdf'

Open the PDF with pdfplumber

with pdfplumber.open(pdf_file) as pdf:
for page_number, page in enumerate(pdf.pages):
# Extract tables from the page
tables = page.extract_tables()

    # Check if any tables were found
    if tables:
        for i, table in enumerate(tables):
            # Save each table as a CSV
            output_csv = f'output_table_page_{page_number}_table_{i}.csv'
            
            # Writing table to CSV
            with open(output_csv, 'w') as f:
                for row in table:
                    f.write(','.join(str(cell) for cell in row) + '\n')
            
            print(f'Table {i} from page {page_number} saved to {output_csv}')
    else:
        print(f'No tables found on page {page_number}')

please let me if any of this is helpfull for your repositries

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tables in pdf not getting saved into csv file #824

Tables in pdf not getting saved into csv file #824

vijayproxima commented Jun 3, 2024

noman1321 commented Oct 3, 2024

noman1321 commented Oct 3, 2024

noman1321 commented Oct 3, 2024

noman1321 commented Oct 3, 2024

Tables in pdf not getting saved into csv file #824

Tables in pdf not getting saved into csv file #824

Comments

vijayproxima commented Jun 3, 2024

noman1321 commented Oct 3, 2024

noman1321 commented Oct 3, 2024

Path to your PDF file

Extract tables from the PDF (all pages)

Check how many tables were extracted

Export the tables to CSV files

noman1321 commented Oct 3, 2024

Path to your PDF file

Extract tables from all pages of the PDF

Check how many tables were extracted

Export each extracted table to a separate CSV file

noman1321 commented Oct 3, 2024

Path to your PDF file

Open the PDF with pdfplumber