Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tables in pdf not getting saved into csv file #824

Open
vijayproxima opened this issue Jun 3, 2024 · 4 comments
Open

Tables in pdf not getting saved into csv file #824

vijayproxima opened this issue Jun 3, 2024 · 4 comments

Comments

@vijayproxima
Copy link

HI,
In my pdf file, I have 4 tables [4 regions] for listing the holidays for a year. the tables has columns, Sr.No, Date, Day and Festival. The title on the table is Region Name Holiday List 2024. However, when i execute this line, there is no csv file being created nor the pdfdocs.jsonl file is created. it is just creating the data.jsonl file.
def parsing_the_pdfs():
t0 = time.time()
# Create a Library
LLMWareConfig().set_active_db("sqlite")

lib = Library().create_new_library("pdfdocs")
#parse and extract all of the contents from these documents
# Add file to the library
parsing_output = lib.add_files(input_folder_path=input_data)

print("Update: parsing time :", time.time() -t0)
print("Update: parsing output :", parsing_output)
#export all of the content of the library into jsonl files with metadata
output1 = lib.export_library_to_jsonl_file(output_data, "data.jsonl")
# export all of the tables
output2 = Query(lib).export_all_tables(query="Holiday", output_fp=output_data)

return 0

p= parsing_the_pdfs()
This is the output when I execute the code:
Update: parsing time : 0.0057866573333740234
Update: parsing output : {'docs_added': 0, 'blocks_added': 0, 'images_added': 0, 'pages_added': 0, 'tables_added': 0, 'rejected_files': []}

@noman1321
Copy link

assign me this issue

@noman1321
Copy link

@vijayproxima you can fix this in 3 ways
1.import tabula

Path to your PDF file

pdf_file = 'your_pdf_file.pdf'

Extract tables from the PDF (all pages)

tables = tabula.read_pdf(pdf_file, pages='all', multiple_tables=True)

Check how many tables were extracted

print(f'Total tables extracted: {len(tables)}')

Export the tables to CSV files

for i, table in enumerate(tables):
output_csv = f'output_table_{i}.csv'
table.to_csv(output_csv, index=False)
print(f'Table {i} saved to {output_csv}')

@noman1321
Copy link

import camelot

Path to your PDF file

pdf_file = 'your_pdf_file.pdf'

Extract tables from all pages of the PDF

tables = camelot.read_pdf(pdf_file, pages='all')

Check how many tables were extracted

print(f'Total tables extracted: {len(tables)}')

Export each extracted table to a separate CSV file

for i, table in enumerate(tables):
output_csv = f'output_table_{i}.csv'
table.to_csv(output_csv)
print(f'Table {i} saved to {output_csv}')

@noman1321
Copy link

3.import pdfplumber

Path to your PDF file

pdf_file = 'your_pdf_file.pdf'

Open the PDF with pdfplumber

with pdfplumber.open(pdf_file) as pdf:
for page_number, page in enumerate(pdf.pages):
# Extract tables from the page
tables = page.extract_tables()

    # Check if any tables were found
    if tables:
        for i, table in enumerate(tables):
            # Save each table as a CSV
            output_csv = f'output_table_page_{page_number}_table_{i}.csv'
            
            # Writing table to CSV
            with open(output_csv, 'w') as f:
                for row in table:
                    f.write(','.join(str(cell) for cell in row) + '\n')
            
            print(f'Table {i} from page {page_number} saved to {output_csv}')
    else:
        print(f'No tables found on page {page_number}')

please let me if any of this is helpfull for your repositries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants