Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not extracting any tables! #950

Open
SayeedAbid opened this issue Aug 8, 2024 · 0 comments
Open

Not extracting any tables! #950

SayeedAbid opened this issue Aug 8, 2024 · 0 comments

Comments

@SayeedAbid
Copy link

import os
from llmware.library import Library
from llmware.retrieval import Query

def extract_pdf_tables(library_name):
    
    print(f"\nExample: Parsing PDF Documents and Extracting Tables")
    
    # Step 1 - create library
    lib = Library().create_new_library(library_name)
    
    # Step 2 - pull sample files
    sample_files_path = "./pdfs"
    
    # Step 3 - parse and extract all of the content from the PDF Documents
    parsing_output = lib.add_files(input_folder_path=sample_files_path)
    
    # Review the parsing output summary info - all of the text and table blocks are in Mongo collection
    print("Update: parsing_output - ", parsing_output)
    
    # Step 4 - export all of the content into .jsonl files with metadata
    output_fp = "./output_csv"
    print(f"Update: Step 4 - exporting all blocks into file path - {output_fp}")
    
    output1 = lib.export_library_to_jsonl_file(output_fp, f"{library_name}_export")
    
    # Step 5 - export all of the tables as csv with ''" in the query
    print(f"Update: Step 5 - exporting all tables with into file path - {output_fp}")
    output2 = Query(lib).export_all_tables(query="Topline", output_fp=output_fp)
    
    return output2

if __name__ == "__main__":
    extract_pdf_tables("pdf_table_lib_example")
  • I have tried different query values ["politicians ", "elections", "Response"].
Update: parsing_output -  {'docs_added': 1, 'blocks_added': 16, 'images_added': 1, 'pages_added': 3, 'tables_added': 0, 'rejected_files': []}

I have attached pdf and json file.

survey.pdf
pdf_table_lib_example_export.json

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant