Develop a solution to extract specific content from documents with the primary goal of ensuring extraction accuracy and completeness. The solution must follow these requirements:
-
Data Extraction Workflow:
a. Extract the “page_begin” and “page_end” field values from the Excel document “02_024.xlsx”
b. Using these page numbers, extract from the PDF file “02_024.pdf”:- From page_begin: Article title and author information
- From page_end: Author biography and author affiliation
-
Technical Requirements:
a. Implement using Python programming language
b. Create and use a dedicated Anaconda virtual environment named “torch_gpu” with Python 3.12
c. Utilize GPU acceleration where possible, given the NVIDIA GeForce RTX 3080 GPU with CUDA 12.6 support
d. All program files must be installed in the “g:\\work\\apps” directory -
Available Resources:
a. Local LLM models via Ollama: deepseek-r1:1.5b, bge-m3:latest, qwen3-vl:latest, qwen3-vl:8b
b. Docker container with PaddleOCR-VL-1.5 running on port 8080
c. Existing software: Anaconda, OpenCV, OpenVINO, Visual Studio, Tomcat, Maven -
Deliverables:
a. Provide 3-5 distinct implementation方案
b. For each方案, clearly document:- Detailed implementation steps
- Required libraries and dependencies
- Exact commands for environment setup
-优式 (advantages) with specific technical justifications
-劣式 (disadvantages) with specific technical limitations
c. Prioritize方案 based on extraction accuracy, with secondary consideration for performance and resource utilization
-
Evaluation Criteria:
a. Extraction accuracy (primary): 100% correct identification of all required fields
b. Completeness: No missing information from the specified page regions
c. Reliability: Consistent performance across multiple test runs
d. Resource efficiency: Optimal use of GPU and system resources
Present方案 in a structured format with clear headings and technical details that enable straightforward implementation and comparison.