282 conversations
Extracting, transforming, and structuring information from documents at scale.
470
508
515
188
All Conversations (188)
188 conversations found
test-seed
Mordechai is refining his GitHub profile strategy to better embody the 'Bottleneck Principle'. The core idea is to showcase the journey from a large number of private, raw repos to a curated set of public, simplified tools and frameworks. This approach aims to demonstrate depth without overwhelming users, highlighting a clear and smooth path for engagement. The focus is on the process of sanitizat
CSV File Download
The user is seeking a robust Python script to process scanned 8850 forms. The primary goal is to programmatically identify exactly seven checkboxes on each form, determine if they are marked or empty, and output this information. Previous attempts using various image processing techniques (thresholding, adaptive thresholding, contour detection) have been inconsistent, often detecting too many or t
Untitled
The conversation revolves around building a PDF extraction lab system. Key decisions include leveraging Supabase for database and real-time features, and GitHub for version control. The system aims to dynamically load Python code sections from Supabase, process PDFs using various AI models (like Claude and Gemini via OpenRouter), and display real-time analytics on a dashboard. Iterative improvemen
help me plan a few tables for my extraction lab in the db i want to user...
Mordechai is refining a dynamic PDF extraction system. The system now successfully connects to Supabase, loads Python modules from the database, and executes a 4-step extraction pipeline. He's focused on improving UI styling, fixing component issues, and ensuring real-time monitoring and cost tracking are functional. The system is confirmed to be using real API calls and is being tested with actua
Calculate invoice: 17% VAT.
The script aims to process PDF files based on data from an Excel file. It sets up a directory structure, reads an Excel file named 'kobi2.xls' (specifically the '2023' sheet), and iterates through its rows. For each row, it checks for the existence of a PDF file corresponding to the 'PMT_ID' and, if found, moves it to a newly created 'Email' subfolder.
Convert PDF on Mac
Mordechai is working on extracting and reformatting text from a PDF document. The initial goal was to convert the PDF to HTML while preserving formatting for app integration. This involved several attempts with different libraries (PyMuPDF, pdfminer) due to conversion interruptions and formatting challenges. Subsequently, the focus shifted to identifying and labeling specific sections within the d
Untitled
The conversation focused on evaluating various Large Language Models (LLMs) for PDF processing tasks via the OpenRouter API. Initial tests revealed significant issues with hallucinated PII across multiple models when using standard PDF text extraction. A breakthrough occurred when implementing the Mistral OCR engine, which enabled models like Gemini 2.5 Flash to accurately extract real data. Furth
PDF to PNG Conversion
Mordechai is working on a script to automate the creation of PDFs from image pairs. The primary goal is to extract first and last names from the images using OCR (Tesseract) and then use these names to name the generated PDF files in the format {last_name}_{first_name}.pdf. The process has involved several iterations to correct errors in API usage (OpenAI), name extraction logic, and PDF generatio
Remove Images from .docx
The user is working on a Python script to extract structured data from a PDF file, specifically focusing on 'Day' sections, their titles, and subsequent content broken down into subheadings and body paragraphs. The goal is to convert this into a JSON format for a web application. Initial attempts with PyMuPDF and regex patterns have been made, but the script is not producing the expected output, l
NLP Platforms for WebApps
The user requested a PDF with an A4 page divided into 10 equal columns and 4 equal rows, with light grey dotted lines for cutting. The goal is to be able to cut an A2 page into 40 equal rectangles using the entire page. The user then refined the request to specify that the dotted line should have a dot every 4th character.
Untitled
The conversation focused on developing a Python script to parse Hebrew rental listing PDFs using Google Gemini 2.5 Flash-Lite via the OpenRouter API. Initial attempts involved direct PDF processing, but issues arose with OpenRouter's PDF handling and neighborhood extraction. The team explored different API formats, prompt engineering, and error handling to ensure comprehensive extraction from all
Untitled
The user requested a Python script to accurately transcribe Hebrew text from 34 sefer images. The process involved setting up an OCR system, handling HEIC to JPEG conversions, implementing preprocessing steps, and a systematic manual review to ensure 100% accuracy. The assistant successfully developed and executed a script that processed all images, converted necessary formats, and produced a comp
help me organise my takeout file it mixed everything around
The conversation focused on organizing the user's Google Takeout folder, which was a mess of mixed files. The assistant systematically created a structured directory, moved photos and videos to dedicated locations, and then addressed the remaining files in the original TAKEOUT folder. Decisions were made to move documents, metadata, and remove development-specific files, with ongoing efforts to re
Extract Day Sections Accurately
The user wants to extract structured content from a document, focusing on identifying and separating bold headings from regular text paragraphs within each 'Day' entry. The goal is to associate these extracted sections with the already identified day number and title, ignoring any large bold text that might appear directly below the day number if a title is already present. This exploration aims t
Discussion 97341d3d
The user is attempting to set up and run a Python script for processing WOTC PDFs using Claude 4. Initial attempts to install packages encountered an 'externally-managed-environment' error, leading to the decision to use a virtual environment. The test script execution failed due to a missing ANTHROPIC_API_KEY. The user also provided feedback to focus on a lightweight app for WOTC forms and to rem
Discussion 64616402
The user requested to download and convert 'הגאון מווילנה על משלי' from Wikisource into Markdown format. Initial attempts involved converting EPUB and MOBI files using pandoc, which were successful. PDF conversion proved more challenging, requiring the use of `pdftotext` with the `-layout` option and a custom Python script (`enhance_hebrew_pdf.py`) for better formatting and Hebrew text handling. A
Discussion 23f74c82
The user requested to convert various file formats (EPUB, MOBI, PDF) of a Hebrew religious text ('הגאון מווילנה על משלי') into Markdown. Initial attempts with `pandoc` for EPUB to MD were successful. PDF conversion proved more challenging, requiring a combination of `pdftotext -layout` for text extraction and a custom Python script for enhancement and cleaning. The process of downloading chapters
Discussion 00d023f5
The user requested to download and convert Hebrew texts from Wikisource to Markdown. Initial attempts to download chapters using a Python script failed due to errors. Pandoc was found to be unable to convert directly from PDF, leading to the use of `pdftotext` with a `-layout` option for better text extraction. An enhancement script was developed to clean the extracted text. The process involved i
Discussion a7693751
The process of downloading and converting 'הגאון מווילנה על משלי' from Wikisource into Markdown format has been initiated. Initial attempts focused on converting existing EPUB and MOBI files to Markdown using pandoc, which were successful. PDF conversion proved more challenging, requiring the use of `pdftotext` with layout preservation. A Python script was developed to automate the download of cha
Discussion bed79f52
The user requested to download all chapters of 'הגאון מווילנה על משלי' from Wikisource into Markdown format. Initial attempts involved direct PDF to Markdown conversion using pandoc, which failed due to pandoc's inability to read PDFs. `pdftotext` was then employed, with variations like `-layout` for better preservation of structure. A Python script was developed to automate the download of availa
Discussion f3ba9bc7
The user wants to download and convert all chapters of 'הגאון מווילנה על משלי' from Wikisource into Markdown format. I've identified the main page and a list of available chapters. I've started downloading them systematically, beginning with chapter א. I've also encountered issues with automated download scripts and am now resorting to manual fetching and conversion for each chapter.
Discussion 05c66c85
The user is attempting to convert various formats (EPUB, MOBI, PDF) of 'הגאון מווילנה על משלי' into Markdown. Initial attempts with Pandoc for EPUB to Markdown were successful. PDF conversion proved more challenging, requiring the use of `pdftotext` with layout options. There were also issues with downloading chapters from Wikisource using a Python script, necessitating a shift to `curl` and `pand
Discussion 613c0f7e
The user is trying to set up and run a Python script (`test_wotc_pdfs.py`) for WOTC PDF processing using Claude 4. The process encountered issues with package installation due to an 'externally-managed-environment' error, requiring the creation and activation of a virtual environment. The script also requires the `ANTHROPIC_API_KEY`, which was initially missing and then attempted to be provided vi
Discussion 7e48a107
The user is working on converting Hebrew religious texts, specifically 'הגאון מווילנה על משלי', from various formats (PDF, EPUB, MOBI) into Markdown. Initial attempts using `pandoc` for PDF conversion failed as `pandoc` cannot convert from PDF. The assistant pivoted to using `pdftotext -layout` for PDF extraction, and `pandoc` for EPUB and MOBI conversions. There were also attempts to systematical
Discussion 411fb9a4
The user initiated a task to convert various file formats (EPUB, PDF, MOBI) of 'הגאון מווילנה על משלי' into Markdown. Initial attempts involved `pandoc` for EPUB/MOBI and `pdftotext` for PDF. Challenges arose with PDF conversion, leading to exploration of `pdftotext -layout` and brainstorming for advanced Hebrew PDF parsing. A systematic approach for downloading chapters from Wikisource was develo
Discussion fc1c1a1d
Mordechai is exploring methods to convert various Hebrew text formats (EPUB, MOBI, PDF) to Markdown, with a focus on the work 'הגאון מווילנה על משלי'. Initial attempts using `pandoc` for EPUB and MOBI were successful, yielding Markdown files. PDF conversion proved more challenging, with `pandoc` failing to convert from PDF and `pdftotext` being used as an alternative. An attempt to download all ch
Structured Outputs for Robust AI Applications
The conversation focuses on integrating OpenAI's vision capabilities with Python scripting to extract information from form images. Initial attempts to use structured outputs and function calling faced challenges with model output formatting and parameter compatibility. The process involved iterative debugging, prompt refinement, and code updates to handle missing modules, incorrect model usage, a
PDF Text Extraction Challenge
The user is encountering persistent issues with extracting Last Name and Receipt Number from PDF documents, even after implementing OCR and attempting to refine regular expressions. The script is successfully renaming some files but failing on others, indicated by 'Could not extract info' messages. The current focus is on diagnosing and fixing the regular expression patterns to match the diverse f
Discussion 5abf1dfa
The user requested conversion of various file formats (EPUB, MOBI, PDF) to Markdown, and downloading content from Wikisource. The EPUB to Markdown conversion was successful using pandoc. PDF conversion proved challenging, leading to the development and application of an improved strategy using `pdftotext -layout`. Downloading content from Wikisource was attempted with a Python script, which failed
Discussion e27a2689
Mordechai is exploring the technical setup for processing WOTC test PDFs using Claude 4. Initial attempts to run a test script failed due to missing environment variables (ANTHROPIC_API_KEY) and Python installation issues. The assistant is guiding Mordechai through setting up the environment, installing necessary packages (anthropic, python-dotenv), and correctly executing the test script. The goa