At my previous job, I spent the last 3 years building, deploying and improving a document conversion service that, as the name implies, converted many types of input documents (image files, pdfs, etc) into a structured XML format. The service was the first step in various machine learning/natural language processing (NLP) pipelines. The service was effectively an OCR service on steroids, somewhat comparable to AWS Textract.
The service was distributed across a few backend API nodes that communicated with worker nodes via Apache Kafka.
Rebuilding a simplified version of this service will be a good demo project for my portfolio, and will help me continue to learn and improve my skills. Specifically, I will use this as an opportunity to explore FastAPI. I'll initially deploy the service in AWS, since we deployed the service in an on-prem cloud at State Street. I'll explore deploying via Terraform as well. The previous service used a proprietary third-party OCR software, but we'll use the open source (and excellent) poppler-utils.
make
and docker
installed on their local system.Future posts will describe components (API, database, testing, etc.) in detail.
All code will be open and available at https://github.com/jschaub30/demo-conversion-service under the MIT license.