Project requirements for a distributed cloud app

At my previous job, I spent the last 3 years building, deploying and improving a document conversion service that, as the name implies, converted many types of input documents (image files, pdfs, etc) into a structured XML format. The service was the first step in various machine learning/natural language processing (NLP) pipelines. The service was effectively an OCR service on steroids, somewhat comparable to AWS Textract.

The service was distributed across a few backend API nodes that communicated with worker nodes via Apache Kafka.

Rebuilding a simplified version of this service will be a good demo project for my portfolio, and will help me continue to learn and improve my skills. Specifically, I will use this as an opportunity to explore FastAPI. I'll initially deploy the service in AWS, since we deployed the service in an on-prem cloud at State Street. I'll explore deploying via Terraform as well. The previous service used a proprietary third-party OCR software, but we'll use the open source (and excellent) poppler-utils.

Functional requirements

  1. Allow a new user to register for the service.
  2. Users can upload new documents in common image and PDF formats
  3. Users can submit jobs where previously uploaded documents are converted into plain text and/or structured XML format
  4. Users can view the status of previous submitted jobs
  5. Input and output documents are retained for a configurable period (e.g. 7 days)
  6. A user cannot read other user's documents, but a user can share documents with another user.
  7. To keep costs managable and predictable, we limit:
    • number of active users
    • size of input documents
    • jobs/day/user
    • total pages/day/user

Non-functional requirements

  1. The final service will be deployed in the cloud, but a reasonably skilled developer can run a local version of the service if they have make and docker installed on their local system.
  2. For the cloud service, the system should be scalable. Submitting 5 simultaneous jobs should take a similar time as 1 job, if the input documents are similar. A local deployment need not be scalable.
  3. Less important are availablility and latency. This is a free service, after all.

Future posts will describe components (API, database, testing, etc.) in detail.

All code will be open and available at under the MIT license.