### Tags:
#AI #AI/OCR #OCR #RomanCoins #Digital_Humanities #DigitalProcess #Digital_Archives #Digital_Archaeology #Python
---
# What I was trying to do
This was my first attempt at performing OCR with a PDF using Google Vision, Google AI Studio's Gemini 2.5 Pro version.
- [Google Vision OCR Layout](https://aistudio.google.com/prompts/1IV-O-ld8vbxTtT-gymcf1Gf0aFWpOJ9n)
## What I did
>[! note]
>I first opened a Gemini chat and began prompting the AI for options to perform OCR on PDF files that I accessed via the Internet Archive. These PDFs are scanned copies of 16th - 19th century manuscripts in print. These manuscripts focus on Roman history and coins. My objective is to test various approaches to OCR in order to establish a streamlined process that will allow me to assess many PDFs of scanned manuscripts without having to convert each page into JPG or PNG files. After prompting the AI, I chose the `DOCUMENT_TEXT_DETECTION` feature via the GCV APIs. I then prompted the AI with the following: `can you provide step by step instructions on how to use document_text_detection` Gemini returned the following:
>[! Caution] **Phase 1: Prerequisites (One-time Setup)**
>1. **Create a Google Cloud Project (if you don't have one):**
> - Go to the [Google Cloud Console](https://www.google.com/url?sa=E&q=https%3A%2F%2Fconsole.cloud.google.com%2F).
> - Create a new project or select an existing one.
>2. **Enable Billing for your Project:**
> - In the Cloud Console, navigate to "Billing."
> - Ensure your project is linked to an active billing account. (You get a free tier, but billing must be enabled).
>3. **Enable the Cloud Vision API:**
>- In the Cloud Console, go to "APIs & Services" > "Library."
> - Search for "Cloud Vision API" and click "Enable."
>4. **Create a Service Account and Download Credentials:**
> - This is the recommended way for applications to authenticate.
> - Go to "IAM & Admin" > "Service Accounts."
> - Click "+ CREATE SERVICE ACCOUNT."
> - Give it a name (e.g., "vision-api-user") and an optional description. Click "CREATE AND CONTINUE."
> - **Grant Role:** For Vision API, the role "Cloud Vision AI User" (or a more permissive one like "Editor" if you're testing, but "Cloud Vision AI User" is more secure and sufficient) is needed. Select it and click
> - "CONTINUE."
>- Click "DONE" (you can skip granting user access to this service account for now).
> - Find your newly created service account in the list. Click the three dots (Actions) next to it, then "Manage keys."
> - Click "ADD KEY" > "Create new key."
> - Choose **JSON** as the key type and click "CREATE."
> - A JSON file will be downloaded. **Keep this file secure!** It contains your credentials. Let's say you save it as your-project-credentials.json.
>5. **Set Up Your Local Environment:**
> - **Install Python:** Ensure you have Python 3.6+ installed.
> - **Install Google Cloud Client Library for Python:**
>```
pip install google-cloud-vision
>```
>
>- **Set Environment Variable for Authentication:** Tell the client library where to find your credentials JSON file.
>- **Linux/macOS:**
>```
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your-project-credentials.json"
>```
(Add this line to your .bashrc, .zshrc, or equivalent shell startup file to make it permanent for your terminal sessions).
>[! note]
> The steps Gemini 2.5 Pro provided for setting up my Google Cloud Console (GCC) was easy to follow and execute. I setup my GCC before exploring the Python code that Gemini provided to perform the OCR and output the data in JSON files. The initial code did not work and numerous prompts were conducted to establish a working code. The final code output needs to be cleaned and will be posted via GitHub soon. It is provided below under the ### Code header.
> After Gemini had produced the initial code, it provided step-by-step instructions on how to set up a [[`Bucket`]] in the GCC. I found this confusing. Gemini should have provided these steps before providing the code, as you need the `bucket` setup before executing the code. The code needs the path for the input and output folders. The bucket is named `coin_ocr_bucket1` with subfolders `coin_ocr_results`, `input_manuscripts`, and a test folder, `Output_results`, to differentiate between the two manuscripts I tested the code on. The code provided by Gemini went through four iterations (I think, due to lost sections of the chat) using VS Code. A virtual environment was established to avoid conflicts.
## Challenges
The main challenge was getting the code to execute correctly. At times the AI was not clear on whether I needed address issues via the Terminal I was in or a new terminal environment. The AI established the errors occurred due to the .zshrc shell being incorrectly set up. This was partly due to the Anaconda environment conflicting with the 'regular' Python3 environment. The code could not authenticate the Google Application Credentials and the .zshrc shell needed to be modified using `nano`. The initial attempts to rectify the code and Google Application Credential issues were problematic as `nano` was not running properly. Gemini provided alternative solutions to navigate the problems with `Nano` that allowed GAC to eventually be recognized. Unfortunately, I do not have the error codes or script due to an internet issue, and the scripts were not saved in the Gemini chat, where I was logging them and troubleshooting the errors in the AI. The internet issue caused about half of my documentation to be lost. The final code is saved in my documents folder and will be deposited into GitHub. Total token use was over 61000.
## Thoughts on where to go next
My initial observations of the JSON outputs for both manuscripts appear to be good but not great. The OCR did not pick up the Latin long 's' (ſ ) in these manuscripts and instead outputted them as 's'. I have not confirmed any other issues at this time. I need to extract the texts from the JSON files and clean them up to confirm if the OCR missed other Latin forms. The results appear promising, but I am unsure if this is the route to perform the OCR as PDF image quality is inconsistent. However, converting each PDF into JPG or PNG files seems tedious and I am unsure if it will produce satisfactory results. The best results I have received were depositing a JPG image directly into Gemini and asking it to perform OCR. See [[OCR and Google AI Studio 1]] for output. Next, I will try [OCR with Google Vision API and Tesseract | Programming Historian](https://programminghistorian.org/en/lessons/ocr-with-google-vision-and-tesseract#prerequisites) and see if the results are better.
### Screenshots
*Google Bucket and associated folders*
![[Google Cloud Storage OCR.png | 750]]
*Gemini Output and Token Count*
![[Gemini 2.5 Screenshot.png | 750]]
*OCR error due to .zshrc Shell*
![[Unexpected Error OCR.png | 750]]
*OCR Output in JSON with prompt to locate files.*
![[Output of OCR.png | 750]]
### Code