October 10th, 2023
# Exploring Vertex AI, SageMaker, & Colab
## Vertex AI
Training a model in Google Vertex AI requires
1. configuring a Vertex workspace,
2. uploading data to a Bucket,
3. accessing the data from Vertex, and
4. using sufficient acceleration.
### Loading in Data with Buckets
The first obstacle I encountered was how to upload my image data (approximately 4k JPEG images around 500x400 px each) to Google Cloud. The web console froze when I tried to upload the folder, so instead I used the [CLI](https://cloud.google.com/sdk/docs/install).
```gsutil cp <local object path> gs://<bucket name>```
### GPU Availability Woes
Unfortunately I could not access the GPU resources I needed from any region in any instance configuration, although comparing these directories did help me get close:
- [Platforms <-> GPUs](https://cloud.google.com/compute/docs/gpus/?&_ga=2.190991850.-1341057692.1696901646#gpus-list)
- [GPUs <-> Regions](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones)
## SageMaker
After no luck with Vertex, the next stop was AWS's offering: SageMaker Studio in conjunction with S3 (cloud storage). The main steps were
1. uploading data to an S3 bucket using the AWS CLI,
2. making the data accessible from a Sagemaker notebook,
3. accessing sufficient GPU resources.
### Data Wrangling Part 1: Local to S3
This included
1. setting up the AWS CLI on my local machine, and
2. creating an IAM user and IAM user key.
I installed the CLI using `curl`, and the install script added `aws` to my PATH with no trouble. Using`aws configure` prompted me to enter an AWS Access Key ID and Secret Key.
To obtain these I went to the IAM feature on Amazon AWS and created a new IAM user, and then an access key for that user.
This permitted me to use `aws cp <local file> "s3://<desination-bucket>"` to upload the files to my Amazon S3 bucket.
### Part 2: S3 to SageMaker Studio Notebook Instance
I didn't end up using the tool of the same name to queue up my data. Instead I used the `sagemaker.s3.S3Downloader` to copy the files into the SageMaker runtime environment, so I could then work with them. This took a few minutes to copy the files for faster processing. Went off without a hitch. But if you get an error like this: `no file or folder s3://<bucket-name>/folder`, you probably need to change the role permissions for your instance to FullS3Permission. Have fun with that 🫡.
### The Achilles Heel: Resource Quotas
Once again the showstopper here was accessing a GPU. Each launch, or processing unit (or something) uses up your quota by 1 unit. I found myself with depleted quotas for all GPU machines. I requested an increase to 1, which was quickly granted, but my subsequent request is still pending as of 8:42 pm. Be sure to keep an eye on what zone you are requesting or using resources, and use the same one or you might notice disappearing instances and storage.
## Google Colab
This was by far the quickest setup of all. Of course, that's not by accident on the developers' part. This came down to
1. creating a notebook,
2. getting my data via Google Drive,
3. copying it to the instance,
AND, since I finally succeeded on this one,
4. turning on GPU and modifying my code to utilize it.
***WARNINGS***
- Colab does not save your code. Download it frequently if you're not on the Pro tier.
- It does not save your instance files after closing out a notebook (even if you immediately navigate back to it).
- It will yell at you for having too many instances open.
Be prepared to contort yourself a bit at first dealing with these restrictions (it's designed to be annoying to get you to buy Pro), and *don't lose your stuff*. Anyway...
### Setup
I was able to simply create a new notebook from the .ipynb on my local machine. Then I uploaded my data to a Google drive folder.
Then I mounted the drive folder to my Colab instance, and copied the files to the instance with
```
from google.colab import drive
drive.mount('/content/drive')
import shutil
src = '/content/drive/MyDrive/images'
dest = '/content/images'
destination = shutil.copytree(src, dest)
```
I turned on the GPU via menu: Runtime > Change Runtime Type. Finally I modified my code.
This included setting the device and moving the defined CNN tensor to GPU
```
print("CUDA Available: " + str(torch.cuda.is_available()))
print("Using " + torch.cuda.get_device_name(0))
torch.cuda.set_device(0)
device = torch.device('cuda:0')
net.cuda()
```
and moving each batch of inputs and labels to the GPU on each batch loop-step.
```
# for i, batch in enumerator:
labels, inputs = data[0].to(device), data[1].to(device)
```
The GPU ran through the data 5-6 times faster than my MacBook's "2.4 GHz 8-Core Intel Core i9", which seems a bit underwhelming, but for now I'll take it.
# Resources
- [Blog article with steps for using CUDA on a model training](https://cnvrg.io/pytorch-cuda/)
- [Google: Run a calculation on a Cloud TPU VM using PyTorch](https://cloud.google.com/tpu/docs/run-calculation-pytorch)
- [Guide to Picking CNN Architecture for Vision](https://levelup.gitconnected.com/a-practical-guide-to-selecting-cnn-architectures-for-computer-vision-applications-4a07ef90234#:~:text=DenseNet%3A%20DenseNet%20is%20a%20CNN,accuracy%20in%20image%20classification%20tasks.)
- [Trying to Calculate Number of Features Given to 1st FC Layer](https://datascience.stackexchange.com/questions/40906/determining-size-of-fc-layer-after-conv-layer-in-pytorch)
- [Pytorch Image Classifier Startup Guide](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html)
- [COMPREHENSIVE Discussion of CNN Model for Pytorch Vision Application](https://medium.com/thecyphy/train-cnn-model-with-pytorch-21dafb918f48)