October 10th, 2023 # Exploring Vertex AI, SageMaker, & Colab ## Vertex AI Training a model in Google Vertex AI requires 1. configuring a Vertex workspace, 2. uploading data to a Bucket, 3. accessing the data from Vertex, and 4. using sufficient acceleration. ### Loading in Data with Buckets The first obstacle I encountered was how to upload my image data (approximately 4k JPEG images around 500x400 px each) to Google Cloud. The web console froze when I tried to upload the folder, so instead I used the [CLI](https://cloud.google.com/sdk/docs/install). ```gsutil cp <local object path> gs://<bucket name>``` ### GPU Availability Woes Unfortunately I could not access the GPU resources I needed from any region in any instance configuration, although comparing these directories did help me get close: - [Platforms <-> GPUs](https://cloud.google.com/compute/docs/gpus/?&_ga=2.190991850.-1341057692.1696901646#gpus-list) - [GPUs <-> Regions](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones) ## SageMaker After no luck with Vertex, the next stop was AWS's offering: SageMaker Studio in conjunction with S3 (cloud storage). The main steps were 1. uploading data to an S3 bucket using the AWS CLI, 2. making the data accessible from a Sagemaker notebook, 3. accessing sufficient GPU resources. ### Data Wrangling Part 1: Local to S3 This included 1. setting up the AWS CLI on my local machine, and 2. creating an IAM user and IAM user key. I installed the CLI using `curl`, and the install script added `aws` to my PATH with no trouble. Using`aws configure` prompted me to enter an AWS Access Key ID and Secret Key. To obtain these I went to the IAM feature on Amazon AWS and created a new IAM user, and then an access key for that user. This permitted me to use `aws cp <local file> "s3://<desination-bucket>"` to upload the files to my Amazon S3 bucket. ### Part 2: S3 to SageMaker Studio Notebook Instance I didn't end up using the tool of the same name to queue up my data. Instead I used the `sagemaker.s3.S3Downloader` to copy the files into the SageMaker runtime environment, so I could then work with them. This took a few minutes to copy the files for faster processing. Went off without a hitch. But if you get an error like this: `no file or folder s3://<bucket-name>/folder`, you probably need to change the role permissions for your instance to FullS3Permission. Have fun with that 🫡. ### The Achilles Heel: Resource Quotas Once again the showstopper here was accessing a GPU. Each launch, or processing unit (or something) uses up your quota by 1 unit. I found myself with depleted quotas for all GPU machines. I requested an increase to 1, which was quickly granted, but my subsequent request is still pending as of 8:42 pm. Be sure to keep an eye on what zone you are requesting or using resources, and use the same one or you might notice disappearing instances and storage. ## Google Colab This was by far the quickest setup of all. Of course, that's not by accident on the developers' part. This came down to 1. creating a notebook, 2. getting my data via Google Drive, 3. copying it to the instance, AND, since I finally succeeded on this one, 4. turning on GPU and modifying my code to utilize it. ***WARNINGS*** - Colab does not save your code. Download it frequently if you're not on the Pro tier. - It does not save your instance files after closing out a notebook (even if you immediately navigate back to it). - It will yell at you for having too many instances open. Be prepared to contort yourself a bit at first dealing with these restrictions (it's designed to be annoying to get you to buy Pro), and *don't lose your stuff*. Anyway... ### Setup I was able to simply create a new notebook from the .ipynb on my local machine. Then I uploaded my data to a Google drive folder. Then I mounted the drive folder to my Colab instance, and copied the files to the instance with ``` from google.colab import drive drive.mount('/content/drive') import shutil src = '/content/drive/MyDrive/images' dest = '/content/images' destination = shutil.copytree(src, dest) ``` I turned on the GPU via menu: Runtime > Change Runtime Type. Finally I modified my code. This included setting the device and moving the defined CNN tensor to GPU ``` print("CUDA Available: " + str(torch.cuda.is_available())) print("Using " + torch.cuda.get_device_name(0)) torch.cuda.set_device(0) device = torch.device('cuda:0') net.cuda() ``` and moving each batch of inputs and labels to the GPU on each batch loop-step. ``` # for i, batch in enumerator: labels, inputs = data[0].to(device), data[1].to(device) ``` The GPU ran through the data 5-6 times faster than my MacBook's "2.4 GHz 8-Core Intel Core i9", which seems a bit underwhelming, but for now I'll take it. # Resources - [Blog article with steps for using CUDA on a model training](https://cnvrg.io/pytorch-cuda/) - [Google: Run a calculation on a Cloud TPU VM using PyTorch](https://cloud.google.com/tpu/docs/run-calculation-pytorch) - [Guide to Picking CNN Architecture for Vision](https://levelup.gitconnected.com/a-practical-guide-to-selecting-cnn-architectures-for-computer-vision-applications-4a07ef90234#:~:text=DenseNet%3A%20DenseNet%20is%20a%20CNN,accuracy%20in%20image%20classification%20tasks.) - [Trying to Calculate Number of Features Given to 1st FC Layer](https://datascience.stackexchange.com/questions/40906/determining-size-of-fc-layer-after-conv-layer-in-pytorch) - [Pytorch Image Classifier Startup Guide](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) - [COMPREHENSIVE Discussion of CNN Model for Pytorch Vision Application](https://medium.com/thecyphy/train-cnn-model-with-pytorch-21dafb918f48)