January 8, 2020 by Irio Musskopf
I was recently faced with the problem of finding an apartment in Berlin. As you might have heard, this task is not getting easier over time. For foreigners who don’t speak German fluently especially, the effort is enough to require hours of weekly research. When faced with a good-enough option, you must be quick to make the first contact followed by a proposal.
As an engineer, I work for automating everything that can reliably be automated by an algorithm, so people can focus on tougher tasks or just enjoy their free time.
Following this principle and my own experience, I decided to automate part of this effort – my task would be to write a program to alert me about all the best deals in the city, every day. In this article, I explain how I designed the foundations of this platform, which required software to collect all the available apartments at the push of a button.
The platform I’ve written is a Go application deployed to Google Cloud using Terraform. Also, it has Continuous Deployment from a private GitHub repository.
After a quick research, I came to the following list of platforms to monitor:
- eBay Kleinanzeigen
A few hours later, I have a Go binary that does everything I need to run the application locally. It uses a web scraping framework called Colly to browse all the platforms listings, extract basic attributes, and export to CSV files in the local filesystem.
Since I didn’t want to maintain the application running locally, my first choice would be to get a cheap instance at Google Cloud. Once I had this rented virtual machine, I could write a startup script to compile the app from GitHub, and set up a crontab to scrape the platforms on a daily basis.
Probably the best decision for this specific project, but could I use this personal problem as an opportunity to explore the integration of Google Cloud services?
Since, in the past, I was involved in multiple projects involving some sort of scraping application, I believed it was worth the effort. I could easily reuse this setup in the future.
My architecture started with a few premises:
- It should use Google Cloud services.
- It should support data collection every few minutes, even though I would start collecting only once a day.
- It should be as cost-effective as a cheap droplet at DigitalOcean (US$ 5).
- It should be easy to deploy. Ideally, it should implement Continuous Deployment.
- It should support to trigger a data collection process over demand - e.g., after an HTTP POST request.
My hypothesis was that I didn’t need a virtual machine running 24/7; thus, it should not cost the same as a full month price. In fact, my application was able to download all the properties I was interested in under 3 minutes, so I expected something significantly lower.
# The architecture
My exploration through the latest Google Cloud services resulted in finding Cloud Run, a service that “run(s) stateless containers on a fully managed environment or in your own GKE cluster.” Still classified as a beta product by Google Cloud, it is built on top of Knative and Kubernetes. The key proposal is its pricing model: it charges in chunks of milliseconds rather than hours of runtime.
With a few tweaks, my Go application was wrapped in a Docker container to be runnable by Cloud Run. Once it gets a HTTP POST request, it collects attributes from all the advertised properties and publishes as CSV files to a Google Storage bucket. For my use case, I created two possible ways to hit this endpoint: an Internet-accessible address so I can trigger it whenever I want, and through Cloud Scheduler, which is configured to hit it once a day.
# The application
The application is fairly simple: it consists of an HTTP server with a single endpoint. On every hit, it scrapes all the platforms and saves results in CSVs inside a Storage bucket.
Other application files can be found in this Gist. All the feedback is appreciated, as this is one of my first Go projects.
# The deployment
- $ gcloud auth login
Create a Google Cloud project and configure the CLI to use it with
- $ gcloud config set project PROJECT_NAME
Create a Google Cloud Service Account for using with Terraform, giving it the “Owner” role. Create and download a JSON key for this new service account. Place it in deployment/credentials.jsonEnable the following Cloud APIs:
- App Engine Admin API
- Cloud Build API
- Cloud Run API
- Cloud Scheduler API
Give appropriate roles to the API service account ending with @cloudbuild.gserviceaccount.com:
- Cloud Run Admin
- Cloud Run Service Agent
Create a Cloud Source Repository
- based on your GitHub repository.
- Set appropriate variable values in terraform.tfvars.
Now with permissions already given, use Terraform to set up the rest of the infrastructure.
$ cd deployment $ terraform init $ terraform apply
The initial deployment may take about five minutes since Terraform waits for Cloud Run to build and start before configuring Cloud Scheduler.
Since Cloud Run is still in beta - with API endpoints in alpha stage -I was not able to declare all the infrastructure in Terraform files. As a temporary workaround, I’ve written a couple of auxiliary bash scripts that trigger the Cloud API through its CLI command. Fortunately, all this happens in background when a developer triggers terraform apply.
# The result
Every day, without any human interaction, Cloud Scheduler creates a new folder with a number of CSV files with the most recently available apartments in my city.
# The costs
Not all the services in use are available in the official calculator. Either way, I’ve made a rough estimation for my personal use, considering an unrealistic number of one deployment each day.
# Cloud Storage - US$ 0.02/month
- Location: US
- Class A operations: 4*30 = 120
- 1st month
- Storage: 2MB*30 = 60MB
- 12nd month
- Storage: 2MB*365 = 730MB
# Cloud Run - US$ 0.00/month
- Location: us-east1
- Cpu allocated: 1
- Memory allocated: 1GB
- Concurrent requests per container instance: 1
- Execution Time per Request (ms): 5000
- Outbound Network Bandwidth per request execution (KB): 1000
- Requests per Month: 30
# Cloud Build - US$ 0.00/month
- free quota of 120 builds-minutes/day
- 4 build-minutes/day
# Container Registry - US$ 0.02–0.19/month
- 1st month
- 20MB*30 = 600MB
- 600/1024 * 0.026 = 0.02
- 12nd month
- Storage: 20MB*365 = 7300MB
- 7300/1024 * 0.026 = 0.19
# Cloud Source Repositories - US$ 0.00/month
- free quota of 5 project-users
- 1 project
# Cloud Scheduler - US$ 0.00/month
- free quota of 3 free jobs/month
- 1 job
For comparison, an f1-micro instance - with 0.6GB of RAM - running over a full month on Google Cloud, is included in the free tier; a g1-small instance, with 1.7GB, would cost US$ 13.80 per month. Also, it is reasonable to consider the cost could decrease or increase depending on how accurate were my initial assumptions and further optimizations.
Originally published at https://iriomk.com/ on September 27, 2019.