Article Image

Running a scraping platform at Google Cloud for as little as US$ 0.05/month

Author Image
Irio Musskopf
5 mins read

I was recently faced with the problem of finding an apartment in Berlin. As you might have heard, this task is not getting easier over time. For foreigners who don’t speak German fluently especially, the effort is enough to require hours of weekly research. When faced with a good-enough option, you must be quick to make the first contact followed by a proposal.

As an engineer, I work for automating everything that can reliably be automated by an algorithm, so people can focus on tougher tasks or just enjoy their free time.

Following this principle and my own experience, I decided to automate part of this effort – my task would be to write a program to alert me about all the best deals in the city, every day. In this article, I explain how I designed the foundations of this platform, which required software to collect all the available apartments at the push of a button.

Berlin

Photo taken by dronepicr

The platform I’ve written is a Go application deployed to Google Cloud using Terraform. Also, it has Continuous Deployment from a private GitHub repository.


After a quick research, I came to the following list of platforms to monitor:

  • eBay Kleinanzeigen
  • ImmobilienScout24
  • Immowelt
  • Nestpick

A few hours later, I have a Go binary that does everything I need to run the application locally. It uses a web scraping framework called Colly to browse all the platforms listings, extract basic attributes, and export to CSV files in the local filesystem.

Since I didn’t want to maintain the application running locally, my first choice would be to get a cheap instance at Google Cloud. Once I had this rented virtual machine, I could write a startup script to compile the app from GitHub, and set up a crontab to scrape the platforms on a daily basis.

Probably the best decision for this specific project, but could I use this personal problem as an opportunity to explore the integration of Google Cloud services?

Since, in the past, I was involved in multiple projects involving some sort of scraping application, I believed it was worth the effort. I could easily reuse this setup in the future.

My architecture started with a few premises:

  • It should use Google Cloud services.
  • It should support data collection every few minutes, even though I would start collecting only once a day.
  • It should be as cost-effective as a cheap droplet at DigitalOcean (US$ 5).
  • It should be easy to deploy. Ideally, it should implement Continuous Deployment.
  • It should support to trigger a data collection process over demand - e.g., after an HTTP POST request.

My hypothesis was that I didn’t need a virtual machine running 24/7; thus, it should not cost the same as a full month price. In fact, my application was able to download all the properties I was interested in under 3 minutes, so I expected something significantly lower.

The architecture

Software architecture

My exploration through the latest Google Cloud services resulted in finding Cloud Run, a service that “run(s) stateless containers on a fully managed environment or in your own GKE cluster.” Still classified as a beta product by Google Cloud, it is built on top of Knative and Kubernetes. The key proposal is its pricing model: it charges in chunks of milliseconds rather than hours of runtime.

With a few tweaks, my Go application was wrapped in a Docker container to be runnable by Cloud Run. Once it gets a HTTP POST request, it collects attributes from all the advertised properties and publishes as CSV files to a Google Storage bucket. For my use case, I created two possible ways to hit this endpoint: an Internet-accessible address so I can trigger it whenever I want, and through Cloud Scheduler, which is configured to hit it once a day.

The application

The application is fairly simple: it consists of an HTTP server with a single endpoint. On every hit, it scrapes all the platforms and saves results in CSVs inside a Storage bucket.

Application source tree

./main.go

Main.go file content

./Dockerfile

Dockerfile file content

Other application files can be found in this Gist. All the feedback is appreciated, as this is one of my first Go projects.

The deployment

  1. Install Terraform
  2. Install Google Cloud CLI and sign in to your account with
  • $ gcloud auth login

Create a Google Cloud project and configure the CLI to use it with

  • $ gcloud config set project PROJECT_NAME

Create a Google Cloud Service Account for using with Terraform, giving it the “Owner” role. Create and download a JSON key for this new service account. Place it in deployment/credentials.jsonEnable the following Cloud APIs:

  • App Engine Admin API
  • Cloud Build API
  • Cloud Run API
  • Cloud Scheduler API

Give appropriate roles to the API service account ending with @cloudbuild.gserviceaccount.com:

  • Cloud Run Admin
  • Cloud Run Service Agent

Create a Cloud Source Repository

  1. based on your GitHub repository.
  2. Set appropriate variable values in terraform.tfvars.

Now with permissions already given, use Terraform to set up the rest of the infrastructure.

$ cd deployment
$ terraform init
$ terraform apply

The initial deployment may take about five minutes since Terraform waits for Cloud Run to build and start before configuring Cloud Scheduler.

Since Cloud Run is still in beta - with API endpoints in alpha stage -I was not able to declare all the infrastructure in Terraform files. As a temporary workaround, I’ve written a couple of auxiliary bash scripts that trigger the Cloud API through its CLI command. Fortunately, all this happens in background when a developer triggers terraform apply.


The result

Every day, without any human interaction, Cloud Scheduler creates a new folder with a number of CSV files with the most recently available apartments in my city.

Google Cloud Platform

The costs

Not all the services in use are available in the official calculator. Either way, I’ve made a rough estimation for my personal use, considering an unrealistic number of one deployment each day.

Cloud Storage - US$ 0.02/month

  • Location: US
  • Class A operations: 4*30 = 120
  • 1st month
    • Storage: 2MB*30 = 60MB
  • 12nd month
    • Storage: 2MB*365 = 730MB

Cloud Run - US$ 0.00/month

  • Location: us-east1
  • Cpu allocated: 1
  • Memory allocated: 1GB
  • Concurrent requests per container instance: 1
  • Execution Time per Request (ms): 5000
  • Outbound Network Bandwidth per request execution (KB): 1000
  • Requests per Month: 30

Cloud Build - US$ 0.00/month

  • free quota of 120 builds-minutes/day
  • 4 build-minutes/day

Container Registry - US$ 0.02–0.19/month

  • $0.026/GB
  • 1st month
    • 20MB*30 = 600MB
    • 600/1024 * 0.026 = 0.02
  • 12nd month
    • Storage: 20MB*365 = 7300MB
    • 7300/1024 * 0.026 = 0.19

Cloud Source Repositories - US$ 0.00/month

  • free quota of 5 project-users
  • 1 project

Cloud Scheduler - US$ 0.00/month

  • free quota of 3 free jobs/month
  • 1 job

For comparison, an f1-micro instance - with 0.6GB of RAM - running over a full month on Google Cloud, is included in the free tier; a g1-small instance, with 1.7GB, would cost US$ 13.80 per month. Also, it is reasonable to consider the cost could decrease or increase depending on how accurate were my initial assumptions and further optimizations.

Originally published at https://iriomk.com/ on September 27, 2019.

We are the CKAN experts.

Datopian are the co-creators, co-stewards, and one of the main developers of CKAN. We design, develop and scale CKAN solutions for everyone from government to the Fortune 500. We also monitor client use cases for data to ensure that CKAN is responding to genuine challenges faced by real organizations.

Related blog posts

Case Study Image
4 min read

A Brief Introduction to Data Portals

A crucial tool for any organization, data portals perform a range of functions, from providing an easily-searchable catalog of your data to enabling data visualizations and enhancement. This article i...

Author ImageAuthor Image

Annabel Van Daalen

Paul Walsh

Case Study Image
6 min read

On the Value of Data

Data has become increasingly intertwined with our daily lives as more companies collect, analyze, and utilize it—and its use is growing exponentially. Data is everywhere. IoT is opening up new possibi...

Author Image

Michael Polidori