- Crawl Budget Management For Large Sites | Google Search Central | Documentation | Google for Developers
- Learn what crawl budget is and how you can optimize Google's crawling of large and frequently updated websites.
This post has been translated by AI.
Post summarized by durumis AI
- To improve web page transfer speed and implement a Cloud Run-based redirect service, we aimed to improve the efficiency of Google search crawl budget utilization.
- We moved static files such as robots.txt and favicon.ico and redirection logic to Cloud Run, resulting in a response speed improvement from 200ms to 20ms compared to the previous GKE Autopilot.
- By deploying Cloud Run in various regions worldwide, we achieved faster speeds and cost savings by serving the service from a server closer to the user's location.
All services that primarily focus on content aim to achieve high visibility in Google search results.
However, to achieve this visibility, it's crucial that Google's crawlers visit the site frequently to crawl the latest data and effectively reflect it in their index.
To encourage crawlers to visit a website more often, the concept of "Crawl Budget" is utilized. A higher crawl budget allows for more frequent visits, and achieving this typically requires a certain level of website traffic.
So, how can we effectively utilize this limited "Crawl Budget" to maximize crawler visits within the same budget?
The answer lies in optimizing website page load speed. For instance, a site with a page download time of 20ms will likely be crawled more frequently compared to a site with a 200ms download time.
Fundamentally, durumis aims to be a "global service" and therefore provides basic traffic services across 8 regions worldwide.
GCP Server Region
Currently, our service utilizes regions from amongGoogle's 40 regions, specifically Seoul and Singapore in Asia, Mumbai in Asia, Belgium in Europe, Los Angeles and South Carolina in the United States. This ensures that users from virtually anywhere in the world can access the service with minimal latency.
However, due to the reasons discussed earlier, we've decided to fine-tune the service further to achieve even greater speed.
Currently, our service is undergoing a transition from the initial launch URL scheme to a new subdomain scheme. This involves some redirects, which are attracting a significant amount of Google crawler traffic. We need to ensure that Google quickly recognizes the new URLs associated with the existing ones.
Therefore, to maximize the use of the allocated crawl budget, we need to maintain the highest possible speed for redirect traffic.
Previously, GKE was deployed in the 8 regions mentioned above, and each GKE instance utilized Google's GKE Autopilot to handle redirects using data that is not associated with the database.
As a result, we've decided to migrate some logic currently residing in GKE and other Cloud Run instances to a new Cloud Run environment. The following items are included in the migration list:
- robots.txt: Although a static file, it serves content dynamically across various subdomains based on specific logic.
- favicon.ico: A purely static file.
- durumis.com/[lang]/@userId/postID: Redirecting to the new URL -> userId.durumis.com/[lang]/postID
Previously, when served through a slightly heavier framework within GKE Autopilot, the response time was approximately 200ms.
The new approach involves utilizing Cloud Run and, given the simplicity of the logic, we've avoided using any frameworks. Instead, we're leveraging node:http without any libraries, relying on the basic code.
const http = require("http"); const fs = require("fs"); const path = require("path"); let cachedFavicon = null; function loadFavicon() { const faviconPath = path.join(__dirname, "favicon.ico"); try { cachedFavicon = fs.readFileSync(faviconPath); console.log("Favicon loaded into memory"); } catch (err) { console.error("Error loading favicon:", err); } } --- 생략 --- const server = http.createServer((req, res) => { if (req.url === "/favicon.ico") { res.writeHead(200, { "Content-Type": "image/x-icon", "Cache-Control": "public,max-age=2419200" } res.end(cachedFavicon); }); server.listen(port, () => { console.log(Server running at http://localhost:${port}/); });
(Please note that directly copying and pasting the code above will not result in functionality. It's intended to provide a general understanding of the concept.)
In the case of favicon.ico, since it's an image file, we've chosen to load it from the file system into memory and serve it from there.
We utilize shell scripts with gcloud to deploy the service across multiple regions simultaneously.
Ta-da!
The deployment results are as follows:
Cloud Run Screen
We've deployed to "all regions" in the United States, along with 2 regions each in Europe and Asia, 1 in Africa, and 2 in South America. This showcases the advantage of serverless technology— deploying in this manner doesn't increase costs at all. (In fact, serving traffic from closer regions might even reduce costs.)
Now, let's see how much faster the service has become.
The previous server-side response times were as follows:
GKE Autopilot
The improved speeds are as follows:
Static File using Cloud Run
You might be wondering about the cold start times inherent in Cloud Run's serverless nature. However, in situations where there's a consistent flow of crawls at regular intervals, it's not a significant concern. This is one of the key features of GCP's serverless architecture—it maintains a standby and idle state while waiting for requests. (And it doesn't increase costs.)
In the next post, we'll delve into handling image files and other files, going beyond the simple static files and 301 redirects discussed in this post.