Advanced Web Performance Optimization [con't]
Caching Frequently Used Objects
Caching is the temporary storage of frequently accessed data in higher-speed media (typically SRAM or RAM), or in media that is closer to the user, for more efficient retrieval. Web caching stores frequently used objects closer to the client through browser, proxy, or server caches. By storing "fresh" objects closer to your users, you avoid unnecessary HTTP requests and minimize DNS "hops." This reduces bandwidth consumption and server load, and improves response times. Yahoo! estimates that between 62% and 95% of the time that it takes to fetch a web page is spent making HTTP requests for objects.[132] Caching helps to reduce costly HTTP requests to improve performance.
Unfortunately, caching is underutilized and often is misunderstood on the Web. A July 2007 survey of Fortune 1000 company websites revealed that 37.9% used cache control headers.[133] What the survey doesn't tell you is that most of these sites use "don't cache" headers. Developers routinely bust caches for fear of delivering stale content. Browsers have also not helped the situation. To avoid "304" requests that would come from revalidating previously downloaded objects, developers have adopted a check-once-per-session scheme. If a user doesn't shut down her browser, however, she can see stale content. One solution is to cache web objects for longer periods (some developers set their expiry times 20 years into the future), change object filenames for updates, and use shorter expiration times for HTML documents, which tend to change more frequently.
Caching is not just for static sites; even dynamic sites can benefit from it. Caching dynamically generated content is less useful than caching all the dependent objects, such as scripts, styles, images, and Flash, which are often re-requested or at least revalidated by browsers or intermediaries. Dependent objects such as multimedia objects typically don't change as frequently as HTML files. Graphics that seldom change, such as logos, headers, and navigation bars, can be given longer expiration times, whereas resources that change more frequently, such as HTML and XML files, can be given shorter expiration times. By designing your site with caching in mind, you can target different classes of resources to give them different expiration times with only a few lines of code. You can test how well caching is set up on your site using Port80 Software's Cache Check tool (see Figure 9.3, "Checking the caching on CNN.com with Port80Software.com's Cache Check tool").
Three ways to cache in
- Via
<meta>
tags (<meta http-equiv="Expires"Â >
) - Programmatically, by setting HTTP headers (CGI scripts, etc.)
- Through the web server general configuration files (
httpd.conf
)
In the section that follows, we'll explore the third method of cache control: server configuration files. Although the first method works with browsers, most intermediate proxy servers don't parse HTML files; they look for HTTP headers to set caching policy, thus undermining this method. The second method of programmatically setting cache control headers (e.g., Expires
and Cache-Control
) is useful for dynamic CGI scripts that output dynamic data. The third and preferred method is to use web server configuration files to set cache control rules. In addition, we'll explore mod_cache
, which provides a powerful caching architecture to accelerate HTTP traffic.
Example cache control conversation. To cache web objects, browsers and proxy servers upstream from the origin server must be able to calculate a time to live (TTL), or a limit on the period of time you can display an object from the cache since the last time it was accessed or modified. HTTP does this digital melon-squeezing primarily through brief HTTP header conversations between client, proxy, and origin servers to determine whether it is OK to reuse a cached object or whether it should reload the resource to get a fresh one. Here is an example HTTP request and response sequence for Google's logo image, logo.gif
(see Figure 9.4: "Google's logo: back to the future").
First the browser requests the image:
One of Google's servers replies with the following:
This image was last modified June 7, 2006 and includes an Expires
header set to January 17, 2038, far into the future. In its minimalist reply header, Google does not use the Cache-Control
header, an entity tag (ETag), or the Accept-Ranges
header. The Cache-Control
header was introduced in HTTP 1.1 to provide a more flexible alternative to the Expires
header. Rather than setting a hardcoded time into the future, as the Expires
header does, the max-age
setting of the Cache-Control
header provides a relative offset (in seconds) from the last access. Here is an example that sets the cache control maximum age to one year from the last access (in seconds):
The Expires
header works for browsers that encounter a server that switches to HTTP 1.0, which should send only an Expires
header. Of course, because Google doesn't use ETags, once it substitutes one of its patented seasonal logos it would need to change the filename to make sure the logo updates in browsers (see Figure 9.5, "Happy Halloween logo from Google").
Use a future Expires
header. By using an Expires
header set far into the future, Google ensures that its logo will be cached by browsers. According to the HTTP specification, the Expires header tells the browser "the date/time after which the response is considered stale." When the browser encounters this header and has the image in its cache, the cached image is returned on subsequent page views, saving one HTTP request and HTTP response.
Configure or eliminate ETags. ETags were designed to be a more flexible caching alternative to determine whether a component in the browser's cache matches the one on the origin server. The problem with ETags is that they are constructed to be unique to a specific resource on a specific server. For busy sites with multiple servers, ETags can cause identical resources to not be cached, degrading performance. Here is an example ETag:
In Apache, ETags are made out of three components: the INode
, MTime
, and Size
.
You can configure your Apache server (in your httpd.conf
file) to strip the server component out of each ETag, like so:
However, most of the websites that we tested don't bother configuring their ETags, so a simpler solution is to turn off ETags entirely and rely on Expires
or Cache-Control
headers to enable efficient caching of resources. To turn off ETags, add the following lines to one of your configuration files in Apache (this requires mod_headers, which is included in the default Apache build):
The effect of cookies on caching. Cookies are commonly used on the Web for tracking and saving state across browser sessions, but they are often overused. Researchers have found that popular sites indiscriminately set cookies for all their URIs, denying themselves the benefits of Content Delivery Networks (CDNs) and caching, both of which are impeded by cookies. For example, one study found that 66% of responses were uncacheable or required cache validation. A significant fraction of these uncacheable responses was due to the use of cookies (47% of all requests used).
Most sites use the Set-Cookie
header path of root (/
), which sets cookies for every object. If you segregate cookied content, move images to a separate directory or server, and use more specific paths to assign cookies, you can minimize their impact on performance.