Fable Of Contents|
ISP Tech Talk
by Avi Freedman
I notice that Jack's all hot about caching technology, so I thought I'd give a bit of a technical background on it. In a month or two, I'll hopefully be able to give some real-world experience, as we're considering building some web caches to improve quality for our downstream ISP customers. Of course, U.S.-based ISPs know almost nothing about caching compared to our European and Australian/Pacific brethren, since our bandwidth is essentially free compared to theirs, but we're interested and should be capable of learning quickly...
WHAT IS CACHING?
The idea/goal of caching web (HTTP) content is this:
When someone requests a web page, intercept that request, either by playing TCP/IP tricks or by having the user's browser send the request to you. If you (the intercepting box, or "proxy") have the web page requested, and you think it's a fairly recent copy, and it's something that you think is "static content" (not likely to be constantly changing) and it's something that your user is authorized to retrieve, send it back to the user without having to use Internet bandwidth to get the data or wait for the data to come back.
What this means is that the data is returned faster to the end user - and the end user, his ISP, and you (if you're the provider to the ISP) - saved on Internet bandwidth.
If bandwidth is expensive, this is a huge win! For example, until recently in Australia, bandwidth cost AU$.19/megabyte retrieved from the Net to the ISP. Outbound bandwidth was (and is currently) "free." But disk space only costs US$.04/megabyte, give or take a few pennies. So it's cheaper to save everything ever retrieved on disk if someone's going to access it just once more!
In the U.S. (and elsewhere), caching is the big win. When a national provider's network is having a very bad day, aggressive caching is a way to get the users zippier access to the data. This makes them happy and reduces the number of "the Net is slow" calls you take.
HOW HUGE A WIN?
With a 4 GB disk, 128-256 MB of RAM, and a decently fast CPU, it's fairly easy to get 25-30 percent "cache hit rate" (percentage of HTTP requests that are serviced by the cache).
Getting much more than that is trickier, but 40 to 45 percent is somewhat achievable. More disk is necessary, but the key factor is a larger user population.
USER POPULATION SIZE
The more end users that are hitting on the cache, the greater likelihood that the cache will have a higher hit rate, because the chances are better in a larger user population that two people want to go to the same place on the Web.
It's possible to set up hierarchies of caches using ICP, the Internet Cache Protocol, and to configure your cache to get the data requested by users of other caches in the cache hierarchy, thus simulating a larger user population and making your cache more effective. Basically, cooperating caches (either on the same network or on networks of friendly ISPs) query each other for content before going out to get it from the Internet.
A simple modification allows you to simulate the larger user population. When your sibling asks "do you have http://www.reallystick.com/sticky.jpg?" You say "no." Then, wait a minute and ask it for that URL. It'll have it (if it's to be found on the Net). You retrieve it, and have it in case one of your sticky- keyboard-crowd users wanted it.
A common architecture is to have caches peer (through routers) across smaller regional exchange points or via private interconnects. Of course, if ISP A's cache sends twice as much data to ISP B's cache as it gets back, payment may need to flow from ISP B to ISP A. Of course, ISP A already had the data, so it's "found money." So this data doesn't necessarily need to be charged for at the full rate.
Cooperating caches working together as above are one way of prepopulating caches with content. The idea is to get data into your cache - even if you'll never use it - on the off chance that a user will want it, and that you'll then save expensive Internet bandwidth - and give better service to your users.
Another method is to examine sites that are typically in the cache, and have your cache check the validity and re-scan the popular sites every morning at 9-10 AM, after, say, the news sites are often updated. Often, this prepopulation is done when bandwidth is cheaper or when the network is just less utilized. Your upstream provider may not agree, but you might as well blow some bandwidth at this time (if you're paying "flat-rate pricing," as in the U.S.) in hopes of better service later.
We're going to experiment with this with some other ISPs, so it's quite possible that we'll have some idea of good prepopulation strategies to present here in a few months.
Another approach is to have a managed service that feeds you with a "cache prepopulation" feed via satellite. That way, terrestrial backbones aren't congested with the prepopulation traffic - and, without participating in ICP meshes (meshes of cooperating caches), you get the benefit of a potentially huge user population. As I write this, things are just starting to get in gear with SkyCache, so hopefully there'll be some technical news about it to report in a few months.
Without going into detail, there's some question about whether keeping copies of other people's data is legal - whether it's a copyright violation. Common sense tells me no, it shouldn't be, but the legal system may decide otherwise.
PROBLEMS: STALE DATA
You're the proxy cache. If you have a copy of a web document in your cache, you could just return it to the client. But should you? How do you know it didn't change? Well, you could make a dynamic HTTP request every time to see whether the document has been modified since you retrieved it (using the If- Modified-Since request header). Even though that request takes a small amount of bandwidth, it adds latency to the operation, defeating the idea of using caching to raise quality.
Usually people set some intermediate max-time-to-cache that they feel they can live with. If 24 hours proves to be reasonable, you should just seek cache software that can re-check all documents overnight.
PROBLEMS - CACHE BUSTING
Many content providers don't like the idea that you could be caching content. They need to know how many times someone saw a banner ad, for example.
How to do this? Modern browsers and caches should recognize and honor the "Pragma: no-cache" directive and "Expires: 0" lines.
Also, caches can't store CGI-bin and other dynamic content for later retrieval. Making ads come back from CGI and server-side includes, winds up defeating caches. If the site designers are smart, they'll write cgi to generate references to banner gifs and jpegs, which themselves can be cached, saving bandwidth and download time, and making the site seem "faster" to the user.
SO WHAT CAN I EXPECT?
With a moderately fast Pentium, 64 MB, and a 4 GB disk, you can expect a 25 percent cache hit rate. More important than the bandwidth savings, though, is probably that you'll have much lower latency on those 25 percent of your customer requests.
It'll take more memory and disk - and more importantly, more traffic, to get higher hit rates. 35-40 percent hit rates should be obtainable with a few T-1's worth of web requests going through the box.
HOW DO I PUT A CACHE IN MY NETWORK?
One way to go about it is to convince all of your customers and your customers' customers to set their web browsers up to use your proxy cache server specifically.
Another approach is to use routers to redirect all traffic destined to port 80 (the standard HTTP port) to the proxy cache box, which then re-sponds to the client as if it were the remote host. Most web proxies will use their own IP address to fetch content. If they do that, they only have to be in the outbound path from the web client. Some web proxies (such as Mirror Image's) need to be bidirectionally in the loop because when data needs to be fetched from a real off-site server, they make requests as the web client, not with the proxy's own IP address.
For our application, the Mirror Image approach doesn't work, since we see outbound packets from many of our dual-homed T-1 customers, but are not always symmetrically in the "return path."
PROXY CACHE SOFTWARE
The most popular free software is Squid, available at www.nlanr.net/Squid. A list of other proxy-caching software is available at http://ircache.nlanr.net, as is a description of their Global Cache Mesh project.
Fable Of Contents