Apr 29 2005 03:16:00 PM EDT

My Interview With CacheLogic

My earlier blog entry questioning CacheLogic’s statistics about BitTorrent and peer-to-peer usage on today’s Internet stirred up a small controversy. I was criticized in private e-mail and even in a public posting to Dave Farber’s list. In a snarky public attack, Brett Glass said folks should be skeptical of me since I’m an “advocate of P2P,” even though all I was calling for was skepticism about CacheLogic’s “study” (which, it turns out, is not actually a study as that word is normally used, but instead is a set of PowerPoint slides). How one should be skeptical of a call for skepticism is best left an exercise for the recursively inclined reader. I am, of course, not any great advocate of peer-to-peer file-trading, although I use the technology and necessarily am called upon to research it and/or demonstrate it from time to time.

At any rate, I was contacted by Andrew Parker of CacheLogic, who offered to explain to me what it is that his company does, and what he’s actually been saying with regard to CacheLogic’s statistics. What I came away with was the sense that there’s no particular reason to suspect the guy’s motives, but also that there’s plenty of reason to be skeptical of CacheLogic’s claims about P2P usage — not because they’re necessarily trying to misrepresent any facts, but because their data haven’t been published and can’t be checked independently.

For this reason, those who want to know how much P2P is being used would be better off relying on peer-reviewed public research such as the CAIDA study, which carefully qualifies its claims and explains its methodology. That the press (both the trade press and the mainstream press) uncritically reiterate CacheLogic’s numbers (which are published not as a paper but as part of a sales-oriented PowerPoint presentation) is as much a reason to criticize a credulous press as it is to criticize a company that’s just trying to make a buck. Parker expressly disavows any attempt to represent CacheLogic’s numbers as academic research — the press should take the hint and quite repeating CacheLogic’s sales-related claims as if they signified scientific research.

At any rate, here are my notes from my telephone interview with Andrew Parker. I should state here that I have shown him the following notes, and he helpfully offered corrections on a couple of points where he believed I misunderstood him. I have dutifully incorporated those corrections below. I should also note that Parker originally referred to the CacheLogic figures as a “study” (he even does it in a comment posting on this weblog), but now disavows any claim that there is a CacheLogic study: “I don’t really do ’studies,’ Parker said in e-mail to me prior to the phone interview. “I published a series of slides last year covering what was going on in the world of P2P and since then journalists and others keep calling and asking for updated stats,” he said. So the hype about the CacheLogic number is really the journalists’ fault, in other words. I can’t say Parker’s wrong about that.

Andrew Parker says CacheLogic can’t give access to the methodology and the datasets. He says the datasets are “terabytes” in size, and that this is why they can’t be easily distributed or reviewed. He says the methodology is to “interrogate all traffic across the wire”. This is in response to the move of P2P services to dynamic port usage and efforts to disguise their applications as other applications.

Parker says the CacheLogic dataset is to some extent based on Tier 1 providers, but that he “couldn’t go into specifics” as to how many Tier 1 providers he’s relying on. (Parker says his reason for not going into specifics is client confidentiality and mutual NDAs.)

He says, with regard to ISPs that are willing to participate in the company’s data gathering, CacheLogic is “more than happy to give them our devices.” He said the devices that are used to interrogate and measure traffic are standalone devices. The data in the presentations, he said, are based on 14 ISPs around the world, employing 14-20 of CacheLogic’s devices to measure traffic and report their results back to CacheLogic.

Parker says that a couple of universities of “mainland Europe” have approached CacheLogic to make use of their dataset and (maybe) their devices to generate academic papers of the traditional sort. He couldn’t say who they were because of NDAs, but could say they were not in the UK. “They’ve approached us,” he says, adding that CacheLogic is willing to share resources with them because “it’s not our business to do academic research.” Says Parker: “That’s something we can’t do.”

Parker says the methodology is to interrogate “the packet payload” and discover what protocol is being used. He is careful to say that CacheLogic doesn’t purport to know how much of the content distributed through a particular protocol is copyrighted, much less infringing.

To reproduce his company’s results, he says, a researcher would have to have “access to lots of service providers” and “the ability to recognize the applications on the wire.” When I asked whether it would be possible to duplicate his research without the CacheLogic proprietary technology, he said there were some open-source projects, like P2P Filter, that could be refined to do something like what CacheLogic’s technology does, although that refinement would take “man-days.” Parker pointed out that service providers around the world are employing different tactics and strategies to measure traffic and application use, and that this varies according to “geography.”

Parker says he “certainly wouldn’t attend a web-caching conference and present what we have here” — he is quick to disclaim any pretense of having produced academic research as such, although he adds that he believes CacheLogic’s conclusions are “reasonably accurate.”

With regard to my posted comment that the widely quoted CacheLogic figures were essentially taken from promotional literature, and as such should be approached skeptically, he said, “the point you made is not one that I would argue with.”

He later adds, however, the following:

“Despite not feeling that our reports have been verified by 3rd parties. I would suggest that your statement surrounding ‘figures taken from marketing literature’ to be stretching the truth somewhat. I said I would not argue that the numbers had not been verified and would not constitute an academic weight study, and as such I would not present it as a complete study of P2P activity on a global basis. The figures presented are not marketing waffle and our ISPs would not participate if they felt this was a) a marketing exercise or b) that the data was inaccurate.”

I’m not sure, however, that this really answers the charge that the figures are taken from promotional literature. Clearly, CacheLogic is offering products that purport to help ISPs assess and handle P2P traffic. To the extent that the problem of handling P2P traffic is made to look bigger, it creates a stronger argument that someone should be buying CacheLogic’s products. I would be happier if the CacheLogic claims were independently verified, and if the press didn’t report their statistical claims as if a real “study” had taken place. It is absolutely clear from my discussion with Parker that no study using the CacheLogic technology has taken place. Ergo, every press report that refers to the CacheLogic “study” is written by someone who didn’t bother to question how CacheLogic can claim to know what it says it knows about P2P traffic.

In a subsequent e-mail message to me, Parker did add a final qualification to CacheLogic’s claims. Despite some press reports that CacheLogic’s figures have file-sharing making up 60 to 80 percent of Internet bandwidth, Parker explains that the real figure is probably lower: “[A]s for the Internet as a whole it’s going to be less as [corporate networks] don’t have the same prolific filesharing habits as end-users on ISP networks.”

—–

Leave a Reply