Anonymizing the logs

7 posts / 0 new
Last post
Tom Owad's picture
Offline
Last seen: 2 weeks 2 days ago
Joined: Dec 16 2003 - 15:14
Posts: 3385
Anonymizing the logs


Hi,
I'm a grad student working with David Patterson at the RAD Lab at the CS
department at UC Berkeley. Our lab is interested in building more reliable
internet services. You can read more about our lab here:

* our wiki page: http://radlab.cs.berkeley.edu
* slashdot article: http://slashdot.org/articles/05/12/15/1426223.shtml

To do interesting research, it's extremely useful to have traces/logs from
real-world internet sites that we can use in our experiments. For example,
here's a paper where we used data from Ebates.com to evaluate new algorithms
for detecting and localizing failures in Internet services:
http://www.cs.berkeley.edu/~bodikp/publications/icac05.pdf

I'm wondering whether you have any logs from the Applefritter website that
you could make available to our research group. For example, web server
access logs are very useful for understanding the traffic to a site.

Of course, we could help you to anonymize the data in case it contains any
confidential information.

If you cannot provide such data, could you recommend other people or
websites that could make their logs available or simply forward this email
to them?

Thank you!
Peter

I'd like to help them out, but obviously need to ensure first that we can anonymize the logs. So far, we're looking at replacing IP addresses with a hash, as well as User IDs in /user accesses and message IDs in /privatemessage. Anything else you'd like to see anonymized? BDub and I will be taking a closer look at the logs, for other potentially identifying information, but we'd also like to get a few more opinions.

moosemanmoo's picture
Offline
Last seen: 10 years 1 month ago
Joined: Aug 17 2004 - 15:24
Posts: 686
I don't know what kind of log

I don't know what kind of logs you're talking about, but the kind of logs that my Apache server gives don't have a whole lot of information in them. What I think would be better than hashing the IPs is to resolve them to a hostname, then just omit everything but the last part of the hostname. For example, 24.14.2.91 might resolve to 24-14-2-91.hr.hr.cox.net, but you would only put cox.com. That is a fair amount of work for the nameservers, though, and it's probably not feasable.

Tom Owad's picture
Offline
Last seen: 2 weeks 2 days ago
Joined: Dec 16 2003 - 15:14
Posts: 3385
Yes - Apache logs. Could you

Yes - Apache logs. Could you explain your reservations to hashing the IPs?

moosemanmoo's picture
Offline
Last seen: 10 years 1 month ago
Joined: Aug 17 2004 - 15:24
Posts: 686
MD5 hashes can be easily crac

MD5 hashes can be easily cracked-- with the proper tools, someone could brute-force an MD5 hash in about a minute. That, and it would probably make formatting the data a little harder. Replacing the IPs with some repeated letter like x would work better from a security standpoint.

Dr. Webster's picture
Offline
Last seen: 2 days 2 hours ago
Joined: Dec 19 2003 - 17:34
Posts: 1760
I wouldn't give them the IPs

I wouldn't give them the IPs in any form, but I don't think there would be a problem with giving them the domains. You could resolve them to comcast.net, cox.net, etc., or just truncate the IPs to xxx.xxx.---.--- (truncating the IP at the subnet level).

BDub's picture
Offline
Last seen: 2 years 10 months ago
Joined: Dec 20 2003 - 10:38
Posts: 703
If a salted hash is insufficient

I'm against xxx.xxx'ing out the IP's because they likely want to trace how users navigate around a website.

Also possible, we could give each IP that shows up it's own unique identifier (first IP in logs is replaced with "1", second with "2") but have IPs that are the same use the same identifier. This would allow them to trace a user around the site, while not having any clues to the users identity.

BDub's picture
Offline
Last seen: 2 years 10 months ago
Joined: Dec 20 2003 - 10:38
Posts: 703
Re: MD5 hashes can be easily crac

MD5 hashes can be easily cracked-- with the proper tools, someone could brute-force an MD5 hash in about a minute. That, and it would probably make formatting the data a little harder. Replacing the IPs with some repeated letter like x would work better from a security standpoint.

First off, we'd obviously salt the hash. That's not even a question. Brute forcing becomes quite a bit harder given that situation. Could you quote a source for the 'in about a minute' comment?

We could also use a hash scheme other than MD5.

Log in or register to post comments