Data Deduplication

Deduplication: Good for the Bottom Line, Bad for Security

As practically ever aspect of our daily lives begins moving slowly to the cloud, it behooves us to inspect this “cloud” and find all the chinks in the armor that is protecting an increasingly large percentage of our lives. While the goals of cloud providers and customers seem to be one and the same where security is concerned – no one wants the cost of a breach and a lawsuit, after all – there are certain calculated risks that cloud providers often take with your data that you may not be aware of. In some cases they may not even be aware of the risk themselves!

Data deduplication is just such a risk and it’s absolutely rampant. Cloud storage provider DropBox performs data deduplication, as does cloud backup provider BackBlaze – but what is it and why is it so dangerous?

Date deduplication, in its simplest terms, means that you look at a file before it’s uploaded and if you’ve already got a copy of that file somewhere, you don’t actually upload it – you just make a note of where that file resides and that this user has a copy of it too. On the face of it this seems harmless enough: it saves cloud services the extra cost of bandwidth and storage space and it saves you on upload times for common files – win/win, right? Except that it provides a handy side channel attack for someone to gather information.

Let’s say I’m the MPAA and I’ve got a copy of a pirated movie that I suspect people are storing in their DropBox account or have backed up using BackBlaze. Now if I use either service to upload that video (and only that video) while monitoring my network traffic it’s pretty easy to see the difference between the hundreds of megabytes of traffic uploading a full video creates compared to the several kilobytes it takes to confirm the file already exists somewhere and log my copy of it. This should be enough to obtain a warrant, assuming a reasonably technically adept judge, and legally coerce these providers into providing records of which customers have these files.

Now a bunch of people in the audience just jumped up and said “But I’m not a pirate! It’s illegal and they should get caught anyway!” or “Hey neat! A new tool we can use to catch child pornographers!” – OK probably not, those sort of people aren’t usually in my demographic, but let’s pretend they did. I would remind those well-meaning but ultimately dangerously wrong folks that if the MPAA can do it and the government can do it, so can anyone else. With a good lawyer or a little social engineering, it’s not hard to come up with a half-dozen contrived attacks to flush out the non-criminals too.

But let’s be clear: Without a warrant or some very clever social engineering, it’s unlikely that anyone can get confirmation that you, specifically, are storing a specific file – just that someone is – but that’s an important starting point in the footprinting process, revealing data about the internal workings of a system that can, combined with other seemingly trivial bits of information, lead to absolutely massive breaches.

What’s worse is that this completely circumvents all of the other security measures such services typically use! DropBox, for example, encrypts every last file you upload, but not before deduplication. Given that they also hold the keys necessary to decrypt everything they store, this is a nightmare for anyone who finds themselves in the crosshairs. Technically speaking, deduplication and encryption shouldn’t work together at all, given that a properly encrypted file is indiscernable from random data, but by checking prior to encryption DropBox et al circumvent this limtation… and their own security measures.

So how do you protect yourself? You’ve got a few options. First, you can take the encryption out of the cloud and put it directly on your own terms. Use software like TrueCrypt to encrypt potentially sensitive or damning data before uploading to the cloud, that way your provider can neither check for duplication nor do they have access to the encryption keys. Alternately, look for a better provide. There are a handful of providers, like wuala or SpiderOak that do what SpiderOak has dubbed “zero-knowledge data backup” – all files are encrypted with a key that only you have before leaving your computer. This costs the providers a few extra dollars in storage and bandwidth, sure, and realistically those costs do get passed on to the customer, but how much is your privacy worth?

Aside from the benefits such zero-knowledge techniques provide to the customer, they also provide significant legal protection to the host as well. You can’t be sued for knowingly hosting any kind of content when you literally have no way of discovering what your clients are storing. Cloud providers typically enjoy “safe harbor” protections here in the U.S. at least, but the anti-piracy campaigns are slowly eating away at that and the day may come when client-side encryption isn’t just a convenience, but an absolute necessity.

No tips yet.
Be the first to tip!

Tip With Bitcoin

12VneFf1Q9ExNDeFhfmyi18N6N8noAcWHh

Each post gets its own unique Bitcoin address so by tipping you're not only making my continued efforts possible but telling me what you liked. Vote with your (Bitcoin) wallet!

Share ThisShare on Reddit0Share on Google+4Share on Facebook2Tweet about this on Twitter4
Loading Facebook Comments ...

Comments

  1. deduper says

    >Technically speaking, deduplication and encryption shouldn't work together at all, given that a properly encrypted file is indiscernable from random data, but by checking prior to encryption DropBox et al circumvent this limtation… and their own security measures.

    This is wrong, it is possible to deduplicate post-encryption. Microsoft Research developed an algorithm to do just this several years ago. The way it works is you encrypt the block with its plaintext hash as the key with a symmetric cipher and upload that to the service. Then, encrypt the plaintext hash with the user's key and upload that too. Hence, you only transfer the encrypted pair (E1(B,H(B)), E2(H(B), K)) = (e,k), which may be decrypted with the key K via D1(e, D2(k,K)).

    • says

      Interesting, I wasn't aware of Microsoft Research's algorithm – still, adding deduplication to encrypted data adds a side channel attack against said encrypted data, however small an attack. And properly encrypted data should still appear random. This definitely merits research on my part, however little it effects my conclusions in the article. Thanks for the post!

  2. deduper says

    >Technically speaking, deduplication and encryption shouldn't work together at all, given that a properly encrypted file is indiscernable from random data, but by checking prior to encryption DropBox et al circumvent this limtation… and their own security measures.

    This is wrong, it is possible to deduplicate post-encryption. Microsoft Research developed an algorithm to do just this several years ago. The way it works is you encrypt the block with its plaintext hash as the key with a symmetric cipher and upload that to the service. Then, encrypt the plaintext hash with the user's key and upload that too. Hence, you only transfer the encrypted pair (E1(B,H(B)), E2(H(B), K)) = (e,k), which may be decrypted with the key K via D1(e, D2(k,K)).

    • says

      Interesting, I wasn't aware of Microsoft Research's algorithm – still, adding deduplication to encrypted data adds a side channel attack against said encrypted data, however small an attack. And properly encrypted data should still appear random. This definitely merits research on my part, however little it effects my conclusions in the article. Thanks for the post!

Trackbacks

Leave a Reply

Your email address will not be published. Required fields are marked *