What is Big Data, and who cares? A non-technical take

If you work in information technology, or a related field, you can’t have failed to notice one buzzphrase “Big Data”. Seemingly every job description, architectural roadmap, or sales pitch contains it. What does it mean?

The technology underpinning Big Data (and the history of that is for another post) means that storing, and more importantly, processing and accessing, incredibly large amounts of data is cheaper and more accessible than ever before. Most companies of a reasonable size can keep and analyse more information than they’ve ever been able to.

So far, so good. The explosion of the worldwide web, and now mobile personal computing devices like phones and tablets, means that there’s far, far more data for companies to look at than ever. Looking back 25 years, financial transactions were probably the highest volume generated by the average (unless I’m overlooking something). EPOS data perhaps, when stores went from manually keying values on a till to scanning barcodes, would also have been (and remains) high volume.

Buying items in a supermarket though, is nothing compared to the page views, the clicks, the texts, emails, and messages we generate daily. Photos uploaded, tweets posted, steps counted, calories logged, we are now excellent at getting data into electronic repositories.

All of this is hardly earth-shattering, or unsurprising. What doesn’t often come to mind is what happens to this data “below the surface”. We’re all comfortable with the fact that data displayed on various websites is stored by said sites, but what else do they know about us?

Let’s take Facebook as an example. If you have a Facebook account, and have logged in on a computer, they can track every visit you make to a site with has a Facebook integration, whether that to log in at that site, or just to share the content. Even if you’ve logged out, they still know which of these sites you visited, and when.

And there’s more. Facebook bought Instagram, so they can (probably, I don’t have any inside information) associate your activity and photos from there with your Facebook account. Which, by the way, they very strongly want you to populate with your real identity. So your anonymous Instagram account can be associated with your Facebook account.

None of this explicitly requires big data technologies, but they make the retention and analysis of data at this scale much, much easier. Previously, this sort of activity would have taken some seriously expensive kit, and months of planning to implement, and now, ideas can be implemented almost as quickly as they’re conceived. Companies can pivot how and why they work far more easily.

Now, maybe you don’t have a Facebook account. You’re not out of reach for their data analysis, unfortunately. Facebook also owns WhatsApp, and WhatsApp is on your phone. Well, it might not be on your phone, but that doesn’t matter. Here’s why.

WhatsApp makes finding your friends really easy, because it doesn’t have user names. It uses phone numbers, which is really convenient, and a pretty good way to identify that someone is who they say they are. Of course, as a phone app, it can get this information seamlessly (after asking for permission). Of course, you give it permission, because it doesn’t work otherwise. Now WhatsApp has all your phone numbers, or at least a version that can be compared against other people’s phone numbers.

So, if your phone number is on the phone of a friend that uses WhatsApp, they can (and do) compare it to those of other people signing up for the app, so they can show people their friends. It’s perfectly possible that WhatsApp (and by extension Facebook) have a very good idea of your group of friends and acquaintances, and all without you going near or using any of their products.

I’ve chosen Facebook as an example, and they do have a vast pool of your information to draw on, but it’s not just them. Everyone that offers a service for free/nearly free is looking to make money by selling you to advertisers, and the more they know, the better they can sell you. All of this activity is supported, and in large part enabled, by the big data technology stack.

I don’t think that any of this is necessarily a bad thing, but I do think that people aren’t aware of the amount of their data that is being captured, stored, and analysed. And that is a bad thing.