BIG DATA. Quality vs. Quantity

By: Bogdan Iancu - BEng

Data. Big data. More data. Everywhere. On every device. From every input. Digital. Analogue. Constant. Programmed. Scheduled or random.

Maybe the time has come we all move away from quantity and choose quality data instead. To drive all industries forward.

Social Media, big data analytics firms, AI, ML software companies and everyone and anyone in between argues that the best way to get ahead is more data, more inputs, more often.

But what gets me is the one size fits all approach.

And evidence is emerging that seems to suggest that this approach doesn't always work.

Not sure why we’ve done it this way. And insist on doing it the same. Maybe because at the helm there are a bunch of brilliant kids, young, supremely capable programmers banded together to develop advanced algorithms that solve complex problems and data is key. Great minds.

Exquisite mathematical skills. However, none of them truly understand real business. And they don’t seem to be able to grasp that a big corporation with thousands of employees, multiple geographies and maybe tens of thousands of clients will be vastly different than a business with say 30 or 50 people and 20 or 100 clients in a year.

And that B2B and B2C are not the same. Retail differs from hospitality or tourism, eCommerce, finance, digital marketing or heavy engineering, mining or defence. It is not simple.

Still, what you’ll find is that we’re all asked the same questions. When we set up our companies on Facebook, Instagram or even LinkedIn. We’re being asked to select industry, company size and type. Plus, URL and website. That’s it. It is at least questionable why I cannot select multiple geographies for my business. Or for my profile.

I mean, I cover multiple countries in a year…And no, I do not want to create a different company for our US subsidiary. Doesn’t make sense. It may make sense for BHP or Coca Cola (mind you, that creates many sub-companies that in actual fact don’t exist and skew data to no benefit; that is data pollution).

But not for us. On Facebook is even easier. You name your Page (as in business), add a category to describe your page and enter business information, such as the address, category (only 3) and contact information. And boom. You’re up and running.

And then you’re being taught on how to advertise. And things get really confusing. Real fast.

Not because is complicated to create ads. It is not overly complex, and certainly not an insurmountable challenge. Will have one on that topic soon.

But, once all adds are set and you’re ready to go, it is rather sad and downright perplexing to see how many “blank” hits you have. And by blank, I don’t mean false hits as in automated bots. Those are a problem too. The issue that gets driven from too much cluttered data and not enough quality data is that you get results that you pay for and are not relevant.At all. Regardless of the coarse filters that are put in place, allegedly selecting only the relevant data.

Point in case, we have had a recent advertising campaign on LinkedIn where we have specifically targeted medium-sized companies. Businesses with less than 1000 employees. Only in Australia.

First, quite a few hits coming from either from sole traders (which were purposely excluded when we had set up the campaign) but most concerningly, we have companies with more than 5,000, or even >10,001 employees that our adds managed to reach despite all the filters.

To confirm that, we have had hits from companies like BHP, NAB, ANZ, Rio Tinto, etc (all with more than 1000 employees) but also from SNC Lavalin Suez (not in Australia).

And not just one or two hits or just impressions. But quite a few actually. So, you end up paying not just say 5$/click as the system tells you. But rather 8 or 10$ per relevant click. And that is the key word. Relevant. To you. And your business.

The rest is nothing but noise. And if the system's filters cannot cut through the data noise to generate only relevant results (that you should be charged for), then it's not your problem.

You should not pay for noise.

See the below. We always aim to reach CXO level (decision-making personnel) in whatever company we target (this add included). But many “unpaid”, “entry” or “training” have been reached by LinkedIn’s algorithm which is based on preset filters that are not properly calibrated and cater more than they'd like to admit for B2C businesses rather than B2B.

And it seems like they don’t properly update many of these filters to reflect where people fit in a company (up or down the food chain).

What about countries?

This one, I do not have a clear answer for. Seems to be an error. As in the algorithm is clearly not perfect and every now and then gets another company targeted by our add. Regardless of location. Thus "Greater Tucson Area" or "NYC Metro". Maybe..

What I find questionable though is the fact that these companies force you to pay for these “blank clicks”.

And there is little you can do about it. And you know what? We would actually have no problem paying for those if indeed their touted ML/AI algorithms would actually work. As in, say after 30 or 60 days you’ll see that their robots learned what your business is all about and then filter through all of that noise to deliver relevant results.

Unfortunately, that is not what is actually happening here. Not to our business anyway. Nor to many businesses that we work with.

This may look just like a rant, but that’s not what this is all about.

Just trying to frame what is actually wrong with too much data. And one possible explanation is that the data is too “coarse”. Doesn’t have the necessary “granulometry” required for an advanced AI/ML to make decisions and apply successive learnings to provide valuable returns.

One size fits all doesn’t translate in the same results for B2B vs B2C businesses nor for similar companies in the same sector, even if they are the same size.

The second is the old adage “garbage in, garbage out”. Too many profiles that have been created on LinkedIn lack quality and therefore when targeted they returned “false positives” or “false negatives” based on whether their profile has been updated (but with wrong data) or the data is too old and although of good quality (when created), is not relevant anymore.

The third is that too much data is being generated without knowing what to do with it from the get-go. In other words, we have a lazy approach with data creation and collection.

Because on every such project, we always start with the idea that we should get as much information as we possibly can and then we’ll see what we can do with it, who wants to buy it, in what form and so on. So, the more the merrier.

Sure, there are many areas where this is exactly what you want to do. Mapping, military applications, software analytics, behavioural science and analysis, retail shopping, etc.

But there are many more fields where you need to be more precise with data collection from the beginning, because you must at least entertain the idea that a defined outcome should exist, and it can be helped by a particular data stream.

That outcome should be defined and understood in scope and boundaries.

Or at least we ought to use the information generated from the data points to answer a particular question that we had enunciated before we have started the collection.

So, the data should be:

Repetitive, predictable, collected from clearly defined and pre-specified inputs, accurate enough to match:

(1) a corresponding process, or

(2) a clearly specified problem requiring one (or more) solution(s), or

(3) a structured system with certain variation(s)

that/as it intends to

·     trend and control (in case of 1)

·     solve/address (in case of 2) or

·     enable / empower, correct or drive (for 3).

At the moment all we do is collect data and that generates in return a lot of pain, confusion and quite a few fundamental errors. Sure, we need sometimes a lot of data to generate algorithms and forecast trends (for example). No doubt that in many, many fields this is the right approach. Because we might not always know what we need or can find so data will generate possible options, scenarios, and variants for us.

What we are questioning here though is the idea that this is “the only approach”. It is not.

Many times, quality trumps quantity. That doesn’t mean that we should not have enough quantity. We absolutely must. But we should have the right amount of quality data to solve a puzzle. And the definition of right amount of quality data (and collection points/ways) changes based on the problem that needs a solution.  

Let me give you a brief example. I have a friend who just bought a state-of-the-art SMART fridge. IoT enabled. And one of the features this fridge has is that it records how many times you have opened/closed the fridge per day. And trends that for you. That is what we call “dumb data” collection.

No doubt it can provide some insights when it comes to behavioral analysis. Maybe if you increase the number of times, you keep opening/closing that fridge, you might have bulimia or maybe you buy groceries way too often or whatnot.

But because the data is too coarse it doesn’t actually tell us if you are opening the fridge to get something out of (or put something in). It doesn’t tell us what that item might be nor how often you do it. Or who has actually actually opened that fridge. Was it you, or someone else? Maybe you have a compulsion (OCD or something) and love to open and close the fridge. Many scenarios. Way too many.

Now let’s consider the quality approach and let’s imagine a scenario where that fridge collects data from a SMART sensor located where you normally put your bottle of white wine in.

And the data shows that you’ve opened the fridge and taken the bottle out four times. Then based on historical data, an ML algorithm concluded that you do that only 5 times before a new bottle is stored in.

So, the system triggers the SMART fridge to order a replacement bottle for you. That would make sense and it will solve a defined problem (ordering a new bottle of wine).

Collecting a lot of dumb data because maybe one day will be useful makes very little sense in this scenario. Not to mention that we’re all different, so some level of customisation will still be required to accommodate the needs of any family.

If you think that this applies only to B2C or some fringe applications or social media, then think again. It applies to many areas we collect data for. Whether industrial, commercial or retail. Healthcare too.

In the industrial sector our thinking always starts with what we want to control or use the data for; only then we proceed by adding sensors on various controllable (or monitoring-only) instruments or devices to remotely perform those functions/reviews based on a defined logic.

If more data can be generated in this process (and used for some other purpose lately), then sure. Great. But you never confuse essential data and information with good-to-have data that may or may not be used by some yet-to-be-developed algorithm that will potentially help us in the future solve some other discrete problem. Fit for purpose comes first.

For example, IIoT- enabled control of a pumping station or continuously monitoring all process motors on a minerals processing site to reduce MTBF (mean time between failures). All of these use a certain number of inputs to generate solutions.

But more data is generated in the process and could potentially be stored and used later for various improvements in efficiency. But that is data that you trust. It is not tampered with, usually you have at least two sensors always confirming the output, plus redundancy.

And that’s how you look at the data. Make sure you can collect enough data from as many relevant key points as possible (quantity) so you can use only the necessary, trusted inputs to find a solution to a specific problem (quality).

Allow for customisation (granulometry) and fine-tuning aligned with a particular problem and understand that while the inputs might come from the same place, the problems might vary slightly, therefore most times no two solutions will look the same.

You can have the same problem in two different places and although you know what the inputs should be, fine-tuning is required to render similar outputs. Or the problem might vary with time, and in this case, you need to adjust the data stream (and sometimes change or add data points, without compromising the data quality) to be able to continue solving that problem.


Special Edition