Big Data, BI, Data Analytics 3 — The Big Intermezzo
When the Internet took off and data became big
Already back in 1992, when I was studying, there were predictions of distributed databases coming in the future. And in 1998, when I worked with reporting and data warehouse construction, a colleague of mine could tell me with a big smile that now he had Hadoop!
Hadoop was not the kind of distributed database we had learned about. It was developed to handle the still bigger amounts of data needed for Google’s search engine to function. “Big” originally meant “bigger than fits on one computer”, and what Google and others in an open source collaboration developed, was a file system and several tools for managing and querying large amounts of data by distributing them over several computers and disks.
However, in the beginning, big data was something we talked about but most normal companies and organisations didn’t have them. It was not until large web shops became widespread and such as supermarket nets began collecting data from Point Of Sale (POS) systems, the computerised cash registers, that we saw a more common need for big data tools.
I got into it, a bit, in 2015 - at that time, there were several ideas of what exactly big data was - how big is big? So, an analytics tool like R could handle up to 64 GB data in-memory, and that was considered a very big dataset at the time. But, of course, Amazon and Google had datasets many times bigger. I remember, though, how the teachers at the R course I took were quite excited about the capabilities of it. Remember, in those days it would be a challenge to even find a computer with 64 GB memory, so the limit was actually at kind of fantasy size.
It was around that time, the first Data Scientists came out of the universities with a dedicated degree in data science. Job titles like Data Scientist and Data Analyst had begun appearing, and the area was getting hyped - many people wanted to jump into it and become something with data, mainly big data.
Just a few years later, however, I talked to one of the big vendors of Hadoop tools, and they could tell me how the idea of “big data” was fading out. Hadoop wasn’t needed any more in most cases, as other data management products were capable of handling much larger datasets than previously.
But from around 2004 to 2015, we saw a transition in the common mindset about data analytics from working on mainly one data source (even if it was a data warehouse with input from several OLTP databases in the organisation) to working on data from several and often external sources. Sales data from the webshop would be combined with demographical data from research or commercial vendors, and it all would be combined with social media data or whatever additional information could be obtained.
Of course, not all analytical tasks would require that, but the mindset was focused on the ability to work with big data.
The talking continued a bit, but after around 2020, this has ended. We still work with big datasets and combining data from different sets, but we no longer talk about it as big data. Now it is just data.
While the talking focused on big data, we also saw how many of the dedicated BI consulting companies were bought by the big enterprises - such as the big accounting firms did that, and the big IT companies.
Tool-wise, the main focus was set on Tableau, Qlik, and Microsoft Power BI. A continued relevance for data analysts would include skills in one of these plus Excel, sometimes also Python or R. With such a set of tools, most analytical tasks can be done.
For a data engineer or a data scientist, skills in database products are needed as well. Nobody asks for Hadoop skills.