Vocabulary

Special terminology used in Data Analytics

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z     

A

Actionable

This word is mostly used in the connection “actionable insights”, meaning that there should be a direct usefulness, a guidance on what do to, connected to any insights provided from data analytics. The alternative could be “fun”, “interesting”, “of philosophical value” or similar, but in a business context, all that is normally not wanted. Data analytics is performed to provide directly actionable new knowledge, that can act as a guidance for management decisions.

ANSI SQL

Also sometimes called Standard SQL, and today referring to ISO/IEC 9075 “Database Language SQL”, in its relevant version (typically the latest, which, at the time of writing is ISO/IEC 9075:2023). There has been an ANSI document made long ago, in 1986, called ANSI X3.135. Since ISO adopted it shortly after, it has effectively been an ISO standard almost from the start, and is described in several parts, each with their own document. More on this from ANSI and ISO.

App analytics

The analytics related to users’ behavior toward an app. Click analysis, conversion analysis (if you are selling something or trying to make the users do something specific), and retention analysis are some aspects, and these are closely related to what you want to do in web analytics. This means that tools for one of these areas often handles the other as well, especially if the app’s user interface runs in a web browser.

B

Benchmark

In order to understand if a value is to be considered high or low, or a development over time good or bad, it needs to be measured up against something. A benchmark to measure up against can be, for instance, the average number or development for competing businesses, or a specific competitor. If it is found that our sales of a certain category of products has increased by 10% year to year, but the benchmark says 15%, then our sales is to be seen as less good. We then know that we did something wrong, or that there could be an unfulfilled potential for us to exploit, if this sales increase is a continuing trend in the market.

Business Intelligence (BI)

(coming soon)

C

Cloud-based

This typically means software-as-a-service delivered through the Internet, being established and maintained in a cloud data center. There is, hence, no software to install on-premise, but it may be so that we ourselves need to do the maintenance in the cloud. So, it can either be a black box service, delivered out of the cloud, or simply a different place for us to install and administer the software. For the user, it makes no difference: it will in all cases be used the same way.

A hybrid solution is often also possible, with parts of the system cloud-based and other parts on-premise.

D

Dashboard

An overview of selected metrics, provided in a shape that is immediately decodable by the intended audience – for instance the management of the company. Easy shapes can be pie charts, traffic lights, etc., or simple numbers, curves or other common representations of numbers.

For certain audiences, more complex representations may be used, if these are both useful and understandable for exactly these people. A dashboard is the typical result of visualization of data, even though a dashboard most often will hold several such visualizations on the screen at the same time.

A dashboard may provide functionality for drill-down or other interactions with the visuals.

Data

A representation of measurements or other collected representations of something that is or has happened, such as keyboard input. Data is plural, technically speaking with “datum” as the singular, even though this is rarely used, and “data” is typically used also for single pieces of data.

When some data has a recognizable value, suitable for informing someone, they can be called information. Data without such a value can often be combined through summarization or other calculations, or they can be counted, or used together with other data to provide the wanted informational value. That’s a partial goal of data analysis, with that further goal to find such pieces of information that are new or can lead to a new understanding of something, which can then be called insights.

Data analysis

The process of making information out of data, looking for insights – often in the shape of trends or changes in data over time or across space or business function. The analysis typically has a goal of finding the answers to some initial questions, but will often reveal more than originally asked for. A sorting process is therefore needed, to decide what is useful to bring forward as insights.

Data analytics

A broader scope of activities than data analysis, basically covering the while business of handling the analysis, including business analysis, talking to the customer of the analysis work, preparation of datasets, sometimes with the use of machine learning or other advanced preparation techniques, and the follow up after the analysis with preparing visualizations, for instance in the shape of reports or dashboards, and add a data storytelling to highlight the interesting aspects and their explanations and value.

Database

A storage for data. Often, “database” is also used as a short form for “database server software”, so that, e.g., Microsoft SQL Server is called “a database”. Technically, a database is, however, the specific set of tables, stored procedures, and other elements that make up the frame for handling a specific set of data, including the data themselves.

Database engine

The software that handles the functionality of a database server, effectively the complete database server software package. Microsoft SQL Server is, hence, a database engine. More precisely, the engine should comprise only the functionality that handles such as a query or a request ot create a table, or other actions toward the data and their storage and retrieval. This means that user interface elements or secondary functions, for instance for upgrading the software or many other purposes, aren’t part of the engine.

Data cleansing / data cleaning / data scrubbing

This is the preparation of a dataset to be useful for analytics. It consists of a number of practical improvements of the data, such as deduplication, adding missing values, removing irrelevant data, and whatever is needed for the dataset to become useful for making statistics or other treatments planned with it.

The different names for the process mean the same, even though some people want them to mean something different – then often with data cleansing being more thorough, while data cleaning is a quick and often automated process.

Data scrubbing often means exactly the same, but sometimes it has a completely different meaning, focusing on the ongoing maintenance of data in memory, which is a definition promoted on Wikipedia, among other places.

What these different meanings imply is, that you should make sure to check with the one you are communicating with, that you both understand whatever term you use, the same way.

Data-driven

An adjective that implies data and its analytical treatment to be an important contributor to whatever is data-driven – typically decisions. The depths of the term varies, and it can mean that nothing happens without the correct data, or, at the other end of the scale, that data somehow influences what happens.

Often, it doesn’t mean a lot, to be honest. Decisions may still be made on a gut feeling, even if a company is proclaimed to be data-driven. But, at least, having such a term in use typically means that there are data analytics activities going on (perhaps somewhat automated), using suitable software, people spending time, and someone taking care of creating or obtaining the needed data and delivering the information and insights in a formalized way to the intended users.

Data insights

The qualified new information that came out of a data analysis process. Often just called “insights”. They are typically connected with the term “actionable”, since most companies don’t want pointless insights, or insights that do not help them make decisions – hence, the qualification process, that should lead to the presentation of only the actionable insights, while any others are seen as worthless (even though the analyst may keep a note of them, to be considered in future analyses).

Data lake

A collection of data from different sources, that are not unified into a common structure but rather left as they are – and with proper tools for retrieving them. In contrast to a data warehouse, there can be different formats, for instance some relational databases, some text files, and other things.

The idea of creating data lakes spawned out of the big data wave, where it became clear that it could become problematic to attempt bringing all the achieved data together into one common structure, especially if it wasn’t known which potential uses there would be of the data in the future.

Data management

(coming soon)

Data mart

A smaller data warehouse, often used for a reduced set of purposes, for instance by one department. The data may be an extract from a central data warehouse, or they may be loaded in from operational databases or other sources.

Data mining

the process of searching for both patterns and anomalies in datasets, thereby getting to an understanding of the nature of the data. A simple example is a correlation diagram, where two attributes are held up against each other to see if they are related, but clustering techniques, classification, and other aspects are also included.

Wikipedia describes six classes of tasks included: anomaly detection, association rule learning, clustering, classification, regression, and summarization.

Data pipeline

A series of activities for moving data from one place to another. ETL and ELT are data pipelines, but a pipeline can be constructed as needed, and it can be for one-time use or a regular, perhaps scheduled, activity.

Several tools exist for designing, managing, and operating data pipelines

Dataset

A set of data – this can have any form, bit is typically ordered in a table, or several tables, and will often be in CSV format if it is meant for general distribution to data analysts.

There can be restrictions on how a dataset may be used, and also how it must be referenced. “Open data” is a term often used for datasets with some kind of license that gives the most levels of freedom when using the data.

The world produces very many data, and a large portion of them are gathered into publicly available datasets, that can be downloaded from such as government or university websites, or the websites of international organizations.

Data science

A scientific discipline that blends several other disciplines, such as statistics, mathematics, and computer science, and with a central element of data analytics.

When seen as a job title in a company, a Data Scientist works with a broad scope of data analytics tasks, but often with a weight on programming and algorithms, also including machine learning and other AI development, where a Data Analyst has more weight on working with front-end analytical tools like spreadsheets and visualization tools, but it can vary a lot. Historically, data science, statistics, and data analysis has overall meant the same, but with different people defining the terms differently.

If nothing else is defined in the situation: expect more programming, and perhaps more mathematics, when working as a data scientist, than when working as a data analyst.

Data source

Any place where data can be found, when preparing for data analytics, population of a data warehouse or any other purpose that requires data. A system that asks for a data source will typically be limited with regards to which types of data sources it can handle.

There are a number of technologies that can abstract data sources, such as ODBC and JDBC, and thereby help making more different source types useful for a database or analytics product that can handle one of these abstraction technologies.

Data storytelling (Storytelling)

There are different ideas on what exactly data storytelling is: some want it to be “storytelling through data”, others are more into making it “a blend of visualization and text”, and yet others favor a textual narrative with added data.

Certain is it, that data, visualization, and a textual narrative of some kind should be included. This can lead to traditional reports with included diagrams, to infographics, or simply a dashboard with some textual notes on it, pointing out what a local minimum of a graph means, for instance.

If you take the “story by data” approach, you may carefully pick a series of different angles of the same overall story, and illustrate them with different data representations. This way, you can drag your audience in, step by step, to see and understand those insights you want to share, using just a few words to tie the story together.

if you take the narrative approach, you will use the text to describe the world as you see it, and illustrate it with visualized data where needed for a better understanding.

The hype about data storytelling appeared due to a sad tendency having been established, to just present a bunch of diagrams and traffic lights in a dashboard, without further explanation. Now, having this terminology, it is easier to both remember and get the time budget for explaining things more thoroughly.

Adding some text to these dashboards makes them more comprehendible, and is a representation of the “blend” approach.

Data warehouse (DW)

A collection of data from several sources, structured to facilitate business intelligence and data analytics, hence, with a focus on fast output based on larger amounts of data, in contrast with an operational (typical relational) database’s organization into handling one set of data that are optimized for integrity and singular updates.

The data structure is typically multi-dimensional, using OLAP.

E

Embedded

Integrated as a component of something else. An embedded database server, for instance, is functional only as part of the application it has been embedded into. Used mainly for providing the embedded functionality to that one application.

Extract, Load, Transform (ELT)

As compared to ETL, this is a similar approach to populating a data warehouse, except that the transformation of data happens after data have been loaded into the data warehouse.

This approach may be preferential when it is not known at the time of loading data which format is preferred – or if several different formats could be wanted, over time. The loaded data exists in a staging area inside of the data warehouse itself, available for being transformed when needed. However, it can also be done immediately after loading, which then effectively just adds the possibility of using the data warehouse system’s resources for the transformation, and is otherwise very similar to ETL.

ELT is typically preferred over ETL when adding data to a data lake.

Extract, Transform, Load (ETL)

When populating a data warehouse, data need to be extracted from their original source, converted or transformed into a format that will fit the data warehouse, and then loaded into the data warehouse. There are special tools for this complete process, but it is also possible to do the steps using separate tools.

The transformation will typically take place in a separate staging area, which can be arranged as files or a database, and the main purpose of the transformation process is to homogenize data from different sources, so that they follow the same format and will be possible to see in the data warehouse as a homogenous set of data.

F

G

Geographic data

(coming soon)

Geographic Information System (GIS)

(coming soon)

H

I

Integrated Development Environment (IDE)

(coming soon)

Indexed Sequential Access Method (ISAM)

(coming soon)

J

K

L

M

Massively Parallel Processing (MPP)

A type of database that splits the load and access over multible servers.

Metric

Element of measurement – something you can measure and determine a value of. In a table describing the features of a person, such as height, shoe size, etc., each of these parameters is a metric.

When working with data analytics, the available metrics for anything you are analyzing will set some limits for what you can do. At times, it is possible to arrange for more metrics to be collected, for future analyses, but it may also be possible to find the needed data in a different database, possibly in an external database that relates to the topic.

For instance, obtaining a dataset with demographical information such as number of inhabitants in a city, means that you can become able to filter the analyzed data to include only people living in cities with more than a certain number of inhabitants – provided that you already know which city your data subjects live in.

Multidimensional Expressions (MDX)

A database query language used by some OLAP databases, as the SQL query language traditionally used for OLTP relational databases is less for querying OLAP cubes. However, there are other alternatives as well, and each database engine has its own set of possible query methods, as described in a section on a long page of OLAP server comparisons on Wikipedia.

N

NoSQL database

The term is short for a “Not only SQL” database system, where data aren’t relationally arranged. You may actually be able to use SQL as a query tool, but it will then be adapted by the system to give you the requested data.

There are many different types of NoSQL databases, and each has its purpose, as well as advantages and disadvantages.

One type is the graph database, that focus on making mapping and searching for related data of the same type, for instance people, easy to perform. This is very useful in social media contexts, for instance.

Another type is a document-oriented database, that contains full sets of data, for instance a complete person record, in a separate document, that can then be retrieved in one go – which is fast, if this is what you need. for instance, an XML database has data readily available as XML documents that can be forwarded as is to a client application.

Notebook

This can mean different things, depending on context, but one is the document with program code and its results, nicely arranged after each other, that some programming environments can offer. Such a notebook can be shared with others who can see exactly what you have done in your analysis, and they can verify that the results match the code. As you can annotate everything by adding text lines where you want, it can become a good and precise account of a well-made analysis.

O

Object-relational database

(coming soon)

Online Analytical Processing (OLAP)

A type of database which data structure and engine processes are optimized for analytical treatment of the data, i.e., output, by a few users at a time.

Often, a data warehouse is an OLAP database that has been loaded with data from one or more OLTP databases – the latter could be finance and productions systems, and perhaps several other systems such as CRM or HR. But performing analyses directly on these systems’ OLTP databases will both burden them and slow them down for their intended use, and the analyses will be slow as well.

Therefore, it can make sense to extract (copy) the data needed for analysis, for instance once a day, during off-hours, and then load them into an OLAP database for the analytics work.

OLAP databases use OLAP cubes, or hypercubes, to relate data to each other. This concept is multidimensional, and, hence, one piece of data can be related to many others. This, however, is still often arranged in a relational database, hence the term Relational Online Analytical Processing (ROLAP), that is occasionally seen.

Other varieties are, among others, Multidimensional OLAP (MOLAP) and Hybrid OLAP (HOLAP).

Online Transaction Processing (OLTP)

A type of database which structure and engine processes are optimized for creation and updates of data by many users simultaneously. A traditional relational database is of this type. The normalized data ensures that updates will go well, and fast, as there’s only one place to update each data element, even though the ful set of data in an update can be split over several tables.

Through a mechanism with “logs” (simple files listing all transactions asked for, complete will all data) for storing incoming create and update requests, these are effectively queued, making the client able to move on with other tasks – should there happen to be a conflict, for instance if two clients try to update the same data, then one will be carried through, the other rolled back, and a message typically sent back to the client, which then must handle the communication with the user.

One thing that makes the relational/OLTP database less suitable for analytics is that a number of data tuples will be locked when accessed, thereby blocking other operations from taking place at the same time. This, to reduce the number of collisions when updating. As the OLAP database isn’t meant to be updated ad-hoc, such mechanisms can be switched off there, making access to data fast an unconstrained. Another aspect is the need in a relational database to create several extracts (cursers) from different tables, in order to get one complete set of data, corresponding to the query. In the OLAP database, it is more typical to have all data that belong together in the same tuple in the same table, so that just one look-up is needed. This, inherently, leads to redundancy, but that is accepted with OLAP.

On-premise

An adjective used, typically, with software, sometimes also with the hardware it runs on, and perhaps other equipment, such as network infrastructure. The term is used in opposition to having “cloud-based” software – or, indeed, software that runs anywhere else than on the company’s own premises. It could, technically speaking, be in the neighboring building and doesn’t have to be in the cloud.

Whatever it is opposed to, on-premise means that the software runs in your own server room, on your own servers, and needs both installation, maintenance, upgrades, backups, etc. To be done by your own IT department or their consultants. Also, running software on-premise implies a requirement for electricity and cooling installations, and an assertion of the security aspects. Another typical argument for using a cloud-based installation instead is that the on-premise servers typically are used only for a few hours per day, but you’ll have to pay for the full server anyway, thereby just wasting the extra capacity, and the money it costs. Cloud-based often means that the capacity can be shared with other companies, this maximizing the utilization of the hardware, the human resources, the infrastructure, installations, etc., and sometimes also the software.

However, on-premise means that you have the full control, and that your software may be used even if there are problems with the Internet or with the external datacenter.

You will very often have the choice between installing some software on-premise, or subscribing to a service that gives you access to a cloud-based edition of it.

Open source

(coming soon)

P

Q

Query

A request message to a database server to return the specified data. A query will have to follow a defined format, such as SQL, and the results are usually sent back as a table with text contents.

There are, however, also other ways of working with queries, such as the query composer in Microsoft Access, where you add the tables in a graphical window, then select by drop-downs which fields you want to include, and which criteria count for the selection. With this input, Access produces an SQL statement behind the scene and send it to the database engine. The result can then come back in a grid window, where you can resize columns and potentially also alter the data directly.

For most typical database requests, you will either write the SQL statement yourself, or some application will do it in the background as part of its functionality, and you will just use its buttons, menus, etc., without thinking much about what communication takes place with the database. But queries are there, nonetheless.

Query federation

The possibility for a query tool to query across several databases and return one combined result to you.

R

Relational Database Management System (RDBMS)

Often talked about as simply a relational database, even though this would technically be only the data and their defining structure (table information, user access information, etc.), not the system that manages it.

A relational database is arranged with normalized data, meaning that these exist only once in the database and are distributed on several tables with links in between them, called relations, in order to show the connections between related data while still avoiding duplicating them.

This type of database was seen as a significant step forward, when it was developed by Edgar Codd around 1970, and it became the dominant type of database during many years.

The RDBMS, meaning the database engine, i.e., the system for managing such databases, follows some specific ideas for handling the data storage, and the reception of commands, optimization of retrieval requests, etc., so the different products on the market tend to have a lot in common. Today, they all support SQL, but in the early days there were other query languages in use by some of them.

Report

A document describing something in an ordered way. In relation to data and databases, there has been a long tradition of making reports in the shape of listings of data, possibly with some calculations included, for instance sums and averages, and, as printer technologies evolved, also with charts and other illustrations of the data.

This kind of reports have largely been replaced by dashboards and other means of getting access to the data, both in their raw form and as processed overviews and insights, but reports can still make sense at times.

There is a discipline within journalism called data-driven journalism, and in a sense, an article that is rich on data, typically represented by infographics and diagrams, with some textual descriptions as well, could be called a report – resembling some of the most elaborate examples from the earlier days.

The availability of traditional reporting tools is quite limited today, but it is possible to set up something similar in various other tools, such as a modern text processor, like Word or LibreOffice – with integrated data from spreadsheets, diagrams from a diagram tool, or from a presenting tool like PowerPoint. And this all can be integrated so that it will be updated as data changes, making it a live report, or use fixed data, if that is preferred.

And, of course, today, a report mostly doesn’t need to be printed – it works well in an electronic form, such as a PDF file, to be read on a screen.

S

Script

For a data analyst, or a programmer, a script is typically a piece of software source code that is mean to be run as is, i.e., not compiled into a program first. The typical use for scripts are in the operating system’s command line interface, that often gives the possibility to write several commands after each other in a file, and then execute all of them in one go. That file is a script.

When working with some command line interfaces, there are quite advanced scripting possibilities, and such as Bash, which has existed in the Unix/Linux world for a long time, hence also on a Mac, since it is running Unix (even though it looks like its own system on the surface), and it has been implemented for Windows too.

You can make scripts in many other places too, but the terminology is typically used only for such lightweight programming. It may also be called a batch file, even though this term is mostly used with Windows and DOS.

And one more place where program code is typically called a script, is in an interactive programming environment, typical for Python or R, for instance in the tools that let you make notebooks.

Scripting

The process of writing a script, but it is also an adjective used for such as scripting tools, etc., meaning that they can be used for making scripts with.

Statistics

Another, and older, word for data analysis. The typical distinction between the two terms is based on a historical use of statistics as formulas and methods used to analyze data, but not necessarily by using a computer, and the newer data analysis term then came into use along with computers becoming common.

An aspect of statistics is data collection, and this may include surveys or other mechanisms that gather data.

As a scientific area, statistics has its traditionally included subareas, but there is no general consensus to be found on the exact division between statistics and data analysis.

Structured Query Language (SQL)

One of several possible ways of querying a database. SQL was developed in the early days of relational databases and is specifically suitable for specifying the relations that form the set of data you are after, as well the wanted filtering, sorting, and aggregation functions.

SQL is used through a textual command, but it can be abstracted by a graphical interface, which is commonly seen in many tools.

the result is typically a list of data.

As you can perform all kinds of communication with the database through SQL, it covers four overall functional areas, with each there name:

  • Data Query Language (DQL): For querying data
  • Data Definition Language (DDL): For creating/altering tables, user accounts, and other structural database objects
  • Data Control Language (DCL): For managing the rights of the users
  • Data Manipulation Language (DML): For inserting, updating, and deleting data

T

U

V

Visualization

This can mean different things, depending on the context, but for data analytics it typically means making a graphical or otherwise simplified visual representation of some data.

It can be as simple as a number written with large digits, and a couple of words to tell what this is (as found on some of the pages of this and many other websites), or it can be a pie chart or other traditional or more inventive kind of diagram. The main thing is that is should reveal valuable insights at a first glance, possibly offering some more details for those who study it closer. it should be easier to see what is important through the visualization, than by looking at a list or a table full of numbers.

For a data analyst, a main task is to select the visualizations carefully, as the easy overview will be gone of all angles of all data are made into lots of diagrams. Hence, the idea of a dashboard, where only a few, thoughtfully selected visualizations are gathered, in order to provide the insights that are the most needed for the intended audience, for making decisions.

W

Web analytics

Also called website analytics. It is the activities for collecting and analyzing data related to the use of a website, which includes such as traffic, geographical data, clicks, conversion (whether the user buys something), the efficacy of a “call for action”, etc.

The purpose of web analytics is often to optimize the design to the purpose, which can be to sell something, or to make people want to stay longer on the site, to increase the value of advertising – or to lead the users to take the next step in a contact process, for instance by subscribing to a newsletter.

Web analytics becomes much easier when the users aren’t anonymous, so making people create a login or setting a cookie in the browser will help create much more useful data for the analysis.

X

Y

Z