Vocabulary

Special terminology used in Data Analytics

This page will be updated frequently. Sorting criteria: Number or punctuation in beginning of term: as if it was spelled out. Digits through the term: according to ASCII. Punctuation or spaces through the term: ignored. Case: ignored.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

A

Actionable

This word is mostly used in the connection “actionable insights”, meaning that there should be a direct usefulness, a guidance on what do to, connected to any insights provided from data analytics. The alternative could be “fun”, “interesting”, “of philosophical value” or similar, but in a business context, all that is normally not wanted. Data analytics is performed to provide directly actionable new knowledge, that can act as a guidance for management decisions.

ANSI SQL

Also sometimes called Standard SQL, and today referring to ISO/IEC 9075 “Database Language SQL”, in its relevant version (typically the latest, which, at the time of writing is ISO/IEC 9075:2023). There has been an ANSI document made long ago, in 1986, called ANSI X3.135. Since ISO adopted it shortly after, it has effectively been an ISO standard almost from the start, and is described in several parts, each with their own document. More on this from ANSI and ISO.

App analytics

The analytics related to users’ behavior toward an app. Click analysis, conversion analysis (if you are selling something or trying to make the users do something specific), and retention analysis are some aspects, and these are closely related to what you want to do in web analytics. This means that tools for one of these areas often handles the other as well, especially if the app’s user interface runs in a web browser.

B

Benchmark

In order to understand if a value is to be considered high or low, or a development over time good or bad, it needs to be measured up against something. A benchmark to measure up against can be, for instance, the average number or development for competing businesses, or a specific competitor. If it is found that our sales of a certain category of products has increased by 10% year to year, but the benchmark says 15%, then our sales is to be seen as less good. We then know that we did something wrong, or that there could be an unfulfilled potential for us to exploit, if this sales increase is a continuing trend in the market.

Business Intelligence (BI)

Reporting and other representations of data, to make them useful for decision making. Effectively, business intelligence in larger organizations consists of gathering data from various operational IT systems’ databases into a data warehouse, and from there, use dashboard clients with drill-down, drill up, aggregation, etc., for the user to be able to see the details or the helicopter perspective, as needed – or switch between them, and thereby get to an understanding of the presented data.

It is expected in most organizations that BI uses multi-dimensional data, hypercubes, in an OLAP-based data warehouse. But with the right client tools and limited amounts of data, a similar value can sometimes be obtained using other tools, such as a relational database with an SQL-based reporting client.

As we mentioned in an article, BI often expects the user to interpret the data, after being given a dashboard with a few, chosen starting points, typically as diagrams and traffic lights, giving a reduced, simplified, view on the state of the organization. When using the term “data analytics” instead of BI, a more elaborate kind of information preparation is often expected, including such as data storytelling, to qualify and explain the choice of information made available in the dashboard.

C

Cloud-based

This typically means software-as-a-service delivered through the Internet, being established and maintained in a cloud data center. There is, hence, no software to install on-premise, but it may be so that we ourselves need to do the maintenance in the cloud. So, it can either be a black box service, delivered out of the cloud, or simply a different place for us to install and administer the software. For the user, it makes no difference: it will in all cases be used the same way.

A hybrid solution is often also possible, with parts of the system cloud-based and other parts on-premise.

D

Dashboard

An overview of selected metrics, provided in a shape that is immediately decodable by the intended audience – for instance the management of the company. Easy shapes can be pie charts, traffic lights, etc., or simple numbers, curves or other common representations of numbers.

For certain audiences, more complex representations may be used, if these are both useful and understandable for exactly these people. A dashboard is the typical result of visualization of data, even though a dashboard most often will hold several such visualizations on the screen at the same time.

The chosen metrics for a dashboard will typically be those that are needed regularly, and which can be handily put together in one view, to ease the process of getting an overview for decision-making. Similar to a dashboard in a car, which typically shows the most important information in the most prominent places, easy to see and understand, often with some additional information around it that will be needed at times and must be easy to find.

A dashboard may provide functionality for drill-down or other interactions with the visuals, or, in some systems, links to underlying reports where such functionality then may be present.

Data

A representation of measurements or other collected representations of something that is or has happened, such as keyboard input. Data is plural, technically speaking with “datum” as the singular, even though this is rarely used, and “data” is typically used also for single pieces of data.

When some data have a recognizable value, suitable for informing someone, they can be called information. Data without such a value can often be combined through summarization or other calculations, or they can be counted, or used together with other data to provide the wanted informational value. That’s a partial goal of data analysis, with that further goal to find such pieces of information that are new or can lead to a new understanding of something, which can then be called insights.

Data analysis

The process of making information out of data, looking for insights – often in the shape of trends or changes in data over time or across space or business function. The analysis typically has a goal of finding the answers to some initial questions, but will often reveal more than originally asked for. A sorting process is therefore needed, to decide what is useful to bring forward as insights.

Data analytics

A broader scope of activities than data analysis, basically covering the while business of handling the analysis, including business analysis, talking to the customer of the analysis work, preparation of datasets, sometimes with the use of machine learning or other advanced preparation techniques, and the follow up after the analysis with preparing visualizations, for instance in the shape of reports or dashboards, and add a data storytelling to highlight the interesting aspects and their explanations and value.

It is common to talk about different types of analytics: descriptive, diagnostic, prescriptive, and predictive.

Database

A storage for data. Often, “database” is also used as a short form for “database server software”, so that, e.g., Microsoft SQL Server is called “a database”. Technically, a database is, however, the specific set of tables, stored procedures, and other elements that make up the frame for handling a specific set of data, including the data themselves.

Database engine

The software that handles the functionality of a database server, effectively the complete database server software package. Microsoft SQL Server is, hence, a database engine. More precisely, the engine should comprise only the functionality that handles such as a query or a request ot create a table, or other actions toward the data and their storage and retrieval. This means that user interface elements or secondary functions, for instance for upgrading the software or many other purposes, aren’t part of the engine.

Data cleansing / data cleaning / data scrubbing

This is the preparation of a dataset to be useful for analytics. It consists of a number of practical improvements of the data, such as deduplication, adding missing values, removing irrelevant data, and whatever is needed for the dataset to become useful for making statistics or other treatments planned with it.

The different names for the process mean the same, even though some people want them to mean something different – then often with data cleansing being more thorough, while data cleaning is a quick and often automated process.

Data scrubbing often means exactly the same, but sometimes it has a completely different meaning, focusing on the ongoing maintenance of data in memory, which is a definition promoted on Wikipedia, among other places.

What these different meanings imply is, that you should make sure to check with the one you are communicating with, that you both understand whatever term you use, the same way.

Data-driven

An adjective that implies data and its analytical treatment to be an important contributor to whatever is data-driven – typically decisions. The depths of the term varies, and it can mean that nothing happens without the correct data, or, at the other end of the scale, that data somehow influences what happens.

Often, it doesn’t mean a lot, to be honest. Decisions may still be made on a gut feeling, even if a company is proclaimed to be data-driven. But, at least, having such a term in use typically means that there are data analytics activities going on (perhaps somewhat automated), using suitable software, people spending time, and someone taking care of creating or obtaining the needed data and delivering the information and insights in a formalized way to the intended users.

Data engineering

The activities in the input end of data analytics, after data has been created or gathered, to collect data and ingest them into a database system, creating pipelines of data loading and transformation (ETL/ELT), to make data available and useful for analysis.

This is all typically done by software engineers, using programming, machine learning, and other techniques from computer science, and often, data analysts will then continue from there with the actual analysis of data.

Data insights

The qualified new information that came out of a data analysis process. Often just called “insights”. They are typically connected with the term “actionable”, since most companies don’t want pointless insights, or insights that do not help them make decisions – hence, the qualification process, that should lead to the presentation of only the actionable insights, while any others are seen as worthless (even though the analyst may keep a note of them, to be considered in future analyses).

Data lake

A collection of data from different sources, that are not unified into a common structure but rather left as they are – and with proper tools for retrieving them. In contrast to a data warehouse, there can be different formats, for instance some relational databases, some text files, and other things.

The idea of creating data lakes spawned out of the big data wave, where it became clear that it could become problematic to attempt bringing all the achieved data together into one common structure, especially if it wasn’t known which potential uses there would be of the data in the future.

Data lakehouse

Where a data warehouse is an ordered and analyzable collection of data, and a data lake is a pool of different data in different shapes, but all in one place, a data lakehouse, often just called a lakehouse, attempts to gather the best from those two worlds: both orderly and analyzable, and also having everything at hand, in one place.

The idea is that you in one system can perform various kinds of transactions on all your data, even of they have varying nature and structure. This is basically achieved by opening up the structure of the lakehouse, allowing several different tools and technologies to be used in parallel – which then again makes the construction somewhat more complex, but flexible.

Data management

Everything between the creation of data and the use of it, including the supporting platforms and activities for making data available in the right shape, is called data management, so it is about organizing, storing, and maintaining data.

It also includes such as making backups and ensuring availability. However, the collection or generation of data, as well as the analysis and reporting, are typically not considered data management.

A database engine with tools is often sold as a “data management system”.

The extent of data management can vary, even though the above definition is typical. Some business areas, like the pharmaceutical industry, may have very specific requirements to how it is being performed, and these may expand the term to also cover the collection of data, and the analysis of them, such as it is done in, e.g., a Laboratory Information Management System (LIMS), and document management may be included as well.

The approach in pharma is typically called “clinical data management”.

Data mart

A smaller data warehouse, often used for a reduced set of purposes, for instance by one department. The data may be an extract from a central data warehouse, or they may be loaded in from operational databases or other sources.

Data mining

the process of searching for both patterns and anomalies in datasets, thereby getting to an understanding of the nature of the data. A simple example is a correlation diagram, where two attributes are held up against each other to see if they are related, but clustering techniques, classification, and other aspects are also included.

Wikipedia describes six classes of tasks included: anomaly detection, association rule learning, clustering, classification, regression, and summarization.

Data pipeline

A series of activities for moving data from one place to another. ETL and ELT are data pipelines, but a pipeline can be constructed as needed, and it can be for one-time use or a regular, perhaps scheduled, activity.

Several tools exist for designing, managing, and operating data pipelines

Data modeling

The act of creating a data model – which is a structure for data, to be used in a database, a data warehouse, or other storage mechanism.

A data model defines how data is structured, but in order to make that useful for data analytics, some adjustments and additions may be needed, which is done in a semantic data model – it is common to describe this as a data model with added meaning, in the sense of classifications, aggregations, etc., such as you will find them arranged in a data warehouse or other analytical database.

Dataset

A set of data – this can have any form, bit is typically ordered in a table, or several tables, and will often be in CSV format if it is meant for general distribution to data analysts.

There can be restrictions on how a dataset may be used, and also how it must be referenced. “Open data” is a term often used for datasets with some kind of license that gives the most levels of freedom when using the data.

The world produces very many data, and a large portion of them are gathered into publicly available datasets, that can be downloaded from such as government or university websites, or the websites of international organizations.

Data science

A scientific discipline that blends several other disciplines, such as statistics, mathematics, and computer science, and with a central element of data analytics.

When seen as a job title in a company, a Data Scientist works with a broad scope of data analytics tasks, but often with a weight on programming and algorithms, also including machine learning and other AI development, where a Data Analyst has more weight on working with front-end analytical tools like spreadsheets and visualization tools, but it can vary a lot. Historically, data science, statistics, and data analysis has overall meant the same, but with different people defining the terms differently.

If nothing else is defined in the situation: expect more programming, and perhaps more mathematics, when working as a data scientist, than when working as a data analyst.

Data source

Any place where data can be found, when preparing for data analytics, population of a data warehouse or any other purpose that requires data. A system that asks for a data source will typically be limited with regards to which types of data sources it can handle.

There are a number of technologies that can abstract data sources, such as ODBC and JDBC, and thereby help making more different source types useful for a database or analytics product that can handle one of these abstraction technologies.

Data storytelling (Storytelling)

There are different ideas on what exactly data storytelling is: some want it to be “storytelling through data”, others are more into making it “a blend of visualization and text”, and yet others favor a textual narrative with added data.

Certain is it, that data, visualization, and a textual narrative of some kind should be included. This can lead to traditional reports with included diagrams, to infographics, or simply a dashboard with some textual notes on it, pointing out what a local minimum of a graph means, for instance.

If you take the “story by data” approach, you may carefully pick a series of different angles of the same overall story, and illustrate them with different data representations. This way, you can drag your audience in, step by step, to see and understand those insights you want to share, using just a few words to tie the story together.

if you take the narrative approach, you will use the text to describe the world as you see it, and illustrate it with visualized data where needed for a better understanding.

The hype about data storytelling appeared due to a sad tendency having been established, to just present a bunch of diagrams and traffic lights in a dashboard, without further explanation. Now, having this terminology, it is easier to both remember and get the time budget for explaining things more thoroughly.

Adding some text to these dashboards makes them more comprehendible, and is a representation of the “blend” approach.

Data warehouse (DW)

A collection of data from several sources, structured to facilitate business intelligence and data analytics, hence, with a focus on fast output based on larger amounts of data, in contrast with an operational (typical relational) database’s organization into handling one set of data that are optimized for integrity and singular updates.

The data structure is typically multi-dimensional, using OLAP.

Data wrangling

The process of improving a dataset, to make it useful for data analytics. It is sometimes used as a larger definition than data cleansing, meaning that data cleansing is just one step in the data wrangling activities, and at other times, it is considered to be the continuation of activities after the cleansing.

In general, the term will need to be defined in the context of its use, to make everybody understand it the same way.

What is typically considered “extra”, compared to data cleansing, is such as deduplication, generation of missing values, and outlier removal, and sometimes also validation and documentation.

E

Embedded

Integrated as a component of something else. An embedded database server, for instance, is functional only as part of the application it has been embedded into. Used mainly for providing the embedded functionality to that one application.

Extract, Load, Transform (ELT)

As compared to ETL, this is a similar approach to populating a data warehouse, except that the transformation of data happens after data have been loaded into the data warehouse.

This approach may be preferential when it is not known at the time of loading data which format is preferred – or if several different formats could be wanted, over time. The loaded data exists in a staging area inside of the data warehouse itself, available for being transformed when needed. However, it can also be done immediately after loading, which then effectively just adds the possibility of using the data warehouse system’s resources for the transformation, and is otherwise very similar to ETL.

ELT is typically preferred over ETL when adding data to a data lake.

Extract, Transform, Load (ETL)

When populating a data warehouse, data need to be extracted from their original source, converted or transformed into a format that will fit the data warehouse, and then loaded into the data warehouse. There are special tools for this complete process, but it is also possible to do the steps using separate tools.

The transformation will typically take place in a separate staging area, which can be arranged as files or a database, and the main purpose of the transformation process is to homogenize data from different sources, so that they follow the same format and will be possible to see in the data warehouse as a homogenous set of data.

F

G

Geographic data, Geospatial data, Geodata

Several terms for the same kind of data, even if it may take different shapes: data containing coordinates or other geographical information, such as latitude and longitude, perhaps also elevation, and having this information, possibly in combination with a city name or similar, allows for creating a visualization in the shape of a map.

A map of the world or a country with its regions, each area color coded according to a metric being examined, can often reveal if the metric is related to location, but it will in any case make a comparison between two areas relatively easier and faster than a long list of numbers.

Geographic Information System (GIS)

A special kind of data analytics, providing information as part of a map. The analysis often consists of looking at the map, switching on or off different layers of information, perhaps add in some additional data with geodata information, to be placed correctly on the map, and by that reach an understanding of connections between such as geography or geology and the examined data.

All the same data, including geographic and geological data, may of course be analyzed mathematically/statistically as well, which is then often called a spatial analysis.

H

I

Integrated Development Environment (IDE)

When programming, you can either use a set of different tools, such as an editor and a compiler, or you can use one tool that helps you doing it all. The latter is called an IDE. The advantage is typical an easier manual working process, even though some IDEs have become a bit over-engineered, so that skilled programmers have found them a hindrance for efficient work, rather than a help.

The alternative to an IDE is often command-line tools, which are often more practical tu use when automating some of the work.

Indexed Sequential Access Method (ISAM)

An older type of database that works well and efficiently in setups where a single computer, preferably also with a single application, is using the data files – since these are accessed directly by the application, without a management system, and, hence, is prone to conflicting operations or performance-restricting locking of parts of the file, if accessed by several apps simultaneously.

This style of database was used on some mainframe and mini computer systems, and also by several PC database systems, such as dBase and Microsoft Access (the Jet engine), and embedded database systems, but isn’t widely used anymore.

J

K

Key Performance Indicator (KPI)

In order to measure employee performance, many organizations will set up a set of simple and typically quantitative performance goals – such as achieved Net Promoter Score (NPS), sales volume, or average response time for service requests. At times, also qualitative measures will be included, which then requires a manual and often subjective activity to evaluate and decide if these have been reached. Each such goal is then seen as indicative for the employee’s overall performance, hence the name, KPI.

The exact set of KPIs for each employee will vary, since it needs to be related to something that the employee has an influence on, but some organizations also use team KPIs or company KPIs, in order to urge employees to cooperate on the common goals.

A dashboard with the available data for at least the quantitative KPIs and perhaps also the qualitative ones, if they have been keyed into a database, will help decide if the employee has reached the goals.

L

M

Massively Parallel Processing (MPP)

A type of database that splits the load and access over multible servers.

Mean, Median, Mode

Terms from the statistical area, with simple meanings but sometimes confused. They all relate to a set of data in the shape of a list of values.

Mean is the sum of all the data in the list, divided by the number of values. This is what would often be called a simple average.

Median is the middle value, if you line up all the values in order of size. If you have an even number of values, the median is the mean of the two in the middle.

Mode is the most frequent value. If there are several values that are equally frequent, there is no mode – or there are several modes. People seem to disagree on how to define it. You can use the terms unimodal, bimodal, or multimodal, depending on the number of equally ranking frequencies. If all values are different, the mode may be calculated based on intervals/groups of values. Check the statistics theory books for further explanations on how to do that well.

Metric

Element of measurement – something you can measure and determine a value of. In a table describing the features of a person, such as height, shoe size, etc., each of these parameters is a metric. The term is also used for something you can calculate, such as yearly revenue or average height, based on values across the dataset you are looking at.

When working with data analytics, the available metrics for anything you are analyzing will set some limits for what you can do. At times, it is possible to arrange for more metrics to be collected, for future analyses, but it may also be possible to find the needed data in a different database, possibly in an external database that relates to the topic.

For instance, obtaining a dataset with demographical information such as number of inhabitants in a city, means that you can become able to filter the analyzed data to include only people living in cities with more than a certain number of inhabitants – provided that you already know which city your data subjects live in.

Multidimensional Expressions (MDX)

A database query language used by some OLAP databases, as the SQL query language traditionally used for OLTP relational databases is less suitable for querying OLAP cubes. However, there are other alternatives as well, and each database engine has its own set of possible query methods, as described in a section on a long page of OLAP server comparisons on Wikipedia.

Some OLAP databases do offer querying through SQL, which will then be translated internally as needed.

N

NoSQL database

The term is short for a “Not only SQL” database system, where data aren’t relationally arranged. You may actually be able to use SQL as a query tool, but it will then be adapted by the system to give you the requested data.

There are many different types of NoSQL databases, and each has its purpose, as well as advantages and disadvantages.

One type is the graph database, that focus on making mapping and searching for related data of the same type, for instance people, easy to perform. This is very useful in social media contexts, for instance.

Another type is a document-oriented database, that contains full sets of data, for instance a complete person record, in a separate document, that can then be retrieved in one go – which is fast, if this is what you need. for instance, an XML database has data readily available as XML documents that can be forwarded as is to a client application.

Notebook

This can mean different things, depending on context, but one is the document with program code and its results, nicely arranged after each other, that some programming environments can offer. Such a notebook can be shared with others who can see exactly what you have done in your analysis, and they can verify that the results match the code. As you can annotate everything by adding text lines where you want, it can become a good and precise account of a well-made analysis.

O

Object-relational database

A type of database that allows for classes, instances, user data types, and member methods, as well as nested objects to be created – all this to map the use of objects in object-oriented programming languages with a minimum of mapping needed between application and database.

Online Analytical Processing (OLAP)

A type of database which data structure and engine processes are optimized for analytical treatment of the data, i.e., output, by a few users at a time.

Often, a data warehouse is an OLAP database that has been loaded with data from one or more OLTP databases – the latter could be finance and productions systems, and perhaps several other systems such as CRM or HR. But performing analyses directly on these systems’ OLTP databases will both burden them and slow them down for their intended use, and the analyses will be slow as well.

Therefore, it can make sense to extract (copy) the data needed for analysis, for instance once a day, during off-hours, and then load them into an OLAP database for the analytics work.

OLAP databases use OLAP cubes, or hypercubes, to relate data to each other. This concept is multidimensional, and, hence, one piece of data can be related to many others. This, however, is still often arranged in a relational database, hence the term Relational Online Analytical Processing (ROLAP), that is occasionally seen.

Other varieties are, among others, Multidimensional OLAP (MOLAP) and Hybrid OLAP (HOLAP).

Online Transaction Processing (OLTP)

A type of database which structure and engine processes are optimized for creation and updates of data by many users simultaneously. A traditional relational database is of this type. The normalized data ensures that updates will go well, and fast, as there’s only one place to update each data element, even though the ful set of data in an update can be split over several tables.

Through a mechanism with “logs” (simple files listing all transactions asked for, complete will all data) for storing incoming create and update requests, these are effectively queued, making the client able to move on with other tasks – should there happen to be a conflict, for instance if two clients try to update the same data, then one will be carried through, the other rolled back, and a message typically sent back to the client, which then must handle the communication with the user.

One thing that makes the relational/OLTP database less suitable for analytics is that a number of data tuples will be locked when accessed, thereby blocking other operations from taking place at the same time. This, to reduce the number of collisions when updating. As the OLAP database isn’t meant to be updated ad-hoc, such mechanisms can be switched off there, making access to data fast an unconstrained. Another aspect is the need in a relational database to create several extracts (cursers) from different tables, in order to get one complete set of data, corresponding to the query. In the OLAP database, it is more typical to have all data that belong together in the same tuple in the same table, so that just one look-up is needed. This, inherently, leads to redundancy, but that is accepted with OLAP.

On-premise

An adjective used, typically, with software, sometimes also with the hardware it runs on, and perhaps other equipment, such as network infrastructure. The term is used in opposition to having “cloud-based” software – or, indeed, software that runs anywhere else than on the company’s own premises. It could, technically speaking, be in the neighboring building and doesn’t have to be in the cloud.

Whatever it is opposed to, on-premise means that the software runs in your own server room, on your own servers, and needs both installation, maintenance, upgrades, backups, etc. To be done by your own IT department or their consultants. Also, running software on-premise implies a requirement for electricity and cooling installations, and an assertion of the security aspects. Another typical argument for using a cloud-based installation instead is that the on-premise servers typically are used only for a few hours per day, but you’ll have to pay for the full server anyway, thereby just wasting the extra capacity, and the money it costs. Cloud-based often means that the capacity can be shared with other companies, this maximizing the utilization of the hardware, the human resources, the infrastructure, installations, etc., and sometimes also the software.

However, on-premise means that you have the full control, and that your software may be used even if there are problems with the Internet or with the external datacenter.

You will very often have the choice between installing some software on-premise, or subscribing to a service that gives you access to a cloud-based edition of it.

Open source

Anything open source, be it software, data, or hardware designs, comes with some kind of license document that describes how the “open source” term is to be understood for this particular product.

This means that you cannot assume anything before having read the document, but the basic idea of using the term is to allow for you or others to alter the software for your own use, and to distribute your altered version of it, for others to use as well.

The typical restrictions are that you must attribute the original author, as well as make your version available under the same license as the original. It is also typical to forbid you from charging any money for neither the original version, nor your altered version.

Having said that, there are very many different open source licenses, with different details in them.

Nevertheless, what you can count on, in general, is that any open source software you get will be possible for you to use, even if the original author stops distributing it. If you have downloaded the source code, you can also maintain it, and make changes to the code for your own use.

It is common to make software partially open source, meaning that a reduced version is available as open source, and then a more advanced version, including elements that are not open source, is available as a commercial product. The open source license is then typically written so that nobody else can do the same – only the original author.

A common way of making money on somebody else’s open source software is to offer it in a Software as a Service, SaaS, setup, where you pay for the access to the managed software, but you don’t pay for the software itself. Another is to offer plugins or add-ons for the open source software, and then charge for these offerings.

Open source, hence, is not public domain. You cannot do what you want with open source, but you will most often be able to use it for free, change it without restrictions, and distribute it without profit.

The main idea of the open source model for software development is that many developers want to bid in and take part in making it. As the software is offered for free, it makes sense for them to help, so that they thereby can help the many people who will benefit from getting the software free of charge.

With many developers participating, a much larger and better product can be made, without hiring in paid workforce. Quite a lot of software has been possible to make because of this model, including the content management system (CMS) and the web server software that show you this now, as well as the operating system they run on, and the database software and many other components being utilized.

But there can easily be a commercial aspect in open source. Automattic, for instance, who makes the CMS, WordPress, has become a large company due to their possibility to offer the software in a SaaS setting, selling subscriptions to their hosting of it, plus additional services, and then sell plug-ins and other products next to this, partly marketed by their reputation for making the otherwise freely available open source product.

P

Public domain

This is often confused with “open source”, but there is a very distinct difference: public domain generally means that everybody can do what they want with the software, data, or whatever has been given the designation – while open source means that you can see the source, make your own changes to it, but with some restrictions.

Also public domain may have restrictions, but they are usually quite limited. If there is a license description/document connected, then, as a basic rule, this counts. This could, for instance be as a link to the Creative Commons CC0 1.0 Universal Deed, which, in other words, says that you can do what you want with the material, but the original author is not responsible for it, whatsoever.

Q

Query

A request message to a database server to return the specified data. A query will have to follow a defined format, such as SQL, and the results are usually sent back as a table with text contents.

There are, however, also other ways of working with queries, such as the query composer in Microsoft Access, where you add the tables in a graphical window, then select by drop-downs which fields you want to include, and which criteria count for the selection. With this input, Access produces an SQL statement behind the scene and send it to the database engine. The result can then come back in a grid window, where you can resize columns and potentially also alter the data directly.

For most typical database requests, you will either write the SQL statement yourself, or some application will do it in the background as part of its functionality, and you will just use its buttons, menus, etc., without thinking much about what communication takes place with the database. But queries are there, nonetheless.

Query federation

The possibility for a query tool to query across several databases and return one combined result to you.

R

Relational Database Management System (RDBMS)

Often talked about as simply a relational database, even though this would technically be only the data and their defining structure (table information, user access information, etc.), not the system that manages it.

A relational database is arranged with normalized data, meaning that these exist only once in the database and are distributed on several tables with links in between them, called relations, in order to show the connections between related data while still avoiding duplicating them.

This type of database was seen as a significant step forward, when it was developed by Edgar Codd around 1970, and it became the dominant type of database during many years.

The RDBMS, meaning the database engine, i.e., the system for managing such databases, follows some specific ideas for handling the data storage, and the reception of commands, optimization of retrieval requests, etc., so the different products on the market tend to have a lot in common. Today, they all support SQL, but in the early days there were other query languages in use by some of them.

Report

A document describing something in an ordered way. In relation to data and databases, there has been a long tradition of making reports in the shape of listings of data, possibly with some calculations included, for instance sums and averages, and, as printer technologies evolved, also with charts and other illustrations of the data, as described in this article.

This kind of reports have largely been replaced by dashboards and other means of getting access to the data, both in their raw form and as processed overviews and insights, but reports can still make sense at times.

There is a discipline within journalism called data-driven journalism, and in a sense, an article that is rich on data, typically represented by infographics and diagrams, with some textual descriptions as well, could be called a report – resembling some of the most elaborate examples from the earlier days.

Another use of the word report by some data visualization tools is a collection of visualizations, perhaps with interaction between them. This might be similar to what other tools call dashboards, and there can then, like in Microsoft Power BI, be an extra layer of representation of the visualizations in something else called dashboards.

The availability of traditional reporting tools is quite limited today, but it is possible to set up something similar in various other tools, such as a modern text processor, like Word or LibreOffice – with integrated data from spreadsheets, diagrams from a diagram tool, or from a presenting tool like PowerPoint. And this all can be integrated so that it will be updated as data changes, making it a live report, or use fixed data, if that is preferred.

And, of course, today, a report mostly doesn’t need to be printed – it works well in an electronic form, such as a PDF file, to be read on a screen.

S

Script

For a data analyst, or a programmer, a script is typically a piece of software source code that is mean to be run as is, i.e., not compiled into a program first. The typical use for scripts are in the operating system’s command line interface, that often gives the possibility to write several commands after each other in a file, and then execute all of them in one go. That file is a script.

When working with some command line interfaces, there are quite advanced scripting possibilities, and such as Bash, which has existed in the Unix/Linux world for a long time, hence also on a Mac, since it is running Unix (even though it looks like its own system on the surface), and it has been implemented for Windows too.

You can make scripts in many other places too, but the terminology is typically used only for such lightweight programming. It may also be called a batch file, even though this term is mostly used with Windows and DOS.

And one more place where program code is typically called a script, is in an interactive programming environment, typical for Python or R, for instance in the tools that let you make notebooks.

Scripting

The process of writing a script, but it is also an adjective used for such as scripting tools, etc., meaning that they can be used for making scripts with.

Semantic model

While the semantic model has been used during times to describe the conceptual model of such as a database or a workflow, it is now often used to describe the prepared layer of data and metrics that can be used for reporting in a data analytics reporting tool.

A data warehouse will, in this meaning, offer a semantic model to applications, making it possible for such as a business intelligence front-end application to simply create visualizations of the data, with drill-down and aggregation given by the model, meaning that the application doesn’t need to calculate or arrange such features.

Various tools have appeared that are specifically used to create semantic models for data analytics applications, without themselves holding the data – often, these tools can retrieve data from a variety of underlying databases, data lakes, etc. and are not bound to a particular data source. However, most front-end data analytics tools also contain a possibility to build the semantic model there, this way combining several elements of the analytical process into one tool.

Statistics

Another, and older, word for data analysis. The typical distinction between the two terms is based on a historical use of statistics as formulas and methods used to analyze data, but not necessarily by using a computer, and the newer data analysis term then came into use along with computers becoming common.

An aspect of statistics is data collection, and this may include surveys or other mechanisms that gather data.

As a scientific area, statistics has its traditionally included subareas, but there is no general consensus to be found on the exact division between statistics and data analysis.

Structured Query Language (SQL)

One of several possible ways of querying a database. SQL was developed in the early days of relational databases and is specifically suitable for specifying the relations that form the set of data you are after, as well the wanted filtering, sorting, and aggregation functions.

SQL is used through a textual command, but it can be abstracted by a graphical interface, which is commonly seen in many tools.

the result is typically a list of data.

You can perform all kinds of communication with the database through SQL, covering four functional areas, with each their name:

Data Query Language (DQL): For querying data
Data Definition Language (DDL): For creating/altering tables, user accounts, and other structural database objects
Data Control Language (DCL): For managing the rights of the users
Data Manipulation Language (DML): For inserting, updating, and deleting data

T

Time Series Analysis

The analysis of a dataset containing a series of data, typically measured several times, or which can be calculated for a sequence of points in time, such as company turnaround per month, quarter, or year.

Time Series Database

A database type optimized for storing sets of time and value, often with a steady inflow of data from, e.g., a sensor measuring temperature every minute, or similar.

For this kind of data, a database needs to be good at handling large datasets.

Even if it is common to have time series consisting of just one time/value pair, it is possible to have more parameters associated with each time stamp. Also, the time stamp could be just a year or even something abstracted, that isn’t a date or a time but rather a sequential number or similar, which is able to show direction and progression.

U

Unstructured Data

While we often think of data as something in a database, nicely ordered into tables with uniform types and shapes of contents, easy to count and calculate on – this way being structured data – it is in general believed that most data in the world is not like that – it is unstructured.

In an analytical context, “unstructured” means that there is no defined or described structure, but in reality, part of it could be structured. Think of a written document that contains fluent text but also a table with some research result, for instance.

Nevertheless, such and other documents are usually considered unstructured. Also parts of otherwise structured data, for instance the text in the body of an email – while having various structured data around it, the metadata, the body itself is whatever the email writer has decided to put there, often not considering any kind of structure.

Unstructured data can either be treated and made into some kind of structure, fit for the purpose – a text analysis, for instance, which counts all the individual words and creates a table of them and their frequency, or natural language processing (NLP), for finding meaning and patterns – or it can be used in its raw form, using search technologies.

Video and audio also belongs to this category.

A data lake is often practical for keeping such data, and having metadata describing it, will make it more useful, so any mechanism that can generate such metadata will often be worth considering.

V

Visualization

This can mean different things, depending on the context, but for data analytics it typically means making a graphical or otherwise simplified visual representation of some data.

It can be as simple as a number written with large digits, and a couple of words to tell what this is (as found on some of the pages of this and many other websites), or it can be a pie chart or other traditional or more inventive kind of diagram. The main thing is that is should reveal valuable insights at a first glance, possibly offering some more details for those who study it closer. it should be easier to see what is important through the visualization, than by looking at a list or a table full of numbers.

For a data analyst, a main task is to select the visualizations carefully, as the easy overview will be gone of all angles of all data are made into lots of diagrams. Hence, the idea of a dashboard, where only a few, thoughtfully selected visualizations are gathered, in order to provide the insights that are the most needed for the intended audience, for making decisions.

W

Web analytics

Also called website analytics. It is the activities for collecting and analyzing data related to the use of a website, which includes such as traffic, geographical data, clicks, conversion (whether the user buys something), the efficacy of a “call for action”, etc.

The purpose of web analytics is often to optimize the design to the purpose, which can be to sell something, or to make people want to stay longer on the site, to increase the value of advertising – or to lead the users to take the next step in a contact process, for instance by subscribing to a newsletter.

Web analytics becomes much easier when the users aren’t anonymous, so making people create a login or setting a cookie in the browser will help create much more useful data for the analysis.