Agile methodology is a type of project management that is mainly applied to software development, although agile approaches are becoming popular among other disciplines and industries. Agile software development uses the same logic as product assembly lines: each stage of development can only begin once the previous stage is completed. These stages are known as ‘sprints’, in which cross-functional teams work on a certain project for a specified period of time. Agile development requires close collaboration with the client or customer and regular reflections on progress by all involved parties.
An algorithm is a bit like a set of instructions. On a piece of paper, you could write an algorithm for baking a cake, washing the car or planting potatoes. When applied to computers, an algorithm tells a computer how to carry out a certain task. Whenever you tell your computer to ‘do’ something, such as look up an item online, calculate some sums on a spreadsheet, save a document etc., your computer needs to know how. It may use a number of different algorithms to execute one task. It is good to note that an algorithm is just the steps needed to carry out instructions, and not the implementation.
An API is a bit like a Roman foot messenger, just considerably faster. It runs back and forth between two applications, taking away a request and bringing back a response. Take logging into a social media account as an example. Your browser and the social media’s server are two different locations, meaning they need a way of communicating with each other. This is the job of the API. It allows a message to travel to the social media’s server (‘Person X is trying to log in’), ask for a response (‘authorise the log in’), and then bring back this response to your browser (either ‘Person X is logged in’ or notified that their ‘password/username was incorrect’).
Some programmers specialise in back-end development, which means that they create the ‘behind the scenes’, the back end, of a website or software. They are the interior designers of the programming world. Instead of worrying about how the final product is displayed on the screen for the user, a back-end developer is more concerned with the core computational logic. They create the components that are accessed by the user through the front-end application. Often, back-end developers specialise in a back-end programming language such as PHP, Ruby or Python.
Of all the terms in this glossary, big data is probably both the most widely used and the most widely mis-used. Even among non-specialist audiences, it has become a technological buzzword in recent times. Big data is a field working on solutions for collecting, transporting and storing very large amounts of data (i.e. data that is very ‘big’). For this reason, it is mainly large organisations that are concerned with big data. For example, if a major tech company decided to move all of its data to a new data centre, it can’t just put it all on a memory stick. It would require specialist data-carrying vehicles to come to the old data centre, get wired up, spend hours downloading all the data, and then repeat the process in reverse at the new centre. This is a very time- and cost intensive process.
On a basic level, cloud computing is the interaction between various computers. When many people visualise ‘the cloud’, they often imagine some sort of digital, nebulous structure floating above their heads. In reality, the ‘clouds’ of cloud computing do actually exist as tangible entities. The term ‘cloud’ simply refers to a cluster of computers in remote data centres, which often are dotted all over the world. Many large service providers, such as Amazon, have their own data centres, and sometimes sell a fraction of their computational capacity to other services, such as Netflix. This is also known as ‘big computation’, which describes the process of using the combined power of many computers to run large services or solve complex tasks. Combining the capacity of many CPUs speeds up these processes exponentially. By way of example: if it would take you one million hours to decrypt a document on one computer then, with access to one million computers, you could do it in one hour. However, this is not to say that only big business benefits from cloud computing, On a daily basis, many people also use the cloud for storage and sharing - from uploading photos to Dropbox to working on a text with a colleague via Google Docs.
Many enterprises use software called a CMS to manage website content. Traditional CMS, also known as monolithic CMS, comprise a back-end database, a crud UI and a front-end display. However, increasingly CMSs are going headless, meaning they only comprise a back-end database and a crud UI (plus an API). By removing the front-end presentation layer and adding an API, the user is able to push content from the headless CMS to different websites and apps. This means that, unlike a monolithic CMS, a headless CMS does not concern itself with where or how its content is displayed.
The CPU is the ‘brains’ behind your computer. It contains the important technology that generates what you see on your screen. Because the CPU is, essentially, the ‘computer’ part of your computer, many non-specialists erroneously use the terms interchangeably. For example, you might hear someone say that a data centre contains thousands of computers. What they probably meant, however, is that the data centre contains thousands of CPUs. This is because the term ‘computer’ refers not only to the CPU (the part that does the ‘computing’), but also to the screen, the keyboard, the casing etc. Saying that a data centre contains lots of computers, therefore, suggests that they contain rows and rows of domestic, desktop computers. In reality, however, they contain CPUs, as these are what generate the computational power needed for storing and processing large amounts of data.
In computing, some data is saved in what is known as a CSV format. CSV stands for ‘comma-separated values’. As its name suggests, a CSV file divides pieces of data with commas. This makes them very easy to export into tables such as spreadsheets, given that delineating pieces of data by commas essentially gives them their own ‘field’. Just like in a spreadsheet or table, each data record gets its own line.
CRUD stands for ‘create, read, update, delete’ and UI for ‘user interface’. A CRUD UI, therefore, is a type of user interface that allows users to search for, view and edit information in a database.
Imagine a data catalog as a bit like an encyclopedia. If someone asked you to list all the key objects inside from memory, you most likely wouldn’t be able to. There is simply too much data. Many large organisations face a similar problem - they have so much data, that it is almost impossible to keep track of it all. However, if you knew vaguely what category or kind of object you were looking for, say, instruments, then you could look up the word ‘instrument’ in the encyclopedia’s index. It would then tell you on which pages all the world’s instruments can be found. A data catalog functions in much of the same way. If you know vaguely what you are looking for, you could run a catalog search on all of your organisation’s data. Specifically, you would be searching for ‘metadata’ about that data.
It used to be the case that those who produced data, or information, for public consumption were those in positions of power, the ‘lettered elite’. We might think here of governmental documents, academic papers or medical records. However, nowadays, anyone can create and use data. This may be in the form of sales tables, social media posts, web browser searches etc. Data has been ‘democratised’, i.e. given to the people.
Just as an engineer builds structures like houses and bridges, a data engineer creates structures for housing and connecting data. For a data scientist to be able to analyse large data sets, they first need a data engineer to build the mechanisms needed to harvest and process this data.
Data governance is the process of ensuring the quality, usability and safety of an organisation’s data. It involves ensuring that data is well-maintained and complies with the correct data standards. Think of those in charge of data governance as both the gatekeepers and police of an organisation’s data.
Often, when reading about concepts such as frictionless data or big data, we come across the analogy of organising data into packages or ‘crates’ to allow for easier handling and ‘transportation’ of data. This is because handling large amounts of raw data is a very tricky process, in the same way as trying to carry water with your hands is very difficult. A data lake is where raw data is kept before it is processed into more manageable formats (in this sense, you might like to think of a data lake as more of a data ‘reservoir’ from which data is tapped off and bottled up).
Data management is a broad term that describes the process of collecting, cataloguing and processing data within an organisation to achieve a certain outcome. This may be to publish data openly so that others may use it, or to process data internally in such a way as to generate insight. Increasingly, organisations worldwide are adopting ‘data management systems’, such as CKAN, to advance their organisation’s cause through data. By simplifying complex data processes, data management systems are turning data management into an everyday activity for non-technical staff across the public, private and third sectors.
Unlike its name may suggest, data mining does not mean ‘mining for data’, but rather ‘mining data’. The process aims to extract as much value and ‘usable’ information from raw, fresh data as possible.
Data quality is not the same as data value, although in most cases, high-quality data is of higher value. The better the data quality, the greater the likelihood that you will be able to gain insight of value from that data. This is because high-quality data is accurate, complete, consistent and valid.
Think of data science as the study of data. Whereas data engineers build the systems and programs needed to store and display data, data scientists are more interested in the data itself. Much like other scientists, they conduct experiments and research, committed to solving problems and finding answers. They piece different data together, dissect patterns, scrutinise anomalies and generate tables and graphs. They also work on wider data projects, such as machine learning and artificial intelligence. If data mining is about extracting value, data science is about generating value.
Data is often presented in data sets, which are simply collections of data with a shared theme. This might be a group of tree images, a table of university exam scores or an address directory. Often, people look for data sets to help with their research. For example, if you wanted to find out about your risk of falling victim to crime in a certain area, it wouldn’t help you very much to only have one piece of data. You’d need to look for public data sets containing information on types of crime, crime frequency, crime hotspots etc.
Many data-driven organisations are implementing data management systems as a means of maximising the insight they can gain from their data. DMSs allow organisations to: collate all of their data; collect data from sources outside of their organisation; query the data in their database for data identification or discovery.
ETL is the name given to the process by which data is taken from one source and moved to a larger container with lots of other data. It’s name describes the process: data is taken (‘extracted’) from a source, converted (‘transformed’) into a uniform format, and placed (‘loaded’) into a much larger storage facility. Much like a farmer harvests potatoes, sorts them into crates and moves these into a warehouse, data is selected, packaged up for easy handling and put into storage with other data until it is needed.
In an ideal world, data would flow between systems, people and institutions efficiently and seamlessly. It would also be easy to use data to generate insight, as data would be readily available and easily accessible. Currently, many people and institutions spend most of their time trying to collect and organise data, leaving them with little time to extract value from it. In a world where data was entirely frictionless, instead of having small pieces of data flying around chaotically, data would be packaged up into a standardised container. This would not only make it easier to send data back and forth between different systems, but increase its usefulness, thereby fostering data-driven decision making.
Whereas many data engineers specialise in either back-end or front-end development, a full-stack developer can work on both. Because they have an overview of all aspects of building a website or software, full stack developers often also work on project management.
You might hear someone say that a new version of a software has come out with ‘added functionality’. This just means the software now has more ‘functions’ than it did before, ie. things that it is capable of doing.
Occasionally, computer programmers get together (either in person or digitally) to work intensely on a certain project for a specific period of time. These may arise to come up with solutions for organisation-specific or even global projects. Sometimes, hackathons are called in response to crises, such as pandemics or natural disasters.
While most software is made up of a ‘head’ (a front-end) and a ‘body’ (a back-end), some software only consists of a body. The analogy is that the ‘head’ of the software has been chopped off from the body, leaving it headless.
Anyone needing access to large amounts of data would benefit from a data integration solution. Data integration is a strand of data management that concentrates on amalgamating data from many different sources. Taking pains to integrate data properly minimises room for error in all data-driven decisions undertaken by an organisation. A good example of this is synchronization. If employees did not have a system that automatically updated their data once changes were made to external sets on which the organisation relied, then employees would have to continually, manually check for updates.
An interface is the part of a system that connects its system to other, distinct systems. Some interfaces are mechanical, such as the headphone jack on your phone, or the USB port on your laptop. The term ‘user interface’ refers to interfaces that connect humans with computers.
This term is very similar in meaning to compatibility. When used in the context of technology, it refers to the ability of computer systems to work together. Designing systems with interoperability in mind forms a crucial part of the work of computer and data scientists alike. This is because, the more interoperable our global systems, the more information is able to flow freely. The negative impact on both the economy and society of low levels of interoperability can be demonstrated through some everyday examples. Actions as small as not being able to plug in your devices abroad without an adapter, or not being able to edit a document sent to you as a PDF, show how even low-level instances of non-interoperability can halt the free flow of information.
Nowadays, we are seeing more and more domestic devices that connect to the internet. Even some washing machines have WiFi. This explains the name IoT - it’s the art of connecting the internet to ‘things’. In a domestic context, IoT aims to make the lives of consumers easier. You could turn your heating on with your phone so that it’s warm when you get home; set your washing machine to wash while you’re at work; or turn your house lights off from the office if you realise you left them on. Increasingly, IoT technology is being applied on a much larger scale, in the urban setting, giving rise to what are called ‘smart cities’.
Machine learning is a type of artificial intelligence (AI). ‘Machine’ really refers to computer algorithms that ‘learn’ how to improve through experience. To give an idea of how machine learning works, imagine that you are sitting in front of an AI with a built-in camera. You want to teach the AI to recognise a tree when it sees one in an image. For days, you tell the computer that you are going to show it pictures of something called a ‘tree’ and let it take pictures of your tree images. After a while, the computer will have enough data about what a tree looks like that it will be able to recognise a tree of its own accord. It has ‘learned’ what a tree looks like. Some algorithms will be so advanced that they could even reproduce their own images of a tree that looks indistinguishable from a real image (this particular type of machine learning is known as a General Adversarial Network, ‘GAN’). Self-driving cars and voice recognition services are examples of products made possible by machine learning.
Similar to but distinct from a programming language, a markup language is used for computer text programming. Many non-programmers are used to formatting texts in a word processor. If we want to italicise words, we simply click on the ‘I’ button. When creating a text for a website, however, we cannot use a word processor. We have to use a special program that tells a text what to look like. To use one of these programs, we need to know how to write a markup language. In the markup language Markdown, for example, italicising a word works by enclosing it in underscores, like this: word. Unlike typical programming syntax, markup languages use recognisable words. Examples of markup languages are HTML and Markdown.
Simply put, metadata is data about data. For one piece of data, there are often many other metadata, i.e. pieces of information that describe that data. A good example is a photograph on your phone. The photograph itself is the data, and information such as time and date of capture, image size and storage location are the metadata.
Some unstructured data, such as whole documents, texts or videos, is often stored in NoSQL databases. These are databases that are designed with different data types in mind. A NoSQL database is non-relational, meaning that it stores data that cannot be displayed in tabular form.
Open data is data that is freely available for anyone to access, distribute and copy. It is ‘public’ data and as such is not protected by intellectual property rights.
Many organisations choose to publish data sets (usually their own) on open data portals. These are online user interfaces that allow users to access collections of open data. Two of the most common types of organisation that publish data via open data portals are governments (usually for the purposes of transparency and freedom of information) and research organisations (mostly for the purpose of sharing data for the benefit of other researchers). CKAN is a high-profile example of an open data portal.
In many respects, open-source software is the opposite of proprietary software. This is mainly because it is free to use and not restricted by copyright. However, some open source licenses are ‘copyleft’, which means that they are free to distribute and modify so long as all derivatives are subject to the same licencing. While no company can sell open-source software, they are at liberty to sell products and services related to it, such as consulting or added features. The name ‘open-source’ stems from the ‘open’ nature of the source code. This means that the code behind the software can be viewed and modified by anyone with coding skills. A key advantage of this is transparency, as anyone can see what is going on ‘behind the scenes’ of a software.
PII is the name given to any information that relates to a specific individual. This could be very basic information, such as a name or number, or highly sensitive information, such as bank details or medical records. Because PII can tell you something about private individuals, it is often regulated by data protection legislation.
In computing, the term plain text is given to text that cannot be formatted in any way. Whereas many word processors come with formatting options such as font, font size, bold, italics etc., plain text only displays words. Often, however, plain text can be used to determine how digital content should be formatted via means of a markup language. This allows you to use letter/number/symbol patterns on your plain text file to determine what, say, a website text should look like (for example writing words in asterisks to turn them bold). In this case, plain text acts a bit like code.
In contrast to open-source software, proprietary software (also known as closed-source software) is not free to use and is protected by intellectual property rights such as copyright. Unlike open-source software, proprietary software does not permit end users to view or modify the source code.
When used in relation to databases, the word ‘query’ is very similar in meaning to ‘search’. If someone carries out a database query, it just means they have searched for a specific piece or set of data.
Much like raw materials, raw data is data that has not been processed or transformed in any way. It is data that has been taken straight from the source. Often, large amounts of raw data are summarised into smaller, more manageable summaries. While this can help to organise data in many instances, it can also be unproductive for many user cases. For example, many scientists often choose to publish an article summarising experiment findings, as opposed to the raw data from the experiment itself. Publishing this raw data, however, could support the work of other scientists with similar projects.
Often, instead of installing expensive and complex software, organisations choose to subscribe to software services. These are an example of a cloud computing solution. Because it can be accessed and used online, the software is not tied to a specific device. An added advantage of SaaS is that you only pay for what you consume, as if the software were a utility fee.
Data scraping is a means of getting data from a website onto a local computer file, such as a spreadsheet. Whereas data transfer between programs is usually written in code (because only the computer needs to ‘read’ and interpret it), data that is scraped is intended for the end-user. In this sense, readable data from a website is ‘scraped’ from one program into another.
Sometimes, instead of data in an organisation being accessed through a central data management system, it is siloed. This means that information is stored across different systems that are incompatible and not integrated with each other. This makes data analysis cumbersome and slow.
Advancements in IoT technology have brought about a concept known as smart cities. A smart city collects data on things like traffic, CO2 levels, electricity consumption, weather, movement patterns to optimise city life for its inhabitants, protect the environment and preserve resources. Different cities continue to create new solutions for improving urban living. Let’s take the simple example of parking. Often, people drive around and around car parks looking for a free space. This wastes time and fuel, and pollutes the environment unnecessarily. To try and make this experience better for drivers and the environment, a smart city car park could contain sensors in each parking bay that sent data on whether the bay was empty to a parking meter at the car park entrance. This could show the driver a virtual map of the car park that highlighted the location of free spaces.
Some database management systems, known as Relational Database Management Systems (RDBMS), need to be managed using the programming language SQL. Only by phrasing commands in SQL can you discover, edit or delete data found in a RDBMS. Relative to other programming languages, SQL is easy to learn, as it shares much of its syntax with the English language. SQL can either be pronounced as individual letters, or as the word “sequel”.
In most cases, structured data is quantitative data. It is the sort of data that is easy to organise into spreadsheets, relational databases and visual representations such as graphs and charts. Examples of structured data are things like names, addresses, order numbers, geolocation. Because it lends itself to fast analysis, it is much easier to generate insight from structured data than from unstructured data.
Some databases are relational, which means that they organise data in relation to other pieces of data, for example in the form of tables. A Relational Database Management System (RDBMS) is the name given to the program that allows you to administer relational databases. To communicate with data stored in a RDBMS, you need to speak the programming language SQL.
Most software has a user interface. The UI is the part of a software that allows a user to interact with the software. A website is a common user interface that may contain numerous interactive elements such as boxes to type in, drop-down menus, pop-up boxes or progress bars.
In most cases, unstructured data is qualitative data. It is data that cannot be displayed well in analytics tools such as graphs and tables. This is not to say that unstructured data is not ‘useful’, but that it is not particularly ‘usable’ for the purposes of analysis and insight generation. Examples of unstructured data are videos, audio, social media activity and satellite images. Unstructured data is the opposite of structured data and is often stored in NoSQL databases.
People interested in user experience, such as UX designers, want to find out how users react towards and feel about certain digital products, such as websites or apps. UX designers want to design user interfaces in such a way as to maximise user interaction and nudge their behaviour. For example, a UX designer creating an ecommerce smartphone app might ensure that any ‘buy now’ button was within thumb-reach.
Sometimes, organisations become dependent on certain product or service vendors, usually because the cost of switching to a different vendor is too high. As a result, they find themselves ‘locked in’ to a certain vendor. For example, imagine you work in Human Resources at an organisation and your staff complain about the quality of the coffee in the office. In most instances, you’d simply try out another coffee brand. However, in this instance, there’s a problem. You currently purchase coffee in the form of pods that only fit into certain coffee machines. Switching to different coffee would mean replacing all of these expensive machines - of which there are dozens across the building - and you simply can’t afford this. You’re locked-in to your current coffee vendor.