Dark data can be defined as all the data that exists within an organization but is currently useless or unusable, either because it is redundant, forgotten, ignored, hidden, simply unknown to the organization at a given point in time, or which may be difficult to find, manage, or exploit for valuable insights.
Where Do These Data Come From, and Why?
These data are either collected by the organization for a certain purpose (but not clearly defined or implemented), or produced by people or IT systems, during normal operations, but not used or useful anymore, or even produced by IT systems without the organization’s knowledge. Other reasons can be:
- From processing the same data by multiple stakeholders.
- From processing the same data in multiple systems or by multiple functions.
- Due to changes in applications/IT systems.
- Due to improperly managed business processes.
- The existence of incorrectly configured or managed applications/IT systems.
- Due to the lack of data cleansing processes resulting from normal business processes.
- Due to the lack of validation criteria for collected or processed data.
- Poorly defined or monitored systems configuration, administration, or incorrect backup and archiving processes.
- Overall, the absence of a data governance framework.
Examples of “Dark Data”
Often we can overlook, or even forget, data we have collected for a project, a report, a presentation, etc. However, all this data is still in our operational systems. Some cases where dark data may come from or be stored:
- Attachments to old emails, often sent to large groups, resent, saved in various locations
- Intermediate files used in the elaboration of a report or presentation, with various comments and revisions
- Raw data used to produce a report, presentation
- Operating system logs, databases, applications, or IT/ network system journals that extend over a period of time that is too long, or of which we are unaware and which are not used or useful anymore
- Databases of outdated IT systems that have been replaced but have not been migrated and cannot be used anymore
- Data collected by an IT system, transmitted to others for processing but stored by both systems in the same format
- Erroneous data, which cannot be used for planned purposes, but has not been deleted
- Temporary files created by certain applications or IT systems, kept long after the processing purpose has been fulfilled
- Data that is behind access restrictions, of which we are unaware and cannot be accessed
- Archived files, that although have been extracted, and the files used have not been deleted;
- Data from projects that have been started and abandoned or put on hold for a long time
- Data belonging to former employees that can no longer be accessed or are no longer useful or relevant
- Archived data that is still being kept in operational systems
- Archived data whose storage/archive time has expired but has not been deleted
- Multiple backups, or backups kept for longer than the efficient usage duration
- Data collected for a defined purpose but never materialized
- Data collected “just in case” may be useful for something in the future
Why Should We Care About Them? – Individual Use
At a personal level, I suppose it is not only me that feels annoyed when searching for an old document created some time ago, and trying to figure out which the last version is. To be sure, I will check sent emails, where I found even more versions than anticipated. I assume I am one of many to see a message that notes that there is not enough space and that some files should be cleaned up.
That is the best-case scenario, the worse being when the system crashes or runs very slowly due to a lack of disk space.
Some of us are aware of the need for periodical cleanup, but we still are surprised when we see the number of temporary or large files, we had not accessed for years.
Of course, this counts those pictures we took on several occasions, events, trips, or those with our children, which are never too many, and are never protected enough. Therefore, just in case, we are saving them in several locations, we have them in our phones, our computers, in cloud(s), plus sent (and stored) in emails or any messaging application.
The most curious of us are looking at systems or application logs/history, and we are amazed about the volume and the timeframes we have information about. Information we are never using.
Those aware and up to date with technological risks are doing periodical back-ups of their computers, phones, or accounts. This is a good thing, however, not as good if we are keeping and storing all back-ups from the past 10 years.
The multitude of data we are storing can be continued to be described, which are not only useless but also harming our effectiveness or efficiency. These are the so-called “dark data”.
At An Organizational Level
Unfortunately, all the previous, but in a higher order of magnitude, in terms of data source, volume, and potential use, are applicable at an organizational level just as much.
Moreover, we have additional reasons to care about them. In fact, many other reasons, which could be irrelevant for an individual, but are critical for business.
Some of which are:
- Meeting legal requirements
- Complying with assumed standards
- Rising unproductive, direct and indirect, costs of managing data
- Managing security and business continuity risks
- Impacting work efficiency
- Organizations may lose customer trust
- Missing new business opportunities
The reasons look obvious, so one could ask why these were not already addressed if these could cause such hassle to an organization. If so, why? Because these are not known by the deciding factors, or at least not the magnitude of the volume or impact of these data. This is the reason we are calling them “dark data”.
What Can Or Should We Do With Them?
All people in an organization, all the systems, and all the processes are collecting or generating masses of data and some of them will inevitably become “dark data”. We cannot ignore it, and we should even aim to eliminate all of the “dark data”, as the costs could be higher than living with them. However, we should try to minimize them, keeping them at a level which does not impact business effectiveness and efficiency.
An approach to treat “dark data”, or any data which looks like “dark data”, would contain several logical steps, such as the following:
- Dark data identification
- Quantitative evaluation
- Classification
- Impact estimation
- Identification of (probable) causes
- Decide on “dark data” treatment
- Implement measures to avoid/minimize dark data volume and impact
- Continuous monitoring of data.
- Now, let us take a deeper look at them one by one:
“Dark Data” Identification
As the name suggests, these are not always obvious, at least if we are not looking for them. So where could we start? Based on the examples proposed here, or the classification proposed later in this material, we should try to look for them, source by source, process by process, system by system. Of course, using automated tools would make our life easier, and there are a lot of them on the market, including many free ones. Your IT team or admin could also help. In fact, without IT help, this initiative is almost doomed, as aside from their access rights, they have the knowledge of where to look and how to do it.
The beginning could be more difficult, but as you start to discover them, the easier it will be, to apply similar search patterns and logic for other processes or IT systems.
Quantitative Evaluation
As we mentioned earlier, we should not aim to get rid of all “dark data”, as the cost could easily be higher than the negative impact. So we should aim to remove the most critical ones, which could have a real impact.
So, the first criterion would be the volume of data. The more data that we do not need or use, the worse it is. However, the overall volume is not the only criterion. For example, 10,000 small (10Kb) files could have a larger negative impact than a 1GB file, even if the overall volume is ten times lower.
Moreover, the quantitative aspect is not the only criterion. Sometimes one single small file, in the wrong place, containing sensitive information, could be the subject of the clean-up.
Classification – The “Classical” Way
The first attempt to classify these useless or unused data was the ROT data.
The term “ROT data” is derived from the acronym “ROT”, which stands for “redundant, obsolete, or trivial” data. This term has been used for many years in the information management industry to refer to data that is no longer useful or relevant but is still being stored and managed by an organization.
All of these concepts have been included in the scope of this article as they overlap with the concept of ‘dark data’.
Classification means to assign them to different categories and subcategories, based on different criteria. The criteria could vary from the perception of data, to the source of data, and the impact of data, based on the specific job requirements of the person doing the classification.
An example could be the following:
- Useless Data: Unusable data, Redundant data, Outdated data, Trivial data, Low-quality data, Temporary data.
- (Just) Unused Data: Hidden data, Inaccessible data, Ignored/forgotten data, Data on hold/awaiting, Data collected “just in case”.
However, this is not an easy task as the classification depends on the purpose, source, organization, knowledge of people and functions involved, in processing and analysis, as usually data crosses many functions from generation and collection to processing, storage, analysis, etc.
A More Effective Way of Classifying Dark Data
Taking the same sample of “dark data” and asking different people from different functions is likely to give completely different results. Most likely, IT, Marketing, Legal, Finance, Operational, Compliance, etc., would have different classifications. Even within the same department, like IT, a system or application admin would have a different classification than a cybersecurity specialist, a project manager, a business analyst, or IT management.
The first attempt in classifying “dark data” would be to separate those which are clearly useless and those just unused, but could be potentially useful. The question is “useful for who”? Not always, or rarely, the person responsible for collection is also the main beneficiary of the data.
Thus, a better alternative to pure classification is using keywords to describe certain characteristics used in classification. This way, using as many keywords as we feel relevant, from all relevant stakeholders, we will not need to fit into a certain category, and it even gives some hints about the perceived issues with these dark data and the possible solutions to address the issue.
Some possible keywords could be (not exhaustive and in alphabetical order):
- Abandoned, awaiting, broken, conflicting, counterproductive, disorganized, disregarded, fragmentary, hidden, ignored, impractical, inaccessible, incomplete, incongruent, inconsistent, insignificant, irreconcilable, irrelevant, low-quality, masked, meaningless, messed-up, negligible, not understood, omitted, out of view, outdated, out-of-date, partial, pending, purposeless, redundant, scrambled, substandard, temporary, too complex, transitory, uncertain, undecided, undetected, unknown, unnecessary, unorganized, unreachable, unstable, unstructured, unusable, unverified, unwanted.
However, we will see that perception is different, and what is useless to one could be useful for other stakeholders, and what is too complex for one could be trivial for another. But all this information together would give important hints for the next steps.
Impact Estimation
Once we classified the data, or even before, the order in this case not being of paramount importance, would be to estimate the impact. Estimating the impact would consist of two different pieces of information: where the impact is higher (like compliance, legal, operational, finance) and the magnitude of the impact.
For each impact area, thresholds should be defined to decide what is worth to be addressed and what is not. However, this is an iterative work, and it would be easier to accomplish.
Identification of (Probable) Cause
In order to decide what to do with them, we should understand why these “dark data” are there. Are they a normal artefact of a business process, which we should delete after a while, are they due to a misconfiguration of an IT system or a broken process (especially a crossfunctional one), or a result of the changes in the IT systems or the business process? This is important to know, in order to prevent their generation after we clean up the mess.
Decide On “Dark Data” Treatment
After we have a quantitative estimation of these data, impact estimation, probable cause, and classification, we can decide what to do with them. Generally speaking, we have to choose from the following alternatives:
- Let them be – when quantity/impact is irrelevant
- Keep them at a minimum – when we need them, but not for a long period of time
- Delete them – when nobody sees a use for them, or they are causing more trouble
- Archive them – when they are not needed for operational purposes, but there are legal requirements to keep them
- Organize them – when these could be useful for a certain function
- Improve quality when possible
- Process them – when these would be useful, but not in actual format/status
Of course, the decision is not always obvious, but the keywords used in classification could give a hint about the needed actions.
For example:
Abandoned, awaiting, disregarded, ignored – ask the intended usage owners to decide: delete/organize for use
Broken, fragmentary, incomplete, inconsistent, insignificant, counterproductive, impractical irrelevant – delete
Negligible – delete or let them be
Disorganized, irreconcilable, incongruent – label, analyze, then decide
Hidden, masked meaningless, omitted, too complex – analyze possible usage, then decide
Inaccessible – get access and analyze or delete/archive
Implement measures to avoid or minimize dark data volume and impact
After we have treated the data, it would be more efficient to implement the needed changes into the business processes or IT systems configuration in order to minimize these “dark data”, or at least to identify them in due time, for proper treatment.
Implementation of data management and governance framework, in case one is not implemented, would definitely make this process easier, or some categories would not even be present, being discovered and treated, in due time.
In case you have one, after this effort, it would be useful to update your business glossary, data dictionary and catalog.
Continuous Monitoring Of “Dark Data” Categories
It seems we have done everything that needed to be done. Not quite. There still remains one activity to be performed: to monitor the “dark data” existent in our business environment. IT systems are continuously changing, and business processes change as well, not to mention people. As a result, what yesterday was under control, tomorrow could become a problem. Monitoring those we know about, or overall categories, like application logs, implementing similar controls for new systems or initiatives, as soon as possible, could save us a lot of time and money.