The final part in our three part blog series looks at the role of Artificial Intelligence (AI) and Machine Learning (ML) for data classification, and what makes a good fit between AI and classification. If you’ve not yet read part one and part two, catch up on these now.
- How does Machine Learning and Artificial Intelligence fit into the world of data classification?
In specific situations, it can certainly play a supporting part. There are some types of information, such as PII or PCI, that can be more easily identified as part of a Machine Learning algorithm. However, these types of more easily identifiable sensitive information are only a small part of the story. Boldon James is actively researching AI and ML techniques to identify areas where Machine Learning can make a positive impact in the field of data classification.
Most organisations’ sensitive information is contained across a variety of unstructured data, much of which is not specifically defined as PII or PCI, for example. It could be sensitive company strategy documents, R&D information or intellectual property which may manifest itself in various forms, from text to drawings and charts. This makes automated classification of such information extremely difficult – how can a machine understand whether a drawing file is sensitive or not? A data sheet of a new product should be protected, but a data sheet of an existing product would be “Public”. The system must also know how this data relates to the company AND to its competitors, which means taking context into account. What tends to happen, as with all automated approaches, is that files are either over or under classified.
An example of this could be a classification label for a sensitive special project, for example, ‘Project Atlanta’. Imagine how many documents and emails might contain both of those keywords in a variety of contexts – it would run into tens of thousands. This means the potential for tens of thousands of false positive or negatives in this scenario – either files are over classified as ‘Sensitive’ or under classified as ‘Public’, for example. The risk of sensitive project information being leaked is concerningly high.
ML has developed well to classify the context of a document, and gives good results to understanding the subject of the text material but not the business sensitivity. Certain documents like invoices and legal documentation are relatively easy to classify but other documents are much harder. For example, classifying a document is very hard without human intervention.
The user is still key to classification. They know the context of a file and its relevance to the business. ML is able to learn from the user the most likely classification for a file, as often a user will generate the same type of files.
The main challenges of using ML in a classification project are time and resources, as to get to any level of classification accuracy using ML you not only need to involve the user to make decisions (so the machine can learn) but you will also need a massive amount of training data to enable the machine to learn – and this comes from your people.
With most classification programmes there are still too many variables for any ML system to calculate with any degree of accuracy, and ML still relies on huge amounts of data to be able to predict anything with any accuracy. Your starting point for training would be a classified data set – classified by users (hopefully SMEs). Without users you have no basis for the system to learn.
How will the system improve accuracy over time? For certain file types, automation will definitely play a part when used in conjunction with the user for newly created data. For stored data results are likely to be unreliable as the ML has no concept of the external influence of the data.
- What is the right blend between AI, ML and data classification?
As with any approach, a balance is required. Utilise automation intelligently, in the form of AI or ML, to help support the user in efficiently identifying and categorising data, such as PII or PCI. Then blend this with the need for users to actively contribute in the process – either through endorsement or verification of automation selections, or the manual application of labels where the correct context needs to be applied.
Getting the right blend between automated machine and user involvement will enable organisations to minimise training requirements, reduce administrative workload and reduce the risk of data loss incidents occurring while empowering the user community to take an active part in security, bringing with it all the positives around awareness and cultural improvement.
Empowering a human to take an active role in the identification and decision making process around data protection is vitally important if organisations are to make a serious security culture shift, improve security awareness and accountability. Considering how much organisations spend on user awareness and security training each year (well into the $millions), we believe this is an important and extremely valuable benefit.
The era of machine may be here, and organisations need to understand and embrace the benefits in an appropriate way, but at the same time, it has never been more important to involve humans in the process – the user is fighting back.
Boldon James is actively researching AI and ML techniques to identify areas where Machine Leaning can help in data classification. If you would like to discuss our approach to data classification then please get in touch.