Part two in our latest blog series looks at the role of Machine Learning and Artificial Intelligence in the Security Environment, and the challenges surrounding these technologies.
- What about Machine Learning and Artificial Intelligence’s track record in security?
From a security technology perspective, Artificial Intelligence (AI) and Machine Learning (ML) isn’t particularly new. Rule based AI has been a part of DLP and other security solutions for a very long time.
Automated routines and intelligent algorithms underpin how DLP works, and assists, with the security related decisions around sensitivity of data. However, historical use of ML and AI has consistently underachieved, due to a lack of true context around the actual sensitivity of the data, and following a fully automated approach to sensitive data scanning and classification. A massive increase in security false positives and false negatives, results in either an increase in administrative resources to deal with these errors, or a reduced level of protection as rules are relaxed (to reduce the flow of false positives) – which all costs time and money. People soon lose trust in the classification process if it is not reliable.
In the first installment of this series we looked at digital assistants using ML for speech to text and AI for basic word identification. ML programs now exist to do basic comprehension checks on documents and it is possible get answers about the contents of the document, but is that suitable for data classification?
A data classification system must be able to understand the context of the whole document and how the document relates to the wider business. Typically, the wider business information is not part of the training data. A few certain data types such as PII or PHI data can be pre-programed in to the learning otherwise it is extremely difficult for the ML to understand contextual data.
Speed is an additional concern. Often it is only possible to complete live scanning of documents when the document is saved or printed due to host application limitation. This means that there is often a delay when the document is scanned before a classification can be recommended.
Also let’s look at the term Machine Learning. It is what it says: the machine is learning. Learning by its very concept involves making mistakes, and while high levels of mistakes in the form of false positive or false negative errors might be expected in the short term, the question is how long a business has to wait until they reach an acceptable level. Often an ML system might suggest a label with a percentage certainty. What would a user do if they had a recommendation that the classification was 75%-85% certain? It is unlikely that this figure would be good enough for most business but a “lazy” user might always accept the classification. Boldon James research in to scanning emails indicate that this rate is a norm; certain data types provide a higher figure and others provide a lower figure.
The good news is that this trend has shifted somewhat away from pure automation in the past 5 years, moving towards more user involvement than less, in order to improve the accuracy and performance of security decisions, the supporting technology and driving greater operational efficiency.
Certain data types are very predictable and can be easily classified. It would be simple for ML system to identify, for example, an invoice and classify it correctly. On the other hand, how would a ML system classify this blog? It has technical discussions, plus reference to company names and has financial information. Should this be company confidential? Clearly not, as if you apply context this is a blog so the classification should be public.
For more information, you can watch our recent webcast “Actual vs Artificial Intelligence here now.