Can AI help us to protect our data?
The answer is more or less straightforward as there are already examples of applied AI algorithms in this topic. I suppose the first online data protection technique is well known to the reader. The purpose of this tool is to typically prevent automated access to online systems by inserting a human check into certain parts of the online workflow. This method is well known by its abbreviation: CAPTCHA. The early solutions did not involve AI algorithms on the protection side but involved AI algorithms on the hacking side. Nowadays, there are various solutions on the market. The most widely used technique – reCAPTCHA – involves AI. In this case, the primary role of AI is not the classification between humans and bots but dataset building. When a visitor solves a CAPTCHA presented by reCAPTCHA, the images are automatically annotated by a no-cost human workforce. The software probably presents those images which cannot be labeled with high confidence by an AI algorithm.
Biometric identification is also involved in commercial products. Some smartphones and laptops already offer face and fingerprint identification solutions. In addition to such techniques, alternative solutions are also possible, for example, skin impedance-based identification and breath analysis based on exhaled biomarkers. The mentioned techniques strongly rely on AI algorithms. The application of such solutions can be applied in law enforcement and financial scenarios mainly, at the moment.
The device/browser fingerprinting is also a rich source of information. The fingerprint contains relevant information about the user that can be acquired online. Such user attributes can be, e.g., browser type, browser subversion, IP address, geolocation, and also the history of these values. AI algorithms can be applied to such data to conduct account fraud detection, protect payment processing, etc.
In addition to user account protection, enterprise grade solutions are also applied in practice. Enterprises are not trivial to protect, due to the relatively high success rate of social engineering. The typical application areas of AI algorithms can be found, in this case, in antivirus software and smart firewall systems. The specialty of antivirus classifiers is the need for an extremely low falsenegative rate. The hardness of smart firewall systems can be found in the heterogeneous and distributed data such systems rely on.
Can AI help with data acquisition activities?
As already mentioned, AI can be used to overcome automated and human-operated IT protection. The mechanisms of such attacks can be, for example, database acquisition, propagation of fake news, breaking data protection, cracking passwords, identifying weak spots, finding targets of social engineering, impersonating people (voice, video).
How did AI affect common data protection practices?
AI algorithms are ubiquitous these days. State-of-the-art techniques can be found under the hood on most of the common online platforms of tech giants, such as; Google, Facebook, Booking, Spotify, Amazon, YouTube, etc.
The goal of such algorithms is to increase the efficiency of these systems regarding a previously defined business goal.
From the business perspective, the best strategy should lead to a win-win situation, which means that the AI should be applied in a way that is attractive, both for the service providers and also for the users. This principle should make the business work.
However, there are different voices also out there. Looking at the problem in black and white, the application of AI algorithms can be treated from two different perspectives: AI can be malicious or good. One can say that malicious companies use their data to calmly calculate incentives for a higher consumption rate, they are interested only in profit. On the other hand, the application of AI in Information Retrieval (IR) can be treated as a purposeful Intelligent Augmentation (IA) tool, which provides value to the online users. In general, the task of IR is to process a huge set of items. By huge we mean that it cannot be processed one by one, by humans because of time and capacity limitations. IR algorithms help users to find relevant items in this set. This relevance can be calculated related to a search query or to a specific user. In the first case, we talk about web search engines, as Google, Bing, etc. In the latter case, we talk about personalization. In most cases, web search comes in hand with personalization.
From the business perspective, organizations equipped with AI can provide a better service as users tend to find the relevant products/services easier. This is true for search results, advertisements, audio tracks, travel destinations, videos, etc. In general, better service leads to a higher engagement and a higher profit in the end.
AI algorithms are trained with examples, so-called datasets. It means that a relatively massive amount of examples should be presented to the actual AI algorithm. Based on its modeling ability, the algorithm then learns the interdependencies in data with a certain accuracy. The essence of algorithm development is to find the right algorithm for the right task. If the algorithm is too simple, it will not be able to fulfill its task at all. If the algorithm is too powerful, it will memorize the training data instead of learning its interdependencies. In general, it means that the AI algorithms are eager on training data. The quality of the data strongly determines the quality of the algorithm. The organization that manages to acquire sufficient and high-quality data will be able to develop high-quality algorithms and will hopefully be able to convert the algorithm into a revenue stream. This paradigm is often mentioned as data is the new oil. Those organizations have the chance for a higher profit that manage to gather relevant and high-quality data.
There were times when the acquaintances of any user and also other sensitive data could be queried from Facebook without any significant restriction. Back then, it was culturally accepted and also unregulated. The first moves to data protection have been conducted by the organizations themselves to protect their own database, their own business. The popularization of search engines, social media platforms, massive content providers, and the expansion of applied AI algorithms attracted the spotlight to this sector. Spotlight and revenue typically come with regulations sooner or later, thus it happened. We call it GDPR or MDR in the medical world. Data collection became regulated.
The primary message bound to GDPR is data protection, which leads basically to a positive attitude for the average citizen and is basically a noble goal. The question lies more in the implementation of the principles of the regulation. In practice, data protection is conducted with administrative/ bureaucratic techniques. It means the definition of several dedicated working hours for legal, economical, and IT professionals to elaborate the IT workflows that operate according to the principles described in GDPR.
It means that the costs of data collection has increased.
Does GDPR prevent the citizens? The answer could be: In some cases yes and in some cases not.
I still receive phone calls from a broker I have never seen to transfer them my money and who recommends me to buy high-performance stocks. I already asked the broker to remove me from their database, without any effect. I still receive emails from a “bank in the Netherlands” about billions of euros that will be transferred to my account as soon as I pay the transaction fee. I still receive a lot of spam from companies I have never heard of. It means that GDPR had a short attenuation effect on the spamming/scamming activity but it is gone. Basically, it reflects the efficiency of law enforcement techniques applied currently.
Various business models appeared in the online world to acquire data. In the case of tech giants, data can be found directly in the online system. Some software libraries/ devkits provide AI algorithms and also help the developers with data acquisition and dashboards to manage data collection user settings. I guess that most of the users use the default settings by just clicking the ‘ok’ button and are more or less immune to the terms and conditions and also to the GDPR settings.
I think that the economic effects of GDPR are also not communicated clearly and are also not sufficiently visible. In general, from the economical side, regulations like MDR, GDPR, and Standard Essential Patents (SEP) typically help big companies and hinder small ones, as such regulations increase the costs to enter and operate on the market. Furthermore, big companies have more resources, more routine in lobby activities, and better access to legislation procedures. Such mechanisms can be involved to conduct market cleaning, which process has its advantages and disadvantages.
AI capable IoT applications are gaining more and more attention these days. As it has already been mentioned, the first step of developing an AI algorithm is data collection. Fortunately, this type of data does not involve personal information and can be collected in a less regulated manner.
It means that there is a significant pool of AI algorithms that can be developed without asking permission from the users.
Finally, privacy is a key issue in the case of AI algorithm development. Even if the training data is properly anonymized, leaking algorithm design can lead to the violation of privacy. An example can be presented from the world of recommender systems, personalization algorithms. Such algorithms typically calculate the user preferences, the recommended items based on user interaction, namely: who purchased what.
In the case of common items, when there are a lot of purchase items available, the recommendations are driven by the statistics, for example, which items are purchased by similar users. In extreme cases, when there are rarely purchased items in the context, the identity of the person can be revealed.
To summarize this article, AI has several direct and indirect effects on data protection and authentication. In the first part of this article (first two questions), the more technical aspects have been discussed. It focuses directly on the application and practical issues. The last question discusses the effect more from the societal perspective. It brings up questions and reflects on mechanisms related to this topic.