Blog – AIDA https://aida.inesctec.pt Wed, 21 Jun 2023 13:40:24 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://aida.inesctec.pt/wp-content/uploads/2020/05/cropped-Logo-AIDA_azul_pequeno-cópia-32x32.png Blog – AIDA https://aida.inesctec.pt 32 32 177378626 Handling Update Hotspots in Distributed Database Systems https://aida.inesctec.pt/handling-update-hotspots-in-distributed-database-systems/?utm_source=rss&utm_medium=rss&utm_campaign=handling-update-hotspots-in-distributed-database-systems Wed, 21 Jun 2023 13:39:16 +0000 https://aida.inesctec.pt/?p=5531

Handling Update Hotspots in Distributed Database Systems

Motivation

Database systems must deal with the fact that real workloads often exhibit hotspots: Some items at certain times are accessed by concurrent transactions with high probability. This arises in telecoms, sensing, stock trading, shopping, banking, and numerous applications. Some are as simple as counting events, such as user votes or advertisement impressions in Web sites. Some of these applications, such as prepaid telco plans, selling event tickets, or keeping track of remaining inventory, in addition to counting, also need to enforce a bound invariant, that ensures that the quantity being tracked does not cross a set threshold.

 

Update hotspots mean that locking and validation mechanisms used for isolation have a severe impact on usable throughput. This is particularly challenging in emerging cloud and edge database systems as classical techniques such as escrow locking are not applicable (e.g., locks in MySQL Group Replication are local and transactions in different nodes run optimistically), distributed synchronization has a considerable impact on latency (e.g., waiting for a stable time in Spanner), or the unpredictability makes them ineffective (e.g., how many separate splits are needed in a serverless system). Moreover, in industrial applications, one needs a solution that works in current cloud-based and off-the-shelf systems, that is, using only their application programming interfaces.

MRVs in a Nutshell

Structure

The AIDA project proposes Multi-Record Values (MRVs), a new approach to handle update hotspots in scale-out cloud and edge database systems. It builds on the general strategy of value splitting: To split each contended value into multiple database records, each holding part of the total value, such that they can be accessed concurrently. To add or subtract to the value, one needs to add or subtract to any of these records. To read the current value, one needs to read and sum them all. The main novelty of MRVs is how each transaction is assigned to a physical record and how the various records, holding parts of the total value, are managed efficiently. 

 

MRVs can be portrayed as a circular structure of size N. Of these N positions, n are assigned to a physical record, which holds a subset of the total value. In the example below, we have an MRV with N=23 and n=9, with the latter represented by black circles.

Figure 1. Representation of MRV pki.

Operations

As different clients might have different access patterns, MRVs avoid statically assigning them to records, as done in previous splitting techniques. To ensure that accesses are evenly spread, our first insight in MRVs is to use a random number between 0 and N-1, for each access, to determine which record to use. Assuming that the number of records for each item is big enough, this results in a small probability of conflict. This avoids the need for explicit coordination of clients, which would be costly in a distributed environment. 

 

In the example below, transaction T1 wants to add 2 units to the MRV. As such, it performs a lookup with a random key, 4, and ends up updating the next physical record, assigned at index 6.

Figure 2. Example of a add operation on MRV psi. Only one record is accessed and modified.

Subtractions might not be fully possible on a single record if its current value is lower than what is being subtracted. Thus, the subtraction might require multiple accesses to complete. This could be done simply by keeping the remainder after a first subtraction and carrying it on to a second random one, and so forth, until it is fully done. This, however, makes it difficult to determine unsuccessful termination, which happens after all records have been visited and there is still a remainder. This is addressed by performing only one random lookup and then scanning to the next record. After the last record, we wrap around and restart the process on the first, hence the circular structure. With a tree-type index, both the lookup and next operations execute efficiently.

 

If two transactions try to update the same record concurrently, one of those will have to either rollback or wait. MRVs rely on the underlying concurrency control to achieve this task, which means a simplified implementation process.

 

In the example below, transaction T2 wants to subtract 3 units from the MRV. It first accesses the record at index 8, which only has 2 units. As such, it sets its value to 0 and carries the remainder (1) to the next record.

Figure 3. Example of a sub operation on MRV psi. To complete the operation, two records needed to be accessed and updated.

Adjusting

Choosing an optimal n depends on the current load. On one hand, a low number will lead to a high conflict probability under high loads. On the other, a large number increases storage and read overheads, and can be counterproductive for MRVs with low values. Therefore, MRVs employ a background worker that dynamically adjusts the number of records per MRV based on the workload, using, e.g., the MRV’s conflict rate. 

 

In the example below, a higher load leads to an increased abort rate. To offset this, the adjust worker adds two new records, filling previously empty positions.

Figure 4. Example of the adjust worker adding two new records to MRV pki.

Balancing

Finally, skewed workloads can lead to imbalanced records. For example, in stock reservation use cases, we might have multiple small subtractions – clients buying the item – and a few large additions – the store restocking the item. This increases the conflict probability of subtract operations, reducing the usefulness of MRVs. Our solution is another background worker that periodically balances the amount between records, as exemplified in the animation below.

Figure 5. Example of the balance worker balancing amount among records in MRV pki.

Evaluation

MRVs are evaluated with experiments on different database management systems, including distributed systems where other solutions are not applicable. Namely, implementations of MRVs have been tested on: a centralized SQL system with PostgreSQL, often used in cloud-based managed services; a single-writer NoSQL data store with MongoDB; a multi-writer SQL database with MySQL Group Replication; and a cloud-native, multi-writer NewSQL system that we call System X.

 

This test mimics a shopping application that keeps stocks of products. We increase contention both by increasing the number of concurrent clients (X-axis) as well as decreasing the number of products (Y-axis). The heatmaps below display the scale-up in throughput against the native single-record solution. These experiments show that the MRVs technique is widely feasible and advantageous in a spectrum of different database management systems, including NoSQL and distributed systems where an improvement of 100x can be observed in cases of extreme contention.

Figure 5. Performance comparison between MRVs and the native single-record I a variety of database systems.

Final Remarks

Multi-Record Values improve the performance of applications affected by the problem of numeric hotspots, by reducing the collision probability. The background workers ensure that MRVs adapt based on the current load, meaning write performance is improved while optimizing to reduce read and storage overheads.

The open challenge that remains is the feasibility of randomized splitting to data structures other than numeric values.


The full paper can be accessed here: https://nuno-faria.github.io/publications/mrv. The code used in the experiments can be accessed here: https://github.com/nuno-faria/mrv.

 

References

Nuno Faria and José Pereira. 2023. MRVs: Enforcing Numeric Invariants in Parallel Updates to Hotspots with Randomized Splitting. Proc. ACM Manag. Data 1, 1, Article 43 (May 2023). https://doi.org/10.1145/3588723

]]>
5531
There is always one more data bug https://aida.inesctec.pt/there-is-always-one-more-data-bug/?utm_source=rss&utm_medium=rss&utm_campaign=there-is-always-one-more-data-bug Mon, 13 Feb 2023 10:42:55 +0000 https://aida.inesctec.pt/?p=5419

There is always one more data bug

If we, as data scientists, receive a dataset from a reliable source, we should go ahead with the analysis (classification, clustering, deep learning, etc), right?

Well, yes, that’s what most of us (myself included) often do, especially if there is a tight deadline. However, this could be dangerous. Let me describe some rude awakenings I suffered over the past decades, as well as remind you some fast and easy preventive measures.

 

Examples (a.k.a. horror stories)

E1 Geographical data

Two decades ago, we got access to a public dataset of cross-roads in California, that is, about 10,000 points in two dimensions ( (x,y)  pairs). We plotted it to include it in the spatial-indexing paper we wanted to submit – the plot looked mostly empty, with a lot of points in the shape of California, at the bottom-left corner (see Figure [fig:ca](a)).

Why so much empty space?

The answer was that there was a single point somewhere in Baltimore (Atlantic coast), thousands of miles away from California. Clearly a typo – maybe the coordinates had the wrong sign or so. Since it was just one point, we deleted it.

 

 (a) Q: why is the CA dataset at the bottom left and the rest, empty?

 

 (b) A: because of a tiny stray point in Baltimore…

 

E2 Medical data – many ‘centenarians’

I heard this instance from a colleague at CMU (let’s call him ’Mike’). He was working with patient records (anonymized-patient-id, age, symptoms, etc). ’Mike’ was very careful, and did a histogram of the age distribution – and he noticed that ’99’ was an abnormally popular age! He asked the doctors that owned this dataset – they replied that, yeah, that’s what they used as a ’null’ value – since for some patients the age is unknown.

 

Remedies – Conclusion

Given that “There is always one more data bug”, what should we do as practitioners? While there is no perfect solution, I found useful the following defensive measures:

R1: Visual inspection – ’Plot something’: For any dataset, it often helps if we plot the histogram (’pdf’) of each numerical column. We could even try logarithmic scales, when we expect power-laws: Zipf, Pareto, etc. Spikes/outliers may spot anomalies (like the ’99’ age of the medical records incident).

R2: Keep in touch with the data owner: The medical doctors knew that ’99’ was the null value; in general, domain experts can help us focus on the real anomalies, as opposed to artifacts.

In conclusion, Data Science is fun and typically gives high value to the data owner. However, there is always room for errors in the data points, sometimes with painful impact. Anticipating and spotting such errors can help us provide even higher value to our customers. Paraphrasing our software engineering colleagues: There is always one more data bug.

 

Written by CMU

]]>
5419
AIDA Research on Suspicious Behavior and Anomaly Detection https://aida.inesctec.pt/protecting-the-security-and-privacy-of-aida-and-its-data-2/?utm_source=rss&utm_medium=rss&utm_campaign=protecting-the-security-and-privacy-of-aida-and-its-data-2 Tue, 13 Dec 2022 10:01:24 +0000 https://aida.inesctec.pt/?p=5306

AIDA Research on Suspicious Behavior and Anomaly Detection

One of the goals of the AIDA Project is to investigate and identify new ways to help Analysts find Anomalous Behavior (of any kind) in large and complex pools of data. We hope this can ultimately lead to significant improvements in the detection of fraudulent activity occurring on Telecom Networks, especially in the light of new technologies like 5G already expanding on the market. Moreover, we now see over more sophisticated, complex and robust methods being used for committing fraud on Telecomm Networks but also increasingly expanding to the world of Communication Service Providers and the Internet. 

In essence our goal is to create and develop new tools that can provide new ways to help the Data Analyst and Data Scientist better understand and make sense of information originated from huge and dynamic pools of data such as that originating from Telecom Operator’s traffic activity. This type of data displays very particular characteristics as it is based upon human interactions prone to exhibiting strong and established patterns  and Data Visualization functionalities when designing and building these new tools. 

We believe this can ultimately lead to better results at detecting Anomalous Behavior and uncover suspicious activity that would otherwise have remained undetected, which in turn can potentially stand as the basis for developing the next generation of commercial products and solutions aimed at tackling Anomalous Behavior in Telecom Operators’ traffic activity and other related scenarios. 

 

Case Building and Smart Fraud

 

One of the ideas behind this approach was to establish a strong focus on facilitating the opening and development of new cases by the end user (usually the Analyst). Case Building is a crucial part for the Analyst to manage his investigation work in order to allow the Analyst to follow a logical sequential analytical process, where there is the possibility to deep dive on data of interest as well as the opening and creation of new cases with ease. Current solutions such as those used at Mobileum include smart monitoring features that sense when there is an unusual activity exhibited by the Operator traffic data. When a suspicious event occurs, the application can set a temporary alert on that specific agent, attempting to keep fraud losses to an absolute minimum without interrupting legitimate activity. This approach functions more as a broad analysis that can detect alterations to the normal flow patterns of traffic data. Therefore, usually only the more obvious and evident anomalous activity can be detected through such methods, which leads to a lot of illegal activity passing under the radar of these solutions. The Analyst here usually does not have the ability to perform a sequential deep dive on the data and to reflect at multiple perspectives sequentially layered in a manner that can facilitate the workflow of the user.

This is made more difficult with the increase of Smart Fraud where fraudsters are increasingly developing more sophisticated, efficient and complex ways to commit Fraud be it for increasing success rates or to adapt to the adoption of anti-fraud mechanisms by the Telecom industry.

The advent of AI and the ever more sophistication, creativity and expertise of Fraudsters are major factors for this. New methods used by Fraudsters focus on attempting to mimic “normal” non-fraudulent behavior that can increase the chances of avoiding detection, through several ways. For example, regarding call traffic activity we have seen approximations on call duration, frequency and even ensuring that verified contacts are also included in the fraud committing process.

We believe the work carried out in the AIDA Project will significantly contribute to new solutions that increase the accuracy in detecting Anomalous Behavior. We provide a brief overview on some of the work conducted so far in the next section.

 

Graph and Node Analysis in Anomaly Detection

 

One of the goals of this work was to provide Analysts with new ways to analyze Telecom traffic activity that can lead to the detection of anomalous behavior that is not detected using classic methods. In that line we have introduced Graph Analysis to our research together with powerful statistical and feature exploration focused on providing powerful and insightful Data Visualization tools and techniques that can strongly help Analysts analyzing traffic data. One of the methods of choice to explore in such research is the study of Anomalous Behavior and suspicious activity using graphic objects such as time evolving graphs. Graphs are very powerful for analyzing human interactions and therefore make exceptional candidates for studying and analyzing anomalous behavior from Telecommunication Networks’ traffic activity. Here, individual agents (people) are represented by nodes of the graph and the links between them represent connection events established by those nodes. We are interested to understand the patterns that occur in this type of data, like who-calls-whom, who-sends-money-to-whom the focus was on the unsupervised cases, where there are no labels.

Our work so far with Anomalous Behavior has led us to use these approaches and techniques in the study of Fraud occurring in Telecom Networks. We have been developing new analytical solutions that we have been testing with a real-life dataset originating from an existing Telecom Operator. Together with the help of high expertise in this business domain, we were able to detect significant anomalous behavior in novel and potent ways which positively reinforces the potential and value of this application. More specifically we were able to identify new cases with high potential for Bypass Fraud. Furthermore, we present a case that was later confirmed by the Telecom Operator as a newly confirmed Bypass Fraud case that had not been previously identified. A recent solution we have recently published is depicted in figure 1.

Figure 1. Developed solution that uses the power of graphs to analyze, detect and determine Anomalous Behavior that has the potential for Fraudulent Activity. There is the possibility for selecting upon several relevant features as the basis for further exploration and investigation.

The proposed solution follows three sequential steps:

  • Step 1: ‘Feature-selection’: by carefully choosing features to extract from each node;
  • Step 2: ‘Summary’: high-level, interactive summary of the data;
  • Step 3: ‘Deep-dive’: allowing the user to focus on suspicious nodes.

The initial idea was to start with several informative features that can help Analysts get a better understanding of the data they are looking at (as shown below in fig. 2 (a)). Upon selection of desired features to visualize, we can dive into the structure of the data and try to identify and find patterns or nodes that seem anomalous or somehow suspicious.

The Summary comprehends a high-level yet detailed interactive view of the data (as show in fig 2 (a)). It is possible to visualize data in a variety of ways that can highlight different aspects, patterns and behaviors that can be further investigated.

The deep-dive consists on a drill down on suspicious or interesting nodes that ask for further investigation (as shown in fig. 2 (b) and  (c)). Here we can explore several important metrics exhibited by the selected group of nodes. This can help to detect patterns and relationships between different nodes and infer about anomalous behavior. Additionally, we explore the use of EgoNets, a very powerful way to look at relationships in the data. The idea is to be able to drill down on any nodes and relationships of interest and understand specific metrics and statistics associated with them.

This approach was designed to work with labeled and unlabeled data. It is of special significance the possibility for addressing Anomalous Behavior investigation on unlabeled data, as this will inevitably always constitute the vast majority of newly generated data.

One of main ideas in the design of this solution was to allow the user to interact with the data in a way where it could be possible to select specific areas of interest in the graph for further investigation. This can then allow the analysis of suspicious groups of nodes in the graph that may share similar attributes, patterns or behavior. A major advantage for this is to be able to switch from a case-by-case approach, where the Analyst has to look at each single potential case individually and is thus less informative and more time-consuming, to one of group behavior analysis. By analyzing whole groups of nodes, we can broaden our scope of analysis and gain a better idea on the overall potential fraud events occurring in the data. Additionally, when detecting a suspicious case it becomes easier to infer about related events exhibiting the same kind of behavior. Generally, the visualization of the graph nodes and respective interactions allows us to evaluate the behavior of a particular group and identify the fraudulent nodes e corresponding victims.

Also, the use of parallel graphs provides a powerful means to visualize and analyze different metrics (more than twenty at the moment) that can assist in the investigation process.

 

Success Case Study

 

In this publication we present a success case that showcases how new events of fraud that had been previously undetected by traditional methods can be detected. We went from an initial dataset comprising of a pool of events (calls) of traffic activity from a Telecom Operator and were able, through Data Visualization techniques, to successfully identify a new fraud case. This was an instance of International Bypass fraud, one of the most prevalent and damaging types of fraud.

The data set used for the analysis presented here was not labeled and is comprised of a pool of 2 days of traffic activity (events are phone calls) from a large Telecom Operator.

Figure 2. Current proposed solution at work: (a) several nodes are on the 45-degree line (red dashed box), away from the majority (notice that both axes, as well as the color-scale, are in log). (b) ‘Deep-dive’ for the red triangle: parallel axis plot of the EgoNet of the ‘red triangle’, shows that the nodes receive 1-second phone calls (c) experts are investigating the nodes like the ones in red ovals, and confirmed that the callers in yellow-highlight have all the evidence of ‘International Bypass’ type of fraud.

Figure 2 depicts the analytical process undertaken in the analysis that led to the identification of the success case presented here.

In figure 2 (a) we can see a plot representing the relationship between number of calls and calls received. Here we detected a group of suspicious nodes (red dashed rectangle) that displayed a pattern diverging from normal behaviour as seen by the representation of the full dataset.

From there we looked at the various features computed by the application to get a better grasp at the behavior displayed in that case. We noticed a pattern of 1-second phone calls that was not a standard behavior.

We further deepened our investigation by computing an graphical visualization of the EgoNet of the data of interest and we could learn that many of these 1-second long phone calls were strangely being made to an hotel. We suspect this was indeed Anomalous Behavior and likely indicative of fraudulent activity related with the International Bypass.

 

Conclusion

 

One of the many outcomes of the AIDA Project has been the research and development of new algorithms and data visualization solutions that can add to the catalogue of tools addressing the detection and identification of Anomalous Behavior. Here we have discussed a solution, developed by AIDA partners Carnegie Mellon University and Mobileum, designed for the study and detection of Anomalous Behavior in large pools of data. We have addressed new ways for the study of Anomalous Behavior by Analysts tackling Fraud using the power of time evolving graphs. The proposed solution is designed in a sequential manner that allows the user to dive in specific areas of interest in the data. Also, the user can deepen the analysis by investigating multiple ways the data is interconnected.

This provides the ability to detect fraud patterns occurring on the data by groups of nodes which contrasts with the narrower scope of single case analysis that is used in current state-of-the-art commercial solutions.

We have built a tool focusing on group analysis and attention routing that allows the identification and visualization of fraud patterns in a network.

By using this approach with an initial set of unlabeled data we were able to detect Anomalous Behavior that was confirmed to constitute Fraud activity in a quick and efficient manner.

We reinforce the strong emphasis on a solution that can provide powerful Data Visualization features that allows the user to successfully tackle Anomalous Behavior and fraud activity. The possibility for deep dive in data points (nodes) of interest and collect valuable information from the relationship between these data points is of special significance. Here, the possibility to perform drill down through a Lasso Selection of the data of interest should reveal to be highly valuable in this type of analytical work.

This application also provides the possibility for addressing and detecting new cases of Smart Fraud that aims to mask itself within “normal” behavior by adopting new and ever more unsuspected mechanisms and techniques.

 

Next Steps

 

This is ongoing work as we continue to explore these research vectors. One of the main areas of interest we want to focus our work on is Attention Routing. This will help to improve the study of areas of interest in the data that are blurred across the normal distribution behavior and are often ignored.

We are exploring new ways to analyze of groups or clusters of data and continue with our research in parallel graphs and the use of spring models for visualization and interaction.

Finally, we are thinking on how to evolve to solutions that can be made more pro-active, possibly suggesting suspicious cases based on an automated informed analysis. We hope this could ultimately have a great impact on the industry of Fraud Management.

 

References

 

TgrApp: Anomaly Detection and Visualization of Large-Scale Call Graphs (2022 International Conference on Big Data (IEEE BigData 2022), AAAI Proceedings (2022)).

CallMine: Fraud detection and visualization of million-scale call graphs (22nd IEEE International Conference on Data Mining on Data Mining (IEEE ICDM 2022)).

 

By Mobileum

 

]]>
5306
Protecting the Security and Privacy of AIDA and its Data https://aida.inesctec.pt/protecting-the-security-and-privacy-of-aida-and-its-data/?utm_source=rss&utm_medium=rss&utm_campaign=protecting-the-security-and-privacy-of-aida-and-its-data Mon, 07 Nov 2022 15:24:06 +0000 https://aida.inesctec.pt/?p=5262

Protecting the Security and Privacy of AIDA and its Data

Adapting the RAID platform comes with many security and privacy issues that need to be addressed. These issues appear mainly due to the transition to an edge architecture, which imposes the use of computation resources at the edges of the network, but also due to the need of supporting multiple tenants and network slicing with 5G technology. As mobile phones connect and disconnect, the network is always changing, requiring adaptation and monitorization tools to be used to maximize resources. And since the complexity of the system increases, attackers have more opportunities to exploit it. With these issues in mind, various mechanisms and protocols were put in place, assuring the overall security of the platform. 

 

Secure Communications and Malware Identification

The transition to edge computing architecture introduces new vulnerabilities. Attacks like Man-in-the-Middle or Eavesdropping can occur with higher probabilities, thus it is of utmost importance to ensure secure communication between the different devices and components. The edge architectures can leverage different solutions, seeking or not  compliance with standard industries for mobile applications like Mobile Edge Computing (MEC). Such standards facilitate the data processing at the edge, but also open new threats due to exposure of new APIs, which may require data flows between edge nodes and cloud components.

 

Secure communications for applications leveraging on APIs can rely on the advances of the HTTP protocol – HTTP/3 which introduces functionalities for increased security (i.e., reduced round trip time for initial session handshake) and support for unreliable links, for instance with high packet loss. The AIDA platform leverages  the advances of HTTP/3 that relies on the QUIC protocol. In addition, KubeEdge orchestrates the diverse microservices of the AIDA platform at edge nodes and also supports multiple solutions to secure the communications at the control plane. Besides TLS connections, there is also support for the HTTP/3 protocol.

 

The AIDA platform is also able to allow ISPs to detect malware on the fly. First, the AIDA platform captures the network traffic (mainly DNS queries) and presents it to a botnet analyzing service. The service then produces a floating point evaluation regarding the probability of infection of a specific device, which can then be used to trigger a response based on the presented value.

 

The steps present in the detection are: (1) blacklist/whitelist analysis, (2) query rate analysis, (3) domain analysis (whether it is or not DGA-generated) and finally a (4) machine learning step. This pipeline-oriented scheme aims to achieve high speed and scalability, therefore, the packets leave the pipeline as early as a consistent evaluation is formed. In all steps a packet can be marked as infected or not(except (2) which can only deem infected) and leave the pipeline. Only if a packet does not meet the upper or lower bound criteria for infection, it traverses the pipeline into the next evaluation step.

 

Secure within Software Components

The AIDA platform is based on the use of microservices, which require special attention to ensure their security. The platform also is expected that large amounts of data will be generated, therefore, intrusion detection mechanisms must be lightweight while also being able to cope with a dynamic number of replicas and a large number of services.

 

When looking at the system as a whole, we were able to increase detection rates up to 60%, while only around 25% of the alarms were false positives. Without the techniques being employed the results would be overwhelmed by a really large amount of false positives. Also, considering adaptation environments, results improved by up to 80%, improving the overall security of the system.

 

Fig.1 – Results of the intrusion detection methods employed 

 

Hoping to improve the intrusion detection rate and minimize false alarms, we also decided to explore the use of machine-learning techniques for intrusion detection. With this intent, we collected system calls from the microservice systems as our data and used classification techniques to detect intrusions. The results demonstrate a high detection rate for two of the five attacks from the tested vulnerabilities. Although only some attacks have been detected, the false alarm rate had excellent results, staying below 1% for all attacks. We also improved the results from machine learning by using a sliding window as a post-processing technique.

 

Based on the machine learning results, we understood the necessity of a better system call representation for intrusion detection techniques. For this, we decided to work on a representation that could convey more information about the connection between system calls. First, we devised a classification where system calls were divided into classes and subclasses. Later, we established relationships of different costs between these system calls to create a system call graph. Some parts of the system call graph can be seen in Fig.3. Since our representation was susceptible to our subjectivity, we designed a validation process that gathered information from other researchers in the research area. In the classification, we adjusted 17,28% of system calls based on the validation from other researchers. The graph validation has started, and it is in progress.

 

Fig.2 – Small representation of the system calls as a graph

 

Since the AIDA platform will receive dynamic loads, adaptations are possible throughout the execution, maximizing the resources being allocated. Mechanisms of self-adaptation were therefore put in place, monitoring the system and executing pre-defined actions that let the system react to the changes, allowing for high performance and availability. To do that, we will use the Trustworthiness Monitoring and Assessment (TMA) Framework. TMA allows self-adaptation mechanisms in cloud and edge applications. This is done through their REST interfaces, which interact with the probes and actuators of the managed element (i.e., the AIDA platform). As TMA can be easily tailored to any aspect to be monitored (e.g., performance, availability, security), it was chosen to be applied to the AIDA platform. In addition, we have recently made available a new dashboard that allows managing and visualizing TMA configurations.

 

Data Privacy

The assurance of data privacy in the AIDA platform demands suitable anonymization approaches to store and process large amounts of data. The development of a privacy framework came to overcome the difficulty in selecting and configuring the appropriate mechanism that fulfills the project requirements. This privacy framework allows to implement, apply, and assess Privacy-Preserving Mechanisms (PPMs) according to the pipeline below.

 

Fig. 3 – Privacy Framework architecture 

 

The architecture of the privacy framework consists of a main Python Package with a set of subpackages, where each subpackage contains the corresponding adapter, that is, an abstract class that can be extended by implementing the abstract methods (i.e. relevant methods for the component). These adapters make the framework easily extendable, by allowing the implementation of new features (e.g. new PPMs or metrics).

 

These strategies are complemented by SOTERIA, which uses machine learning techniques to create a distributed privacy-preserving system. It was built taking into consideration both scalability and fault tolerance, allowing the processing of large datasets. See our December post for details.

 

By University of Coimbra

]]>
5262
Fraud detection, micro-clusters and scatterplots https://aida.inesctec.pt/finding-anomalies-in-large-scale-graphs-2/?utm_source=rss&utm_medium=rss&utm_campaign=finding-anomalies-in-large-scale-graphs-2 Thu, 30 Jun 2022 13:45:36 +0000 https://aida.inesctec.pt/?p=5126

Fraud detection, micro-clusters and scatterplots

 

Acknowledgements

The results and analysis presented here were done with contributions from Mirela Cazzolato (USP, and CMU), Saranya Vijayakumar (CMU), Xinyi (Carol) Zheng (CMU), Meng-Chieh (Jeremy) Lee (CMU), Namyong Park (CMU), Pedro Fidalgo (Mobileum), Bruno Lages (Mobileum), and Agma Traina (USP).

 

 

Reminders – Problem definition and past insights

As we mentioned in the February 2022 blog post, the problem we are focusing on is to spot fraudulent behavior in a who-calls-whom-and-when graph. We distinguished between the supervised case (where we are given a list of fraudulent subscribers (labeled data)), and the un-supervised one, where we are not given any such labeled data.

The two main insights where (a) the labeled data could be wrong: a subscriber labeled as ’honest’ may turn out to be fraudulent after closer inspection and (b) there are many types of fraud, including telemarketers/scammers (’you owe taxes’); subscribers bypassing the legal ways of making international calls; subscribers using fake or stolen credit cards; and many more.

 

 

New Insights

For the supervised case, we can build classifiers (random forests, autoGluon), which will be as good as the quality of our labels, and which will never be able to spot new types of fraud.

For the un-supervised case that we focus here, the main insights are

  1. Fraudsters often exhibit lock-step behavior: for example, too many callers call the same (large number) of destinations, about the same time.

  2. Visualization is a powerful way to explain to a non-analyst why our algorithm suspects a given set of subscribers.

We elaborate with an example, next.

 

 

Deep dive

Figure 1 illustrates the main ideas and insights.

 

 

Raw data: Figure 1 (a) and (d)

The first column shows two of the many scatter-plots we are proposing: The first row ’(a)’ plots the in-degree (total # of friends calling in) versus weighted in-degree, in log-log scales since we have power-laws as we expected. Like all other scatter plots, this is a heatmap – notice that even the scales of the heatmap are logarithmic: The vast majority of subscribers, in red/orange colors, receive a few phonecalls ( ≈ 101), and the total talking time is relatively short ( ≈ 102). Not surprisingly, there is a correlation: the more people call you, the larger the total duration of the phone calls. What is surprising is the micro-cluster inside the white circle, at (102, 104): Many people ( ≈ 103 – light-yellow colormap) have surprisingly similar in-degree of about 100, and similar total duration of about 10,000 seconds.

Let’s dissect (d), the scatter-plot below it. Every dot is a subscriber; the axis are the in- and out-degree of each subscriber, again in logarithmic scales (log (x+1), to be exact). There is heavy overplotting (red/orange/yellow colors) for small values – that is, most subscribers receive and initiate only a few phonecalls. There are two un-expected issues: The first is that there is little reciprocity: analysis of earlier who-calls-whom networks [1] reported that usually people who make xphonecalls, also receive about x phonecalls. Thus, we would expect to see a strong diagonal along the 45 degree line – but it is not there.

The second observation is that there are a lot of people who never return a phonecall (all the dots on the horizontal axis – our domain experts call them ’black holes’), and similarly many people that receive no incoming phonecalls (’volcanoes’). These behaviors are not necessarily bad: for example, an emergency room or a help desk would behave like a ’black hole’. However, several fraudulent behaviors are so asymmetric, like, eg., scammers (’you owe taxes’).

 

 

Our Guesses (b), (e)

The second column of Figure 1 shows our guesses for fraudsters (orange diamonds and crosses, for ’volcanoes’ and ’black-holes’ respectively). The crucial plot is (b), which shows the ’core-number’ versus inter-arrival time (IAT). The core-number has a complicated definition, 1 but the intuition is that nodes with high core number belong to a densely connected community.

In Figure 1(b) notice that, while the vast majority of nodes (red/yellow dots at the bottom-middle of the graph) have low core-number (2-5), there are several with very high core number (20). Plotting them on the in- vs out-degree plot (’(e)’), about half of them are ’volcanoes’ (diamonds on the horizontal axis) and the rest are ’black-holes’ (crosses, on the vertical one). We shall refer to them as leadsfrom now on.

It turns out that most of our leads actually call each other, forming a very dense bi-partite core; such subgraphs are typically fraudulent in social networks (like Twitter [3] , FaceBook [2] )

Our domain experts confirmed that our leads are indeed fraudulent.

 

 

Comparison with the ’ground truth’

Our dataset already had labels, and we show them in the last column of Figure 1(c) and (f). Notice that (c) and (f) are exactly parallel to the middel column ((b) and (e)), with the only difference that the ’ground truth’ column has the labeled nodes in red. As in the ’leads’ case, diamonds and crosses correspond to ’volcanoes’ and ’black-holes’ respectively.

The main observation is the following:

  • The investigators who labeled the data, completely missed our ’leads’, with the high core-number

This is exactly the reason that we call the labeled set as ’ground truth’ within quotes: Several of the fraudsters may be flying below the radar, as we showed with our ’leads’ of Figure 1(b) and (e).

 

 

Conclusions

We repeat our two insights:

  1. Fraudsters often exhibit lock-step behavior, resulting in micro-clusters and dense communities or bi-partite cores

  2. Visualization helps explain our ’leads’ (as we did with the micro-clusters in Figure 1(a),(b)

 

Citations

  • [1] Akoglu, L., de Melo, P. O. S. V., and Faloutsos, C. Quantifying reciprocity in large weighted communication networks. In PAKDD (2) (2012), vol. 7302 of Lecture Notes in Computer Science, Springer, pp. 85–96.
  • [2] Beutel, A., Xu, W., Guruswami, V., Palow, C., and Faloutsos, C. Copy-catch: stopping group attacks by spotting lockstep behavior in social networks. In WWW (2013), International World Wide Web Conferences Steering Committee / ACM, pp. 119–130.
  • [3] Hooi, B., Song, H. A., Beutel, A., Shah, N., Shin, K., and Faloutsos, C. FRAUDAR: bounding graph fraud in the face of camouflage. In KDD (2016), ACM, pp. 895–904.

     

     

    Formally, the k-core of a graph is a maximal subgraph that contains nodes of degree k or more; the core-number of a node is the largest value of k of a k-core that contains that node.

     

By Carnegie Mellon University

]]>
5126
Federated Machine Learning https://aida.inesctec.pt/federated-machine-learning/?utm_source=rss&utm_medium=rss&utm_campaign=federated-machine-learning Tue, 26 Apr 2022 07:53:08 +0000 https://aida.inesctec.pt/?p=4951

Federated Machine Learning

Federated Learning (FL) is a collaboratively decentralized privacy-preserving technology to overcome the challenges of data storage and data sensibility [1]. The last few years have been strongly marked by artificial intelligence, machine learning, smart devices, and deep learning. As a result, two challenges arose in data science, impacting how data can be accessed and used. First, with the creation of the General Data Protection Regulation (GDPR) [2], the data became protected by the regulation. Institutions cannot store or share data without users’ authorization. Another challenge is that in the era of big data, a large volume of data is generated, and it becomes increasingly difficult to store that data in a single location. Therefore, the information is distributed by different servers or generated by smart devices, which creates the need to build models or perform calculations without these data leaving their origin. Thus, a new paradigm emerged, coined as Federated Learning, a sub-area of machine learning that seeks to solve the problem of building distributed models with privacy concerns.

The first work on FL was published in 2017 by McMahan et al. [3]. The authors developed the Federated Averaging (FedAvg) algorithm to improve the recommendation and automatic revision of texts from thousands of users’ devices with the Android mobile system.
 
Federated Learning falls into three types of architecture based on data distribution among edges in the feature and sample space: horizontal, vertical, and federated transfer learning [4]. An example of these architectures is shown in Figure 1. Horizontal federated learning, also known as sample-based federated learning, is characterized by scenarios where each node holds the same features but different individuals. Vertical Federated Learning or feature-based federated learning is suitable for cases where data is vertically partitioned according to the feature dimension. Unlike the horizontal and vertical architectures, in Federated Transfer Learning (FTL), data shares neither sample nor feature space.

Figure 1 | Federated Learning Architectures

 
Privacy is often a significant concern in FL. The existing privacy-preserving methods mainly focus on information encryption for the client, secure aggregation at the server side, and security protection for the FL framework. FL’s variety of privacy definitions is classified into global and local privacy [5]. According to Figure 2, the main difference between the two categories is where the privacy-preserving methods are implemented.
 

Figure 2 | Categories of Privacy-Preserving Mechanisms (a) Global Privacy (b) Local Privacy

 
The development of open-source frameworks for FL simulation has the potential to accelerate the research progress. The first framework designed at the production level was TensorFlow Federated (TFF) [6]. The Federated AI Technology Enabler (FATE) is an open-source industrial-level framework [7] built to deal with federated anomaly detection issues such as credit risk control and anti-money laundering.
 
The evaluation of an FL model consists of accessing the aggregated model after assigning it to each client using the local evaluation datasets. Then, each client shares the performance with the server, which combines local performances in global evaluation metrics. This subject is still in the early stage, and there are open challenges related to the ideal metrics for each type of problem.
 
This new research area has several open challenges, such as the definition of metrics for model evaluation and communication interoperability and energetic efficiency issues. It is also interesting to research the possibility of adopting white box distributed learning algorithms to promote the models’ explainability, mainly when dealing with interdisciplinary applications. The diversity of applications is still limited to experiments with simulated or traditional machine learning datasets. Several works can deal with horizontally distributed data, but there are still research opportunities to deal with vertical FL and federated transfer learning.
 

References

[1] Yu, B., Mao, W., Lv, Y., Zhang, C., & Xie, Y. (2022). A survey on federated learning in data mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(1), e1443.


[2] Voigt, P., & Von dem Bussche, A. (2017). The EU general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 10(3152676), 10-5555.


[3] McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017, April). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics (pp. 1273-1282). PMLR.


[4] Yang, Q., Liu, Y., Chen, T., & Tong, Y. (2019). Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2), 1-19.


[5] Li, T., Sahu, A. K., Talwalkar, A., & Smith, V. (2020). Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3), 50-60.


[6] Google. (2020). TensorFlow Federated. https://www.tensorflow.org/federated Accessed 2022.


[7] Webank. (2019). Federated AI Technology Enabler. (FATE). https://github.com/webankfintech/fate Accessed 2022.


By INESC TEC

]]>
4951
Finding Anomalies in Large Scale Graphs https://aida.inesctec.pt/finding-anomalies-in-large-scale-graphs/?utm_source=rss&utm_medium=rss&utm_campaign=finding-anomalies-in-large-scale-graphs Mon, 21 Mar 2022 10:20:59 +0000 https://aida.inesctec.pt/?p=4849

Finding Anomalies in Large Scale Graphs

 

Problem definition

 

Given a large, who-calls-whom graph, how can we  nd anomalies and fraud? How can we explain the results of our algorithms? This is exactly the focus of this project. We distinguish two settings: static graphs (no timestamps), and time-evolving graphs (with timestamps for each phone). We further subdivide into two sub-cases each: supervised, and unsupervised. In the supervised case, we have the labels for some of the nodes (‘fraud’/’honest’), while in the unsupervised one, we have no labels at all.

 

Major lessons

 

For the supervised case, the natural assumption is that the labels are absolutely correct and thus we only need to build a classier, using our labelled set as the ‘gold set’. However, this is a pitfall: the ‘fraud’ labels are almost always correct (notice the ‘almost’), while the ‘honest’ labels are often wrong. Thus, we need to develop algorithms that can tolerate (and ag) those of the labels that seem erroneous. There is work on this line, under the area of ‘weak labels’, with some excellent work on the so-called ‘con dent learning’ [8]. The second insight is that there are many types of fraud, as well as many types of ‘honest’ behaviour. We already mentioned that in the context of Twitter followers [10]. For phonecall networks, there are also many types of fraud: telemarketers, phishing attempts, redirections to expensive 900 numbers, to name a few.

 

Our tools – unsupervised case

 

The most creative and most interesting part of the project is the feature extraction: what (numerical) features should we extract from each node, to try to  nd strange nodes? The obvious ones are the in- and out-degree of each node; the total number of minutes in calls and out-calls. More elaborate ones include centrality measures, like Google’s super successful PageRank [2] the number of triangles in the agent of the node [1] Thus, every node becomes a point in n-dimensional space, and thus we can employee all the unsupervised algorithms, like clustering (DBSCAN [9], OPTICS), outlier detection (isolation forests [7]), micro-cluster detection [6].

 

Our tools – supervised case

 

Here we have two families of tools: the first is to build a classier for the n-dimensional

space that we can create with the feature extraction above. There is a wealth of classifiers to choose from – ‘autoGluon’ [3] automatically tries several of them and picks the best.

The second is to exploit network acts, with tools like label propagation, belief propagation, semi-supervised learning (eg., FaBP [5]; zooBP [4]).

 

Preliminary results

 

Figure 1 gives some results for the supervised case. The scatterplots (actually, heatmaps, to highlight the over-plotting) have one dot for each customer, with the axis being the (weighted) in-degree versus the (weighted) out-degree. Both axis logarithmic (log(x + 1), so that we keep the zeros).

Notice that most points (fraud/honest) are along the diagonal, indicating that reciprocity is to be expected.

Also notice that there are some extreme deviations from reciprocity, namely, points along the axis. This means that there are customers that call, but never get called back (like, eg., telemarketers) and the other way around (like, eg., help-lines).

What distinguishes the fraudsters from the honest customers, is the magnitude of activity. Notice that most of the fraud customers tend to be around the 104; 10points, while most honest customers are close to the 103; 10point.

 

(a) Fraud

(b) Honest

Figure 1: Visualization helps: heatmaps of in-versus out-degree (weighted): Fraud (left) vs honest subscribers (right)

 

Conclusion

 

Looking for patterns and anomalies in large, real graphs never gets boring: there are always new patterns to look for, new activities by the fraudsters (as well as new activities by the honest ones). Despite the fact that there are already some excellent tools for graph analysis, there is always room for more.

 

References

 

[1] Akoglu, L., McGlohon, M., and Faloutsos, C. oddball: Spotting anomalies in weighted graphs. In PAKDD (2) (2010), vol. 6119 of Lecture Notes in Computer Science, Springer, pp. 410{421.

 

[2] Brin, S., and Page, L. The anatomy of a large-scale hypertextual web search engine. Comput. Networks 30, 1-7 (1998), 107{117.

 

[3] Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., and Smola, A. J. Autogluon-tabular: Robust and accurate automl for structured data. CoRR abs/2003.06505 (2020).

 

[4] Eswaran, D., Gunnemann, S., Faloutsos, C., Makhija, D., and Kumar, M. Zoobp: Belief propagation for heterogeneous networks. Proc. VLDB Endow. 10, 5 (2017), 625{636.

 

[5] Koutra, D., Ke, T., Kang, U., Chau, D. H., Pao, H. K., and Faloutsos, C. Unifying guilt-by-association approaches: Theorems and fast algorithms. In ECML/PKDD (2) (2011), vol. 6912 of Lecture Notes in Computer Science, Springer, pp. 245{260.

 

[6] Lee, M., Shekhar, S., Faloutsos, C., Hutson, T. N., and Iasemidis, L. D. Gen2out: Detecting and ranking generalized anomalies. In IEEE BigData (2021), IEEE, pp. 801{811.

 

[7] Liu, F. T., Ting, K. M., and Zhou, Z. Isolation forest. In ICDM (2008), IEEE Computer Society, pp. 413{422.

 

[8] Northcutt, C. G., Jiang, L., and Chuang, I. L. Con dent learning: Estimating uncertainty in dataset labels. J. Artif. Intell. Res. 70 (2021), 1373{1411.

 

[9] Schubert, E., Sander, J., Ester, M., Kriegel, H., and Xu, X. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans. Database Syst. 42, 3 (2017), 19:1{19:21.

 

[10] Shah, N., Lamba, H., Beutel, A., and Faloutsos, C. The many faces of link fraud. In ICDM (2017), IEEE Computer Society, pp. 1069{1074.

 

By Carnegie Mellon University

 

]]>
4849
A Data Management Architecture for AIDA https://aida.inesctec.pt/a-data-management-architecture-for-aida/?utm_source=rss&utm_medium=rss&utm_campaign=a-data-management-architecture-for-aida Wed, 19 Jan 2022 13:39:07 +0000 https://aida.inesctec.pt/?p=4720

A Data Management Architecture for AIDA

One of the major challenges in the evolution of the RAID platform during the AIDA project is the need to further distribute the platform components to achieve greater levels of scalability, by leveraging the increasing edge computing capacity made available by the IoT and the imminent large-scale deployment of 5G cellular technology.

 

The advent of 5G networks and growing adoption of Internet of Things (IoT) devices lead to more opportunities for data collection and processing with hybrid edge-cloud systems.

 

In this architecture, edge devices — placed near where the data is being collected/accessed — execute some of the processing while offloading other more complex work to the cloud, which is scalable on demand. However, edge-cloud architectures present several challenges when it comes to data management. Most of the difficulties are due to their inherently large-scale, distributed, and heterogeneous deployments, namely:

 

  • Replicating stateful (edge or otherwise) components for scalability requires synchronization, which can be expensive and makes fault tolerance more complex; 

 

  • Large-scale data replication between the edge and the cloud raises issues in terms of network latency and storage capacity; 

 

  • The heterogeneity of edge devices, namely in the context of IoT, encounters a very diverse set of data models, which requires data processing frameworks to handle each one on a case-by-case approach.

 

For example, in the network of edge-cloud services, such as firewalls and load balancers, there has been a shift to virtualizing functions to cut operation and management costs and increase elasticity. However, Virtual Network Functions (VNFs) that store data often rely on two techniques: use the operating system level memory sharing techniques for scalability or delegate data management to a standalone centralized database. The former helps to reduce the latency but fault-tolerance is harder to achieve and it limits all VNFs replicas to the same virtual machine. The latter improves scalability and fault-tolerance but increases latency. Although some solutions rely on eventual consistency to improve both scalability and latency, it is not enough for functions that require strong consistency or present non-deterministic behavior. However, this can be circumvented with expensive distributed locking.

 

On the other hand, the system must answer analytical queries on data collected in the edge. For example, it must be capable of efficiently answering ad-hoc queries for exploratory data analysis. One of the main challenges of this workload is that fetched data is unpredictable. Therefore, the system must be able to adapt to different workloads, by achieving an optimal tradeoff between the network and the computing power of the edge. In an environment with a substantial number of IoT devices, sending all collected data to the cloud consumes network and CPU resources on the edge, which is often limited due to cost and energy factors. Additionally, it introduces delays to the data that is actually needed at the moment. Finally, the heterogeneity of the devices and the collected data makes it difficult for the cloud to have a consistent and holistic view of data, increasing the complexity of analytical workloads.

 

The edge computing paradigm aims at leveraging the computational and storage capabilities of edge devices while resorting to cloud computing services for more demanding processing tasks that cannot be done at the edge. Edge devices generate large volumes of data that may need to be transferred to the cloud and that come from several types of data sources. Therefore, we propose the AIDA-DB unified data management architecture for an edge and cloud continuum, summarized in Fig. 1, that is able to tackle both analytical and transactional workloads, both relying on a polyglot middleware.

 

AIDA-DB unified data management architecture for an edge and cloud continuum

 

In detail, the polyglot middleware processes data from different sources, which have different formats and encodings. This component is capable of efficiently performing queries over an integrated view of the data, by exploiting the underlying edge’s query engines.

 

The synchronization middleware is in charge of efficiently transferring data from the edge to the cloud. It does so by considering the data that is currently required by the cloud and the data that is already cached. Then, it synchronizes the missing data based on a balance between the network delay and the impact on the edge resources.

 

Finally, the transactional middleware guarantees consistent reads and isolated and atomic updates across the edge and cloud continuum.

Our proposal of the AIDA-DB architecture is aimed at enabling emerging complex VNFs and edge-based application services in emerging 5G networks to manage data across a cloud and edge continuum. The key advantages of AIDA-DB which allow this are:

 

  • Global transactional reliability: Managing distributed data across distributed system boundaries and multiple SQL and NoSQL database systems is hard and prone to inefficiencies and errors. By providing a seamless transactional layer, AIDA-DB ensures that business-critical workflows execute across these boundaries and makes edge-based application components and services first-class citizens in complex distributed applications;

 

  • Hybrid transactional-analytical processing: The edge adds an axis of complexity to traditional extract, transform, and load (ETL) procedures, as data needs to be copied across the network to a centralized data warehouse. In contrast, a key driver for the adoption of the edge is to make use of fresher data in applications and services. Therefore, in AIDA-DB we strive for hybrid transactional analytical processing systems that take advantage of cloud elasticity and automatic synchronization to run interactive analytics, such as needed for data exploration, on live or freshly updated data;

 

  • Polyglot processing with a standard API: Although it has become clear that current applications and services need a variety of data management paradigms and tools, this pushes additional complexity into applications and calls for polystores to bridge between them. AIDA-DB innovates by catering for polyglot workloads, across different systems, for both transactional and analytical workloads while exposing all functionality in a standard xDBC interface that accepts polyglot queries;

 

  • Separation of concerns: Abstract management of the cloud and edge divide and support for the SQL query language makes it possible to separate application development concerns — what functionality is needed — from deployment and optimization — how it is provided. This extends to the cloud and edge computing environment the traditional strength of database management systems of providing separate and declarative interfaces for both developers and database administrators (DBAs), making applications reusable and deployable in different contexts. This is particularly suited for platforms as a service (PaaS) as it allows the platform provider to offer managed services.

 

 

By INESC TEC

]]>
4720
How can we protect the security and the privacy of the AIDA platform? https://aida.inesctec.pt/how-can-we-protect-the-security-and-the-privacy-of-the-aida-platform/?utm_source=rss&utm_medium=rss&utm_campaign=how-can-we-protect-the-security-and-the-privacy-of-the-aida-platform Mon, 13 Dec 2021 10:39:46 +0000 https://aida.inesctec.pt/?p=4604

How can we protect the security and the privacy of the AIDA platform?

The evolution of the RAID platform during the AIDA project brings many benefits, but also many security and privacy concerns to be considered. Thus, discovering and implementing measures that address these new risks, while not degrading performance, is of utmost importance. The main challenges are related to the transition to edge, pushing the computational power to the edges of the network, to the integration of 5G supporting multiple tenants and network slicing, and finally to the privacy of the data gathered and analyzed.  

 

Given these changes to the network, communications need to be verified in order to assure that they  are secure and that performance isn’t being affected. The network is constantly changing as many devices connect and disconnect from it and have higher and lower traffic, creating the necessity of  monitoring the platform to allow a fast response to any change in it. This change in the network creates new potential entry points for attackers to take advantage of or makes it harder to defend attacks that were already possible.

 

Figure 1: Overview of the changes to the Architecture

 

Providing Secure Communication among the Components

The AIDA platform with components running at the edge of the network and at the core, requires secure communication channels to assure that the exchanged information is protected against several threats, such as eavesdropping, man-in-the-middle attacks. A critical aspect regarding security is the support for authentication, authorization and accounting (AAA), by all services/functions of the AIDA platform, no matter the place where they run.
 
Robust and secure communication approaches exist, such as the Transport Layer Security (TLS) protocol which is widely used nowadays to assure authentication, confidentiality and integrity of the exchanged data. In this perspective, components should support TLS v1.2 and beyond, preferably v1.3 given the higher protection levels and the reduced times to perform the handshake processes. 
 
Nonetheless, a plug-n-play solution is not simple! The existence of several microservices, which can run at the edge or at the core of the network, lead to issues with keys, certificates management, that are required by TLS connections. A seamless integration with federated identity management approaches like OpenID Connect can lead to scalability issues, if not managed properly.
 
Assuring a Secure Operation of the Software Components
AIDAMicroservices need to be monitored using lightweight, fast, and efficient approaches while maintaining a high effectiveness level. The constant modification of the deployment scenarios, with auto-scaling adaptation, forces the behavior profiles used to identify deviations to become generalizable so that security level is not compromised in these dynamic environments. There is still potential for some intrusions to go undetected, the reason why the incorporating intrusion tolerance provides a way to increasing security levels and assure the system provides the intended service level despite intrusions successfully evading the detection mechanisms.
 
Many security strategies are being evaluated and improved, such as the use of machine learning techniques and classifiers to detect intrusions. The goal is to construct benign behavior profiles that detect deviations from the “normal behavior” used to train the algorithms. After a configurable number of deviations, alarms are raised and suspicious activity is reported. Intrusion tolerance will be most effective when applied to the key services of the architecture. Solutions that are commonly used are under study to identify possible applications in the AIDA scenario. The approaches that provide tolerance to the application can range from diversity of services, requiring different versions of technologies or techniques used to develop them, to the application of architectural patterns that can be static or dynamically applied according to information collected from the system while in operation.
 
 

Figure 2: High level perspective of the secure operation of the main software components

 
Also, to keep up with high availability, a self-adaptation mechanism can be used to monitor and adapt the various components inside the architecture, applying known actions to different components as an answer to the changes in the environment. These actions have the purpose of mitigating the problems and improving the performance of the platform, managing the resources to where they are needed, achieving high performance and availability. It also takes care of fixing identified security and privacy problems in the platform.
 
Maintaining the Privacy of the Data used
Regulations such as GDPR and HIPAA, together with the need to outsource data and computation to third-party infrastructures, make it critical to have privacy-preserving solutions that can be deployed at potentially untrusted environments. For instance, in Machine learning as it deals with the analysis of sensitive data, many times unprotected, which may leak sensitive information to adversaries at the untrusted premises. Even if this information is encrypted, there are other types of attacks that may compromise confidentiality as depicted in Figure 3.
 
 

Figure 3: Examples of attacks that can affect ML: Adversarial Samples, Model Extraction, Model Inversion, Reconstruction Attacks, and Membership Inference

 

Although the use of software-based cryptographic schemes is far from coming to a halt, Trusted Execution Environments (TEEs) are increasingly sought as an alternative solution that can reduce the performance overhead associated with traditional privacy-preserving secure schemes. In AIDA we are exploring this technology to provide a privacy-preserving machine learning solution that can be used in practice, while scaling out for large datasets. SOTERIA is a system for distributed privacy-preserving machine learning, which leverages Apache Spark’s design and its MLlib APIs. Our solution was designed to avoid changing the architecture and processing flow of Apache Spark, keeping its scalability and fault tolerance properties.

 

Apart from cryptographic mechanisms, privacy guarantees can be provided by applying adequate anonymization mechanisms. However, selecting a privacy-preserving mechanism is quite challenging, not only by the lack of a standardized and universal privacy definition, but also by the need of properly selecting and configuring mechanisms according to the data types and privacy requirements. Moreover, the type of anonymization approaches employed may affect the performance of the machine learning mechanisms considered in the project. Focusing on the data types relevant for the AIDA project, we are developing a privacy framework that allows us to test configurations, apply and assess privacy-preserving mechanisms according to the achieved privacy and utility level of data. 



]]>
4604
Fraud Risk Management https://aida.inesctec.pt/fraud-risk-management/?utm_source=rss&utm_medium=rss&utm_campaign=fraud-risk-management Mon, 18 Oct 2021 08:38:18 +0000 https://aida.inesctec.pt/?p=4404

Fraud Risk Management

5G presents an opportunity for telecom operators to capture new revenue streams from industrial digitization. In cases such as network-as-a-service (NaaS), network exposure is becoming a reality through the transformation of core telecom network assets into digital assets. With 5G, the dynamic provisioning and scaling of network capacity and resources are available for the first time.


The vision of managing the network-as-a-service in the same way as a developer might manage cloud resources on Azure, AWS, or Google Cloud is happening through a combination of scalable infrastructure and the next generation of digital business support systems (BSS).


The 5G network evolution has opened up an abundance of new business opportunities for communication service providers (CSPs) in verticals such as industrial automation, security, health care, and automotive. To capture the opportunities and leverage their NaaS capabilities, CSPs are deploying automated business support systems (BSS) capable of expanding non-telecom value chains, while supporting new business models through open interfaces.


Figure 1: 5G Open Interfaces for Business Models


 

The world’s digital connections are becoming broader and faster, providing a platform for every industry to boost productivity and innovation. To illustrate the range of possibilities, let’s look at the healthcare industry where connectivity-enabled innovations can make it possible to monitor patients remotely, use AI-powered tools for more accurate diagnoses, and automate many tasks so that caregivers can spend more time with patients.


This technological transformation of the healthcare sector offers numerous opportunities for telecom operators to penetrate new value chains and initiate partnerships that benefit the entire ecosystem. Still, it is just one example of how CSPs can partner with a wide range of vertical industries.


Expanding the business models through partners can bring significant benefits and help bring about successful innovation, but inevitably offers less direct control than delivering by themselves in their own controlled environment. It is often said that a business is only as strong as the chain of suppliers it works with.

 

An example of how service delivery chains are becoming complex, and by doing so, becoming more difficult for handling risks, can be exemplified by Uganda’s recent hacker attack on the country’s Mobile Money business that processes phone-based transactions. The mobile money value chain is made of mobile network operators (MNOs), banks, and end-users, and is a technology that allows people to receive, store, and spend money using a mobile phone.


In the mobile money value-chain, there are blurred risks mostly due to the often-undefined roles of banks and telecommunications companies in financial services, as proven by the recent hack of a gateway that links the bank-to-mobile money transactions. There is a clear line between “banking” and “mobile money” as a standalone business. But the big question is, and when the lines become blurred, is when MNOs expand their services to connect with banks and allow the withdraw of money from regular ATMs.


Figure 2: Responsibility Matrix


At Mobileum, we believe that is the opportunity to leverage the digital transformation data exchange and create the capacity to analyze distributed big data for integrated risk management (IRM) purposes, instead of pursuing a more reactive approach that focuses on finding more data sets and understanding how to use them to address risk. An IRM strategy reduces siloed risk domains and supports dynamic business decision-making via risk-data correlations and shared risk processes.

 

Figure 3: Integrated Risk Management Overview

 
Along with the connectivity platform, CSPs are at a good point to understand and manage a wide scope of risk through a comprehensive view across business units, risk and compliance functions, and key business partners, suppliers, and outsourced entities.
 

The goal should not be to create one big repository that can handle any data set, no matter how large. Instead, it should be to fully automate the linkage among relevant insights from a wide variety of internal and external sources, a process that data in various nodes of the supply chain triggering action immediately when possible, and adding data to a queue for deeper analysis.


The Adaptive, Intelligent, and Distributed Assurance Platform, AIDA, project aims to deliver this vision, an end-to-end 5G-ready fraud management platform that is able to protect the 5G ecosystem in its multiple layers, and deploy an IRM strategy that manages high data volumes and real-time visibility through edges close to the monitoring points, contributing with scalability and local learning to global models.  Additionally, 5G introduces challenges that previous generations did not have. The multitude of deployment scenarios between isolated or shared and private and public networks, and the multiple business entities and partners involved in the new business models, introduces intrusion, tampering, confidentiality, and data privacy requirements that need to be monitored and analyzed, ensuring system-wide protection of the ecosystem and value chain.


Another key aspect of 5G is the number of new stakeholders in the fraud landscape, which brings new types of fraud that are difficult to anticipate now. This is where the use of AI, especially unsupervised learning algorithms for abnormal behavior detection can help to address the unknown patterns or smart fraud, designed to dissimulate any abnormal patterns and create blind spots, evading detection.  In an Integrated Risk Management approach, data feeds that traditionally are not considered in fraud management systems will strengthen the linkage between the different sources enhancing the relations between different domains, like fraud, security, or network fault and performance.


While detecting fraud in clear data can be a challenge, doing it on encrypted streams is far more challenging. Either by legislation enforcement or with the intention to cover fraud or simply as good practice, data is encrypted providing nothing more than a relation between two entities. As an example, most malware used today in telecom fraud depends on command and control botnets over encrypted connections. They control infected devices and monetize fraud by using services owned by the fraudsters, one of many monetization methods that can be used. Likewise, premium content can be streamed through illegal services by any node in the network, and data will only reflect a connection to a VPN provider, covering all the illegal activity behind it. In a 5G context where is difficult to anticipate the new types of fraud, the challenge grows on identifying them on encrypted data. 


Resilient organizations anticipate risks, develop controls, monitor events, and whenever possible, apply automatic actions to remediate risks. At Mobileum, we believe CSPs will position themselves to lead the emergence of new ecosystems and play their full role in transforming industries and society. Our technology and telecommunications risk management services can assess and protect risk-related issues specific to the telecom industry and assure the industries that are leveraged by connectivity. Currently, we provide a vast stack of solutions that can support the changing imperatives of risk management when it comes to monetization, security, and trust brought by the telco platform economy.


The Mobileum portfolio is unique to assure how our customers build a strong relationship between enterprise risk management and improve its customers’ ability to track risk. At Mobileum, we believe that business resilience and risk management should be tightly linked.

By bringing an integrated view of network services, security, and testing and monitoring results, we create a comprehensive view and analysis of risks from fraud, monetization failures, and customer/partner experience while enabling the success of the digital transformation.


By Mobileum

]]>
4404