Motivation
Database systems must deal with the fact that real workloads often exhibit hotspots: Some items at certain times are accessed by concurrent transactions with high probability. This arises in telecoms, sensing, stock trading, shopping, banking, and numerous applications. Some are as simple as counting events, such as user votes or advertisement impressions in Web sites. Some of these applications, such as prepaid telco plans, selling event tickets, or keeping track of remaining inventory, in addition to counting, also need to enforce a bound invariant, that ensures that the quantity being tracked does not cross a set threshold.
Update hotspots mean that locking and validation mechanisms used for isolation have a severe impact on usable throughput. This is particularly challenging in emerging cloud and edge database systems as classical techniques such as escrow locking are not applicable (e.g., locks in MySQL Group Replication are local and transactions in different nodes run optimistically), distributed synchronization has a considerable impact on latency (e.g., waiting for a stable time in Spanner), or the unpredictability makes them ineffective (e.g., how many separate splits are needed in a serverless system). Moreover, in industrial applications, one needs a solution that works in current cloud-based and off-the-shelf systems, that is, using only their application programming interfaces.
MRVs in a Nutshell
Structure
The AIDA project proposes Multi-Record Values (MRVs), a new approach to handle update hotspots in scale-out cloud and edge database systems. It builds on the general strategy of value splitting: To split each contended value into multiple database records, each holding part of the total value, such that they can be accessed concurrently. To add or subtract to the value, one needs to add or subtract to any of these records. To read the current value, one needs to read and sum them all. The main novelty of MRVs is how each transaction is assigned to a physical record and how the various records, holding parts of the total value, are managed efficiently.
MRVs can be portrayed as a circular structure of size N. Of these N positions, n are assigned to a physical record, which holds a subset of the total value. In the example below, we have an MRV with N=23 and n=9, with the latter represented by black circles.
Operations
As different clients might have different access patterns, MRVs avoid statically assigning them to records, as done in previous splitting techniques. To ensure that accesses are evenly spread, our first insight in MRVs is to use a random number between 0 and N-1, for each access, to determine which record to use. Assuming that the number of records for each item is big enough, this results in a small probability of conflict. This avoids the need for explicit coordination of clients, which would be costly in a distributed environment.
In the example below, transaction T1 wants to add 2 units to the MRV. As such, it performs a lookup with a random key, 4, and ends up updating the next physical record, assigned at index 6.
Subtractions might not be fully possible on a single record if its current value is lower than what is being subtracted. Thus, the subtraction might require multiple accesses to complete. This could be done simply by keeping the remainder after a first subtraction and carrying it on to a second random one, and so forth, until it is fully done. This, however, makes it difficult to determine unsuccessful termination, which happens after all records have been visited and there is still a remainder. This is addressed by performing only one random lookup and then scanning to the next record. After the last record, we wrap around and restart the process on the first, hence the circular structure. With a tree-type index, both the lookup and next operations execute efficiently.
If two transactions try to update the same record concurrently, one of those will have to either rollback or wait. MRVs rely on the underlying concurrency control to achieve this task, which means a simplified implementation process.
In the example below, transaction T2 wants to subtract 3 units from the MRV. It first accesses the record at index 8, which only has 2 units. As such, it sets its value to 0 and carries the remainder (1) to the next record.
Adjusting
Choosing an optimal n depends on the current load. On one hand, a low number will lead to a high conflict probability under high loads. On the other, a large number increases storage and read overheads, and can be counterproductive for MRVs with low values. Therefore, MRVs employ a background worker that dynamically adjusts the number of records per MRV based on the workload, using, e.g., the MRV’s conflict rate.
In the example below, a higher load leads to an increased abort rate. To offset this, the adjust worker adds two new records, filling previously empty positions.
Balancing
Finally, skewed workloads can lead to imbalanced records. For example, in stock reservation use cases, we might have multiple small subtractions – clients buying the item – and a few large additions – the store restocking the item. This increases the conflict probability of subtract operations, reducing the usefulness of MRVs. Our solution is another background worker that periodically balances the amount between records, as exemplified in the animation below.
Evaluation
MRVs are evaluated with experiments on different database management systems, including distributed systems where other solutions are not applicable. Namely, implementations of MRVs have been tested on: a centralized SQL system with PostgreSQL, often used in cloud-based managed services; a single-writer NoSQL data store with MongoDB; a multi-writer SQL database with MySQL Group Replication; and a cloud-native, multi-writer NewSQL system that we call System X.
This test mimics a shopping application that keeps stocks of products. We increase contention both by increasing the number of concurrent clients (X-axis) as well as decreasing the number of products (Y-axis). The heatmaps below display the scale-up in throughput against the native single-record solution. These experiments show that the MRVs technique is widely feasible and advantageous in a spectrum of different database management systems, including NoSQL and distributed systems where an improvement of 100x can be observed in cases of extreme contention.
Final Remarks
Multi-Record Values improve the performance of applications affected by the problem of numeric hotspots, by reducing the collision probability. The background workers ensure that MRVs adapt based on the current load, meaning write performance is improved while optimizing to reduce read and storage overheads.
The open challenge that remains is the feasibility of randomized splitting to data structures other than numeric values.
The full paper can be accessed here: https://nuno-faria.github.io/publications/mrv. The code used in the experiments can be accessed here: https://github.com/nuno-faria/mrv.
References
Nuno Faria and José Pereira. 2023. MRVs: Enforcing Numeric Invariants in Parallel Updates to Hotspots with Randomized Splitting. Proc. ACM Manag. Data 1, 1, Article 43 (May 2023). https://doi.org/10.1145/3588723
If we, as data scientists, receive a dataset from a reliable source, we should go ahead with the analysis (classification, clustering, deep learning, etc), right?
Well, yes, that’s what most of us (myself included) often do, especially if there is a tight deadline. However, this could be dangerous. Let me describe some rude awakenings I suffered over the past decades, as well as remind you some fast and easy preventive measures.
E1 Geographical data
Two decades ago, we got access to a public dataset of cross-roads in California, that is, about 10,000 points in two dimensions ( (x,y) pairs). We plotted it to include it in the spatial-indexing paper we wanted to submit – the plot looked mostly empty, with a lot of points in the shape of California, at the bottom-left corner (see Figure [fig:ca](a)).
Why so much empty space?
The answer was that there was a single point somewhere in Baltimore (Atlantic coast), thousands of miles away from California. Clearly a typo – maybe the coordinates had the wrong sign or so. Since it was just one point, we deleted it.
(a) Q: why is the CA dataset at the bottom left and the rest, empty?
(b) A: because of a tiny stray point in Baltimore…
E2 Medical data – many ‘centenarians’
I heard this instance from a colleague at CMU (let’s call him ’Mike’). He was working with patient records (anonymized-patient-id, age, symptoms, etc). ’Mike’ was very careful, and did a histogram of the age distribution – and he noticed that ’99’ was an abnormally popular age! He asked the doctors that owned this dataset – they replied that, yeah, that’s what they used as a ’null’ value – since for some patients the age is unknown.
Given that “There is always one more data bug”, what should we do as practitioners? While there is no perfect solution, I found useful the following defensive measures:
R1: Visual inspection – ’Plot something’: For any dataset, it often helps if we plot the histogram (’pdf’) of each numerical column. We could even try logarithmic scales, when we expect power-laws: Zipf, Pareto, etc. Spikes/outliers may spot anomalies (like the ’99’ age of the medical records incident).
R2: Keep in touch with the data owner: The medical doctors knew that ’99’ was the null value; in general, domain experts can help us focus on the real anomalies, as opposed to artifacts.
In conclusion, Data Science is fun and typically gives high value to the data owner. However, there is always room for errors in the data points, sometimes with painful impact. Anticipating and spotting such errors can help us provide even higher value to our customers. Paraphrasing our software engineering colleagues: There is always one more data bug.
Written by CMU
One of the goals of the AIDA Project is to investigate and identify new ways to help Analysts find Anomalous Behavior (of any kind) in large and complex pools of data. We hope this can ultimately lead to significant improvements in the detection of fraudulent activity occurring on Telecom Networks, especially in the light of new technologies like 5G already expanding on the market. Moreover, we now see over more sophisticated, complex and robust methods being used for committing fraud on Telecomm Networks but also increasingly expanding to the world of Communication Service Providers and the Internet.
In essence our goal is to create and develop new tools that can provide new ways to help the Data Analyst and Data Scientist better understand and make sense of information originated from huge and dynamic pools of data such as that originating from Telecom Operator’s traffic activity. This type of data displays very particular characteristics as it is based upon human interactions prone to exhibiting strong and established patterns and Data Visualization functionalities when designing and building these new tools.
We believe this can ultimately lead to better results at detecting Anomalous Behavior and uncover suspicious activity that would otherwise have remained undetected, which in turn can potentially stand as the basis for developing the next generation of commercial products and solutions aimed at tackling Anomalous Behavior in Telecom Operators’ traffic activity and other related scenarios.
One of the ideas behind this approach was to establish a strong focus on facilitating the opening and development of new cases by the end user (usually the Analyst). Case Building is a crucial part for the Analyst to manage his investigation work in order to allow the Analyst to follow a logical sequential analytical process, where there is the possibility to deep dive on data of interest as well as the opening and creation of new cases with ease. Current solutions such as those used at Mobileum include smart monitoring features that sense when there is an unusual activity exhibited by the Operator traffic data. When a suspicious event occurs, the application can set a temporary alert on that specific agent, attempting to keep fraud losses to an absolute minimum without interrupting legitimate activity. This approach functions more as a broad analysis that can detect alterations to the normal flow patterns of traffic data. Therefore, usually only the more obvious and evident anomalous activity can be detected through such methods, which leads to a lot of illegal activity passing under the radar of these solutions. The Analyst here usually does not have the ability to perform a sequential deep dive on the data and to reflect at multiple perspectives sequentially layered in a manner that can facilitate the workflow of the user.
This is made more difficult with the increase of Smart Fraud where fraudsters are increasingly developing more sophisticated, efficient and complex ways to commit Fraud be it for increasing success rates or to adapt to the adoption of anti-fraud mechanisms by the Telecom industry.
The advent of AI and the ever more sophistication, creativity and expertise of Fraudsters are major factors for this. New methods used by Fraudsters focus on attempting to mimic “normal” non-fraudulent behavior that can increase the chances of avoiding detection, through several ways. For example, regarding call traffic activity we have seen approximations on call duration, frequency and even ensuring that verified contacts are also included in the fraud committing process.
We believe the work carried out in the AIDA Project will significantly contribute to new solutions that increase the accuracy in detecting Anomalous Behavior. We provide a brief overview on some of the work conducted so far in the next section.
One of the goals of this work was to provide Analysts with new ways to analyze Telecom traffic activity that can lead to the detection of anomalous behavior that is not detected using classic methods. In that line we have introduced Graph Analysis to our research together with powerful statistical and feature exploration focused on providing powerful and insightful Data Visualization tools and techniques that can strongly help Analysts analyzing traffic data. One of the methods of choice to explore in such research is the study of Anomalous Behavior and suspicious activity using graphic objects such as time evolving graphs. Graphs are very powerful for analyzing human interactions and therefore make exceptional candidates for studying and analyzing anomalous behavior from Telecommunication Networks’ traffic activity. Here, individual agents (people) are represented by nodes of the graph and the links between them represent connection events established by those nodes. We are interested to understand the patterns that occur in this type of data, like who-calls-whom, who-sends-money-to-whom the focus was on the unsupervised cases, where there are no labels.
Our work so far with Anomalous Behavior has led us to use these approaches and techniques in the study of Fraud occurring in Telecom Networks. We have been developing new analytical solutions that we have been testing with a real-life dataset originating from an existing Telecom Operator. Together with the help of high expertise in this business domain, we were able to detect significant anomalous behavior in novel and potent ways which positively reinforces the potential and value of this application. More specifically we were able to identify new cases with high potential for Bypass Fraud. Furthermore, we present a case that was later confirmed by the Telecom Operator as a newly confirmed Bypass Fraud case that had not been previously identified. A recent solution we have recently published is depicted in figure 1.
The proposed solution follows three sequential steps:
The initial idea was to start with several informative features that can help Analysts get a better understanding of the data they are looking at (as shown below in fig. 2 (a)). Upon selection of desired features to visualize, we can dive into the structure of the data and try to identify and find patterns or nodes that seem anomalous or somehow suspicious.
The Summary comprehends a high-level yet detailed interactive view of the data (as show in fig 2 (a)). It is possible to visualize data in a variety of ways that can highlight different aspects, patterns and behaviors that can be further investigated.
The deep-dive consists on a drill down on suspicious or interesting nodes that ask for further investigation (as shown in fig. 2 (b) and (c)). Here we can explore several important metrics exhibited by the selected group of nodes. This can help to detect patterns and relationships between different nodes and infer about anomalous behavior. Additionally, we explore the use of EgoNets, a very powerful way to look at relationships in the data. The idea is to be able to drill down on any nodes and relationships of interest and understand specific metrics and statistics associated with them.
This approach was designed to work with labeled and unlabeled data. It is of special significance the possibility for addressing Anomalous Behavior investigation on unlabeled data, as this will inevitably always constitute the vast majority of newly generated data.
One of main ideas in the design of this solution was to allow the user to interact with the data in a way where it could be possible to select specific areas of interest in the graph for further investigation. This can then allow the analysis of suspicious groups of nodes in the graph that may share similar attributes, patterns or behavior. A major advantage for this is to be able to switch from a case-by-case approach, where the Analyst has to look at each single potential case individually and is thus less informative and more time-consuming, to one of group behavior analysis. By analyzing whole groups of nodes, we can broaden our scope of analysis and gain a better idea on the overall potential fraud events occurring in the data. Additionally, when detecting a suspicious case it becomes easier to infer about related events exhibiting the same kind of behavior. Generally, the visualization of the graph nodes and respective interactions allows us to evaluate the behavior of a particular group and identify the fraudulent nodes e corresponding victims.
Also, the use of parallel graphs provides a powerful means to visualize and analyze different metrics (more than twenty at the moment) that can assist in the investigation process.
In this publication we present a success case that showcases how new events of fraud that had been previously undetected by traditional methods can be detected. We went from an initial dataset comprising of a pool of events (calls) of traffic activity from a Telecom Operator and were able, through Data Visualization techniques, to successfully identify a new fraud case. This was an instance of International Bypass fraud, one of the most prevalent and damaging types of fraud.
The data set used for the analysis presented here was not labeled and is comprised of a pool of 2 days of traffic activity (events are phone calls) from a large Telecom Operator.
Figure 2 depicts the analytical process undertaken in the analysis that led to the identification of the success case presented here.
In figure 2 (a) we can see a plot representing the relationship between number of calls and calls received. Here we detected a group of suspicious nodes (red dashed rectangle) that displayed a pattern diverging from normal behaviour as seen by the representation of the full dataset.
From there we looked at the various features computed by the application to get a better grasp at the behavior displayed in that case. We noticed a pattern of 1-second phone calls that was not a standard behavior.
We further deepened our investigation by computing an graphical visualization of the EgoNet of the data of interest and we could learn that many of these 1-second long phone calls were strangely being made to an hotel. We suspect this was indeed Anomalous Behavior and likely indicative of fraudulent activity related with the International Bypass.
One of the many outcomes of the AIDA Project has been the research and development of new algorithms and data visualization solutions that can add to the catalogue of tools addressing the detection and identification of Anomalous Behavior. Here we have discussed a solution, developed by AIDA partners Carnegie Mellon University and Mobileum, designed for the study and detection of Anomalous Behavior in large pools of data. We have addressed new ways for the study of Anomalous Behavior by Analysts tackling Fraud using the power of time evolving graphs. The proposed solution is designed in a sequential manner that allows the user to dive in specific areas of interest in the data. Also, the user can deepen the analysis by investigating multiple ways the data is interconnected.
This provides the ability to detect fraud patterns occurring on the data by groups of nodes which contrasts with the narrower scope of single case analysis that is used in current state-of-the-art commercial solutions.
We have built a tool focusing on group analysis and attention routing that allows the identification and visualization of fraud patterns in a network.
By using this approach with an initial set of unlabeled data we were able to detect Anomalous Behavior that was confirmed to constitute Fraud activity in a quick and efficient manner.
We reinforce the strong emphasis on a solution that can provide powerful Data Visualization features that allows the user to successfully tackle Anomalous Behavior and fraud activity. The possibility for deep dive in data points (nodes) of interest and collect valuable information from the relationship between these data points is of special significance. Here, the possibility to perform drill down through a Lasso Selection of the data of interest should reveal to be highly valuable in this type of analytical work.
This application also provides the possibility for addressing and detecting new cases of Smart Fraud that aims to mask itself within “normal” behavior by adopting new and ever more unsuspected mechanisms and techniques.
This is ongoing work as we continue to explore these research vectors. One of the main areas of interest we want to focus our work on is Attention Routing. This will help to improve the study of areas of interest in the data that are blurred across the normal distribution behavior and are often ignored.
We are exploring new ways to analyze of groups or clusters of data and continue with our research in parallel graphs and the use of spring models for visualization and interaction.
Finally, we are thinking on how to evolve to solutions that can be made more pro-active, possibly suggesting suspicious cases based on an automated informed analysis. We hope this could ultimately have a great impact on the industry of Fraud Management.
TgrApp: Anomaly Detection and Visualization of Large-Scale Call Graphs (2022 International Conference on Big Data (IEEE BigData 2022), AAAI Proceedings (2022)).
CallMine: Fraud detection and visualization of million-scale call graphs (22nd IEEE International Conference on Data Mining on Data Mining (IEEE ICDM 2022)).
By Mobileum
Adapting the RAID platform comes with many security and privacy issues that need to be addressed. These issues appear mainly due to the transition to an edge architecture, which imposes the use of computation resources at the edges of the network, but also due to the need of supporting multiple tenants and network slicing with 5G technology. As mobile phones connect and disconnect, the network is always changing, requiring adaptation and monitorization tools to be used to maximize resources. And since the complexity of the system increases, attackers have more opportunities to exploit it. With these issues in mind, various mechanisms and protocols were put in place, assuring the overall security of the platform.
The transition to edge computing architecture introduces new vulnerabilities. Attacks like Man-in-the-Middle or Eavesdropping can occur with higher probabilities, thus it is of utmost importance to ensure secure communication between the different devices and components. The edge architectures can leverage different solutions, seeking or not compliance with standard industries for mobile applications like Mobile Edge Computing (MEC). Such standards facilitate the data processing at the edge, but also open new threats due to exposure of new APIs, which may require data flows between edge nodes and cloud components.
Secure communications for applications leveraging on APIs can rely on the advances of the HTTP protocol – HTTP/3 which introduces functionalities for increased security (i.e., reduced round trip time for initial session handshake) and support for unreliable links, for instance with high packet loss. The AIDA platform leverages the advances of HTTP/3 that relies on the QUIC protocol. In addition, KubeEdge orchestrates the diverse microservices of the AIDA platform at edge nodes and also supports multiple solutions to secure the communications at the control plane. Besides TLS connections, there is also support for the HTTP/3 protocol.
The AIDA platform is also able to allow ISPs to detect malware on the fly. First, the AIDA platform captures the network traffic (mainly DNS queries) and presents it to a botnet analyzing service. The service then produces a floating point evaluation regarding the probability of infection of a specific device, which can then be used to trigger a response based on the presented value.
The steps present in the detection are: (1) blacklist/whitelist analysis, (2) query rate analysis, (3) domain analysis (whether it is or not DGA-generated) and finally a (4) machine learning step. This pipeline-oriented scheme aims to achieve high speed and scalability, therefore, the packets leave the pipeline as early as a consistent evaluation is formed. In all steps a packet can be marked as infected or not(except (2) which can only deem infected) and leave the pipeline. Only if a packet does not meet the upper or lower bound criteria for infection, it traverses the pipeline into the next evaluation step.
The AIDA platform is based on the use of microservices, which require special attention to ensure their security. The platform also is expected that large amounts of data will be generated, therefore, intrusion detection mechanisms must be lightweight while also being able to cope with a dynamic number of replicas and a large number of services.
When looking at the system as a whole, we were able to increase detection rates up to 60%, while only around 25% of the alarms were false positives. Without the techniques being employed the results would be overwhelmed by a really large amount of false positives. Also, considering adaptation environments, results improved by up to 80%, improving the overall security of the system.
Fig.1 – Results of the intrusion detection methods employed
Hoping to improve the intrusion detection rate and minimize false alarms, we also decided to explore the use of machine-learning techniques for intrusion detection. With this intent, we collected system calls from the microservice systems as our data and used classification techniques to detect intrusions. The results demonstrate a high detection rate for two of the five attacks from the tested vulnerabilities. Although only some attacks have been detected, the false alarm rate had excellent results, staying below 1% for all attacks. We also improved the results from machine learning by using a sliding window as a post-processing technique.
Based on the machine learning results, we understood the necessity of a better system call representation for intrusion detection techniques. For this, we decided to work on a representation that could convey more information about the connection between system calls. First, we devised a classification where system calls were divided into classes and subclasses. Later, we established relationships of different costs between these system calls to create a system call graph. Some parts of the system call graph can be seen in Fig.3. Since our representation was susceptible to our subjectivity, we designed a validation process that gathered information from other researchers in the research area. In the classification, we adjusted 17,28% of system calls based on the validation from other researchers. The graph validation has started, and it is in progress.
Fig.2 – Small representation of the system calls as a graph
Since the AIDA platform will receive dynamic loads, adaptations are possible throughout the execution, maximizing the resources being allocated. Mechanisms of self-adaptation were therefore put in place, monitoring the system and executing pre-defined actions that let the system react to the changes, allowing for high performance and availability. To do that, we will use the Trustworthiness Monitoring and Assessment (TMA) Framework. TMA allows self-adaptation mechanisms in cloud and edge applications. This is done through their REST interfaces, which interact with the probes and actuators of the managed element (i.e., the AIDA platform). As TMA can be easily tailored to any aspect to be monitored (e.g., performance, availability, security), it was chosen to be applied to the AIDA platform. In addition, we have recently made available a new dashboard that allows managing and visualizing TMA configurations.
The assurance of data privacy in the AIDA platform demands suitable anonymization approaches to store and process large amounts of data. The development of a privacy framework came to overcome the difficulty in selecting and configuring the appropriate mechanism that fulfills the project requirements. This privacy framework allows to implement, apply, and assess Privacy-Preserving Mechanisms (PPMs) according to the pipeline below.
Fig. 3 – Privacy Framework architecture
The architecture of the privacy framework consists of a main Python Package with a set of subpackages, where each subpackage contains the corresponding adapter, that is, an abstract class that can be extended by implementing the abstract methods (i.e. relevant methods for the component). These adapters make the framework easily extendable, by allowing the implementation of new features (e.g. new PPMs or metrics).
These strategies are complemented by SOTERIA, which uses machine learning techniques to create a distributed privacy-preserving system. It was built taking into consideration both scalability and fault tolerance, allowing the processing of large datasets. See our December post for details.
By University of Coimbra
The results and analysis presented here were done with contributions from Mirela Cazzolato (USP, and CMU), Saranya Vijayakumar (CMU), Xinyi (Carol) Zheng (CMU), Meng-Chieh (Jeremy) Lee (CMU), Namyong Park (CMU), Pedro Fidalgo (Mobileum), Bruno Lages (Mobileum), and Agma Traina (USP).
As we mentioned in the February 2022 blog post, the problem we are focusing on is to spot fraudulent behavior in a who-calls-whom-and-when graph. We distinguished between the supervised case (where we are given a list of fraudulent subscribers (labeled data)), and the un-supervised one, where we are not given any such labeled data.
The two main insights where (a) the labeled data could be wrong: a subscriber labeled as ’honest’ may turn out to be fraudulent after closer inspection and (b) there are many types of fraud, including telemarketers/scammers (’you owe taxes’); subscribers bypassing the legal ways of making international calls; subscribers using fake or stolen credit cards; and many more.
For the supervised case, we can build classifiers (random forests, autoGluon), which will be as good as the quality of our labels, and which will never be able to spot new types of fraud.
For the un-supervised case that we focus here, the main insights are
Fraudsters often exhibit lock-step behavior: for example, too many callers call the same (large number) of destinations, about the same time.
Visualization is a powerful way to explain to a non-analyst why our algorithm suspects a given set of subscribers.
We elaborate with an example, next.
Figure 1 illustrates the main ideas and insights.
The first column shows two of the many scatter-plots we are proposing: The first row ’(a)’ plots the in-degree (total # of friends calling in) versus weighted in-degree, in log-log scales since we have power-laws as we expected. Like all other scatter plots, this is a heatmap – notice that even the scales of the heatmap are logarithmic: The vast majority of subscribers, in red/orange colors, receive a few phonecalls ( ≈ 101), and the total talking time is relatively short ( ≈ 102). Not surprisingly, there is a correlation: the more people call you, the larger the total duration of the phone calls. What is surprising is the micro-cluster inside the white circle, at (102, 104): Many people ( ≈ 103 – light-yellow colormap) have surprisingly similar in-degree of about 100, and similar total duration of about 10,000 seconds.
Let’s dissect (d), the scatter-plot below it. Every dot is a subscriber; the axis are the in- and out-degree of each subscriber, again in logarithmic scales (log (x+1), to be exact). There is heavy overplotting (red/orange/yellow colors) for small values – that is, most subscribers receive and initiate only a few phonecalls. There are two un-expected issues: The first is that there is little reciprocity: analysis of earlier who-calls-whom networks [1] reported that usually people who make xphonecalls, also receive about x phonecalls. Thus, we would expect to see a strong diagonal along the 45 degree line – but it is not there.
The second observation is that there are a lot of people who never return a phonecall (all the dots on the horizontal axis – our domain experts call them ’black holes’), and similarly many people that receive no incoming phonecalls (’volcanoes’). These behaviors are not necessarily bad: for example, an emergency room or a help desk would behave like a ’black hole’. However, several fraudulent behaviors are so asymmetric, like, eg., scammers (’you owe taxes’).
The second column of Figure 1 shows our guesses for fraudsters (orange diamonds and crosses, for ’volcanoes’ and ’black-holes’ respectively). The crucial plot is (b), which shows the ’core-number’ versus inter-arrival time (IAT). The core-number has a complicated definition, 1 but the intuition is that nodes with high core number belong to a densely connected community.
In Figure 1(b) notice that, while the vast majority of nodes (red/yellow dots at the bottom-middle of the graph) have low core-number (2-5), there are several with very high core number (≈20). Plotting them on the in- vs out-degree plot (’(e)’), about half of them are ’volcanoes’ (diamonds on the horizontal axis) and the rest are ’black-holes’ (crosses, on the vertical one). We shall refer to them as leadsfrom now on.
It turns out that most of our leads actually call each other, forming a very dense bi-partite core; such subgraphs are typically fraudulent in social networks (like Twitter [3] , FaceBook [2] )
Our domain experts confirmed that our leads are indeed fraudulent.
Our dataset already had labels, and we show them in the last column of Figure 1(c) and (f). Notice that (c) and (f) are exactly parallel to the middel column ((b) and (e)), with the only difference that the ’ground truth’ column has the labeled nodes in red. As in the ’leads’ case, diamonds and crosses correspond to ’volcanoes’ and ’black-holes’ respectively.
The main observation is the following:
The investigators who labeled the data, completely missed our ’leads’, with the high core-number
This is exactly the reason that we call the labeled set as ’ground truth’ within quotes: Several of the fraudsters may be flying below the radar, as we showed with our ’leads’ of Figure 1(b) and (e).
We repeat our two insights:
Fraudsters often exhibit lock-step behavior, resulting in micro-clusters and dense communities or bi-partite cores
Visualization helps explain our ’leads’ (as we did with the micro-clusters in Figure 1(a),(b)
Formally, the k-core of a graph is a maximal subgraph that contains nodes of degree k or more; the core-number of a node is the largest value of k of a k-core that contains that node.
By Carnegie Mellon University
Figure 1 | Federated Learning Architectures
Figure 2 | Categories of Privacy-Preserving Mechanisms (a) Global Privacy (b) Local Privacy
[1] Yu, B., Mao, W., Lv, Y., Zhang, C., & Xie, Y. (2022). A survey on federated learning in data mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(1), e1443.
[2] Voigt, P., & Von dem Bussche, A. (2017). The EU general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 10(3152676), 10-5555.
[3] McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017, April). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics (pp. 1273-1282). PMLR.
[4] Yang, Q., Liu, Y., Chen, T., & Tong, Y. (2019). Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2), 1-19.
[5] Li, T., Sahu, A. K., Talwalkar, A., & Smith, V. (2020). Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3), 50-60.
[6] Google. (2020). TensorFlow Federated. https://www.tensorflow.org/federated Accessed 2022.
[7] Webank. (2019). Federated AI Technology Enabler. (FATE). https://github.com/webankfintech/fate Accessed 2022.
By INESC TEC
Given a large, who-calls-whom graph, how can we nd anomalies and fraud? How can we explain the results of our algorithms? This is exactly the focus of this project. We distinguish two settings: static graphs (no timestamps), and time-evolving graphs (with timestamps for each phone). We further subdivide into two sub-cases each: supervised, and unsupervised. In the supervised case, we have the labels for some of the nodes (‘fraud’/’honest’), while in the unsupervised one, we have no labels at all.
For the supervised case, the natural assumption is that the labels are absolutely correct and thus we only need to build a classier, using our labelled set as the ‘gold set’. However, this is a pitfall: the ‘fraud’ labels are almost always correct (notice the ‘almost’), while the ‘honest’ labels are often wrong. Thus, we need to develop algorithms that can tolerate (and ag) those of the labels that seem erroneous. There is work on this line, under the area of ‘weak labels’, with some excellent work on the so-called ‘con dent learning’ [8]. The second insight is that there are many types of fraud, as well as many types of ‘honest’ behaviour. We already mentioned that in the context of Twitter followers [10]. For phonecall networks, there are also many types of fraud: telemarketers, phishing attempts, redirections to expensive 900 numbers, to name a few.
The most creative and most interesting part of the project is the feature extraction: what (numerical) features should we extract from each node, to try to nd strange nodes? The obvious ones are the in- and out-degree of each node; the total number of minutes in calls and out-calls. More elaborate ones include centrality measures, like Google’s super successful PageRank [2] the number of triangles in the agent of the node [1] Thus, every node becomes a point in n-dimensional space, and thus we can employee all the unsupervised algorithms, like clustering (DBSCAN [9], OPTICS), outlier detection (isolation forests [7]), micro-cluster detection [6].
Here we have two families of tools: the first is to build a classier for the n-dimensional
space that we can create with the feature extraction above. There is a wealth of classifiers to choose from – ‘autoGluon’ [3] automatically tries several of them and picks the best.
The second is to exploit network acts, with tools like label propagation, belief propagation, semi-supervised learning (eg., FaBP [5]; zooBP [4]).
Figure 1 gives some results for the supervised case. The scatterplots (actually, heatmaps, to highlight the over-plotting) have one dot for each customer, with the axis being the (weighted) in-degree versus the (weighted) out-degree. Both axis logarithmic (log(x + 1), so that we keep the zeros).
Notice that most points (fraud/honest) are along the diagonal, indicating that reciprocity is to be expected.
Also notice that there are some extreme deviations from reciprocity, namely, points along the axis. This means that there are customers that call, but never get called back (like, eg., telemarketers) and the other way around (like, eg., help-lines).
What distinguishes the fraudsters from the honest customers, is the magnitude of activity. Notice that most of the fraud customers tend to be around the 104; 104 points, while most honest customers are close to the 103; 103 point.
(a) Fraud
(b) Honest
Figure 1: Visualization helps: heatmaps of in-versus out-degree (weighted): Fraud (left) vs honest subscribers (right)
Looking for patterns and anomalies in large, real graphs never gets boring: there are always new patterns to look for, new activities by the fraudsters (as well as new activities by the honest ones). Despite the fact that there are already some excellent tools for graph analysis, there is always room for more.
[1] Akoglu, L., McGlohon, M., and Faloutsos, C. oddball: Spotting anomalies in weighted graphs. In PAKDD (2) (2010), vol. 6119 of Lecture Notes in Computer Science, Springer, pp. 410{421.
[2] Brin, S., and Page, L. The anatomy of a large-scale hypertextual web search engine. Comput. Networks 30, 1-7 (1998), 107{117.
[3] Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., and Smola, A. J. Autogluon-tabular: Robust and accurate automl for structured data. CoRR abs/2003.06505 (2020).
[4] Eswaran, D., Gunnemann, S., Faloutsos, C., Makhija, D., and Kumar, M. Zoobp: Belief propagation for heterogeneous networks. Proc. VLDB Endow. 10, 5 (2017), 625{636.
[5] Koutra, D., Ke, T., Kang, U., Chau, D. H., Pao, H. K., and Faloutsos, C. Unifying guilt-by-association approaches: Theorems and fast algorithms. In ECML/PKDD (2) (2011), vol. 6912 of Lecture Notes in Computer Science, Springer, pp. 245{260.
[6] Lee, M., Shekhar, S., Faloutsos, C., Hutson, T. N., and Iasemidis, L. D. Gen2out: Detecting and ranking generalized anomalies. In IEEE BigData (2021), IEEE, pp. 801{811.
[7] Liu, F. T., Ting, K. M., and Zhou, Z. Isolation forest. In ICDM (2008), IEEE Computer Society, pp. 413{422.
[8] Northcutt, C. G., Jiang, L., and Chuang, I. L. Con dent learning: Estimating uncertainty in dataset labels. J. Artif. Intell. Res. 70 (2021), 1373{1411.
[9] Schubert, E., Sander, J., Ester, M., Kriegel, H., and Xu, X. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans. Database Syst. 42, 3 (2017), 19:1{19:21.
[10] Shah, N., Lamba, H., Beutel, A., and Faloutsos, C. The many faces of link fraud. In ICDM (2017), IEEE Computer Society, pp. 1069{1074.
By Carnegie Mellon University
One of the major challenges in the evolution of the RAID platform during the AIDA project is the need to further distribute the platform components to achieve greater levels of scalability, by leveraging the increasing edge computing capacity made available by the IoT and the imminent large-scale deployment of 5G cellular technology.
The advent of 5G networks and growing adoption of Internet of Things (IoT) devices lead to more opportunities for data collection and processing with hybrid edge-cloud systems.
In this architecture, edge devices — placed near where the data is being collected/accessed — execute some of the processing while offloading other more complex work to the cloud, which is scalable on demand. However, edge-cloud architectures present several challenges when it comes to data management. Most of the difficulties are due to their inherently large-scale, distributed, and heterogeneous deployments, namely:
For example, in the network of edge-cloud services, such as firewalls and load balancers, there has been a shift to virtualizing functions to cut operation and management costs and increase elasticity. However, Virtual Network Functions (VNFs) that store data often rely on two techniques: use the operating system level memory sharing techniques for scalability or delegate data management to a standalone centralized database. The former helps to reduce the latency but fault-tolerance is harder to achieve and it limits all VNFs replicas to the same virtual machine. The latter improves scalability and fault-tolerance but increases latency. Although some solutions rely on eventual consistency to improve both scalability and latency, it is not enough for functions that require strong consistency or present non-deterministic behavior. However, this can be circumvented with expensive distributed locking.
On the other hand, the system must answer analytical queries on data collected in the edge. For example, it must be capable of efficiently answering ad-hoc queries for exploratory data analysis. One of the main challenges of this workload is that fetched data is unpredictable. Therefore, the system must be able to adapt to different workloads, by achieving an optimal tradeoff between the network and the computing power of the edge. In an environment with a substantial number of IoT devices, sending all collected data to the cloud consumes network and CPU resources on the edge, which is often limited due to cost and energy factors. Additionally, it introduces delays to the data that is actually needed at the moment. Finally, the heterogeneity of the devices and the collected data makes it difficult for the cloud to have a consistent and holistic view of data, increasing the complexity of analytical workloads.
The edge computing paradigm aims at leveraging the computational and storage capabilities of edge devices while resorting to cloud computing services for more demanding processing tasks that cannot be done at the edge. Edge devices generate large volumes of data that may need to be transferred to the cloud and that come from several types of data sources. Therefore, we propose the AIDA-DB unified data management architecture for an edge and cloud continuum, summarized in Fig. 1, that is able to tackle both analytical and transactional workloads, both relying on a polyglot middleware.
AIDA-DB unified data management architecture for an edge and cloud continuum
In detail, the polyglot middleware processes data from different sources, which have different formats and encodings. This component is capable of efficiently performing queries over an integrated view of the data, by exploiting the underlying edge’s query engines.
The synchronization middleware is in charge of efficiently transferring data from the edge to the cloud. It does so by considering the data that is currently required by the cloud and the data that is already cached. Then, it synchronizes the missing data based on a balance between the network delay and the impact on the edge resources.
Finally, the transactional middleware guarantees consistent reads and isolated and atomic updates across the edge and cloud continuum.
Our proposal of the AIDA-DB architecture is aimed at enabling emerging complex VNFs and edge-based application services in emerging 5G networks to manage data across a cloud and edge continuum. The key advantages of AIDA-DB which allow this are:
By INESC TEC
The evolution of the RAID platform during the AIDA project brings many benefits, but also many security and privacy concerns to be considered. Thus, discovering and implementing measures that address these new risks, while not degrading performance, is of utmost importance. The main challenges are related to the transition to edge, pushing the computational power to the edges of the network, to the integration of 5G supporting multiple tenants and network slicing, and finally to the privacy of the data gathered and analyzed.
Given these changes to the network, communications need to be verified in order to assure that they are secure and that performance isn’t being affected. The network is constantly changing as many devices connect and disconnect from it and have higher and lower traffic, creating the necessity of monitoring the platform to allow a fast response to any change in it. This change in the network creates new potential entry points for attackers to take advantage of or makes it harder to defend attacks that were already possible.
Figure 1: Overview of the changes to the Architecture
Providing Secure Communication among the Components
Figure 2: High level perspective of the secure operation of the main software components
Figure 3: Examples of attacks that can affect ML: Adversarial Samples, Model Extraction, Model Inversion, Reconstruction Attacks, and Membership Inference
Although the use of software-based cryptographic schemes is far from coming to a halt, Trusted Execution Environments (TEEs) are increasingly sought as an alternative solution that can reduce the performance overhead associated with traditional privacy-preserving secure schemes. In AIDA we are exploring this technology to provide a privacy-preserving machine learning solution that can be used in practice, while scaling out for large datasets. SOTERIA is a system for distributed privacy-preserving machine learning, which leverages Apache Spark’s design and its MLlib APIs. Our solution was designed to avoid changing the architecture and processing flow of Apache Spark, keeping its scalability and fault tolerance properties.
Apart from cryptographic mechanisms, privacy guarantees can be provided by applying adequate anonymization mechanisms. However, selecting a privacy-preserving mechanism is quite challenging, not only by the lack of a standardized and universal privacy definition, but also by the need of properly selecting and configuring mechanisms according to the data types and privacy requirements. Moreover, the type of anonymization approaches employed may affect the performance of the machine learning mechanisms considered in the project. Focusing on the data types relevant for the AIDA project, we are developing a privacy framework that allows us to test configurations, apply and assess privacy-preserving mechanisms according to the achieved privacy and utility level of data.
By University of Coimbra and INESC TEC
5G presents an opportunity for telecom operators to capture new revenue streams from industrial digitization. In cases such as network-as-a-service (NaaS), network exposure is becoming a reality through the transformation of core telecom network assets into digital assets. With 5G, the dynamic provisioning and scaling of network capacity and resources are available for the first time.
The vision of managing the network-as-a-service in the same way as a developer might manage cloud resources on Azure, AWS, or Google Cloud is happening through a combination of scalable infrastructure and the next generation of digital business support systems (BSS).
The 5G network evolution has opened up an abundance of new business opportunities for communication service providers (CSPs) in verticals such as industrial automation, security, health care, and automotive. To capture the opportunities and leverage their NaaS capabilities, CSPs are deploying automated business support systems (BSS) capable of expanding non-telecom value chains, while supporting new business models through open interfaces.
Figure 1: 5G Open Interfaces for Business Models
The world’s digital connections are becoming broader and faster, providing a platform for every industry to boost productivity and innovation. To illustrate the range of possibilities, let’s look at the healthcare industry where connectivity-enabled innovations can make it possible to monitor patients remotely, use AI-powered tools for more accurate diagnoses, and automate many tasks so that caregivers can spend more time with patients.
This technological transformation of the healthcare sector offers numerous opportunities for telecom operators to penetrate new value chains and initiate partnerships that benefit the entire ecosystem. Still, it is just one example of how CSPs can partner with a wide range of vertical industries.
Expanding the business models through partners can bring significant benefits and help bring about successful innovation, but inevitably offers less direct control than delivering by themselves in their own controlled environment. It is often said that a business is only as strong as the chain of suppliers it works with.
An example of how service delivery chains are becoming complex, and by doing so, becoming more difficult for handling risks, can be exemplified by Uganda’s recent hacker attack on the country’s Mobile Money business that processes phone-based transactions. The mobile money value chain is made of mobile network operators (MNOs), banks, and end-users, and is a technology that allows people to receive, store, and spend money using a mobile phone.
In the mobile money value-chain, there are blurred risks mostly due to the often-undefined roles of banks and telecommunications companies in financial services, as proven by the recent hack of a gateway that links the bank-to-mobile money transactions. There is a clear line between “banking” and “mobile money” as a standalone business. But the big question is, and when the lines become blurred, is when MNOs expand their services to connect with banks and allow the withdraw of money from regular ATMs.
Figure 2: Responsibility Matrix
At Mobileum, we believe that is the opportunity to leverage the digital transformation data exchange and create the capacity to analyze distributed big data for integrated risk management (IRM) purposes, instead of pursuing a more reactive approach that focuses on finding more data sets and understanding how to use them to address risk. An IRM strategy reduces siloed risk domains and supports dynamic business decision-making via risk-data correlations and shared risk processes.
Figure 3: Integrated Risk Management Overview
The goal should not be to create one big repository that can handle any data set, no matter how large. Instead, it should be to fully automate the linkage among relevant insights from a wide variety of internal and external sources, a process that data in various nodes of the supply chain triggering action immediately when possible, and adding data to a queue for deeper analysis.
The Adaptive, Intelligent, and Distributed Assurance Platform, AIDA, project aims to deliver this vision, an end-to-end 5G-ready fraud management platform that is able to protect the 5G ecosystem in its multiple layers, and deploy an IRM strategy that manages high data volumes and real-time visibility through edges close to the monitoring points, contributing with scalability and local learning to global models. Additionally, 5G introduces challenges that previous generations did not have. The multitude of deployment scenarios between isolated or shared and private and public networks, and the multiple business entities and partners involved in the new business models, introduces intrusion, tampering, confidentiality, and data privacy requirements that need to be monitored and analyzed, ensuring system-wide protection of the ecosystem and value chain.
Another key aspect of 5G is the number of new stakeholders in the fraud landscape, which brings new types of fraud that are difficult to anticipate now. This is where the use of AI, especially unsupervised learning algorithms for abnormal behavior detection can help to address the unknown patterns or smart fraud, designed to dissimulate any abnormal patterns and create blind spots, evading detection. In an Integrated Risk Management approach, data feeds that traditionally are not considered in fraud management systems will strengthen the linkage between the different sources enhancing the relations between different domains, like fraud, security, or network fault and performance.
While detecting fraud in clear data can be a challenge, doing it on encrypted streams is far more challenging. Either by legislation enforcement or with the intention to cover fraud or simply as good practice, data is encrypted providing nothing more than a relation between two entities. As an example, most malware used today in telecom fraud depends on command and control botnets over encrypted connections. They control infected devices and monetize fraud by using services owned by the fraudsters, one of many monetization methods that can be used. Likewise, premium content can be streamed through illegal services by any node in the network, and data will only reflect a connection to a VPN provider, covering all the illegal activity behind it. In a 5G context where is difficult to anticipate the new types of fraud, the challenge grows on identifying them on encrypted data.
Resilient organizations anticipate risks, develop controls, monitor events, and whenever possible, apply automatic actions to remediate risks. At Mobileum, we believe CSPs will position themselves to lead the emergence of new ecosystems and play their full role in transforming industries and society. Our technology and telecommunications risk management services can assess and protect risk-related issues specific to the telecom industry and assure the industries that are leveraged by connectivity. Currently, we provide a vast stack of solutions that can support the changing imperatives of risk management when it comes to monetization, security, and trust brought by the telco platform economy.
The Mobileum portfolio is unique to assure how our customers build a strong relationship between enterprise risk management and improve its customers’ ability to track risk. At Mobileum, we believe that business resilience and risk management should be tightly linked.
By bringing an integrated view of network services, security, and testing and monitoring results, we create a comprehensive view and analysis of risks from fraud, monetization failures, and customer/partner experience while enabling the success of the digital transformation.
By Mobileum