AIDA – AIDA https://aida.inesctec.pt Thu, 13 Jul 2023 13:37:01 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://aida.inesctec.pt/wp-content/uploads/2020/05/cropped-Logo-AIDA_azul_pequeno-cópia-32x32.png AIDA – AIDA https://aida.inesctec.pt 32 32 177378626 Three years and many results later, the AIDA project comes to an end https://aida.inesctec.pt/three-years-and-many-results-later-the-aida-project-comes-to-an-end/?utm_source=rss&utm_medium=rss&utm_campaign=three-years-and-many-results-later-the-aida-project-comes-to-an-end Fri, 07 Jul 2023 21:05:04 +0000 https://aida.inesctec.pt/?p=5612

Three years and many results later, the AIDA project comes to an end

After three years working on harnessing the power of edge computing and federated machine learning, AIDA project has come to an end with many and significative outcomes in the fields of data processing, scalability, and privacy protection:

 

  • 8 software tool
  • 42 scientific publications:
    • 3 with PT partners and CMU
    • 3 with Mobileum
  • 2 use cases:
    • Bot net detection
    • Anomalous Behavior Detection 

 

 

This is what our coordinator has to say about the work developed in the last years and the main contributions from the AIDA project:

 

Unlike previous evolution’s, 5G re-thinks and re-architects how the network is built and managed, by introducing emerging use cases urllc(Ultra Reliable Low Latency Communications), mmtc (massive Machine Type Communications) , embb ( enhanced Mobile Broadband) and business models, affecting not only consumers but also enterprises and industries. In 5G, most subscribers will not be consumers as before, the bulk of 5G will consist in IoT devices with quite different behaviors from human subscribers. In fact, even in terms of IoT and depending on the use case, IoT devices can actually have totally different behaviors (e.g: a Smartmeter and a connected car).

 

Furthermore, the capabilities to support the divergent use cases in 5G, can only be accomplished in a flexible and efficient architecture, with the agility of modular network functions which can be quickly deployed and scaled on demand. 

 

Taken together, 5G networks bring significant changes to fraud management, namely by introducing new use cases and consequently new attack vectors and type of stakeholders involved in fraud, as well as massive volumes of information that require a distributed approach to process them.

 

The main contribution of the AIDA Project is to address the changes in the current threat model for 5G, explicitly proposing (I) a distributed edge computing Federated ML architecture on the 5G Edge, using decentralized data to train ML models, supporting the processing of massive volumes of information in real time; (II) linear, scalable, ranked out, parameter free methods that are able to identify abnormal behavior in smart fraud context where artificial intelligence is used to perpetrated fraud and avoid detection; (III) Visualization of Million-Scale Call Graphs, with meaningful and explainable visualizations extracts to assist analysts in spotting fraudsters and suspicious behavior.

 

AIDA represents the effort of the industry with a consortium of academic partners from multiple areas with the goal of providing a solution that enables us to face the challenges of 5G fraud and distributed platform components, achiving greater levels of scalability, by leveraging the increasing edge computing capacity made available by the IoT, and the imminent large-scale deployment of 5G cellular technology.”

 

— Pedro Fidalgo, Mobileum

]]>
5612
AIDA’s results presented at the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://aida.inesctec.pt/aidas-results-presented-at-the-ieee-ifip-international-conference-on-dependable-systems-and-networks-dsn/?utm_source=rss&utm_medium=rss&utm_campaign=aidas-results-presented-at-the-ieee-ifip-international-conference-on-dependable-systems-and-networks-dsn Thu, 29 Jun 2023 10:32:14 +0000 https://aida.inesctec.pt/?p=5591

AIDA's results presented at the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

AIDA’s partners travelled to Porto, Portugal, where they participated in the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), from June 27-29. Their participation was an opportunity to present the project’s results since it reached its concluding phase. For the last three years, the AIDA’s partners produced 8 software tools, 42 scientific publications (3 with PT partners and CMU and 3 with Mobileum), and 2 use cases (Bot net detection and International Revenue Share Fraud).

]]>
5591
Handling Update Hotspots in Distributed Database Systems https://aida.inesctec.pt/handling-update-hotspots-in-distributed-database-systems/?utm_source=rss&utm_medium=rss&utm_campaign=handling-update-hotspots-in-distributed-database-systems Wed, 21 Jun 2023 13:39:16 +0000 https://aida.inesctec.pt/?p=5531

Handling Update Hotspots in Distributed Database Systems

Motivation

Database systems must deal with the fact that real workloads often exhibit hotspots: Some items at certain times are accessed by concurrent transactions with high probability. This arises in telecoms, sensing, stock trading, shopping, banking, and numerous applications. Some are as simple as counting events, such as user votes or advertisement impressions in Web sites. Some of these applications, such as prepaid telco plans, selling event tickets, or keeping track of remaining inventory, in addition to counting, also need to enforce a bound invariant, that ensures that the quantity being tracked does not cross a set threshold.

 

Update hotspots mean that locking and validation mechanisms used for isolation have a severe impact on usable throughput. This is particularly challenging in emerging cloud and edge database systems as classical techniques such as escrow locking are not applicable (e.g., locks in MySQL Group Replication are local and transactions in different nodes run optimistically), distributed synchronization has a considerable impact on latency (e.g., waiting for a stable time in Spanner), or the unpredictability makes them ineffective (e.g., how many separate splits are needed in a serverless system). Moreover, in industrial applications, one needs a solution that works in current cloud-based and off-the-shelf systems, that is, using only their application programming interfaces.

MRVs in a Nutshell

Structure

The AIDA project proposes Multi-Record Values (MRVs), a new approach to handle update hotspots in scale-out cloud and edge database systems. It builds on the general strategy of value splitting: To split each contended value into multiple database records, each holding part of the total value, such that they can be accessed concurrently. To add or subtract to the value, one needs to add or subtract to any of these records. To read the current value, one needs to read and sum them all. The main novelty of MRVs is how each transaction is assigned to a physical record and how the various records, holding parts of the total value, are managed efficiently. 

 

MRVs can be portrayed as a circular structure of size N. Of these N positions, n are assigned to a physical record, which holds a subset of the total value. In the example below, we have an MRV with N=23 and n=9, with the latter represented by black circles.

Figure 1. Representation of MRV pki.

Operations

As different clients might have different access patterns, MRVs avoid statically assigning them to records, as done in previous splitting techniques. To ensure that accesses are evenly spread, our first insight in MRVs is to use a random number between 0 and N-1, for each access, to determine which record to use. Assuming that the number of records for each item is big enough, this results in a small probability of conflict. This avoids the need for explicit coordination of clients, which would be costly in a distributed environment. 

 

In the example below, transaction T1 wants to add 2 units to the MRV. As such, it performs a lookup with a random key, 4, and ends up updating the next physical record, assigned at index 6.

Figure 2. Example of a add operation on MRV psi. Only one record is accessed and modified.

Subtractions might not be fully possible on a single record if its current value is lower than what is being subtracted. Thus, the subtraction might require multiple accesses to complete. This could be done simply by keeping the remainder after a first subtraction and carrying it on to a second random one, and so forth, until it is fully done. This, however, makes it difficult to determine unsuccessful termination, which happens after all records have been visited and there is still a remainder. This is addressed by performing only one random lookup and then scanning to the next record. After the last record, we wrap around and restart the process on the first, hence the circular structure. With a tree-type index, both the lookup and next operations execute efficiently.

 

If two transactions try to update the same record concurrently, one of those will have to either rollback or wait. MRVs rely on the underlying concurrency control to achieve this task, which means a simplified implementation process.

 

In the example below, transaction T2 wants to subtract 3 units from the MRV. It first accesses the record at index 8, which only has 2 units. As such, it sets its value to 0 and carries the remainder (1) to the next record.

Figure 3. Example of a sub operation on MRV psi. To complete the operation, two records needed to be accessed and updated.

Adjusting

Choosing an optimal n depends on the current load. On one hand, a low number will lead to a high conflict probability under high loads. On the other, a large number increases storage and read overheads, and can be counterproductive for MRVs with low values. Therefore, MRVs employ a background worker that dynamically adjusts the number of records per MRV based on the workload, using, e.g., the MRV’s conflict rate. 

 

In the example below, a higher load leads to an increased abort rate. To offset this, the adjust worker adds two new records, filling previously empty positions.

Figure 4. Example of the adjust worker adding two new records to MRV pki.

Balancing

Finally, skewed workloads can lead to imbalanced records. For example, in stock reservation use cases, we might have multiple small subtractions – clients buying the item – and a few large additions – the store restocking the item. This increases the conflict probability of subtract operations, reducing the usefulness of MRVs. Our solution is another background worker that periodically balances the amount between records, as exemplified in the animation below.

Figure 5. Example of the balance worker balancing amount among records in MRV pki.

Evaluation

MRVs are evaluated with experiments on different database management systems, including distributed systems where other solutions are not applicable. Namely, implementations of MRVs have been tested on: a centralized SQL system with PostgreSQL, often used in cloud-based managed services; a single-writer NoSQL data store with MongoDB; a multi-writer SQL database with MySQL Group Replication; and a cloud-native, multi-writer NewSQL system that we call System X.

 

This test mimics a shopping application that keeps stocks of products. We increase contention both by increasing the number of concurrent clients (X-axis) as well as decreasing the number of products (Y-axis). The heatmaps below display the scale-up in throughput against the native single-record solution. These experiments show that the MRVs technique is widely feasible and advantageous in a spectrum of different database management systems, including NoSQL and distributed systems where an improvement of 100x can be observed in cases of extreme contention.

Figure 5. Performance comparison between MRVs and the native single-record I a variety of database systems.

Final Remarks

Multi-Record Values improve the performance of applications affected by the problem of numeric hotspots, by reducing the collision probability. The background workers ensure that MRVs adapt based on the current load, meaning write performance is improved while optimizing to reduce read and storage overheads.

The open challenge that remains is the feasibility of randomized splitting to data structures other than numeric values.


The full paper can be accessed here: https://nuno-faria.github.io/publications/mrv. The code used in the experiments can be accessed here: https://github.com/nuno-faria/mrv.

 

References

Nuno Faria and José Pereira. 2023. MRVs: Enforcing Numeric Invariants in Parallel Updates to Hotspots with Randomized Splitting. Proc. ACM Manag. Data 1, 1, Article 43 (May 2023). https://doi.org/10.1145/3588723

]]>
5531
AIDA’s researchers present poster at EuroSys Conference https://aida.inesctec.pt/aidas-researchers-present-poster-at-eurosys-conference/?utm_source=rss&utm_medium=rss&utm_campaign=aidas-researchers-present-poster-at-eurosys-conference Wed, 17 May 2023 14:31:10 +0000 https://aida.inesctec.pt/?p=5507

AIDA's researchers present poster at EuroSys 2023

Earlier this month, two researchers from INESC TEC, one partner of the AIDA project, presented a poster at this year EuroSys Conference, in Rome, with the title “Emission-Aware Federated Learning: A Case Study on Transportation and Carbon Footprint”. The poster is focused on the addition of Federated Learning”, which enables the promotion of sustainable and personalised travel behaviours while preserving data privacy.

As written in the poster, the main goal of the research “is to preserve the privacy of users data while increasing awareness on their carbon footprint”. 

EuroSys Conference is one of the most important events related to systems software research and development.

]]>
5507
AIDA promoted a networking event with PhD students https://aida.inesctec.pt/aida-promoted-a-networking-event-with-phd-students/?utm_source=rss&utm_medium=rss&utm_campaign=aida-promoted-a-networking-event-with-phd-students Fri, 31 Mar 2023 12:38:32 +0000 https://aida.inesctec.pt/?p=5489

AIDA promoted a networking event with PhD students

On March 29, eight PhD students got together to share and discuss the work they have been developing within the AIDA project. This event was an excellent networking opportunity for the students to meet each other and to get to know other investigation topics.

 

There were presentations in three main areas related to the project outputs: Distributed Systems, Security, and Deep Learning and Anomaly Detection.

 

The meeting was moderated by Pedro Fidalgo, from Mobileum, and had the participation of Luís Ferreira, from INESC TEC, José Flora, Iury Araújo, Mariana Cunha and Jessica Castro, from University of Coimbra, and Minji Yoon, Saranya Vijayakumar and Jeremy Lee, from Carnegie Mellon University.

]]>
5489
There is always one more data bug https://aida.inesctec.pt/there-is-always-one-more-data-bug/?utm_source=rss&utm_medium=rss&utm_campaign=there-is-always-one-more-data-bug Mon, 13 Feb 2023 10:42:55 +0000 https://aida.inesctec.pt/?p=5419

There is always one more data bug

If we, as data scientists, receive a dataset from a reliable source, we should go ahead with the analysis (classification, clustering, deep learning, etc), right?

Well, yes, that’s what most of us (myself included) often do, especially if there is a tight deadline. However, this could be dangerous. Let me describe some rude awakenings I suffered over the past decades, as well as remind you some fast and easy preventive measures.

 

Examples (a.k.a. horror stories)

E1 Geographical data

Two decades ago, we got access to a public dataset of cross-roads in California, that is, about 10,000 points in two dimensions ( (x,y)  pairs). We plotted it to include it in the spatial-indexing paper we wanted to submit – the plot looked mostly empty, with a lot of points in the shape of California, at the bottom-left corner (see Figure [fig:ca](a)).

Why so much empty space?

The answer was that there was a single point somewhere in Baltimore (Atlantic coast), thousands of miles away from California. Clearly a typo – maybe the coordinates had the wrong sign or so. Since it was just one point, we deleted it.

 

 (a) Q: why is the CA dataset at the bottom left and the rest, empty?

 

 (b) A: because of a tiny stray point in Baltimore…

 

E2 Medical data – many ‘centenarians’

I heard this instance from a colleague at CMU (let’s call him ’Mike’). He was working with patient records (anonymized-patient-id, age, symptoms, etc). ’Mike’ was very careful, and did a histogram of the age distribution – and he noticed that ’99’ was an abnormally popular age! He asked the doctors that owned this dataset – they replied that, yeah, that’s what they used as a ’null’ value – since for some patients the age is unknown.

 

Remedies – Conclusion

Given that “There is always one more data bug”, what should we do as practitioners? While there is no perfect solution, I found useful the following defensive measures:

R1: Visual inspection – ’Plot something’: For any dataset, it often helps if we plot the histogram (’pdf’) of each numerical column. We could even try logarithmic scales, when we expect power-laws: Zipf, Pareto, etc. Spikes/outliers may spot anomalies (like the ’99’ age of the medical records incident).

R2: Keep in touch with the data owner: The medical doctors knew that ’99’ was the null value; in general, domain experts can help us focus on the real anomalies, as opposed to artifacts.

In conclusion, Data Science is fun and typically gives high value to the data owner. However, there is always room for errors in the data points, sometimes with painful impact. Anticipating and spotting such errors can help us provide even higher value to our customers. Paraphrasing our software engineering colleagues: There is always one more data bug.

 

Written by CMU

]]>
5419
AIDA presented to PhD students https://aida.inesctec.pt/aida-presented-to-phd-students/?utm_source=rss&utm_medium=rss&utm_campaign=aida-presented-to-phd-students Tue, 13 Dec 2022 12:24:04 +0000 https://aida.inesctec.pt/?p=5370

AIDA presented to PhD students

INESC TEC, one of AIDA’s partners, presented AIDA to the Doctoral Program in Computer Science (MAP-I)’S first year students. This visit, which took place on December 6, 2022, happened during a visit to the High-Assurance Software Laboratory from INESC TEC at the University of Minho.

 

This was an opportunity to present the work carried out by the AIDA project and get to know a good example of international collaboration between industry and academia.

 

It is important to mention that the initiative was also an opportunity to share the project’s vision and expected outcomes, while promoting the project’s activities and benefits.

]]>
5370
AIDA Research on Suspicious Behavior and Anomaly Detection https://aida.inesctec.pt/protecting-the-security-and-privacy-of-aida-and-its-data-2/?utm_source=rss&utm_medium=rss&utm_campaign=protecting-the-security-and-privacy-of-aida-and-its-data-2 Tue, 13 Dec 2022 10:01:24 +0000 https://aida.inesctec.pt/?p=5306

AIDA Research on Suspicious Behavior and Anomaly Detection

One of the goals of the AIDA Project is to investigate and identify new ways to help Analysts find Anomalous Behavior (of any kind) in large and complex pools of data. We hope this can ultimately lead to significant improvements in the detection of fraudulent activity occurring on Telecom Networks, especially in the light of new technologies like 5G already expanding on the market. Moreover, we now see over more sophisticated, complex and robust methods being used for committing fraud on Telecomm Networks but also increasingly expanding to the world of Communication Service Providers and the Internet. 

In essence our goal is to create and develop new tools that can provide new ways to help the Data Analyst and Data Scientist better understand and make sense of information originated from huge and dynamic pools of data such as that originating from Telecom Operator’s traffic activity. This type of data displays very particular characteristics as it is based upon human interactions prone to exhibiting strong and established patterns  and Data Visualization functionalities when designing and building these new tools. 

We believe this can ultimately lead to better results at detecting Anomalous Behavior and uncover suspicious activity that would otherwise have remained undetected, which in turn can potentially stand as the basis for developing the next generation of commercial products and solutions aimed at tackling Anomalous Behavior in Telecom Operators’ traffic activity and other related scenarios. 

 

Case Building and Smart Fraud

 

One of the ideas behind this approach was to establish a strong focus on facilitating the opening and development of new cases by the end user (usually the Analyst). Case Building is a crucial part for the Analyst to manage his investigation work in order to allow the Analyst to follow a logical sequential analytical process, where there is the possibility to deep dive on data of interest as well as the opening and creation of new cases with ease. Current solutions such as those used at Mobileum include smart monitoring features that sense when there is an unusual activity exhibited by the Operator traffic data. When a suspicious event occurs, the application can set a temporary alert on that specific agent, attempting to keep fraud losses to an absolute minimum without interrupting legitimate activity. This approach functions more as a broad analysis that can detect alterations to the normal flow patterns of traffic data. Therefore, usually only the more obvious and evident anomalous activity can be detected through such methods, which leads to a lot of illegal activity passing under the radar of these solutions. The Analyst here usually does not have the ability to perform a sequential deep dive on the data and to reflect at multiple perspectives sequentially layered in a manner that can facilitate the workflow of the user.

This is made more difficult with the increase of Smart Fraud where fraudsters are increasingly developing more sophisticated, efficient and complex ways to commit Fraud be it for increasing success rates or to adapt to the adoption of anti-fraud mechanisms by the Telecom industry.

The advent of AI and the ever more sophistication, creativity and expertise of Fraudsters are major factors for this. New methods used by Fraudsters focus on attempting to mimic “normal” non-fraudulent behavior that can increase the chances of avoiding detection, through several ways. For example, regarding call traffic activity we have seen approximations on call duration, frequency and even ensuring that verified contacts are also included in the fraud committing process.

We believe the work carried out in the AIDA Project will significantly contribute to new solutions that increase the accuracy in detecting Anomalous Behavior. We provide a brief overview on some of the work conducted so far in the next section.

 

Graph and Node Analysis in Anomaly Detection

 

One of the goals of this work was to provide Analysts with new ways to analyze Telecom traffic activity that can lead to the detection of anomalous behavior that is not detected using classic methods. In that line we have introduced Graph Analysis to our research together with powerful statistical and feature exploration focused on providing powerful and insightful Data Visualization tools and techniques that can strongly help Analysts analyzing traffic data. One of the methods of choice to explore in such research is the study of Anomalous Behavior and suspicious activity using graphic objects such as time evolving graphs. Graphs are very powerful for analyzing human interactions and therefore make exceptional candidates for studying and analyzing anomalous behavior from Telecommunication Networks’ traffic activity. Here, individual agents (people) are represented by nodes of the graph and the links between them represent connection events established by those nodes. We are interested to understand the patterns that occur in this type of data, like who-calls-whom, who-sends-money-to-whom the focus was on the unsupervised cases, where there are no labels.

Our work so far with Anomalous Behavior has led us to use these approaches and techniques in the study of Fraud occurring in Telecom Networks. We have been developing new analytical solutions that we have been testing with a real-life dataset originating from an existing Telecom Operator. Together with the help of high expertise in this business domain, we were able to detect significant anomalous behavior in novel and potent ways which positively reinforces the potential and value of this application. More specifically we were able to identify new cases with high potential for Bypass Fraud. Furthermore, we present a case that was later confirmed by the Telecom Operator as a newly confirmed Bypass Fraud case that had not been previously identified. A recent solution we have recently published is depicted in figure 1.

Figure 1. Developed solution that uses the power of graphs to analyze, detect and determine Anomalous Behavior that has the potential for Fraudulent Activity. There is the possibility for selecting upon several relevant features as the basis for further exploration and investigation.

The proposed solution follows three sequential steps:

  • Step 1: ‘Feature-selection’: by carefully choosing features to extract from each node;
  • Step 2: ‘Summary’: high-level, interactive summary of the data;
  • Step 3: ‘Deep-dive’: allowing the user to focus on suspicious nodes.

The initial idea was to start with several informative features that can help Analysts get a better understanding of the data they are looking at (as shown below in fig. 2 (a)). Upon selection of desired features to visualize, we can dive into the structure of the data and try to identify and find patterns or nodes that seem anomalous or somehow suspicious.

The Summary comprehends a high-level yet detailed interactive view of the data (as show in fig 2 (a)). It is possible to visualize data in a variety of ways that can highlight different aspects, patterns and behaviors that can be further investigated.

The deep-dive consists on a drill down on suspicious or interesting nodes that ask for further investigation (as shown in fig. 2 (b) and  (c)). Here we can explore several important metrics exhibited by the selected group of nodes. This can help to detect patterns and relationships between different nodes and infer about anomalous behavior. Additionally, we explore the use of EgoNets, a very powerful way to look at relationships in the data. The idea is to be able to drill down on any nodes and relationships of interest and understand specific metrics and statistics associated with them.

This approach was designed to work with labeled and unlabeled data. It is of special significance the possibility for addressing Anomalous Behavior investigation on unlabeled data, as this will inevitably always constitute the vast majority of newly generated data.

One of main ideas in the design of this solution was to allow the user to interact with the data in a way where it could be possible to select specific areas of interest in the graph for further investigation. This can then allow the analysis of suspicious groups of nodes in the graph that may share similar attributes, patterns or behavior. A major advantage for this is to be able to switch from a case-by-case approach, where the Analyst has to look at each single potential case individually and is thus less informative and more time-consuming, to one of group behavior analysis. By analyzing whole groups of nodes, we can broaden our scope of analysis and gain a better idea on the overall potential fraud events occurring in the data. Additionally, when detecting a suspicious case it becomes easier to infer about related events exhibiting the same kind of behavior. Generally, the visualization of the graph nodes and respective interactions allows us to evaluate the behavior of a particular group and identify the fraudulent nodes e corresponding victims.

Also, the use of parallel graphs provides a powerful means to visualize and analyze different metrics (more than twenty at the moment) that can assist in the investigation process.

 

Success Case Study

 

In this publication we present a success case that showcases how new events of fraud that had been previously undetected by traditional methods can be detected. We went from an initial dataset comprising of a pool of events (calls) of traffic activity from a Telecom Operator and were able, through Data Visualization techniques, to successfully identify a new fraud case. This was an instance of International Bypass fraud, one of the most prevalent and damaging types of fraud.

The data set used for the analysis presented here was not labeled and is comprised of a pool of 2 days of traffic activity (events are phone calls) from a large Telecom Operator.

Figure 2. Current proposed solution at work: (a) several nodes are on the 45-degree line (red dashed box), away from the majority (notice that both axes, as well as the color-scale, are in log). (b) ‘Deep-dive’ for the red triangle: parallel axis plot of the EgoNet of the ‘red triangle’, shows that the nodes receive 1-second phone calls (c) experts are investigating the nodes like the ones in red ovals, and confirmed that the callers in yellow-highlight have all the evidence of ‘International Bypass’ type of fraud.

Figure 2 depicts the analytical process undertaken in the analysis that led to the identification of the success case presented here.

In figure 2 (a) we can see a plot representing the relationship between number of calls and calls received. Here we detected a group of suspicious nodes (red dashed rectangle) that displayed a pattern diverging from normal behaviour as seen by the representation of the full dataset.

From there we looked at the various features computed by the application to get a better grasp at the behavior displayed in that case. We noticed a pattern of 1-second phone calls that was not a standard behavior.

We further deepened our investigation by computing an graphical visualization of the EgoNet of the data of interest and we could learn that many of these 1-second long phone calls were strangely being made to an hotel. We suspect this was indeed Anomalous Behavior and likely indicative of fraudulent activity related with the International Bypass.

 

Conclusion

 

One of the many outcomes of the AIDA Project has been the research and development of new algorithms and data visualization solutions that can add to the catalogue of tools addressing the detection and identification of Anomalous Behavior. Here we have discussed a solution, developed by AIDA partners Carnegie Mellon University and Mobileum, designed for the study and detection of Anomalous Behavior in large pools of data. We have addressed new ways for the study of Anomalous Behavior by Analysts tackling Fraud using the power of time evolving graphs. The proposed solution is designed in a sequential manner that allows the user to dive in specific areas of interest in the data. Also, the user can deepen the analysis by investigating multiple ways the data is interconnected.

This provides the ability to detect fraud patterns occurring on the data by groups of nodes which contrasts with the narrower scope of single case analysis that is used in current state-of-the-art commercial solutions.

We have built a tool focusing on group analysis and attention routing that allows the identification and visualization of fraud patterns in a network.

By using this approach with an initial set of unlabeled data we were able to detect Anomalous Behavior that was confirmed to constitute Fraud activity in a quick and efficient manner.

We reinforce the strong emphasis on a solution that can provide powerful Data Visualization features that allows the user to successfully tackle Anomalous Behavior and fraud activity. The possibility for deep dive in data points (nodes) of interest and collect valuable information from the relationship between these data points is of special significance. Here, the possibility to perform drill down through a Lasso Selection of the data of interest should reveal to be highly valuable in this type of analytical work.

This application also provides the possibility for addressing and detecting new cases of Smart Fraud that aims to mask itself within “normal” behavior by adopting new and ever more unsuspected mechanisms and techniques.

 

Next Steps

 

This is ongoing work as we continue to explore these research vectors. One of the main areas of interest we want to focus our work on is Attention Routing. This will help to improve the study of areas of interest in the data that are blurred across the normal distribution behavior and are often ignored.

We are exploring new ways to analyze of groups or clusters of data and continue with our research in parallel graphs and the use of spring models for visualization and interaction.

Finally, we are thinking on how to evolve to solutions that can be made more pro-active, possibly suggesting suspicious cases based on an automated informed analysis. We hope this could ultimately have a great impact on the industry of Fraud Management.

 

References

 

TgrApp: Anomaly Detection and Visualization of Large-Scale Call Graphs (2022 International Conference on Big Data (IEEE BigData 2022), AAAI Proceedings (2022)).

CallMine: Fraud detection and visualization of million-scale call graphs (22nd IEEE International Conference on Data Mining on Data Mining (IEEE ICDM 2022)).

 

By Mobileum

 

]]>
5306
AIDA participated in the CMU Portugal Summit 2022 https://aida.inesctec.pt/aida-participated-in-the-cmu-portugal-summit-2022/?utm_source=rss&utm_medium=rss&utm_campaign=aida-participated-in-the-cmu-portugal-summit-2022 Wed, 16 Nov 2022 10:47:36 +0000 https://aida.inesctec.pt/?p=5290

AIDA participated in the CMU Portugal Summit 2022

Pedro Fidalgo, from our coordinator partner Mobileum, participated in the roundtable “Adaptive, automated, and autonomic computing” at this year’s CMU Portugal Summit, which happened under the motto “New Frontiers in tech”. 

During the conference, the AIDA project had a meeting with the CMU Portugal External Review Committee, an advisory board charged with assessing the project’s performance and making recommendations.   

Paula Raissa Silva, from INESC TEC, also participated in the conference with a paper entitled “Federated Anomaly Detection over Distributed Data Streams”.

The conference aimed at bringing experts to present research progress in various areas, such as Health, Cybersecurity, Forests, Artificial Intelligence, Language Technologies, Machine Learning, among others. The CMU Portugal Summit 2022 took place in Lisbon, on November 9 and 10.  

]]>
5290
Protecting the Security and Privacy of AIDA and its Data https://aida.inesctec.pt/protecting-the-security-and-privacy-of-aida-and-its-data/?utm_source=rss&utm_medium=rss&utm_campaign=protecting-the-security-and-privacy-of-aida-and-its-data Mon, 07 Nov 2022 15:24:06 +0000 https://aida.inesctec.pt/?p=5262

Protecting the Security and Privacy of AIDA and its Data

Adapting the RAID platform comes with many security and privacy issues that need to be addressed. These issues appear mainly due to the transition to an edge architecture, which imposes the use of computation resources at the edges of the network, but also due to the need of supporting multiple tenants and network slicing with 5G technology. As mobile phones connect and disconnect, the network is always changing, requiring adaptation and monitorization tools to be used to maximize resources. And since the complexity of the system increases, attackers have more opportunities to exploit it. With these issues in mind, various mechanisms and protocols were put in place, assuring the overall security of the platform. 

 

Secure Communications and Malware Identification

The transition to edge computing architecture introduces new vulnerabilities. Attacks like Man-in-the-Middle or Eavesdropping can occur with higher probabilities, thus it is of utmost importance to ensure secure communication between the different devices and components. The edge architectures can leverage different solutions, seeking or not  compliance with standard industries for mobile applications like Mobile Edge Computing (MEC). Such standards facilitate the data processing at the edge, but also open new threats due to exposure of new APIs, which may require data flows between edge nodes and cloud components.

 

Secure communications for applications leveraging on APIs can rely on the advances of the HTTP protocol – HTTP/3 which introduces functionalities for increased security (i.e., reduced round trip time for initial session handshake) and support for unreliable links, for instance with high packet loss. The AIDA platform leverages  the advances of HTTP/3 that relies on the QUIC protocol. In addition, KubeEdge orchestrates the diverse microservices of the AIDA platform at edge nodes and also supports multiple solutions to secure the communications at the control plane. Besides TLS connections, there is also support for the HTTP/3 protocol.

 

The AIDA platform is also able to allow ISPs to detect malware on the fly. First, the AIDA platform captures the network traffic (mainly DNS queries) and presents it to a botnet analyzing service. The service then produces a floating point evaluation regarding the probability of infection of a specific device, which can then be used to trigger a response based on the presented value.

 

The steps present in the detection are: (1) blacklist/whitelist analysis, (2) query rate analysis, (3) domain analysis (whether it is or not DGA-generated) and finally a (4) machine learning step. This pipeline-oriented scheme aims to achieve high speed and scalability, therefore, the packets leave the pipeline as early as a consistent evaluation is formed. In all steps a packet can be marked as infected or not(except (2) which can only deem infected) and leave the pipeline. Only if a packet does not meet the upper or lower bound criteria for infection, it traverses the pipeline into the next evaluation step.

 

Secure within Software Components

The AIDA platform is based on the use of microservices, which require special attention to ensure their security. The platform also is expected that large amounts of data will be generated, therefore, intrusion detection mechanisms must be lightweight while also being able to cope with a dynamic number of replicas and a large number of services.

 

When looking at the system as a whole, we were able to increase detection rates up to 60%, while only around 25% of the alarms were false positives. Without the techniques being employed the results would be overwhelmed by a really large amount of false positives. Also, considering adaptation environments, results improved by up to 80%, improving the overall security of the system.

 

Fig.1 – Results of the intrusion detection methods employed 

 

Hoping to improve the intrusion detection rate and minimize false alarms, we also decided to explore the use of machine-learning techniques for intrusion detection. With this intent, we collected system calls from the microservice systems as our data and used classification techniques to detect intrusions. The results demonstrate a high detection rate for two of the five attacks from the tested vulnerabilities. Although only some attacks have been detected, the false alarm rate had excellent results, staying below 1% for all attacks. We also improved the results from machine learning by using a sliding window as a post-processing technique.

 

Based on the machine learning results, we understood the necessity of a better system call representation for intrusion detection techniques. For this, we decided to work on a representation that could convey more information about the connection between system calls. First, we devised a classification where system calls were divided into classes and subclasses. Later, we established relationships of different costs between these system calls to create a system call graph. Some parts of the system call graph can be seen in Fig.3. Since our representation was susceptible to our subjectivity, we designed a validation process that gathered information from other researchers in the research area. In the classification, we adjusted 17,28% of system calls based on the validation from other researchers. The graph validation has started, and it is in progress.

 

Fig.2 – Small representation of the system calls as a graph

 

Since the AIDA platform will receive dynamic loads, adaptations are possible throughout the execution, maximizing the resources being allocated. Mechanisms of self-adaptation were therefore put in place, monitoring the system and executing pre-defined actions that let the system react to the changes, allowing for high performance and availability. To do that, we will use the Trustworthiness Monitoring and Assessment (TMA) Framework. TMA allows self-adaptation mechanisms in cloud and edge applications. This is done through their REST interfaces, which interact with the probes and actuators of the managed element (i.e., the AIDA platform). As TMA can be easily tailored to any aspect to be monitored (e.g., performance, availability, security), it was chosen to be applied to the AIDA platform. In addition, we have recently made available a new dashboard that allows managing and visualizing TMA configurations.

 

Data Privacy

The assurance of data privacy in the AIDA platform demands suitable anonymization approaches to store and process large amounts of data. The development of a privacy framework came to overcome the difficulty in selecting and configuring the appropriate mechanism that fulfills the project requirements. This privacy framework allows to implement, apply, and assess Privacy-Preserving Mechanisms (PPMs) according to the pipeline below.

 

Fig. 3 – Privacy Framework architecture 

 

The architecture of the privacy framework consists of a main Python Package with a set of subpackages, where each subpackage contains the corresponding adapter, that is, an abstract class that can be extended by implementing the abstract methods (i.e. relevant methods for the component). These adapters make the framework easily extendable, by allowing the implementation of new features (e.g. new PPMs or metrics).

 

These strategies are complemented by SOTERIA, which uses machine learning techniques to create a distributed privacy-preserving system. It was built taking into consideration both scalability and fault tolerance, allowing the processing of large datasets. See our December post for details.

 

By University of Coimbra

]]>
5262