anomalies – AIDA https://aida.inesctec.pt Tue, 18 Oct 2022 16:44:19 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://aida.inesctec.pt/wp-content/uploads/2020/05/cropped-Logo-AIDA_azul_pequeno-cópia-32x32.png anomalies – AIDA https://aida.inesctec.pt 32 32 177378626 Fraud detection, micro-clusters and scatterplots https://aida.inesctec.pt/finding-anomalies-in-large-scale-graphs-2/?utm_source=rss&utm_medium=rss&utm_campaign=finding-anomalies-in-large-scale-graphs-2 Thu, 30 Jun 2022 13:45:36 +0000 https://aida.inesctec.pt/?p=5126

Fraud detection, micro-clusters and scatterplots

 

Acknowledgements

The results and analysis presented here were done with contributions from Mirela Cazzolato (USP, and CMU), Saranya Vijayakumar (CMU), Xinyi (Carol) Zheng (CMU), Meng-Chieh (Jeremy) Lee (CMU), Namyong Park (CMU), Pedro Fidalgo (Mobileum), Bruno Lages (Mobileum), and Agma Traina (USP).

 

 

Reminders – Problem definition and past insights

As we mentioned in the February 2022 blog post, the problem we are focusing on is to spot fraudulent behavior in a who-calls-whom-and-when graph. We distinguished between the supervised case (where we are given a list of fraudulent subscribers (labeled data)), and the un-supervised one, where we are not given any such labeled data.

The two main insights where (a) the labeled data could be wrong: a subscriber labeled as ’honest’ may turn out to be fraudulent after closer inspection and (b) there are many types of fraud, including telemarketers/scammers (’you owe taxes’); subscribers bypassing the legal ways of making international calls; subscribers using fake or stolen credit cards; and many more.

 

 

New Insights

For the supervised case, we can build classifiers (random forests, autoGluon), which will be as good as the quality of our labels, and which will never be able to spot new types of fraud.

For the un-supervised case that we focus here, the main insights are

  1. Fraudsters often exhibit lock-step behavior: for example, too many callers call the same (large number) of destinations, about the same time.

  2. Visualization is a powerful way to explain to a non-analyst why our algorithm suspects a given set of subscribers.

We elaborate with an example, next.

 

 

Deep dive

Figure 1 illustrates the main ideas and insights.

 

 

Raw data: Figure 1 (a) and (d)

The first column shows two of the many scatter-plots we are proposing: The first row ’(a)’ plots the in-degree (total # of friends calling in) versus weighted in-degree, in log-log scales since we have power-laws as we expected. Like all other scatter plots, this is a heatmap – notice that even the scales of the heatmap are logarithmic: The vast majority of subscribers, in red/orange colors, receive a few phonecalls ( ≈ 101), and the total talking time is relatively short ( ≈ 102). Not surprisingly, there is a correlation: the more people call you, the larger the total duration of the phone calls. What is surprising is the micro-cluster inside the white circle, at (102, 104): Many people ( ≈ 103 – light-yellow colormap) have surprisingly similar in-degree of about 100, and similar total duration of about 10,000 seconds.

Let’s dissect (d), the scatter-plot below it. Every dot is a subscriber; the axis are the in- and out-degree of each subscriber, again in logarithmic scales (log (x+1), to be exact). There is heavy overplotting (red/orange/yellow colors) for small values – that is, most subscribers receive and initiate only a few phonecalls. There are two un-expected issues: The first is that there is little reciprocity: analysis of earlier who-calls-whom networks [1] reported that usually people who make xphonecalls, also receive about x phonecalls. Thus, we would expect to see a strong diagonal along the 45 degree line – but it is not there.

The second observation is that there are a lot of people who never return a phonecall (all the dots on the horizontal axis – our domain experts call them ’black holes’), and similarly many people that receive no incoming phonecalls (’volcanoes’). These behaviors are not necessarily bad: for example, an emergency room or a help desk would behave like a ’black hole’. However, several fraudulent behaviors are so asymmetric, like, eg., scammers (’you owe taxes’).

 

 

Our Guesses (b), (e)

The second column of Figure 1 shows our guesses for fraudsters (orange diamonds and crosses, for ’volcanoes’ and ’black-holes’ respectively). The crucial plot is (b), which shows the ’core-number’ versus inter-arrival time (IAT). The core-number has a complicated definition, 1 but the intuition is that nodes with high core number belong to a densely connected community.

In Figure 1(b) notice that, while the vast majority of nodes (red/yellow dots at the bottom-middle of the graph) have low core-number (2-5), there are several with very high core number (20). Plotting them on the in- vs out-degree plot (’(e)’), about half of them are ’volcanoes’ (diamonds on the horizontal axis) and the rest are ’black-holes’ (crosses, on the vertical one). We shall refer to them as leadsfrom now on.

It turns out that most of our leads actually call each other, forming a very dense bi-partite core; such subgraphs are typically fraudulent in social networks (like Twitter [3] , FaceBook [2] )

Our domain experts confirmed that our leads are indeed fraudulent.

 

 

Comparison with the ’ground truth’

Our dataset already had labels, and we show them in the last column of Figure 1(c) and (f). Notice that (c) and (f) are exactly parallel to the middel column ((b) and (e)), with the only difference that the ’ground truth’ column has the labeled nodes in red. As in the ’leads’ case, diamonds and crosses correspond to ’volcanoes’ and ’black-holes’ respectively.

The main observation is the following:

  • The investigators who labeled the data, completely missed our ’leads’, with the high core-number

This is exactly the reason that we call the labeled set as ’ground truth’ within quotes: Several of the fraudsters may be flying below the radar, as we showed with our ’leads’ of Figure 1(b) and (e).

 

 

Conclusions

We repeat our two insights:

  1. Fraudsters often exhibit lock-step behavior, resulting in micro-clusters and dense communities or bi-partite cores

  2. Visualization helps explain our ’leads’ (as we did with the micro-clusters in Figure 1(a),(b)

 

Citations

  • [1] Akoglu, L., de Melo, P. O. S. V., and Faloutsos, C. Quantifying reciprocity in large weighted communication networks. In PAKDD (2) (2012), vol. 7302 of Lecture Notes in Computer Science, Springer, pp. 85–96.
  • [2] Beutel, A., Xu, W., Guruswami, V., Palow, C., and Faloutsos, C. Copy-catch: stopping group attacks by spotting lockstep behavior in social networks. In WWW (2013), International World Wide Web Conferences Steering Committee / ACM, pp. 119–130.
  • [3] Hooi, B., Song, H. A., Beutel, A., Shah, N., Shin, K., and Faloutsos, C. FRAUDAR: bounding graph fraud in the face of camouflage. In KDD (2016), ACM, pp. 895–904.

     

     

    Formally, the k-core of a graph is a maximal subgraph that contains nodes of degree k or more; the core-number of a node is the largest value of k of a k-core that contains that node.

     

By Carnegie Mellon University

]]>
5126
Finding Anomalies in Large Scale Graphs https://aida.inesctec.pt/finding-anomalies-in-large-scale-graphs/?utm_source=rss&utm_medium=rss&utm_campaign=finding-anomalies-in-large-scale-graphs Mon, 21 Mar 2022 10:20:59 +0000 https://aida.inesctec.pt/?p=4849

Finding Anomalies in Large Scale Graphs

 

Problem definition

 

Given a large, who-calls-whom graph, how can we  nd anomalies and fraud? How can we explain the results of our algorithms? This is exactly the focus of this project. We distinguish two settings: static graphs (no timestamps), and time-evolving graphs (with timestamps for each phone). We further subdivide into two sub-cases each: supervised, and unsupervised. In the supervised case, we have the labels for some of the nodes (‘fraud’/’honest’), while in the unsupervised one, we have no labels at all.

 

Major lessons

 

For the supervised case, the natural assumption is that the labels are absolutely correct and thus we only need to build a classier, using our labelled set as the ‘gold set’. However, this is a pitfall: the ‘fraud’ labels are almost always correct (notice the ‘almost’), while the ‘honest’ labels are often wrong. Thus, we need to develop algorithms that can tolerate (and ag) those of the labels that seem erroneous. There is work on this line, under the area of ‘weak labels’, with some excellent work on the so-called ‘con dent learning’ [8]. The second insight is that there are many types of fraud, as well as many types of ‘honest’ behaviour. We already mentioned that in the context of Twitter followers [10]. For phonecall networks, there are also many types of fraud: telemarketers, phishing attempts, redirections to expensive 900 numbers, to name a few.

 

Our tools – unsupervised case

 

The most creative and most interesting part of the project is the feature extraction: what (numerical) features should we extract from each node, to try to  nd strange nodes? The obvious ones are the in- and out-degree of each node; the total number of minutes in calls and out-calls. More elaborate ones include centrality measures, like Google’s super successful PageRank [2] the number of triangles in the agent of the node [1] Thus, every node becomes a point in n-dimensional space, and thus we can employee all the unsupervised algorithms, like clustering (DBSCAN [9], OPTICS), outlier detection (isolation forests [7]), micro-cluster detection [6].

 

Our tools – supervised case

 

Here we have two families of tools: the first is to build a classier for the n-dimensional

space that we can create with the feature extraction above. There is a wealth of classifiers to choose from – ‘autoGluon’ [3] automatically tries several of them and picks the best.

The second is to exploit network acts, with tools like label propagation, belief propagation, semi-supervised learning (eg., FaBP [5]; zooBP [4]).

 

Preliminary results

 

Figure 1 gives some results for the supervised case. The scatterplots (actually, heatmaps, to highlight the over-plotting) have one dot for each customer, with the axis being the (weighted) in-degree versus the (weighted) out-degree. Both axis logarithmic (log(x + 1), so that we keep the zeros).

Notice that most points (fraud/honest) are along the diagonal, indicating that reciprocity is to be expected.

Also notice that there are some extreme deviations from reciprocity, namely, points along the axis. This means that there are customers that call, but never get called back (like, eg., telemarketers) and the other way around (like, eg., help-lines).

What distinguishes the fraudsters from the honest customers, is the magnitude of activity. Notice that most of the fraud customers tend to be around the 104; 10points, while most honest customers are close to the 103; 10point.

 

(a) Fraud

(b) Honest

Figure 1: Visualization helps: heatmaps of in-versus out-degree (weighted): Fraud (left) vs honest subscribers (right)

 

Conclusion

 

Looking for patterns and anomalies in large, real graphs never gets boring: there are always new patterns to look for, new activities by the fraudsters (as well as new activities by the honest ones). Despite the fact that there are already some excellent tools for graph analysis, there is always room for more.

 

References

 

[1] Akoglu, L., McGlohon, M., and Faloutsos, C. oddball: Spotting anomalies in weighted graphs. In PAKDD (2) (2010), vol. 6119 of Lecture Notes in Computer Science, Springer, pp. 410{421.

 

[2] Brin, S., and Page, L. The anatomy of a large-scale hypertextual web search engine. Comput. Networks 30, 1-7 (1998), 107{117.

 

[3] Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., and Smola, A. J. Autogluon-tabular: Robust and accurate automl for structured data. CoRR abs/2003.06505 (2020).

 

[4] Eswaran, D., Gunnemann, S., Faloutsos, C., Makhija, D., and Kumar, M. Zoobp: Belief propagation for heterogeneous networks. Proc. VLDB Endow. 10, 5 (2017), 625{636.

 

[5] Koutra, D., Ke, T., Kang, U., Chau, D. H., Pao, H. K., and Faloutsos, C. Unifying guilt-by-association approaches: Theorems and fast algorithms. In ECML/PKDD (2) (2011), vol. 6912 of Lecture Notes in Computer Science, Springer, pp. 245{260.

 

[6] Lee, M., Shekhar, S., Faloutsos, C., Hutson, T. N., and Iasemidis, L. D. Gen2out: Detecting and ranking generalized anomalies. In IEEE BigData (2021), IEEE, pp. 801{811.

 

[7] Liu, F. T., Ting, K. M., and Zhou, Z. Isolation forest. In ICDM (2008), IEEE Computer Society, pp. 413{422.

 

[8] Northcutt, C. G., Jiang, L., and Chuang, I. L. Con dent learning: Estimating uncertainty in dataset labels. J. Artif. Intell. Res. 70 (2021), 1373{1411.

 

[9] Schubert, E., Sander, J., Ester, M., Kriegel, H., and Xu, X. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans. Database Syst. 42, 3 (2017), 19:1{19:21.

 

[10] Shah, N., Lamba, H., Beutel, A., and Faloutsos, C. The many faces of link fraud. In ICDM (2017), IEEE Computer Society, pp. 1069{1074.

 

By Carnegie Mellon University

 

]]>
4849