The Role of Machine Learning in Cybersecurity

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, 1 introduction, 2 background and motivation, 2.1 soft introduction to machine learning.

machine learning in cyber security research papers

MetricDescriptionFormula
Accuracy ( \(Acc\) )It denotes the percentage of correct predictions among all predictions. It is misleading in the presence of imbalanced distributions (common in cybersecurity). \(Acc = \frac{TP + TN}{TP + TN + FP + FN}\)
Detection Rate ( \(DR\) )It measures the capacity of identifying attacks, but it does not consider false alarms. It is also known as , or . \(DR = \frac{TP}{TP + FN}\)
FP Rate ( \(FPR\) )It represents the percentage of incorrect “positive” predictions. Useful for measuring the amount of false alarms. \(FPR = \frac{FP}{FP+TN}\)
Precision ( \(Prec\) )It denotes the percentage of correct predictions among all “positive” samples. High values implies low false positives, but nothing can be said about false negatives. \(Prec = \frac{TP}{TP + FP}\)
F1-score ( \(F1\) )It combines and into a single metric. Useful for “overviews,” but it is difficult to interpret. \(F1 = \frac{TP}{TP+0.5(FN+FP)}\)

2.2 Scope and Target Audience

2.3 related work, 3 machine learning for threat detection.

machine learning in cyber security research papers

3.1 Machine Learning in Network Intrusion Detection

machine learning in cyber security research papers

3.2 Machine Learning in Malware Detection

machine learning in cyber security research papers

3.3 Machine Learning in Phishing Detection

machine learning in cyber security research papers

4 Beyond Detection: Additional Roles of Machine Learning in Cybersecurity

machine learning in cyber security research papers

4.1 Alert Management

4.2 raw-data analysis, 4.3 risk exposure assessment, 4.4 threat intelligence, 5 intrinsic problems of machine learning in cybersecurity.

machine learning in cyber security research papers

5.1 General Problems of ML in Cybersecurity

machine learning in cyber security research papers

5.2 Problems with in-house development of ML Systems

5.3 problems of commercial-off-the-shelf ml products, 6 the future of machine learning in cybersecurity.

machine learning in cyber security research papers

6.1 Certification (Sovereign Entities)

6.2 data availability (executives and legislation authorities), 6.3 usable security research (scientific community), 6.4 orchestration of machine learning (engineers), 7 case studies: industrial applications of ml for cybersecurity, 7.1 detection of cache poisoning attacks in named data networks.

machine learning in cyber security research papers

Routing StrategyAttack payload (# Interest/s)5102050
% TP95.0%95.3%97.0%98.3%
% FP1.0%1.0%1.0%1.0%
% TP63.3%72.8%79.3%96.3%
% FP1.0%1.0%1.0%1.0%

7.2 Combining ML with Non-ML Methods to Protect Industry 4.0 Environments

machine learning in cyber security research papers

8 Conclusion

#MisconceptionRef.
1Deep Learning vs Shallow LearningSection
2Machine Learning and Anomaly DetectionSection
3Legitimacy of Adversarial SamplesSection
4Minimal Adversarial PerturbationsSection
5Size of training dataSection
6Updating ML models with new dataSection
  • Pervez N J. Titus A (2024) Integrating MLSecOps in the Biotechnology Industry 5.0 The Role of Cybersecurity in the Industry 5.0 Era [Working Title] 10.5772/intechopen.114972 Online publication date: 10-May-2024 https://doi.org/10.5772/intechopen.114972
  • Husin M Aziz S (2024) Navigating Fintech Disruptions Safeguarding Financial Data in the Digital Age 10.4018/979-8-3693-3633-5.ch007 (103-120) Online publication date: 17-May-2024 https://doi.org/10.4018/979-8-3693-3633-5.ch007
  • Munjal G Paul B Kumar M (2024) Application of Artificial Intelligence in Cybersecurity Improving Security, Privacy, and Trust in Cloud Computing 10.4018/979-8-3693-1431-9.ch006 (127-146) Online publication date: 2-Feb-2024 https://doi.org/10.4018/979-8-3693-1431-9.ch006
  • Show More Cited By

Index Terms

Computing methodologies

Machine learning

Security and privacy

Network security

Recommendations

Adversarial machine learning in iot from an insider point of view.

With the rapid progress and significant successes in various applications, machine learning has been considered a crucial component in the Internet of Things ecosystem. However, machine learning models have recently been vulnerable to ...

First Workshop on Machine Learning Challenges in Cybersecurity

Cybersecurity events severely impact many individuals, infras-tructures, businesses, and institutions. Machine learning has been playing a key role in many aspects of our daily life and in particular as a cyber defense mechanism. However, the nature of ...

Authentic Learning of Machine Learning in Cybersecurity with Portable Hands-on Labware

New cyberattacks grow in an exponential rate. However, the state-of-the-art in cyber defense efforts cannot keep pace with the sophisticated cybersecurity threats and attacks. The industry relies excessively on anti-virus software which is effective for ...

Information

Published in.

cover image Digital Threats: Research and Practice

University of Louisiana at Lafayette and Cythereal, USA

Author Picture

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • Cybersecurity
  • incident detection
  • machine learning
  • artificial intelligence
  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 49 Total Citations View Citations
  • 30,859 Total Downloads
  • Downloads (Last 12 months) 23,138
  • Downloads (Last 6 weeks) 2,810
  • Zubair M Janicke H Mohsin A Maglaras L Sarker I (2024) Automated Sensor Node Malicious Activity Detection with Explainability Analysis Sensors 10.3390/s24123712 24 :12 (3712) Online publication date: 7-Jun-2024 https://doi.org/10.3390/s24123712
  • Altwaijry N Al-Turaiki I Alotaibi R Alakeel F (2024) Advancing Phishing Email Detection: A Comparative Study of Deep Learning Models Sensors 10.3390/s24072077 24 :7 (2077) Online publication date: 24-Mar-2024 https://doi.org/10.3390/s24072077
  • Wang M Yang N Guo Y Weng N (2024) Learn-IDS: Bridging Gaps between Datasets and Learning-Based Network Intrusion Detection Electronics 10.3390/electronics13061072 13 :6 (1072) Online publication date: 14-Mar-2024 https://doi.org/10.3390/electronics13061072
  • Naveen Kumar Thawait (2024) Machine Learning in Cybersecurity : Applications, Challenges and Future Directions International Journal of Scientific Research in Computer Science, Engineering and Information Technology 10.32628/CSEIT24102125 10 :3 (16-27) Online publication date: 3-May-2024 https://doi.org/10.32628/CSEIT24102125
  • Baugher J Qu Y (2024) Create the Taxonomy for Unintentional Insider Threat via Text Mining and Hierarchical Clustering Analysis European Journal of Electrical Engineering and Computer Science 10.24018/ejece.2024.8.2.608 8 :2 (36-49) Online publication date: 5-Apr-2024 https://doi.org/10.24018/ejece.2024.8.2.608
  • Broklyn P Shad R Egon A (2024) The Evolving Thread Landscape Pf Ai-Powered Cyberattacks:A Multi-Faceted Approach to Defense And Mitigate   SSRN Electronic Journal 10.2139/ssrn.4904878 Online publication date: 2024 https://doi.org/10.2139/ssrn.4904878
  • Paramesha M Rane N Rane J (2024) Artificial intelligence, machine learning, and deep learning for cybersecurity solutions: a review of emerging technologies and applications SSRN Electronic Journal 10.2139/ssrn.4855884 Online publication date: 2024 https://doi.org/10.2139/ssrn.4855884

View options

View or Download as a PDF file.

View online with eReader .

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

  • Survey Paper
  • Open access
  • Published: 01 July 2020

Cybersecurity data science: an overview from machine learning perspective

  • Iqbal H. Sarker   ORCID: orcid.org/0000-0003-1740-5517 1 , 2 ,
  • A. S. M. Kayes 3 ,
  • Shahriar Badsha 4 ,
  • Hamed Alqahtani 5 ,
  • Paul Watters 3 &
  • Alex Ng 3  

Journal of Big Data volume  7 , Article number:  41 ( 2020 ) Cite this article

151k Accesses

261 Citations

51 Altmetric

Metrics details

In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model , is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. In this paper, we focus and briefly discuss on cybersecurity data science , where the data is being gathered from relevant cybersecurity sources, and the analytics complement the latest data-driven patterns for providing more effective security solutions. The concept of cybersecurity data science allows making the computing process more actionable and intelligent as compared to traditional ones in the domain of cybersecurity. We then discuss and summarize a number of associated research issues and future directions . Furthermore, we provide a machine learning based multi-layered framework for the purpose of cybersecurity modeling. Overall, our goal is not only to discuss cybersecurity data science and relevant methods but also to focus the applicability towards data-driven intelligent decision making for protecting the systems from cyber-attacks.

Introduction

Due to the increasing dependency on digitalization and Internet-of-Things (IoT) [ 1 ], various security incidents such as unauthorized access [ 2 ], malware attack [ 3 ], zero-day attack [ 4 ], data breach [ 5 ], denial of service (DoS) [ 2 ], social engineering or phishing [ 6 ] etc. have grown at an exponential rate in recent years. For instance, in 2010, there were less than 50 million unique malware executables known to the security community. By 2012, they were double around 100 million, and in 2019, there are more than 900 million malicious executables known to the security community, and this number is likely to grow, according to the statistics of AV-TEST institute in Germany [ 7 ]. Cybercrime and attacks can cause devastating financial losses and affect organizations and individuals as well. It’s estimated that, a data breach costs 8.19 million USD for the United States and 3.9 million USD on an average [ 8 ], and the annual cost to the global economy from cybercrime is 400 billion USD [ 9 ]. According to Juniper Research [ 10 ], the number of records breached each year to nearly triple over the next 5 years. Thus, it’s essential that organizations need to adopt and implement a strong cybersecurity approach to mitigate the loss. According to [ 11 ], the national security of a country depends on the business, government, and individual citizens having access to applications and tools which are highly secure, and the capability on detecting and eliminating such cyber-threats in a timely way. Therefore, to effectively identify various cyber incidents either previously seen or unseen, and intelligently protect the relevant systems from such cyber-attacks, is a key issue to be solved urgently.

figure 1

Popularity trends of data science, machine learning and cybersecurity over time, where x-axis represents the timestamp information and y axis represents the corresponding popularity values

Cybersecurity is a set of technologies and processes designed to protect computers, networks, programs and data from attack, damage, or unauthorized access [ 12 ]. In recent days, cybersecurity is undergoing massive shifts in technology and its operations in the context of computing, and data science (DS) is driving the change, where machine learning (ML), a core part of “Artificial Intelligence” (AI) can play a vital role to discover the insights from data. Machine learning can significantly change the cybersecurity landscape and data science is leading a new scientific paradigm [ 13 , 14 ]. The popularity of these related technologies is increasing day-by-day, which is shown in Fig.  1 , based on the data of the last five years collected from Google Trends [ 15 ]. The figure represents timestamp information in terms of a particular date in the x-axis and corresponding popularity in the range of 0 (minimum) to 100 (maximum) in the y-axis. As shown in Fig.  1 , the popularity indication values of these areas are less than 30 in 2014, while they exceed 70 in 2019, i.e., more than double in terms of increased popularity. In this paper, we focus on cybersecurity data science (CDS), which is broadly related to these areas in terms of security data processing techniques and intelligent decision making in real-world applications. Overall, CDS is security data-focused, applies machine learning methods to quantify cyber risks, and ultimately seeks to optimize cybersecurity operations. Thus, the purpose of this paper is for those academia and industry people who want to study and develop a data-driven smart cybersecurity model based on machine learning techniques. Therefore, great emphasis is placed on a thorough description of various types of machine learning methods, and their relations and usage in the context of cybersecurity. This paper does not describe all of the different techniques used in cybersecurity in detail; instead, it gives an overview of cybersecurity data science modeling based on artificial intelligence, particularly from machine learning perspective.

The ultimate goal of cybersecurity data science is data-driven intelligent decision making from security data for smart cybersecurity solutions. CDS represents a partial paradigm shift from traditional well-known security solutions such as firewalls, user authentication and access control, cryptography systems etc. that might not be effective according to today’s need in cyber industry [ 16 , 17 , 18 , 19 ]. The problems are these are typically handled statically by a few experienced security analysts, where data management is done in an ad-hoc manner [ 20 , 21 ]. However, as an increasing number of cybersecurity incidents in different formats mentioned above continuously appear over time, such conventional solutions have encountered limitations in mitigating such cyber risks. As a result, numerous advanced attacks are created and spread very quickly throughout the Internet. Although several researchers use various data analysis and learning techniques to build cybersecurity models that are summarized in “ Machine learning tasks in cybersecurity ” section, a comprehensive security model based on the effective discovery of security insights and latest security patterns could be more useful. To address this issue, we need to develop more flexible and efficient security mechanisms that can respond to threats and to update security policies to mitigate them intelligently in a timely manner. To achieve this goal, it is inherently required to analyze a massive amount of relevant cybersecurity data generated from various sources such as network and system sources, and to discover insights or proper security policies with minimal human intervention in an automated manner.

Analyzing cybersecurity data and building the right tools and processes to successfully protect against cybersecurity incidents goes beyond a simple set of functional requirements and knowledge about risks, threats or vulnerabilities. For effectively extracting the insights or the patterns of security incidents, several machine learning techniques, such as feature engineering, data clustering, classification, and association analysis, or neural network-based deep learning techniques can be used, which are briefly discussed in “ Machine learning tasks in cybersecurity ” section. These learning techniques are capable to find the anomalies or malicious behavior and data-driven patterns of associated security incidents to make an intelligent decision. Thus, based on the concept of data-driven decision making, we aim to focus on cybersecurity data science , where the data is being gathered from relevant cybersecurity sources such as network activity, database activity, application activity, or user activity, and the analytics complement the latest data-driven patterns for providing corresponding security solutions.

The contributions of this paper are summarized as follows.

We first make a brief discussion on the concept of cybersecurity data science and relevant methods to understand its applicability towards data-driven intelligent decision making in the domain of cybersecurity. For this purpose, we also make a review and brief discussion on different machine learning tasks in cybersecurity, and summarize various cybersecurity datasets highlighting their usage in different data-driven cyber applications.

We then discuss and summarize a number of associated research issues and future directions in the area of cybersecurity data science, that could help both the academia and industry people to further research and development in relevant application areas.

Finally, we provide a generic multi-layered framework of the cybersecurity data science model based on machine learning techniques. In this framework, we briefly discuss how the cybersecurity data science model can be used to discover useful insights from security data and making data-driven intelligent decisions to build smart cybersecurity systems.

The remainder of the paper is organized as follows. “ Background ” section summarizes background of our study and gives an overview of the related technologies of cybersecurity data science. “ Cybersecurity data science ” section defines and discusses briefly about cybersecurity data science including various categories of cyber incidents data. In “  Machine learning tasks in cybersecurity ” section, we briefly discuss various categories of machine learning techniques including their relations with cybersecurity tasks and summarize a number of machine learning based cybersecurity models in the field. “ Research issues and future directions ” section briefly discusses and highlights various research issues and future directions in the area of cybersecurity data science. In “  A multi-layered framework for smart cybersecurity services ” section, we suggest a machine learning-based framework to build cybersecurity data science model and discuss various layers with their roles. In “  Discussion ” section, we highlight several key points regarding our studies. Finally,  “ Conclusion ” section concludes this paper.

In this section, we give an overview of the related technologies of cybersecurity data science including various types of cybersecurity incidents and defense strategies.

  • Cybersecurity

Over the last half-century, the information and communication technology (ICT) industry has evolved greatly, which is ubiquitous and closely integrated with our modern society. Thus, protecting ICT systems and applications from cyber-attacks has been greatly concerned by the security policymakers in recent days [ 22 ]. The act of protecting ICT systems from various cyber-threats or attacks has come to be known as cybersecurity [ 9 ]. Several aspects are associated with cybersecurity: measures to protect information and communication technology; the raw data and information it contains and their processing and transmitting; associated virtual and physical elements of the systems; the degree of protection resulting from the application of those measures; and eventually the associated field of professional endeavor [ 23 ]. Craigen et al. defined “cybersecurity as a set of tools, practices, and guidelines that can be used to protect computer networks, software programs, and data from attack, damage, or unauthorized access” [ 24 ]. According to Aftergood et al. [ 12 ], “cybersecurity is a set of technologies and processes designed to protect computers, networks, programs and data from attacks and unauthorized access, alteration, or destruction”. Overall, cybersecurity concerns with the understanding of diverse cyber-attacks and devising corresponding defense strategies that preserve several properties defined as below [ 25 , 26 ].

Confidentiality is a property used to prevent the access and disclosure of information to unauthorized individuals, entities or systems.

Integrity is a property used to prevent any modification or destruction of information in an unauthorized manner.

Availability is a property used to ensure timely and reliable access of information assets and systems to an authorized entity.

The term cybersecurity applies in a variety of contexts, from business to mobile computing, and can be divided into several common categories. These are - network security that mainly focuses on securing a computer network from cyber attackers or intruders; application security that takes into account keeping the software and the devices free of risks or cyber-threats; information security that mainly considers security and the privacy of relevant data; operational security that includes the processes of handling and protecting data assets. Typical cybersecurity systems are composed of network security systems and computer security systems containing a firewall, antivirus software, or an intrusion detection system [ 27 ].

Cyberattacks and security risks

The risks typically associated with any attack, which considers three security factors, such as threats, i.e., who is attacking, vulnerabilities, i.e., the weaknesses they are attacking, and impacts, i.e., what the attack does [ 9 ]. A security incident is an act that threatens the confidentiality, integrity, or availability of information assets and systems. Several types of cybersecurity incidents that may result in security risks on an organization’s systems and networks or an individual [ 2 ]. These are:

Unauthorized access that describes the act of accessing information to network, systems or data without authorization that results in a violation of a security policy [ 2 ];

Malware known as malicious software, is any program or software that intentionally designed to cause damage to a computer, client, server, or computer network, e.g., botnets. Examples of different types of malware including computer viruses, worms, Trojan horses, adware, ransomware, spyware, malicious bots, etc. [ 3 , 26 ]; Ransom malware, or ransomware , is an emerging form of malware that prevents users from accessing their systems or personal files, or the devices, then demands an anonymous online payment in order to restore access.

Denial-of-Service is an attack meant to shut down a machine or network, making it inaccessible to its intended users by flooding the target with traffic that triggers a crash. The Denial-of-Service (DoS) attack typically uses one computer with an Internet connection, while distributed denial-of-service (DDoS) attack uses multiple computers and Internet connections to flood the targeted resource [ 2 ];

Phishing a type of social engineering , used for a broad range of malicious activities accomplished through human interactions, in which the fraudulent attempt takes part to obtain sensitive information such as banking and credit card details, login credentials, or personally identifiable information by disguising oneself as a trusted individual or entity via an electronic communication such as email, text, or instant message, etc. [ 26 ];

Zero-day attack is considered as the term that is used to describe the threat of an unknown security vulnerability for which either the patch has not been released or the application developers were unaware [ 4 , 28 ].

Beside these attacks mentioned above, privilege escalation [ 29 ], password attack [ 30 ], insider threat [ 31 ], man-in-the-middle [ 32 ], advanced persistent threat [ 33 ], SQL injection attack [ 34 ], cryptojacking attack [ 35 ], web application attack [ 30 ] etc. are well-known as security incidents in the field of cybersecurity. A data breach is another type of security incident, known as a data leak, which is involved in the unauthorized access of data by an individual, application, or service [ 5 ]. Thus, all data breaches are considered as security incidents, however, all the security incidents are not data breaches. Most data breaches occur in the banking industry involving the credit card numbers, personal information, followed by the healthcare sector and the public sector [ 36 ].

Cybersecurity defense strategies

Defense strategies are needed to protect data or information, information systems, and networks from cyber-attacks or intrusions. More granularly, they are responsible for preventing data breaches or security incidents and monitoring and reacting to intrusions, which can be defined as any kind of unauthorized activity that causes damage to an information system [ 37 ]. An intrusion detection system (IDS) is typically represented as “a device or software application that monitors a computer network or systems for malicious activity or policy violations” [ 38 ]. The traditional well-known security solutions such as anti-virus, firewalls, user authentication, access control, data encryption and cryptography systems, however might not be effective according to today’s need in the cyber industry

[ 16 , 17 , 18 , 19 ]. On the other hand, IDS resolves the issues by analyzing security data from several key points in a computer network or system [ 39 , 40 ]. Moreover, intrusion detection systems can be used to detect both internal and external attacks.

Intrusion detection systems are different categories according to the usage scope. For instance, a host-based intrusion detection system (HIDS), and network intrusion detection system (NIDS) are the most common types based on the scope of single computers to large networks. In a HIDS, the system monitors important files on an individual system, while it analyzes and monitors network connections for suspicious traffic in a NIDS. Similarly, based on methodologies, the signature-based IDS, and anomaly-based IDS are the most well-known variants [ 37 ].

Signature-based IDS : A signature can be a predefined string, pattern, or rule that corresponds to a known attack. A particular pattern is identified as the detection of corresponding attacks in a signature-based IDS. An example of a signature can be known patterns or a byte sequence in a network traffic, or sequences used by malware. To detect the attacks, anti-virus software uses such types of sequences or patterns as a signature while performing the matching operation. Signature-based IDS is also known as knowledge-based or misuse detection [ 41 ]. This technique can be efficient to process a high volume of network traffic, however, is strictly limited to the known attacks only. Thus, detecting new attacks or unseen attacks is one of the biggest challenges faced by this signature-based system.

Anomaly-based IDS : The concept of anomaly-based detection overcomes the issues of signature-based IDS discussed above. In an anomaly-based intrusion detection system, the behavior of the network is first examined to find dynamic patterns, to automatically create a data-driven model, to profile the normal behavior, and thus it detects deviations in the case of any anomalies [ 41 ]. Thus, anomaly-based IDS can be treated as a dynamic approach, which follows behavior-oriented detection. The main advantage of anomaly-based IDS is the ability to identify unknown or zero-day attacks [ 42 ]. However, the issue is that the identified anomaly or abnormal behavior is not always an indicator of intrusions. It sometimes may happen because of several factors such as policy changes or offering a new service.

In addition, a hybrid detection approach [ 43 , 44 ] that takes into account both the misuse and anomaly-based techniques discussed above can be used to detect intrusions. In a hybrid system, the misuse detection system is used for detecting known types of intrusions and anomaly detection system is used for novel attacks [ 45 ]. Beside these approaches, stateful protocol analysis can also be used to detect intrusions that identifies deviations of protocol state similarly to the anomaly-based method, however it uses predetermined universal profiles based on accepted definitions of benign activity [ 41 ]. In Table 1 , we have summarized these common approaches highlighting their pros and cons. Once the detecting has been completed, the intrusion prevention system (IPS) that is intended to prevent malicious events, can be used to mitigate the risks in different ways such as manual, providing notification, or automatic process [ 46 ]. Among these approaches, an automatic response system could be more effective as it does not involve a human interface between the detection and response systems.

  • Data science

We are living in the age of data, advanced analytics, and data science, which are related to data-driven intelligent decision making. Although, the process of searching patterns or discovering hidden and interesting knowledge from data is known as data mining [ 47 ], in this paper, we use the broader term “data science” rather than data mining. The reason is that, data science, in its most fundamental form, is all about understanding of data. It involves studying, processing, and extracting valuable insights from a set of information. In addition to data mining, data analytics is also related to data science. The development of data mining, knowledge discovery, and machine learning that refers creating algorithms and program which learn on their own, together with the original data analysis and descriptive analytics from the statistical perspective, forms the general concept of “data analytics” [ 47 ]. Nowadays, many researchers use the term “data science” to describe the interdisciplinary field of data collection, preprocessing, inferring, or making decisions by analyzing the data. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. According to Cao et al. [ 47 ] “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments, to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. As a high-level statement in the context of cybersecurity, we can conclude that it is the study of security data to provide data-driven solutions for the given security problems, as known as “the science of cybersecurity data”. Figure 2 shows the typical data-to-insight-to-decision transfer at different periods and general analytic stages in data science, in terms of a variety of analytics goals (G) and approaches (A) to achieve the data-to-decision goal [ 47 ].

figure 2

Data-to-insight-to-decision analytic stages in data science [ 47 ]

Based on the analytic power of data science including machine learning techniques, it can be a viable component of security strategies. By using data science techniques, security analysts can manipulate and analyze security data more effectively and efficiently, uncovering valuable insights from data. Thus, data science methodologies including machine learning techniques can be well utilized in the context of cybersecurity, in terms of problem understanding, gathering security data from diverse sources, preparing data to feed into the model, data-driven model building and updating, for providing smart security services, which motivates to define cybersecurity data science and to work in this research area.

Cybersecurity data science

In this section, we briefly discuss cybersecurity data science including various categories of cyber incidents data with the usage in different application areas, and the key terms and areas related to our study.

Understanding cybersecurity data

Data science is largely driven by the availability of data [ 48 ]. Datasets typically represent a collection of information records that consist of several attributes or features and related facts, in which cybersecurity data science is based on. Thus, it’s important to understand the nature of cybersecurity data containing various types of cyberattacks and relevant features. The reason is that raw security data collected from relevant cyber sources can be used to analyze the various patterns of security incidents or malicious behavior, to build a data-driven security model to achieve our goal. Several datasets exist in the area of cybersecurity including intrusion analysis, malware analysis, anomaly, fraud, or spam analysis that are used for various purposes. In Table 2 , we summarize several such datasets including their various features and attacks that are accessible on the Internet, and highlight their usage based on machine learning techniques in different cyber applications. Effectively analyzing and processing of these security features, building target machine learning-based security model according to the requirements, and eventually, data-driven decision making, could play a role to provide intelligent cybersecurity services that are discussed briefly in “ A multi-layered framework for smart cybersecurity services ” section.

Defining cybersecurity data science

Data science is transforming the world’s industries. It is critically important for the future of intelligent cybersecurity systems and services because of “security is all about data”. When we seek to detect cyber threats, we are analyzing the security data in the form of files, logs, network packets, or other relevant sources. Traditionally, security professionals didn’t use data science techniques to make detections based on these data sources. Instead, they used file hashes, custom-written rules like signatures, or manually defined heuristics [ 21 ]. Although these techniques have their own merits in several cases, it needs too much manual work to keep up with the changing cyber threat landscape. On the contrary, data science can make a massive shift in technology and its operations, where machine learning algorithms can be used to learn or extract insight of security incident patterns from the training data for their detection and prevention. For instance, to detect malware or suspicious trends, or to extract policy rules, these techniques can be used.

In recent days, the entire security industry is moving towards data science, because of its capability to transform raw data into decision making. To do this, several data-driven tasks can be associated, such as—(i) data engineering focusing practical applications of data gathering and analysis; (ii) reducing data volume that deals with filtering significant and relevant data to further analysis; (iii) discovery and detection that focuses on extracting insight or incident patterns or knowledge from data; (iv) automated models that focus on building data-driven intelligent security model; (v) targeted security  alerts focusing on the generation of remarkable security alerts based on discovered knowledge that minimizes the false alerts, and (vi) resource optimization that deals with the available resources to achieve the target goals in a security system. While making data-driven decisions, behavioral analysis could also play a significant role in the domain of cybersecurity [ 81 ].

Thus, the concept of cybersecurity data science incorporates the methods and techniques of data science and machine learning as well as the behavioral analytics of various security incidents. The combination of these technologies has given birth to the term “cybersecurity data science”, which refers to collect a large amount of security event data from different sources and analyze it using machine learning technologies for detecting security risks or attacks either through the discovery of useful insights or the latest data-driven patterns. It is, however, worth remembering that cybersecurity data science is not just about a collection of machine learning algorithms, rather,  a process that can help security professionals or analysts to scale and automate their security activities in a smart way and in a timely manner. Therefore, the formal definition can be as follows: “Cybersecurity data science is a research or working area existing at the intersection of cybersecurity, data science, and machine learning or artificial intelligence, which is mainly security data-focused, applies machine learning methods, attempts to quantify cyber-risks or incidents, and promotes inferential techniques to analyze behavioral patterns in security data. It also focuses on generating security response alerts, and eventually seeks for optimizing cybersecurity solutions, to build automated and intelligent cybersecurity systems.”

Table  3 highlights some key terms associated with cybersecurity data science. Overall, the outputs of cybersecurity data science are typically security data products, which can be a data-driven security model, policy rule discovery, risk or attack prediction, potential security service and recommendation, or the corresponding security system depending on the given security problem in the domain of cybersecurity. In the next section, we briefly discuss various machine learning tasks with examples within the scope of our study.

Machine learning tasks in cybersecurity

Machine learning (ML) is typically considered as a branch of “Artificial Intelligence”, which is closely related to computational statistics, data mining and analytics, data science, particularly focusing on making the computers to learn from data [ 82 , 83 ]. Thus, machine learning models typically comprise of a set of rules, methods, or complex “transfer functions” that can be applied to find interesting data patterns, or to recognize or predict behavior [ 84 ], which could play an important role in the area of cybersecurity. In the following, we discuss different methods that can be used to solve machine learning tasks and how they are related to cybersecurity tasks.

Supervised learning

Supervised learning is performed when specific targets are defined to reach from a certain set of inputs, i.e., task-driven approach. In the area of machine learning, the most popular supervised learning techniques are known as classification and regression methods [ 129 ]. These techniques are popular to classify or predict the future for a particular security problem. For instance, to predict denial-of-service attack (yes, no) or to identify different classes of network attacks such as scanning and spoofing, classification techniques can be used in the cybersecurity domain. ZeroR [ 83 ], OneR [ 130 ], Navies Bayes [ 131 ], Decision Tree [ 132 , 133 ], K-nearest neighbors [ 134 ], support vector machines [ 135 ], adaptive boosting [ 136 ], and logistic regression [ 137 ] are the well-known classification techniques. In addition, recently Sarker et al. have proposed BehavDT [ 133 ], and IntruDtree [ 106 ] classification techniques that are able to effectively build a data-driven predictive model. On the other hand, to predict the continuous or numeric value, e.g., total phishing attacks in a certain period or predicting the network packet parameters, regression techniques are useful. Regression analyses can also be used to detect the root causes of cybercrime and other types of fraud [ 138 ]. Linear regression [ 82 ], support vector regression [ 135 ] are the popular regression techniques. The main difference between classification and regression is that the output variable in the regression is numerical or continuous, while the predicted output for classification is categorical or discrete. Ensemble learning is an extension of supervised learning while mixing different simple models, e.g., Random Forest learning [ 139 ] that generates multiple decision trees to solve a particular security task.

Unsupervised learning

In unsupervised learning problems, the main task is to find patterns, structures, or knowledge in unlabeled data, i.e., data-driven approach [ 140 ]. In the area of cybersecurity, cyber-attacks like malware stays hidden in some ways, include changing their behavior dynamically and autonomously to avoid detection. Clustering techniques, a type of unsupervised learning, can help to uncover the hidden patterns and structures from the datasets, to identify indicators of such sophisticated attacks. Similarly, in identifying anomalies, policy violations, detecting, and eliminating noisy instances in data, clustering techniques can be useful. K-means [ 141 ], K-medoids [ 142 ] are the popular partitioning clustering algorithms, and single linkage [ 143 ] or complete linkage [ 144 ] are the well-known hierarchical clustering algorithms used in various application domains. Moreover, a bottom-up clustering approach proposed by Sarker et al. [ 145 ] can also be used by taking into account the data characteristics.

Besides, feature engineering tasks like optimal feature selection or extraction related to a particular security problem could be useful for further analysis [ 106 ]. Recently, Sarker et al. [ 106 ] have proposed an approach for selecting security features according to their importance score values. Moreover, Principal component analysis, linear discriminant analysis, pearson correlation analysis, or non-negative matrix factorization are the popular dimensionality reduction techniques to solve such issues [ 82 ]. Association rule learning is another example, where machine learning based policy rules can prevent cyber-attacks. In an expert system, the rules are usually manually defined by a knowledge engineer working in collaboration with a domain expert [ 37 , 140 , 146 ]. Association rule learning on the contrary, is the discovery of rules or relationships among a set of available security features or attributes in a given dataset [ 147 ]. To quantify the strength of relationships, correlation analysis can be used [ 138 ]. Many association rule mining algorithms have been proposed in the area of machine learning and data mining literature, such as logic-based [ 148 ], frequent pattern based [ 149 , 150 , 151 ], tree-based [ 152 ], etc. Recently, Sarker et al. [ 153 ] have proposed an association rule learning approach considering non-redundant generation, that can be used to discover a set of useful security policy rules. Moreover, AIS [ 147 ], Apriori [ 149 ], Apriori-TID and Apriori-Hybrid [ 149 ], FP-Tree [ 152 ], and RARM [ 154 ], and Eclat [ 155 ] are the well-known association rule learning algorithms that are capable to solve such problems by generating a set of policy rules in the domain of cybersecurity.

Neural networks and deep learning

Deep learning is a part of machine learning in the area of artificial intelligence, which is a computational model that is inspired by the biological neural networks in the human brain [ 82 ]. Artificial Neural Network (ANN) is frequently used in deep learning and the most popular neural network algorithm is backpropagation [ 82 ]. It performs learning on a multi-layer feed-forward neural network consists of an input layer, one or more hidden layers, and an output layer. The main difference between deep learning and classical machine learning is its performance on the amount of security data increases. Typically deep learning algorithms perform well when the data volumes are large, whereas machine learning algorithms perform comparatively better on small datasets [ 44 ]. In our earlier work, Sarker et al. [ 129 ], we have illustrated the effectiveness of these approaches considering contextual datasets. However, deep learning approaches mimic the human brain mechanism to interpret large amount of data or the complex data such as images, sounds and texts [ 44 , 129 ]. In terms of feature extraction to build models, deep learning reduces the effort of designing a feature extractor for each problem than the classical machine learning techniques. Beside these characteristics, deep learning typically takes a long time to train an algorithm than a machine learning algorithm, however, the test time is exactly the opposite [ 44 ]. Thus, deep learning relies more on high-performance machines with GPUs than classical machine-learning algorithms [ 44 , 156 ]. The most popular deep neural network learning models include multi-layer perceptron (MLP) [ 157 ], convolutional neural network (CNN) [ 158 ], recurrent neural network (RNN) or long-short term memory (LSTM) network [ 121 , 158 ]. In recent days, researchers use these deep learning techniques for different purposes such as detecting network intrusions, malware traffic detection and classification, etc. in the domain of cybersecurity [ 44 , 159 ].

Other learning techniques

Semi-supervised learning can be described as a hybridization of supervised and unsupervised techniques discussed above, as it works on both the labeled and unlabeled data. In the area of cybersecurity, it could be useful, when it requires to label data automatically without human intervention, to improve the performance of cybersecurity models. Reinforcement techniques are another type of machine learning that characterizes an agent by creating its own learning experiences through interacting directly with the environment, i.e., environment-driven approach, where the environment is typically formulated as a Markov decision process and take decision based on a reward function [ 160 ]. Monte Carlo learning, Q-learning, Deep Q Networks, are the most common reinforcement learning algorithms [ 161 ]. For instance, in a recent work [ 126 ], the authors present an approach for detecting botnet traffic or malicious cyber activities using reinforcement learning combining with neural network classifier. In another work [ 128 ], the authors discuss about the application of deep reinforcement learning to intrusion detection for supervised problems, where they received the best results for the Deep Q-Network algorithm. In the context of cybersecurity, genetic algorithms that use fitness, selection, crossover, and mutation for finding optimization, could also be used to solve a similar class of learning problems [ 119 ].

Various types of machine learning techniques discussed above can be useful in the domain of cybersecurity, to build an effective security model. In Table  4 , we have summarized several machine learning techniques that are used to build various types of security models for various purposes. Although these models typically represent a learning-based security model, in this paper, we aim to focus on a comprehensive cybersecurity data science model and relevant issues, in order to build a data-driven intelligent security system. In the next section, we highlight several research issues and potential solutions in the area of cybersecurity data science.

Research issues and future directions

Our study opens several research issues and challenges in the area of cybersecurity data science to extract insight from relevant data towards data-driven intelligent decision making for cybersecurity solutions. In the following, we summarize these challenges ranging from data collection to decision making.

Cybersecurity datasets : Source datasets are the primary component to work in the area of cybersecurity data science. Most of the existing datasets are old and might insufficient in terms of understanding the recent behavioral patterns of various cyber-attacks. Although the data can be transformed into a meaningful understanding level after performing several processing tasks, there is still a lack of understanding of the characteristics of recent attacks and their patterns of happening. Thus, further processing or machine learning algorithms may provide a low accuracy rate for making the target decisions. Therefore, establishing a large number of recent datasets for a particular problem domain like cyber risk prediction or intrusion detection is needed, which could be one of the major challenges in cybersecurity data science.

Handling quality problems in cybersecurity datasets : The cyber datasets might be noisy, incomplete, insignificant, imbalanced, or may contain inconsistency instances related to a particular security incident. Such problems in a data set may affect the quality of the learning process and degrade the performance of the machine learning-based models [ 162 ]. To make a data-driven intelligent decision for cybersecurity solutions, such problems in data is needed to deal effectively before building the cyber models. Therefore, understanding such problems in cyber data and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like malware analysis or intrusion detection and prevention is needed, which could be another research issue in cybersecurity data science.

Security policy rule generation : Security policy rules reference security zones and enable a user to allow, restrict, and track traffic on the network based on the corresponding user or user group, and service, or the application. The policy rules including the general and more specific rules are compared against the incoming traffic in sequence during the execution, and the rule that matches the traffic is applied. The policy rules used in most of the cybersecurity systems are static and generated by human expertise or ontology-based [ 163 , 164 ]. Although, association rule learning techniques produce rules from data, however, there is a problem of redundancy generation [ 153 ] that makes the policy rule-set complex. Therefore, understanding such problems in policy rule generation and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like access control [ 165 ] is needed, which could be another research issue in cybersecurity data science.

Hybrid learning method : Most commercial products in the cybersecurity domain contain signature-based intrusion detection techniques [ 41 ]. However, missing features or insufficient profiling can cause these techniques to miss unknown attacks. In that case, anomaly-based detection techniques or hybrid technique combining signature-based and anomaly-based can be used to overcome such issues. A hybrid technique combining multiple learning techniques or a combination of deep learning and machine-learning methods can be used to extract the target insight for a particular problem domain like intrusion detection, malware analysis, access control, etc. and make the intelligent decision for corresponding cybersecurity solutions.

Protecting the valuable security information : Another issue of a cyber data attack is the loss of extremely valuable data and information, which could be damaging for an organization. With the use of encryption or highly complex signatures, one can stop others from probing into a dataset. In such cases, cybersecurity data science can be used to build a data-driven impenetrable protocol to protect such security information. To achieve this goal, cyber analysts can develop algorithms by analyzing the history of cyberattacks to detect the most frequently targeted chunks of data. Thus, understanding such data protecting problems and designing corresponding algorithms to effectively handling these problems, could be another research issue in the area of cybersecurity data science.

Context-awareness in cybersecurity : Existing cybersecurity work mainly originates from the relevant cyber data containing several low-level features. When data mining and machine learning techniques are applied to such datasets, a related pattern can be identified that describes it properly. However, a broader contextual information [ 140 , 145 , 166 ] like temporal, spatial, relationship among events or connections, dependency can be used to decide whether there exists a suspicious activity or not. For instance, some approaches may consider individual connections as DoS attacks, while security experts might not treat them as malicious by themselves. Thus, a significant limitation of existing cybersecurity work is the lack of using the contextual information for predicting risks or attacks. Therefore, context-aware adaptive cybersecurity solutions could be another research issue in cybersecurity data science.

Feature engineering in cybersecurity : The efficiency and effectiveness of a machine learning-based security model has always been a major challenge due to the high volume of network data with a large number of traffic features. The large dimensionality of data has been addressed using several techniques such as principal component analysis (PCA) [ 167 ], singular value decomposition (SVD) [ 168 ] etc. In addition to low-level features in the datasets, the contextual relationships between suspicious activities might be relevant. Such contextual data can be stored in an ontology or taxonomy for further processing. Thus how to effectively select the optimal features or extract the significant features considering both the low-level features as well as the contextual features, for effective cybersecurity solutions could be another research issue in cybersecurity data science.

Remarkable security alert generation and prioritizing : In many cases, the cybersecurity system may not be well defined and may cause a substantial number of false alarms that are unexpected in an intelligent system. For instance, an IDS deployed in a real-world network generates around nine million alerts per day [ 169 ]. A network-based intrusion detection system typically looks at the incoming traffic for matching the associated patterns to detect risks, threats or vulnerabilities and generate security alerts. However, to respond to each such alert might not be effective as it consumes relatively huge amounts of time and resources, and consequently may result in a self-inflicted DoS. To overcome this problem, a high-level management is required that correlate the security alerts considering the current context and their logical relationship including their prioritization before reporting them to users, which could be another research issue in cybersecurity data science.

Recency analysis in cybersecurity solutions : Machine learning-based security models typically use a large amount of static data to generate data-driven decisions. Anomaly detection systems rely on constructing such a model considering normal behavior and anomaly, according to their patterns. However, normal behavior in a large and dynamic security system is not well defined and it may change over time, which can be considered as an incremental growing of dataset. The patterns in incremental datasets might be changed in several cases. This often results in a substantial number of false alarms known as false positives. Thus, a recent malicious behavioral pattern is more likely to be interesting and significant than older ones for predicting unknown attacks. Therefore, effectively using the concept of recency analysis [ 170 ] in cybersecurity solutions could be another issue in cybersecurity data science.

The most important work for an intelligent cybersecurity system is to develop an effective framework that supports data-driven decision making. In such a framework, we need to consider advanced data analysis based on machine learning techniques, so that the framework is capable to minimize these issues and to provide automated and intelligent security services. Thus, a well-designed security framework for cybersecurity data and the experimental evaluation is a very important direction and a big challenge as well. In the next section, we suggest and discuss a data-driven cybersecurity framework based on machine learning techniques considering multiple processing layers.

A multi-layered framework for smart cybersecurity services

As discussed earlier, cybersecurity data science is data-focused, applies machine learning methods, attempts to quantify cyber risks, promotes inferential techniques to analyze behavioral patterns, focuses on generating security response alerts, and eventually seeks for optimizing cybersecurity operations. Hence, we briefly discuss a multiple data processing layered framework that potentially can be used to discover security insights from the raw data to build smart cybersecurity systems, e.g., dynamic policy rule-based access control or intrusion detection and prevention system. To make a data-driven intelligent decision in the resultant cybersecurity system, understanding the security problems and the nature of corresponding security data and their vast analysis is needed. For this purpose, our suggested framework not only considers the machine learning techniques to build the security model but also takes into account the incremental learning and dynamism to keep the model up-to-date and corresponding response generation, which could be more effective and intelligent for providing the expected services. Figure 3 shows an overview of the framework, involving several processing layers, from raw security event data to services. In the following, we briefly discuss the working procedure of the framework.

figure 3

A generic multi-layered framework based on machine learning techniques for smart cybersecurity services

Security data collecting

Collecting valuable cybersecurity data is a crucial step, which forms a connecting link between security problems in cyberinfrastructure and corresponding data-driven solution steps in this framework, shown in Fig.  3 . The reason is that cyber data can serve as the source for setting up ground truth of the security model that affect the model performance. The quality and quantity of cyber data decide the feasibility and effectiveness of solving the security problem according to our goal. Thus, the concern is how to collect valuable and unique needs data for building the data-driven security models.

The general step to collect and manage security data from diverse data sources is based on a particular security problem and project within the enterprise. Data sources can be classified into several broad categories such as network, host, and hybrid [ 171 ]. Within the network infrastructure, the security system can leverage different types of security data such as IDS logs, firewall logs, network traffic data, packet data, and honeypot data, etc. for providing the target security services. For instance, a given IP is considered malicious or not, could be detected by performing data analysis utilizing the data of IP addresses and their cyber activities. In the domain of cybersecurity, the network source mentioned above is considered as the primary security event source to analyze. In the host category, it collects data from an organization’s host machines, where the data sources can be operating system logs, database access logs, web server logs, email logs, application logs, etc. Collecting data from both the network and host machines are considered a hybrid category. Overall, in a data collection layer the network activity, database activity, application activity, and user activity can be the possible security event sources in the context of cybersecurity data science.

Security data preparing

After collecting the raw security data from various sources according to the problem domain discussed above, this layer is responsible to prepare the raw data for building the model by applying various necessary processes. However, not all of the collected data contributes to the model building process in the domain of cybersecurity [ 172 ]. Therefore, the useless data should be removed from the rest of the data captured by the network sniffer. Moreover, data might be noisy, have missing or corrupted values, or have attributes of widely varying types and scales. High quality of data is necessary for achieving higher accuracy in a data-driven model, which is a process of learning a function that maps an input to an output based on example input-output pairs. Thus, it might require a procedure for data cleaning, handling missing or corrupted values. Moreover, security data features or attributes can be in different types, such as continuous, discrete, or symbolic [ 106 ]. Beyond a solid understanding of these types of data and attributes and their permissible operations, its need to preprocess the data and attributes to convert into the target type. Besides, the raw data can be in different types such as structured, semi-structured, or unstructured, etc. Thus, normalization, transformation, or collation can be useful to organize the data in a structured manner. In some cases, natural language processing techniques might be useful depending on data type and characteristics, e.g., textual contents. As both the quality and quantity of data decide the feasibility of solving the security problem, effectively pre-processing and management of data and their representation can play a significant role to build an effective security model for intelligent services.

Machine learning-based security modeling

This is the core step where insights and knowledge are extracted from data through the application of cybersecurity data science. In this section, we particularly focus on machine learning-based modeling as machine learning techniques can significantly change the cybersecurity landscape. The security features or attributes and their patterns in data are of high interest to be discovered and analyzed to extract security insights. To achieve the goal, a deeper understanding of data and machine learning-based analytical models utilizing a large number of cybersecurity data can be effective. Thus, various machine learning tasks can be involved in this model building layer according to the solution perspective. These are - security feature engineering that mainly responsible to transform raw security data into informative features that effectively represent the underlying security problem to the data-driven models. Thus, several data-processing tasks such as feature transformation and normalization, feature selection by taking into account a subset of available security features according to their correlations or importance in modeling, or feature generation and extraction by creating new brand principal components, may be involved in this module according to the security data characteristics. For instance, the chi-squared test, analysis of variance test, correlation coefficient analysis, feature importance, as well as discriminant and principal component analysis, or singular value decomposition, etc. can be used for analyzing the significance of the security features to perform the security feature engineering tasks [ 82 ].

Another significant module is security data clustering that uncovers hidden patterns and structures through huge volumes of security data, to identify where the new threats exist. It typically involves the grouping of security data with similar characteristics, which can be used to solve several cybersecurity problems such as detecting anomalies, policy violations, etc. Malicious behavior or anomaly detection module is typically responsible to identify a deviation to a known behavior, where clustering-based analysis and techniques can also be used to detect malicious behavior or anomaly detection. In the cybersecurity area, attack classification or prediction is treated as one of the most significant modules, which is responsible to build a prediction model to classify attacks or threats and to predict future for a particular security problem. To predict denial-of-service attack or a spam filter separating tasks from other messages, could be the relevant examples. Association learning or policy rule generation module can play a role to build an expert security system that comprises several IF-THEN rules that define attacks. Thus, in a problem of policy rule generation for rule-based access control system, association learning can be used as it discovers the associations or relationships among a set of available security features in a given security dataset. The popular machine learning algorithms in these categories are briefly discussed in “  Machine learning tasks in cybersecurity ” section. The module model selection or customization is responsible to choose whether it uses the existing machine learning model or needed to customize. Analyzing data and building models based on traditional machine learning or deep learning methods, could achieve acceptable results in certain cases in the domain of cybersecurity. However, in terms of effectiveness and efficiency or other performance measurements considering time complexity, generalization capacity, and most importantly the impact of the algorithm on the detection rate of a system, machine learning models are needed to customize for a specific security problem. Moreover, customizing the related techniques and data could improve the performance of the resultant security model and make it better applicable in a cybersecurity domain. The modules discussed above can work separately and combinedly depending on the target security problems.

Incremental learning and dynamism

In our framework, this layer is concerned with finalizing the resultant security model by incorporating additional intelligence according to the needs. This could be possible by further processing in several modules. For instance, the post-processing and improvement module in this layer could play a role to simplify the extracted knowledge according to the particular requirements by incorporating domain-specific knowledge. As the attack classification or prediction models based on machine learning techniques strongly rely on the training data, it can hardly be generalized to other datasets, which could be significant for some applications. To address such kind of limitations, this module is responsible to utilize the domain knowledge in the form of taxonomy or ontology to improve attack correlation in cybersecurity applications.

Another significant module recency mining and updating security model is responsible to keep the security model up-to-date for better performance by extracting the latest data-driven security patterns. The extracted knowledge discussed in the earlier layer is based on a static initial dataset considering the overall patterns in the datasets. However, such knowledge might not be guaranteed higher performance in several cases, because of incremental security data with recent patterns. In many cases, such incremental data may contain different patterns which could conflict with existing knowledge. Thus, the concept of RecencyMiner [ 170 ] on incremental security data and extracting new patterns can be more effective than the existing old patterns. The reason is that recent security patterns and rules are more likely to be significant than older ones for predicting cyber risks or attacks. Rather than processing the whole security data again, recency-based dynamic updating according to the new patterns would be more efficient in terms of processing and outcome. This could make the resultant cybersecurity model intelligent and dynamic. Finally, response planning and decision making module is responsible to make decisions based on the extracted insights and take necessary actions to prevent the system from the cyber-attacks to provide automated and intelligent services. The services might be different depending on particular requirements for a given security problem.

Overall, this framework is a generic description which potentially can be used to discover useful insights from security data, to build smart cybersecurity systems, to address complex security challenges, such as intrusion detection, access control management, detecting anomalies and fraud, or denial of service attacks, etc. in the area of cybersecurity data science.

Although several research efforts have been directed towards cybersecurity solutions, discussed in “ Background ” , “ Cybersecurity data science ”, and “ Machine learning tasks in cybersecurity ” sections in different directions, this paper presents a comprehensive view of cybersecurity data science. For this, we have conducted a literature review to understand cybersecurity data, various defense strategies including intrusion detection techniques, different types of machine learning techniques in cybersecurity tasks. Based on our discussion on existing work, several research issues related to security datasets, data quality problems, policy rule generation, learning methods, data protection, feature engineering, security alert generation, recency analysis etc. are identified that require further research attention in the domain of cybersecurity data science.

The scope of cybersecurity data science is broad. Several data-driven tasks such as intrusion detection and prevention, access control management, security policy generation, anomaly detection, spam filtering, fraud detection and prevention, various types of malware attack detection and defense strategies, etc. can be considered as the scope of cybersecurity data science. Such tasks based categorization could be helpful for security professionals including the researchers and practitioners who are interested in the domain-specific aspects of security systems [ 171 ]. The output of cybersecurity data science can be used in many application areas such as Internet of things (IoT) security [ 173 ], network security [ 174 ], cloud security [ 175 ], mobile and web applications [ 26 ], and other relevant cyber areas. Moreover, intelligent cybersecurity solutions are important for the banking industry, the healthcare sector, or the public sector, where data breaches typically occur [ 36 , 176 ]. Besides, the data-driven security solutions could also be effective in AI-based blockchain technology, where AI works with huge volumes of security event data to extract the useful insights using machine learning techniques, and block-chain as a trusted platform to store such data [ 177 ].

Although in this paper, we discuss cybersecurity data science focusing on examining raw security data to data-driven decision making for intelligent security solutions, it could also be related to big data analytics in terms of data processing and decision making. Big data deals with data sets that are too large or complex having characteristics of high data volume, velocity, and variety. Big data analytics mainly has two parts consisting of data management involving data storage, and analytics [ 178 ]. The analytics typically describe the process of analyzing such datasets to discover patterns, unknown correlations, rules, and other useful insights [ 179 ]. Thus, several advanced data analysis techniques such as AI, data mining, machine learning could play an important role in processing big data by converting big problems to small problems [ 180 ]. To do this, the potential strategies like parallelization, divide-and-conquer, incremental learning, sampling, granular computing, feature or instance selection, can be used to make better decisions, reducing costs, or enabling more efficient processing. In such cases, the concept of cybersecurity data science, particularly machine learning-based modeling could be helpful for process automation and decision making for intelligent security solutions. Moreover, researchers could consider modified algorithms or models for handing big data on parallel computing platforms like Hadoop, Storm, etc. [ 181 ].

Based on the concept of cybersecurity data science discussed in the paper, building a data-driven security model for a particular security problem and relevant empirical evaluation to measure the effectiveness and efficiency of the model, and to asses the usability in the real-world application domain could be a future work.

Motivated by the growing significance of cybersecurity and data science, and machine learning technologies, in this paper, we have discussed how cybersecurity data science applies to data-driven intelligent decision making in smart cybersecurity systems and services. We also have discussed how it can impact security data, both in terms of extracting insight of security incidents and the dataset itself. We aimed to work on cybersecurity data science by discussing the state of the art concerning security incidents data and corresponding security services. We also discussed how machine learning techniques can impact in the domain of cybersecurity, and examine the security challenges that remain. In terms of existing research, much focus has been provided on traditional security solutions, with less available work in machine learning technique based security systems. For each common technique, we have discussed relevant security research. The purpose of this article is to share an overview of the conceptualization, understanding, modeling, and thinking about cybersecurity data science.

We have further identified and discussed various key issues in security analysis to showcase the signpost of future research directions in the domain of cybersecurity data science. Based on the knowledge, we have also provided a generic multi-layered framework of cybersecurity data science model based on machine learning techniques, where the data is being gathered from diverse sources, and the analytics complement the latest data-driven patterns for providing intelligent security services. The framework consists of several main phases - security data collecting, data preparation, machine learning-based security modeling, and incremental learning and dynamism for smart cybersecurity systems and services. We specifically focused on extracting insights from security data, from setting a research design with particular attention to concepts for data-driven intelligent security solutions.

Overall, this paper aimed not only to discuss cybersecurity data science and relevant methods but also to discuss the applicability towards data-driven intelligent decision making in cybersecurity systems and services from machine learning perspectives. Our analysis and discussion can have several implications both for security researchers and practitioners. For researchers, we have highlighted several issues and directions for future research. Other areas for potential research include empirical evaluation of the suggested data-driven model, and comparative analysis with other security systems. For practitioners, the multi-layered machine learning-based model can be used as a reference in designing intelligent cybersecurity systems for organizations. We believe that our study on cybersecurity data science opens a promising path and can be used as a reference guide for both academia and industry for future research and applications in the area of cybersecurity.

Availability of data and materials

Not applicable.

Abbreviations

  • Machine learning

Artificial Intelligence

Information and communication technology

Internet of Things

Distributed Denial of Service

Intrusion detection system

Intrusion prevention system

Host-based intrusion detection systems

Network Intrusion Detection Systems

Signature-based intrusion detection system

Anomaly-based intrusion detection system

Li S, Da Xu L, Zhao S. The internet of things: a survey. Inform Syst Front. 2015;17(2):243–59.

Google Scholar  

Sun N, Zhang J, Rimba P, Gao S, Zhang LY, Xiang Y. Data-driven cybersecurity incident prediction: a survey. IEEE Commun Surv Tutor. 2018;21(2):1744–72.

McIntosh T, Jang-Jaccard J, Watters P, Susnjak T. The inadequacy of entropy-based ransomware detection. In: International conference on neural information processing. New York: Springer; 2019. p. 181–189

Alazab M, Venkatraman S, Watters P, Alazab M, et al. Zero-day malware detection based on supervised learning algorithms of api call signatures (2010)

Shaw A. Data breach: from notification to prevention using pci dss. Colum Soc Probs. 2009;43:517.

Gupta BB, Tewari A, Jain AK, Agrawal DP. Fighting against phishing attacks: state of the art and future challenges. Neural Comput Appl. 2017;28(12):3629–54.

Av-test institute, germany, https://www.av-test.org/en/statistics/malware/ . Accessed 20 Oct 2019.

Ibm security report, https://www.ibm.com/security/data-breach . Accessed on 20 Oct 2019.

Fischer EA. Cybersecurity issues and challenges: In brief. Congressional Research Service (2014)

Juniper research. https://www.juniperresearch.com/ . Accessed on 20 Oct 2019.

Papastergiou S, Mouratidis H, Kalogeraki E-M. Cyber security incident handling, warning and response system for the european critical information infrastructures (cybersane). In: International Conference on Engineering Applications of Neural Networks, p. 476–487 (2019). New York: Springer

Aftergood S. Cybersecurity: the cold war online. Nature. 2017;547(7661):30.

Hey AJ, Tansley S, Tolle KM, et al. The fourth paradigm: data-intensive scientific discovery. 2009;1:

Cukier K. Data, data everywhere: A special report on managing information, 2010.

Google trends. In: https://trends.google.com/trends/ , 2019.

Anwar S, Mohamad Zain J, Zolkipli MF, Inayat Z, Khan S, Anthony B, Chang V. From intrusion detection to an intrusion response system: fundamentals, requirements, and future directions. Algorithms. 2017;10(2):39.

MATH   Google Scholar  

Mohammadi S, Mirvaziri H, Ghazizadeh-Ahsaee M, Karimipour H. Cyber intrusion detection by combined feature selection algorithm. J Inform Sec Appl. 2019;44:80–8.

Tapiador JE, Orfila A, Ribagorda A, Ramos B. Key-recovery attacks on kids, a keyed anomaly detection system. IEEE Trans Depend Sec Comput. 2013;12(3):312–25.

Tavallaee M, Stakhanova N, Ghorbani AA. Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 40(5), 516–524 (2010)

Foroughi F, Luksch P. Data science methodology for cybersecurity projects. arXiv preprint arXiv:1803.04219 , 2018.

Saxe J, Sanders H. Malware data science: Attack detection and attribution, 2018.

Rainie L, Anderson J, Connolly J. Cyber attacks likely to increase. Digital Life in. 2014, vol. 2025.

Fischer EA. Creating a national framework for cybersecurity: an analysis of issues and options. LIBRARY OF CONGRESS WASHINGTON DC CONGRESSIONAL RESEARCH SERVICE, 2005.

Craigen D, Diakun-Thibault N, Purse R. Defining cybersecurity. Technology Innovation. Manag Rev. 2014;4(10):13–21.

Council NR. et al. Toward a safer and more secure cyberspace, 2007.

Jang-Jaccard J, Nepal S. A survey of emerging threats in cybersecurity. J Comput Syst Sci. 2014;80(5):973–93.

MathSciNet   MATH   Google Scholar  

Mukkamala S, Sung A, Abraham A. Cyber security challenges: Designing efficient intrusion detection systems and antivirus tools. Vemuri, V. Rao, Enhancing Computer Security with Smart Technology.(Auerbach, 2006), 125–163, 2005.

Bilge L, Dumitraş T. Before we knew it: an empirical study of zero-day attacks in the real world. In: Proceedings of the 2012 ACM conference on computer and communications security. ACM; 2012. p. 833–44.

Davi L, Dmitrienko A, Sadeghi A-R, Winandy M. Privilege escalation attacks on android. In: International conference on information security. New York: Springer; 2010. p. 346–60.

Jovičić B, Simić D. Common web application attack types and security using asp .net. ComSIS, 2006.

Warkentin M, Willison R. Behavioral and policy issues in information systems security: the insider threat. Eur J Inform Syst. 2009;18(2):101–5.

Kügler D. “man in the middle” attacks on bluetooth. In: International Conference on Financial Cryptography. New York: Springer; 2003, p. 149–61.

Virvilis N, Gritzalis D. The big four-what we did wrong in advanced persistent threat detection. In: 2013 International Conference on Availability, Reliability and Security. IEEE; 2013. p. 248–54.

Boyd SW, Keromytis AD. Sqlrand: Preventing sql injection attacks. In: International conference on applied cryptography and network security. New York: Springer; 2004. p. 292–302.

Sigler K. Crypto-jacking: how cyber-criminals are exploiting the crypto-currency boom. Comput Fraud Sec. 2018;2018(9):12–4.

2019 data breach investigations report, https://enterprise.verizon.com/resources/reports/dbir/ . Accessed 20 Oct 2019.

Khraisat A, Gondal I, Vamplew P, Kamruzzaman J. Survey of intrusion detection systems: techniques, datasets and challenges. Cybersecurity. 2019;2(1):20.

Johnson L. Computer incident response and forensics team management: conducting a successful incident response, 2013.

Brahmi I, Brahmi H, Yahia SB. A multi-agents intrusion detection system using ontology and clustering techniques. In: IFIP international conference on computer science and its applications. New York: Springer; 2015. p. 381–93.

Qu X, Yang L, Guo K, Ma L, Sun M, Ke M, Li M. A survey on the development of self-organizing maps for unsupervised intrusion detection. In: Mobile networks and applications. 2019;1–22.

Liao H-J, Lin C-HR, Lin Y-C, Tung K-Y. Intrusion detection system: a comprehensive review. J Netw Comput Appl. 2013;36(1):16–24.

Alazab A, Hobbs M, Abawajy J, Alazab M. Using feature selection for intrusion detection system. In: 2012 International symposium on communications and information technologies (ISCIT). IEEE; 2012. p. 296–301.

Viegas E, Santin AO, Franca A, Jasinski R, Pedroni VA, Oliveira LS. Towards an energy-efficient anomaly-based intrusion detection engine for embedded systems. IEEE Trans Comput. 2016;66(1):163–77.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Dutt I, Borah S, Maitra IK, Bhowmik K, Maity A, Das S. Real-time hybrid intrusion detection system using machine learning techniques. 2018, p. 885–94.

Ragsdale DJ, Carver C, Humphries JW, Pooch UW. Adaptation techniques for intrusion detection and intrusion response systems. In: Smc 2000 conference proceedings. 2000 IEEE international conference on systems, man and cybernetics.’cybernetics evolving to systems, humans, organizations, and their complex interactions’(cat. No. 0). IEEE; 2000. vol. 4, p. 2344–2349.

Cao L. Data science: challenges and directions. Commun ACM. 2017;60(8):59–68.

Rizk A, Elragal A. Data science: developing theoretical contributions in information systems via text analytics. J Big Data. 2020;7(1):1–26.

Lippmann RP, Fried DJ, Graf I, Haines JW, Kendall KR, McClung D, Weber D, Webster SE, Wyschogrod D, Cunningham RK, et al. Evaluating intrusion detection systems: The 1998 darpa off-line intrusion detection evaluation. In: Proceedings DARPA information survivability conference and exposition. DISCEX’00. IEEE; 2000. vol. 2, p. 12–26.

Kdd cup 99. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html . Accessed 20 Oct 2019.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In: 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE; 2009. p. 1–6.

Caida ddos attack 2007 dataset. http://www.caida.org/data/ passive/ddos-20070804-dataset.xml/ . Accessed 20 Oct 2019.

Caida anonymized internet traces 2008 dataset. https://www.caida.org/data/passive/passive-2008-dataset . Accessed 20 Oct 2019.

Isot botnet dataset. https://www.uvic.ca/engineering/ece/isot/ datasets/index.php/ . Accessed 20 Oct 2019.

The honeynet project. http://www.honeynet.org/chapters/france/ . Accessed 20 Oct 2019.

Canadian institute of cybersecurity, university of new brunswick, iscx dataset, http://www.unb.ca/cic/datasets/index.html/ . Accessed 20 Oct 2019.

Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput Secur. 2012;31(3):357–74.

The ctu-13 dataset. https://stratosphereips.org/category/datasets-ctu13 . Accessed 20 Oct 2019.

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 Military Communications and Information Systems Conference (MilCIS). IEEE; 2015. p. 1–6.

Cse-cic-ids2018 [online]. available: https://www.unb.ca/cic/ datasets/ids-2018.html/ . Accessed 20 Oct 2019.

Cic-ddos2019 [online]. available: https://www.unb.ca/cic/datasets/ddos-2019.html/ . Accessed 28 Mar 2019.

Jing X, Yan Z, Jiang X, Pedrycz W. Network traffic fusion and analysis against ddos flooding attacks with a novel reversible sketch. Inform Fusion. 2019;51:100–13.

Xie M, Hu J, Yu X, Chang E. Evaluating host-based anomaly detection systems: application of the frequency-based algorithms to adfa-ld. In: International conference on network and system security. New York: Springer; 2015. p. 542–49.

Lindauer B, Glasser J, Rosen M, Wallnau KC, ExactData L. Generating test data for insider threat detectors. JoWUA. 2014;5(2):80–94.

Glasser J, Lindauer B. Bridging the gap: A pragmatic approach to generating insider threat data. In: 2013 IEEE Security and Privacy Workshops. IEEE; 2013. p. 98–104.

Enronspam. https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/ . Accessed 20 Oct 2019.

Spamassassin. http://www.spamassassin.org/publiccorpus/ . Accessed 20 Oct 2019.

Lingspam. https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/lingspampublic.tar.gz/ . Accessed 20 Oct 2019.

Alexa top sites. https://aws.amazon.com/alexa-top-sites/ . Accessed 20 Oct 2019.

Bambenek consulting—master feeds. available online: http://osint.bambenekconsulting.com/feeds/ . Accessed 20 Oct 2019.

Dgarchive. https://dgarchive.caad.fkie.fraunhofer.de/site/ . Accessed 20 Oct 2019.

Zago M, Pérez MG, Pérez GM. Umudga: A dataset for profiling algorithmically generated domain names in botnet detection. Data in Brief. 2020;105400.

Zhou Y, Jiang X. Dissecting android malware: characterization and evolution. In: 2012 IEEE Symposium on security and privacy. IEEE; 2012. p. 95–109.

Virusshare. http://virusshare.com/ . Accessed 20 Oct 2019.

Virustotal. https://virustotal.com/ . Accessed 20 Oct 2019.

Comodo. https://www.comodo.com/home/internet-security/updates/vdp/database . Accessed 20 Oct 2019.

Contagio. http://contagiodump.blogspot.com/ . Accessed 20 Oct 2019.

Kumar R, Xiaosong Z, Khan RU, Kumar J, Ahad I. Effective and explainable detection of android malware based on machine learning algorithms. In: Proceedings of the 2018 international conference on computing and artificial intelligence. ACM; 2018. p. 35–40.

Microsoft malware classification (big 2015). arXiv:org/abs/1802.10135/ . Accessed 20 Oct 2019.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Future Gen Comput Syst. 2019;100:779–96.

McIntosh TR, Jang-Jaccard J, Watters PA. Large scale behavioral analysis of ransomware attacks. In: International conference on neural information processing. New York: Springer; 2018. p. 217–29.

Han J, Pei J, Kamber M. Data mining: concepts and techniques, 2011.

Witten IH, Frank E. Data mining: Practical machine learning tools and techniques, 2005.

Dua S, Du X. Data mining and machine learning in cybersecurity, 2016.

Kotpalliwar MV, Wajgi R. Classification of attacks using support vector machine (svm) on kddcup’99 ids database. In: 2015 Fifth international conference on communication systems and network technologies. IEEE; 2015. p. 987–90.

Pervez MS, Farid DM. Feature selection and intrusion classification in nsl-kdd cup 99 dataset employing svms. In: The 8th international conference on software, knowledge, information management and applications (SKIMA 2014). IEEE; 2014. p. 1–6.

Yan M, Liu Z. A new method of transductive svm-based network intrusion detection. In: International conference on computer and computing technologies in agriculture. New York: Springer; 2010. p. 87–95.

Li Y, Xia J, Zhang S, Yan J, Ai X, Dai K. An efficient intrusion detection system based on support vector machines and gradually feature removal method. Expert Syst Appl. 2012;39(1):424–30.

Raman MG, Somu N, Jagarapu S, Manghnani T, Selvam T, Krithivasan K, Sriram VS. An efficient intrusion detection technique based on support vector machine and improved binary gravitational search algorithm. Artificial Intelligence Review. 2019, p. 1–32.

Kokila R, Selvi ST, Govindarajan K. Ddos detection and analysis in sdn-based environment using support vector machine classifier. In: 2014 Sixth international conference on advanced computing (ICoAC). IEEE; 2014. p. 205–10.

Xie M, Hu J, Slay J. Evaluating host-based anomaly detection systems: Application of the one-class svm algorithm to adfa-ld. In: 2014 11th international conference on fuzzy systems and knowledge discovery (FSKD). IEEE; 2014. p. 978–82.

Saxena H, Richariya V. Intrusion detection in kdd99 dataset using svm-pso and feature reduction with information gain. Int J Comput Appl. 2014;98:6.

Chandrasekhar A, Raghuveer K. Confederation of fcm clustering, ann and svm techniques to implement hybrid nids using corrected kdd cup 99 dataset. In: 2014 international conference on communication and signal processing. IEEE; 2014. p. 672–76.

Shapoorifard H, Shamsinejad P. Intrusion detection using a novel hybrid method incorporating an improved knn. Int J Comput Appl. 2017;173(1):5–9.

Vishwakarma S, Sharma V, Tiwari A. An intrusion detection system using knn-aco algorithm. Int J Comput Appl. 2017;171(10):18–23.

Meng W, Li W, Kwok L-F. Design of intelligent knn-based alarm filter using knowledge-based alert verification in intrusion detection. Secur Commun Netw. 2015;8(18):3883–95.

Dada E. A hybridized svm-knn-pdapso approach to intrusion detection system. In: Proc. Fac. Seminar Ser., 2017, p. 14–21.

Sharifi AM, Amirgholipour SK, Pourebrahimi A. Intrusion detection based on joint of k-means and knn. J Converg Inform Technol. 2015;10(5):42.

Lin W-C, Ke S-W, Tsai C-F. Cann: an intrusion detection system based on combining cluster centers and nearest neighbors. Knowl Based Syst. 2015;78:13–21.

Koc L, Mazzuchi TA, Sarkani S. A network intrusion detection system based on a hidden naïve bayes multiclass classifier. Exp Syst Appl. 2012;39(18):13492–500.

Moon D, Im H, Kim I, Park JH. Dtb-ids: an intrusion detection system based on decision tree using behavior analysis for preventing apt attacks. J Supercomput. 2017;73(7):2881–95.

Ingre, B., Yadav, A., Soni, A.K.: Decision tree based intrusion detection system for nsl-kdd dataset. In: International conference on information and communication technology for intelligent systems. New York: Springer; 2017. p. 207–18.

Malik AJ, Khan FA. A hybrid technique using binary particle swarm optimization and decision tree pruning for network intrusion detection. Cluster Comput. 2018;21(1):667–80.

Relan NG, Patil DR. Implementation of network intrusion detection system using variant of decision tree algorithm. In: 2015 international conference on nascent technologies in the engineering field (ICNTE). IEEE; 2015. p. 1–5.

Rai K, Devi MS, Guleria A. Decision tree based algorithm for intrusion detection. Int J Adv Netw Appl. 2016;7(4):2828.

Sarker IH, Abushark YB, Alsolami F, Khan AI. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Puthran S, Shah K. Intrusion detection using improved decision tree algorithm with binary and quad split. In: International symposium on security in computing and communication. New York: Springer; 2016. p. 427–438.

Balogun AO, Jimoh RG. Anomaly intrusion detection using an hybrid of decision tree and k-nearest neighbor, 2015.

Azad C, Jha VK. Genetic algorithm to solve the problem of small disjunct in the decision tree based intrusion detection system. Int J Comput Netw Inform Secur. 2015;7(8):56.

Jo S, Sung H, Ahn B. A comparative study on the performance of intrusion detection using decision tree and artificial neural network models. J Korea Soc Dig Indus Inform Manag. 2015;11(4):33–45.

Zhan J, Zulkernine M, Haque A. Random-forests-based network intrusion detection systems. IEEE Trans Syst Man Cybern C. 2008;38(5):649–59.

Tajbakhsh A, Rahmati M, Mirzaei A. Intrusion detection using fuzzy association rules. Appl Soft Comput. 2009;9(2):462–9.

Mitchell R, Chen R. Behavior rule specification-based intrusion detection for safety critical medical cyber physical systems. IEEE Trans Depend Secure Comput. 2014;12(1):16–30.

Alazab M, Venkataraman S, Watters P. Towards understanding malware behaviour by the extraction of api calls. In: 2010 second cybercrime and trustworthy computing Workshop. IEEE; 2010. p. 52–59.

Yuan Y, Kaklamanos G, Hogrefe D. A novel semi-supervised adaboost technique for network anomaly detection. In: Proceedings of the 19th ACM international conference on modeling, analysis and simulation of wireless and mobile systems. ACM; 2016. p. 111–14.

Ariu D, Tronci R, Giacinto G. Hmmpayl: an intrusion detection system based on hidden markov models. Comput Secur. 2011;30(4):221–41.

Årnes A, Valeur F, Vigna G, Kemmerer RA. Using hidden markov models to evaluate the risks of intrusions. In: International workshop on recent advances in intrusion detection. New York: Springer; 2006. p. 145–64.

Hansen JV, Lowry PB, Meservy RD, McDonald DM. Genetic programming for prevention of cyberterrorism through dynamic and evolving intrusion detection. Decis Supp Syst. 2007;43(4):1362–74.

Aslahi-Shahri B, Rahmani R, Chizari M, Maralani A, Eslami M, Golkar MJ, Ebrahimi A. A hybrid method consisting of ga and svm for intrusion detection system. Neural Comput Appl. 2016;27(6):1669–76.

Alrawashdeh K, Purdy C. Toward an online anomaly intrusion detection system based on deep learning. In: 2016 15th IEEE international conference on machine learning and applications (ICMLA). IEEE; 2016. p. 195–200.

Yin C, Zhu Y, Fei J, He X. A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access. 2017;5:21954–61.

Kim J, Kim J, Thu HLT, Kim H. Long short term memory recurrent neural network classifier for intrusion detection. In: 2016 international conference on platform technology and service (PlatCon). IEEE; 2016. p. 1–5.

Almiani M, AbuGhazleh A, Al-Rahayfeh A, Atiewi S, Razaque A. Deep recurrent neural network for iot intrusion detection system. Simulation Modelling Practice and Theory. 2019;102031.

Kolosnjaji B, Zarras A, Webster G, Eckert C. Deep learning for classification of malware system call sequences. In: Australasian joint conference on artificial intelligence. New York: Springer; 2016. p. 137–49.

Wang W, Zhu M, Zeng X, Ye X, Sheng Y. Malware traffic classification using convolutional neural network for representation learning. In: 2017 international conference on information networking (ICOIN). IEEE; 2017. p. 712–17.

Alauthman M, Aslam N, Al-kasassbeh M, Khan S, Al-Qerem A, Choo K-KR. An efficient reinforcement learning-based botnet detection approach. J Netw Comput Appl. 2020;150:102479.

Blanco R, Cilla JJ, Briongos S, Malagón P, Moya JM. Applying cost-sensitive classifiers with reinforcement learning to ids. In: International conference on intelligent data engineering and automated learning. New York: Springer; 2018. p. 531–38.

Lopez-Martin M, Carro B, Sanchez-Esguevillas A. Application of deep reinforcement learning to intrusion detection for supervised problems. Exp Syst Appl. 2020;141:112963.

Sarker IH, Kayes A, Watters P. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.

John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc.; 1995. p. 338–45.

Quinlan JR. C4.5: Programs for machine learning. Machine Learning, 1993.

Sarker IH, Colman A, Han J, Khan AI, Abushark YB, Salah K. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mobile Networks and Applications. 2019, p. 1–11.

Aha DW, Kibler D, Albert MK. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Freund Y, Schapire RE, et al: Experiments with a new boosting algorithm. In: Icml, vol. 96, p. 148–156 (1996). Citeseer

Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J Royal Stat Soc C. 1992;41(1):191–201.

Watters PA, McCombie S, Layton R, Pieprzyk J. Characterising and predicting cyber attacks using the cyber attacker model profile (camp). J Money Launder Control. 2012.

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):95.

MacQueen J. Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley symposium on mathematical statistics and probability, vol. 1, 1967.

Rokach L. A survey of clustering algorithms. In: Data Mining and Knowledge Discovery Handbook. New York: Springer; 2010. p. 269–98.

Sneath PH. The application of computers to taxonomy. J Gen Microbiol. 1957;17:1.

Sorensen T. method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol Skr. 1948;5.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J. 2018;61(3):349–68.

Kim G, Lee S, Kim S. A novel hybrid intrusion detection method integrating anomaly detection with misuse detection. Exp Syst Appl. 2014;41(4):1690–700.

MathSciNet   Google Scholar  

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM; 1993. vol. 22, p. 207–16.

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Agrawal R, Srikant R, et al: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, 1994, vol. 1215, p. 487–99.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Proceedings of the eleventh international conference on data engineering. IEEE; 1995. p. 25–33.

Ma BLWHY. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, 1998.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record. ACM; 2000. vol. 29, p. 1–12.

Sarker IH, Salim FD. Mining user behavioral rules from smartphone data through association analysis. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Melbourne, Australia. New York: Springer; 2018. p. 450–61.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on information and knowledge management. ACM; 2001. p. 474–81.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Coelho IM, Coelho VN, Luz EJS, Ochi LS, Guimarães FG, Rios E. A gpu deep learning metaheuristic based model for time series forecasting. Appl Energy. 2017;201:412–8.

Van Efferen L, Ali-Eldin AM. A multi-layer perceptron approach for flow-based anomaly detection. In: 2017 International symposium on networks, computers and communications (ISNCC). IEEE; 2017. p. 1–6.

Liu H, Lang B, Liu M, Yan H. Cnn and rnn based payload classification methods for attack detection. Knowl Based Syst. 2019;163:332–41.

Berman DS, Buczak AL, Chavis JS, Corbett CL. A survey of deep learning methods for cyber security. Information. 2019;10(4):122.

Bellman R. A markovian decision process. J Math Mech. 1957;1:679–84.

Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif Intell Res. 1996;4:237–85.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet of Things. 2019;5:180–93.

Kayes ASM, Han J, Colman A. OntCAAC: an ontology-based approach to context-aware access control for software services. Comput J. 2015;58(11):3000–34.

Kayes ASM, Rahayu W, Dillon T. An ontology-based approach to dynamic contextual role for pervasive access control. In: AINA 2018. IEEE Computer Society, 2018.

Colombo P, Ferrari E. Access control technologies for big data management systems: literature review and future trends. Cybersecurity. 2019;2(1):1–13.

Aleroud A, Karabatis G. Contextual information fusion for intrusion detection: a survey and taxonomy. Knowl Inform Syst. 2017;52(3):563–619.

Sarker IH, Abushark YB, Khan AI. Contextpca: Predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry. 2020;12(4):499.

Madsen RE, Hansen LK, Winther O. Singular value decomposition and principal component analysis. Neural Netw. 2004;1:1–5.

Qiao L-B, Zhang B-F, Lai Z-Q, Su J-S. Mining of attack models in ids alerts from network backbone by a two-stage clustering method. In: 2012 IEEE 26th international parallel and distributed processing symposium workshops & Phd Forum. IEEE; 2012. p. 1263–9.

Sarker IH, Colman A, Han J. Recencyminer: mining recency-based personalized behavior from contextual smartphone data. J Big Data. 2019;6(1):49.

Ullah F, Babar MA. Architectural tactics for big data cybersecurity analytics systems: a review. J Syst Softw. 2019;151:81–118.

Zhao S, Leftwich K, Owens M, Magrone F, Schonemann J, Anderson B, Medhi D. I-can-mama: Integrated campus network monitoring and management. In: 2014 IEEE network operations and management symposium (NOMS). IEEE; 2014. p. 1–7.

Abomhara M, et al. Cyber security and the internet of things: vulnerabilities, threats, intruders and attacks. J Cyber Secur Mob. 2015;4(1):65–88.

Helali RGM. Data mining based network intrusion detection system: A survey. In: Novel algorithms and techniques in telecommunications and networking. New York: Springer; 2010. p. 501–505.

Ryoo J, Rizvi S, Aiken W, Kissell J. Cloud security auditing: challenges and emerging approaches. IEEE Secur Priv. 2013;12(6):68–74.

Densham B. Three cyber-security strategies to mitigate the impact of a data breach. Netw Secur. 2015;2015(1):5–8.

Salah K, Rehman MHU, Nizamuddin N, Al-Fuqaha A. Blockchain for ai: review and open research challenges. IEEE Access. 2019;7:10127–49.

Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inform Manag. 2015;35(2):137–44.

Golchha N. Big data-the information revolution. Int J Adv Res. 2015;1(12):791–4.

Hariri RH, Fredericks EM, Bowers KM. Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data. 2019;6(1):44.

Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV. Big data analytics: a survey. J Big data. 2015;2(1):21.

Download references

Acknowledgements

The authors would like to thank all the reviewers for their rigorous review and comments in several revision rounds. The reviews are detailed and helpful to improve and finalize the manuscript. The authors are highly grateful to them.

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Iqbal H. Sarker

Chittagong University of Engineering and Technology, Chittagong, 4349, Bangladesh

La Trobe University, Melbourne, VIC, 3086, Australia

A. S. M. Kayes, Paul Watters & Alex Ng

University of Nevada, Reno, USA

Shahriar Badsha

Macquarie University, Sydney, NSW, 2109, Australia

Hamed Alqahtani

You can also search for this author in PubMed   Google Scholar

Contributions

This article provides not only a discussion on cybersecurity data science and relevant methods but also to discuss the applicability towards data-driven intelligent decision making in cybersecurity systems and services. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Sarker, I.H., Kayes, A.S.M., Badsha, S. et al. Cybersecurity data science: an overview from machine learning perspective. J Big Data 7 , 41 (2020). https://doi.org/10.1186/s40537-020-00318-5

Download citation

Received : 26 October 2019

Accepted : 21 June 2020

Published : 01 July 2020

DOI : https://doi.org/10.1186/s40537-020-00318-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Decision making
  • Cyber-attack
  • Security modeling
  • Intrusion detection
  • Cyber threat intelligence

machine learning in cyber security research papers

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 17 May 2023

A holistic and proactive approach to forecasting cyber threats

  • Zaid Almahmoud 1 ,
  • Paul D. Yoo 1 ,
  • Omar Alhussein 2 ,
  • Ilyas Farhat 3 &
  • Ernesto Damiani 4 , 5  

Scientific Reports volume  13 , Article number:  8049 ( 2023 ) Cite this article

6114 Accesses

10 Citations

2 Altmetric

Metrics details

  • Computer science
  • Information technology

Traditionally, cyber-attack detection relies on reactive, assistive techniques, where pattern-matching algorithms help human experts to scan system logs and network traffic for known virus or malware signatures. Recent research has introduced effective Machine Learning (ML) models for cyber-attack detection, promising to automate the task of detecting, tracking and blocking malware and intruders. Much less effort has been devoted to cyber-attack prediction, especially beyond the short-term time scale of hours and days. Approaches that can forecast attacks likely to happen in the longer term are desirable, as this gives defenders more time to develop and share defensive actions and tools. Today, long-term predictions of attack waves are mostly based on the subjective perceptiveness of experienced human experts, which can be impaired by the scarcity of cyber-security expertise. This paper introduces a novel ML-based approach that leverages unstructured big data and logs to forecast the trend of cyber-attacks at a large scale, years in advance. To this end, we put forward a framework that utilises a monthly dataset of major cyber incidents in 36 countries over the past 11 years, with new features extracted from three major categories of big data sources, namely the scientific research literature, news, blogs, and tweets. Our framework not only identifies future attack trends in an automated fashion, but also generates a threat cycle that drills down into five key phases that constitute the life cycle of all 42 known cyber threats.

Similar content being viewed by others

machine learning in cyber security research papers

Knowledge mining of unstructured information: application to cyber domain

machine learning in cyber security research papers

Machine learning partners in criminal networks

machine learning in cyber security research papers

A novel hybrid feature selection and ensemble-based machine learning approach for botnet detection

Introduction.

Running a global technology infrastructure in an increasingly de-globalised world raises unprecedented security issues. In the past decade, we have witnessed waves of cyber-attacks that caused major damage to governments, organisations and enterprises, affecting their bottom lines 1 . Nevertheless, cyber-defences remained reactive in nature, involving significant overhead in terms of execution time. This latency is due to the complex pattern-matching operations required to identify the signatures of polymorphic malware 2 , which shows different behaviour each time it is run. More recently, ML-based models were introduced relying on anomaly detection algorithms. Although these models have shown a good capability to detect unknown attacks, they may classify benign behaviour as abnormal 3 , giving rise to a false alarm.

We argue that data availability can enable a proactive defense, acting before a potential threat escalates into an actual incident. Concerning non-cyber threats, including terrorism and military attacks, proactive approaches alleviate, delay, and even prevent incidents from arising in the first place. Massive software programs are available to assess the intention, potential damages, attack methods, and alternative options for a terrorist attack 4 . We claim that cyber-attacks should be no exception, and that nowadays we have the capabilities to carry out proactive, low latency cyber-defenses based on ML 5 .

Indeed, ML models can provide accurate and reliable forecasts. For example, ML models such as AlphaFold2 6 and RoseTTAFold 7 can predict a protein’s three-dimensional structure from its linear sequence. Cyber-security data, however, poses its unique challenges. Cyber-incidents are highly sensitive events and are usually kept confidential since they affect the involved organisations’ reputation. It is often difficult to keep track of these incidents, because they can go unnoticed even by the victim. It is also worth mentioning that pre-processing cyber-security data is challenging, due to characteristics such as lack of structure, diversity in format, and high rates of missing values which distort the findings.

When devising a ML-based method, one can rely on manual feature identification and engineering, or try and learn the features from raw data. In the context of cyber-incidents, there are many factors ( i.e. , potential features) that could lead to the occurrence of an attack. Wars and political conflicts between countries often lead to cyber-warfare 8 , 9 . The number of mentions of a certain attack appearing in scientific articles may correlate well with the actual incident rate. Also, cyber-attacks often take place on holidays, anniversaries and other politically significant dates 5 . Finding the right features out of unstructured big data is one of the key strands of our proposed framework.

The remainder of the paper is structured as follows. The “ Literature review ” section presents an overview of the related work and highlights the research gaps and our contributions. The “ Methods ” section describes the framework design, including the construction of the dataset and the building of the model. The “ Results ” section presents the validation results of our model, the trend analysis and forecast, and a detailed description of the developed threat cycle. Lastly, the “ Discussion ” section offers a critical evaluation of our work, highlighting its strengths and limitations, and provides recommendations for future research.

Literature review

In recent years, the literature has extensively covered different cyber threats across various application domains, and researchers have proposed several solutions to mitigate these threats. In the Social Internet of Vehicles (SIoV), one of the primary concerns is the interception and tampering of sensitive information by attackers 10 . To address this, a secure authentication protocol has been proposed that utilises confidential computing environments to ensure the privacy of vehicle-generated data. Another application domain that has been studied is the privacy of image data, specifically lane images in rural areas 11 . The proposed methodology uses Error Level Analysis (ELA) and artificial neural network (ANN) algorithms to classify lane images as genuine or fake, with the U-Net model for lane detection in bona fide images. The final images are secured using the proxy re-encryption technique with RSA and ECC algorithms, and maintained using fog computing to protect against forgery.

Another application domain that has been studied is the security of Wireless Mesh Networks (WMNs) in the context of the Internet of Things (IoT) 12 . WMNs rely on cooperative forwarding, making them vulnerable to various attacks, including packet drop/modification, badmouthing, on-off, and collusion attacks. To address this, a novel trust mechanism framework has been proposed that differentiates between legitimate and malicious nodes using direct and indirect trust computation. The framework utilises a two-hop mechanism to observe the packet forwarding behaviour of neighbours, and a weighted D-S theory to aggregate recommendations from different nodes. While these solutions have shown promising results in addressing cyber threats, it is important to anticipate the type of threat that may arise to ensure that the solutions can be effectively deployed. By proactively identifying and anticipating cyber threats, organisations can better prepare themselves to protect their systems and data from potential attacks.

While we are relatively successful in detecting and classifying cyber-attacks when they occur 13 , 14 , 15 , there has been a much more limited success in predicting them. Some studies exist on short-term predictive capability 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , such as predicting the number or source of attacks to be expected in the next hours or days. The majority of this work performs the prediction in restricted settings ( e.g. , against a specific entity or organisation) where historical data are available 18 , 19 , 25 . Forecasting attack occurrences has been attempted by using statistical methods, especially when parametric data distributions could be assumed 16 , 17 , as well as by using ML models 20 . Other methods adopt a Bayesian setting and build event graphs suitable for estimating the conditional probability of an attack following a given chain of events 21 . Such techniques rely on libraries of predefined attack graphs: they can identify the known attack most likely to happen, but are helpless against never-experienced-before, zero-day attacks.

Other approaches try to identify potential attackers by using network entity reputation and scoring 26 . A small but growing body of research explores the fusion of heterogeneous features (warning signals) to forecast cyber-threats using ML. Warning signs may include the number of mentions of a victim organisation on Twitter 18 , mentions in news articles about the victim entity 19 , and digital traces from dark web hacker forums 20 . Our literature review is summarised in Table 1 .

Forecasting the cyber-threats that will most likely turn into attacks in the medium and long term is of significant importance. It not only gives to cyber-security agencies the time to evaluate the existing defence measures, but also assists them in identifying areas where to develop preventive solutions. Long-term prediction of cyber-threats, however, still relies on the subjective perceptions of human security experts 27 , 28 . Unlike a fully automated procedure based on quantitative metrics, the human-based approach is prone to bias based on scientific or technical interests 29 . Also, quantitative predictions are crucial to scientific objectivity 30 . In summary, we highlight the following research gaps:

Current research primarily focuses on detecting ( i.e. , reactive) rather than predicting cyber-attacks ( i.e. , proactive).

Available predictive methods for cyber-attacks are mostly limited to short-term predictions.

Current predictive methods for cyber-attacks are limited to restricted settings ( e.g. , a particular network or system).

Long-term prediction of cyber-attacks is currently performed by human experts, whose judgement is subjective and prone to bias and disagreement.

Research contributions

Our objective is to fill these research gaps by a proactive, long-term, and holistic approach to attack prediction. The proposed framework gives cyber-security agencies sufficient time to evaluate existing defence measures while also providing objective and accurate representation of the forecast. Our study is aimed at predicting the trend of cyber-attacks up to three years in advance, utilising big data sources and ML techniques. Our ML models are learned from heterogeneous features extracted from massive, unstructured data sources, namely, Hackmageddon 9 , Elsevier 31 , Twitter 32 , and Python APIs 33 . Hackmageddon provides more than 15, 000 records of global cyber-incidents since the year 2011, while Elsevier API offers access to the Scopus database, the largest abstract and citation database of peer-reviewed literature with over 27,000,000 documents 34 . The number of relevant tweets we collected is around 9 million. Our study covers 36 countries and 42 major attack types. The proposed framework not only provides the forecast and categorisation of the threats, but also generates a threat life-cycle model, whose the five key phases underlie the life cycle of all 42 known cyber-threats. The key contribution of this study consists of the following:

A novel dataset is constructed using big unstructured data ( i.e. , Hackmageddon) including news and government advisories, in addition to Elsevier, Twitter, and Python API. The dataset comprises monthly counts of cyber-attacks and other unique features, covering 42 attack types across 36 countries.

Our proactive approach offers long-term forecasting by predicting threats up to 3 years in advance.

Our approach is holistic in nature, as it does not limit itself to specific entities or regions. Instead, it provides projections of attacks across 36 countries situated in diverse parts of the world.

Our approach is completely automated and quantitative, effectively addressing the issue of bias in human predictions and providing a precise forecast.

By analysing past and predicted future data, we have classified threats into four main groups and provided a forecast of 42 attacks until 2025.

The first threat cycle is proposed, which delineates the distinct phases in the life cycle of 42 cyber-attack types.

The framework of forecasting cyber threats

The architecture of our framework for forecasting cyber threats is illustrated in Fig. 1 . As seen in the Data Sources component (l.h.s), to harness all the relevant data and extract meaningful insights, our framework utilises various sources of unstructured data. One of our main sources is Hackmageddon, which includes massive textual data on major cyber-attacks (approx. 15,334 incidents) dating back to July 2011. We refer to the monthly number of attacks in the list as the Number of Incidents (NoI). Also, Elsevier’s Application Programming Interface (API) gives access to a very large corpus of scientific articles and data sets from thousands of sources. Utilising this API, we obtained the Number of Mentions (NoM) ( e.g. , monthly) of each attack that appeared in the scientific publications. This NoM data is of particular importance as it can be used as the ground truth for attack types that do not appear in Hackmageddon. During the preliminary research phase, we examined all the potentially relevant features and noticed that wars/political conflicts are highly correlated to the number of cyber-events. These data were then extracted via Twitter API as Armed Conflict Areas/Wars (ACA). Lastly, as attacks often take place around holidays, Python’s holidays package was used to obtain the number of public holidays per month for each country, which is referred to as Public Holidays (PH).

To ensure the accuracy and quality of Hackmageddon data, we validated it using the statistics from official sources across government, academia, research institutes and technology organisations. For a ransomware example, the Cybersecurity & Infrastructure Security Agency stated in their 2021 trend report that cybersecurity authorities in the United States, Australia, and the United Kingdom observed an increase in sophisticated, high-impact ransomware incidents against critical infrastructure organisations globally 35 . The WannaCry attack in the dataset was also validated with Ghafur et al ’s 1 statement in their article: “WannaCry ransomware attack was a global epidemic that took place in May 2017”.

An example of an entry in the Hackmageddon dataset is shown in Table 2 . Each entry includes the incident date, the description of the attack, the attack type, and the target country. Data pre-processing (Fig. 1 ) focused on noise reduction through imputing missing values ( e.g. , countries), which were often observed in the earlier years. We were able to impute these values from the description column or occasionally, by looking up the entity location using Google.

The textual data were quantified via our Word Frequency Counter (WFC), which counted the number of each attack type per month as in Table 3 . Cumulative Aggregation (CA) obtained the number of attacks for all countries combined and an example of a data entry after transformation includes the month, and the number of attacks against each country (and all countries combined) for each attack type. By adding features such as NoM, ACA, and PH, we ended up having additional features that we appended to the dataset as shown in Table 4 . Our final dataset covers 42 common types of attacks in 36 countries. The full list of attacks is provided in Table 5 . The list of the countries is given in Supplementary Table S1 .

To analyse and investigate the main characteristics of our data, an exploratory analysis was conducted focusing on the visualisation and identification of key patterns such as trend and seasonality, correlated features, missing data and outliers. For seasonal data, we smoothed out the seasonality so that we could identify the trend while removing the noise in the time series 36 . The smoothing type and constants were optimised along with the ML model (see Optimisation for details). We applied Stochastic selection of Features (SoF) to find the subset of features that minimises the prediction error, and compared the univariate against the multivariate approach.

For the modelling, we built a Bayesian encoder-decoder Long Short-Term Memory (B-LSTM) network. B-LSTM models have been proposed to predict “perfect wave” events like the onset of stock market “bear” periods on the basis of multiple warning signs, each having different time dynamics 37 . Encoder-decoder architectures can manage inputs and outputs that both consist of variable-length sequences. The encoder stage encodes a sequence into a fixed-length vector representation (known as the latent representation). The decoder prompts the latent representation to predict a sequence. By applying an efficient latent representation, we train the model to consider all the useful warning information from the input sequence - regardless of its position - and disregard the noise.

Our Bayesian variation of the encoder-decoder LSTM network considers the weights of the model as random variables. This way, we extract epistemic uncertainty via (approximate) Bayesian inference, which quantifies the prediction error due to insufficient information 38 . This is an important parameter, as epistemic uncertainty can be reduced by better intelligence, i.e. , by acquiring more samples and new informative features. Details are provided in “ Bayesian long short-term memory ” section.

Our overall analytical platform learns an operational model for each attack type. Here, we evaluated the model’s performance in predicting the threat trend 36 months in advance. A newly modified symmetric Mean Absolute Percentage Error (M-SMAPE) was devised as the evaluation metric, where we added a penalty term that accounts for the trend direction. More details are provided in the “ Evaluation metrics ” section.

Feature extraction

Below, we provide the details of the process that transforms raw data into numerical features, obtaining the ground truth NoI and the additional features NoM, ACA and PH.

NoI: The number of daily incidents in Hackmageddon was transformed from the purely unstructured daily description of attacks along with the attack and country columns, to the monthly count of incidents for each attack in each country. Within the description, multiple related attacks may appear, which are not necessarily in the attack column. Let \(E_{x_i}\) denote the set of entries during the month \(x_i\) in Hackmageddon dataset. Let \(a_j\) and \(c_k\) denote the j th attack and k th country. Then NoI can be expressed as follows:

where \(Z(a_j,c_k,e)\) is a function that evaluates to 1 if \(a_j\) appears either in the description or in the attack columns of entry e and \(c_k\) appears in the country column of e . Otherwise, the function evaluates to 0. Next, we performed CA to obtain the monthly count of attacks in all countries combined for each attack type as follows:

NoM: We wrote a Python script to query Elsevier API for the number of mentions of each attack during each month 31 . The search covers the title, abstract and keywords of published research papers that are stored in Scopus database 39 . Let \(P_{x_i}\) denote the set of research papers in Scopus published during the month \(x_i\) . Also, let \(W_{p}\) denote the set of words in the title, abstract and keywords of research paper p . Then NoM can be expressed as follows:

where \(U(w,a_j)\) evaluates to 1 if \(w=a_j\) , and to 0 otherwise.

ACA: Using Twitter API in Python 32 , we wrote a query to obtain the number of tweets with keywords related to political conflicts or military attacks associated with each country during each month. The keywords used for each country are summarised in Supplementary Table S2 , representing our query. Formally, let \(T_{x_i}\) denote the set of all tweets during the month \(x_i\) . Then ACA can be expressed as follows:

where \(Q(t,c_k)\) evaluates to 1 if the query in Supplementary Table S2 evaluates to 1 given t and \(c_k\) . Otherwise, it evaluates to 0.

PH: We used the Python holidays library 33 to count the number of days that are considered public holidays in each country during each month. More formally, this can be expressed as follows:

where \(H(d,c_k)\) evaluates to 1 if the day d in the country \(c_k\) is a public holiday, and to 0 otherwise. In ( 4 ) and ( 5 ), CA was used to obtain the count for all countries combined as in ( 2 ).

Data integration

Based on Eqs. ( 1 )–( 5 ), we obtain the following columns for each month:

NoI_C: The number of incidents for each attack type in each country ( \(42 \times 36\) columns) [Hackmageddon].

NoI: The total number of incidents for each attack type (42 columns) [Hackmageddon].

NoM: The number of mentions of each attack type in research articles (42 columns) [Elsevier].

ACA_C: The number of tweets about wars and conflicts related to each country (36 columns) [Twitter].

ACA: The total number of tweets about wars and conflicts (1 column) [Twitter].

PH_C: The number of public holidays in each country (36 columns) [Python].

PH: The total number of public holidays (1 column) [Python].

In the aforementioned list of columns, the name enclosed within square brackets denotes the source of data. By matching and combining these columns, we derive our monthly dataset, wherein each row represents a distinct month. A concrete example can be found in Tables 3 and 4 , which, taken together, constitute a single observation in our dataset. The dataset can be expanded through the inclusion of other monthly features as supplementary columns. Additionally, the dataset may be augmented with further samples as additional monthly records become available. Some suggestions for extending the dataset are provided in the “ Discussion ” section.

Data smoothing

We tested multiple smoothing methods and selected the one that resulted in the model with the lowest M-SMAPE during the hyper-parameter optimisation process. The methods we tested include exponential smoothing (ES), double exponential smoothing (DES) and no smoothing (NS). Let \(\alpha \) be the smoothing constant. Then the ES formula is:

where \(D(x_{i})\) denotes the original data at month \(x_{i}\) . For the DES formula, let \(\alpha \) and \(\beta \) be the smoothing constants. We first define the level \(l(x_{i})\) and the trend \(\tau (x_{i})\) as follows:

then, DES is expressed as follows:

The smoothing constants ( \(\alpha \) and \(\beta \) ) in the aforementioned methods are chosen as the predictive results of the ML model that gives the lowest M-SMAPE during the hyper-parameter optimisation process. Supplementary Fig. S5 depicts an example for the DES result.

Bayesian long short-term memory

LSTM is a type of recurrent neural network (RNN) that uses lagged observations to forecast the future time steps 30 . It was introduced as a solution to the so-called vanishing/exploding gradient problem of traditional RNNs 40 , where the partial derivative of the loss function may suddenly approach zero at some point of the training. In LSTM, the input is passed to the network cell, which combines it with the hidden state and cell state values from previous time steps to produce the next states. The hidden state can be thought of as a short-term memory since it stores information from recent periods in a weighted manner. On the other hand, the cell state is meant to remember all the past information from previous intervals and store them in the LSTM cell. The cell state thus represents the long-term memory.

LSTM networks are well-suited for time-series forecasting, due to their proficiency in retaining both long-term and short-term temporal dependencies 41 , 42 . By leveraging their ability to capture these dependencies within cyber-attack data, LSTM networks can effectively recognise recurring patterns in the attack time-series. Moreover, the LSTM model is capable of learning intricate temporal patterns in the data and can uncover inter-correlations between various variables, making it a compelling option for multivariate time-series analysis 43 .

Given a sequence of LSTM cells, each processing a single time-step from the past, the final hidden state is encoded into a fixed-length vector. Then, a decoder uses this vector to forecast future values. Using such architecture, we can map a sequence of time steps to another sequence of time steps, where the number of steps in each sequence can be set as needed. This technique is referred to as encoder-decoder architecture.

Because we have relatively short sequences within our refined data ( e.g. , 129 monthly data points over the period from July 2011 to March 2022), it is crucial to extract the source of uncertainty, known as epistemic uncertainty 44 , which is caused by lack of knowledge. In principle, epistemic uncertainty can be reduced with more knowledge either in the form of new features or more samples. Deterministic (non-stochastic) neural network models are not adequate to this task as they provide point estimates of model parameters. Rather, we utilise a Bayesian framework to capture epistemic uncertainty. Namely, we adopt the Monte Carlo dropout method proposed by Gal et al. 45 , who showed that the use of non-random dropout neurons during ML training (and inference) provides a Bayesian approximation of the deep Gaussian processes. Specifically, during the training of our LSTM encoder-decoder network, we applied the same dropout mask at every time-step (rather than applying a dropout mask randomly from time-step to time-step). This technique, known as recurrent dropout is readily available in Keras 46 . During the inference phase, we run trained model multiple times with recurrent dropout to produce a distribution of predictive results. Such prediction is shown in Fig. 4 .

Figure 2 shows our encoder-decoder B-LSTM architecture. The hidden state and cell state are denoted respectively by \(h_{i}\) and \(C_{i}\) , while the input is denoted by \(X_{i}\) . Here, the length of the input sequence (lag) is a hyper-parameter tuned to produce the optimal model, where the output is a single time-step. The number of cells ( i.e. , the depth of each layer) is tuned as a hyper-parameter in the range between 25 and 200 cells. Moreover, we used one or two layers, tuning the number of layers to each attack type. For the univariate model we used a standard Rectified Linear Unit (ReLU) activation function, while for the multivariate model we used a Leaky ReLU. Standard ReLU computes the function \(f(x)=max(0,x)\) , thresholding the activation at zero. In the multivariate case, zero-thresholding may generate the same ReLU output for many input vectors, making the model convergence slower 47 . With Leaky ReLU, instead of defining ReLU as zero when \(x < 0\) , we introduce a negative slope \(\alpha =0.2\) . Additionally, we used recurrent dropout ( i.e. , arrows in red as shown in Fig. 2 ), where the probability of dropping out is another hyper-parameter that we tune as described above, following Gal’s method 48 . The tuned dropout value is maintained during the testing and prediction as previously mentioned. Once the final hidden vector \(h_{0}\) is produced by the encoder, the Repeat Vector layer is used as an adapter to reshape it from the bi-dimensional output of the encoder ( e.g. , \(h_{0}\) ) to the three-dimensional input expected by the decoder. The decoder processes the input and produces the hidden state, which is then passed to a dense layer to produce the final output.

Each time-step corresponds to a month in our model. Since the model is learnt to predict a single time-step (single month), we use a sliding window during the prediction phase to forecast 36 (monthly) data points. In other words, we predict a single month at each step, and the predicted value is fed back for the prediction of the following month. This concept is illustrated in the table shown in Fig. 2 . Utilising a single time-step in the model’s output minimises the size of the sliding window, which in turn allows for training with as many observations as possible with such limited data.

The difference between the univariate and multivariate B-LSTMs is that the latter carries additional features in each time-step. Thus, instead of passing a scalar input value to the network, we pass a vector of features including the ground truth at each time-step. The model predicts a vector of features as an output, from which we retrieve the ground truth, and use it along with the other predicted features as an input to predict the next time-step.

Evaluation metrics

The evaluation metric SMAPE is a percentage (or relative) error based accuracy measure that judges the prediction performance purely on how far the predicted value is from the actual value 49 . It is expressed by the following formula:

where \(F_{t}\) and \(A_{t}\) denote the predicted and actual values at time t . This metric returns a value between 0% and 100%. Given that our data has zero values in some months ( e.g. , emerging threats), the issue of division by zero may arise, a problem that often emerges when using standard MAPE (Mean Absolute Percentage Error). We find SMAPE to be resilient to this problem, since it has both the actual and predicted values in the denominator.

Recall that our model aims to predict a curve (corresponding to multiple time steps). Using plain SMAPE as the evaluation metric, the “best” model may turn out to be simply a straight line passing through the same points of the fluctuating actual curve. However, this is undesired in our case since our priority is to predict the trend direction (or slope) over its intensity or value at a certain point. We hence add a penalty term to SMAPE that we apply when the height of the predicted curve is relatively smaller than that of the actual curve. This yields the modified SMAPE (M-SMAPE). More formally, let I ( V ) be the height of the curve V , calculated as follows:

where n is the curve width or the number of data points. Let A and F denote the actual and predicted curves. We define M-SMAPE as follows:

where \(\gamma \) is a penalty constant between 0 and 1, and d is another constant \(\ge \) 1. In our experiment, we set \(\gamma \) to 0.3, and d to 3, as we found these to be reasonable values by trial and error. We note that the range of possible values of M-SMAPE is between 0% and (100 + 100 \(\gamma \) )% after this modification. By running multiple experiments we found out that the modified evaluation metric is more suitable for our scenario, and therefore was adopted for the model’s evaluation.

Optimisation

On average, our model was trained on around 67% of the refined data, which is equivalent to approximately 7.2 years. We kept the rest, approximately 33% (3 years + lag period), for validation. These percentages may slightly differ for different attack types depending on the optimal lag period selected.

For hyper-parameter optimisation, we performed a random search with 60 iterations, to obtain the set of features, smoothing methods and constants, and model’s hyper-parameters that results in the model with the lowest M-SMAPE. Random search is a simple and efficient technique for hyper-parameter optimisation, with advantages including efficiency, flexibility, robustness, and scalability. The technique has been studied extensively in the literature and was found to be superior to grid search in many cases 50 . For each set of hyper-parameters, the model was trained using the mean squared error (MSE) as the loss function, and while using ADAM as the optimisation algorithm 51 . Then, the model was validated by forecasting 3 years while using M-SMAPE as the evaluation metric, and the average performance was recorded over 3 different seeds. Once the set of hyper-parameters with the minimum M-SMAPE was obtained, we used it to train the model on the full data, after which we predicted the trend for the next 3 years (until March, 2025).

The first group of hyper-parameters is the subset of features in the case of the multivariate model. Here, we experimented with each of the 3 features separately (NoM, ACA or PH) along with the ground truth (NoI), in addition to the combination of all features. The second group is the smoothing methods and constants. The set of methods includes ES, DES and NS, as previously discussed. The set of values for the smoothing constant \(\alpha \) ranges from 0.05 to 0.7 while the set of values for the smoothing constant \(\beta \) (for DES) ranges from 0.3 to 0.7. Next is the optimisation of the lag period with values that range from 1 to 12 months. This is followed by the model’s hyper-parameters which include the learning rate with values that range from \(6\times 10^{-4}\) to \(1\times 10^{-2}\) , the number of epochs with values between 30 and 200, the number of layers in the range 1 to 2, the number of units in the range 25 to 200, and the recurrent dropout value between 0.2 and 0.5. The range of these values was obtained from the literature and the online code repositories 52 .

Validation and comparative analysis

The results of our model’s validation are provided in Fig. 3 and Table 5 . As shown in Fig. 3 , the predicted data points are well aligned with the ground truth. Our models successfully predicted the next 36 months of all the attacks’ trends with an average M-SMAPE of 0.25. Table 5 summarises the validation results of univariate and multivariate approaches using B-LSTM. The results show that with approximately 69% of all the attack types, the multivariate approach outperformed the univariate approach. As seen in Fig. 3 , the threats that have a consistent increasing or emerging trend seemed to be more suitable for the univariate approach, while threats that have a fluctuating or decreasing trend showed less validation error when using the multivariate approach. The feature of ACA resulted in the best model for 33% of all the attack types, which makes it among the three most informative features that can boost the prediction performance. The PH accounts for 17% of all the attacks followed by NoM that accounts for 12%.

We additionally compared the performance of the proposed model B-LSTM with other models namely LSTM and ARIMA. The comparison covers the univariate and multivariate approaches of LSTM and B-LSTM, with two features in the case of multivariate approach namely NoI and NoM. The comparison is in terms of the Mean Absolute Percentage Error (MAPE) when predicting four common attack types, namely DDoS, Password Attack, Malware, and Ransomware. A comparison table is provided in Supplementary Table S3 . The results illustrate the superiority of the B-LSTM model for most of the attack types.

Trends analysis

The forecast of each attack trend until the end of the first quarter of 2025 is given in Supplementary Figs. S1 – S4 . By visualising the historical data of each attack as well as the prediction for the next three years, we were able to analyse the overall trend of each attack. The attacks generally follow 4 types of trends: (1) rapidly increasing, (2) overall increasing, (3) emerging and (4) decreasing. The names of attacks for each category are provided in Fig. 4 .

The first trend category is the rapidly increasing trend (Fig. 4 a—approximately 40% of the attacks belong to this trend. We can see that the attacks belonging to this category have increased dramatically over the past 11 years. Based on the model’s prediction, some of these attacks will exhibit a steep growth until 2025. Examples include session hijacking, supply chain, account hijacking, zero-day and botnet. Some of the attacks under this category have reached their peak, have recently started stabilising, and will probably remain steady over the next 3 years. Examples include malware, targeted attack, dropper and brute force attack. Some attacks in this category, after a recent increase, are likely to level off in the next coming years. These are password attack, DNS spoofing and vulnerability-related attacks.

The second trend category is the overall increasing trend as seen in Fig. 4 b. Approximately 31% of the attacks seem to follow this trend. The attacks under this category have a slower rate of increase over the years compared to the attacks in the first category, with occasional fluctuations as can be observed in the figure. Although some of the attacks show a slight recent decline ( e.g. , malvertising, keylogger and URL manipulation), malvertising and keylogger are likely to recover and return to a steady state while URL manipulation is projected to continue a smooth decline. Other attacks typical of “cold” cyber-warfare like Advanced Persistent Threats (APT) and rootkits are already recovering from a small drop and will likely to rise to a steady state by 2025. Spyware and data breach have already reached their peak and are predicted to decline in the near future.

Next is the emerging trend as shown in Fig. 4 c. These are the attacks that started to grow significantly after the year 2016, although many of them existed much earlier. In our study, around 17% of the attacks follow this trend. Some attacks have been growing steeply and are predicted to continue this trend until 2025. These are Internet of Things (IoT) device attack and deepfake. Other attacks have also been increasing rapidly since 2016, however, are likely to slow down after 2022. These include ransomware and adversarial attacks. Interestingly, some attacks that emerged after 2016 have already reached the peak and recently started a slight decline ( e.g. , cryptojacking and WannaCry ransomware attack). It is likely that WannaCry will become relatively steady in the coming years, however, cryptojacking will probably continue to decline until 2025 thanks to the rise of proof-of-stake consensus mechanisms 53 .

The fourth and last trend category is the decreasing trend (Fig. 4 d—only 12% of the attacks follow this trend. Some attacks in this category peaked around 2012, and have been slowly decreasing since then ( e.g. , SQL Injection and defacement). The drive-by attack also peaked in 2012, however, had other local peaks in 2016 and 2018, after which it declined noticeably. Cross-site scripting (XSS) and pharming had their peak more recently compared to the other attacks, however, have been smoothly declining since then. All the attacks under this category are predicted to become relatively stable from 2023 onward, however, they are unlikely to disappear in the next 3 years.

The threat cycle

This large-scale analysis involving the historical data and the predictions for the next three years enables us to come up with a generalisable model that traces the evolution and adoption of the threats as they pass through successive stages. These stages are named by the launch, growth, maturity, trough and stability/decline. We refer to this model as The Threat Cycle (or TTC), which is depicted in Fig. 5 . In the launch phase, few incidents start appearing for a short period. This is followed by a sharp increase in terms of the number of incidents, growth and visibility as more and more cyber actors learn and adopt this new attack. Usually, the attacks in the launch phase are likely to have many variants as observed in the case of the WannaCry attack in 2017. At some point, the number of incidents reaches a peak where the attack enters the maturity phase, and the curve becomes steady for a while. Via the trough (when the attack experiences a slight decline as new security measures seem to be very effective), some attacks recover and adapt to the security defences, entering the slope of plateau, while others continue to smoothly decline although they do not completely disappear ( i.e. , slope of decline). It is worth noting that the speed of transition between the different phases may vary significantly between the attacks.

As seen in Fig. 5 , the attacks are placed on the cycle based on the slope of their current trend, while considering their historical trend and prediction. In the trough phase, we can see that the attacks will either follow the slope of plateau or the slope of decline. Based on the predicted trend in the blue zone in Fig. 4 , we were able to indicate the future direction for some of the attacks close to the split point of the trough using different colours (blue or red). Brute force, malvertising, the Distributed Denial-of-Service attack (DDoS), insider threat, WannaCry and phishing are denoted in blue meaning that these are likely on their way to the slope of plateau. In the first three phases, it is usually unclear and difficult to predict whether a particular attack will reach the plateau or decline, thus, denoted in grey.

There are some similarities and differences between TTC and the well-known Gartner hype cycle (GHC) 54 . A standard GHC is shown in a vanishing green colour in Fig. 5 . As TTC is specific to cyber threats, it has a much wider peak compared to GHC. Although both GHC and TTC have a trough phase, the threats decline slightly (while significant drop in GHC) as they exit their maturity phase, after which they recover and move to stability (slope of plateau) or decline.

Many of the attacks in the emerging category are observed in the growth phase. These include IoT device attack, deepfake and data poisoning. While ransomwares (except WannaCry) are in the growth phase, WannaCry already reached the trough, and is predicted to follow the slope of plateau. Adversarial attack has just entered the maturity stage, and cryptojacking is about to enter the trough. Although adversarial attack is generally regarded as a growing threat, interestingly, this machine-based prediction and introspection shows that it is maturing. The majority of the rapidly increasing threats are either in the growth or in the maturity phase. The attacks in the growth phase include session hijacking, supply chain, account hijacking, zero-day and botnet. The attacks in the maturity phase include malware, targeted attack, vulnerability-related attacks and Man-In-The-Middle attack (MITM). Some rapidly increasing attacks such as phishing, brute force, and DDoS are in the trough and are predicted to enter the stability. We also observe that most of the attacks in the category of overall increasing threats have passed the growth phase and are mostly branching to the slope of plateau or the slope of decline, while few are still in the maturity phase ( e.g. , spyware). All of the decreasing threats are on the slope of decline. These include XSS, pharming, drive-by, defacement and SQL injection.

Highlights and limitations

This study presents the development of a ML-based proactive approach for long-term prediction of cyber-attacks offering the ability to communicate effectively with the potential attacks and the relevant security measures in an early stage to plan for the future. This approach can contribute to the prevention of an incident by allowing more time to develop optimal defensive actions/tools in a contested cyberspace. Proactive approaches can also effectively reduce uncertainty when prioritising existing security measures or initiating new security solutions. We argue that cyber-security agencies should prioritise their resources to provide the best possible support in preventing fastest-growing attacks that appear in the launch phase of TTC or the attacks in the categories of the rapidly increasing or emerging trend as in Fig. 4 a and c based on the predictions in the coming years.

In addition, our fully automated approach is promising to overcome the well-known issues of human-based analysis, above all expertise scarcity. Given the absence of the possibility of analysing with human’s subjective bias while following a purely quantitative procedure and data, the resulting predictions are expected to have lower degree of subjectivity, leading to consistencies within the subject. By fully automating this analytic process, the results are reproducible and can potentially be explainable with help of the recent advancements in Explainable Artificial Intelligence.

Thanks to the massive data volume and wide geographic coverage of the data sources we utilised, this study covers every facet of today’s cyber-attack scenario. Our holistic approach performs the long-term prediction on the scale of 36 countries, and is not confined to a specific region. Indeed, cyberspace is limitless, and a cyber-attack on critical infrastructure in one country can affect the continent as a whole or even globally. We argue that our Threat Cycle (TTC) provides a sound basis to awareness of and investment in new security measures that could prevent attacks from taking place. We believe that our tool can enable a collective defence effort by sharing the long-term predictions and trend analysis generated via quantitative processes and data and furthering the analysis of its regional and global impacts.

Zero-day attacks exploit a previously unknown vulnerability before the developer has had a chance to release a patch or fix for the problem 55 . Zero-day attacks are particularly dangerous because they can be used to target even the most secure systems and go undetected for extended periods of time. As a result, these attacks can cause significant damage to an organisation’s reputation, financial well-being, and customer trust. Our approach takes the existing research on using ML in the field of zero-day attacks to another level, offering a more proactive solution. By leveraging the power of deep neural networks to analyse complex, high-dimensional data, our approach can help agencies to prepare ahead of time, in-order to prevent the zero-day attack from happening at the first place, a problem that there is no existing fix for it despite our ability to detect it. Our results in Fig. 4 a suggest that zero-day attack is likely to continue a steep growth until 2025. If we know this information, we can proactively invest on solutions to prevent it or slow down its rise in the future, since after all, the ML detection approaches may not be alone sufficient to reduce its effect.

A limitation of our approach is its reliance on a restricted dataset that encompasses data since 2011 only. This is due to the challenges we encountered in accessing confidential and sensitive information. Extending the prediction phase requires the model to make predictions further into the future, where there may be more variability and uncertainty. This could lead to a decrease in prediction accuracy, especially if the underlying data patterns change over time or if there are unforeseen external factors that affect the data. While not always the case, this uncertainty is highlighted by the results of the Bayesian model itself as it expresses this uncertainty through the increase of the confidence interval over time (Fig. 3 a and b). Despite incorporating the Bayesian model to tackle the epistemic uncertainty, our model could benefit substantially from additional data to acquire a comprehensive understanding of past patterns, ultimately improving its capacity to forecast long-term trends. Moreover, an augmented dataset would allow ample opportunity for testing, providing greater confidence in the model’s resilience and capability to generalise.

Further enhancements can be made to the dataset by including pivotal dates (such as anniversaries of political events and war declarations) as a feature, specifically those that experience a high frequency of cyber-attacks. Additionally, augmenting the dataset with digital traces that reflect the attackers’ intentions and motivations obtained from the dark web would be valuable. Other informative features could facilitate short-term prediction, specifically to forecast the on-set of each attack.

Future work

Moving forward, future research can focus on augmenting the dataset with additional samples and informative features to enhance the model’s performance and its ability to forecast the trend in the longer-term. Also, the work opens a new area of research that focuses on prognosticating the disparity between the trend of cyber-attacks and the associated technological solutions and other variables, with the aim of guiding research investment decisions. Subsequently, TTC could be improved by adopting another curve model that can visualise the current development of relevant security measures. The threat trend categories (Fig. 4 ) and TTC (Fig. 5 ) show how attacks will be visible in the next three years and more, however, we do not know where the relevant security measures will be. For example, data poisoning is an AI-targeted adversarial attack that attempts to manipulate the training dataset to control the prediction behaviour of a machine-learned model. From the scientific literature data ( e.g. , Scopus), we could analyse the published articles studying the data poisoning problem and identify the relevant keywords of these articles ( e.g. , Reject on Negative Impact (RONI) and Probability of Sufficiency (PS)). RONI and PS are typical methods used for detecting poisonous data by evaluating the effect of individual data points on the performance of the trained model. Likewise, the features that are informative, discriminating or uncertainty-reducing for knowing how the relevant security measures evolve exist within such online sources in the form of author’s keywords, number of citations, research funding, number of publications, etc .

figure 1

The workflow and architecture of forecasting cyber threats. The ground truth of Number of Incidents (NoI) was extracted from Hackmageddon which has over 15,000 daily records of cyber incidents worldwide over the past 11 years. Additional features were obtained including the Number of Mentions (NoM) of each attack in the scientific literature using Elsevier API which gives access to over 27 million documents. The number of tweets about Armed Conflict Areas/Wars (ACA) was also obtained using Twitter API for each country, with a total of approximately 9 million tweets. Finally, the number of Public Holidays (PH) in each country was obtained using the holidays library in Python. The data preparation phase includes data re-formatting, imputation and quantification using Word Frequency Counter (WFC) to obtain the monthly occurrence of attacks per country and Cumulative Aggregation (CA) to obtain the sum for all countries. The monthly NoM, ACA and PHs were quantified and aggregated using CA. The numerical features were then combined and stored in the refined database. The percentages in the refined database are based on the contribution of each data source. In the exploratory analysis phase, the analytic platform analyses the trend and performs data smoothing using Exponential Smoothing (ES), Double Exponential Smoothing (DES) and No Smoothing (NS). The smoothing methods and Smoothing Constants (SCs) were chosen for each attack followed by the Stochastic Selection of Features (SoF). In the model development phase, the meta data was partitioned into approximately 67% for training and 33% for testing. The models were learned using the encoder-decoder architecture of the Bayesian Long Short-Term Memory (B-LSTM). The optimisation component finds the set of hyper-parameters that minimises the error (i.e., M-SMAPE), which is then used for learning the operational models. In the forecasting phase, we used the operational models to predict the next three years’ NoIs. Analysing the predicted data, trend types were identified and attacks were categorised into four different trends. The slope of each attack was then measured and the Magnitude of Slope (MoS) was analysed. The final output is The Threat Cycle (TTC) illustrating the attacks trend, status, and direction in the next 3 years.

figure 2

The encoder-decoder architecture of Bayesian Long Short-Term Memory (B-LSTM). \(X_{i}\) stands for the input at time-step i . \(h_{i}\) stands for the hidden state, which stores information from the recent time steps (short-term). \(C_{i}\) stands for the cell state, which stores all processed information from the past (long-term). The number of input time steps in the encoder is a variable tuned as a hyper-parameter, while the output in the decoder is a single time-step. The depth and number of layers are another set of hyper-parameters tuned during the model optimisation. The red arrows indicate a recurrent dropout maintained during the testing and prediction. The figure shows an example for an input with time lag=6 and a single layer. The final hidden state \(h_{0}\) produced by the encoder is passed to the Repeat Vector layer to convert it from 2 dimensional output to 3 dimensional input as expected by the decoder. The decoder processes the input and produces the final hidden state \(h_{1}\) . This hidden state is finally passed to a dense layer to produce the output. The table illustrates the concept of sliding window method used to forecast multiple time steps during the testing and prediction (i.e., using the output at a time-step as an input to forecast the next time-step). Using this concept, we can predict as many time steps as needed. In the table, an output vector of 6 time steps was predicted.

figure 3

The B-LSTM validation results of predicting the number of attacks from April, 2019 to March, 2022. (U) indicates an univariate model while (M) indicates a multivariate model. ( a ) Botnet attack with M-SMAPE=0.03. ( b ) Brute force attack with M-SMAPE=0.13. ( c ) SQL injection attack with M-SMAPE=0.04 using the feature of NoM. ( d ) Targeted attack with M-SMAPE=0.06 using the feature of NoM. Y axis is normalised in the case of multivariate models to account for the different ranges of feature values.

figure 4

A bird’s eye view of threat trend categories. The period of the trend plots is between July, 2011 and March, 2025, with the period between April, 2022 and March, 2025 forecasted using B-LSTM. ( a ) Among rapidly increasing threats, as observed in the forecast period, some threats are predicted to continue a sharp increase until 2025 while others will probably level off. ( b ) Threats under this category have overall been increasing while fluctuating over the past 11 years. Recently, some of the overall increasing threats slightly declined however many of those are likely to recover and level off by 2025. ( c ) Emerging threats that began to appear and grow sharply after the year 2016, and are expected to continue growing at this increasing rate, while others are likely to slow down or stabilise by 2025. ( d ) Decreasing threats that peaked in the earlier years and have slowly been declining since then. This decreasing group are likely to level off however probably will not disappear in the coming 3 years. The Y axis is normalised to account for the different ranges of values across different attacks. The 95% confidence interval is shown for each threat prediction.

figure 5

The threat cycle (TTC). The attacks go through 5 stages, namely, launch, growth, maturity trough, and stability/decline. A standard Gartner hype cycle (GHC) is shown with a vanishing green colour for a comparison to TTC. Both GHC and TTC have a peak, however, TTC’s peak is much wider with a slightly less steep curve during the growth stage. Some attacks in TTC do not recover after the trough and slide into the slope of decline. TTC captures the state of each attack in 2022, where the colour of each attack indicates which slope it would follow (e.g., plateau or decreasing) based on the predictive results until 2025. Within the trough stage, the attacks (in blue dot) are likely to arrive at the slope of plateau by 2025. The attacks (in red dot) will probably be on the slope of decline by 2025. The attacks with unknown final destination are coloured in grey.

Data availability

As requested by the journal, the data used in this paper is available to editors and reviewers upon request. The data will be made publicly available and can be accessed at the following link after the paper is published. https://github.com/zaidalmahmoud/Cyber-threat-forecast .

Ghafur, S. et al. A retrospective impact analysis of the wannacry cyberattack on the NHS. NPJ Digit. Med. 2 , 1–7 (2019).

Article   Google Scholar  

Alrzini, J. R. S. & Pennington, D. A review of polymorphic malware detection techniques. Int. J. Adv. Res. Eng. Technol. 11 , 1238–1247 (2020).

Google Scholar  

Lazarevic, A., Ertoz, L., Kumar, V., Ozgur, A. & Srivastava, J. A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of the 2003 SIAM International Conference on Data Mining , 25–36 (SIAM, 2003).

Kebir, O., Nouaouri, I., Rejeb, L. & Said, L. B. Atipreta: An analytical model for time-dependent prediction of terrorist attacks. Int. J. Appl. Math. Comput. Sci. 32 , 495–510 (2022).

MATH   Google Scholar  

Anticipating cyber attacks: There’s no abbottabad in cyber space. Infosecurity Magazine https://www.infosecurity-magazine.com/white-papers/anticipating-cyber-attacks (2015).

Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596 , 583–589 (2021).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373 , 871–876 (2021).

Gibney, E. et al. Where is russia’s cyberwar? researchers decipher its strategy. Nature 603 , 775–776 (2022).

Article   ADS   CAS   PubMed   Google Scholar  

Passeri, P. Hackmageddon data set. Hackmageddon https://www.hackmageddon.com (2022).

Chen, C.-M. et al. A provably secure key transfer protocol for the fog-enabled social internet of vehicles based on a confidential computing environment. Veh. Commun. 39 , 100567 (2023).

Nagasree, Y. et al. Preserving privacy of classified authentic satellite lane imagery using proxy re-encryption and UAV technologies. Drones 7 , 53 (2023).

Kavitha, A. et al. Security in IoT mesh networks based on trust similarity. IEEE Access 10 , 121712–121724 (2022).

Salih, A., Zeebaree, S. T., Ameen, S., Alkhyyat, A. & Shukur, H. M A survey on the role of artificial intelligence, machine learning and deep learning for cybersecurity attack detection. In: 2021 7th International Engineering Conference “Research and Innovation amid Global Pandemic” (IEC) , 61–66 (IEEE, 2021).

Ren, K., Zeng, Y., Cao, Z. & Zhang, Y. Id-rdrl: A deep reinforcement learning-based feature selection intrusion detection model. Sci. Rep. 12 , 1–18 (2022).

Liu, X. & Liu, J. Malicious traffic detection combined deep neural network with hierarchical attention mechanism. Sci. Rep. 11 , 1–15 (2021).

Werner, G., Yang, S. & McConky, K. Time series forecasting of cyber attack intensity. In Proceedings of the 12th Annual Conference on Cyber and Information Security Research , 1–3 (2017).

Werner, G., Yang, S. & McConky, K. Leveraging intra-day temporal variations to predict daily cyberattack activity. In 2018 IEEE International Conference on Intelligence and Security Informatics (ISI) , 58–63 (IEEE, 2018).

Okutan, A., Yang, S. J., McConky, K. & Werner, G. Capture: cyberattack forecasting using non-stationary features with time lags. In 2019 IEEE Conference on Communications and Network Security (CNS) , 205–213 (IEEE, 2019).

Munkhdorj, B. & Yuji, S. Cyber attack prediction using social data analysis. J. High Speed Netw. 23 , 109–135 (2017).

Goyal, P. et al. Discovering signals from web sources to predict cyber attacks. arXiv preprint arXiv:1806.03342 (2018).

Qin, X. & Lee, W. Attack plan recognition and prediction using causal networks. In 20th Annual Computer Security Applications Conference , 370–379 (IEEE, 2004).

Husák, M. & Kašpar, J. Aida framework: real-time correlation and prediction of intrusion detection alerts. In: Proceedings of the 14th international conference on availability, reliability and security , 1–8 (2019).

Liu, Y. et al. Cloudy with a chance of breach: Forecasting cyber security incidents. In: 24th USENIX Security Symposium (USENIX Security 15) , 1009–1024 (2015).

Malik, J. et al. Hybrid deep learning: An efficient reconnaissance and surveillance detection mechanism in sdn. IEEE Access 8 , 134695–134706 (2020).

Bilge, L., Han, Y. & Dell’Amico, M. Riskteller: Predicting the risk of cyber incidents. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security , 1299–1311 (2017).

Husák, M., Bartoš, V., Sokol, P. & Gajdoš, A. Predictive methods in cyber defense: Current experience and research challenges. Futur. Gener. Comput. Syst. 115 , 517–530 (2021).

Stephens, G. Cybercrime in the year 2025. Futurist 42 , 32 (2008).

Adamov, A. & Carlsson, A. The state of ransomware. Trends and mitigation techniques. In EWDTS , 1–8 (2017).

Shoufan, A. & Damiani, E. On inter-rater reliability of information security experts. J. Inf. Secur. Appl. 37 , 101–111 (2017).

Cha, Y.-O. & Hao, Y. The dawn of metamaterial engineering predicted via hyperdimensional keyword pool and memory learning. Adv. Opt. Mater. 10 , 2102444 (2022).

Article   CAS   Google Scholar  

Elsevier research products apis. Elsevier Developer Portal https://dev.elsevier.com (2022).

Twitter api v2. Developer Platform https://developer.twitter.com/en/docs/twitter-api (2022).

holidays 0.15. PyPI. The Python Package Index https://pypi.org/project/holidays/ (2022).

Visser, M., van Eck, N. J. & Waltman, L. Large-scale comparison of bibliographic data sources: Scopus, web of science, dimensions, crossref, and microsoft academic. Quant. Sci. Stud. 2 , 20–41 (2021).

2021 trends show increased globalized threat of ransomware. Cybersecurity and Infrastructure Security Agency https://www.cisa.gov/uscert/ncas/alerts/aa22-040a (2022).

Lai, K. K., Yu, L., Wang, S. & Huang, W. Hybridizing exponential smoothing and neural network for financial time series predication. In International Conference on Computational Science , 493–500 (Springer, 2006).

Huang, B., Ding, Q., Sun, G. & Li, H. Stock prediction based on Bayesian-lstm. In Proceedings of the 2018 10th International Conference on Machine Learning and Computing , 128–133 (2018).

Mae, Y., Kumagai, W. & Kanamori, T. Uncertainty propagation for dropout-based Bayesian neural networks. Neural Netw. 144 , 394–406 (2021).

Article   PubMed   Google Scholar  

Scopus preview. Scopus https://www.scopus.com/home.uri (2022).

Jia, P., Chen, H., Zhang, L. & Han, D. Attention-lstm based prediction model for aircraft 4-d trajectory. Sci. Rep. 12 (2022).

Chandra, R., Goyal, S. & Gupta, R. Evaluation of deep learning models for multi-step ahead time series prediction. IEEE Access 9 , 83105–83123 (2021).

Gers, F. A., Schmidhuber, J. & Cummins, F. Learning to forget: Continual prediction with lstm. Neural Comput. 12 , 2451–2471 (2000).

Article   CAS   PubMed   Google Scholar  

Sagheer, A. & Kotb, M. Unsupervised pre-training of a deep lstm-based stacked autoencoder for multivariate time series forecasting problems. Sci. Rep. 9 , 1–16 (2019).

Article   ADS   Google Scholar  

Swiler, L. P., Paez, T. L. & Mayes, R. L. Epistemic uncertainty quantification tutorial. In Proceedings of the 27th International Modal Analysis Conference (2009).

Gal, Y. & Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. arXiv preprint arXiv:1506.02142v6 (2016).

Chollet, F. Deep Learning with Python , 2 edn. (Manning Publications, 2017).

Xu, J., Li, Z., Du, B., Zhang, M. & Liu, J. Reluplex made more practical: Leaky relu. In 2020 IEEE Symposium on Computers and Communications (ISCC) , 1–7 (IEEE, 2020).

Gal, Y., Hron, J. & Kendall, A. Concrete dropout. Adv. Neural Inf. Process. Syst. 30 (2017).

Shcherbakov, M. V. et al. A survey of forecast error measures. World Appl. Sci. J. 24 , 171–176 (2013).

Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (2012).

Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Commun. ACM 60 , 84–90 (2017).

Shifferaw, Y. & Lemma, S. Limitations of proof of stake algorithm in blockchain: A review. Zede J. 39 , 81–95 (2021).

Dedehayir, O. & Steinert, M. The hype cycle model: A review and future directions. Technol. Forecast. Soc. Chang. 108 , 28–41 (2016).

Abri, F., Siami-Namini, S., Khanghah, M. A., Soltani, F. M. & Namin, A. S. Can machine/deep learning classifiers detect zero-day malware with high accuracy?. In 2019 IEEE International Conference on Big Data (Big Data) , 3252–3259 (IEEE, 2019).

Download references

Acknowledgements

The authors are grateful to the DASA’s machine learning team for their invaluable discussions and feedback, and special thanks to the EBTIC, British Telecom’s (BT) cyber security team for their constructive criticism on this work.

Author information

Authors and affiliations.

Department of Computer Science and Information Systems, University of London, Birkbeck College, London, United Kingdom

Zaid Almahmoud & Paul D. Yoo

Huawei Technologies Canada, Ottawa, Canada

Omar Alhussein

Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada

Ilyas Farhat

Department of Computer Science, Università degli Studi di Milano, Milan, Italy

Ernesto Damiani

Center for Cyber-Physical Systems (C2PS), Khalifa University, Abu Dhabi, United Arab Emirates

You can also search for this author in PubMed   Google Scholar

Contributions

Z.A., P.D.Y, I.F., and E.D. were in charge of the framework design and theoretical analysis of the trend analysis and TTC. Z.A., O.A., and P.D.Y. contributed to the B-LSTM design and experiments. O.A. proposed the concepts of B-LSTM. All of the authors contributed to the discussion of the framework design and experiments, and the writing of this paper. P.D.Y. proposed the big data approach and supervised the whole project.

Corresponding author

Correspondence to Paul D. Yoo .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Almahmoud, Z., Yoo, P.D., Alhussein, O. et al. A holistic and proactive approach to forecasting cyber threats. Sci Rep 13 , 8049 (2023). https://doi.org/10.1038/s41598-023-35198-1

Download citation

Received : 21 December 2022

Accepted : 14 May 2023

Published : 17 May 2023

DOI : https://doi.org/10.1038/s41598-023-35198-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Integrating ai-driven threat intelligence and forecasting in the cyber security exercise content generation lifecycle.

  • Alexandros Zacharis
  • Vasilios Katos
  • Constantinos Patsakis

International Journal of Information Security (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

machine learning in cyber security research papers

  • Open access
  • Published: 10 August 2020

Using deep learning to solve computer security challenges: a survey

  • Yoon-Ho Choi 1 , 2 ,
  • Peng Liu 1 ,
  • Zitong Shang 1 ,
  • Haizhou Wang 1 ,
  • Zhilong Wang 1 ,
  • Lan Zhang 1 ,
  • Junwei Zhou 3 &
  • Qingtian Zou 1  

Cybersecurity volume  3 , Article number:  15 ( 2020 ) Cite this article

16k Accesses

23 Citations

1 Altmetric

Metrics details

Although using machine learning techniques to solve computer security challenges is not a new idea, the rapidly emerging Deep Learning technology has recently triggered a substantial amount of interests in the computer security community. This paper seeks to provide a dedicated review of the very recent research works on using Deep Learning techniques to solve computer security challenges. In particular, the review covers eight computer security problems being solved by applications of Deep Learning: security-oriented program analysis, defending return-oriented programming (ROP) attacks, achieving control-flow integrity (CFI), defending network attacks, malware classification, system-event-based anomaly detection, memory forensics, and fuzzing for software security.

Introduction

Using machine learning techniques to solve computer security challenges is not a new idea. For example, in the year of 1998, Ghosh and others in ( Ghosh et al. 1998 ) proposed to train a (traditional) neural network based anomaly detection scheme(i.e., detecting anomalous and unknown intrusions against programs); in the year of 2003, Hu and others in ( Hu et al. 2003 ) and Heller and others in ( Heller et al. 2003 ) applied Support Vector Machines to based anomaly detection scheme (e.g., detecting anomalous Windows registry accesses).

The machine-learning-based computer security research investigations during 1990-2010, however, have not been very impactful. For example, to the best of our knowledge, none of the machine learning applications proposed in ( Ghosh et al. 1998 ; Hu et al. 2003 ; Heller et al. 2003 ) has been incorporated into a widely deployed intrusion-detection commercial product.

Regarding why not very impactful, although researchers in the computer security community seem to have different opinions, the following remarks by Sommer and Paxson ( Sommer and Paxson 2010 ) (in the context of intrusion detection) have resonated with many researchers:

Remark A: “It is crucial to have a clear picture of what problem a system targets: what specifically are the attacks to be detected? The more narrowly one can define the target activity, the better one can tailor a detector to its specifics and reduce the potential for misclassifications.” ( Sommer and Paxson 2010 )

Remark B: “If one cannot make a solid argument for the relation of the features to the attacks of interest, the resulting study risks foundering on serious flaws.” ( Sommer and Paxson 2010 )

These insightful remarks, though well aligned with the machine learning techniques used by security researchers during 1990-2010, could become a less significant concern with Deep Learning (DL), a rapidly emerging machine learning technology, due to the following observations. First, Remark A implies that even if the same machine learning method is used, one algorithm employing a cost function that is based on a more specifically defined target attack activity could perform substantially better than another algorithm deploying a less specifically defined cost function. This could be a less significant concern with DL, since a few recent studies have shown that even if the target attack activity is not narrowly defined, a DL model could still achieve very high classification accuracy. Second, Remark B implies that if feature engineering is not done properly, the trained machine learning models could be plagued by serious flaws. This could be a less significant concern with DL, since many deep learning neural networks require less feature engineering than conventional machine learning techniques.

As stated in NSCAI Intern Report for Congress (2019 ), “DL is a statistical technique that exploits large quantities of data as training sets for a network with multiple hidden layers, called a deep neural network (DNN). A DNN is trained on a dataset, generating outputs, calculating errors, and adjusting its internal parameters. Then the process is repeated hundreds of thousands of times until the network achieves an acceptable level of performance. It has proven to be an effective technique for image classification, object detection, speech recognition, and natural language processing–problems that challenged researchers for decades. By learning from data, DNNs can solve some problems much more effectively, and also solve problems that were never solvable before.”

Now let’s take a high-level look at how DL could make it substantially easier to overcome the challenges identified by Sommer and Paxson ( Sommer and Paxson 2010 ). First, one major advantage of DL is that it makes learning algorithms less dependent on feature engineering. This characteristic of DL makes it easier to overcome the challenge indicated by Remark B. Second, another major advantage of DL is that it could achieve high classification accuracy with minimum domain knowledge. This characteristic of DL makes it easier to overcome the challenge indicated by Remark A.

Key observation. The above discussion indicates that DL could be a game changer in applying machine learning techniques to solving computer security challenges.

Motivated by this observation, this paper seeks to provide a dedicated review of the very recent research works on using Deep Learning techniques to solve computer security challenges. It should be noticed that since this paper aims to provide a dedicated review, non-deep-learning techniques and their security applications are out of the scope of this paper.

The remaining of the paper is organized as follows. In “ A four-phase workflow framework can summarize the existing works in a unified manner ” section, we present a four-phase workflow framework which we use to summarize the existing works in a unified manner. In “ A closer look at applications of deep learning in solving security-oriented program analysis challenges - A closer look at applications of deep learning in security-oriented fuzzing ” section, we provide a review of eight computer security problems being solved by applications of Deep Learning, respectively. In “ Discussion ” section, we will discuss certain similarity and certain dissimilarity among the existing works. In “ Further areas of investigation ” section, we mention four further areas of investigation. In “ Conclusion section, we conclude the paper.

A four-phase workflow framework can summarize the existing works in a unified manner

We found that a four-phase workflow framework can provide a unified way to summarize all the research works surveyed by us. In particular, we found that each work surveyed by us employs a particular workflow when using machine learning techniques to solve a computer security challenge, and we found that each workflow consists of two or more phases. By “a unified way”, we mean that every workflow surveyed by us is essentially an instantiation of a common workflow pattern which is shown in Fig.  1 .

figure 1

Overview of the four-phase workflow

Definitions of the four phases

The four phases, shown in Fig.  1 , are defined as follows. To make the definitions of the four phases more tangible, we use a running example to illustrate each of the four phases. Phase I.(Obtaining the raw data)

In this phase, certain raw data are collected. Running Example: When Deep Learning is used to detect suspicious events in a Hadoop distributed file system (HDFS), the raw data are usually the events (e.g., a block is allocated, read, written, replicated, or deleted) that have happened to each block. Since these events are recorded in Hadoop logs, the log files hold the raw data. Since each event is uniquely identified by a particular (block ID, timestamp) tuple, we could simply view the raw data as n event sequences. Here n is the total number of blocks in the HDFS. For example, the raw data collected in Xu et al. (2009) in total consists of 11,197,954 events. Since 575,139 blocks were in the HDFS, there were 575,139 event sequences in the raw data, and on average each event sequence had 19 events. One such event sequence is shown as follows:

machine learning in cyber security research papers

Phase II.(Data preprocessing)

Both Phase II and Phase III aim to properly extract and represent the useful information held in the raw data collected in Phase I. Both Phase II and Phase III are closely related to feature engineering. A key difference between Phase II and Phase III is that Phase III is completely dedicated to representation learning, while Phase II is focused on all the information extraction and data processing operations that are not based on representation learning. Running Example: Let’s revisit the aforementioned HDFS. Each recorded event is described by unstructured text. In Phase II, the unstructured text is parsed to a data structure that shows the event type and a list of event variables in (name, value) pairs. Since there are 29 types of events in the HDFS, each event is represented by an integer from 1 to 29 according to its type. In this way, the aforementioned example event sequence can be transformed to:

22, 5, 5, 7

Phase III.(Representation learning)

As stated in Bengio et al. (2013) , “Learning representations of the data that make it easier to extract useful information when building classifiers or other predictors.” Running Example: Let’s revisit the same HDFS. Although DeepLog ( Du et al. 2017 ) directly employed one-hot vectors to represent the event types without representation learning, if we view an event type as a word in a structured language, one may actually use the word embedding technique to represent each event type. It should be noticed that the word embedding technique is a representation learning technique.

Phase IV.(Classifier learning)

This phase aims to build specific classifiers or other predictors through Deep Learning. Running Example: Let’s revisit the same HDFS. DeepLog ( Du et al. 2017 ) used Deep Learning to build a stacked LSTM neural network for anomaly detection. For example, let’s consider event sequence {22,5,5,5,11,9,11,9,11,9,26,26,26} in which each integer represents the event type of the corresponding event in the event sequence. Given a window size h = 4, the input sample and the output label pairs to train DeepLog will be: {22,5,5,5 → 11 }, {5,5,5,11 → 9 }, {5,5,11,9 → 11 }, and so forth. In the detection stage, DeepLog examines each individual event. It determines if an event is treated as normal or abnormal according to whether the event’s type is predicted by the LSTM neural network, given the history of event types. If the event’s type is among the top g predicted types, the event is treated as normal; otherwise, it is treated as abnormal.

Using the four-phase workflow framework to summarize some representative research works

In this subsection, we use the four-phase workflow framework to summarize two representative works for each security problem. System security includes many sub research topics. However, not every research topics are suitable to adopt deep learning-based methods due to their intrinsic characteristics. For these security research subjects that can combine with deep-learning, some of them has undergone intensive research in recent years, others just emerging. We notice that there are 5 mainstream research directions in system security. This paper mainly focuses on system security, so the other mainstream research directions (e.g., deepfake) are out-of-scope. Therefore, we choose these 5 widely noticed research directions, and 3 emerging research direction in our survey:

In security-oriented program analysis, malware classification (MC), system-event-based anomaly detection (SEAD), memory forensics (MF), and defending network attacks, deep learning based methods have already undergone intensive research.

In defending return-oriented programming (ROP) attacks, Control-flow integrity (CFI), and fuzzing, deep learning based methods are emerging research topics.

We select two representative works for each research topic in our survey. Our criteria to select papers mainly include: 1) Pioneer (one of the first papers in this field); 2) Top (published on top conference or journal); 3) Novelty; 4) Citation (The citation of this paper is high); 5) Effectiveness (the result of this paper is pretty good); 6) Representative (the paper is a representative work for a branch of the research direction). Table  1 lists the reasons why we choose each paper, which is ordered according to their importance.

The summary is shown in Table  2 . There are three columns in the table. In the first column, we listed eight security problems, including security-oriented program analysis, defending return-oriented programming (ROP) attacks, control-flow integrity (CFI), defending network attacks (NA), malware classification (MC), system-event-based anomaly detection (SEAD), memory forensics (MF), and fuzzing for software security. In the second column, we list the very recent two representative works for each security problem. In the “Summary” column, we sequentially describe how the four phases are deployed at each work, then, we list the evaluation results for each work in terms of accuracy (ACC), precision (PRC), recall (REC), F1 score (F1), false-positive rate (FPR), and false-negative rate (FNR), respectively.

Methodology for reviewing the existing works

Data representation (or feature engineering) plays an important role in solving security problems with Deep Learning. This is because data representation is a way to take advantage of human ingenuity and prior knowledge to extract and organize the discriminative information from the data. Many efforts in deploying machine learning algorithms in security domain actually goes into the design of preprocessing pipelines and data transformations that result in a representation of the data to support effective machine learning.

In order to expand the scope and ease of applicability of machine learning in security domain, it would be highly desirable to find a proper way to represent the data in security domain, which can entangle and hide more or less the different explanatory factors of variation behind the data. To let this survey adequately reflect the important role played by data representation, our review will focus on how the following three questions are answered by the existing works:

Question 1: Is Phase II pervasively done in the literature? When Phase II is skipped in a work, are there any particular reasons?

Question 2: Is Phase III employed in the literature? When Phase III is skipped in a work, are there any particular reasons?

Question 3: When solving different security problems, is there any commonality in terms of the (types of) classifiers learned in Phase IV? Among the works solving the same security problem, is there dissimilarity in terms of classifiers learned in Phase IV?

To group the Phase III methods at different applications of Deep Learning in solving the same security problem, we introduce a classification tree as shown in Fig.  2 . The classification tree categorizes the Phase III methods in our selected survey works into four classes. First, class 1 includes the Phase III methods which do not consider representation learning. Second, class 2 includes the Phase III methods which consider representation learning but, do not adopt it. Third, class 3 includes the Phase III methods which consider and adopt representation learning but, do not compare the performance with other methods. Finally, class 4 includes the Phase III methods which consider and adopt representation learning and, compare the performance with other methods.

figure 2

Classification tree for different Phase III methods. Here, consideration , adoption , and comparison indicate that a work considers Phase III, adopts Phase III and makes comparison with other methods, respectively

In the remaining of this paper, we take a closer look at how each of the eight security problems is being solved by applications of Deep Learning in the literature.

A closer look at applications of deep learning in solving security-oriented program analysis challenges

Recent years, security-oriented program analysis is widely used in software security. For example, symbolic execution and taint analysis are used to discover, detect and analyze vulnerabilities in programs. Control flow analysis, data flow analysis and pointer/alias analysis are important components when enforcing many secure strategies, such as control flow integrity, data flow integrity and doling dangling pointer elimination. Reverse engineering was used by defenders and attackers to understand the logic of a program without source code.

In the security-oriented program analysis, there are many open problems, such as precise pointer/alias analysis, accurate and complete reversing engineer, complex constraint solving, program de-obfuscation, and so on. Some problems have theoretically proven to be NP-hard, and others still need lots of human effort to solve. Either of them needs a lot of domain knowledge and experience from expert to develop better solutions. Essentially speaking, the main challenges when solving them through traditional approaches are due to the sophisticated rules between the features and labels, which may change in different contexts. Therefore, on the one hand, it will take a large quantity of human effort to develop rules to solve the problems, on the other hand, even the most experienced expert cannot guarantee completeness. Fortunately, the deep learning method is skillful to find relations between features and labels if given a large amount of training data. It can quickly and comprehensively find all the relations if the training samples are representative and effectively encoded.

In this section, we will review the very recent four representative works that use Deep Learning for security-oriented program analysis. We observed that they focused on different goals. Shin, et al. designed a model ( Shin et al. 2015 ) to identify the function boundary. EKLAVYA ( Chua et al. 2017 ) was developed to learn the function type. Gemini ( Xu et al. 2017 ) was proposed to detect similarity among functions. DEEPVSA ( Guo et al. 2019 ) was designed to learn memory region of an indirect addressing from the code sequence. Among these works, we select two representative works ( Shin et al. 2015 ; Chua et al. 2017 ) and then, summarize the analysis results in Table  2 in detail.

Our review will be centered around three questions described in “ Methodology for reviewing the existing works ” section. In the remaining of this section, we will first provide a set of observations, and then we provide the indications. Finally, we provide some general remarks.

Key findings from a closer look

From a close look at the very recent applications using Deep Learning for solving security-oriented program analysis challenges, we observed the followings:

Observation 3.1: All of the works in our survey used binary files as their raw data. Phase II in our survey had one similar and straightforward goal – extracting code sequences from the binary. Difference among them was that the code sequence was extracted directly from the binary file when solving problems in static program analysis, while it was extracted from the program execution when solving problems in dynamic program analysis.

*Observation 3.2: Most data representation methods generally took into account the domain knowledge.

Most data representation methods generally took into the domain knowledge, i.e., what kind of information they wanted to reserve when processing their data. Note that the feature selection has a wide influence on Phase II and Phase III, for example, embedding granularities, representation learning methods. Gemini ( Xu et al. 2017 ) selected function level feature and other works in our survey selected instruction level feature. To be specifically, all the works except Gemini ( Xu et al. 2017 ) vectorized code sequence on instruction level.

Observation 3.3: To better support data representation for high performance, some works adopted representation learning.

For instance, DEEPVSA ( Guo et al. 2019 ) employed a representation learning method, i.e., bi-directional LSTM, to learn data dependency within instructions. EKLAVYA ( Chua et al. 2017 ) adopted representation learning method, i.e., word2vec technique, to extract inter-instruciton information. It is worth noting that Gemini ( Xu et al. 2017 ) adopts the Structure2vec embedding network in its siamese architecture in Phase IV (see details in Observation 3.7). The Structure2vec embedding network learned information from an attributed control flow graph.

Observation 3.4: According to our taxonomy, most works in our survey were classified into class 4.

To compare the Phase III, we introduced a classification tree with three layers as shown in Fig.  2 to group different works into four categories. The decision tree grouped our surveyed works into four classes according to whether they considered representation learning or not, whether they adopted representation learning or not, and whether they compared their methods with others’, respectively, when designing their framework. According to our taxonomy, EKLAVYA ( Chua et al. 2017 ), DEEPVSA ( Guo et al. 2019 ) were grouped into class 4 shown in Fig.  2 . Also, Gemini’s work ( Xu et al. 2017 ) and Shin, et al.’s work ( Shin et al. 2015 ) belonged to class 1 and class 2 shown in Fig.  2 , respectively.

Observation 3.5: All the works in our survey explain why they adopted or did not adopt one of representation learning algorithms.

Two works in our survey adopted representation learning for different reasons: to enhance model’s ability of generalization ( Chua et al. 2017 ); and to learn the dependency within instructions ( Guo et al. 2019 ). It is worth noting that Shin, et al. did not adopt representation learning because they wanted to preserve the “attractive” features of neural networks over other machine learning methods – simplicity. As they stated, “first, neural networks can learn directly from the original representation with minimal preprocessing (or “feature engineering”) needed.” and “second, neural networks can learn end-to-end, where each of its constituent stages are trained simultaneously in order to best solve the end goal.” Although Gemini ( Xu et al. 2017 ) did not adopt representation learning when processing their raw data, the Deep Learning models in siamese structure consisted of two graph embedding networks and one cosine function.

*Observation 3.6: The analysis results showed that a suitable representation learning method could improve accuracy of Deep Learning models.

DEEPVSA ( Guo et al. 2019 ) designed a series of experiments to evaluate the effectiveness of its representative method. By combining with the domain knowledge, EKLAVYA ( Chua et al. 2017 ) employed t-SNE plots and analogical reasoning to explain the effectiveness of their representation learning method in an intuitive way.

Observation 3.7: Various Phase IV methods were used.

In Phase IV, Gemini ( Xu et al. 2017 ) adopted siamese architecture model which consisted of two Structure2vec embedding networks and one cosine function. The siamese architecture took two functions as its input, and produced the similarity score as the output. The other three works ( Shin et al. 2015 ; Chua et al. 2017 ; Guo et al. 2019 ) adopted bi-directional RNN, RNN, bi-directional LSTM respectively. Shin, et al. adopted bi-directional RNN because they wanted to combine both the past and the future information in making a prediction for the present instruction ( Shin et al. 2015 ). DEEPVSA ( Guo et al. 2019 ) adopted bi-directional RNN to enable their model to infer memory regions in both forward and backward ways.

The above observations seem to indicate the following indications:

Indication 3.1: Phase III is not always necessary.

Not all authors regard representation learning as a good choice even though some case experiments show that representation learning can improve the final results. They value more the simplicity of Deep Learning methods and suppose that the adoption of representation learning weakens the simplicity of Deep Learning methods.

Indication 3.2: Even though the ultimate objective of Phase III in the four surveyed works is to train a model with better accuracy, they have different specific motivations as described in Observation 3.5.

When authors choose representation learning, they usually try to convince people the effectiveness of their choice by empirical or theoretical analysis.

*Indication 3.3: Observation 3.7 indicates that authors usually refer to the domain knowledge when designing the architecture of Deep Learning model.

For instance, the works we reviewed commonly adopt bi-directional RNN when their prediction partly based on future information in data sequence.

Despite the effectiveness and agility of deep learning-based methods, there are still some challenges in developing a scheme with high accuracy due to the hierarchical data structure, lots of noisy, and unbalanced data composition in program analysis. For instance, an instruction sequence, a typical data sample in program analysis, contains three-level hierarchy: sequence–instruction–opcode/operand. To make things worse, each level may contain many different structures, e.g., one-operand instructions, multi-operand instructions, which makes it harder to encode the training data.

A closer look at applications of deep learning in defending ROP attacks

Return-oriented programming (ROP) attack is one of the most dangerous code reuse attacks, which allows the attackers to launch control-flow hijacking attack without injecting any malicious code. Rather, It leverages particular instruction sequences (called “gadgets”) widely existing in the program space to achieve Turing-complete attacks ( Shacham and et al. 2007 ). Gadgets are instruction sequences that end with a RET instruction. Therefore, they can be chained together by specifying the return addresses on program stack. Many traditional techniques could be used to detect ROP attacks, such as control-flow integrity (CFI Abadi et al. (2009) ), but many of them either have low detection rate or have high runtime overhead. ROP payloads do not contain any codes. In other words, analyzing ROP payload without the context of the program’s memory dump is meaningless. Thus, the most popular way of detecting and preventing ROP attacks is control-flow integrity. The challenge after acquiring the instruction sequences is that it is hard to recognize whether the control flow is normal. Traditional methods use the control flow graph (CFG) to identify whether the control flow is normal, but attackers can design the instruction sequences which follow the normal control flow defined by the CFG. In essence, it is very hard to design a CFG to exclude every single possible combination of instructions that can be used to launch ROP attacks. Therefore, using data-driven methods could help eliminate such problems.

In this section, we will review the very recent three representative works that use Deep Learning for defending ROP attacks: ROPNN ( Li et al. 2018 ), HeNet ( Chen et al. 2018 ) and DeepCheck ( Zhang et al. 2019 ). ROPNN ( Li et al. 2018 ) aims to detect ROP attacks, HeNet ( Chen et al. 2018 ) aims to detect malware using CFI, and DeepCheck ( Zhang et al. 2019 ) aims at detecting all kinds of code reuse attacks.

Specifically, ROPNN is to protect one single program at a time, and its training data are generated from real-world programs along with their execution. Firstly, it generates its benign and malicious data by “chaining-up” the normally executed instruction sequences and “chaining-up” gadgets with the help of gadgets generation tool, respectively, after the memory dumps of programs are created. Each data sample is byte-level instruction sequence labeled as “benign” or “malicious”. Secondly, ROPNN will be trained using both malicious and benign data. Thirdly, the trained model is deployed to a target machine. After the protected program started, the executed instruction sequences will be traced and fed into the trained model, the protected program will be terminated once the model found the instruction sequences are likely to be malicious.

HeNet is also proposed to protect a single program. Its malicious data and benign data are generated by collecting trace data through Intel PT from malware and normal software, respectively. Besides, HeNet preprocesses its dataset and shape each data sample in the format of image, so that they could implement transfer learning from a model pre-trained on ImageNet. Then, HeNet is trained and deployed on machines with features of Intel PT to collect and classify the program’s execution trace online.

The training data for DeepCheck are acquired from CFGs, which are constructed by dissembling the programs and using the information from Intel PT. After the CFG for a protected program is constructed, authors sample benign instruction sequences by chaining up basic blocks that are connected by edges, and sample malicious instruction sequences by chaining up those that are not connected by edges. Although a CFG is needed during training, there is no need to construct CFG after the training phase. After deployed, instruction sequences will be constructed by leveraging Intel PT on the protected program. Then the trained model will classify whether the instruction sequences are malicious or benign.

We observed that none of the works considered Phase III, so all of them belong to class 1 according to our taxonomy as shown in Fig.  2 . The analysis results of ROPNN ( Li et al. 2018 ) and HeNet ( Chen et al. 2018 ) are shown in Table  2 . Also, we observed that three works had different goals.

From a close look at the very recent applications using Deep Learning for defending return-oriented programming attacks, we observed the followings:

Observation 4.1: All the works ( Li et al. 2018 ; Zhang et al. 2019 ; Chen et al. 2018 ) in this survey focused on data generation and acquisition.

In ROPNN ( Li et al. 2018 ), both malicious samples (gadget chains) were generated using an automated gadget generator (i.e. ROPGadget ( Salwant 2015 )) and a CPU emulator (i.e. Unicorn ( Unicorn-The ultimate CPU emulator 2015 )). ROPGadget was used to extract instruction sequences that could be used as gadgets from a program, and Unicorn was used to validate the instruction sequences. Corresponding benign sample (gadget-chain-like instruction sequences) were generated by disassembling a set of programs. In DeepCheck ( Zhang et al. 2019 ) refers to the key idea of control-flow integrity ( Abadi et al. 2009 ). It generates program’s run-time control flow through new feature of Intel CPU (Intel Processor Tracing), then compares the run-time control flow with the program’s control-flow graph (CFG) that generates through static analysis. Benign instruction sequences are that with in the program’s CFG, and vice versa. In HeNet ( Chen et al. 2018 ), program’s execution trace was extracted using the similar way as DeepCheck. Then, each byte was transformed into a pixel with an intensity between 0-255. Known malware samples and benign software samples were used to generate malicious data benign data, respectively.

Observation 4.2: None of the ROP works in this survey deployed Phase III.

Both ROPNN ( Li et al. 2018 ) and DeepCheck ( Zhang et al. 2019 ) used binary instruction sequences for training. In ROPNN ( Li et al. 2018 ), one byte was used as the very basic element for data pre-processing. Bytes were formed into one-hot matrices and flattened for 1-dimensional convolutional layer. In DeepCheck ( Zhang et al. 2019 ), half-byte was used as the basic unit. Each half-byte (4 bits) was transformed to decimal form ranging from 0-15 as the basic element of the input vector, then was fed into a fully-connected input layer. On the other hand, HeNet ( Chen et al. 2018 ) used different kinds of data. By the time this survey has been drafted, the source code of HeNet was not available to public and thus, the details of the data pre-processing was not be investigated. However, it is still clear that HeNet used binary branch information collected from Intel PT rather than binary instructions. In HeNet, each byte was converted to one decimal number ranging from 0 to 255. Byte sequences was sliced and formed into image sequences (each pixel represented one byte) for a fully-connected input layer.

Observation 4.3: Fully-connected neural network was widely used.

Only ROPNN ( Li et al. 2018 ) used 1-dimensional convolutional neural network (CNN) when extracting features. Both HeNet ( Chen et al. 2018 ) and DeepCheck ( Zhang et al. 2019 ) used fully-connected neural network (FCN). None of the works used recurrent neural network (RNN) and the variants.

Indication 4.1: It seems like that one of the most important factors in ROP problem is feature selection and data generation.

All three works use very different methods to collect/generate data, and all the authors provide very strong evidences and/or arguments to justify their approaches. ROPNN ( Li et al. 2018 ) was trained by the malicious and benign instruction sequences. However, there is no clear boundary between benign instruction sequences and malicious gadget chains. This weakness may impair the performance when applying ROPNN to real world ROP attacks. As oppose to ROPNN, DeepCheck ( Zhang et al. 2019 ) utilizes CFG to generate training basic-block sequences. However, since the malicious basic-block sequences are generated by randomly connecting nodes without edges, it is not guaranteed that all the malicious basic-blocks are executable. HeNet ( Chen et al. 2018 ) generates their training data from malware. Technically, HeNet could be used to detect any binary exploits, but their experiment focuses on ROP attack and achieves 100% accuracy. This shows that the source of data in ROP problem does not need to be related to ROP attacks to produce very impressive results.

Indication 4.2: Representation learning seems not critical when solving ROP problems using Deep Learning.

Minimal process on data in binary form seems to be enough to transform the data into a representation that is suitable for neural networks. Certainly, it is also possible to represent the binary instructions at a higher level, such as opcodes, or use embedding learning. However, as stated in ( Li et al. 2018 ), it appears that the performance will not change much by doing so. The only benefit of representing input data to a higher level is to reduce irrelevant information, but it seems like neural network by itself is good enough at extracting features.

Indication 4.3: Different Neural network architecture does not have much influence on the effectiveness of defending ROP attacks.

Both HeNet ( Chen et al. 2018 ) and DeepCheck ( Zhang et al. 2019 ) utilizes standard DNN and achieved comparable results on ROP problems. One can infer that the input data can be easily processed by neural networks, and the features can be easily detected after proper pre-process.

It is not surprising that researchers are not very interested in representation learning for ROP problems as stated in Observation 4.1. Since ROP attack is focus on the gadget chains, it is straightforward for the researcher to choose the gadgets as their training data directly. It is easy to map the data into numerical representation with minimal processing. An example is that one can map binary executable to hexadecimal ASCII representation, which could be a good representation for neural network.

Instead, researchers focus more in data acquisition and generation. In ROP problems, the amount of data is very limited. Unlike malware and logs, ROP payloads normally only contain addresses rather than codes, which do not contain any information without providing the instructions in corresponding addresses. It is thus meaningless to collect all the payloads. At the best of our knowledge, all the previous works use pick instruction sequences rather than payloads as their training data, even though they are hard to collect.

Even though, Deep Learning based method does not face the challenge to design a very complex fine-grained CFG anymore, it suffers from a limited number of data sources. Generally, Deep Learning based method requires lots of training data. However, real-world malicious data for the ROP attack is very hard to find, because comparing with benign data, malicious data need to be carefully crafted and there is no existing database to collect all the ROP attacks. Without enough representative training set, the accuracy of the trained model cannot be guaranteed.

A closer look at applications of deep learning in achieving CFI

The basic ideas of control-flow integrity (CFI) techniques, proposed by Abadi in 2005 ( Abadi et al. 2009 ), could be dated back to 2002, when Vladimir and his fellow researchers proposed an idea called program shepherding ( Kiriansky et al. 2002 ), a method of monitoring the execution flow of a program when it is running by enforcing some security policies. The goal of CFI is to detect and prevent control-flow hijacking attacks, by restricting every critical control flow transfers to a set that can only appear in correct program executions, according to a pre-built CFG. Traditional CFI techniques typically leverage some knowledge, gained from either dynamic or static analysis of the target program, combined with some code instrumentation methods, to ensure the program runs on a correct track.

However, the problems of traditional CFI are: (1) Existing CFI implementations are not compatible with some of important code features ( Xu et al. 2019 ); (2) CFGs generated by static, dynamic or combined analysis cannot always be precisely completed due to some open problems ( Horwitz 1997 ); (3) There always exist certain level of compromises between accuracy and performance overhead and other important properties ( Tan and Jaeger 2017 ; Wang and Liu 2019 ). Recent research has proposed to apply Deep Learning on detecting control flow violation. Their result shows that, compared with traditional CFI implementation, the security coverage and scalability were enhanced in such a fashion ( Yagemann et al. 2019 ). Therefore, we argue that Deep Learning could be another approach which requires more attention from CFI researchers who aim at achieving control-flow integrity more efficiently and accurately.

In this section, we will review the very recent three representative papers that use Deep Learning for achieving CFI. Among the three, two representative papers ( Yagemann et al. 2019 ; Phan et al. 2017 ) are already summarized phase-by-phase in Table  2 . We refer to interested readers the Table  2 for a concise overview of those two papers.

Our review will be centered around three questions described in Section 3 . In the remaining of this section, we will first provide a set of observations, and then we provide the indications. Finally, we provide some general remarks.

From a close look at the very recent applications using Deep Learning for achieving control-flow integrity, we observed the followings:

Observation 5.1: None of the related works realize preventive Footnote 1 prevention of control flow violation.

After doing a thorough literature search, we observed that security researchers are quite behind the trend of applying Deep Learning techniques to solve security problems. Only one paper has been founded by us, using Deep Learning techniques to directly enhance the performance of CFI ( Yagemann et al. 2019 ). This paper leveraged Deep Learning to detect document malware through checking program’s execution traces that generated by hardware. Specifically, the CFI violations were checked in an offline mode. So far, no works have realized Just-In-Time checking for program’s control flow.

In order to provide more insightful results, in this section, we try not to narrow down our focus on CFI detecting attacks at run-time, but to extend our scope to papers that take good use of control flow related data, combined with Deep Learning techniques ( Phan et al. 2017 ; Nguyen et al. 2018 ). In one work, researchers used self-constructed instruction-level CFG to detect program defection ( Phan et al. 2017 ). In another work, researchers used lazy-binding CFG to detect sophisticated malware ( Nguyen et al. 2018 ).

Observation 5.2: Diverse raw data were used for evaluating CFI solutions.

In all surveyed papers, there are two kinds of control flow related data being used: program instruction sequences and CFGs. Barnum et al. ( Yagemann et al. 2019 ) employed statically and dynamically generated instruction sequences acquired by program disassembling and Intel ® Processor Trace. CNNoverCFG ( Phan et al. 2017 ) used self-designed algorithm to construct instruction level control-flow graph. Minh Hai Nguyen et al. ( Nguyen et al. 2018 ) used proposed lazy-binding CFG to reflect the behavior of malware DEC.

Observation 5.3: All the papers in our survey adopted Phase II.

All the related papers in our survey employed Phase II to process their raw data before sending them into Phase III. In Barnum ( Yagemann et al. 2019 ), the instruction sequences from program run-time tracing were sliced into basic-blocks. Then, they assigned each basic-blocks with an unique basic-block ID (BBID). Finally, due to the nature of control-flow hijacking attack, they selected the sequences ending with indirect branch instruction (e.g., indirect call/jump, return and so on) as the training data. In CNNoverCFG ( Phan et al. 2017 ), each of instructions in CFG were labeled with its attributes in multiple perspectives, such as opcode, operands, and the function it belongs to. The training data is generated are sequences generated by traversing the attributed control-flow graph. Nguyen and others ( Nguyen et al. 2018 ) converted the lazy-binding CFG to corresponding adjacent matrix and treated the matrix as a image as their training data.

Observation 5.4: All the papers in our survey did not adopt Phase III. We observed all the papers we surveyed did not adopted Phase III. Instead, they adopted the form of numerical representation directly as their training data. Specifically, Barnum ( Yagemann et al. 2019 ) grouped the instructions into basic-blocks, then represented basic-blocks with uniquely assigning IDs. In CNNoverCFG ( Phan et al. 2017 ), each of instructions in the CFG was represented by a vector that associated with its attributes. Nguyen and others directly used the hashed value of bit string representation.

Observation 5.5: Various Phase IV models were used. Barnum ( Yagemann et al. 2019 ) utilized BBID sequence to monitor the execution flow of the target program, which is sequence-type data. Therefore, they chose LSTM architecture to better learn the relationship between instructions. While in the other two papers ( Phan et al. 2017 ; Nguyen et al. 2018 ), they trained CNN and directed graph-based CNN to extract information from control-flow graph and image, respectively.

Indication 5.1: All the existing works did not achieve Just-In-Time CFI violation detection.

It is still a challenge to tightly embed Deep Learning model in program execution. All existing work adopted lazy-checking – checking the program’s execution trace following its execution.

Indication 5.2: There is no unified opinion on how to generate malicious sample.

Data are hard to collect in control-flow hijacking attacks. The researchers must carefully craft malicious sample. It is not clear whether the “handcrafted” sample can reflect the nature the control-flow hijacking attack.

*Observation 5.3: The choice of methods in Phase II are based on researchers’ security domain knowledge.

The strength of using deep learning to solve CFI problems is that it can avoid the complicated processes of developing algorithms to build acceptable CFGs for the protected programs. Compared with the traditional approaches, the DL based method could prevent CFI designer from studying the language features of the targeted program and could also avoid the open problem (pointer analysis) in control flow analysis. Therefore, DL based CFI provides us a more generalized, scalable, and secure solution. However, since using DL in CFI problem is still at an early age, which kinds of control-flow related data are more effective is still unclear yet in this research area. Additionally, applying DL in real-time control-flow violation detection remains an untouched area and needs further research.

A closer look at applications of deep learning in defending network attacks

Network security is becoming more and more important as we depend more and more on networks for our daily lives, works and researches. Some common network attack types include probe, denial of service (DoS), Remote-to-local (R2L), etc. Traditionally, people try to detect those attacks using signatures, rules, and unsupervised anomaly detection algorithms. However, signature based methods can be easily fooled by slightly changing the attack payload; rule based methods need experts to regularly update rules; and unsupervised anomaly detection algorithms tend to raise lots of false positives. Recently, people are trying to apply Deep Learning methods for network attack detection.

In this section, we will review the very recent seven representative works that use Deep Learning for defending network attacks. Millar et al. (2018 ); Varenne et al. (2019 ); Ustebay et al. (2019 ) build neural networks for multi-class classification, whose class labels include one benign label and multiple malicious labels for different attack types. Zhang et al. (2019 ) ignores normal network activities and proposes parallel cross convolutional neural network (PCCN) to classify the type of malicious network activities. Yuan et al. (2017 ) applies Deep Learning to detecting a specific attack type, distributed denial of service (DDoS) attack. Yin et al. (2017 ); Faker and Dogdu (2019 ) explores both binary classification and multi-class classification for benign and malicious activities. Among these seven works, we select two representative works ( Millar et al. 2018 ; Zhang et al. 2019 ) and summarize the main aspects of their approaches regarding whether the four phases exist in their works, and what exactly do they do in the Phase if it exists. We direct interested readers to Table  2 for a concise overview of these two works.

From a close look at the very recent applications using Deep Learning for solving network attack challenges, we observed the followings:

Observation 6.1: All the seven works in our survey used public datasets, such as UNSW-NB15 ( Moustafa and Slay 2015 ) and CICIDS2017 ( IDS 2017 Datasets 2019 ).

The public datasets were all generated in test-bed environments, with unbalanced simulated benign and attack activities. For attack activities, the dataset providers launched multiple types of attacks, and the numbers of malicious data for those attack activities were also unbalanced.

Observation 6.2: The public datasets were given into one of two data formats, i.e., PCAP and CSV.

One was raw PCAP or parsed CSV format, containing network packet level features, and the other was also CSV format, containing network flow level features, which showed the statistic information of many network packets. Out of all the seven works, ( Yuan et al. 2017 ; Varenne et al. 2019 ) used packet information as raw inputs, ( Yin et al. 2017 ; Zhang et al. 2019 ; Ustebay et al. 2019 ; Faker and Dogdu 2019 ) used flow information as raw inputs, and ( Millar et al. 2018 ) explored both cases.

Observation 6.3: In order to parse the raw inputs, preprocessing methods, including one-hot vectors for categorical texts, normalization on numeric data, and removal of unused features/data samples, were commonly used.

Commonly removed features include IP addresses and timestamps. Faker and Dogdu (2019 ) also removed port numbers from used features. By doing this, they claimed that they could “avoid over-fitting and let the neural network learn characteristics of packets themselves”. One outlier was that, when using packet level features in one experiment, ( Millar et al. 2018 ) blindly chose the first 50 bytes of each network packet without any feature extracting processes and fed them into neural network.

Observation 6.4: Using image representation improved the performance of security solutions using Deep Learning.

After preprocessing the raw data, while ( Zhang et al. 2019 ) transformed the data into image representation, ( Yuan et al. 2017 ; Varenne et al. 2019 ; Faker and Dogdu 2019 ; Ustebay et al. 2019 ; Yin et al. 2017 ) directly used the original vectors as an input data. Also, ( Millar et al. 2018 ) explored both cases and reported better performance using image representation.

Observation 6.5: None of all the seven surveyed works considered representation learning.

All the seven surveyed works belonged to class 1 shown in Fig.  2 . They either directly used the processed vectors to feed into the neural networks, or changed the representation without explanation. One research work ( Millar et al. 2018 ) provided a comparison on two different representations (vectors and images) for the same type of raw input. However, the other works applied different preprocessing methods in Phase II. That is, since the different preprocessing methods generated different feature spaces, it was difficult to compare the experimental results.

Observation 6.6: Binary classification model showed better results from most experiments.

Among all the seven surveyed works, ( Yuan et al. 2017 ) focused on one specific attack type and only did binary classification to classify whether the network traffic was benign or malicious. Also, ( Millar et al. 2018 ; Ustebay et al. 2019 ; Zhang et al. 2019 ; Varenne et al. 2019 ) included more attack types and did multi-class classification to classify the type of malicious activities, and ( Yin et al. 2017 ; Faker and Dogdu 2019 ) explored both cases. As for multi-class classification, the accuracy for selective classes was good, while accuracy for other classes, usually classes with much fewer data samples, suffered by up to 20% degradation.

Observation 6.7: Data representation influenced on choosing a neural network model.

Indication 6.1: All works in our survey adopt a kind of preprocessing methods in Phase II, because raw data provided in the public datasets are either not ready for neural networks, or that the quality of data is too low to be directly used as data samples.

Preprocessing methods can help increase the neural network performance by improving the data samples’ qualities. Furthermore, by reducing the feature space, pre-processing can also improve the efficiency of neural network training and testing. Thus, Phase II should not be skipped. If Phase II is skipped, the performance of neural network is expected to go down considerably.

Indication 6.2: Although Phase III is not employed in any of the seven surveyed works, none of them explains a reason for it. Also, they all do not take representation learning into consideration.

Indication 6.3: Because no work uses representation learning, the effectiveness are not well-studied.

Out of other factors, it seems that the choice of pre-processing methods has the largest impact, because it directly affects the data samples fed to the neural network.

Indication 6.4: There is no guarantee that CNN also works well on images converted from network features.

Some works that use image data representation use CNN in Phase IV. Although CNN has been proven to work well on image classification problem in the recent years, there is no guarantee that CNN also works well on images converted from network features.

From the observations and indications above, we hereby present two recommendations: (1) Researchers can try to generate their own datasets for the specific network attack they want to detect. As stated, the public datasets have highly unbalanced number of data for different classes. Doubtlessly, such unbalance is the nature of real world network environment, in which normal activities are the majority, but it is not good for Deep Learning. ( Varenne et al. 2019 ) tries to solve this problem by oversampling the malicious data, but it is better to start with a balanced data set. (2) Representation learning should be taken into consideration. Some possible ways to apply representation learning include: (a) apply word2vec method to packet binaries, and categorical numbers and texts; (b) use K-means as one-hot vector representation instead of randomly encoding texts. We suggest that any change of data representation may be better justified by explanations or comparison experiments.

One critical challenge in this field is the lack of high-quality data set suitable for applying deep learning. Also, there is no agreement on how to apply domain knowledge into training deep learning models for network security problems. Researchers have been using different pre-processing methods, data representations and model types, but few of them have enough explanation on why such methods/representations/models are chosen, especially for data representation.

A closer look at applications of deep learning in malware classification

The goal of malware classification is to identify malicious behaviors in software with static and dynamic features like control-flow graph and system API calls. Malware and benign programs can be collected from open datasets and online websites. Both the industry and the academic communities have provided approaches to detect malware with static and dynamic analyses. Traditional methods such as behavior-based signatures, dynamic taint tracking, and static data flow analysis require experts to manually investigate unknown files. However, those hand-crafted signatures are not sufficiently effective because attackers can rewrite and reorder the malware. Fortunately, neural networks can automatically detect large-scale malware variants with superior classification accuracy.

In this section, we will review the very recent twelve representative works that use Deep Learning for malware classification ( De La Rosa et al. 2018 ; Saxe and Berlin 2015 ; Kolosnjaji et al. 2017 ; McLaughlin et al. 2017 ; Tobiyama et al. 2016 ; Dahl et al. 2013 ; Nix and Zhang 2017 ; Kalash et al. 2018 ; Cui et al. 2018 ; David and Netanyahu 2015 ; Rosenberg et al. 2018 ; Xu et al. 2018 ). De La Rosa et al. (2018 ) selects three different kinds of static features to classify malware. Saxe and Berlin (2015 ); Kolosnjaji et al. (2017 ); McLaughlin et al. (2017 ) also use static features from the PE files to classify programs. ( Tobiyama et al. 2016 ) extracts behavioral feature images using RNN to represent the behaviors of original programs. ( Dahl et al. 2013 ) transforms malicious behaviors using representative learning without neural network. Nix and Zhang (2017 ) explores RNN model with the API calls sequences as programs’ features. Cui et al. (2018 ); Kalash et al. (2018 ) skip Phase II by directly transforming the binary file to image to classify the file. ( David and Netanyahu 2015 ; Rosenberg et al. 2018 ) applies dynamic features to analyze malicious features. Xu et al. (2018 ) combines static features and dynamic features to represent programs’ features. Among these works, we select two representative works ( De La Rosa et al. 2018 ; Rosenberg et al. 2018 ) and identify four phases in their works shown as Table  2 .

From a close look at the very recent applications using Deep Learning for solving malware classification challenges, we observed the followings:

Observation 7.1: Features selected in malware classification were grouped into three categories: static features, dynamic features, and hybrid features.

Typical static features include metadata, PE import Features, Byte/Entorpy, String, and Assembly Opcode Features derived from the PE files ( Kolosnjaji et al. 2017 ; McLaughlin et al. 2017 ; Saxe and Berlin 2015 ). De La Rosa et al. (2018 ) took three kinds of static features: byte-level, basic-level (strings in the file, the metadata table, and the import table of the PE header), and assembly features-level. Some works directly considered binary code as static features ( Cui et al. 2018 ; Kalash et al. 2018 ).

Different from static features, dynamic features were extracted by executing the files to retrieve their behaviors during execution. The behaviors of programs, including the API function calls, their parameters, files created or deleted, websites and ports accessed, etc, were recorded by a sandbox as dynamic features ( David and Netanyahu 2015 ). The process behaviors including operation name and their result codes were extracted ( Tobiyama et al. 2016 ). The process memory, tri-grams of system API calls and one corresponding input parameter were chosen as dynamic features ( Dahl et al. 2013 ). An API calls sequence for an APK file was another representation of dynamic features ( Nix and Zhang 2017 ; Rosenberg et al. 2018 ).

Static features and dynamic features were combined as hybrid features ( Xu et al. 2018 ). For static features, Xu and others in ( Xu et al. 2018 ) used permissions, networks, calls, and providers, etc. For dynamic features, they used system call sequences.

Observation 7.2: In most works, Phase II was inevitable because extracted features needed to be vertorized for Deep Learning models.

One-hot encoding approach was frequently used to vectorize features ( Kolosnjaji et al. 2017 ; McLaughlin et al. 2017 ; Rosenberg et al. 2018 ; Tobiyama et al. 2016 ; Nix and Zhang 2017 ). Bag-of-words (BoW) and n -gram were also considered to represent features ( Nix and Zhang 2017 ). Some works brought the concepts of word frequency in NLP to convert the sandbox file to fixed-size inputs ( David and Netanyahu 2015 ). Hashing features into a fixed vector was used as an effective method to represent features ( Saxe and Berlin 2015 ). Bytes histogram using the bytes analysis and bytes-entropy histogram with a sliding window method were considered ( De La Rosa et al. 2018 ). In ( De La Rosa et al. 2018 ), De La Rosa and others embeded strings by hashing the ASCII strings to a fixed-size feature vector. For assembly features, they extracted four different levels of granularity: operation level (instruction-flow-graph), block level (control-flow-graph), function level (call-graph), and global level (graphs summarized). bigram, trigram and four-gram vectors and n -gram graph were used for the hybrid features ( Xu et al. 2018 ).

Observation 7.3: Most Phase III methods were classified into class 1.

Following the classification tree shown in Fig.  2 , most works were classified into class 1 shown in Fig.  2 except two works ( Dahl et al. 2013 ; Tobiyama et al. 2016 ), which belonged to class 3 shown in Fig.  2 . To reduce the input dimension, Dahl et al. (2013 ) performed feature selection using mutual information and random projection. Tobiyama et al. generated behavioral feature images using RNN ( Tobiyama et al. 2016 ).

Observation 7.4: After extracting features, two kinds of neural network architectures, i.e., one single neural network and multiple neural networks with a combined loss function, were used.

Hierarchical structures, like convolutional layers, fully connected layers and classification layers, were used to classify programs ( McLaughlin et al. 2017 ; Dahl et al. 2013 ; Nix and Zhang 2017 ; Saxe and Berlin 2015 ; Tobiyama et al. 2016 ; Cui et al. 2018 ; Kalash et al. 2018 ). A deep stack of denoising autoencoders was also introduced to learn programs’ behaviors ( David and Netanyahu 2015 ). De La Rosa and others ( De La Rosa et al. 2018 ) trained three different models with different features to compare which static features are relevant for the classification model. Some works investigated LSTM models for sequential features ( Nix and Zhang 2017 ; Rosenberg et al. 2018 ).

Two networks with different features as inputs were used for malware classification by combining their outputs with a dropout layer and an output layer ( Kolosnjaji et al. 2017 ). In ( Kolosnjaji et al. 2017 ), one network transformed PE Metadata and import features using feedforward neurons, another one leveraged convolutional network layers with opcode sequences. Lifan Xu et al. ( Xu et al. 2018 ) constructed a few networks and combined them using a two-level multiple kernel learning algorithm.

Indication 7.1: Except two works transform binary into images ( Cui et al. 2018 ; Kalash et al. 2018 ), most works surveyed need to adapt methods to vectorize extracted features.

The vectorization methods should not only keep syntactic and semantic information in features, but also consider the definition of the Deep Learning model.

Indication 7.2: Only limited works have shown how to transform features using representation learning.

Because some works assume the dynamic and static sequences, like API calls and instruction, and have similar syntactic and semantic structure as natural language, some representation learning techniques like word2vec may be useful in malware detection. In addition, for the control-flow graph, call graph and other graph representations, graph embedding is a potential method to transform those features.

Though several pieces of research have been done in malware detection using Deep Learning, it’s hard to compare their methods and performances because of two uncertainties in their approaches. First, the Deep Learning model is a black-box, researchers cannot detail which kind of features the model learned and explain why their model works. Second, feature selection and representation affect the model’s performance. Because they do not use the same datasets, researchers cannot prove their approaches – including selected features and Deep Learning model – are better than others. The reason why few researchers use open datasets is that existing open malware datasets are out of data and limited. Also, researchers need to crawl benign programs from app stores, so their raw programs will be diverse.

A closer look at applications of Deep Learning in system-event-based anomaly detection

System logs recorded significant events at various critical points, which can be used to debug the system’s performance issues and failures. Moreover, log data are available in almost all computer systems and are a valuable resource for understanding system status. There are a few challenges in anomaly detection based on system logs. Firstly, the raw log data are unstructured, while their formats and semantics can vary significantly. Secondly, logs are produced by concurrently running tasks. Such concurrency makes it hard to apply workflow-based anomaly detection methods. Thirdly, logs contain rich information and complexity types, including text, real value, IP address, timestamp, and so on. The contained information of each log is also varied. Finally, there are massive logs in every system. Moreover, each anomaly event usually incorporates a large number of logs generated in a long period.

Recently, a large number of scholars employed deep learning techniques ( Du et al. 2017 ; Meng et al. 2019 ; Das et al. 2018 ; Brown et al. 2018 ; Zhang et al. 2019 ; Bertero et al. 2017 ) to detect anomaly events in the system logs and diagnosis system failures. The raw log data are unstructured, while their formats and semantics can vary significantly. To detect the anomaly event, the raw log usually should be parsed to structure data, the parsed data can be transformed into a representation that supports an effective deep learning model. Finally, the anomaly event can be detected by deep learning based classifier or predictor.

In this section, we will review the very recent six representative papers that use deep learning for system-event-based anomaly detection ( Du et al. 2017 ; Meng et al. 2019 ; Das et al. 2018 ; Brown et al. 2018 ; Zhang et al. 2019 ; Bertero et al. 2017 ). DeepLog ( Du et al. 2017 ) utilizes LSTM to model the system log as a natural language sequence, which automatically learns log patterns from the normal event, and detects anomalies when log patterns deviate from the trained model. LogAnom ( Meng et al. 2019 ) employs Word2vec to extract the semantic and syntax information from log templates. Moreover, it uses sequential and quantitative features simultaneously. Das et al. (2018 ) uses LSTM to predict node failures that occur in super computing systems from HPC logs. Brown et al. (2018 ) presented RNN language models augmented with attention for anomaly detection in system logs. LogRobust ( Zhang et al. 2019 ) uses FastText to represent semantic information of log events, which can identify and handle unstable log events and sequences. Bertero et al. (2017 ) map log word to a high dimensional metric space using Google’s word2vec algorithm and take it as features to classify. Among these six papers, we select two representative works ( Du et al. 2017 ; Meng et al. 2019 ) and summarize the four phases of their approaches. We direct interested readers to Table  2 for a concise overview of these two works.

From a close look at the very recent applications using deep learning for solving security-event-based anomaly detection challenges, we observed the followings:

Observation 8.1: Most works of our surveyed papers evaluated their performance using public datasets.

By the time we surveyed this paper, only two works in ( Das et al. 2018 ; Bertero et al. 2017 ) used their private datasets.

Observation 8.2: Most works in this survey adopted Phase II when parsing the raw log data.

After reviewing the six works proposed recently, we found that five works ( Du et al. 2017 ; Meng et al. 2019 ; Das et al. 2018 ; Brown et al. 2018 ; Zhang et al. 2019 ) employed parsing technique, while only one work ( Bertero et al. 2017 ) did not.

DeepLog ( Du et al. 2017 ) parsed the raw log to different log type using Spell ( Du and Li 2016 ) which is based a longest common subsequence. Desh ( Das et al. 2018 ) parsed the raw log to constant message and variable component. Loganom ( Meng et al. 2019 ) parsed the raw log to different log templates using FT-Tree ( Zhang et al. 2017 ) according to the frequent combinations of log words. Andy Brown et al. ( Brown et al. 2018 ) parsed the raw log into word and character tokenization. LogRobust ( Zhang et al. 2019 ) extracted its log event by abstracting away the parameters in the message. Bertero et al. (2017 ) considered logs as regular text without parsing.

Observation 8.3: Most works have considered and adopted Phase III.

Among these six works, only DeepLog represented the parsed data using the one-hot vector without learning. Moreover, Loganom ( Meng et al. 2019 ) compared their results with DeepLog. That is, DeepLog belongs to class 1 and Loganom belongs to class 4 in Fig.  2 , while the other four works follow in class 3.

The four works ( Meng et al. 2019 ; Das et al. 2018 ; Zhang et al. 2019 ; Bertero et al. 2017 ) used word embedding techniques to represent the log data. Andy Brown et al. ( Brown et al. 2018 ) employed attention vectors to represent the log messages.

DeepLog ( Du et al. 2017 ) employed the one-hot vector to represent the log type without learning. We have engaged an experiment replacing the one-hot vector with trained word embeddings.

Observation 8.4: Evaluation results were not compared using the same dataset.

DeepLog ( Du et al. 2017 ) employed the one-hot vector to represent the log type without learning, which employed Phase II without Phase III. However, Christophe Bertero et al. ( Bertero et al. 2017 ) considered logs as regular text without parsing, and used Phase III without Phase II. The precision of the two methods is very high, which is greater than 95%. Unfortunately, the evaluations of the two methods used different datasets.

Observation 8.5: Most works empolyed LSTM in Phase IV.

Five works including ( Du et al. 2017 ; Meng et al. 2019 ; Das et al. 2018 ; Brown et al. 2018 ; Zhang et al. 2019 ) employed LSTM in the Phase IV, while Bertero et al. (2017 ) tried different classifiers including naive Bayes, neural networks and random forest.

Indication 8.1: Phase II has a positive effect on accuracy if being well-designed.

Since Bertero et al. (2017 ) considers logs as regular text without parsing, we can say that Phase II is not required. However, we can find that most of the scholars employed parsing techniques to extract structure information and remove the useless noise.

Indication 8.2: Most of the recent works use trained representation to represent parsed data.

As shown in Table  3 , we can find Phase III is very useful, which can improve detection accuracy.

Indication 8.3: Phase II and Phase III cannot be skipped simultaneously.

Both Phase II and Phase III are not required. However, all methods have employed Phase II or Phase III.

Indication 8.4: Observation 8.3 indicates that the trained word embedding format can improve the anomaly detection accuracy as shown in Table  3 .

Indication 8.5: Observation 8.5 indicates that most of the works adopt LSTM to detect anomaly events.

We can find that most of the works adopt LSTM to detect anomaly event, since log data can be considered as sequence and there can be lags of unknown duration between important events in a time series. LSTM has feedback connections, which can not only process single data points, but also entire sequences of data.

As our consideration, neither Phase II nor Phase III is required in system event-based anomaly detection. However, Phase II can remove noise in raw data, and Phase III can learn a proper representation of the data. Both Phase II and Phase III have a positive effect on anomaly detection accuracy. Since the event log is text data that we can’t feed the raw log data into deep learning model directly, Phase II and Phase III can’t be skipped simultaneously.

Deep learning can capture the potentially nonlinear and high dimensional dependencies among log entries from the training data that correspond to abnormal events. In that way, it can release the challenges mentioned above. However, it still suffers from several challenges. For example, how to represent the unstructured data accurately and automatically without human knowledge.

A closer look at applications of deep learning in solving memory forensics challenges

In the field of computer security, memory forensics is security-oriented forensic analysis of a computer’s memory dump. Memory forensics can be conducted against OS kernels, user-level applications, as well as mobile devices. Memory forensics outperforms traditional disk-based forensics because although secrecy attacks can erase their footprints on disk, they would have to appear in memory ( Song et al. 2018 ). The memory dump can be considered as a sequence of bytes, thus memory forensics usually needs to extract security semantic information from raw memory dump to find attack traces.

The traditional memory forensic tools fall into two categories: signature scanning and data structure traversal. These traditional methods usually have some limitations. Firstly, it needs expert knowledge on the related data structures to create signatures or traversing rules. Secondly, attackers may directly manipulate data and pointer values in kernel objects to evade detection, and then it becomes even more challenging to create signatures and traversing rules that cannot be easily violated by malicious manipulations, system updates, and random noise. Finally, the high-efficiency requirement often sacrifices high robustness. For example, an efficient signature scan tool usually skips large memory regions that are unlikely to have the relevant objects and relies on simple but easily tamperable string constants. An important clue may hide in this ignored region.

In this section, we will review the very recent four representative works that use Deep Learning for memory forensics ( Song et al. 2018 ; Petrik et al. 2018 ; Michalas and Murray 2017 ; Dai et al. 2018 ). DeepMem ( Song et al. 2018 ) recognized the kernel objects from raw memory dumps by generating abstract representations of kernel objects with a graph-based Deep Learning approach. MDMF ( Petrik et al. 2018 ) detected OS and architecture-independent malware from memory snapshots with several pre-processing techniques, domain unaware feature selection, and a suite of machine learning algorithms. MemTri ( Michalas and Murray 2017 ) predicts the likelihood of criminal activity in a memory image using a Bayesian network, based on evidence data artefacts generated by several applications. Dai et al. (2018 ) monitor the malware process memory and classify malware according to memory dumps, by transforming the memory dump into grayscale images and adopting a multi-layer perception as the classifier.

Among these four works ( Song et al. 2018 ; Petrik et al. 2018 ; Michalas and Murray 2017 ; Dai et al. 2018 ), two representative works (i.e., ( Song et al. 2018 ; Petrik et al. 2018 )) are already summarized phase-by-phase in Table 1. We direct interested readers to Table  2 for a concise overview of these two works.

Our review will be centered around the three questions raised in Section 3 . In the remaining of this section, we will first provide a set of observations, and then we provide the indications. Finally, we provide some general remarks.

From a close look at the very recent applications using Deep Learning for solving memory forensics challenges, we observed the followings:

Observation 9.1: Most methods used their own datasets for performance evaluation, while none of them used a public dataset.

DeepMem was evaluated on self-generated dataset by the authors, who collected a large number of diverse memory dumps, and labeled the kernel objects in them using existing memory forensics tools like Volatility. MDMF employed the MalRec dataset by Georgia Tech to generate malicious snapshots, while it created a dataset of benign memory snapshots running normal software. MemTri ran several Windows 7 virtual machine instances with self-designed suspect activity scenarios to gather memory images. Dai et al. built the Procdump program in Cuckoo sandbox to extract malware memory dumps. We found that each of the four works in our survey generated their own datasets, while none was evaluated on a public dataset.

Observation 9.2: Among the four works ( Song et al. 2018 ; Michalas and Murray 2017 ; Petrik et al. 2018 ; Dai et al. 2018 ), two works ( Song et al. 2018 ; Michalas and Murray 2017 ) employed Phase II while the other two works ( Petrik et al. 2018 ; Dai et al. 2018 ) did not employ.

DeepMem ( Song et al. 2018 ) devised a graph representation for a sequence of bytes, taking into account both adjacency and points-to relations, to better model the contextual information in memory dumps. MemTri ( Michalas and Murray 2017 ) firstly identified the running processes within the memory image that match the target applications, then employed regular expressions to locate evidence artefacts in a memory image. MDMF ( Petrik et al. 2018 ) and Dai et al. (2018 ) transformed the memory dump into image directly.

Observation 9.3: Among four works ( Song et al. 2018 ; Michalas and Murray 2017 ; Petrik et al. 2018 ; Dai et al. 2018 ), only DeepMem ( Song et al. 2018 ) employed Phase III for which it used an embedding method to represent a memory graph.

MDMF ( Petrik et al. 2018 ) directly fed the generated memory images into the training of a CNN model. Dai et al. (2018 ) used HOG feature descriptor for detecting objects, while MemTri ( Michalas and Murray 2017 ) extracted evidence artefacts as the input of Bayesian Network. In summary, DeepMem belonged to class 3 shown in Fig.  2 , while the other three works belonged to class 1 shown in Fig.  2 .

Observation 9.4: All the four works ( Song et al. 2018 ; Petrik et al. 2018 ; Michalas and Murray 2017 ; Dai et al. 2018 ) have employed different classifiers even when the types of input data are the same.

DeepMem chose fully connected network (FCN) model that has multi-layered hidden neurons with ReLU activation functions, following by a softmax layer as the last layer. MDMF ( Petrik et al. 2018 ) evaluated their performance both on traditional machine learning algorithms and Deep Learning approach including CNN and LSTM. Their results showed the accuracy of different classifiers did not have a significant difference. MemTri employed a Bayesian network model that is designed with three layers, i.e., a hypothesis layer, a sub-hypothesis layer, and an evidence layer. Dai et al. used a multi-layer perception model including an input layer, a hidden layer and an output layer as the classifier.

Indication 9.1: There lacks public datasets for evaluating the performance of different Deep Learning methods in memory forensics.

From Observation 9.1, we find that none of the four works surveyed was evaluated on public datasets.

Indication 9.2: From Observation 9.2, we find that it is disputable whether one should employ Phase II when solving memory forensics problems.

Since both ( Petrik et al. 2018 ) and ( Dai et al. 2018 ) directly transformed a memory dump into an image, Phase II is not required in these two works. However, since there is a large amount of useless information in a memory dump, we argue that appropriate prepossessing could improve the accuracy of the trained models.

Indication 9.3: From Observation 9.3, we find that Phase III is paid not much attention in memory forensics.

Most works did not employ Phase III. Among the four works, only DeepMem ( Song et al. 2018 ) employed Phase III during which it used embeddings to represent a memory graph. The other three works ( Petrik et al. 2018 ; Michalas and Murray 2017 ; Dai et al. 2018 ) did not learn any representations before training a Deep Learning model.

Indication 9.4: For Phase IV in memory forensics, different classifiers can be employed.

Which kind of classifier to use seems to be determined by the features used and their data structures. From Observation 9.4, we find that the four works have actually employed different kinds of classifiers even the types of input data are the same. It is very interesting that MDMF obtained similar results with different classifiers including traditional machine learning and Deep Learning models. However, the other three works did not discuss why they chose a particular kind of classifier.

Since a memory dump can be considered as a sequence of bytes, the data structure of a training data example is straightforward. If the memory dump is transformed into a simple form in Phase II, it can be directly fed into the training process of a Deep Learning model, and as a result Phase III can be ignored. However, if the memory dump is transformed into a complicated form in Phase II, Phase III could be quite useful in memory forensics.

Regarding the answer for Question 3 at “ Methodology for reviewing the existing works ” section, it is very interesting that during Phase IV different classifiers can be employed in memory forensics. Moreover, MDMF ( Petrik et al. 2018 ) has shown that they can obtain similar results with different kinds of classifiers. Nevertheless, they also admit that with a larger amount of training data, the performance could be improved by Deep Learning.

An end-to-end manner deep learning model can learn the precise representation of memory dump automatically to release the requirement for expert knowledge. However, it still needs expert knowledge to represent data and attacker behavior. Attackers may also directly manipulate data and pointer values in kernel objects to evade detection.

A closer look at applications of deep learning in security-oriented fuzzing

Fuzzing of software security is one of the state of art techniques that people use to detect software vulnerabilities. The goal of fuzzing is to find all the vulnerabilities exist in the program by testing as much program code as possible. Due to the nature of fuzzing, this technique works best on finding vulnerabilities in programs that take in input files, like PDF viewers ( Godefroid et al. 2017 ) or web browsers. A typical workflow of fuzzing can be concluded as: given several seed input files, the fuzzer will mutate or fuzz the seed inputs to get more input files, with the aim of expanding the overall code coverage of the target program as it executes the mutated files. Although there have already been various popular fuzzers ( Li et al. 2018 ), fuzzing still cannot bypass its problem of sometimes redundantly testing input files which cannot improve the code coverage rate ( Shi and Pei 2019 ; Rajpal et al. 2017 ). Some input files mutated by the fuzzer even cannot pass the well-formed file structure test ( Godefroid et al. 2017 ). Recent research has come up with ideas of applying Deep Learning in the process of fuzzing to solve these problems.

In this section, we will review the very recent four representative works that use Deep Learning for fuzzing for software security. Among the three, two representative works ( Godefroid et al. 2017 ; Shi and Pei 2019 ) are already summarized phase-by-phase in Table  2 . We direct interested readers to Table  2 for a concise overview of those two works.

Observation 10.1: Deep Learning has only been applied in mutation-based fuzzing.

Even though various of different fuzzing techniques, including symbolic execution based fuzzing ( Stephens et al. 2016 ), tainted analysis based fuzzing ( Bekrar et al. 2012 ) and hybrid fuzzing ( Yun et al. 2018 ) have been proposed so far, we observed that all the works we surveyed employed Deep Learning method to assist the primitive fuzzing – mutation-based fuzzing. Specifically, they adopted Deep Learning to assist fuzzing tool’s input mutation. We found that they commonly did it in two ways: 1) training Deep Learning models to tell how to efficiently mutate the input to trigger more execution path ( Shi and Pei 2019 ; Rajpal et al. 2017 ); 2) training Deep Learning models to tell how to keep the mutated files compliant with the program’s basic semantic requirement ( Godefroid et al. 2017 ). Besides, all three works trained different Deep Learning models for different programs, which means that knowledge learned from one programs cannot be applied to other programs.

Observation 10.2: Similarity among all the works in our survey existed when choosing the training samples in Phase I.

The works in this survey had a common practice, i.e., using the input files directly as training samples of the Deep Learning model. Learn&Fuzz ( Godefroid et al. 2017 ) used character-level PDF objects sequence as training samples. Neuzz ( Shi and Pei 2019 ) regarded input files directly as byte sequences and fed them into the neural network model. Rajpal et al. (2017 ) also used byte level representations of input files as training samples.

Observation 10.3: Difference between all the works in our survey existed when assigning the training labels in Phase I.

Despite the similarity of training samples researchers decide to use, there was a huge difference in the training labels that each work chose to use. Learn&Fuzz ( Godefroid et al. 2017 ) directly used the character sequences of PDF objects as labels, same as training samples, but shifted by one position, which is a common generative model technique already broadly used in speech and handwriting recognition. Unlike Learn&Fuzz, Neuzz ( Shi and Pei 2019 ) and Rajpal’s work ( Rajpal et al. 2017 ) used bitmap and heatmap respectively as training labels, with the bitmap demonstrating the code coverage status of a certain input, and the heatmap demonstrating the efficacy of flipping one or more bytes of the input file. Whereas, as a common terminology well-known among fuzzing researchers, bitmap was gathered directly from the results of AFL. Heatmap used by Rajpal et al. was generated by comparing the code coverage supported by the bitmap of one seed file and the code coverage supported by bitmaps of the mutated seed files. It was noted that if there is acceptable level of code coverage expansion when executing the mutated seed files, demonstrated by more “1”s, instead of “0”s in the corresponding bitmaps, the byte level differences among the original seed file and the mutated seed files will be highlighted. Since those bytes should be the focus of later on mutation, heatmap was used to denote the location of those bytes.

Different labels usage in each work was actually due to the different kinds of knowledge each work wants to learn. For a better understanding, let us note that we can simply regard a Deep Learning model as a simulation of a “function”. Learn&Fuzz ( Godefroid et al. 2017 ) wanted to learn valid mutation of a PDF file that was compliant with the syntax and semantic requirements of PDF objects. Their model could be seen as a simulation of f ( x , θ )= y , where x denotes sequence of characters in PDF objects and y represents a sequence that are obtained by shifting the input sequences by one position. They generated new PDF object character sequences given a starting prefix once the model was trained. In Neuzz ( Shi and Pei 2019 ), an NN(Neural Network) model was used to do program smoothing, which simultated a smooth surrogate function that approximated the discrete branching behaviors of the target program. f ( x , θ )= y , where x denoted program’s byte level input and y represented the corresponding edge coverage bitmap. In this way, the gradient of the surrogate function was easily computed, due to NN’s support of efficient computation of gradients and higher order derivatives. Gradients could then be used to guide the direction of mutation, in order to get greater code coverage. In Rajpal and others’ work ( Rajpal et al. 2017 ), they designed a model to predict good (and bad) locations to mutate in input files based on the past mutations and corresponding code coverage information. Here, the x variable also denoted program’s byte level input, but the y variable represented the corresponding heatmap.

Observation 10.4: Various lengths of input files were handled in Phase II.

Deep Learning models typically accepted fixed length input, whereas the input files for fuzzers often held different lengths. Two different approaches were used among the three works we surveyed: splitting and padding. Learn&Fuzz ( Godefroid et al. 2017 ) dealt with this mismatch by concatenating all the PDF objects character sequences together, and then splited the large character sequence into multiple training samples with a fixed size. Neuzz ( Shi and Pei 2019 ) solved this problem by setting a maximize input file threshold and then, padding the smaller-sized input files with null bytes. From additional experiments, they also found that a modest threshold gived them the best result, and enlarging the input file size did not grant them additional accuracy. Aside from preprocessing training samples, Neuzz also preprocessed training labels and reduced labels dimension by merging the edges that always appeared together into one edge, in order to prevent the multicollinearity problem, that could prevent the model from converging to a small loss value. Rajpal and others ( Rajpal et al. 2017 ) used the similar splitting mechanism as Learn&Fuzz to split their input files into either 64-bit or 128-bit chunks. Their chunk size was determined empirically and was considered as a trainable parameter for their Deep Learning model, and their approach did not require sequence concatenating at the beginning.

Observation 10.5: All the works in our survey skipped Phase III.

According to our definition of Phase III, all the works in our survey did not consider representation learning. Therefore, all the three works ( Godefroid et al. 2017 ; Shi and Pei 2019 ; Rajpal et al. 2017 ) fell into class 1 shown in Fig.  2 .While as in Rajpal and others’ work, they considered the numerical representation of byte sequences. They claimed that since one byte binary data did not always represent the magnitude but also state, representing one byte in values ranging from 0 to 255 could be suboptimal. They used lower level 8-bit representation.

Indication 10.1: No alteration to the input files seems to be a correct approach. As far as we concerned, it is due to the nature of fuzzing. That is, since every bit of the input files matters, any slight alteration to the input files could either lose important information or add redundant information for the neural network model to learn.

Indication 10.2: Evaluation criteria should be chosen carefully when judging mutation.

Input files are always used as training samples regarding using Deep Learning technique in fuzzing problems. Through this similar action, researchers have a common desire to let the neural network mode learn how the mutated input files should look like. But the criterion of judging a input file actually has two levels: on the one hand, a good input file should be correct in syntax and semantics; on the other hand, a good input file should be the product of a useful mutation, which triggers the program to behave differently from previous execution path. This idea of a fuzzer that can generate semantically correct input file could still be a bad fuzzer at triggering new execution path was first brought up in Learn&Fuzz ( Godefroid et al. 2017 ). We could see later on works trying to solve this problem by using either different training labels ( Rajpal et al. 2017 ) or use neural network to do program smoothing ( Shi and Pei 2019 ). We encouraged fuzzing researchers, when using Deep Learning techniques, to keep this problem in mind, in order to get better fuzzing results.

Indication 10.3: Works in our survey only focus on local knowledge. In brief, some of the existing works ( Shi and Pei 2019 ; Rajpal et al. 2017 ) leveraged the Deep Learning model to learn the relation between program’s input and its behavior and used the knowledge that learned from history to guide future mutation. For better demonstration, we defined the knowledge that only applied in one program as local knowledge . In other words, this indicates that the local knowledge cannot direct fuzzing on other programs.

Corresponding to the problems conventional fuzzing has, the advantages of applying DL in fuzzing are that DL’s learning ability can ensure mutated input files follow the designated grammar rules better. The ways in which input files are generated are more directed, and will, therefore, guarantee the fuzzer to increase its code coverage by each mutation. However, even if the advantages can be clearly demonstrated by the two papers we discuss above, some challenges still exist, including mutation judgment challenges that are faced both by traditional fuzzing techniques and fuzzing with DL, and the scalability of fuzzing approaches.

We would like to raise several interesting questions for the future researchers: 1) Can the knowledge learned from the fuzzing history of one program be applied to direct testing on other programs? 2) If the answer to question one is positive, we can suppose that global knowledge across different programs exists? Then, can we train a model to extract the global knowledge ? 3) Whether it is possible to combine global knowledge and local knowledge when fuzzing programs?

Using high-quality data in Deep Learning is important as much as using well-structured deep neural network architectures. That is, obtaining quality data must be an important step, which should not be skipped, even in resolving security problems using Deep Learning. So far, this study demonstrated how the recent security papers using Deep Learning have adopted data conversion (Phase II) and data representation (Phase III) on different security problems. Our observations and indications showed a clear understanding of how security experts generate quality data when using Deep Learning.

Since we did not review all the existing security papers using Deep Learning, the generality of observations and indications is somewhat limited. Note that our selected papers for review have been published recently at one of prestigious security and reliability conferences such as USENIX SECURITY, ACM CCS and so on ( Shin et al. 2015 )-( Das et al. 2018 ), ( Brown et al. 2018 ; Zhang et al. 2019 ), ( Song et al. 2018 ; Petrik et al. 2018 ), ( Wang et al. 2019 )-( Rajpal et al. 2017 ). Thus, our observations and indications help to understand how most security experts have used Deep Learning to solve the well-known eight security problems from program analysis to fuzzing.

Our observations show that we should transfer raw data to synthetic formats of data ready for resolving security problems using Deep Learning through data cleaning and data augmentation and so on. Specifically, we observe that Phases II and III methods have mainly been used for the following purposes:

To clean the raw data to make the neural network (NN) models easier to interpret

To reduce the dimensionality of data (e.g., principle component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE))

To scale input data (e.g., normalization)

To make NN models understand more complex relationships depending on security problems (e.g. memory graphs)

To simply change various raw data formats into a vector format for NN models (e.g. one-hot encoding and word2vec embedding)

In this following, we do further discuss the question, “What if Phase II is skipped?", rather than the question, “Is Phase III always necessary?". This is because most of the selected papers do not consider Phase III methods (76%), or adopt with no concrete reasoning (19%). Specifically, we demonstrate how Phase II has been adopted according to eight security problems, different types of data, various models of NN and various outputs of NN models, in depth. Our key findings are summarized as follows:

How to fit security domain knowledge into raw data has not been well-studied yet.

While raw text data are commonly parsed after embedding, raw binary data are converted using various Phase II methods.

Raw data are commonly converted into a vector format to fit well to a specific NN model using various Phase II methods.

Various Phase II methods are used according to the relationship between output of security problem and output of NN models.

What if phase II is skipped?

From the analysis results of our selected papers for review, we roughly classify Phase II methods into the following four categories.

Embedding: The data conversion methods that intend to convert high-dimensional discrete variables into low-dimensional continuous vectors ( Google Developers 2016 ).

Parsing combined with embedding: The data conversion methods that constitute an input data into syntactic components in order to test conformability after embedding.

One-hot encoding: A simple embedding where each data belonging to a specific category is mapped to a vector of 0s and a single 1. Here, the low-dimension transformed vector is not managed.

Domain-specific data structures: A set of data conversion methods which generate data structures capturing domain-specific knowledge for different security problems, e.g., memory graphs ( Song et al. 2018 ).

Findings on eight security problems

We observe that over 93% of the papers use one of the above-classified Phase II methods. 7% of the papers do not use any of the above-classified methods, and these papers are mostly solving a software fuzzing problem. Specifically, we observe that 35% of the papers use a Category 1 (i.e. embedding) method; 30% of the papers use a Category 2 (i.e. parsing combined with embedding) method; 15% of the papers use a Category 3 (i.e. one-hot encoding) method; and 13% of the papers use a Category 4 (i.e. domain-specific data structures) method. Regarding why one-hot encoding is not widely used, we found that most security data include categorical input values, which are not directly analyzed by Deep Learning models.

From Fig.  3 , we also observe that according to security problems, different Phase II methods are used. First, PA, ROP and CFI should convert raw data into a vector format using embedding because they commonly collect instruction sequence from binary data. Second, NA and SEAD use parsing combined with embedding because raw data such as the network traffic and system logs consist of the complex attributes with the different formats such as categorical and numerical input values. Third, we observe that MF uses various data structures because memory dumps from memory layout are unstructured. Fourth, fuzzing generally uses no data conversion since Deep Learning models are used to generate the new input data with the same data format as the original raw data. Finally, we observe that MC commonly uses one-hot encoding and embedding because malware binary and well-structured security log files include categorical, numerical and unstructured data in general. These observations indicate that type of data strongly influences on use of Phase II methods. We also observe that only MF among eight security problems commonly transform raw data into well-structured data embedding a specialized security domain knowledge. This observation indicates that various conversion methods of raw data into well-structure data which embed various security domain knowledge are not yet studied in depth.

figure 3

Statistics of Phase II methods for eight security problems

Findings on different data types

Note that according to types of data, a NN model works better than the others. For example, CNN works well with images but does not work with text. From Fig.  4 for raw binary data, we observe that 51.9%, 22.3% and 11.2% of security papers use embedding, one-hot encoding and Others , respectively. Only 14.9% of security papers, especially related to fuzzing, do not use one of Phase II methods. This observation indicates that binary input data which have various binary formats should be converted into an input data type which works well with a specific NN model. From Fig.  4 for raw text data, we also observe that 92.4% of papers use parsing with embedding as the Phase II method. Note that compared with raw binary data whose formats are unstructured, raw text data generally have the well-structured format. Raw text data collected from network traffics may also have various types of attribute values. Thus, raw text data are commonly parsed after embedding to reduce redundancy and dimensionality of data.

figure 4

Statistics of Phase II methods on type of data

Findings on various models of NN

According to types of the converted data, a specific NN model works better than the others. For example, CNN works well with images but does not work with raw text. From Fig.  6 b, we observe that use of embedding for DNN (42.9%), RNN (28.6%) and LSTM (14.3%) models approximates to 85%. This observation indicates that embedding methods are commonly used to generate sequential input data for DNN, RNN and LSTM models. Also, we observe that one-hot encoded data are commonly used as input data for DNN (33.4%), CNN (33.4%) and LSTM (16.7%) models. This observation indicates that one-hot encoding is one of common Phase II methods to generate numerical values for image and sequential input data because many raw input data for security problems commonly have the categorical features. We observe that the CNN (66.7%) model uses the converted input data using the Others methods to express the specific domain knowledge into the input data structure of NN networks. This is because general vector formats including graph, matrix and so on can also be used as an input value of the CNN model.

From Fig.  5 b, we observe that DNN, RNN and LSTM models commonly use embedding, one-hot encoding and parsing combined with embedding. For example, we observe security papers of 54.6%, 18.2% and 18.2% models use embedding, one-hot encoding and parsing combined with embedding, respectively. We also observe that the CNN model is used with various Phase II methods because any vector formats such as image can generally be used as an input data of the CNN model.

figure 5

Statistics of Phase II methods for various types of NNs

figure 6

Statistics of Phase II methods for various output of NN

Findings on output of NN models

According to the relationship between output of security problem and output of NN, we may use a specific Phase II method. For example, if output of security problem is given into a class (e.g., normal or abnormal), output of NN should also be given into classification.

From Fig.  6 a, we observe that embedding is commonly used to support a security problem for classification (100%). Parsing combined with embedding is used to support a security problem for object detection (41.7%) and classification (58.3%). One-hot encoding is used only for classification (100%). These observations indicate that classification of a given input data is the most common output which is obtained using Deep Learning under various Phase II methods.

From Fig.  6 b, we observe that security problems, whose outputs are classification, commonly use embedding (43.8%) and parsing combined with embedding (21.9%) as the Phase II method. We also observe that security problems, whose outputs are object detection, commonly use parsing combined with embedding (71.5%). However, security problems, whose outputs are data generation, commonly do not use the Phase III methods. These observations indicate that a specific Phase II method has been used according to the relationship between output of security problem and use of NN models.

Further areas of investigation

Since any Deep Learning models are stochastic, each time the same Deep Learning model is fit even on the same data, it might give different outcomes. This is because deep neural networks use random values such as random initial weights. However, if we have all possible data for every security problem, we may not make random predictions. Since we have the limited sample data in practice, we need to get the best-effort prediction results using the given Deep Learning model, which fits to the given security problem.

How can we get the best-effort prediction results of Deep Learning models for different security problems? Let us begin to discuss about the stability of evaluation results for our selected papers for review. Next, we will elaborate the influence of security domain knowledge on prediction results of Deep Learning models. Finally, we will discuss some common issues in those fields.

How stable are evaluation results?

When evaluating neural network models, Deep Learning models commonly use three methods: train-test split; train-validation-test split; and k -fold cross validation. A train-test split method splits the data into two parts, i.e., training and test data. Even though a train-test split method makes the stable prediction with a large amount of data, predictions vary with a small amount of data. A train-validation-test split method splits the data into three parts, i.e., training, validation and test data. Validation data are used to estimate predictions over the unknown data. k -fold cross validation has k different set of predictions from k different evaluation data. Since k -fold cross validation takes the average expected performance of the NN model over k -fold validation data, the evaluation result is closer to the actual performance of the NN model.

From the analysis results of our selected papers for review, we observe that 40.0% and 32.5% of the selected papers are measured using a train-test split method and a train-validation-test split method, respectively. Only 17.5% of the selected papers are measured using k -fold cross validation. This observation implies that even though the selected papers show almost more than 99% of accuracy or 0.99 of F1 score, most solutions using Deep Learning might not show the same performance for the noisy data with randomness.

To get stable prediction results of Deep Learning models for different security problems, we might reduce the influence of the randomness of data on Deep Learning models. At least, it is recommended to consider the following methods:

Do experiments using the same data many time : To get a stable prediction with a small amount of sample data, we might control the randomness of data using the same data many times.

Use cross validation methods, e.g. k -fold cross validation : The expected average and variance from k -fold cross validation estimates how stable the proposed model is.

How does security domain knowledge influence the performance of security solutions using deep learning?

When selecting a NN model that analyzes an application dataset, e.g., MNIST dataset ( LeCun and Cortes 2010 ), we should understand that the problem is to classify a handwritten digit using a 28×28 black. Also, to solve the problem with the high classification accuracy, it is important to know which part of each handwritten digit mainly influences the outcome of the problem, i.e., a domain knowledge.

While solving a security problem, knowing and using security domain knowledge for each security problem is also important due to the following reasons (we label the observations and indications that realted to domain knowledge with ‘ ∗ ’):

Firstly, the dataset generation, preprocess and feature selection highly depend on domain knowledge. Different from the image classification and natural language processing, raw data in the security domain cannot be sent into the NN model directly. Researchers need to adopt strong domain knowledge to generate, extract, or clean the training set. Also, in some works, domain knowledge is adopted in data labeling because labels for data samples are not straightforward.

Secondly, domain knowledge helps with the selection of DL models and its hierarchical structure. For example, the neural network architecture (hierarchical and bi-directional LSTM) designed in DEEPVSA ( Guo et al. 2019 ) is based on the domain knowledge in the instruction analysis.

Thirdly, domain knowledge helps to speed up the training process. For instance, by adopting strong domain knowledge to clean the training set, domain knowledge helps to spend up the training process while keeping the same performance. However, due to the influence of the randomness of data on Deep Learning models, domain knowledge should be carefully adopted to avoid potential decreased accuracy.

Finally, domain knowledge helps with the interpretability of models’ prediction. Recently, researchers try to explore the interpretability of the deep learning model in security areas, For instance, LEMNA ( Guo et al. 2018 ) and EKLAVYA ( Chua et al. 2017 ) explain how the prediction was made by models from different perspectives. By enhancing the trained models’ interpretability, they can improve their approaches’ accuracy and security. The explanation for the relation between input, hidden state, and the final output is based on domain knowledge.

Common challenges

In this section, we will discuss the common challenges when applying DL to solving security problems. These challenges as least shared by the majority of works, if not by all the works. Generally, we observe 7 common challenges in our survey:

The raw data collected from the software or system usually contains lots of noise.

The collected raw is untidy. For instance, the instruction trace, the Untidy data: variable length sequences,

Hierarchical data syntactic/structure. As discussed in Section 3 , the information may not simply be encoded in a single layer, rather, it is encoded hierarchically, and the syntactic is complex.

Dataset generation is challenging in some scenarios. Therefore, the generated training data might be less representative or unbalanced.

Different for the application of DL in image classification, and natural language process, which is visible or understandable, the relation between data sample and its label is not intuitive, and hard to explain.

Availability of trained model and quality of dataset.

Finally, we investigate the availability of the trained model and the quality of the dataset. Generally, the availability of the trained models affects its adoption in practice, and the quality of the training set and the testing set will affect the credibility of testing results and comparison between different works. Therefore, we collect relevant information to answer the following four questions and shows the statistic in Table  4 :

Whether a paper’s source code is publicly available?

Whether raw data, which is used to generate the dataset, is publicly available?

Whether its dataset is publicly available?

How are the quality of the dataset?

We observe that both the percentage of open source of code and dataset in our surveyed fields is low, which makes it a challenge to reproduce proposed schemes, make comparisons between different works, and adopt them in practice. Specifically, the statistic shows that 1) the percentage of open source of code in our surveyed fields is low, only 6 out of 16 paper published their model’s source code. 2) the percentage of public data sets is low. Even though, the raw data in half of the works are publicly available, only 4 out of 16 fully or partially published their dataset. 3) the quality of datasets is not guaranteed, for instance, most of the dataset is unbalanced.

The performance of security solutions even using Deep Learning might vary according to datasets. Traditionally, when evaluating different NN models in image classification, standard datasets such as MNIST for recognizing handwritten 10 digits and CIFAR10 ( Krizhevsky et al. 2010 ) for recognizing 10 object classes are used for performance comparison of different NN models. However, there are no known standard datasets for evaluating NN models on different security problems. Due to such a limitation, we observe that most security papers using Deep Learning do not compare the performance of different security solutions even when they consider the same security problem. Thus, it is recommended to generate and use a standard dataset for a specific security problem for comparison. In conclusion, we think that there are three aspects that need to be improved in future research:

Developing standard dataset.

Publishing their source code and dataset.

Improving the interpretability of their model.

This paper seeks to provide a dedicated review of the very recent research works on using Deep Learning techniques to solve computer security challenges. In particular, the review covers eight computer security problems being solved by applications of Deep Learning: security-oriented program analysis, defending ROP attacks, achieving CFI, defending network attacks, malware classification, system-event-based anomaly detection, memory forensics, and fuzzing for software security. Our observations of the reviewed works indicate that the literature of using Deep Learning techniques to solve computer security challenges is still at an earlier stage of development.

Availability of data and materials

Not applicable.

We refer readers to ( Wang and Liu 2019 ) which systemizes the knowledge of protections by CFI schemes.

Abadi, M, Budiu M, Erlingsson Ú, Ligatti J (2009) Control-Flow Integrity Principles, Implementations, and Applications. ACM Trans Inf Syst Secur (TISSEC) 13(1):4.

Article   Google Scholar  

Bao, T, Burket J, Woo M, Turner R, Brumley D (2014) BYTEWEIGHT: Learning to Recognize Functions in Binary Code In: 23rd USENIX Security Symposium (USENIX Security 14), 845–860.. USENIX Association, San Diego.

Google Scholar  

Bekrar, S, Bekrar C, Groz R, Mounier L (2012) A Taint Based Approach for Smart Fuzzing In: 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation.. IEEE. https://doi.org/10.1109/icst.2012.182 .

Bengio, Y, Courville A, Vincent P (2013) Representation Learning: A Review and New Perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828.

Bertero, C, Roy M, Sauvanaud C, Tredan G (2017) Experience Report: Log Mining Using Natural Language Processing and Application to Anomaly Detection In: 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE).. IEEE. https://doi.org/10.1109/issre.2017.43 .

Brown, A, Tuor A, Hutchinson B, Nichols N (2018) Recurrent Neural Network Attention Mechanisms for Interpretable System Log Anomaly Detection In: Proceedings of the First Workshop on Machine Learning for Computing Systems, MLCS’18, 1:1–1:8.. ACM, New York.

Böttinger, K, Godefroid P, Singh R (2018) Deep Reinforcement Fuzzing In: 2018 IEEE Security and Privacy Workshops (SPW), pages 116–122.. IEEE. https://doi.org/10.1109/spw.2018.00026 .

Chen, L, Sultana S, Sahita R (2018) Henet: A Deep Learning Approach on Intel Ⓡ Processor Trace for Effective Exploit Detection In: 2018 IEEE Security and Privacy Workshops (SPW).. IEEE. https://doi.org/10.1109/spw.2018.00025 .

Chua, ZL, Shen S, Saxena P, Liang Z (2017) Neural Nets Can Learn Function Type Signatures from Binaries In: 26th USENIX Security Symposium (USENIX Security 17), 99–116.. USENIX Association. https://dl.acm.org/doi/10.5555/3241189.3241199 .

Cui, Z, Xue F, Cai X, Cao Y, Wang GG, Chen J (2018) Detection of Malicious Code Variants Based on Deep Learning. IEEE Trans Ind Inform 14(7):3187–3196.

Dahl, GE, Stokes JW, Deng L, Yu D (2013) Large-scale Malware Classification using Random Projections and Neural Networks In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).. IEEE. https://doi.org/10.1109/icassp.2013.6638293 .

Dai, Y, Li H, Qian Y, Lu X (2018) A Malware Classification Method Based on Memory Dump Grayscale Image. Digit Investig 27:30–37.

Das, A, Mueller F, Siegel C, Vishnu A (2018) Desh: Deep Learning for System Health Prediction of Lead Times to Failure in HPC In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’18, 40–51.. ACM, New York.

Chapter   Google Scholar  

David, OE, Netanyahu NS (2015) DeepSign: Deep Learning for Automatic Malware Signature Generation and Classification In: 2015 International Joint Conference on Neural Networks (IJCNN).. IEEE. https://doi.org/10.1109/ijcnn.2015.7280815 .

De La Rosa, L, Kilgallon S, Vanderbruggen T, Cavazos J (2018) Efficient Characterization and Classification of Malware Using Deep Learning In: 2018 Resilience Week (RWS).. IEEE. https://doi.org/10.1109/rweek.2018.8473556 .

Du, M, Li F (2016) Spell: Streaming Parsing of System Event Logs In: 2016 IEEE 16th International Conference on Data Mining (ICDM).. IEEE. https://doi.org/10.1109/icdm.2016.0103 .

Du, M, Li F, Zheng G, Srikumar V (2017) DeepLog: Anomaly Detection and Diagnosis from System Logs Through Deep Learning In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, 1285–1298.. ACM, New York.

Faker, O, Dogdu E (2019) Intrusion Detection Using Big Data and Deep Learning Techniques In: Proceedings of the 2019 ACM Southeast Conference on ZZZ - ACM SE ’19, 86–93.. ACM. https://doi.org/10.1145/3299815.3314439 .

Ghosh, AK, Wanken J, Charron F (1998) Detecting Anomalous and Unknown Intrusions against Programs In: Proceedings 14th annual computer security applications conference (Cat. No. 98Ex217), 259–267.. IEEE, Washington, DC.

Godefroid, P, Peleg H, Singh R (2017) Learn&Fuzz: Machine Learning for Input Fuzzing In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).. IEEE. https://doi.org/10.1109/ase.2017.8115618 .

Google Developers (2016) Embeddings . https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture .

Guo, W, Mu D, Xu J, Su P, Wang G, Xing X (2018) Lemna: Explaining deep learning based security applications In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 364–379. https://doi.org/10.1145/3243734.3243792 .

Guo, W, Mu D, Xing X, Du M, Song D (2019) { DEEPVSA }: Facilitating Value-set Analysis with Deep Learning for Postmortem Program Analysis In: 28th USENIX Security Symposium (USENIX Security 19), 1787–1804.. USENIX Association, Santa Clara, CA. https://www.usenix.org/conference/usenixsecurity19/presentation/guo .

Heller, KA, Svore KM, Keromytis AD, Stolfo SJ (2003) One Class Support Vector Machines for Detecting Anomalous Windows Registry Accesses In: Proceedings of the Workshop on Data Mining for Computer Security.. IEEE, Dallas, TX.

Horwitz, S (1997) Precise Flow-insensitive May-alias Analysis is NP-hard. ACM Trans Program Lang Syst 19(1):1–6.

Hu, W, Liao Y, Vemuri VR (2003) Robust Anomaly Detection using Support Vector Machines In: Proceedings of the international conference on machine learning, 282–289.. Citeseer, Washington, DC.

IDS 2017 Datasets (2019). https://www.unb.ca/cic/datasets/ids-2017.html .

Kalash, M, Rochan M, Mohammed N, Bruce NDB, Wang Y, Iqbal F (2018) Malware Classification with Deep Convolutional Neural Networks In: 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), 1–5. https://doi.org/10.1109/NTMS.2018.8328749 .

Kiriansky, V, Bruening D, Amarasinghe SP, et al. (2002) Secure Execution via Program Shepherding In: USENIX Security Symposium, volume 92, page 84.. USENIX Association, Monterey, CA.

Kolosnjaji, B, Eraisha G, Webster G, Zarras A, Eckert C (2017) Empowering Convolutional Networks for Malware Classification and Analysis. Proc Int Jt Conf Neural Netw 2017-May:3838–3845.

Krizhevsky, A, Nair V, Hinton G (2010) CIFAR-10 (Canadian Institute for Advanced Research). https://www.cs.toronto.edu/~kriz/cifar.html .

LeCun, Y, Cortes C (2010) MNIST Handwritten Digit Database. http://yann.lecun.com/exdb/mnist/ .

Li, J, Zhao B, Zhang C (2018) Fuzzing: A Survey. Cybersecurity 1(1):6.

Li, X, Hu Z, Fu Y, Chen P, Zhu M, Liu P (2018) ROPNN: Detection of ROP Payloads Using Deep Neural Networks. arXiv preprint arXiv:1807.11110.

McLaughlin, N, Martinez Del Rincon J, Kang BJ, Yerima S, Miller P, Sezer S, Safaei Y, Trickel E, Zhao Z, Doupe A, Ahn GJ (2017) Deep Android Malware Detection In: Proceedings of the 7th ACM Conference on Data and Application Security and Privacy, 301–308. https://doi.org/10.1145/3029806.3029823 .

Meng, W, Liu Y, Zhu Y, Zhang S, Pei D, Liu Y, Chen Y, Zhang R, Tao S, Sun P, Zhou R (2019) Loganomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence.. International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2019/658 .

Michalas, A, Murray R (2017) MemTri: A Memory Forensics Triage Tool Using Bayesian Network and Volatility In: Proceedings of the 2017 International Workshop on Managing Insider Security Threats, MIST ’17, pages 57–66.. ACM, New York.

Millar, K, Cheng A, Chew HG, Lim C-C (2018) Deep Learning for Classifying Malicious Network Traffic In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, 156–161.. Springer. https://doi.org/10.1007/978-3-030-04503-6_15 .

Moustafa, N, Slay J (2015) UNSW-NB15: A Comprehensive Data Set for Network Intrusion Detection Systems (UNSW-NB15 Network Data Set) In: 2015 Military Communications and Information Systems Conference (MilCIS).. IEEE. https://doi.org/10.1109/milcis.2015.7348942 .

Nguyen, MH, Nguyen DL, Nguyen XM, Quan TT (2018) Auto-Detection of Sophisticated Malware using Lazy-Binding Control Flow Graph and Deep Learning. Comput Secur 76:128–155.

Nix, R, Zhang J (2017) Classification of Android Apps and Malware using Deep Neural Networks. Proc Int Jt Conf Neural Netw 2017-May:1871–1878.

NSCAI Intern Report for Congress (2019). https://drive.google.com/file/d/153OrxnuGEjsUvlxWsFYauslwNeCEkvUb/view .

Petrik, R, Arik B, Smith JM (2018) Towards Architecture and OS-Independent Malware Detection via Memory Forensics In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS ’18, pages 2267–2269.. ACM, New York.

Phan, AV, Nguyen ML, Bui LT (2017) Convolutional Neural Networks over Control Flow Graphs for Software defect prediction In: 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), 45–52.. IEEE. https://doi.org/10.1109/ictai.2017.00019 .

Rajpal, M, Blum W, Singh R (2017) Not All Bytes are Equal: Neural Byte Sieve for Fuzzing. arXiv preprint arXiv:1711.04596.

Rosenberg, I, Shabtai A, Rokach L, Elovici Y (2018) Generic Black-box End-to-End Attack against State of the Art API Call based Malware Classifiers In: Research in Attacks, Intrusions, and Defenses, 490–510.. Springer. https://doi.org/10.1007/978-3-030-00470-5_23 .

Salwant, J (2015) ROPGadget. https://github.com/JonathanSalwan/ROPgadget .

Saxe, J, Berlin K (2015) Deep Neural Network based Malware Detection using Two Dimensional Binary Program Features In: 2015 10th International Conference on Malicious and Unwanted Software (MALWARE).. IEEE. https://doi.org/10.1109/malware.2015.7413680 .

Shacham, H, et al. (2007) The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86) In: ACM conference on Computer and communications security, pages 552–561. https://doi.org/10.1145/1315245.1315313 .

Shi, D, Pei K (2019) NEUZZ: Efficient Fuzzing with Neural Program Smoothing. IEEE Secur Priv.

Shin, ECR, Song D, Moazzezi R (2015) Recognizing Functions in Binaries with Neural Networks In: 24th USENIX Security Symposium (USENIX Security 15).. USENIX Association. https://dl.acm.org/doi/10.5555/2831143.2831182 .

Sommer, R, Paxson V (2010) Outside the Closed World: On Using Machine Learning For Network Intrusion Detection In: 2010 IEEE Symposium on Security and Privacy (S&P).. IEEE. https://doi.org/10.1109/sp.2010.25 .

Song, W, Yin H, Liu C, Song D (2018) DeepMem: Learning Graph Neural Network Models for Fast and Robust Memory Forensic Analysis In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS ’18, 606–618.. ACM, New York.

Stephens, N, Grosen J, Salls C, Dutcher A, Wang R, Corbetta J, Shoshitaishvili Y, Kruegel C, Vigna G (2016) Driller: Augmenting Fuzzing Through Selective Symbolic Execution In: Proceedings 2016 Network and Distributed System Security Symposium.. Internet Society. https://doi.org/10.14722/ndss.2016.23368 .

Tan, G, Jaeger T (2017) CFG Construction Soundness in Control-Flow Integrity In: Proceedings of the 2017 Workshop on Programming Languages and Analysis for Security - PLAS ’17.. ACM. https://doi.org/10.1145/3139337.3139339 .

Tobiyama, S, Yamaguchi Y, Shimada H, Ikuse T, Yagi T (2016) Malware Detection with Deep Neural Network Using Process Behavior. Proc Int Comput Softw Appl Conf 2:577–582.

Unicorn-The ultimate CPU emulator (2015). https://www.unicorn-engine.org/ .

Ustebay, S, Turgut Z, Aydin MA (2019) Cyber Attack Detection by Using Neural Network Approaches: Shallow Neural Network, Deep Neural Network and AutoEncoder In: Computer Networks, 144–155.. Springer. https://doi.org/10.1007/978-3-030-21952-9_11 .

Varenne, R, Delorme JM, Plebani E, Pau D, Tomaselli V (2019) Intelligent Recognition of TCP Intrusions for Embedded Micro-controllers In: International Conference on Image Analysis and Processing, 361–373.. Springer. https://doi.org/10.1007/978-3-030-30754-7_36 .

Wang, Z, Liu P (2019) GPT Conjecture: Understanding the Trade-offs between Granularity, Performance and Timeliness in Control-Flow Integrity. eprint 1911.07828, archivePrefix arXiv, primaryClass cs.CR, arXiv.

Wang, Y, Wu Z, Wei Q, Wang Q (2019) NeuFuzz: Efficient Fuzzing with Deep Neural Network. IEEE Access 7:36340–36352.

Xu, W, Huang L, Fox A, Patterson D, Jordan MI (2009) Detecting Large-Scale System Problems by Mining Console Logs In: Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles SOSP ’09, 117–132.. ACM, New York.

Xu, X, Liu C, Feng Q, Yin H, Song L, Song D (2017) Neural Network-Based Graph Embedding for Cross-Platform Binary Code Similarity Detection In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 363–376.. ACM. https://doi.org/10.1145/3133956.3134018 .

Xu, L, Zhang D, Jayasena N, Cavazos J (2018) HADM: Hybrid Analysis for Detection of Malware 16:702–724.

Xu, X, Ghaffarinia M, Wang W, Hamlen KW, Lin Z (2019) CONFIRM: Evaluating Compatibility and Relevance of Control-flow Integrity Protections for Modern Software In: 28th USENIX Security Symposium (USENIX Security 19), pages 1805–1821.. USENIX Association, Santa Clara.

Yagemann, C, Sultana S, Chen L, Lee W (2019) Barnum: Detecting Document Malware via Control Flow Anomalies in Hardware Traces In: Lecture Notes in Computer Science, 341–359.. Springer. https://doi.org/10.1007/978-3-030-30215-3_17 .

Yin, C, Zhu Y, Fei J, He X (2017) A Deep Learning Approach for Intrusion Detection using Recurrent Neural Networks. IEEE Access 5:21954–21961.

Yuan, X, Li C, Li X (2017) DeepDefense: Identifying DDoS Attack via Deep Learning In: 2017 IEEE International Conference on Smart Computing (SMARTCOMP).. IEEE. https://doi.org/10.1109/smartcomp.2017.7946998 .

Yun, I, Lee S, Xu M, Jang Y, Kim T (2018) QSYM : A Practical Concolic Execution Engine Tailored for Hybrid Fuzzing In: 27th USENIX Security Symposium (USENIX Security 18), pages 745–761.. USENIX Association, Baltimore.

Zhang, S, Meng W, Bu J, Yang S, Liu Y, Pei D, Xu J, Chen Y, Dong H, Qu X, Song L (2017) Syslog Processing for Switch Failure Diagnosis and Prediction in Datacenter Networks In: 2017 IEEE/ACM 25th International Symposium on Quality of Service (IWQoS).. IEEE. https://doi.org/10.1109/iwqos.2017.7969130 .

Zhang, J, Chen W, Niu Y (2019) DeepCheck: A Non-intrusive Control-flow Integrity Checking based on Deep Learning. arXiv preprint arXiv:1905.01858.

Zhang, X, Xu Y, Lin Q, Qiao B, Zhang H, Dang Y, Xie C, Yang X, Cheng Q, Li Z, Chen J, He X, Yao R, Lou J-G, Chintalapati M, Shen F, Zhang D (2019) Robust Log-based Anomaly Detection on Unstable Log Data In: Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2019, pages 807–817.. ACM, New York.

Zhang, Y, Chen X, Guo D, Song M, Teng Y, Wang X (2019) PCCN: Parallel Cross Convolutional Neural Network for Abnormal Network Traffic Flows Detection in Multi-Class Imbalanced Network Traffic Flows. IEEE Access 7:119904–119916.

Download references

Acknowledgments

We are grateful to the anonymous reviewers for their useful comments and suggestions.

This work was supported by ARO W911NF-13-1-0421 (MURI), NSF CNS-1814679, and ARO W911NF-15-1-0576.

Author information

Authors and affiliations.

The Pennsylvania State University, Pennsylvania, USA

Yoon-Ho Choi, Peng Liu, Zitong Shang, Haizhou Wang, Zhilong Wang, Lan Zhang & Qingtian Zou

Pusan National University, Busan, Republic of Korea

Yoon-Ho Choi

Wuhan University of Technology, Wuhan, China

Junwei Zhou

You can also search for this author in PubMed   Google Scholar

Contributions

All authors read and approved the final manuscript.

Corresponding author

Correspondence to Peng Liu .

Ethics declarations

Competing interests.

PL is currently serving on the editorial board for Journal of Cybersecurity.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Choi, YH., Liu, P., Shang, Z. et al. Using deep learning to solve computer security challenges: a survey. Cybersecur 3 , 15 (2020). https://doi.org/10.1186/s42400-020-00055-5

Download citation

Received : 11 March 2020

Accepted : 17 June 2020

Published : 10 August 2020

DOI : https://doi.org/10.1186/s42400-020-00055-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Deep learning
  • Security-oriented program analysis
  • Return-oriented programming attacks
  • Control-flow integrity
  • Network attacks
  • Malware classification
  • System-event-based anomaly detection
  • Memory forensics
  • Fuzzing for software security

machine learning in cyber security research papers

Deep Learning and Machine Learning Algorithms Methods in Cyber Security

  • Conference paper
  • First Online: 28 July 2024
  • Cite this conference paper

machine learning in cyber security research papers

  • Mohammed Abdulhakim Al-Absi 12 ,
  • Hind R’Bigui 13 &
  • Ahmed A. Al-Absi 14  

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 914))

Included in the following conference series:

  • International conference on smart computing and cyber security : strategic foresight, security challenges and innovation

18 Accesses

This study investigates the use of Machine Learning (ML) and Deep Learning (DL) algorithms in cyber security. By leveraging ML and DL technologies, organizations can enhance their ability to detect and prevent cyber threats, protecting sensitive data and critical systems. In this study, we explored different machine learning and deep learning techniques used in cyber security. We discussed the advantages and disadvantages of each algorithm. Additionally, we identified the top 12 cybersecurity datasets that can be valuable for your future machine learning and deep learning projects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Fu R, Ren X, Li Y, Wu Y, Sun H, Al-Absi MA (2023) Machine learning-based UAV assisted agricultural information security architecture and intrusion detection. In: IEEE Internet Things J. https://doi.org/10.1109/JIOT.2023.3236322

Binu D, Rajakumar BR (2021) Artificial intelligence in data mining: theories and applications. Academic, Cambridge

Google Scholar  

Ahmadi A, Meybodi MR, Saghiri AM (2016) Adaptive search in unstructured peer-to-peer networks based on ant colony and learning automata. In: Proceedings of the 2016 Artificial intelligence and robotics, Qazvin, Iran, 9 April 2016

Onwubiko C (2017) Security operations centre: situation awareness, threat intelligence and cybercrime, pp 1–6. https://doi.org/10.1109/socialmedia.2017.8057355

Fu R, Al-Absi MA, Kim K-H, Lee Y-S, Al-Absi AA, Lee H-J (2021) Deep learning-based drone classification using radar cross section signatures at mmWave frequencies. IEEE Access 9:161431–161444. https://doi.org/10.1109/ACCESS.2021.3115805

Article   Google Scholar  

Cheng X, Lin X, Shen X-L, Zarifis A, Mou J (2022) The dark sides of AI. Electron Mark 1–5

Al-Absi MA, Fu R, Kim K-H, Lee Y-S, Al-Absi AA, Lee H-J (2021) Tracking unmanned aerial vehicles based on the Kalman Filter considering uncertainty and error aware. Electronics 10:3067. https://doi.org/10.3390/electronics10243067

Jabbarpour MR, Saghiri AM, Sookhak M (2021) A framework for component selection considering dark sides of artificial intelligence: a case study on autonomous vehicle. Electronics 10:384

Kumar G, Singh G, Bhatanagar V, Jyoti K (2019) Scary dark side of artificial intelligence: a perilous contrivance to mankind. Humanit. Soc. Sci. Rev. 7:1097–1103

Mahmoud AB, Tehseen S, Fuxman L (2020) The dark side of artificial intelligence in retail innovation. In: Retail futures. Emerald Publishing Limited, Bingley, UK

Wirtz BW, Weyerer JC, Sturm BJ (2020) The dark sides of artificial intelligence: an integrated AI governance framework for public administration. Int J Public Adm 43:818–829

Dasgupta D, Akhtar Z, Sen S (2022) Machine learning in cybersecurity: a comprehensive survey. J Def Model Simul 19(1):57–106. https://doi.org/10.1177/1548512920951275

Andrew G (2021) Deep learning: a visual approach. San-Francisco. [Online]. Available: https://nostarch.com/download/DeepLearning_Bonus.pdf

Aghaei S, Azizi MJ, Vayanos P (2019) Learning optimal and fair decision trees for nondiscriminative decision-making. In: Proceedings of the AAAI conference on artificial intelligence, vol 33. AAAI Press, pp 1418–1426

Aglin G, Nijssen S, Schaus P (2020) Learning optimal decision trees using caching branch-and-bound search. In: Proceedings of the 34th AAAI conference on artificial intelligence, the 32nd innovative applications of artificial intelligence conference, the 10th AAAI symposium on educational advances in artificial intelligence. AAAI Press, pp 3146–3153

Avellaneda F (2020) Efficient inference of optimal decision trees. In: Proceedings of the 34th AAAI conference on artificial intelligence. AAAI Press, 8.

Buhrman H, De Wolf R (2002) Complexity measures and decision tree complexity: a survey. Theor Comput Sci 288(1):21–43

Bennett KP, Blue JA (1998) A support vector machine approach to decision trees. In: Proceedings of the IEEE International joint conference on neural networks proceedings, vol 3. IEEE, pp 2396–2401

Kasnavi SA, Aminafshar M, Shariati M, Kashan NE, Honarvar M (2018) The effect of kernel selection on genome wide prediction of discrete traits by Support Vector Machine. Gene Reports.

Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods

Nanda MA, Seminar KB, Nandika D, Maddu A (2018) A comparison study of kernel functions in the support vector machine and its application for termite detection. Information 9:5

Jiang L, Wang D, Cai Z, Yan X (2007) Survey of improving Naive Bayes for classification. In: Alhajj R, Gao H, Li J, Li X, Zaïane OR (eds) (2007) Advanced data mining and applications (ADMA 2007). Lecture notes in computer science, vol 4632. Springer, Berlin. https://doi.org/10.1007/978-3-540-73871-8_14

Vanitha P (2016) Survey on meteorological weather analysis based on Naïve Bayes classification algorithm. Int J Eng Comput Sci

Sholihat A, Bei F, Ainaya R, Sembiring F, Lattu A (2022) Twitter tweet: sentiment analysis on illegal investment using Naive Bayes algorithm. In: 2022 IEEE 8th International conference on computing, engineering and design (ICCED), pp 1–5

Liu G, Luo Y, Sheng J (2022) Research on application of Naive Bayes algorithm based on attribute correlation to unmanned driving ethical dilemma. Math Probl Eng

Arif M, Alam KA, Hussain M (2015) Application of data mining using artificial neural network: survey. Int J Database Theory Appl 8:245–270

Albahri AS, Alnoor A, Zaidan AA, Albahri OS, Hameed H, Zaidan BB, Peh SS, Zain AB, Siraj SB, Masnan AH, Yass AA (2021) Hybrid artificial neural network and structural equation modelling techniques: a survey. Complex Intell Syst 8:1781–1801

Abiodun OI, Jantan AB, Omolara AE, Dada KV, Mohamed N, Arshad H (2018) State-of-the-art in artificial neural network applications: a survey. Heliyon 4

Liu Q, Zou Y, Liu X, Linge N (2019) A survey on rainfall forecasting using artificial neural network. Int J Embed Syst 11:240–249

Mishra Y (2018) Performance analysis using different dataset based on K-means clustering: survey

Patil PV, Karthikeyan A (2020) A survey on K-means clustering for analyzing variation in data

Bhavani K, Radhika N (2020) K-means clustering using nature-inspired optimization algorithms—a comparative survey. Int J Adv Sci Technol 29

Nirmal KR, Satyanarayana KVV (2016) Issues of K means clustering while migrating to map reduce paradigm with big data: a survey. Int J Electr Comput Eng 6 (2016): 3047–3051

Zafar A et al (2022) A comprehensive convolutional neural network survey to detect glaucoma disease. Mobile Inform Syst

Gumma LN et al (2021) A survey on convolutional neural network (deep-learning technique)-based lung cancer detection. SN Comput Sci 3

Wang W et al (2019) Development of convolutional neural network and its application in image classification: a survey. Opt Eng 58:040901

Xu S et al (2020) Convolutional neural network pruning: a survey. In: 2020 39th Chinese control conference (CCC), pp 7458–7463

Yadav SP et al (2021) Survey on machine learning in speech emotion recognition and vision systems using a recurrent neural network (RNN). Arch Comput Methods Eng 29:1753–1770

Valeri K et al (2019) A survey on recurrent neural network based model for solving the assignment problem. In: 2019 IEEE 5th International conference for convergence in technology (I2CT), pp 1–4

Prakash BS et al (2018) A survey on recurrent neural network architectures for sequential learning. In: International conference on soft computing for problem solving

Ghojogh B, Ghodsi A (2023) Recurrent neural networks and long short-term memory networks: tutorial and survey. ArXiv abs/2304.11461

Tavangar S et al (2019) A futuristic survey of the effects of LU/LC change on stream flow by CA–Markov model: a case of the Nekarood watershed, Iran. Geocarto Int 36:1100–1116

Sasidharan SK, Thomas C (2018) A survey on metamorphic malware detection based on hidden Markov model. In: 2018 International conference on advances in computing, communications and informatics (ICACCI), pp 357–362

Kumar SS et al A survey on Markov model.

Singh A, Narayan D (2012) A survey on hidden Markov model for credit card fraud detection

Mishra R (2023) Machine learning-based logistics forecasting and packaging design based on Gray Markov model. In: 2023 9th International conference on advanced computing and communication systems (ICACCS), pp 2428–2432

Pervez MS, Farid DM (2014) Feature selection and intrusion classification in NSL-KDD CUP 99 dataset employing SVMs. In: Proceedings of 8th international conference on software, knowledge information, industrial management and applications (SKIMA), pp 1–6

Saxena H, Richariya V (2014) Intrusion detection in KDD99 dataset using SVM-PSO and feature reduction with information gain. Int J Comput Appl 98(6):25–29

Wang W, Zhu M, Wang J, Zeng X, Yang Z (2017) End-to-end encrypted traffic classification with one-dimensional convolution neural networks. In: Proceedings of IEEE International conference on intelligence and security informatics (ISI), July 2017, pp 43–48

Meng W, Li W, Kwok L-F (2015) Design of intelligent KNN-based alarm filter using knowledge-based alert verification in intrusion detection. Secur Commun Netw 8(18):3883–3895

Kotpalliwar MV, Wajgi R (2015) Classification of attacks using support vector machine (SVM) on KDDCUP’99 IDS database. In: Proceedings of international conference on communication systems and network technologies, pp 987–990

Sharifi AM, Kasmani SA, Pourebrahimi A (2015) Intrusion detection based on joint of K-means and KNN. J Converg Inf Technol 10(5):42–51

Alrawashdeh K, Purdy C (2017) Toward an online anomaly intrusion detection system based on deep learning. In: Proceedings of IEEE International conference on machine learning and application, pp 195–200

Gao N, Gao L, Gao Q, Wang H (2014) An intrusion detection model based on deep belief networks. In: Proceedings of 2nd international conference on advance cloud and big data, pp 247–252

Ammar A (2015) ‘A decision tree classifier for intrusion detection priority tagging.’ J Comput Commun 3(4):52–58

Article   MathSciNet   Google Scholar  

Krishnan RB, Raajan NR (2016) An intellectual intrusion detection system model for attacks classification using RNN. Int J Pharm Technol 8(4):23157–23164

Le T-T-H, Kim J, Kim H (2017) An effective intrusion detection classifier using long short-term memory with gradient descent optimization. In: Proceedings of international conference on platform technology and service, pp 1–6

Yu Y, Long J, Cai Z (2017) Network intrusion detection through stacking dilated convolutional autoencoders. Secur Commun Netw 2(3):1–10

Download references

Author information

Authors and affiliations.

Department of Computer Engineering, Dongseo University, 47 Jurye-ro, Sasang-gu, Busan, 47011, Republic of Korea

Mohammed Abdulhakim Al-Absi

Digital Enterprise Department, Nsoft, CO., LTD, No. 407, 14, Tekeunosaneop-go 55beon gil, Nam-gu, Ulsan, 44776, Republic of Korea

Hind R’Bigui

Department of Smart Computing, Kyungdong University, 46 4-gil, Bongpo, Goseong-gun, Gangwon-do, 24764, Korea

Ahmed A. Al-Absi

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Mohammed Abdulhakim Al-Absi .

Editor information

Editors and affiliations.

School of Computer Engineering, KIIT Deemed University, Bhubaneswar, Odisha, India

Prasant Kumar Pattnaik

Department of Information and Communication Engineering, Dongseo University, Busan, Korea (Republic of)

Mangal Sain

Smart Computing Department, Kyungdong University Global Campus, Gangwondo, Korea (Republic of)

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Al-Absi, M.A., R’Bigui, H., Al-Absi, A.A. (2024). Deep Learning and Machine Learning Algorithms Methods in Cyber Security. In: Pattnaik, P.K., Sain, M., Al-Absi, A.A. (eds) Proceedings of 3rd International Conference on Smart Computing and Cyber Security. SMARTCYBER 2023. Lecture Notes in Networks and Systems, vol 914. Springer, Singapore. https://doi.org/10.1007/978-981-97-0573-3_22

Download citation

DOI : https://doi.org/10.1007/978-981-97-0573-3_22

Published : 28 July 2024

Publisher Name : Springer, Singapore

Print ISBN : 978-981-97-0572-6

Online ISBN : 978-981-97-0573-3

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Subscribe to the PwC Newsletter

Join the community, search results, large language models for cyber security: a systematic literature review.

1 code implementation • 8 May 2024

Overall, our survey provides a comprehensive overview of the current state-of-the-art in LLM4Security and identifies several promising directions for future research.

Evaluating Shallow and Deep Neural Networks for Network Intrusion Detection Systems in Cyber Security

5 code implementations • International Conference on Computing, Communication and Networking Technologies (ICCCNT) 2018

In this paper, DNNs have been utilized to predict the attacks on Network Intrusion Detection System (N-IDS).

machine learning in cyber security research papers

Automatic Labeling for Entity Extraction in Cyber Security

3 code implementations • 22 Aug 2013

Timely analysis of cyber-security information necessitates automated information extraction from unstructured text.

Structural Generalization in Autonomous Cyber Incident Response with Message-Passing Neural Networks and Reinforcement Learning

2 code implementations • 8 Jul 2024

We believe that agents for automated incident response based on machine learning need to handle changes in network structure.

The Path To Autonomous Cyber Defense

1 code implementation • 12 Apr 2024

Defenders are overwhelmed by the number and scale of attacks against their networks. This problem will only be exacerbated as attackers leverage artificial intelligence to automate their workflows.

Developing Optimal Causal Cyber-Defence Agents via Cyber Security Simulation

3 code implementations • 25 Jul 2022

In this paper we explore cyber security defence, through the unification of a novel cyber security simulator with models for (causal) decision-making through optimisation.

SROS2: Usable Cyber Security Tools for ROS 2

1 code implementation • 4 Aug 2022

Built upon DDS as its default communication middleware and used in safety-critical scenarios, adding security to robots and ROS computational graphs is increasingly becoming a concern.

Cryptography and Security Distributed, Parallel, and Cluster Computing Networking and Internet Architecture Robotics Software Engineering

Fundamental Challenges of Cyber-Physical Systems Security Modeling

1 code implementation • 30 Apr 2020

Systems modeling practice lacks security analysis tools that can interface with modeling languages to facilitate security by design.

Cryptography and Security Systems and Control Systems and Control

A Model-Based Approach to Security Analysis for Cyber-Physical Systems

1 code implementation • 31 Oct 2017

To construct such a model we produce a taxonomy of attributes; that is, a generalized schema for system attributes.

Looking for a Black Cat in a Dark Room: Security Visualization for Cyber-Physical System Design and Analysis

1 code implementation • 24 Aug 2018

Today, there is a plethora of software security tools employing visualizations that enable the creation of useful and effective interactive security analyst dashboards.

Human-Computer Interaction Cryptography and Security

Machine Learning in Cyber Security

6 Pages Posted: 17 Aug 2022

Mukesh Choubisa

Department of Computer Science & Engineering , Institute of Advance Research , Gandhinagar

Dhyani Upadhyay

Arizona State University; Department of Computer Science & Engineering , Institute of Advance Research , Gandhinagar

Date Written: August 6, 2022

Machine learning knowledge of era has emerge as mainstream in a large number of domains, and cybersecurity packages of machine learning of strategies are masses. Examples malware analysis, mainly for zero-day malware detection, danger evaluation, anomaly based totally intrusion detection of conventional attacks on vital infrastructures, and many others. because of the ineffectiveness of signature-based methods in detecting zero day attacks or even mild versions of recognized attacks, machine learning-based totally detection is being utilized by researchers in many cyber security merchandise. on this evaluate, we speak numerous areas of cyber security where gadget mastering is used as a device. We also offer some glimpses of hostile assaults on machine learning algorithms to govern training and check statistics of classifiers, to render such equipment useless.

Keywords: Machine Learning , cyber security , ML using security , network security , Ddos Attack

Suggested Citation: Suggested Citation

Department of Computer Science & Engineering , Institute of Advance Research , Gandhinagar ( email )

Dhyani upadhyay (contact author), arizona state university ( email ).

AZ United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics, related ejournals, artificial intelligence ejournal.

Subscribe to this fee journal for more curated articles on this topic

Cybersecurity, Privacy, & Networks eJournal

  • Corpus ID: 271270843

Using LLMs to Automate Threat Intelligence Analysis Workflows in Security Operation Centers

  • PeiYu Tseng , ZihDwo Yeh , +1 author Peng Liu
  • Published 18 July 2024
  • Computer Science, Political Science

Figures from this paper

figure 1

10 References

Threat intelligence.

  • Highly Influential

TINKER: A framework for Open source Cyberthreat Intelligence

Enabling efficient cyber threat hunting with cyber threat intelligence, ttpdrill: automatic and accurate extraction of threat actions from unstructured text of cti sources, uva-dare ( digital academic repository ) using supervised machine learning to code policy issues : can classifiers generalize across contexts , related papers.

Showing 1 through 3 of 0 Related Papers

  • Computer Science
  • Computer Security and Reliability
  • Cyber Security

A Literature Review on Machine Learning for Cyber Security Issues

  • November 2022
  • International Journal of Scientific Research in Computer Science Engineering and Information Technology

Jay Kumar Jain at Maulana Azad National Institute of Technology, Bhopal

  • Maulana Azad National Institute of Technology, Bhopal

Akhilesh A. Waoo at AKS University, Satna

  • AKS University, Satna

Dipti Chauhan at Prestige Institute of Engineering Management & Research

  • Prestige Institute of Engineering Management & Research

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations
  • Aniket Deshpande
  • Priyanka Koushik

Pradeep Chintale

  • pramod singh

Jay Kumar Jain

  • Jay Kumar Jain
  • Sanjay Sharma
  • Emrah Tufan

Cihangir Tezcan

  • Khalil Dajani

Mohammad Mostafizur Rahman Mozumdar

  • Mohamed M. Abdeldayem
  • Chenxi Huang
  • Yaqing Zhang

Victor Albuquerque

  • Jiachen Yang

Bian Zilin

  • Jiacheng Liu

Houbing Herbert Song

  • AD HOC NETW
  • Yongjun Bao

Ali Kashif Bashir

  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

electronics-logo

Article Menu

machine learning in cyber security research papers

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Swin-fake: a consistency learning transformer-based deepfake video detector.

machine learning in cyber security research papers

1. Introduction

2. related work, 2.1. deepfake detection, 2.2. swin transformer, 2.3. data augmentation, 3. proposed methods.

The workflow of Swin-Fake
The pretrained Swin Transformer: ; video sequences: X frames. i = 1 to M to group set. via forward propagation to obtain 2-D features . between pairs using Equation (3). to 2 classes. a trained Swin-Fake model .

3.1. Pre-Processing Work

3.2. encoder network, 3.3. loss functions, 4. experiments, 4.1. datasets, 4.2. implementation details, 4.3. in-dataset study, 4.4. ablation test results, 4.5. cross-dataset study, 5. conclusions and future work, author contributions, data availability statement, conflicts of interest.

  • Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Wared-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Network. Commun. Acm 2018 , 63 , 139–144. [ Google Scholar ] [ CrossRef ]
  • Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [ Google Scholar ]
  • Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the 34th International Conference on Neural Information Processing System, Red Hook, NY, USA, 6–12 December 2020; pp. 6840–6851. [ Google Scholar ]
  • Nguyen, H.H.; Yamagishi, J.; Echizen, I. Use of a Capsule Network to Detect Fake Images and Videos. In Proceedings of the Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [ Google Scholar ]
  • Das, S.; Seferbekov, S.; Datta, A.; Islam, M.S.; Amin, M.R. Towards solving the deepfake problem: An analysis on improving deepfake detection using dynamic face augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [ Google Scholar ]
  • Zhao, T.; Xu, X.; Xu, M.; Ding, H.; Xiong, Y.; Xia, W. Learning Self-Consistency for Deepfake Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [ Google Scholar ]
  • Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual, 26 April–1 May 2020. [ Google Scholar ]
  • Khan, S.A.; Dai, H. Video Transformer for Deepfake Detection with Incremental Learning. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 1821–1828. [ Google Scholar ]
  • Khormali, A.; Yuan, J. DFDT: An End-to-End DeepFake Detection Framework Using Vision Transformer. Appl. Sci. 2022 , 12 , 2953. [ Google Scholar ] [ CrossRef ]
  • Zhao, C.; Wang, C.; Hu, G.; Chen, H.; Liu, C.; Tang, J. ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection. IEEE Trans. Inf. Forensics Secur. 2023 , 18 , 1335–1348. [ Google Scholar ] [ CrossRef ]
  • Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [ Google Scholar ]
  • Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [ Google Scholar ]
  • DFDC Selim. Available online: https://github.com/selimsef/dfdc_deepfake_challenge (accessed on 15 May 2024).
  • Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [ Google Scholar ]
  • He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [ Google Scholar ]
  • Ni, Y.; Meng, D.; Yu, C.; Quan, C.B.; Ren, D.; Zhao, Y. CORE: Consistent Representation Learning for Face Forgery Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [ Google Scholar ]
  • Sun, K.; Yao, T.; Chen, S.; Ding, S.; Li, J.; Ji, R. Dual Contrastive Learning for General Face Forgery Detection. AAAI Conf. Artif. Intell. 2022 , 36 , 2316–2324. [ Google Scholar ] [ CrossRef ]
  • Zhang, J.; Cheng, K.; Sovernigo, G.; Lin, X. A Heterogeneous Feature Ensemble Learning based Deepfake Detection Method. In Proceedings of the ICC 2022—IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; pp. 2084–2089. [ Google Scholar ]
  • Rana, M.S.; Sung, A.H. DeepfakeStack: A Deep Ensemble-based Learning Technique for Deepfake Detection. In Proceedings of the 2020 7th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/2020 6th IEEE International Conference on Edge Computing and Scalable Cloud (EdgeCom), New York, NY, USA, 1–3 August 2020; pp. 70–75. [ Google Scholar ]
  • Ilyas, H.; Javed, A.; Malik, K.M. AVFakeNet: A unified end-to-end Dense Swin Transformer deep learning model for audio–visual deepfakes detection. Appl. Soft Comput. 2023 , 136 , 110124. [ Google Scholar ] [ CrossRef ]
  • Khalid, F.; Akbar, M.H.; Gul, S. SWYNT: Swin Y-Net Transformers for Deepfake Detection. In Proceedings of the 2023 International Conference on Robotics and Automation in Industry (ICRAI), Peshawar, Pakistan, 3–5 March 2023; pp. 1–6. [ Google Scholar ]
  • Guo, J.; Deng, J.; Lattas, A.; Zafeiriou, S. Sample and Computation Redistribution for Efficient Face Detection. In Proceedings of the International Conference on Learning Representation, Virtual Event, 3–7 May 2021. [ Google Scholar ]
  • Zhou, T.F.; Wang, W.G.; Liang, Z.Y.; Shen, J.B. Face Forensics in the Wild. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021. [ Google Scholar ]
  • Kaggle. Available online: https://www.kaggle.com/c/deepfake-detection-challenge/overview (accessed on 12 December 2023).
  • Li, Y.Z.; Yang, X.; Sun, P.; Qi, H.G.; Lyu, S. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. In Proceedings of the Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [ Google Scholar ]
  • Li, L.; Bao, J.; Yang, H.; Chen, D.; Wen, F. FaceShifter: Towards High Fidelity And Occlusion Aware Face Swapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [ Google Scholar ]
  • Loshchilov, I.; Hutter, F. Decoupled weight decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [ Google Scholar ]
  • Gu, Z.; Chen, Y.; Yao, T.; Ding, S.; Li, J.; Huang, F.; Ma, L. Spatiotemporal Inconsistency Learning for Deepfake Video Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [ Google Scholar ]
  • Li, L.; Bao, J.; Zhang, T.; Yang, H.; Chen, D.; Wen, F.; Guo, B. Face X-ray for More General Face Forgery Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [ Google Scholar ]
  • Zheng, Y.; Bao, J.; Chen, D.; Zeng, M.; Wen, F. Exploring temporal coherence for more general video face forgery detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15044–15054. [ Google Scholar ]
  • Haliassos, A.; Vougioukas, K.; Petridis, S.; Pantic, M. Lips don’t lie: A generalisable and robust approach to face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 5039–5049. [ Google Scholar ]

Click here to enlarge figure

MethodsFF++DFDC
DFF2FFSNT
Capsule Net98.5%96.7%94.5%91.8%-
STIL99.6%99.3%97.1%95.4%89.8%
ISTVT99.6%99.6%100%96.8%92.1%
Our approach
MethodsFF++(Binary Test)DFDC
Swin Transformer89.4%91.4%
Our approach94.3%93.7%
0.40.50.60.70.8
Accuracy92.6%93.9%94.1%94.3%93.7%
ModelsCeleb-DFFaceShifterDFDCAverage
Capsule Net34.5%42.4%-38.5%
STIL75.6%--75.6%
ISTVT84.1% 85.9%
Face X-ray79.5%92.8%65.6%79.3%
FTCN86.9%98.8%74.0%
LipForensics82.4%97.1%73.5%84.3%
Our approach 98.3%71.8%86.3%
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Gong, L.Y.; Li, X.J.; Chong, P.H.J. Swin-Fake: A Consistency Learning Transformer-Based Deepfake Video Detector. Electronics 2024 , 13 , 3045. https://doi.org/10.3390/electronics13153045

Gong LY, Li XJ, Chong PHJ. Swin-Fake: A Consistency Learning Transformer-Based Deepfake Video Detector. Electronics . 2024; 13(15):3045. https://doi.org/10.3390/electronics13153045

Gong, Liang Yu, Xue Jun Li, and Peter Han Joo Chong. 2024. "Swin-Fake: A Consistency Learning Transformer-Based Deepfake Video Detector" Electronics 13, no. 15: 3045. https://doi.org/10.3390/electronics13153045

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

IMAGES

  1. Top 50 Research Papers in Machine learning for Cyber security

    machine learning in cyber security research papers

  2. Machine learning in Cyber Security

    machine learning in cyber security research papers

  3. (PDF) Analysis of Machine Learning and Deep Learning in Cyber-Physical System Security

    machine learning in cyber security research papers

  4. (PDF) Cybersecurity data science: an overview from machine learning perspective

    machine learning in cyber security research papers

  5. GitHub

    machine learning in cyber security research papers

  6. PhD Papers in Machine Learning for Cyber Security Threats

    machine learning in cyber security research papers

VIDEO

  1. learning cyber security part 1

  2. Cyber-Physical Security Systems

  3. Learning Cyber Security! #computerhackers #cybersecurity #computerhacking #computersecurity #hacker

  4. Future-se.

  5. AI is a rapidly expanding field with high earning potential. #AI #MachineLearning

  6. most used port by hackers

COMMENTS

  1. A Survey on Machine Learning Techniques for Cyber Security in the Last

    Conventional security systems lack efficiency in detecting previously unseen and polymorphic security attacks. Machine learning (ML) techniques are playing a vital role in numerous applications of cyber security. ... It finally discusses the challenges of using ML techniques in cyber security. This paper provides the latest extensive ...

  2. Full article: Current trends in AI and ML for cybersecurity: A state-of

    Display full size. The results of this survey suggest that the integration of artificial intelligence (AI) and machine learning (ML) into cybersecurity systems is a promising direction for future research and development. However, it is also important to address the ethical and legal implications of these technologies.

  3. The Role of Machine Learning in Cybersecurity

    Abstract. Machine Learning (ML) represents a pivotal technology for current and future information systems, and many domains already leverage the capabilities of ML. However, deployment of ML in cybersecurity is still at an early stage, revealing a significant discrepancy between research and practice.

  4. Machine learning in cybersecurity: a comprehensive survey

    The increase in the use of the internet and related services has also raised the number of cyber-attack events each year. For instance, in 2015, the United States Office of Personnel Management (OPM) was attacked and information for approximately 21.5 million government employees, such as names, social security numbers, addresses, etc., was stolen. 2 Yahoo, the email providing system, suffered ...

  5. Cybersecurity data science: an overview from machine learning

    In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model, is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data ...

  6. Artificial intelligence for cybersecurity: Literature review and future

    Researchers are actively working in the field of sensor identification and authentication to ensure the security of cyber-physical systems or the automotive sector. Channel [97] and sensor [98, 99] imperfections are used to find the transient and steady-state parameters as an input to the machine learning model for sensor identification. 5.2.1.3.

  7. (PDF) The Role of Machine Learning in Cybersecurity

    Machine Learning (ML) represents a pivotal technology for current and future information systems, and many domains already leverage the capabilities of ML. However, deployment of ML in ...

  8. Machine Learning in Cybersecurity

    Published: 03 May 2024. Machine learning (ML) is transforming cybersecurity by enabling advanced. detection, prevention and response mechanisms. This paper provides a. comprehensive review of ML's ...

  9. Explainable machine learning in cybersecurity: A survey

    1 INTRODUCTION. Machine learning (ML) refers to a series of methods that data processing systems use to make and improve predictions. 1 It is fundamental to recent advances in cybersecurity technology, and it has been successfully integrated into security-relevant systems, for example, malware detection, 2, 3 intrusion detection 4, 5 and vulnerability discovery. 6, 7 Such ML processes in these ...

  10. A holistic and proactive approach to forecasting cyber threats

    Recent research has introduced effective Machine Learning (ML) models for cyber-attack detection, promising to automate the task of detecting, tracking and blocking malware and intruders.

  11. [2206.09707] The Role of Machine Learning in Cybersecurity

    The Role of Machine Learning in Cybersecurity. Machine Learning (ML) represents a pivotal technology for current and future information systems, and many domains already leverage the capabilities of ML. However, deployment of ML in cybersecurity is still at an early stage, revealing a significant discrepancy between research and practice.

  12. Using deep learning to solve computer security challenges: a survey

    Although using machine learning techniques to solve computer security challenges is not a new idea, the rapidly emerging Deep Learning technology has recently triggered a substantial amount of interests in the computer security community. This paper seeks to provide a dedicated review of the very recent research works on using Deep Learning techniques to solve computer security challenges.

  13. (PDF) Machine Learning and Cyber Security

    PDF | On Dec 1, 2017, Rishabh Das and others published Machine Learning and Cyber Security | Find, read and cite all the research you need on ResearchGate ... This paper is an up-to-date survey of ...

  14. Uniting cyber security and machine learning: Advantages ...

    Advantages of uniting cyber security and machine learning. Both cyber security and machine learning are essential for each other and can improve their mutual performances. Some of the advantages of their uniting are as follows. • Full proof security of ML models: As discussed earlier, the ML models are vulnerable to various attacks. The ...

  15. Machine Learning and Deep Learning Methods for Cybersecurity

    With the development of the Internet, cyber-attacks are changing rapidly and the cyber security situation is not optimistic. This survey report describes key literature surveys on machine learning (ML) and deep learning (DL) methods for network analysis of intrusion detection and provides a brief tutorial description of each ML/DL method. Papers representing each method were indexed, read, and ...

  16. Machine Learning: The Cyber‐Security, Privacy, and Public Safety

    After a thorough review process, this special issue has selected a set of eight papers to provide new insights on the abovementioned research areas. The paper entitled "An Automatic Source Code Vulnerability Detection Approach Based on KELM" by Gaigai Tang, Lin Yang, Shuangyin Ren, Lianxiao Meng, Feng Yang, and Huiqiang Wang proposes to use ...

  17. Deep Learning and Machine Learning Algorithms Methods in Cyber Security

    For computational approaches, ML is a special term that attempt to mimic human learning processes using computers in order to find and acquire information automatically [6, 7].Statistics, psychology, computer science, and neurology are all part of this multidisciplinary study field [8, 9].Because of recent developments in processing speed and huge data, the learning algorithm has evolved ...

  18. Cyber security: State of the art, challenges and future directions

    However, there are anti-machine learning (ML) attacks. In this paper, we discuss the challenges of cyber security, and future research direction including AI, machine learning, and other states of the art techniques used to combat cyber security challenges. ... The researcher discusses the use of deep learning for cyber security threat ...

  19. Machine Learning and Cyber Security

    The application of machine learning (ML) technique in cyber-security is increasing than ever before. Starting from IP traffic classification, filtering malicious traffic for intrusion detection, ML is the one of the promising answers that can be effective against zero day threats. New research is being done by use of statistical traffic characteristics and ML techniques. This paper is a ...

  20. Machine Learning in Cyber-Security Threats by Alex Mathew

    Recent years have seen tremendous change and increase in machine learning (ML) algorithms that are solving four main cyber-security issues, such as IDS, Android malware detection, spam detection, and malware analysis. The study analyzed these algorithms using a knowledge base system and data mining of information collected from various sources.

  21. Search for Cyber Security

    Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... We believe that agents for automated incident response based on machine learning need to handle changes in network structure. 696. ... In this paper we explore cyber security defence, through the unification of a novel cyber ...

  22. Machine Learning in Cyber Security

    Examples malware analysis, mainly for zero-day malware detection, danger evaluation, anomaly based totally intrusion detection of conventional attacks on vital infrastructures, and many others. because of the ineffectiveness of signature-based methods in detecting zero day attacks or even mild versions of recognized attacks, machine learning ...

  23. Using LLMs to Automate Threat Intelligence Analysis Workflows in

    This project aims to develop an AI agent to replace the labor intensive repetitive tasks involved in analyzing CTI reports and exploits the revolutionary capabilities of LLMs (e.g., GPT-4), but it does not require any human intervention. SIEM systems are prevalent and play a critical role in a variety of analyst workflows in Security Operation Centers. However, modern SIEMs face a big ...

  24. A Parallel Approach to Enhance the Performance of Supervised Machine

    Machine learning models play a critical role in applications such as image recognition, natural language processing, and medical diagnosis, where accuracy and efficiency are paramount. As datasets grow in complexity, so too do the computational demands of classification techniques. Previous research has achieved high accuracy but required significant computational time. This paper proposes a ...

  25. A Literature Review on Machine Learning for Cyber Security Issues

    In this review paper, we combine ML and cyber security to talk about two distinct notions. We also talk about the benefits, problems, and difficulties of combining ML and cyber security. In ...

  26. JCP

    The amount of data related to cyber threats and cyber attack incidents is rapidly increasing. The extracted information can provide security analysts with useful Cyber Threat Intelligence (CTI) to enhance their decision-making. However, because the data sources are heterogeneous, there is a lack of common representation of information, rendering the analysis of CTI complicated. With this work ...

  27. Electronics

    Deepfake has become an emerging technology affecting cyber-security with its illegal applications in recent years. Most deepfake detectors utilize CNN-based models such as the Xception Network to distinguish real or fake media; however, their performance on cross-datasets is not ideal because they suffer from over-fitting in the current stage. Therefore, this paper proposed a spatial ...

  28. Cyber Threat Detection Using Machine Learning

    Abstract: In this paper, we have designed and implemented a solution for the detection of cyber threats using supervised machine learning (ML). An effective software program was adapted and refined in Python Language for training our machines on a large cyber-security dataset in order to detect and classify various types of network intrusions, and make use of features extracted from network ...

  29. The Role of Machine Learning in Cybersecurity

    The Role of Machine Learning in Cybersecurity ... complemented with two real case studies describing industrial applications of ML as defense against cyber-threats. CCS Concepts: • Security and privacy; • Computing methodologies →Machine learning; ... research papers—commonly claiming to outperform previous work—often lead to ...

  30. Announcing Phi-3 fine-tuning, new generative AI models, and other Azure

    Given their small compute footprint, cloud and edge compatibility, Phi-3 models are well suited for fine-tuning to improve base model performance across a variety of scenarios including learning a new skill or a task (e.g. tutoring) or enhancing consistency and quality of the response (e.g. tone or style of responses in chat/Q&A).