Fundamentals of Artificial Neural Networks and Deep Learning

  • Open Access
  • First Online: 14 January 2022

Cite this chapter

You have full access to this open access chapter

artificial neural network research papers

  • Osval Antonio Montesinos López 4 ,
  • Abelardo Montesinos López 5 &
  • Jose Crossa 6 , 7  

35k Accesses

36 Citations

In this chapter, we go through the fundamentals of artificial neural networks and deep learning methods. We describe the inspiration for artificial neural networks and how the methods of deep learning are built. We define the activation function and its role in capturing nonlinear patterns in the input data. We explain the universal approximation theorem for understanding the power and limitation of these methods and describe the main topologies of artificial neural networks that play an important role in the successful implementation of these methods. We also describe loss functions (and their penalized versions) and give details about in which circumstances each of them should be used or preferred. In addition to the Ridge, Lasso, and Elastic Net regularization methods, we provide details of the dropout and the early stopping methods. Finally, we provide the backpropagation method and illustrate it with two simple artificial neural networks.

You have full access to this open access chapter,  Download chapter PDF

Similar content being viewed by others

artificial neural network research papers

Deep Learning

artificial neural network research papers

Neural Networks

artificial neural network research papers

Supervised Artificial Neural Networks: Backpropagation Neural Networks

  • Artificial neural networks
  • Deep learning
  • Activation functions
  • Loss functions
  • Backpropagation method

10.1 The Inspiration for the Neural Network Model

The inspiration for artificial neural networks (ANN), or simply neural networks, resulted from the admiration for how the human brain computes complex processes, which is entirely different from the way conventional digital computers do this. The power of the human brain is superior to many information-processing systems, since it can perform highly complex, nonlinear, and parallel processing by organizing its structural constituents (neurons) to perform such tasks as accurate predictions, pattern recognition, perception, motor control, etc. It is also many times faster than the fastest digital computer in existence today. An example is the sophisticated functioning of the information-processing task called human vision. This system helps us to understand and capture the key components of the environment and supplies us with the information we need to interact with the environment. That is, the brain very often performs perceptual recognition tasks (e.g., voice recognition embedded in a complex scene) in around 100–200 ms, whereas less complex tasks many times take longer even on a powerful computer (Haykin 2009 ).

Another interesting example is the sonar of a bat, since the sonar is an active echolocation system. The sonar provides information not only about how far away the target is located but also about the relative velocity of the target, its size, and the size of various features of the target, including its azimuth and elevation. Within a brain the size of a plum occur the computations required to extract all this information from the target echo. Also, it is documented that an echolocating bat has a high rate of success when pursuing and capturing its target and, for this reason, is the envy of radar and sonar engineers (Haykin 2009 ). This bat capacity inspired the development of radar, which is able to detect objects that are in its path, without needing to see them, thanks to the emission of an ultrasonic wave, the subsequent reception and processing of the echo, which allows it to detect obstacles in its flight with surprising speed and accuracy (Francisco-Caicedo and López-Sotelo 2009 ).

In general, the functioning of the brains of humans and other animals is intriguing because they are able to perform very complex tasks in a very short time and with high efficiency. For example, signals from sensors in the body convey information related to sight, hearing, taste, smell, touch, balance, temperature, pain, etc. Then the brain’s neurons, which are autonomous units, transmit, process, and store this information so that we can respond successfully to external and internal stimuli (Dougherty 2013 ). The neurons of many animals transmit spikes of electrical activity through a long, thin strand called an axon. An axon is divided into thousands of terminals or branches, where depending on the size of the signal they synapse to dendrites of other neurons (Fig. 10.1 ). It is estimated that the brain is composed of around 10 11 neurons that work in parallel, since the processing done by the neurons and the memory captured by the synapses are distributed together over the network. The amount of information processed and stored depends on the threshold firing levels and also on the weight given by each neuron to each of its inputs (Dougherty 2013 ).

figure 1

A graphic representation of a biological neuron

One of the characteristics of biological neurons, to which they owe their great capacity to process and perform highly complex tasks, is that they are highly connected to other neurons from which they receive stimuli from an event as it occurs, or hundreds of electrical signals with the information learned. When it reaches the body of the neuron, this information affects its behavior and can also affect a neighboring neuron or muscle (Francisco-Caicedo and López-Sotelo 2009 ). Francisco-Caicedo and López-Sotelo ( 2009 ) also point out that the communication between neurons goes through the so-called synapses. A synapse is a space that is occupied by chemicals called neurotransmitters. These neurotransmitters are responsible for blocking or passing on signals that come from other neurons. The neurons receive electrical signals from other neurons with which they are in contact. These signals accumulate in the body of the neuron and determine what to do. If the total electrical signal received by the neuron is sufficiently large, the action potential can be overcome, which allows the neuron to be activated or, on the contrary, to remain inactive. When a neuron is activated, it is able to transmit an electrical impulse to the neurons with which it is in contact. This new impulse, for example, acts as an input to other neurons or as a stimulus in some muscles (Francisco-Caicedo and López-Sotelo 2009 ). The architecture of biological neural networks is still the subject of active research, but some parts of the brain have been mapped, and it seems that neurons are often organized in consecutive layers, as shown in Fig. 10.2 .

figure 2

Multiple layers in a biological neural network of human cortex

ANN are machines designed to perform specific tasks by imitating how the human brain works, and build a neural network made up of hundreds or even thousands of artificial neurons or processing units. The artificial neural network is implemented by developing a computational learning algorithm that does not need to program all the rules since it is able to build up its own rules of behavior through what we usually refer to as “experience.” The practical implementation of neural networks is possible due to the fact that they are massively parallel computing systems made up of a huge number of basic processing units (neurons) that are interconnected and learn from their environment, and the synaptic weights capture and store the strengths of the interconnected neurons. The job of the learning algorithm consists of modifying the synaptic weights of the network in a sequential and supervised way to reach a specific objective (Haykin 2009 ). There is evidence that neurons working together are able to learn complex linear and nonlinear input–output relationships by using sequential training procedures. It is important to point out that even though the inspiration for these models was quite different from what inspired statistical models, the building blocks of both types of models are quite similar. Anderson et al. ( 1990 ) and Ripley ( 1993 ) pointed out that neural networks are simply no more than generalized nonlinear statistical models . However, Anderson et al. ( 1990 ) were more expressive in this sense and also pointed out that “ANN are statistics for amateurs since most neural networks conceal the statistics from the user.”

10.2 The Building Blocks of Artificial Neural Networks

To get a clear idea of the main elements used to construct ANN models, in Fig. 10.3 we provide a general artificial neural network model that contains the main components for this type of models.

figure 3

General artificial neural network model

x 1 , …, x p represents the information (input) that the neuron receives from the external sensory system or from other neurons with which it has a connection. w  = ( w 1 , …,  w p ) is the vector of synaptic weights that modifies the received information emulating the synapse between the biological neurons. These can be interpreted as gains that can attenuate or amplify the values that they wish to propagate toward the neuron. Parameter b j is known as the bias (intercept or threshold) of a neuron. Here in ANN, learning refers to the method of modifying the weights of connections between the nodes (neurons) of a specified network.

The different values that the neuron receives are modified by the synaptic weights, which then are added together to produce what is called the net input . In mathematical notation, that is equal to

This net input ( v j ) is what determines whether the neuron is activated or not. The activation of the neuron depends on what we call the activation function . The net input is evaluated in this function and we obtain the output of the network as shown next:

where g is the activation function. For example, if we define this function as a unit step (also called threshold), the output will be 1 if the net input is greater than zero; otherwise the output will be 0. Although there is no biological behavior indicating the presence of something similar to the neurons of the brain, the use of the activation function is an artifice that allows applying ANN to a great diversity of real problems. According to what has been mentioned, output y j of the neuron is generated when evaluating the net input ( v j ) in the activation function. We can propagate the output of the neuron to other neurons or it can be the output of the network, which, according to the application, will have an interpretation for the user. In general, the job of an artificial neural network model is done by simple elements called neurons. The signals are passed between neurons through connection links. Each connection link has an associated weight, which, in a typical neuronal network, multiplies the transmitted signal. Each neuron applies an activation function (usually nonlinear) to the network inputs (sum of the heavy input signals) for determining its corresponding sign. Later in this chapter, we describe the many options for activation functions and the context in which they can be used.

A unilayer ANN like that in Fig. 10.3 has a low processing capacity by itself and its level of applicability is low; its true power lies in the interconnection of many ANNs, as happens in the human brain. This has motivated different researchers to propose various topologies (architectures) to connect neurons to each other in the context of ANN. Next, we provide two definitions of ANN and one definition of deep learning:

Definition 1 . An artificial neural network is a system composed of many simple elements of processing which operate in parallel and whose function is determined by the structure of the network and the weight of connections, where the processing is done in each of the nodes or computing elements that has a low processing capacity (Francisco-Caicedo and López-Sotelo 2009 ).

Definition 2 . An artificial neural network is a structure containing simple elements that are interconnected in many ways with hierarchical organization, which tries to interact with objects in the real world in the same way as the biological nervous system does (Kohonen 2000 ).

Deep learning model . We define deep learning as a generalization of ANN where more than one hidden layer is used, which implies that more neurons are used for implementing the model. For this reason, an artificial neural network with multiple hidden layers is called a Deep Neural Network (DNN) and the practice of training this type of networks is called deep learning (DL) , which is a branch of statistical machine learning where a multilayered (deep) topology is used to map the relations between input variables (independent variables) and the response variable (outcome). Chollet and Allaire ( 2017 ) point out that DL puts the “emphasis on learning successive layers of increasingly meaningful representations.” The adjective “deep” applies not to the acquired knowledge, but to the way in which the knowledge is acquired (Lewis 2016 ), since it stands for the idea of successive layers of representations. The “deep” of the model refers to the number of layers that contribute to the model. For this reason, this field is also called layered representation learning and hierarchical representation learning (Chollet and Allaire 2017 ).

It is important to point out that DL as a subset of machine learning is an aspect of artificial intelligence (AI) that has more complex ways of connecting layers than conventional ANN, which uses more neurons than previous networks to capture nonlinear aspects of complex data better, but at the cost of more computing power required to automatically extract useful knowledge from complex data.

To have a more complete picture of ANN, we provide another model, which is a DL model since it has two hidden layers, as shown in Fig. 10.4 .

figure 4

Artificial deep neural network with a feedforward neural network with eight input variables ( x 1, … , x 8), four output variables ( y 1, y 2, y 3, y 4), and two hidden layers with three neurons each

From Fig. 10.4 we can see that an artificial neural network is a directed graph whose nodes correspond to neurons and whose edges correspond to links between them. Each neuron receives, as input, a weighted sum of the outputs of the neurons connected to its incoming edges (Shalev-Shwartz and Ben-David 2014 ). In the artificial deep neural network given in Fig. 10.4 , there are four layers ( V 0 , V 1 , V 2 , and V 3 ): V 0 represents the input layer, V 1  and V 2 are the hidden layers, and V 3 denotes the output layer. In this artificial deep neural network, three is the number of layers of the network since V 0 , which contains the input information, is excluded. This is also called the “depth” of the network. The size of this network is \( \left|V\right|=\left|\bigcup \limits_{t=0}^{\mathrm{T}}{V}_t\right|=\left|9+4+4+4\right|=21 \) . Note that in each layer we added +1 to the observed units to represent the node of the bias (or intercept). The width of the network is max| V t | = 9.

The analytical form of the model given in Fig. 10.4 for output o , with d inputs, M 1 hidden neurons (units) in hidden layer 1, M 2 hidden units in hidden layer 2, and O output neurons is given by the following ( 10.1 )–( 10.3 ):

where ( 10.1 ) produces the output of each of the neurons in the first hidden layer, ( 10.2 ) produces the output of each of the neurons in the second hidden layer, and finally ( 10.3 ) produces the output of each response variable of interest. The learning process is obtained with the weights ( \( {w}_{ji}^{(1)},{w}_{kj}^{(2)}, \) and \( {w}_{lk}^{(3)}\Big) \) , which are accommodated in the following vector: \( \boldsymbol{w}=\left({w}_{11}^{(1)},{w}_{12}^{(1)},\dots, {w}_{1d}^{(1)},{w}_{21}^{(2)},{w}_{22}^{(2)},\dots, {w}_{2{M}_1}^{(2)},{w}_{31}^{(3)},{w}_{32}^{(3)},\dots, {w}_{3{M}_2}^{(3)}\right), \) g 1 , g 2 , and g 3 are the activation functions in hidden layers 1, 2, and the output layer, respectively.

The model given in Fig. 10.4 is organized as several interconnected layers: the input layer, hidden layers, and output layer, where each layer that performs nonlinear transformations is a collection of artificial neurons, and connections among these layers are made using weights (Fig. 10.4 ). When only one output variable is present in Fig. 10.4 , the model is called univariate DL model. Also, when only one hidden layer is present in Fig. 10.4 , the DL model is reduced to a conventional artificial neural network model, but when more than one hidden layer is included, it is possible to better capture complex interactions, nonlinearities, and nonadditive effects. To better understand the elements of the model depicted in Fig. 10.4 , it is important to distinguish between the types of layers and the types of neurons; for this reason, next we will explain the type of layers and then the type of neurons in more detail.

Input layer: It is the set of neurons that directly receives the information coming from the external sources of the network. In the context of Fig. 10.4 , this information is x 1, … , x 8 (Francisco-Caicedo and López-Sotelo 2009 ). Therefore, the number of neurons in an input layer is most of the time the same as the number of the input explanatory variables provided to the network. Usually input layers are followed by at least one hidden layer. Only in feedforward neuronal networks, input layers are fully connected to the next hidden layer (Patterson and Gibson 2017 ).

Hidden layers: Consist of a set of internal neurons of the network that do not have direct contact with the outside. The number of hidden layers can be 0, 1, or more. In general, the neurons of each hidden layer share the same type of information; for this reason, they are called hidden layers. The neurons of the hidden layers can be interconnected in different ways; this determines, together with their number, the different topologies of ANN and DNN (Francisco-Caicedo and López-Sotelo 2009 ). The learned information extracted from the training data is stored and captured by the weight values of the connections between the layers of the artificial neural network. Also, it is important to point out that hidden layers are key components for capturing complex nonlinear behaviors of data more efficiently (Patterson and Gibson 2017 ).

Output layer: It is a set of neurons that transfers the information that the network has processed to the outside (Francisco-Caicedo and López-Sotelo 2009 ). In Fig. 10.4 the output neurons correspond to the output variables y 1, y 2, y 3, and y 4. This means that the output layer gives the answer or prediction of the artificial neural network model based on the input from the input layer. The final output can be continuous, binary, ordinal, or count depending on the setup of the ANN which is controlled by the activation (or inverse link in the statistical domain) function we specified on the neurons in the output layer (Patterson and Gibson 2017 ).

Next, we define the types of neurons: (1) input neuron . A neuron that receives external inputs from outside the network; (2) output neuron . A neuron that produces some of the outputs of the network; and (3) hidden neuron . A neuron that has no direct interaction with the “outside world” but only with other neurons within the network. Similar terminology is used at the layer level for multilayer neural networks .

As can be seen in Fig. 10.4 , the distribution of neurons within an artificial neural network is done by forming levels of a certain number of neurons. If a set of artificial neurons simultaneously receives the same type of information, we call it a layer. We also described a network of three types of levels called layers. Figure 10.5 shows another six networks with different numbers of layers, and half of them (Fig. 10.5a, c, e ) are univariate since the response variable we wish to predict is only one, while the other half (Fig. 10.5b, d, f ) are multivariate since the interest of the network is to predict two outputs. It is important to point out that subpanels a and b in Fig. 10.5 are networks with only one layer and without hidden layers; for this reason, this type of networks corresponds to conventional regression or classification regression models.

figure 5

Different feedforward topologies with univariate and multivariate outputs and different number of layers. ( a ) Unilayer and univariate output. ( b ) Unilayer and multivariate output. ( c ) Three layer and univariate output. ( d ) Three layer and multivariate output. ( e ) Four layer univariate output. ( f ) Four layer multivariate output

Therefore, the topology of an artificial neural network is the way in which neurons are organized inside the network; it is closely linked to the learning algorithm used to train the network. Depending on the number of layers, we define the networks as monolayer and multilayer ; and if we take as a classification element the way information flows, we define the networks as feedforward or recurrent. Each type of topology will be described in another section.

In summary, an artificial (deep) neural network model is an information processing system that mimics the behavior of biological neural networks, which was developed as a generalization of mathematical models of human knowledge or neuronal biology.

10.3 Activation Functions

The mapping between inputs and a hidden layer in ANN and DNN is determined by activation functions. Activation functions propagate the output of one layer’s nodes forward to the next layer (up to and including the output layer). Activation functions are scalar-to-scalar functions that provide a specific output of the neuron. Activation functions allow nonlinearities to be introduced into the network’s modeling capabilities (Wiley 2016 ). The activation function of a neuron (node) defines the functional form for how a neuron gets activated. For example, if we define a linear activation function as g ( z ) =  z , in this case the value of the neuron would be the raw input, x , times the learned weight, that is, a linear model. Next, we describe the most popular activation functions.

10.3.1 Linear

Figure 10.6 shows a linear activation function that is basically the identity function. It is defined as g ( z ) =  Wz , where the dependent variable has a direct, proportional relationship with the independent variable. In practical terms, it means the function passes the signal through unchanged. The problem with making activation functions linear is that this does not permit any nonlinear functional forms to be learned (Patterson and Gibson 2017 ).

figure 6

Representation of a linear activation function

10.3.2 Rectifier Linear Unit (ReLU)

The rectifier linear unit (ReLU) activation function is one of the most popular. The ReLU activation function is flat below some threshold (usually the threshold is zero) and then linear. The ReLU activates a node only if the input is above a certain quantity. When the input is below zero, the output is zero, but when the input rises above a certain threshold, it has a linear relationship with the dependent variable g ( z ) =  max (0,  z ), as demonstrated in Fig. 10.7 . Despite its simplicity, the ReLU activation function provides nonlinear transformation, and enough linear rectifiers can be used to approximate arbitrary nonlinear functions, unlike when only linear activation functions are used (Patterson and Gibson 2017 ). ReLUs are the current state of the art because they have proven to work in many different situations. Because the gradient of a ReLU is either zero or a constant, it is not easy to control the vanishing exploding gradient issue, also known as the “dying ReLU” issue. ReLU activation functions have been shown to train better in practice than sigmoid activation functions. This activation function is the most used in hidden layers and in output layers when the response variable is continuous and larger than zero.

figure 7

Representation of the ReLU activation function

10.3.3 Leaky ReLU

Leaky ReLUs are a strategy to mitigate the “dying ReLU” issue. As opposed to having the function be zero when z  < 0, the leaky ReLU will instead have a small negative slope, α ,where α is a value between 0 and 1 (Fig. 10.8 ). In practice, some success has been achieved with this ReLU variation, but results are not always consistent. The function of this activation function is given here:

figure 8

Representation of the Leaky ReLU activation function with α  = 0.1

10.3.4 Sigmoid

A sigmoid activation function is a machine that converts independent variables of near infinite range into simple probabilities between 0 and 1, and most of its output will be very close to 0 or 1. Like all logistic transformations, sigmoids can reduce extreme values or outliers in data without removing them. This activation function resembles an S (Wiley 2016 ; Patterson and Gibson 2017 ) and is defined as g ( z ) = (1 +  e − z ) −1 . This activation function is one of the most common types of activation functions used to construct ANNs and DNNs, where the outcome is a probability or binary outcome. This activation function is a strictly increasing function that exhibits a graceful balance between linear and nonlinear behavior but has the propensity to get “stuck,” i.e., the output values would be very close to 1 or 0 when the input values are strongly positive or negative (Fig. 10.9 ). By getting “stuck” we mean that the learning process is not improving due to the large or small values of the output values of this activation function.

figure 9

Representation of the sigmoid activation function

10.3.5 Softmax

Softmax is a generalization of the sigmoid activation function that handles multinomial labeling systems, that is, it is appropriate for categorical outcomes. Softmax is the function you will often find in the output layer of a classifier with more than two categories. The softmax activation function returns the probability distribution over mutually exclusive output classes. To further illustrate the idea of the softmax output layer and how to use it, let’s consider two types of uses. If we have a multiclass modeling problem we only care about the best score across these classes, we’d use a softmax output layer with an argmax() function to get the highest score across all classes. For example, let us assume that our categorical response has ten classes; with this activation function we calculate a probability for each category (the sum of the ten categories is one) and we classify a particular individual in the class with the largest probability. It is important to recall that if we want to get binary classifications per output (e.g., “diseased and not diseased”), we do not want softmax as an output layer. Instead, we will use the sigmoid activation function explained before. The softmax function is defined as

This activation function is a generalization of the sigmoid activation function that squeezes (force) a C dimensional vector of arbitrary real values to a C dimensional vector of real values in the range [0,1] that adds up to 1. A strong prediction would have a single entry in the vector close to 1, while the remaining entries would be close to 0. A weak prediction would have multiple possible categories (labels) that are more or less equally likely. The sigmoid and softmax activation functions are suitable for probabilistic interpretation due to the fact that the output is a probabilistic distribution of the classes. This activation function is mostly recommended for output layers when the response variable is categorical.

10.3.6 Tanh

The hyperbolic tangent (Tanh) activation function is defined as \( \tanh \left(\mathrm{z}\right)=\sinh \left(\mathrm{z}\right)/\cosh \left(\mathrm{z}\right)=\frac{\exp (z)-\exp \left(-z\right)}{\exp (z)+\exp \left(-z\right)} \) . The hyperbolic tangent works well in some cases and, like the sigmoid activation function, has a sigmoidal (“S” shaped) output, with the advantage that it is less likely to get “stuck” than the sigmoid activation function since its output values are between −1 and 1, as shown in Fig. 10.10 . For this reason, for hidden layers should be preferred the Tanh activation function. Large negative inputs to the tanh function will give negative outputs, while large positive inputs will give positive outputs (Patterson and Gibson 2017 ). The advantage of tanh is that it can deal more easily with negative numbers.

figure 10

Representation of the tanh activation function

It is important to point out that there are more activations functions like the threshold activation function introduced in the pioneering work on ANN by McCulloch and Pitts ( 1943 ), but the ones just mentioned are some of the most used.

10.4 The Universal Approximation Theorem

The universal approximation theorem is at the heart of ANN since it provides the mathematical basis of why artificial neural networks work in practice for nonlinear input–output mapping. According to Haykin ( 2009 ), this theorem can be stated as follows.

Let g () be a bounded, and monotone-increasing continuous function. Let \( {I}_{m_0} \) denote the m 0 -dimensional unit hypercube \( {\left[0,1\right]}^{m_0} \) . The space of continuous functions on \( {I}_{m_0} \) is denoted by \( C\left({I}_{m_0}\right) \) . Then given any function \( f\ni C\left({I}_{m_0}\right) \) and ε  > 0, there is an integer m 1 and sets of real constants α i , b i , and w ij , where i  = 1,…,  m 1 and j  = 1,…,  m 0 such that we may define

as an approximate realization of function f (·); that is,

For all \( {x}_1,\dots, {x}_{m_0} \) that lie in the input space.

m 0 represents the input nodes of a multilayer perceptron with a single hidden layer. m 1 is the number of neurons in the single hidden layer, \( {x}_1,\dots, {x}_{m_0} \) are the inputs, w ij denotes the weight of neuron i in input j , b i denotes the bias corresponding to neuron i , and α i is the weight of the output layer in neuron i .

This theorem states that any feedforward neural network containing a finite number of neurons is capable of approximating any continuous functions of arbitrary complexity to arbitrary accuracy, if provided enough neurons in even a single hidden layer, under mild assumptions of the activation function. In other words, this theorem says that any continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily and closely by a multilayer perceptron with just one hidden layer and a finite very large number of neurons (Cybenko 1989 ; Hornik 1991 ). However, this theorem only guarantees a reasonable approximation; for this reason, this theorem is an existence theorem. This implies that simple ANNs are able to represent a wide variety of interesting functions if given enough neurons and appropriate parameters; but nothing is mentioned about the algorithmic learnability of those parameters, nor about their time of learning, ease of implementation, generalization, or that a single hidden layer is optimum. The first version of this theorem was given by Cybenko ( 1989 ) for sigmoid activation functions. Two years later, Hornik ( 1991 ) pointed out that the potential of “ANN of being universal approximators is not due to the specific choice of the activation function, but to the multilayer feedforward architecture itself.”

From this theorem, we can deduce that when an artificial neural network has more than two hidden layers, it will not always improve the prediction performance since there is a higher risk of converging to a local minimum. However, using two hidden layers is recommended when the data has discontinuities. Although the proof of this theorem was done for only a single output, it is also valid for the multi-output scenario and can easily be deduced from the single output case. It is important to point out that this theorem states that all activation functions will perform equally well in specific learning problems since their performance depends on the data and additional issues such as minimal redundancy, computational efficiency, etc.

10.5 Artificial Neural Network Topologies

In this subsection, we describe the most popular network topologies. An artificial neural network topology represents the way in which neurons are connected to form a network. In other words, the neural network topology can be seen as the relationship between the neurons by means of their connections. The topology of a neural network plays a fundamental role in its functionality and performance, as illustrated throughout this chapter. The generic terms structure and architecture are used as synonyms for network topology. However, caution should be exercised when using these terms since their meaning is not well defined and causes confusion in other domains where the same terms are used for other purposes.

More precisely, the topology of a neural network consists of its frame or framework of neurons, together with its interconnection structure or connectivity:

The next two subsections are devoted to these two components.

Artificial neural framework

Most neural networks, including many biological ones, have a layered topology. There are a few exceptions where the network is not explicitly layered, but those can usually be interpreted as having a layered topology, for example, in some associative memory networks, which can be seen as one-layer neural networks where all neurons function both as input and output units. At the framework level, neurons are considered abstract entities, therefore possible differences between them are not considered. The framework of an artificial neural network can therefore be described by the number of neurons, number of layers (denoted by L ) , and the size of the layer, which consists of the number of neurons in each of the layers.

Interconnection structure

The interconnection structure of an artificial neural network determines the way in which the neurons are linked. Based on a layered structure, several different kinds of connections can be distinguished (see Fig. 10.11 ): (a) Interlayer connection : This connects neurons in adjacent layers whose layer indices differ by one; (b) Intralayer connection : This is a connection between neurons in the same layer; (c) Self-connection : This is a special kind of intralayer connection that connects a neuron to itself; (d) Supralayer connection : This is a connection between neurons that are in distinct nonadjacent layers; in other words, these connections “cross” or “jump” at least one hidden layer.

figure 11

Network topology with two layers. (i) denotes the six interlayer connections, (s) denotes the four supralayered connections, and (a) denotes four intralayer connections of which two are self-connections

With each connection (interconnection ), a weight (strength) is associated which is a weighting factor that reflects its importance. This weight is a scalar value (a number), which can be positive (excitatory) or negative (inhibitory). If a connection has zero weight, it is considered to be nonexistent at that point in time.

Note that the basic concept of layeredness is based on the presence of interlayer connections. In other words, every layered neural network has at least one interlayer connection between adjacent layers. If interlayer connections are absent between any two adjacent clusters in the network, a spatial reordering can be applied to the topology, after which certain connections become the interlayer connections of the transformed, layered network.

Now that we have described the two key components of an artificial neural network topology, we will present two of the most commonly used topologies.

Feedforward network

In this type of artificial neural network, the information flows in a single direction from the input neurons to the processing layer or layers (only interlayer connections) for monolayer and multilayer networks, respectively, until reaching the output layer of the neural network. This means that there are no connections between neurons in the same layer (no intralayer), and there are no connections that transmit data from a higher layer to a lower layer, that is, no supralayer connections (Fig. 10.12 ). This type of network is simple to analyze, but is not restricted to only one hidden layer.

figure 12

A simple two-layer feedforward artificial neural network

Recurrent networks

In this type of neural network, information does not always flow in one direction, since it can feed back into previous layers through synaptic connections. This type of neural network can be monolayer or multilayer. In this network, all the neurons have (1) incoming connections emanating from all the neurons in the previous layer, (2) ongoing connections leading to all the neurons in the subsequent layer, and (3) recurrent connections that propagate information between neurons of the same layer. Recurrent neural networks (RNNs) are different from a feedforward neural network in that they have at least one feedback loop since the signals travel in both directions. This type of network is frequently used in time series prediction since short-term memory, or delay, increases the power of recurrent networks immensely. In this case, we present an example of a recurrent two-layer neural network. The output of each neuron is passed through a delay unit and then taken to all the neurons, except itself. In Figs. 10.13 and 10.14 , we can see that only one input variable is presented to the input units, the feedforward flow is computed, and the outputs are fed back as auxiliary inputs. This leads to a different set of hidden unit activations, new output activations, and so on. Ultimately, the activations stabilize, and the final output values are used for predictions.

figure 13

A simple two-layer recurrent artificial neural network with univariate output

figure 14

A two-layer recurrent artificial neural network with multivariate outputs

However, it is important to point out out that despite the just mentioned virtues of recurrent artificial neural networks, they are still largely theoretical and produce mixed results (good and bad) in real applications. On the other hand, the feedforward networks are the most popular since they are successfully implemented in all areas of domain; the multilayer perceptron (MLP; that is, onother name give to feedforward networks) is the de facto standard artificial neural network topology (Lantz 2015 ). There are other DNN topologies like convolutional neural networks that are presented in Chap. 13 , but they can be found also in books specializing in deep learning.

10.6 Successful Applications of ANN and DL

The success of ANN and DL is due to remarkable results on perceptual problems such as seeing and hearing—problems involving skills that seem natural and intuitive to humans but have long been elusive for machines. Next, we provide some of these successful applications:

Near-human-level image classification, speech recognition, handwriting transcription, autonomous driving (Chollet and Allaire 2017 )

Automatic translation of text and images (LeCun et al. 2015 )

Improved text-to-speech conversion (Chollet and Allaire 2017 )

Digital assistants such as Google Now and Amazon Alexa

Improved ad targeting, as used by Google, Baidu, and Bing

Improved search results on the Web (Chollet and Allaire 2017 )

Ability to answer natural language questions (Goldberg 2016 )

In games like chess, Jeopardy, GO, and poker (Makridakis et al. 2018 )

Self-driving cars (Liu et al. 2017 ),

Voice search and voice-activated intelligent assistants (LeCun et al. 2015 )

Automatically adding sound to silent movies (Chollet and Allaire 2017 )

Energy market price forecasting (Weron 2014 )

Image recognition (LeCun et al. 2015 )

Prediction of time series (Dingli and Fournier 2017 )

Predicting breast, brain (Cole et al. 2017 ), or skin cancer

Automatic image captioning (Chollet and Allaire 2017 )

Predicting earthquakes (Rouet-Leduc et al. 2017 )

Genomic prediction (Montesinos-López et al. 2018a , b )

It is important to point out that the applications of ANN and DL are not restricted to perception and natural language understanding, such as formal reasoning. There are also many successful applications in biological science. For example, deep learning has been successfully applied for predicting univariate continuous traits (Montesinos-López et al. 2018a ), multivariate continuous traits (Montesinos-López et al. 2018b ), univariate ordinal traits (Montesinos-López et al. 2019a ), and multivariate traits with mixed outcomes (Montesinos-López et al. 2019b ) in the context of genomic-based prediction. Menden et al. ( 2013 ) applied a DL method to predict the viability of a cancer cell line exposed to a drug. Alipanahi et al. ( 2015 ) used DL with a convolutional network architecture (an ANN with convolutional operations; see Chap. 13 ) to predict specificities of DNA- and RNA-binding proteins. Tavanaei et al. ( 2017 ) used a DL method for predicting tumor suppressor genes and oncogenes. DL methods have also made accurate predictions of single-cell DNA methylation states (Angermueller et al. 2016 ). In the area of genomic selection, we mention two reports only: (a) McDowell and Grant ( 2016 ) found that DL methods performed similarly to several Bayesian and linear regression techniques that are commonly employed for phenotype prediction and genomic selection in plant breeding and (b) Ma et al. ( 2017 ) also used a DL method with a convolutional neural network architecture to predict phenotypes from genotypes in wheat and found that the DL method outperformed the GBLUP method. However, a review of DL application to genomic selection is provided by Montesinos-López et al. ( 2021 ).

10.7 Loss Functions

Loss function (also known as objective function) in general terms is a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its negative (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case now the goal is a maximization process. In the statistical machine learning domain, a loss function tries to quantify how close the predicted values produced by an artificial neural network or DL model are to the true values. That is, the loss function measures the quality of the network’s output by computing a distance score between the observed and predicted values (Chollet and Allaire 2017 ). The basic idea is to calculate a metric based on the observed error between the true and predicted values to measure how well the artificial neural network model’s prediction matches what was expected. Then these errors are averaged over the entire data set to provide only a single number that represents how the artificial neural network is performing with regard to its ideal. In looking for this ideal, it is possible to find the parameters (weights and biases) of the artificial neural network that will minimize the “loss” produced by the errors. Training ANN models with loss functions allows the use of optimization methods to estimate the required parameters. Although most of the time it is not possible to obtain an analytical solution to estimate the parameters, very often good approximations can be obtained using iterative optimization algorithms like gradient descent (Patterson and Gibson 2017 ). Next, we provide the most used loss functions for each type of response variable.

10.7.1 Loss Functions for Continuous Outcomes

Sum of square error loss.

This loss function is appropriate for continuous response variables (outcomes), assuming that we want to predict L response variables. The error (difference between observed ( y ij ) and predicted ( \( {\hat{y}}_{ij} \) ) values) in a prediction is squared and summed over the number of observations, since the training of the network is not local but global. To capture all possible trends in the training data, the expression used for sum of square error (SSE) loss is

Note that n is the size of your data set, and L , the number of targets (outputs) the network has to predict. It is important to point out that when there is only one response variable, the L is dropped. Also, the division by two is added for mathematical convenience (which will become clearer in the context of its gradient in backpropagation). One disadvantage of this loss function is that it is quite sensitive to outliers and, for this reason, other loss functions have been proposed for continuous response variables. With the loss function, it is possible to calculate the loss score, which is used as a feedback signal to adjust the weights of the artificial neural network; this process of adjusting the weights in ANN is illustrated in Fig. 10.15 (Chollet and Allaire 2017 ). It is also common practice to use as a loss function, the SSE divided by the training sample ( n ) multiplied by the number of outputs ( L ).

figure 15

The loss score is used as a feedback signal to adjust the weights

Figure 10.15 shows that in the learning process of an artificial neural network are involved the interaction of layers, input data, loss function which defines the feedback signal used for learning, and the optimizer which determines how the learning proceeds and uses the loss value to update the network’s weights. Initially, the weights of the network are assigned small random values, but when this provides an output far from the ideal values, it also implies a high loss score. But at each iteration of the network process, the weights are adjusted a little to reduce the difference between the observed and predicted values and, of course, to decrease the loss score. This is the basic step of the training process of statistical machine learning models in general, and when this process is repeated a sufficient number of times (on the order of thousands of iterations), it yields weight values that minimize the loss function. A network with minimal loss is one in which the observed and predicted values are very close; it is called a trained network (Chollet and Allaire 2017 ). There are other options of loss functions for continuous data like the sum of absolute percentage error loss (SAPE): \( L\left(\boldsymbol{w}\right)={\sum}_{i=1}^n{\sum}_{j=1}^L\left|\frac{{\hat{y}}_{ij}-{y}_{ij}}{y_{ij}}\right| \) and the sum of squared log error loss (Patterson and Gibson 2017 ): \( L\left(\boldsymbol{w}\right)={\sum}_{i=1}^n{\sum}_{j=1}^L{\left(\log \left({\hat{y}}_{ij}\right)-\log \left({y}_{ij}\right)\right)}^2 \) , but the SSE is popular in ANN and DL models due to its nice mathematical properties.

10.7.2 Loss Functions for Binary and Ordinal Outcomes

Next, we provide two popular loss functions for binary data: the hinge loss and the cross-entropy loss.

This loss function originated in the context of the support vector machine for “maximum-margin” classification, and is defined as

It is important to point out that since this loss function is appropriate for binary data, the intended response variable output is denoted as +1 for success and −1 for failure.

Logistic loss

This loss function is defined as

This loss function originated as the negative log-likelihood of the product of Bernoulli distributions. It is also known as cross-entropy loss since we arrive at the logistic loss by calculating the cross-entropy (difference between two probability distributions) loss, which is a measure of the divergence between the predicted probability distribution and the true distribution. Logistic loss functions are preferred over the hinge loss when the scientist is mostly interested in the probabilities of success rather than in just the hard classifications. For example, when a scientist is interested in the probability that a patient can get cancer as a function of a set of covariates, the logistic loss is preferred since it allows calculating true probabilities.

When the number of classes is more than two according to Patterson and Gibson ( 2017 ), that is, when we are in the presence of categorical data, the loss function is known as categorical cross-entropy and is equal to

Poisson loss

This loss function is built as the minus log-likelihood of a Poisson distribution and is appropriate for predicting count outcomes. It is defined as

Also, for count data the loss function can be obtained under a negative binomial distribution, which can do a better job than the Poisson distribution when the assumption of equal mean and variance is hard to justify.

10.7.3 Regularized Loss Functions

Regularization is a method that helps to reduce the complexity of the model and significantly reduces the variance of statistical machine learning models without any substantial increase in their bias. For this reason, to prevent overfitting and improve the generalizability of our models, we use regularization (penalization), which is concerned with reducing testing errors so that the model performs well on new data as well as on training data. Regularized or penalized loss functions are those that instead of minimizing the conventional loss function, L ( w ), minimize an augmented loss function that consists of the sum of the conventional loss function and a penalty (or regularization) term that is a function of the weights. This is defined as

where L ( w ,  λ ) is the regularized (or penalized) loss function, λ is the degree or strength of the penalty term, and E P is the penalization proposed for the weights; this is known as the regularization term. The regularization term shrinks the weight estimates toward zero, which helps to reduce the variance of the estimates and increase the bias of the weights, which in turn helps to improve the out-of-sample predictions of statistical machine learning models (James et al. 2013 ). As you remember, the way to introduce the penalization term is using exactly the same logic used in Ridge regression in Chap. 3 . Depending on the form of E P , there is a name for the type of regularization. For example, when E P  =  w T w , it is called Ridge penalty or weight decay penalty. This regularization is also called L2 penalty and has the effect that larger weights (positive or negative) result in larger penalties. On the other hand, when \( {E}_P={\sum}_{p=1}^P\left|{w}_p\right| \) , that is, when the E P term is equal to the sum of the absolute weights, the name of this regularization is Least Absolute Shrinkage and Selection Operator (Lasso) or simply L1 regularization. The L1 penalty produces a sparse solution (more zero weights) because small and larger weights are equally penalized and force some weights to be exactly equal to zero when the λ is considerably large (James et al. 2013 ; Wiley 2016 ); for this reason, the Lasso penalization also performs variable selection and provides a model more interpretable than the Ridge penalty. By combining Ridge (L2) and Lasso (L1) regularization, we obtained Elastic Net regularization, where the loss function is defined as \( L\left(\boldsymbol{w},{\lambda}_1,{\lambda}_2\right)=L\left(\boldsymbol{w}\right)+0.5\times {\lambda}_1{\sum}_{p=1}^P\left|{w}_p\right|+0.5\times {\lambda}_2{\sum}_{p=1}^P{w}_p^2 \) , and where instead of one lambda parameter, two are needed.

It is important to point out that more than one hyperparameter is needed in ANN and DL models where different degrees of penalties can be applied to different layers and different hyperparameters. This differential penalization is sometimes desirable to improve the predictions in new data, but this has the disadvantage that more hyperparameters need to be tuned, which increases the computation cost of the optimization process (Wiley 2016 ).

In all types of regularization, when λ  = 0 (or λ 1  =  λ 2  = 0), the penalty term has no effect, but the larger the value of λ , the more the shrinkage and penalty grows and the weight estimates will approach zero. The selection of the appropriate value of λ is challenging and critical; for this reason, λ is also treated as a hyperparameter that needs to be tuned and is usually optimized by evaluating a range of possible λ values through cross-validation. It is also important to point out that scaling the input data before implementing artificial neural networks is recommended, since the effect of the penalty depends on the size of the weights and the size of the weights depends on the scale of the data. Also, the user needs to recall from Chap. 3 where Ridge regression was presented, that the shrinkage penalty is applied to all the weights except the intercept or bias terms (Wiley 2016 ).

Another type of regularization that is very popular in ANN and DL is the dropout, which consists of setting to zero a random fraction (or percentage) of the weights of the input neurons or hidden neurons. Suppose that our original topology is like the topology given in Fig. 10.16 .16a, where all the neurons are active (with weights different to zero), while when a random fraction of neurons is dropped out, this means that all its connections (weights) are set to zero and the topology with the dropout neurons (with weights set to zero) is observed in Fig. 10.16b . The contribution of those dropped out neurons to the activation of downstream neurons is temporarily removed on the forward pass and any weight updates are not applied to the neuron on the backward pass. Dropout is only used during the training of a model but not when evaluating the skill of the model; it prevents the units from co-adapting too much.

figure 16

Feedforward neural network with four layers. ( a ) Three input neurons, four neurons in hidden layers 1 and 3, and five neurons in hidden layer 2 without dropout and ( b ) the same network with dropout; dropping out one in the input neuron, three neurons in hidden layers 1–3

This type of regularization is very simple and there is a lot of empirical evidence of its power to avoid overfitting. This regularization is quite new in the context of statistical machine learning and was proposed by Srivastava et al. ( 2014 ) in the paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting . There are no unique rules to choose the percentage of neurons that will be dropped out. Some tips are given below to choose the % dropout:

Usually a good starting point is to use 20% dropout, but values between 20% and 50% are reasonable. A percentage that is too low has minimal effect and a value that is too high results in underfitting the network.

The larger the network, the better, when you use the dropout method, since then you are more likely to get a better performance, because the model had more chance to learn independent representations.

Application of dropout is not restricted to hidden neurons; it can also be applied in the input layer. In both cases, there is evidence that it improves the performance of the ANN model.

When using dropout, increasing the learning rate (learning rate is a tuning parameter in an optimization algorithm that regulates the step size at each epoch (iteration) while moving toward a minimum (or maximum) of a loss function ) of the ANN algorithm by a factor of 10–100 is suggested, as well as increasing the momentum value (another tuning parameter useful for computing the gradient at each iteration), for example, from 0.90 to 0.99.

When dropout is used, it is also a good idea to constrain the size of network weights, since the larger the learning rate, the larger the network weights. For this reason, constraining the size of network weights to less than five in absolute values with max-norm regularization has shown to improve results.

It is important to point out that all the loss functions described in the previous section can be converted to regularized (penalized) loss functions using the elements given in this section. The dropout method can also be implemented with any type of loss function.

10.7.4 Early Stopping Method of Training

During the training process, the ANN and DL models learn in stages, from simple realizations to complex mapping functions. This process is captured by monitoring the behavior of the mean squared error that compares the match between observed and predicted values, which starts decreasing rapidly by increasing the number of epochs (epoch refers to one cycle through the full training data set) used for training, then decrease slowly when the error surface is close to a local minimum. However, to attain the larger generalization power of a model, it is necessary to figure out when it is best to stop training, which is a very challenging situation since a very early stopping point can produce underfitting, while a very late (no large) stopping point can produce overfitting of the training data. As mentioned in Chap. 4 , one way to avoid overfitting is to use a CV strategy, where the training set is split into a training-inner and testing-inner set; with the training-inner set, the model is trained for the set of hyperparameters, and with the testing-inner (tuning) set, the power to predict out of sample data is evaluated, and in this way the optimal hyperparameters are obtained. However, we can incorporate the early stopping method to CV to fight better overfitting by using the CV strategy in the usual way with a minor modification, which consists of stopping the training section periodically (i.e., every so many epochs) and testing the model on the validation subset, after reaching the specified number of epochs (Haykin 2009 ). In other words, the stopping method combined with the CV strategy that consists of a periodic “estimation-followed-by-validation process” basically proceeds as follows:

After a period of estimation (training)—every three epochs, for example—the weights and bias (intercept) parameters of the multilayer perceptron are all fixed, and the network is operated in its forward mode. Then the training and validation error are computed.

When the validation prediction performance is completed, the estimation (training) is started again for another period, and the process is repeated.

Due to its nature (just described above), which is simple to understand and easy to implement in practice, this method is called early stopping method of training. To better understand this method, in Fig. 10.17 this approach is conceptualized with two learning curves, one for the training subset and the other for the validation subset. Figure 10.17 shows that the prediction power in terms of MSE is lower in the training set than in the validation set, which is expected. The estimation learning curve that corresponds to the training set decreases monotonically as the number of epochs increases, which is normal, while the validation learning curve decreases monotonically to a minimum and then as the training continues, starts to increase. However, the estimation learning curve of the training set suggests that we can do better by going beyond the minimum point on the validation learning curve, but this is not really true since in essence what is learned beyond this point is the noise contained in the training data. For this reason, the minimum point on the validation learning curve could be used as a sensible criterion for stopping the training session. However, the validation sample error does not evolve as smoothly as the perfect curve shown in Fig. 10.17 , over the number of epochs used for training, since the validation sample error many times exhibits few local minima of its own before it starts to increase with an increasing number of epochs. For this reason, in the presence of two or more local minima, the selection of a “slower” stopping criterion (i.e., a criterion that stops later than other criteria) makes it possible to attain a small improvement in generalization performance (typically, about 4%, on average) at the cost of a much longer training period (about a factor of four, on average).

figure 17

Schematic representation of the early stopping rule based on cross-validation (Haykin 2009 )

10.8 The King Algorithm for Training Artificial Neural Networks: Backpropagation

The training process of ANN, which consists of adjusting connection weights, requires a lot of computational resources. For this reason, although they had been studied for many decades, few real applications of ANN were available until the mid-to-late 1980s, when the backpropagation method made its arrival. This method is attributed to Rumelhart et al. ( 1986 ). It is important to point out that, independently, other research teams around the same time published the backpropagation algorithm, but the one previously mentioned is one of the most cited. This algorithm led to the resurgence of ANN after the 1980s, but this algorithm is still considerably slower than other statistical machine learning algorithms. Some advantages of this algorithm are (a) it is able to make predictions of categorical or continuous outcomes, (b) it does a better job in capturing complex patterns than nearly any other algorithm, and (c) few assumptions about the underlying relationships of the data are made. However, this algorithm is not without weaknesses, some of which are (a) it is very slow to train since it requires a lot of computational resources because the more complex the network topology, the more computational resources are needed, this statement is true not only for ANN but also for any algorithm, (b) it is very susceptible to overfitting training data, and (c) its results are difficult to interpret (Lantz 2015 ).

Next, we provide the derivation of the backpropagation algorithm for the multilayer perceptron network shown in Fig. 10.18 .

figure 18

Schematic representation of a multilayer feedforward network with one hidden layer, eight input variables, and three output variables

As mentioned earlier, the goal of the backpropagation algorithm is to find the weights of a multilayered feedforward network. The multilayered feedforward network given in Fig. 10.18 is able to approximate any function to any degree of accuracy (Cybenko 1989 ) with enough hidden units, as stated in the universal approximation theorem (Sect. 10.4 ), which makes the multilayered feedforward network a powerful statistical machine learning tool. Suppose that we provide this network with n input patterns of the form

where x i denotes the input pattern of individual i with i  = 1, …, n , and x ip denotes the input p th of x i . Let y ij denote the response variable of the i th individual for the j th output and this is associated with the input pattern x i . For this reason, to be able to train the neural network, we must learn the functional relationship between the inputs and outputs. To illustrate the learning process of this relationship, we use the SSE loss function (explained in the section about “loss functions” to optimize the weights) which is defined as

Now, to explain how the backpropagation algorithm works, we will explain how information is first passed forward through the network. Providing the input values to the input layer is the first step, but no operation is performed on this information since it is simply passed to the hidden units. Then the net input into the k th hidden neuron is calculated as

Here P is the total number of explanatory variables or input nodes, \( {w}_{kp}^{(h)} \) is the weight from input unit p to hidden unit k , the superscript, h , refers to hidden layer, and x ip is the value of the p th input for pattern or individual i . It is important to point out that the bias term ( \( {b}_j^{(h)} \) ) of neuron k in the hidden layer has been excluded from ( 10.6 ) because the bias can be accounted for by adding an extra neuron to the input layer and fixing its value at 1. Then the output of the k neuron resulting from applying an activation function to its net input is

where g ( h ) is the activation function that is applied to the net input of any neuron k of the hidden layer. In a similar vein, now with all the outputs of the neurons in the hidden layer, we can estimate the net input of the j th neuron of the output unit j as

where M is the number of neurons in the hidden layer and \( {w}_{jk}^{(l)} \) represents the weights from hidden unit k to output j . The superscript, l ,refers to output layer. Also, here the bias term ( \( {b}_j^{(l)}\Big) \) of neuron j in the output layer was not included in ( 10.8 ) since it can be included by adding an extra neuron to the hidden layer and fixing its value at 1. Now, by applying the activation function to the output of the j th neuron of the output layer, we get the predicted value of the j th output as

where \( {\hat{y}}_{ij} \) is the predicted value of individual i in output j and g ( l ) is the activation function of the output layer. We are interested in learning the weights ( \( {w}_{kp}^{(h)},{w}_{jk}^{(l)} \) ) that minimize the sum of squared errors known as the mean square loss function ( 10.5 ), which is a function of the unknown weights, as can be observed in ( 10.6 )–( 10.8 ). Therefore, the partial derivatives of the loss function with respect to the weights represent the rate of change of the loss function with respect to the weights (this is the slope of the loss function). The loss function will decrease when moving the weights down this slope. This is the intuition behind the iterative method called backpropagation for finding the optimal weights and biases. This method consists of evaluating the partial derivatives of the loss function with regard to the weights and then moving these values down the slope, until the score of the loss function no longer decreases. For example, if we make the variation of the weights proportional to the negative of the gradient, the change in the weights in the right direction is reached. The gradient of the loss function given in ( 10.5 ) with respect to the weights connecting the hidden units to the output units ( \( {w}_{jk}^{(l)}\Big) \) is given by

where η is the learning rate that scales the step size and is specified by the user. To be able to calculate the adjustments for the weights connecting the hidden neurons to the outputs, \( {w}_{jk}^{(l)} \) , first we substitute ( 10.6 )–( 10.9 ) in ( 10.5 ), which yields

Then, by expanding ( 10.10 ) using the change rule, we get

Next, we get each partial derivative

By substituting these partial derivatives in ( 10.10 ), we obtain the change in weights from the hidden units to the output units, \( \Delta {w}_{jk}^{(l)}, \) as

where \( {\delta}_{ij}=\left({y}_{ij}-{\hat{y}}_{ij}\right)\ {g}^{(l)\acute{\mkern6mu}}\left({z}_{ij}^{(l)}\right) \) . Therefore, the formula used to update the weights from the hidden units to the output units is

This equation reflects that the adjusted weights from ( 10.13 ) are added to the current estimate of the weights, \( {w}_{jk}^{(l)(t)} \) , to obtain the updated estimates, \( {w}_{jk}^{(l)\left(t+1\right)} \) .

Next, to update the weights connecting the input units to the hidden units, we follow a similar process as in ( 10.12 ). Thus

Using the chain rule, we get that

where \( \frac{\partial E}{\partial {\hat{y}}_{ij}} \) and \( \frac{\partial {\hat{y}}_{ij}}{\partial {z}_{ij}^{(l)}} \) are given in ( 10.11 ), while

Substituting back into ( 10.14 ), we obtain the change in the weights from the input units to the hidden units, \( \Delta {w}_{kp}^{(h)} \) , as

where \( {\psi}_{ik}={\sum}_{j=1}^L{\delta}_{ij}{w}_{jk}^{(l)}{g}^{(h)\acute{\mkern6mu}}\left({z}_{ik}^{(h)}\right) \) . The summation over the number of output units is because each hidden neuron is connected to all the output units. Therefore, all the outputs should be affected if the weight connecting an input unit to a hidden unit changes. In a similar way, the formula for updating the weights from the input units to the hidden units is

This equation also reflects that the adjusted weights from ( 10.17 ) are added to the current estimate of the weights, \( {w}_{kp}^{(h)(t)} \) , to obtain the updated estimates, \( {w}_{kp}^{(h)\left(t+1\right)} \) . Now we are able to put down the processing steps needed to compute the change in the network weights using the backpropagation algorithm. We define w as the entire collection of weights.

10.8.1 Backpropagation Algorithm: Online Version

10.8.1.1 feedforward part.

Step 1. Initialize the weights to small random values, and define the learning rate ( η ) and the minimum expected loss score (tol). By tol we can fix a small value that when this value is reached, the training process will stop.

Step 2. If the stopping condition is false, perform steps 3–14.

Step 3. Select a pattern x i  = [ x i 1 , …,  x iP ] T as the input vector sequentially ( i  = 1 till the number of samples) or at random.

Step 4. The net inputs of the hidden layer are calculated: \( {z}_{ik}^{(h)}={\sum}_{p=0}^P{w}_{kp}^{(h)}{x}_{ip} \) , i  = 1, …, n and  k  = 0, …, M .

Step 5. The outputs of the hidden layer are calculated: \( {V}_{ik}^{(h)}={g}^{(h)}\left({z}_{ik}^{(h)}\right) \)

Step 6. The net inputs of the output layer are calculated: \( {z}_{ij}^{(l)}={\sum}_{k=0}^M{w}_{jk}^{(l)}{V}_{ik}^{(h)},j=1,\dots, L \)

Step 7. The predicted values (outputs) of the neural network are calculated: \( {\hat{y}}_{ij}={g}^{(l)}\left({z}_{ij}^{(l)}\right) \)

Step 8. Compute the mean square error (loss function) for pattern i error: \( {E}_i=\frac{1}{2 nL}{\sum}_{j=1}^L{\left({\hat{y}}_{ij}-{y}_{ij}\right)}^2+{E}_i \) ; then E ( w ) =  E i  +  E ( w ) ; in the first step of an epoch, initialize E i  = 0. Note that the value of the loss function is accumulated over all data pairs, that is, ( y ij , x i ).

10.8.1.2 Backpropagation Part

Step 9. The output errors are calculated: \( {\delta}_{ij}=\left({y}_{ij}-{\hat{y}}_{ij}\right)\ {g}^{(l)\acute{\mkern6mu}}\left({z}_{ij}^{(l)}\right) \)

Step 10. The hidden layer errors are calculated: \( {\psi}_{ik}={g}^{(h)\acute{\mkern6mu}}\left({z}_{ik}^{(h)}\right){\sum}_{j=1}^L{\delta}_{ij}{w}_{jk}^{(l)} \)

Step 11. The weights of the output layer are updated: \( {w}_{jk}^{(l)\left(t+1\right)}={w}_{jk}^{(l)(t)}+{\eta \delta}_{ij}{V}_{ik}^{(h)} \)

Step 12. The weights of the hidden layer are updated: \( {w}_{kp}^{(h)\left(t+1\right)}={w}_{kp}^{(h)(t)}+{\eta \psi}_{ik}{x}_{ip} \)

Step 13. If i  <  n , go to step 3; otherwise go to step 14.

Step 14. Once the learning of an epoch is complete, i  =  n ; then we check if the global error is satisfied with the specified tolerance (tol). If this condition is satisfied we terminate the learning process which means that the network has been trained satisfactorily. Otherwise, go to step 3 and start a new learning epoch: i  = 1, since E ( w )  <  tol.

The backpropagation algorithm is iterative. This means that the search process occurs over multiple discrete steps, each step hopefully slightly improving the model parameters. Each step involves using the model with the current set of internal parameters to make predictions of some samples, comparing the predictions to the real expected outcomes, calculating the error, and using the error to update the internal model parameters. This update procedure is different for different algorithms, but in the case of ANN, as previously pointed out, the backpropagation update algorithm is used.

10.8.2 Illustrative Example 10.1: A Hand Computation

In this section, we provide a simple example that will be computed step by step by hand to fully understand how the training is done using the backpropagation method. The topology used for this example is given in Fig. 10.19 .

figure 19

A simple artificial neural network with one input, one hidden layer with one neuron, and one response variable (output)

The data set for this example is given in Table 10.1 , where we can see that the data collected consist of four observations, the response variable ( y ) takes values between 0 and 1, and the input information is for only one predictor ( x ). Additionally, Table 10.1 gives the starting values for the hidden weights ( \( {w}_{kp}^{(h)}\Big) \) and for the output weights ( \( {w}_{jk}^{(l)} \) ). It is important to point out that due to the fact that the response variable is in the interval between zero and one, we will use the sigmoid activation function for both the hidden layer and the output layer. A learning rate ( η ) equal to 0.1 and tolerance equal to 0.025 were also used.

The backpropagation algorithm described before was given for one input pattern at a time; however, to simplify the calculations, we will implement this algorithm using the four patterns of data available simultaneously using matrix calculations. For this reason, first we build the design matrix of inputs and outputs:

We also define the vectors of the starting values of the hidden and output weights:

Here we can see that P  = 1,and M  = 2. Next we calculate the net inputs for the hidden layer as

Now the output for the hidden layer is calculated using the sigmoid activation function

where \( {V}_{ik}^{(h)}={g}^{(h)}\left({z}_{ik}^{(h)}\right),i=1,\dots, 4 \) and g ( h ) ( z ) = 1/(1 +  exp (− z )), which can be replaced by another desired activation function. Then the net inputs for the output layer are calculated as follows:

The predicted values (outputs) of the neural network are calculated as

where \( {\hat{y}}_i={g}^{(l)}\left({z}_{i1}^{(l)}\right),i=1,\dots, 4 \) and g ( l ) ( z ) = 1/(1 +  exp (− z )). Next the output errors are calculated using the Hadamard product, ∘, (element-wise matrix multiplication) as

The hidden layer errors are calculated as

where \( {\boldsymbol{w}}_1^{(l)} \) is w ( l ) without the weight of the intercept, that is, without the first element.

The weights of the output layer are updated:

where 2 denotes that the output weights are for epoch number 2. Then the weights for epoch 2 of the hidden layer are obtained with

We check to see if the global error is satisfied with the specified tolerance (tol). Since \( E\left(\boldsymbol{w}\right)=\frac{1}{2n}{\sum}_{i=1}^n{\left({\hat{y}}_i-{y}_i\right)}^2=0.03519>\mathrm{tol}=0.025, \) this means that we need to increase the number of epochs to satisfy the tol = 0.025 specified.

Epoch 2. Using the updated weights of epoch 1, we obtain the new weights after epoch 2. First for the output layer:

And next for the hidden layer:

Now the predicted values are \( {\hat{y}}_1=0.8233 \) , \( {\hat{y}}_2=0.3754 \) , \( {\hat{y}}_3=0.8480 \) , and \( {\hat{y}}_4=0.2459 \) , and again we found that \( E\left(\boldsymbol{w}\right)=\frac{1}{2n}{\sum}_{i=1}^n{\left({\hat{y}}_i-{y}_i\right)}^2=0.03412>\mathrm{tol}=0.025. \) This means that we need to continue the number of epochs to be able to satisfy the tol = 0.025 specified. The learning process by decreasing the MSE is observed in Fig. 10.20 , where we can see that tol = 0.025 is reached in epoch number 13, with an MSE =  E ( w )  =  0.02425.

figure 20

Behavior of the learning process by monitoring the MSE for Example 10.1—a hand computation

10.8.3 Illustrative Example 10.2—By Hand Computation

Table 10.2 gives the information for this example; the data collected contain five observations, the response variable ( y ) has a value between −1 and 1, and there are three inputs (predictors). Table 10.2 also provides the starting values for the hidden weights ( \( {w}_{kp}^{(h)}\Big) \) and for the output weights ( \( {w}_{jk}^{(l)} \) ). Due to the fact that the response variable is in the interval between −1 and 1, we will use the hyperbolic tangent activation function (Tanh) for the hidden and output layers. Now we used a learning rate ( η ) equal to 0.05 and a tolerance equal to 0.008 (Fig. 10.21 ).

figure 21

A simple artificial neural network with three inputs, one hidden layer with two neurons, and one response variable (output)

Here the backpropagation algorithm was implemented using the five patterns of data simultaneously using matrix calculations. Again, first we represent the design matrix of inputs and outputs:

Then we define the vectors of starting values of the hidden ( w ( h ) ) and output ( w ( l ) ) weights:

Now P  = 3 and M  = 3. Next, we calculate the net inputs for the hidden layer as

Now with the tanh activation function, the output of the hidden layer is calculated:

Again \( {V}_{ik}^{(h)}={g}^{(h)}\left({z}_{ik}^{(h)}\right),i=1;k=1,2 \) , and \( {g}^{(h)}(z)=\tanh (z)=\frac{\exp (z)-\exp \left(-z\right)}{\exp (z)+\exp \left(-z\right)} \) , which can also be replaced by another activation function. Then the net inputs for the output layer are calculated as

where \( {\hat{y}}_{ij}={g}^{(l)}\left({z}_{i1}^{(l)}\right),i=1,\dots, 5 \) and \( {g}^{(l)}(z)=\tanh (z)=\frac{\exp (z)-\exp \left(-z\right)}{\exp (z)+\exp \left(-z\right)} \) . The output errors are calculated as

where \( {\boldsymbol{w}}_1^{(h)} \) is w ( h ) without the weights of the intercepts, that is, without the first row.

Number 2 in w ( l )(2) indicates that output weights are for epoch number 2. The weights of the hidden layer in epoch 2 are obtained with

We check to see if the global errors are satisfied with the specified tolerance (tol). \( E\left(\boldsymbol{w}\right)=\frac{1}{2n}{\sum}_{i=1}^n{\left({\hat{y}}_i-{y}_i\right)}^2=0.01104>\mathrm{tol}=0.008 \) which means that we have to continue with the next epoch by cycling the training data again.

Epoch 2. Using the updated weights of epoch 1, we obtain the new weights for epoch 2.

For the output layer, these are

While for the hidden layer, they are

Now the predicted values are \( {\hat{y}}_1=0.6372 \) , \( {\hat{y}}_2=0.2620 \) , \( {\hat{y}}_3=-0.6573 \) , \( {\hat{y}}_4=0.6226, \) \( {\hat{y}}_5=-0.9612, \) and the \( E\left(\boldsymbol{w}\right)=\frac{1}{2n}{\sum}_{i=1}^n{\left({\hat{y}}_i-{y}_i\right)}^2=0.01092>\mathrm{tol}=0.008, \) which means that we have to continue with the next epoch by cycling the training data again. Figure 10.22 shows that the E ( w ) = 0.00799 < tol = 0.008 until epoch 83.

figure 22

Behavior of the learning process by monitoring the MSE for Example 10.2—a hand computation

In this algorithm, zero weights are not an option because each layer is symmetric in the weights flowing to the different neurons. Then the starting values should be close to zero and can be taken from random uniform or Gaussian distributions (Efron and Hastie 2016 ). One of the disadvantages of the basic backpropagation algorithm just described above is that the learning parameter η is fixed.

Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33:831–838

Article   CAS   Google Scholar  

Anderson J, Pellionisz A, Rosenfeld E (1990) Neurocomputing 2: directions for research. MIT, Cambridge

Google Scholar  

Angermueller C, Pärnamaa T, Parts L, Stegle O (2016) Deep learning for computational biology. Mol Syst Biol 12(878):1–16

Chollet F, Allaire JJ (2017) Deep learning with R. Manning Publications, Manning Early Access Program (MEA), 1st edn

Cole JH, Rudra PK, Poudel DT, Matthan WA, Caan CS, Tim D, Spector GM (2017) Predicting brain age with deep learning from raw imaging data results in a reliable and heritable biomarker. NeuroImage 163(1):115–124. https://doi.org/10.1016/j.neuroimage.2017.07.059

Article   PubMed   Google Scholar  

Cybenko G (1989) Approximations by superpositions of sigmoidal functions. Math Control Signal Syst 2:303–314

Article   Google Scholar  

Dingli A, Fournier KS (2017) Financial time series forecasting—a deep learning approach. Int J Mach Learn Comput 7(5):118–122

Dougherty G (2013) Pattern recognition and classification-an introduction. Springer Science + Business Media, New York

Book   Google Scholar  

Efron B, Hastie T (2016) Computer age statistical inference. Algorithms, evidence, and data science. Cambridge University Press, New York

Francisco-Caicedo EF, López-Sotelo JA (2009) Una approximación práctica a las redes neuronales artificiales. Universidad del Valle, Cali

Goldberg Y (2016) A primer on neural network models for natural language processing. J Artif Intell Res 57(345):420

Haykin S (2009) Neural networks and learning machines, 3rd edn. Pearson Prentice Hall, New York

Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4:251–257

James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning with applications in R. Springer, New York

Kohonen T (2000) Self-organizing maps. Springer, Berlin

Lantz B (2015) Machine learning with R, 2nd edn. Packt Publishing Ltd, Birmingham

LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

Lewis ND (2016) Deep learning made easy with R. A gentle introduction for data science. CreateSpace Independent Publishing Platform

Liu S, Tang J, Zhang Z, Gaudiot JL (2017) CAAD: computer architecture for autonomous driving. ariv preprint ariv:1702.01894

Ma W, Qiu Z, Song J, Cheng Q, Ma C (2017) DeepGS: predicting phenotypes from genotypes using Deep Learning. bioRxiv 241414. https://doi.org/10.1101/241414

Makridakis S, Spiliotis E, Assimakopoulos V (2018) Statistical and Machine Learning forecasting methods: concerns and ways forward. PLoS One 13(3):e0194889. https://doi.org/10.1371/journal.pone.0194889

Article   CAS   PubMed   PubMed Central   Google Scholar  

McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 5:115–133

McDowell R, Grant D (2016) Genomic selection with deep neural networks. Graduate Theses and Dissertations, p 15973. https://lib.dr.iastate.edu/etd/15973

Menden MP, Iorio F, Garnett M, McDermott U, Benes CH et al (2013) Machine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical properties. PLoS One 8:e61318

Montesinos-López A, Montesinos-López OA, Gianola D, Crossa J, Hernández-Suárez CM (2018a) Multi-environment genomic prediction of plant traits using deep learners with a dense architecture. G3: Genes, Genomes, Genetics 8(12):3813–3828. https://doi.org/10.1534/g3.118.200740

Montesinos-López OA, Montesinos-López A, Crossa J, Gianola D, Hernández-Suárez CM et al (2018b) Multi-trait, multi-environment deep learning modeling for genomic-enabled prediction of plant traits. G3: Genes, Genomes, Genetics 8(12):3829–3840. https://doi.org/10.1534/g3.118.200728

Montesinos-López OA, Vallejo M, Crossa J, Gianola D, Hernández-Suárez CM, Montesinos-López A, Juliana P, Singh R (2019a) A benchmarking between deep learning, support vector machine and bayesian threshold best linear unbiased prediction for predicting ordinal traits in plant breeding. G3: Genes, Genomes, Genetics 9(2):601–618

Montesinos-López OA, Martín-Vallejo J, Crossa J, Gianola D, Hernández-Suárez CM, Montesinos-López A, Juliana P, Singh R (2019b) New deep learning genomic prediction model for multi-traits with mixed binary, ordinal, and continuous phenotypes. G3: Genes, Genomes, Genetics 9(5):1545–1556

Montesinos-López OA, Montesinos-López A, Pérez-Rodríguez P, Barrón-López JA, Martini JWR, Fajardo-Flores SB, Gaytan-Lugo LS, Santana-Mancilla PC, Crossa J (2021) A review of deep learning applications for genomic selection. BMC Genomics 22:19

Patterson J, Gibson A (2017) Deep learning: a practitioner’s approach. O’Reilly Media

Ripley B (1993) Statistical aspects of neural networks. In: Bornndorff-Nielsen U, Jensen J, Kendal W (eds) Networks and chaos—statistical and probabilistic aspects. Chapman and Hall, London, pp 40–123

Chapter   Google Scholar  

Rouet-Leduc B, Hulbert C, Lubbers N, Barros K, Humphreys CJ et al (2017) Machine learning predicts laboratory earthquakes. Geophys Res Lett 44(28):9276–9282

Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by backpropagating errors. Nature 323:533–536

Shalev-Shwartz, Ben-David (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press, New York

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(6):1929–1958

Tavanaei A, Anandanadarajah N, Maida AS, Loganantharaj R (2017) A deep learning model for predicting tumor suppressor genes and oncogenes from PDB structure. bioRiv 177378. https://doi.org/10.1101/177378

Weron R (2014) Electricity price forecasting: a review of the state-of-the-art with a look into the future. Int J Forecast 30(4):1030–1081

Wiley JF (2016) R deep learning essentials: build automatic classification and prediction models using unsupervised learning. Packt Publishing, Birmingham, Mumbai

Download references

Author information

Authors and affiliations.

Facultad de Telemática, University of Colima, Colima, México

Osval Antonio Montesinos López

Departamento de Matemáticas, University of Guadalajara, Guadalajara, México

Abelardo Montesinos López

Biometrics and Statistics Unit, CIMMYT, Edo de México, México

Jose Crossa

Colegio de Post- Graduado, Edo de México, México

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

© 2022 The Author(s)

About this chapter

Montesinos López, O.A., Montesinos López, A., Crossa, J. (2022). Fundamentals of Artificial Neural Networks and Deep Learning. In: Multivariate Statistical Machine Learning Methods for Genomic Prediction. Springer, Cham. https://doi.org/10.1007/978-3-030-89010-0_10

Download citation

DOI : https://doi.org/10.1007/978-3-030-89010-0_10

Published : 14 January 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-89009-4

Online ISBN : 978-3-030-89010-0

eBook Packages : Biomedical and Life Sciences Biomedical and Life Sciences (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Survey Paper
  • Open access
  • Published: 31 March 2021

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

  • Laith Alzubaidi   ORCID: orcid.org/0000-0002-7296-5413 1 , 5 ,
  • Jinglan Zhang 1 ,
  • Amjad J. Humaidi 2 ,
  • Ayad Al-Dujaili 3 ,
  • Ye Duan 4 ,
  • Omran Al-Shamma 5 ,
  • J. Santamaría 6 ,
  • Mohammed A. Fadhel 7 ,
  • Muthana Al-Amidie 4 &
  • Laith Farhan 8  

Journal of Big Data volume  8 , Article number:  53 ( 2021 ) Cite this article

452k Accesses

2789 Citations

37 Altmetric

Metrics details

In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching or even beating those provided by human performance. One of the benefits of DL is the ability to learn massive amounts of data. The DL field has grown fast in the last few years and it has been extensively used to successfully address a wide range of traditional applications. More importantly, DL has outperformed well-known ML techniques in many domains, e.g., cybersecurity, natural language processing, bioinformatics, robotics and control, and medical information processing, among many others. Despite it has been contributed several works reviewing the State-of-the-Art on DL, all of them only tackled one aspect of the DL, which leads to an overall lack of knowledge about it. Therefore, in this contribution, we propose using a more holistic approach in order to provide a more suitable starting point from which to develop a full understanding of DL. Specifically, this review attempts to provide a more comprehensive survey of the most important aspects of DL and including those enhancements recently added to the field. In particular, this paper outlines the importance of DL, presents the types of DL techniques and networks. It then presents convolutional neural networks (CNNs) which the most utilized DL network type and describes the development of CNNs architectures together with their main features, e.g., starting with the AlexNet network and closing with the High-Resolution network (HR.Net). Finally, we further present the challenges and suggested solutions to help researchers understand the existing research gaps. It is followed by a list of the major DL applications. Computational tools including FPGA, GPU, and CPU are summarized along with a description of their influence on DL. The paper ends with the evolution matrix, benchmark datasets, and summary and conclusion.

Introduction

Recently, machine learning (ML) has become very widespread in research and has been incorporated in a variety of applications, including text mining, spam detection, video recommendation, image classification, and multimedia concept retrieval [ 1 , 2 , 3 , 4 , 5 , 6 ]. Among the different ML algorithms, deep learning (DL) is very commonly employed in these applications [ 7 , 8 , 9 ]. Another name for DL is representation learning (RL). The continuing appearance of novel studies in the fields of deep and distributed learning is due to both the unpredictable growth in the ability to obtain data and the amazing progress made in the hardware technologies, e.g. High Performance Computing (HPC) [ 10 ].

DL is derived from the conventional neural network but considerably outperforms its predecessors. Moreover, DL employs transformations and graph technologies simultaneously in order to build up multi-layer learning models. The most recently developed DL techniques have obtained good outstanding performance across a variety of applications, including audio and speech processing, visual data processing, natural language processing (NLP), among others [ 11 , 12 , 13 , 14 ].

Usually, the effectiveness of an ML algorithm is highly dependent on the integrity of the input-data representation. It has been shown that a suitable data representation provides an improved performance when compared to a poor data representation. Thus, a significant research trend in ML for many years has been feature engineering, which has informed numerous research studies. This approach aims at constructing features from raw data. In addition, it is extremely field-specific and frequently requires sizable human effort. For instance, several types of features were introduced and compared in the computer vision context, such as, histogram of oriented gradients (HOG) [ 15 ], scale-invariant feature transform (SIFT) [ 16 ], and bag of words (BoW) [ 17 ]. As soon as a novel feature is introduced and is found to perform well, it becomes a new research direction that is pursued over multiple decades.

Relatively speaking, feature extraction is achieved in an automatic way throughout the DL algorithms. This encourages researchers to extract discriminative features using the smallest possible amount of human effort and field knowledge [ 18 ]. These algorithms have a multi-layer data representation architecture, in which the first layers extract the low-level features while the last layers extract the high-level features. Note that artificial intelligence (AI) originally inspired this type of architecture, which simulates the process that occurs in core sensorial regions within the human brain. Using different scenes, the human brain can automatically extract data representation. More specifically, the output of this process is the classified objects, while the received scene information represents the input. This process simulates the working methodology of the human brain. Thus, it emphasizes the main benefit of DL.

In the field of ML, DL, due to its considerable success, is currently one of the most prominent research trends. In this paper, an overview of DL is presented that adopts various perspectives such as the main concepts, architectures, challenges, applications, computational tools and evolution matrix. Convolutional neural network (CNN) is one of the most popular and used of DL networks [ 19 , 20 ]. Because of CNN, DL is very popular nowadays. The main advantage of CNN compared to its predecessors is that it automatically detects the significant features without any human supervision which made it the most used. Therefore, we have dug in deep with CNN by presenting the main components of it. Furthermore, we have elaborated in detail the most common CNN architectures, starting with the AlexNet network and ending with the High-Resolution network (HR.Net).

Several published DL review papers have been presented in the last few years. However, all of them have only been addressed one side focusing on one application or topic such as the review of CNN architectures [ 21 ], DL for classification of plant diseases [ 22 ], DL for object detection [ 23 ], DL applications in medical image analysis [ 24 ], and etc. Although these reviews present good topics, they do not provide a full understanding of DL topics such as concepts, detailed research gaps, computational tools, and DL applications. First, It is required to understand DL aspects including concepts, challenges, and applications then going deep in the applications. To achieve that, it requires extensive time and a large number of research papers to learn about DL including research gaps and applications. Therefore, we propose a deep review of DL to provide a more suitable starting point from which to develop a full understanding of DL from one review paper. The motivation behinds our review was to cover the most important aspect of DL including open challenges, applications, and computational tools perspective. Furthermore, our review can be the first step towards other DL topics.

The main aim of this review is to present the most important aspects of DL to make it easy for researchers and students to have a clear image of DL from single review paper. This review will further advance DL research by helping people discover more about recent developments in the field. Researchers would be allowed to decide the more suitable direction of work to be taken in order to provide more accurate alternatives to the field. Our contributions are outlined as follows:

This is the first review that almost provides a deep survey of the most important aspects of deep learning. This review helps researchers and students to have a good understanding from one paper.

We explain CNN in deep which the most popular deep learning algorithm by describing the concepts, theory, and state-of-the-art architectures.

We review current challenges (limitations) of Deep Learning including lack of training data, Imbalanced Data, Interpretability of data, Uncertainty scaling, Catastrophic forgetting, Model compression, Overfitting, Vanishing gradient problem, Exploding Gradient Problem, and Underspecification. We additionally discuss the proposed solutions tackling these issues.

We provide an exhaustive list of medical imaging applications with deep learning by categorizing them based on the tasks by starting with classification and ending with registration.

We discuss the computational approaches (CPU, GPU, FPGA) by comparing the influence of each tool on deep learning algorithms.

The rest of the paper is organized as follows: “ Survey methodology ” section describes The survey methodology. “ Background ” section presents the background. “ Classification of DL approaches ” section defines the classification of DL approaches. “ Types of DL networks ” section displays types of DL networks. “ CNN architectures ” section shows CNN Architectures. “ Challenges (limitations) of deep learning and alternate solutions ” section details the challenges of DL and alternate solutions. “ Applications of deep learning ” section outlines the applications of DL. “ Computational approaches ” section explains the influence of computational approaches (CPU, GPU, FPGA) on DL. “ Evaluation metrics ” section presents the evaluation metrics. “ Frameworks and datasets ” section lists frameworks and datasets. “ Summary and conclusion ” section presents the summary and conclusion.

Survey methodology

We have reviewed the significant research papers in the field published during 2010–2020, mainly from the years of 2020 and 2019 with some papers from 2021. The main focus was papers from the most reputed publishers such as IEEE, Elsevier, MDPI, Nature, ACM, and Springer. Some papers have been selected from ArXiv. We have reviewed more than 300 papers on various DL topics. There are 108 papers from the year 2020, 76 papers from the year 2019, and 48 papers from the year 2018. This indicates that this review focused on the latest publications in the field of DL. The selected papers were analyzed and reviewed to (1) list and define the DL approaches and network types, (2) list and explain CNN architectures, (3) present the challenges of DL and suggest the alternate solutions, (4) assess the applications of DL, (5) assess computational approaches. The most keywords used for search criteria for this review paper are (“Deep Learning”), (“Machine Learning”), (“Convolution Neural Network”), (“Deep Learning” AND “Architectures”), ((“Deep Learning”) AND (“Image”) AND (“detection” OR “classification” OR “segmentation” OR “Localization”)), (“Deep Learning” AND “detection” OR “classification” OR “segmentation” OR “Localization”), (“Deep Learning” AND “CPU” OR “GPU” OR “FPGA”), (“Deep Learning” AND “Transfer Learning”), (“Deep Learning” AND “Imbalanced Data”), (“Deep Learning” AND “Interpretability of data”), (“Deep Learning” AND “Overfitting”), (“Deep Learning” AND “Underspecification”). Figure  1 shows our search structure of the survey paper. Table  1 presents the details of some of the journals that have been cited in this review paper.

figure 1

Search framework

This section will present a background of DL. We begin with a quick introduction to DL, followed by the difference between DL and ML. We then show the situations that require DL. Finally, we present the reasons for applying DL.

DL, a subset of ML (Fig.  2 ), is inspired by the information processing patterns found in the human brain. DL does not require any human-designed rules to operate; rather, it uses a large amount of data to map the given input to specific labels. DL is designed using numerous layers of algorithms (artificial neural networks, or ANNs), each of which provides a different interpretation of the data that has been fed to them [ 18 , 25 ].

figure 2

Deep learning family

Achieving the classification task using conventional ML techniques requires several sequential steps, specifically pre-processing, feature extraction, wise feature selection, learning, and classification. Furthermore, feature selection has a great impact on the performance of ML techniques. Biased feature selection may lead to incorrect discrimination between classes. Conversely, DL has the ability to automate the learning of feature sets for several tasks, unlike conventional ML methods [ 18 , 26 ]. DL enables learning and classification to be achieved in a single shot (Fig.  3 ). DL has become an incredibly popular type of ML algorithm in recent years due to the huge growth and evolution of the field of big data [ 27 , 28 ]. It is still in continuous development regarding novel performance for several ML tasks [ 22 , 29 , 30 , 31 ] and has simplified the improvement of many learning fields [ 32 , 33 ], such as image super-resolution [ 34 ], object detection [ 35 , 36 ], and image recognition [ 30 , 37 ]. Recently, DL performance has come to exceed human performance on tasks such as image classification (Fig.  4 ).

figure 3

The difference between deep learning and traditional machine learning

figure 4

Deep learning performance compared to human

Nearly all scientific fields have felt the impact of this technology. Most industries and businesses have already been disrupted and transformed through the use of DL. The leading technology and economy-focused companies around the world are in a race to improve DL. Even now, human-level performance and capability cannot exceed that the performance of DL in many areas, such as predicting the time taken to make car deliveries, decisions to certify loan requests, and predicting movie ratings [ 38 ]. The winners of the 2019 “Nobel Prize” in computing, also known as the Turing Award, were three pioneers in the field of DL (Yann LeCun, Geoffrey Hinton, and Yoshua Bengio) [ 39 ]. Although a large number of goals have been achieved, there is further progress to be made in the DL context. In fact, DL has the ability to enhance human lives by providing additional accuracy in diagnosis, including estimating natural disasters [ 40 ], the discovery of new drugs [ 41 ], and cancer diagnosis [ 42 , 43 , 44 ]. Esteva et al. [ 45 ] found that a DL network has the same ability to diagnose the disease as twenty-one board-certified dermatologists using 129,450 images of 2032 diseases. Furthermore, in grading prostate cancer, US board-certified general pathologists achieved an average accuracy of 61%, while the Google AI [ 44 ] outperformed these specialists by achieving an average accuracy of 70%. In 2020, DL is playing an increasingly vital role in early diagnosis of the novel coronavirus (COVID-19) [ 29 , 46 , 47 , 48 ]. DL has become the main tool in many hospitals around the world for automatic COVID-19 classification and detection using chest X-ray images or other types of images. We end this section by the saying of AI pioneer Geoffrey Hinton “Deep learning is going to be able to do everything”.

When to apply deep learning

Machine intelligence is useful in many situations which is equal or better than human experts in some cases [ 49 , 50 , 51 , 52 ], meaning that DL can be a solution to the following problems:

Cases where human experts are not available.

Cases where humans are unable to explain decisions made using their expertise (language understanding, medical decisions, and speech recognition).

Cases where the problem solution updates over time (price prediction, stock preference, weather prediction, and tracking).

Cases where solutions require adaptation based on specific cases (personalization, biometrics).

Cases where size of the problem is extremely large and exceeds our inadequate reasoning abilities (sentiment analysis, matching ads to Facebook, calculation webpage ranks).

Why deep learning?

Several performance features may answer this question, e.g

Universal Learning Approach: Because DL has the ability to perform in approximately all application domains, it is sometimes referred to as universal learning.

Robustness: In general, precisely designed features are not required in DL techniques. Instead, the optimized features are learned in an automated fashion related to the task under consideration. Thus, robustness to the usual changes of the input data is attained.

Generalization: Different data types or different applications can use the same DL technique, an approach frequently referred to as transfer learning (TL) which explained in the latter section. Furthermore, it is a useful approach in problems where data is insufficient.

Scalability: DL is highly scalable. ResNet [ 37 ], which was invented by Microsoft, comprises 1202 layers and is frequently applied at a supercomputing scale. Lawrence Livermore National Laboratory (LLNL), a large enterprise working on evolving frameworks for networks, adopted a similar approach, where thousands of nodes can be implemented [ 53 ].

Classification of DL approaches

DL techniques are classified into three major categories: unsupervised, partially supervised (semi-supervised) and supervised. Furthermore, deep reinforcement learning (DRL), also known as RL, is another type of learning technique, which is mostly considered to fall into the category of partially supervised (and occasionally unsupervised) learning techniques.

Deep supervised learning

Deep semi-supervised learning.

In this technique, the learning process is based on semi-labeled datasets. Occasionally, generative adversarial networks (GANs) and DRL are employed in the same way as this technique. In addition, RNNs, which include GRUs and LSTMs, are also employed for partially supervised learning. One of the advantages of this technique is to minimize the amount of labeled data needed. On other the hand, One of the disadvantages of this technique is irrelevant input feature present training data could furnish incorrect decisions. Text document classifier is one of the most popular example of an application of semi-supervised learning. Due to difficulty of obtaining a large amount of labeled text documents, semi-supervised learning is ideal for text document classification task.

Deep unsupervised learning

This technique makes it possible to implement the learning process in the absence of available labeled data (i.e. no labels are required). Here, the agent learns the significant features or interior representation required to discover the unidentified structure or relationships in the input data. Techniques of generative networks, dimensionality reduction and clustering are frequently counted within the category of unsupervised learning. Several members of the DL family have performed well on non-linear dimensionality reduction and clustering tasks; these include restricted Boltzmann machines, auto-encoders and GANs as the most recently developed techniques. Moreover, RNNs, which include GRUs and LSTM approaches, have also been employed for unsupervised learning in a wide range of applications. The main disadvantages of unsupervised learning are unable to provide accurate information concerning data sorting and computationally complex. One of the most popular unsupervised learning approaches is clustering [ 54 ].

Deep reinforcement learning

For solving a task, the selection of the type of reinforcement learning that needs to be performed is based on the space or the scope of the problem. For example, DRL is the best way for problems involving many parameters to be optimized. By contrast, derivative-free reinforcement learning is a technique that performs well for problems with limited parameters. Some of the applications of reinforcement learning are business strategy planning and robotics for industrial automation. The main drawback of Reinforcement Learning is that parameters may influence the speed of learning. Here are the main motivations for utilizing Reinforcement Learning:

It assists you to identify which action produces the highest reward over a longer period.

It assists you to discover which situation requires action.

It also enables it to figure out the best approach for reaching large rewards.

Reinforcement Learning also gives the learning agent a reward function.

Reinforcement Learning can’t utilize in all the situation such as:

In case there is sufficient data to resolve the issue with supervised learning techniques.

Reinforcement Learning is computing-heavy and time-consuming. Specially when the workspace is large.

Types of DL networks

The most famous types of deep learning networks are discussed in this section: these include recursive neural networks (RvNNs), RNNs, and CNNs. RvNNs and RNNs were briefly explained in this section while CNNs were explained in deep due to the importance of this type. Furthermore, it is the most used in several applications among other networks.

Recursive neural networks

RvNN can achieve predictions in a hierarchical structure also classify the outputs utilizing compositional vectors [ 57 ]. Recursive auto-associative memory (RAAM) [ 58 ] is the primary inspiration for the RvNN development. The RvNN architecture is generated for processing objects, which have randomly shaped structures like graphs or trees. This approach generates a fixed-width distributed representation from a variable-size recursive-data structure. The network is trained using an introduced back-propagation through structure (BTS) learning system [ 58 ]. The BTS system tracks the same technique as the general-back propagation algorithm and has the ability to support a treelike structure. Auto-association trains the network to regenerate the input-layer pattern at the output layer. RvNN is highly effective in the NLP context. Socher et al. [ 59 ] introduced RvNN architecture designed to process inputs from a variety of modalities. These authors demonstrate two applications for classifying natural language sentences: cases where each sentence is split into words and nature images, and cases where each image is separated into various segments of interest. RvNN computes a likely pair of scores for merging and constructs a syntactic tree. Furthermore, RvNN calculates a score related to the merge plausibility for every pair of units. Next, the pair with the largest score is merged within a composition vector. Following every merge, RvNN generates (a) a larger area of numerous units, (b) a compositional vector of the area, and (c) a label for the class (for instance, a noun phrase will become the class label for the new area if two units are noun words). The compositional vector for the entire area is the root of the RvNN tree structure. An example RvNN tree is shown in Fig.  5 . RvNN has been employed in several applications [ 60 , 61 , 62 ].

figure 5

An example of RvNN tree

Recurrent neural networks

RNNs are a commonly employed and familiar algorithm in the discipline of DL [ 63 , 64 , 65 ]. RNN is mainly applied in the area of speech processing and NLP contexts [ 66 , 67 ]. Unlike conventional networks, RNN uses sequential data in the network. Since the embedded structure in the sequence of the data delivers valuable information, this feature is fundamental to a range of different applications. For instance, it is important to understand the context of the sentence in order to determine the meaning of a specific word in it. Thus, it is possible to consider the RNN as a unit of short-term memory, where x represents the input layer, y is the output layer, and s represents the state (hidden) layer. For a given input sequence, a typical unfolded RNN diagram is illustrated in Fig.  6 . Pascanu et al. [ 68 ] introduced three different types of deep RNN techniques, namely “Hidden-to-Hidden”, “Hidden-to-Output”, and “Input-to-Hidden”. A deep RNN is introduced that lessens the learning difficulty in the deep network and brings the benefits of a deeper RNN based on these three techniques.

figure 6

Typical unfolded RNN diagram

However, RNN’s sensitivity to the exploding gradient and vanishing problems represent one of the main issues with this approach [ 69 ]. More specifically, during the training process, the reduplications of several large or small derivatives may cause the gradients to exponentially explode or decay. With the entrance of new inputs, the network stops thinking about the initial ones; therefore, this sensitivity decays over time. Furthermore, this issue can be handled using LSTM [ 70 ]. This approach offers recurrent connections to memory blocks in the network. Every memory block contains a number of memory cells, which have the ability to store the temporal states of the network. In addition, it contains gated units for controlling the flow of information. In very deep networks [ 37 ], residual connections also have the ability to considerably reduce the impact of the vanishing gradient issue which explained in later sections. CNN is considered to be more powerful than RNN. RNN includes less feature compatibility when compared to CNN.

Convolutional neural networks

In the field of DL, the CNN is the most famous and commonly employed algorithm [ 30 , 71 , 72 , 73 , 74 , 75 ]. The main benefit of CNN compared to its predecessors is that it automatically identifies the relevant features without any human supervision [ 76 ]. CNNs have been extensively applied in a range of different fields, including computer vision [ 77 ], speech processing [ 78 ], Face Recognition [ 79 ], etc. The structure of CNNs was inspired by neurons in human and animal brains, similar to a conventional neural network. More specifically, in a cat’s brain, a complex sequence of cells forms the visual cortex; this sequence is simulated by the CNN [ 80 ]. Goodfellow et al. [ 28 ] identified three key benefits of the CNN: equivalent representations, sparse interactions, and parameter sharing. Unlike conventional fully connected (FC) networks, shared weights and local connections in the CNN are employed to make full use of 2D input-data structures like image signals. This operation utilizes an extremely small number of parameters, which both simplifies the training process and speeds up the network. This is the same as in the visual cortex cells. Notably, only small regions of a scene are sensed by these cells rather than the whole scene (i.e., these cells spatially extract the local correlation available in the input, like local filters over the input).

A commonly used type of CNN, which is similar to the multi-layer perceptron (MLP), consists of numerous convolution layers preceding sub-sampling (pooling) layers, while the ending layers are FC layers. An example of CNN architecture for image classification is illustrated in Fig.  7 .

figure 7

An example of CNN architecture for image classification

The input x of each layer in a CNN model is organized in three dimensions: height, width, and depth, or \(m \times m \times r\) , where the height (m) is equal to the width. The depth is also referred to as the channel number. For example, in an RGB image, the depth (r) is equal to three. Several kernels (filters) available in each convolutional layer are denoted by k and also have three dimensions ( \(n \times n \times q\) ), similar to the input image; here, however, n must be smaller than m , while q is either equal to or smaller than r . In addition, the kernels are the basis of the local connections, which share similar parameters (bias \(b^{k}\) and weight \(W^{k}\) ) for generating k feature maps \(h^{k}\) with a size of ( \(m-n-1\) ) each and are convolved with input, as mentioned above. The convolution layer calculates a dot product between its input and the weights as in Eq. 1 , similar to NLP, but the inputs are undersized areas of the initial image size. Next, by applying the nonlinearity or an activation function to the convolution-layer output, we obtain the following:

The next step is down-sampling every feature map in the sub-sampling layers. This leads to a reduction in the network parameters, which accelerates the training process and in turn enables handling of the overfitting issue. For all feature maps, the pooling function (e.g. max or average) is applied to an adjacent area of size \(p \times p\) , where p is the kernel size. Finally, the FC layers receive the mid- and low-level features and create the high-level abstraction, which represents the last-stage layers as in a typical neural network. The classification scores are generated using the ending layer [e.g. support vector machines (SVMs) or softmax]. For a given instance, every score represents the probability of a specific class.

Benefits of employing CNNs

The benefits of using CNNs over other traditional neural networks in the computer vision environment are listed as follows:

The main reason to consider CNN is the weight sharing feature, which reduces the number of trainable network parameters and in turn helps the network to enhance generalization and to avoid overfitting.

Concurrently learning the feature extraction layers and the classification layer causes the model output to be both highly organized and highly reliant on the extracted features.

Large-scale network implementation is much easier with CNN than with other neural networks.

The CNN architecture consists of a number of layers (or so-called multi-building blocks). Each layer in the CNN architecture, including its function, is described in detail below.

Convolutional Layer: In CNN architecture, the most significant component is the convolutional layer. It consists of a collection of convolutional filters (so-called kernels). The input image, expressed as N-dimensional metrics, is convolved with these filters to generate the output feature map.

Kernel definition: A grid of discrete numbers or values describes the kernel. Each value is called the kernel weight. Random numbers are assigned to act as the weights of the kernel at the beginning of the CNN training process. In addition, there are several different methods used to initialize the weights. Next, these weights are adjusted at each training era; thus, the kernel learns to extract significant features.

Convolutional Operation: Initially, the CNN input format is described. The vector format is the input of the traditional neural network, while the multi-channeled image is the input of the CNN. For instance, single-channel is the format of the gray-scale image, while the RGB image format is three-channeled. To understand the convolutional operation, let us take an example of a \(4 \times 4\) gray-scale image with a \(2 \times 2\) random weight-initialized kernel. First, the kernel slides over the whole image horizontally and vertically. In addition, the dot product between the input image and the kernel is determined, where their corresponding values are multiplied and then summed up to create a single scalar value, calculated concurrently. The whole process is then repeated until no further sliding is possible. Note that the calculated dot product values represent the feature map of the output. Figure  8 graphically illustrates the primary calculations executed at each step. In this figure, the light green color represents the \(2 \times 2\) kernel, while the light blue color represents the similar size area of the input image. Both are multiplied; the end result after summing up the resulting product values (marked in a light orange color) represents an entry value to the output feature map.

figure 8

The primary calculations executed at each step of convolutional layer

However, padding to the input image is not applied in the previous example, while a stride of one (denoted for the selected step-size over all vertical or horizontal locations) is applied to the kernel. Note that it is also possible to use another stride value. In addition, a feature map of lower dimensions is obtained as a result of increasing the stride value.

On the other hand, padding is highly significant to determining border size information related to the input image. By contrast, the border side-features moves carried away very fast. By applying padding, the size of the input image will increase, and in turn, the size of the output feature map will also increase. Core Benefits of Convolutional Layers.

Sparse Connectivity: Each neuron of a layer in FC neural networks links with all neurons in the following layer. By contrast, in CNNs, only a few weights are available between two adjacent layers. Thus, the number of required weights or connections is small, while the memory required to store these weights is also small; hence, this approach is memory-effective. In addition, matrix operation is computationally much more costly than the dot (.) operation in CNN.

Weight Sharing: There are no allocated weights between any two neurons of neighboring layers in CNN, as the whole weights operate with one and all pixels of the input matrix. Learning a single group of weights for the whole input will significantly decrease the required training time and various costs, as it is not necessary to learn additional weights for each neuron.

Pooling Layer: The main task of the pooling layer is the sub-sampling of the feature maps. These maps are generated by following the convolutional operations. In other words, this approach shrinks large-size feature maps to create smaller feature maps. Concurrently, it maintains the majority of the dominant information (or features) in every step of the pooling stage. In a similar manner to the convolutional operation, both the stride and the kernel are initially size-assigned before the pooling operation is executed. Several types of pooling methods are available for utilization in various pooling layers. These methods include tree pooling, gated pooling, average pooling, min pooling, max pooling, global average pooling (GAP), and global max pooling. The most familiar and frequently utilized pooling methods are the max, min, and GAP pooling. Figure  9 illustrates these three pooling operations.

figure 9

Three types of pooling operations

Sometimes, the overall CNN performance is decreased as a result; this represents the main shortfall of the pooling layer, as this layer helps the CNN to determine whether or not a certain feature is available in the particular input image, but focuses exclusively on ascertaining the correct location of that feature. Thus, the CNN model misses the relevant information.

Activation Function (non-linearity) Mapping the input to the output is the core function of all types of activation function in all types of neural network. The input value is determined by computing the weighted summation of the neuron input along with its bias (if present). This means that the activation function makes the decision as to whether or not to fire a neuron with reference to a particular input by creating the corresponding output.

Non-linear activation layers are employed after all layers with weights (so-called learnable layers, such as FC layers and convolutional layers) in CNN architecture. This non-linear performance of the activation layers means that the mapping of input to output will be non-linear; moreover, these layers give the CNN the ability to learn extra-complicated things. The activation function must also have the ability to differentiate, which is an extremely significant feature, as it allows error back-propagation to be used to train the network. The following types of activation functions are most commonly used in CNN and other deep neural networks.

Sigmoid: The input of this activation function is real numbers, while the output is restricted to between zero and one. The sigmoid function curve is S-shaped and can be represented mathematically by Eq. 2 .

Tanh: It is similar to the sigmoid function, as its input is real numbers, but the output is restricted to between − 1 and 1. Its mathematical representation is in Eq. 3 .

ReLU: The mostly commonly used function in the CNN context. It converts the whole values of the input to positive numbers. Lower computational load is the main benefit of ReLU over the others. Its mathematical representation is in Eq. 4 .

Occasionally, a few significant issues may occur during the use of ReLU. For instance, consider an error back-propagation algorithm with a larger gradient flowing through it. Passing this gradient within the ReLU function will update the weights in a way that makes the neuron certainly not activated once more. This issue is referred to as “Dying ReLU”. Some ReLU alternatives exist to solve such issues. The following discusses some of them.

Leaky ReLU: Instead of ReLU down-scaling the negative inputs, this activation function ensures these inputs are never ignored. It is employed to solve the Dying ReLU problem. Leaky ReLU can be represented mathematically as in Eq. 5 .

Note that the leak factor is denoted by m. It is commonly set to a very small value, such as 0.001.

Noisy ReLU: This function employs a Gaussian distribution to make ReLU noisy. It can be represented mathematically as in Eq. 6 .

Parametric Linear Units: This is mostly the same as Leaky ReLU. The main difference is that the leak factor in this function is updated through the model training process. The parametric linear unit can be represented mathematically as in Eq. 7 .

Note that the learnable weight is denoted as a.

Fully Connected Layer: Commonly, this layer is located at the end of each CNN architecture. Inside this layer, each neuron is connected to all neurons of the previous layer, the so-called Fully Connected (FC) approach. It is utilized as the CNN classifier. It follows the basic method of the conventional multiple-layer perceptron neural network, as it is a type of feed-forward ANN. The input of the FC layer comes from the last pooling or convolutional layer. This input is in the form of a vector, which is created from the feature maps after flattening. The output of the FC layer represents the final CNN output, as illustrated in Fig.  10 .

figure 10

Fully connected layer

Loss Functions: The previous section has presented various layer-types of CNN architecture. In addition, the final classification is achieved from the output layer, which represents the last layer of the CNN architecture. Some loss functions are utilized in the output layer to calculate the predicted error created across the training samples in the CNN model. This error reveals the difference between the actual output and the predicted one. Next, it will be optimized through the CNN learning process.

However, two parameters are used by the loss function to calculate the error. The CNN estimated output (referred to as the prediction) is the first parameter. The actual output (referred to as the label) is the second parameter. Several types of loss function are employed in various problem types. The following concisely explains some of the loss function types.

Cross-Entropy or Softmax Loss Function: This function is commonly employed for measuring the CNN model performance. It is also referred to as the log loss function. Its output is the probability \(p \in \left\{ 0\left. , 1 \right\} \right. \) . In addition, it is usually employed as a substitution of the square error loss function in multi-class classification problems. In the output layer, it employs the softmax activations to generate the output within a probability distribution. The mathematical representation of the output class probability is Eq. 8 .

Here, \(e^{a_{i}}\) represents the non-normalized output from the preceding layer, while N represents the number of neurons in the output layer. Finally, the mathematical representation of cross-entropy loss function is Eq. 9 .

Euclidean Loss Function: This function is widely used in regression problems. In addition, it is also the so-called mean square error. The mathematical expression of the estimated Euclidean loss is Eq. 10 .

Hinge Loss Function: This function is commonly employed in problems related to binary classification. This problem relates to maximum-margin-based classification; this is mostly important for SVMs, which use the hinge loss function, wherein the optimizer attempts to maximize the margin around dual objective classes. Its mathematical formula is Eq. 11 .

The margin m is commonly set to 1. Moreover, the predicted output is denoted as \(p_{_{i}}\) , while the desired output is denoted as \(y_{_{i}}\) .

Regularization to CNN

For CNN models, over-fitting represents the central issue associated with obtaining well-behaved generalization. The model is entitled over-fitted in cases where the model executes especially well on training data and does not succeed on test data (unseen data) which is more explained in the latter section. An under-fitted model is the opposite; this case occurs when the model does not learn a sufficient amount from the training data. The model is referred to as “just-fitted” if it executes well on both training and testing data. These three types are illustrated in Fig.  11 . Various intuitive concepts are used to help the regularization to avoid over-fitting; more details about over-fitting and under-fitting are discussed in latter sections.

Dropout: This is a widely utilized technique for generalization. During each training epoch, neurons are randomly dropped. In doing this, the feature selection power is distributed equally across the whole group of neurons, as well as forcing the model to learn different independent features. During the training process, the dropped neuron will not be a part of back-propagation or forward-propagation. By contrast, the full-scale network is utilized to perform prediction during the testing process.

Drop-Weights: This method is highly similar to dropout. In each training epoch, the connections between neurons (weights) are dropped rather than dropping the neurons; this represents the only difference between drop-weights and dropout.

Data Augmentation: Training the model on a sizeable amount of data is the easiest way to avoid over-fitting. To achieve this, data augmentation is used. Several techniques are utilized to artificially expand the size of the training dataset. More details can be found in the latter section, which describes the data augmentation techniques.

Batch Normalization: This method ensures the performance of the output activations [ 81 ]. This performance follows a unit Gaussian distribution. Subtracting the mean and dividing by the standard deviation will normalize the output at each layer. While it is possible to consider this as a pre-processing task at each layer in the network, it is also possible to differentiate and to integrate it with other networks. In addition, it is employed to reduce the “internal covariance shift” of the activation layers. In each layer, the variation in the activation distribution defines the internal covariance shift. This shift becomes very high due to the continuous weight updating through training, which may occur if the samples of the training data are gathered from numerous dissimilar sources (for example, day and night images). Thus, the model will consume extra time for convergence, and in turn, the time required for training will also increase. To resolve this issue, a layer representing the operation of batch normalization is applied in the CNN architecture.

The advantages of utilizing batch normalization are as follows:

It prevents the problem of vanishing gradient from arising.

It can effectively control the poor weight initialization.

It significantly reduces the time required for network convergence (for large-scale datasets, this will be extremely useful).

It struggles to decrease training dependency across hyper-parameters.

Chances of over-fitting are reduced, since it has a minor influence on regularization.

figure 11

Over-fitting and under-fitting issues

Optimizer selection

This section discusses the CNN learning process. Two major issues are included in the learning process: the first issue is the learning algorithm selection (optimizer), while the second issue is the use of many enhancements (such as AdaDelta, Adagrad, and momentum) along with the learning algorithm to enhance the output.

Loss functions, which are founded on numerous learnable parameters (e.g. biases, weights, etc.) or minimizing the error (variation between actual and predicted output), are the core purpose of all supervised learning algorithms. The techniques of gradient-based learning for a CNN network appear as the usual selection. The network parameters should always update though all training epochs, while the network should also look for the locally optimized answer in all training epochs in order to minimize the error.

The learning rate is defined as the step size of the parameter updating. The training epoch represents a complete repetition of the parameter update that involves the complete training dataset at one time. Note that it needs to select the learning rate wisely so that it does not influence the learning process imperfectly, although it is a hyper-parameter.

Gradient Descent or Gradient-based learning algorithm: To minimize the training error, this algorithm repetitively updates the network parameters through every training epoch. More specifically, to update the parameters correctly, it needs to compute the objective function gradient (slope) by applying a first-order derivative with respect to the network parameters. Next, the parameter is updated in the reverse direction of the gradient to reduce the error. The parameter updating process is performed though network back-propagation, in which the gradient at every neuron is back-propagated to all neurons in the preceding layer. The mathematical representation of this operation is as Eq. 12 .

The final weight in the current training epoch is denoted by \(w_{i j^{t}}\) , while the weight in the preceding \((t-1)\) training epoch is denoted \(w_{i j^{t-1}}\) . The learning rate is \(\eta \) and the prediction error is E . Different alternatives of the gradient-based learning algorithm are available and commonly employed; these include the following:

Batch Gradient Descent: During the execution of this technique [ 82 ], the network parameters are updated merely one time behind considering all training datasets via the network. In more depth, it calculates the gradient of the whole training set and subsequently uses this gradient to update the parameters. For a small-sized dataset, the CNN model converges faster and creates an extra-stable gradient using BGD. Since the parameters are changed only once for every training epoch, it requires a substantial amount of resources. By contrast, for a large training dataset, additional time is required for converging, and it could converge to a local optimum (for non-convex instances).

Stochastic Gradient Descent: The parameters are updated at each training sample in this technique [ 83 ]. It is preferred to arbitrarily sample the training samples in every epoch in advance of training. For a large-sized training dataset, this technique is both more memory-effective and much faster than BGD. However, because it is frequently updated, it takes extremely noisy steps in the direction of the answer, which in turn causes the convergence behavior to become highly unstable.

Mini-batch Gradient Descent: In this approach, the training samples are partitioned into several mini-batches, in which every mini-batch can be considered an under-sized collection of samples with no overlap between them [ 84 ]. Next, parameter updating is performed following gradient computation on every mini-batch. The advantage of this method comes from combining the advantages of both BGD and SGD techniques. Thus, it has a steady convergence, more computational efficiency and extra memory effectiveness. The following describes several enhancement techniques in gradient-based learning algorithms (usually in SGD), which further powerfully enhance the CNN training process.

Momentum: For neural networks, this technique is employed in the objective function. It enhances both the accuracy and the training speed by summing the computed gradient at the preceding training step, which is weighted via a factor \(\lambda \) (known as the momentum factor). However, it therefore simply becomes stuck in a local minimum rather than a global minimum. This represents the main disadvantage of gradient-based learning algorithms. Issues of this kind frequently occur if the issue has no convex surface (or solution space).

Together with the learning algorithm, momentum is used to solve this issue, which can be expressed mathematically as in Eq. 13 .

The weight increment in the current \(t^{\prime} \text{th}\) training epoch is denoted as \( \Delta w_{i j^{t}}\) , while \(\eta \) is the learning rate, and the weight increment in the preceding \((t-1)^{\prime} \text{th}\) training epoch. The momentum factor value is maintained within the range 0 to 1; in turn, the step size of the weight updating increases in the direction of the bare minimum to minimize the error. As the value of the momentum factor becomes very low, the model loses its ability to avoid the local bare minimum. By contrast, as the momentum factor value becomes high, the model develops the ability to converge much more rapidly. If a high value of momentum factor is used together with LR, then the model could miss the global bare minimum by crossing over it.

However, when the gradient varies its direction continually throughout the training process, then the suitable value of the momentum factor (which is a hyper-parameter) causes a smoothening of the weight updating variations.

Adaptive Moment Estimation (Adam): It is another optimization technique or learning algorithm that is widely used. Adam [ 85 ] represents the latest trends in deep learning optimization. This is represented by the Hessian matrix, which employs a second-order derivative. Adam is a learning strategy that has been designed specifically for training deep neural networks. More memory efficient and less computational power are two advantages of Adam. The mechanism of Adam is to calculate adaptive LR for each parameter in the model. It integrates the pros of both Momentum and RMSprop. It utilizes the squared gradients to scale the learning rate as RMSprop and it is similar to the momentum by using the moving average of the gradient. The equation of Adam is represented in Eq. 14 .

Design of algorithms (backpropagation)

Let’s start with a notation that refers to weights in the network unambiguously. We denote \({\varvec{w}}_{i j}^{h}\) to be the weight for the connection from \(\text {ith}\) input or (neuron at \(\left. (\text {h}-1){\text{th}}\right) \) to the \(j{\text{t }}\) neuron in the \(\text {hth}\) layer. So, Fig. 12 shows the weight on a connection from the neuron in the first layer to another neuron in the next layer in the network.

figure 12

MLP structure

Where \(w_{11}^{2}\) has represented the weight from the first neuron in the first layer to the first neuron in the second layer, based on that the second weight for the same neuron will be \(w_{21}^{2}\) which means is the weight comes from the second neuron in the previous layer to the first layer in the next layer which is the second in this net. Regarding the bias, since the bias is not the connection between the neurons for the layers, so it is easily handled each neuron must have its own bias, some network each layer has a certain bias. It can be seen from the above net that each layer has its own bias. Each network has the parameters such as the no of the layer in the net, the number of the neurons in each layer, no of the weight (connection) between the layers, the no of connection can be easily determined based on the no of neurons in each layer, for example, if there are ten input fully connect with two neurons in the next layer then the number of connection between them is \((10 * 2=20\) connection, weights), how the error is defined, and the weight is updated, we will imagine there is there are two layers in our neural network,

where \(\text {d}\) is the label of induvial input \(\text {ith}\) and \(\text {y}\) is the output of the same individual input. Backpropagation is about understanding how to change the weights and biases in a network based on the changes of the cost function (Error). Ultimately, this means computing the partial derivatives \(\partial \text {E} / \partial \text {w}_{\text {ij}}^{h}\) and \(\partial \text {E} / \partial \text {b}_{\text {j}}^{h}.\) But to compute those, a local variable is introduced, \(\delta _{j}^{1}\) which is called the local error in the \(j{\text{th} }\) neuron in the \(h{\text{th} }\) layer. Based on that local error Backpropagation will give the procedure to compute \(\partial \text {E} / \partial \text {w}_{\text {ij}}^{h}\) and \(\partial \text {E} / \partial \text {b}_{\text {j}}^{h}\) how the error is defined, and the weight is updated, we will imagine there is there are two layers in our neural network that is shown in Fig. 13 .

figure 13

Neuron activation functions

Output error for \(\delta _{\text {j}}^{1}\) each \(1=1: \text {L}\) where \(\text {L}\) is no. of neuron in output

where \(\text {e}(\text {k})\) is the error of the epoch \(\text {k}\) as shown in Eq. ( 2 ) and \(\varvec{\vartheta }^{\prime }\left( {\varvec{v}}_{j}({\varvec{k}})\right) \) is the derivate of the activation function for \(v_{j}\) at the output.

Backpropagate the error at all the rest layer except the output

where \(\delta _{j}^{1}({\mathbf {k}})\) is the output error and \(w_{j l}^{h+1}(k)\) is represented the weight after the layer where the error need to obtain.

After finding the error at each neuron in each layer, now we can update the weight in each layer based on Eqs. ( 16 ) and ( 17 ).

Improving performance of CNN

Based on our experiments in different DL applications [ 86 , 87 , 88 ]. We can conclude the most active solutions that may improve the performance of CNN are:

Expand the dataset with data augmentation or use transfer learning (explained in latter sections).

Increase the training time.

Increase the depth (or width) of the model.

Add regularization.

Increase hyperparameters tuning.

CNN architectures

Over the last 10 years, several CNN architectures have been presented [ 21 , 26 ]. Model architecture is a critical factor in improving the performance of different applications. Various modifications have been achieved in CNN architecture from 1989 until today. Such modifications include structural reformulation, regularization, parameter optimizations, etc. Conversely, it should be noted that the key upgrade in CNN performance occurred largely due to the processing-unit reorganization, as well as the development of novel blocks. In particular, the most novel developments in CNN architectures were performed on the use of network depth. In this section, we review the most popular CNN architectures, beginning from the AlexNet model in 2012 and ending at the High-Resolution (HR) model in 2020. Studying these architectures features (such as input size, depth, and robustness) is the key to help researchers to choose the suitable architecture for the their target task. Table  2 presents the brief overview of CNN architectures.

The history of deep CNNs began with the appearance of LeNet [ 89 ] (Fig.  14 ). At that time, the CNNs were restricted to handwritten digit recognition tasks, which cannot be scaled to all image classes. In deep CNN architecture, AlexNet is highly respected [ 30 ], as it achieved innovative results in the fields of image recognition and classification. Krizhevesky et al. [ 30 ] first proposed AlexNet and consequently improved the CNN learning ability by increasing its depth and implementing several parameter optimization strategies. Figure  15 illustrates the basic design of the AlexNet architecture.

figure 14

The architecture of LeNet

figure 15

The architecture of AlexNet

The learning ability of the deep CNN was limited at this time due to hardware restrictions. To overcome these hardware limitations, two GPUs (NVIDIA GTX 580) were used in parallel to train AlexNet. Moreover, in order to enhance the applicability of the CNN to different image categories, the number of feature extraction stages was increased from five in LeNet to seven in AlexNet. Regardless of the fact that depth enhances generalization for several image resolutions, it was in fact overfitting that represented the main drawback related to the depth. Krizhevesky et al. used Hinton’s idea to address this problem [ 90 , 91 ]. To ensure that the features learned by the algorithm were extra robust, Krizhevesky et al.’s algorithm randomly passes over several transformational units throughout the training stage. Moreover, by reducing the vanishing gradient problem, ReLU [ 92 ] could be utilized as a non-saturating activation function to enhance the rate of convergence [ 93 ]. Local response normalization and overlapping subsampling were also performed to enhance the generalization by decreasing the overfitting. To improve on the performance of previous networks, other modifications were made by using large-size filters \((5\times 5 \; \text{and}\; 11 \times 11)\) in the earlier layers. AlexNet has considerable significance in the recent CNN generations, as well as beginning an innovative research era in CNN applications.

Network-in-network

This network model, which has some slight differences from the preceding models, introduced two innovative concepts [ 94 ]. The first was employing multiple layers of perception convolution. These convolutions are executed using a 1×1 filter, which supports the addition of extra nonlinearity in the networks. Moreover, this supports enlarging the network depth, which may later be regularized using dropout. For DL models, this idea is frequently employed in the bottleneck layer. As a substitution for a FC layer, the GAP is also employed, which represents the second novel concept and enables a significant reduction in the number of model parameters. In addition, GAP considerably updates the network architecture. Generating a final low-dimensional feature vector with no reduction in the feature maps dimension is possible when GAP is used on a large feature map [ 95 , 96 ]. Figure  16 shows the structure of the network.

figure 16

The architecture of network-in-network

Before 2013, the CNN learning mechanism was basically constructed on a trial-and-error basis, which precluded an understanding of the precise purpose following the enhancement. This issue restricted the deep CNN performance on convoluted images. In response, Zeiler and Fergus introduced DeconvNet (a multilayer de-convolutional neural network) in 2013 [ 97 ]. This method later became known as ZefNet, which was developed in order to quantitively visualize the network. Monitoring the CNN performance via understanding the neuron activation was the purpose of the network activity visualization. However, Erhan et al. utilized this exact concept to optimize deep belief network (DBN) performance by visualizing the features of the hidden layers [ 98 ]. Moreover, in addition to this issue, Le et al. assessed the deep unsupervised auto-encoder (AE) performance by visualizing the created classes of the image using the output neurons [ 99 ]. By reversing the operation order of the convolutional and pooling layers, DenconvNet operates like a forward-pass CNN. Reverse mapping of this kind launches the convolutional layer output backward to create visually observable image shapes that accordingly give the neural interpretation of the internal feature representation learned at each layer [ 100 ]. Monitoring the learning schematic through the training stage was the key concept underlying ZefNet. In addition, it utilized the outcomes to recognize an ability issue coupled with the model. This concept was experimentally proven on AlexNet by applying DeconvNet. This indicated that only certain neurons were working, while the others were out of action in the first two layers of the network. Furthermore, it indicated that the features extracted via the second layer contained aliasing objects. Thus, Zeiler and Fergus changed the CNN topology due to the existence of these outcomes. In addition, they executed parameter optimization, and also exploited the CNN learning by decreasing the stride and the filter sizes in order to retain all features of the initial two convolutional layers. An improvement in performance was accordingly achieved due to this rearrangement in CNN topology. This rearrangement proposed that the visualization of the features could be employed to identify design weaknesses and conduct appropriate parameter alteration. Figure  17 shows the structure of the network.

figure 17

The architecture of ZefNet

Visual geometry group (VGG)

After CNN was determined to be effective in the field of image recognition, an easy and efficient design principle for CNN was proposed by Simonyan and Zisserman. This innovative design was called Visual Geometry Group (VGG). A multilayer model [ 101 ], it featured nineteen more layers than ZefNet [ 97 ] and AlexNet [ 30 ] to simulate the relations of the network representational capacity in depth. Conversely, in the 2013-ILSVRC competition, ZefNet was the frontier network, which proposed that filters with small sizes could enhance the CNN performance. With reference to these results, VGG inserted a layer of the heap of \(3\times 3\) filters rather than the \(5\times 5\) and 11 × 11 filters in ZefNet. This showed experimentally that the parallel assignment of these small-size filters could produce the same influence as the large-size filters. In other words, these small-size filters made the receptive field similarly efficient to the large-size filters \((7 \times 7 \; \text{and}\; 5 \times 5)\) . By decreasing the number of parameters, an extra advantage of reducing computational complication was achieved by using small-size filters. These outcomes established a novel research trend for working with small-size filters in CNN. In addition, by inserting \(1\times 1\) convolutions in the middle of the convolutional layers, VGG regulates the network complexity. It learns a linear grouping of the subsequent feature maps. With respect to network tuning, a max pooling layer [ 102 ] is inserted following the convolutional layer, while padding is implemented to maintain the spatial resolution. In general, VGG obtained significant results for localization problems and image classification. While it did not achieve first place in the 2014-ILSVRC competition, it acquired a reputation due to its enlarged depth, homogenous topology, and simplicity. However, VGG’s computational cost was excessive due to its utilization of around 140 million parameters, which represented its main shortcoming. Figure  18 shows the structure of the network.

figure 18

The architecture of VGG

In the 2014-ILSVRC competition, GoogleNet (also called Inception-V1) emerged as the winner [ 103 ]. Achieving high-level accuracy with decreased computational cost is the core aim of the GoogleNet architecture. It proposed a novel inception block (module) concept in the CNN context, since it combines multiple-scale convolutional transformations by employing merge, transform, and split functions for feature extraction. Figure  19 illustrates the inception block architecture. This architecture incorporates filters of different sizes ( \(5\times 5, 3\times 3, \; \text{and} \; 1\times 1\) ) to capture channel information together with spatial information at diverse ranges of spatial resolution. The common convolutional layer of GoogLeNet is substituted by small blocks using the same concept of network-in-network (NIN) architecture [ 94 ], which replaced each layer with a micro-neural network. The GoogLeNet concepts of merge, transform, and split were utilized, supported by attending to an issue correlated with different learning types of variants existing in a similar class of several images. The motivation of GoogLeNet was to improve the efficiency of CNN parameters, as well as to enhance the learning capacity. In addition, it regulates the computation by inserting a \(1\times 1\) convolutional filter, as a bottleneck layer, ahead of using large-size kernels. GoogleNet employed sparse connections to overcome the redundant information problem. It decreased cost by neglecting the irrelevant channels. It should be noted here that only some of the input channels are connected to some of the output channels. By employing a GAP layer as the end layer, rather than utilizing a FC layer, the density of connections was decreased. The number of parameters was also significantly decreased from 40 to 5 million parameters due to these parameter tunings. The additional regularity factors used included the employment of RmsProp as optimizer and batch normalization [ 104 ]. Furthermore, GoogleNet proposed the idea of auxiliary learners to speed up the rate of convergence. Conversely, the main shortcoming of GoogleNet was its heterogeneous topology; this shortcoming requires adaptation from one module to another. Other shortcomings of GoogleNet include the representation jam, which substantially decreased the feature space in the following layer, and in turn occasionally leads to valuable information loss.

figure 19

The basic structure of Google Block

Highway network

Increasing the network depth enhances its performance, mainly for complicated tasks. By contrast, the network training becomes difficult. The presence of several layers in deeper networks may result in small gradient values of the back-propagation of error at lower layers. In 2015, Srivastava et al. [ 105 ] suggested a novel CNN architecture, called Highway Network, to overcome this issue. This approach is based on the cross-connectivity concept. The unhindered information flow in Highway Network is empowered by instructing two gating units inside the layer. The gate mechanism concept was motivated by LSTM-based RNN [ 106 , 107 ]. The information aggregation was conducted by merging the information of the \(\i{\text{th}}-k\) layers with the next \(\i{\text{th}}\) layer to generate a regularization impact, which makes the gradient-based training of the deeper network very simple. This empowers the training of networks with more than 100 layers, such as a deeper network of 900 layers with the SGD algorithm. A Highway Network with a depth of fifty layers presented an improved rate of convergence, which is better than thin and deep architectures at the same time [ 108 ]. By contrast, [ 69 ] empirically demonstrated that plain Net performance declines when more than ten hidden layers are inserted. It should be noted that even a Highway Network 900 layers in depth converges much more rapidly than the plain network.

He et al. [ 37 ] developed ResNet (Residual Network), which was the winner of ILSVRC 2015. Their objective was to design an ultra-deep network free of the vanishing gradient issue, as compared to the previous networks. Several types of ResNet were developed based on the number of layers (starting with 34 layers and going up to 1202 layers). The most common type was ResNet50, which comprised 49 convolutional layers plus a single FC layer. The overall number of network weights was 25.5 M, while the overall number of MACs was 3.9 M. The novel idea of ResNet is its use of the bypass pathway concept, as shown in Fig.  20 , which was employed in Highway Nets to address the problem of training a deeper network in 2015. This is illustrated in Fig.  20 , which contains the fundamental ResNet block diagram. This is a conventional feedforward network plus a residual connection. The residual layer output can be identified as the \((l - 1){\text{th}}\) outputs, which are delivered from the preceding layer \((x_{l} - 1)\) . After executing different operations [such as convolution using variable-size filters, or batch normalization, before applying an activation function like ReLU on \((x_{l} - 1)\) ], the output is \(F(x_{l} - 1)\) . The ending residual output is \(x_{l}\) , which can be mathematically represented as in Eq. 18 .

There are numerous basic residual blocks included in the residual network. Based on the type of the residual network architecture, operations in the residual block are also changed [ 37 ].

figure 20

The block diagram for ResNet

In comparison to the highway network, ResNet presented shortcut connections inside layers to enable cross-layer connectivity, which are parameter-free and data-independent. Note that the layers characterize non-residual functions when a gated shortcut is closed in the highway network. By contrast, the individuality shortcuts are never closed, while the residual information is permanently passed in ResNet. Furthermore, ResNet has the potential to prevent the problems of gradient diminishing, as the shortcut connections (residual links) accelerate the deep network convergence. ResNet was the winner of the 2015-ILSVRC championship with 152 layers of depth; this represents 8 times the depth of VGG and 20 times the depth of AlexNet. In comparison with VGG, it has lower computational complexity, even with enlarged depth.

Inception: ResNet and Inception-V3/4

Szegedy et al. [ 103 , 109 , 110 ] proposed Inception-ResNet and Inception-V3/4 as upgraded types of Inception-V1/2. The concept behind Inception-V3 was to minimize the computational cost with no effect on the deeper network generalization. Thus, Szegedy et al. used asymmetric small-size filters ( \(1\times 5\) and \(1\times 7\) ) rather than large-size filters ( \( 7\times 7\) and \(5\times 5\) ); moreover, they utilized a bottleneck of \(1\times 1\) convolution prior to the large-size filters [ 110 ]. These changes make the operation of the traditional convolution very similar to cross-channel correlation. Previously, Lin et al. utilized the 1 × 1 filter potential in NIN architecture [ 94 ]. Subsequently, [ 110 ] utilized the same idea in an intelligent manner. By using \(1\times 1\) convolutional operation in Inception-V3, the input data are mapped into three or four isolated spaces, which are smaller than the initial input spaces. Next, all of these correlations are mapped in these smaller spaces through common \(5\times 5\) or \(3\times 3\) convolutions. By contrast, in Inception-ResNet, Szegedy et al. bring together the inception block and the residual learning power by replacing the filter concatenation with the residual connection [ 111 ]. Szegedy et al. empirically demonstrated that Inception-ResNet (Inception-4 with residual connections) can achieve a similar generalization power to Inception-V4 with enlarged width and depth and without residual connections. Thus, it is clearly illustrated that using residual connections in training will significantly accelerate the Inception network training. Figure  21 shows The basic block diagram for Inception Residual unit.

figure 21

The basic block diagram for Inception Residual unit

To solve the problem of the vanishing gradient, DenseNet was presented, following the same direction as ResNet and the Highway network [ 105 , 111 , 112 ]. One of the drawbacks of ResNet is that it clearly conserves information by means of preservative individuality transformations, as several layers contribute extremely little or no information. In addition, ResNet has a large number of weights, since each layer has an isolated group of weights. DenseNet employed cross-layer connectivity in an improved approach to address this problem [ 112 , 113 , 114 ]. It connected each layer to all layers in the network using a feed-forward approach. Therefore, the feature maps of each previous layer were employed to input into all of the following layers. In traditional CNNs, there are l connections between the previous layer and the current layer, while in DenseNet, there are \(\frac{l(l+1)}{2}\) direct connections. DenseNet demonstrates the influence of cross-layer depth wise-convolutions. Thus, the network gains the ability to discriminate clearly between the added and the preserved information, since DenseNet concatenates the features of the preceding layers rather than adding them. However, due to its narrow layer structure, DenseNet becomes parametrically high-priced in addition to the increased number of feature maps. The direct admission of all layers to the gradients via the loss function enhances the information flow all across the network. In addition, this includes a regularizing impact, which minimizes overfitting on tasks alongside minor training sets. Figure  22 shows the architecture of DenseNet Network.

figure 22

(adopted from [ 112 ])

The architecture of DenseNet Network

ResNext is an enhanced version of the Inception Network [ 115 ]. It is also known as the Aggregated Residual Transform Network. Cardinality, which is a new term presented by [ 115 ], utilized the split, transform, and merge topology in an easy and effective way. It denotes the size of the transformation set as an extra dimension [ 116 , 117 , 118 ]. However, the Inception network manages network resources more efficiently, as well as enhancing the learning ability of the conventional CNN. In the transformation branch, different spatial embeddings (employing e.g. \(5\times 5\) , \(3\times 3\) , and \(1\times 1\) ) are used. Thus, customizing each layer is required separately. By contrast, ResNext derives its characteristic features from ResNet, VGG, and Inception. It employed the VGG deep homogenous topology with the basic architecture of GoogleNet by setting \(3\times 3\) filters as spatial resolution inside the blocks of split, transform, and merge. Figure  23 shows the ResNext building blocks. ResNext utilized multi-transformations inside the blocks of split, transform, and merge, as well as outlining such transformations in cardinality terms. The performance is significantly improved by increasing the cardinality, as Xie et al. showed. The complexity of ResNext was regulated by employing \(1\times 1\) filters (low embeddings) ahead of a \(3\times 3\) convolution. By contrast, skipping connections are used for optimized training [ 115 ].

figure 23

The basic block diagram for the ResNext building blocks

The feature reuse problem is the core shortcoming related to deep residual networks, since certain feature blocks or transformations contribute a very small amount to learning. Zagoruyko and Komodakis [ 119 ] accordingly proposed WideResNet to address this problem. These authors advised that the depth has a supplemental influence, while the residual units convey the core learning ability of deep residual networks. WideResNet utilized the residual block power via making the ResNet wider instead of deeper [ 37 ]. It enlarged the width by presenting an extra factor, k, which handles the network width. In other words, it indicated that layer widening is a highly successful method of performance enhancement compared to deepening the residual network. While enhanced representational capacity is achieved by deep residual networks, these networks also have certain drawbacks, such as the exploding and vanishing gradient problems, feature reuse problem (inactivation of several feature maps), and the time-intensive nature of the training. He et al. [ 37 ] tackled the feature reuse problem by including a dropout in each residual block to regularize the network in an efficient manner. In a similar manner, utilizing dropouts, Huang et al. [ 120 ] presented the stochastic depth concept to solve the slow learning and gradient vanishing problems. Earlier research was focused on increasing the depth; thus, any small enhancement in performance required the addition of several new layers. When comparing the number of parameters, WideResNet has twice that of ResNet, as an experimental study showed. By contrast, WideResNet presents an improved method for training relative to deep networks [ 119 ]. Note that most architectures prior to residual networks (including the highly effective VGG and Inception) were wider than ResNet. Thus, wider residual networks were established once this was determined. However, inserting a dropout between the convolutional layers (as opposed to within the residual block) made the learning more effective in WideResNet [ 121 , 122 ].

Pyramidal Net

The depth of the feature map increases in the succeeding layer due to the deep stacking of multi-convolutional layers, as shown in previous deep CNN architectures such as ResNet, VGG, and AlexNet. By contrast, the spatial dimension reduces, since a sub-sampling follows each convolutional layer. Thus, augmented feature representation is recompensed by decreasing the size of the feature map. The extreme expansion in the depth of the feature map, alongside the spatial information loss, interferes with the learning ability in the deep CNNs. ResNet obtained notable outcomes for the issue of image classification. Conversely, deleting a convolutional block—in which both the number of channel and spatial dimensions vary (channel depth enlarges, while spatial dimension reduces)—commonly results in decreased classifier performance. Accordingly, the stochastic ResNet enhanced the performance by decreasing the information loss accompanying the residual unit drop. Han et al. [ 123 ] proposed Pyramidal Net to address the ResNet learning interference problem. To address the depth enlargement and extreme reduction in spatial width via ResNet, Pyramidal Net slowly enlarges the residual unit width to cover the most feasible places rather than saving the same spatial dimension inside all residual blocks up to the appearance of the down-sampling. It was referred to as Pyramidal Net due to the slow enlargement in the feature map depth based on the up-down method. Factor l, which was determined by Eq. 19 , regulates the depth of the feature map.

Here, the dimension of the l th residual unit is indicated by \(d_{l}\) ; moreover, n indicates the overall number of residual units, the step factor is indicated by \(\lambda \) , and the depth increase is regulated by the factor \(\frac{\lambda }{n}\) , which uniformly distributes the weight increase across the dimension of the feature map. Zero-padded identity mapping is used to insert the residual connections among the layers. In comparison to the projection-based shortcut connections, zero-padded identity mapping requires fewer parameters, which in turn leads to enhanced generalization [ 124 ]. Multiplication- and addition-based widening are two different approaches used in Pyramidal Nets for network widening. More specifically, the first approach (multiplication) enlarges geometrically, while the second one (addition) enlarges linearly [ 92 ]. The main problem associated with the width enlargement is the growth in time and space required related to the quadratic time.

Extreme inception architecture is the main characteristic of Xception. The main idea behind Xception is its depthwise separable convolution [ 125 ]. The Xception model adjusted the original inception block by making it wider and exchanging a single dimension ( \(3 \times 3\) ) followed by a \(1 \times 1\) convolution to reduce computational complexity. Figure  24 shows the Xception block architecture. The Xception network becomes extra computationally effective through the use of the decoupling channel and spatial correspondence. Moreover, it first performs mapping of the convolved output to the embedding short dimension by applying \(1 \times 1\) convolutions. It then performs k spatial transformations. Note that k here represents the width-defining cardinality, which is obtained via the transformations number in Xception. However, the computations were made simpler in Xception by distinctly convolving each channel around the spatial axes. These axes are subsequently used as the \(1 \times 1\) convolutions (pointwise convolution) for performing cross-channel correspondence. The \(1 \times 1\) convolution is utilized in Xception to regularize the depth of the channel. The traditional convolutional operation in Xception utilizes a number of transformation segments equivalent to the number of channels; Inception, moreover, utilizes three transformation segments, while traditional CNN architecture utilizes only a single transformation segment. Conversely, the suggested Xception transformation approach achieves extra learning efficiency and better performance but does not minimize the number of parameters [ 126 , 127 ].

figure 24

The basic block diagram for the Xception block architecture

Residual attention neural network

To improve the network feature representation, Wang et al. [ 128 ] proposed the Residual Attention Network (RAN). Enabling the network to learn aware features of the object is the main purpose of incorporating attention into the CNN. The RAN consists of stacked residual blocks in addition to the attention module; hence, it is a feed-forward CNN. However, the attention module is divided into two branches, namely the mask branch and trunk branch. These branches adopt a top-down and bottom-up learning strategy respectively. Encapsulating two different strategies in the attention model supports top-down attention feedback and fast feed-forward processing in only one particular feed-forward process. More specifically, the top-down architecture generates dense features to make inferences about every aspect. Moreover, the bottom-up feedforward architecture generates low-resolution feature maps in addition to robust semantic information. Restricted Boltzmann machines employed a top-down bottom-up strategy as in previously proposed studies [ 129 ]. During the training reconstruction phase, Goh et al. [ 130 ] used the mechanism of top-down attention in deep Boltzmann machines (DBMs) as a regularizing factor. Note that the network can be globally optimized using a top-down learning strategy in a similar manner, where the maps progressively output to the input throughout the learning process [ 129 , 130 , 131 , 132 ].

Incorporating the attention concept with convolutional blocks in an easy way was used by the transformation network, as obtained in a previous study [ 133 ]. Unfortunately, these are inflexible, which represents the main problem, along with their inability to be used for varying surroundings. By contrast, stacking multi-attention modules has made RAN very effective at recognizing noisy, complex, and cluttered images. RAN’s hierarchical organization gives it the capability to adaptively allocate a weight for every feature map depending on its importance within the layers. Furthermore, incorporating three distinct levels of attention (spatial, channel, and mixed) enables the model to use this ability to capture the object-aware features at these distinct levels.

Convolutional block attention module

The importance of the feature map utilization and the attention mechanism is certified via SE-Network and RAN [ 128 , 134 , 135 ]. The convolutional block attention (CBAM) module, which is a novel attention-based CNN, was first developed by Woo et al. [ 136 ]. This module is similar to SE-Network and simple in design. SE-Network disregards the object’s spatial locality in the image and considers only the channels’ contribution during the image classification. Regarding object detection, object spatial location plays a significant role. The convolutional block attention module sequentially infers the attention maps. More specifically, it applies channel attention preceding the spatial attention to obtain the refined feature maps. Spatial attention is performed using 1 × 1 convolution and pooling functions, as in the literature. Generating an effective feature descriptor can be achieved by using a spatial axis along with the pooling of features. In addition, generating a robust spatial attention map is possible, as CBAM concatenates the max pooling and average pooling operations. In a similar manner, a collection of GAP and max pooling operations is used to model the feature map statistics. Woo et al. [ 136 ] demonstrated that utilizing GAP will return a sub-optimized inference of channel attention, whereas max pooling provides an indication of the distinguishing object features. Thus, the utilization of max pooling and average pooling enhances the network’s representational power. The feature maps improve the representational power, as well as facilitating a focus on the significant portion of the chosen features. The expression of 3D attention maps through a serial learning procedure assists in decreasing the computational cost and the number of parameters, as Woo et al. [ 136 ] experimentally proved. Note that any CNN architecture can be simply integrated with CBAM.

Concurrent spatial and channel excitation mechanism

To make the work valid for segmentation tasks, Roy et al. [ 137 , 138 ] expanded Hu et al. [ 134 ] effort by adding the influence of spatial information to the channel information. Roy et al. [ 137 , 138 ] presented three types of modules: (1) channel squeeze and excitation with concurrent channels (scSE); (2) exciting spatially and squeezing channel-wise (sSE); (3) exciting channel-wise and squeezing spatially (cSE). For segmentation purposes, they employed auto-encoder-based CNNs. In addition, they suggested inserting modules following the encoder and decoder layers. To specifically highlight the object-specific feature maps, they further allocated attention to every channel by expressing a scaling factor from the channel and spatial information in the first module (scSE). In the second module (sSE), the feature map information has lower importance than the spatial locality, as the spatial information plays a significant role during the segmentation process. Therefore, several channel collections are spatially divided and developed so that they can be employed in segmentation. In the final module (cSE), a similar SE-block concept is used. Furthermore, the scaling factor is derived founded on the contribution of the feature maps within the object detection [ 137 , 138 ].

CNN is an efficient technique for detecting object features and achieving well-behaved recognition performance in comparison with innovative handcrafted feature detectors. A number of restrictions related to CNN are present, meaning that the CNN does not consider certain relations, orientation, size, and perspectives of features. For instance, when considering a face image, the CNN does not count the various face components (such as mouth, eyes, nose, etc.) positions, and will incorrectly activate the CNN neurons and recognize the face without taking specific relations (such as size, orientation etc.) into account. At this point, consider a neuron that has probability in addition to feature properties such as size, orientation, perspective, etc. A specific neuron/capsule of this type has the ability to effectively detect the face along with different types of information. Thus, many layers of capsule nodes are used to construct the capsule network. An encoding unit, which contains three layers of capsule nodes, forms the CapsuleNet or CapsNet (the initial version of the capsule networks).

For example, the MNIST architecture comprises \(28\times 28\) images, applying 256 filters of size \(9\times 9\) and with stride 1. The \(28-9+1=20\) is the output plus 256 feature maps. Next, these outputs are input to the first capsule layer, while producing an 8D vector rather than a scalar; in fact, this is a modified convolution layer. Note that a stride 2 with \(9\times 9\) filters is employed in the first convolution layer. Thus, the dimension of the output is \((20-9)/2+1=6\) . The initial capsules employ \(8\times 32\) filters, which generate 32 × 8 × 6 × 6 (32 for groups, 8 for neurons, while 6 × 6 is the neuron size).

Figure  25 represents the complete CapsNet encoding and decoding processes. In the CNN context, a max-pooling layer is frequently employed to handle the translation change. It can detect the feature moves in the event that the feature is still within the max-pooling window. This approach has the ability to detect the overlapped features; this is highly significant in detection and segmentation operations, since the capsule involves the weighted features sum from the preceding layer.

figure 25

The complete CapsNet encoding and decoding processes

In conventional CNNs, a particular cost function is employed to evaluate the global error that grows toward the back throughout the training process. Conversely, in such cases, the activation of a neuron will not grow further once the weight between two neurons turns out to be zero. Instead of a single size being provided with the complete cost function in repetitive dynamic routing alongside the agreement, the signal is directed based on the feature parameters. Sabour et al. [ 139 ] provides more details about this architecture. When using MNIST to recognize handwritten digits, this innovative CNN architecture gives superior accuracy. From the application perspective, this architecture has extra suitability for segmentation and detection approaches when compared with classification approaches [ 140 , 141 , 142 ].

High-resolution network (HRNet)

High-resolution representations are necessary for position-sensitive vision tasks, such as semantic segmentation, object detection, and human pose estimation. In the present up-to-date frameworks, the input image is encoded as a low-resolution representation using a subnetwork that is constructed as a connected series of high-to-low resolution convolutions such as VGGNet and ResNet. The low-resolution representation is then recovered to become a high-resolution one. Alternatively, high-resolution representations are maintained during the entire process using a novel network, referred to as a High-Resolution Network (HRNet) [ 143 , 144 ]. This network has two principal features. First, the convolution series of high-to-low resolutions are connected in parallel. Second, the information across the resolutions are repeatedly exchanged. The advantage achieved includes getting a representation that is more accurate in the spatial domain and extra-rich in the semantic domain. Moreover, HRNet has several applications in the fields of object detection, semantic segmentation, and human pose prediction. For computer vision problems, the HRNet represents a more robust backbone. Figure  26 illustrates the general architecture of HRNet.

figure 26

The general architecture of HRNet

Challenges (limitations) of deep learning and alternate solutions

When employing DL, several difficulties are often taken into consideration. Those more challenging are listed next and several possible alternatives are accordingly provided.

Training data

DL is extremely data-hungry considering it also involves representation learning [ 145 , 146 ]. DL demands an extensively large amount of data to achieve a well-behaved performance model, i.e. as the data increases, an extra well-behaved performance model can be achieved (Fig.  27 ). In most cases, the available data are sufficient to obtain a good performance model. However, sometimes there is a shortage of data for using DL directly [ 87 ]. To properly address this issue, three suggested methods are available. The first involves the employment of the transfer-learning concept after data is collected from similar tasks. Note that while the transferred data will not directly augment the actual data, it will help in terms of both enhancing the original input representation of data and its mapping function [ 147 ]. In this way, the model performance is boosted. Another technique involves employing a well-trained model from a similar task and fine-tuning the ending of two layers or even one layer based on the limited original data. Refer to [ 148 , 149 ] for a review of different transfer-learning techniques applied in the DL approach. In the second method, data augmentation is performed [ 150 ]. This task is very helpful for use in augmenting the image data, since the image translation, mirroring, and rotation commonly do not change the image label. Conversely, it is important to take care when applying this technique in some cases such as with bioinformatics data. For instance, when mirroring an enzyme sequence, the output data may not represent the actual enzyme sequence. In the third method, the simulated data can be considered for increasing the volume of the training set. It is occasionally possible to create simulators based on the physical process if the issue is well understood. Therefore, the result will involve the simulation of as much data as needed. Processing the data requirement for DL-based simulation is obtained as an example in Ref. [ 151 ].

figure 27

The performance of DL regarding the amount of data

  • Transfer learning

Recent research has revealed a widespread use of deep CNNs, which offer ground-breaking support for answering many classification problems. Generally speaking, deep CNN models require a sizable volume of data to obtain good performance. The common challenge associated with using such models concerns the lack of training data. Indeed, gathering a large volume of data is an exhausting job, and no successful solution is available at this time. The undersized dataset problem is therefore currently solved using the TL technique [ 148 , 149 ], which is highly efficient in addressing the lack of training data issue. The mechanism of TL involves training the CNN model with large volumes of data. In the next step, the model is fine-tuned for training on a small request dataset.

The student-teacher relationship is a suitable approach to clarifying TL. Gathering detailed knowledge of the subject is the first step [ 152 ]. Next, the teacher provides a “course” by conveying the information within a “lecture series” over time. Put simply, the teacher transfers the information to the student. In more detail, the expert (teacher) transfers the knowledge (information) to the learner (student). Similarly, the DL network is trained using a vast volume of data, and also learns the bias and the weights during the training process. These weights are then transferred to different networks for retraining or testing a similar novel model. Thus, the novel model is enabled to pre-train weights rather than requiring training from scratch. Figure  28 illustrates the conceptual diagram of the TL technique.

Pre-trained models: Many CNN models, e.g. AlexNet [ 30 ], GoogleNet [ 103 ], and ResNet [ 37 ], have been trained on large datasets such as ImageNet for image recognition purposes. These models can then be employed to recognize a different task without the need to train from scratch. Furthermore, the weights remain the same apart from a few learned features. In cases where data samples are lacking, these models are very useful. There are many reasons for employing a pre-trained model. First, training large models on sizeable datasets requires high-priced computational power. Second, training large models can be time-consuming, taking up to multiple weeks. Finally, a pre-trained model can assist with network generalization and speed up the convergence.

A research problem using pre-trained models: Training a DL approach requires a massive number of images. Thus, obtaining good performance is a challenge under these circumstances. Achieving excellent outcomes in image classification or recognition applications, with performance occasionally superior to that of a human, becomes possible through the use of deep convolutional neural networks (DCNNs) including several layers if a huge amount of data is available [ 37 , 148 , 153 ]. However, avoiding overfitting problems in such applications requires sizable datasets and properly generalizing DCNN models. When training a DCNN model, the dataset size has no lower limit. However, the accuracy of the model becomes insufficient in the case of the utilized model has fewer layers, or if a small dataset is used for training due to over- or under-fitting problems. Due to they have no ability to utilize the hierarchical features of sizable datasets, models with fewer layers have poor accuracy. It is difficult to acquire sufficient training data for DL models. For example, in medical imaging and environmental science, gathering labelled datasets is very costly [ 148 ]. Moreover, the majority of the crowdsourcing workers are unable to make accurate notes on medical or biological images due to their lack of medical or biological knowledge. Thus, ML researchers often rely on field experts to label such images; however, this process is costly and time consuming. Therefore, producing the large volume of labels required to develop flourishing deep networks turns out to be unfeasible. Recently, TL has been widely employed to address the later issue. Nevertheless, although TL enhances the accuracy of several tasks in the fields of pattern recognition and computer vision [ 154 , 155 ], there is an essential issue related to the source data type used by the TL as compared to the target dataset. For instance, enhancing the medical image classification performance of CNN models is achieved by training the models using the ImageNet dataset, which contains natural images [ 153 ]. However, such natural images are completely dissimilar from the raw medical images, meaning that the model performance is not enhanced. It has further been proven that TL from different domains does not significantly affect performance on medical imaging tasks, as lightweight models trained from scratch perform nearly as well as standard ImageNet-transferred models [ 156 ]. Therefore, there exists scenarios in which using pre-trained models do not become an affordable solution. In 2020, some researchers have utilized same-domain TL and achieved excellent results [ 86 , 87 , 88 , 157 ]. Same-domain TL is an approach of using images that look similar to the target dataset for training. For example, using X-ray images of different chest diseases to train the model, then fine-tuning and training it on chest X-ray images for COVID-19 diagnosis. More details about same-domain TL and how to implement the fine-tuning process can be found in [ 87 ].

figure 28

The conceptual diagram of the TL technique

Data augmentation techniques

If the goal is to increase the amount of available data and avoid the overfitting issue, data augmentation techniques are one possible solution [ 150 , 158 , 159 ]. These techniques are data-space solutions for any limited-data problem. Data augmentation incorporates a collection of methods that improve the attributes and size of training datasets. Thus, DL networks can perform better when these techniques are employed. Next, we list some data augmentation alternate solutions.

Flipping: Flipping the vertical axis is a less common practice than flipping the horizontal one. Flipping has been verified as valuable on datasets like ImageNet and CIFAR-10. Moreover, it is highly simple to implement. In addition, it is not a label-conserving transformation on datasets that involve text recognition (such as SVHN and MNIST).

Color space: Encoding digital image data is commonly used as a dimension tensor ( \(height \times width \times color channels\) ). Accomplishing augmentations in the color space of the channels is an alternative technique, which is extremely workable for implementation. A very easy color augmentation involves separating a channel of a particular color, such as Red, Green, or Blue. A simple way to rapidly convert an image using a single-color channel is achieved by separating that matrix and inserting additional double zeros from the remaining two color channels. Furthermore, increasing or decreasing the image brightness is achieved by using straightforward matrix operations to easily manipulate the RGB values. By deriving a color histogram that describes the image, additional improved color augmentations can be obtained. Lighting alterations are also made possible by adjusting the intensity values in histograms similar to those employed in photo-editing applications.

Cropping: Cropping a dominant patch of every single image is a technique employed with combined dimensions of height and width as a specific processing step for image data. Furthermore, random cropping may be employed to produce an impact similar to translations. The difference between translations and random cropping is that translations conserve the spatial dimensions of this image, while random cropping reduces the input size [for example from (256, 256) to (224, 224)]. According to the selected reduction threshold for cropping, the label-preserving transformation may not be addressed.

Rotation: When rotating an image left or right from within 0 to 360 degrees around the axis, rotation augmentations are obtained. The rotation degree parameter greatly determines the suitability of the rotation augmentations. In digit recognition tasks, small rotations (from 0 to 20 degrees) are very helpful. By contrast, the data label cannot be preserved post-transformation when the rotation degree increases.

Translation: To avoid positional bias within the image data, a very useful transformation is to shift the image up, down, left, or right. For instance, it is common that the whole dataset images are centered; moreover, the tested dataset should be entirely made up of centered images to test the model. Note that when translating the initial images in a particular direction, the residual space should be filled with Gaussian or random noise, or a constant value such as 255 s or 0 s. The spatial dimensions of the image post-augmentation are preserved using this padding.

Noise injection This approach involves injecting a matrix of arbitrary values. Such a matrix is commonly obtained from a Gaussian distribution. Moreno-Barea et al. [ 160 ] employed nine datasets to test the noise injection. These datasets were taken from the UCI repository [ 161 ]. Injecting noise within images enables the CNN to learn additional robust features.

However, highly well-behaved solutions for positional biases available within the training data are achieved by means of geometric transformations. To separate the distribution of the testing data from the training data, several prospective sources of bias exist. For instance, when all faces should be completely centered within the frames (as in facial recognition datasets), the problem of positional biases emerges. Thus, geometric translations are the best solution. Geometric translations are helpful due to their simplicity of implementation, as well as their effective capability to disable the positional biases. Several libraries of image processing are available, which enables beginning with simple operations such as rotation or horizontal flipping. Additional training time, higher computational costs, and additional memory are some shortcomings of geometric transformations. Furthermore, a number of geometric transformations (such as arbitrary cropping or translation) should be manually observed to ensure that they do not change the image label. Finally, the biases that separate the test data from the training data are more complicated than transitional and positional changes. Hence, it is not trivial answering to when and where geometric transformations are suitable to be applied.

Imbalanced data

Commonly, biological data tend to be imbalanced, as negative samples are much more numerous than positive ones [ 162 , 163 , 164 ]. For example, compared to COVID-19-positive X-ray images, the volume of normal X-ray images is very large. It should be noted that undesirable results may be produced when training a DL model using imbalanced data. The following techniques are used to solve this issue. First, it is necessary to employ the correct criteria for evaluating the loss, as well as the prediction result. In considering the imbalanced data, the model should perform well on small classes as well as larger ones. Thus, the model should employ area under curve (AUC) as the resultant loss as well as the criteria [ 165 ]. Second, it should employ the weighted cross-entropy loss, which ensures the model will perform well with small classes if it still prefers to employ the cross-entropy loss. Simultaneously, during model training, it is possible either to down-sample the large classes or up-sample the small classes. Finally, to make the data balanced as in Ref. [ 166 ], it is possible to construct models for every hierarchical level, as a biological system frequently has hierarchical label space. However, the effect of the imbalanced data on the performance of the DL model has been comprehensively investigated. In addition, to lessen the problem, the most frequently used techniques were also compared. Nevertheless, note that these techniques are not specified for biological problems.

Interpretability of data

Occasionally, DL techniques are analyzed to act as a black box. In fact, they are interpretable. The need for a method of interpreting DL, which is used to obtain the valuable motifs and patterns recognized by the network, is common in many fields, such as bioinformatics [ 167 ]. In the task of disease diagnosis, it is not only required to know the disease diagnosis or prediction results of a trained DL model, but also how to enhance the surety of the prediction outcomes, as the model makes its decisions based on these verifications [ 168 ]. To achieve this, it is possible to give a score of importance for every portion of the particular example. Within this solution, back-propagation-based techniques or perturbation-based approaches are used [ 169 ]. In the perturbation-based approaches, a portion of the input is changed and the effect of this change on the model output is observed [ 170 , 171 , 172 , 173 ]. This concept has high computational complexity, but it is simple to understand. On the other hand, to check the score of the importance of various input portions, the signal from the output propagates back to the input layer in the back-propagation-based techniques. These techniques have been proven valuable in [ 174 ]. In different scenarios, various meanings can represent the model interpretability.

Uncertainty scaling

Commonly, the final prediction label is not the only label required when employing DL techniques to achieve the prediction; the score of confidence for every inquiry from the model is also desired. The score of confidence is defined as how confident the model is in its prediction [ 175 ]. Since the score of confidence prevents belief in unreliable and misleading predictions, it is a significant attribute, regardless of the application scenario. In biology, the confidence score reduces the resources and time expended in proving the outcomes of the misleading prediction. Generally speaking, in healthcare or similar applications, the uncertainty scaling is frequently very significant; it helps in evaluating automated clinical decisions and the reliability of machine learning-based disease-diagnosis [ 176 , 177 ]. Because overconfident prediction can be the output of different DL models, the score of probability (achieved from the softmax output of the direct-DL) is often not in the correct scale [ 178 ]. Note that the softmax output requires post-scaling to achieve a reliable probability score. For outputting the probability score in the correct scale, several techniques have been introduced, including Bayesian Binning into Quantiles (BBQ) [ 179 ], isotonic regression [ 180 ], histogram binning [ 181 ], and the legendary Platt scaling [ 182 ]. More specifically, for DL techniques, temperature scaling was recently introduced, which achieves superior performance compared to the other techniques.

Catastrophic forgetting

This is defined as incorporating new information into a plain DL model, made possible by interfering with the learned information. For instance, consider a case where there are 1000 types of flowers and a model is trained to classify these flowers, after which a new type of flower is introduced; if the model is fine-tuned only with this new class, its performance will become unsuccessful with the older classes [ 183 , 184 ]. The logical data are continually collected and renewed, which is in fact a highly typical scenario in many fields, e.g. Biology. To address this issue, there is a direct solution that involves employing old and new data to train an entirely new model from scratch. This solution is time-consuming and computationally intensive; furthermore, it leads to an unstable state for the learned representation of the initial data. At this time, three different types of ML techniques, which have not catastrophic forgetting, are made available to solve the human brain problem founded on the neurophysiological theories [ 185 , 186 ]. Techniques of the first type are founded on regularizations such as EWC [ 183 ] Techniques of the second type employ rehearsal training techniques and dynamic neural network architecture like iCaRL [ 187 , 188 ]. Finally, techniques of the third type are founded on dual-memory learning systems [ 189 ]. Refer to [ 190 , 191 , 192 ] in order to gain more details.

Model compression

To obtain well-trained models that can still be employed productively, DL models have intensive memory and computational requirements due to their huge complexity and large numbers of parameters [ 193 , 194 ]. One of the fields that is characterized as data-intensive is the field of healthcare and environmental science. These needs reduce the deployment of DL in limited computational-power machines, mainly in the healthcare field. The numerous methods of assessing human health and the data heterogeneity have become far more complicated and vastly larger in size [ 195 ]; thus, the issue requires additional computation [ 196 ]. Furthermore, novel hardware-based parallel processing solutions such as FPGAs and GPUs [ 197 , 198 , 199 ] have been developed to solve the computation issues associated with DL. Recently, numerous techniques for compressing the DL models, designed to decrease the computational issues of the models from the starting point, have also been introduced. These techniques can be classified into four classes. In the first class, the redundant parameters (which have no significant impact on model performance) are reduced. This class, which includes the famous deep compression method, is called parameter pruning [ 200 ]. In the second class, the larger model uses its distilled knowledge to train a more compact model; thus, it is called knowledge distillation [ 201 , 202 ]. In the third class, compact convolution filters are used to reduce the number of parameters [ 203 ]. In the final class, the information parameters are estimated for preservation using low-rank factorization [ 204 ]. For model compression, these classes represent the most representative techniques. In [ 193 ], it has been provided a more comprehensive discussion about the topic.

Overfitting

DL models have excessively high possibilities of resulting in data overfitting at the training stage due to the vast number of parameters involved, which are correlated in a complex manner. Such situations reduce the model’s ability to achieve good performance on the tested data [ 90 , 205 ]. This problem is not only limited to a specific field, but involves different tasks. Therefore, when proposing DL techniques, this problem should be fully considered and accurately handled. In DL, the implied bias of the training process enables the model to overcome crucial overfitting problems, as recent studies suggest [ 205 , 206 , 207 , 208 ]. Even so, it is still necessary to develop techniques that handle the overfitting problem. An investigation of the available DL algorithms that ease the overfitting problem can categorize them into three classes. The first class acts on both the model architecture and model parameters and includes the most familiar approaches, such as weight decay [ 209 ], batch normalization [ 210 ], and dropout [ 90 ]. In DL, the default technique is weight decay [ 209 ], which is used extensively in almost all ML algorithms as a universal regularizer. The second class works on model inputs such as data corruption and data augmentation [ 150 , 211 ]. One reason for the overfitting problem is the lack of training data, which makes the learned distribution not mirror the real distribution. Data augmentation enlarges the training data. By contrast, marginalized data corruption improves the solution exclusive to augmenting the data. The final class works on the model output. A recently proposed technique penalizes the over-confident outputs for regularizing the model [ 178 ]. This technique has demonstrated the ability to regularize RNNs and CNNs.

Vanishing gradient problem

In general, when using backpropagation and gradient-based learning techniques along with ANNs, largely in the training stage, a problem called the vanishing gradient problem arises [ 212 , 213 , 214 ]. More specifically, in each training iteration, every weight of the neural network is updated based on the current weight and is proportionally relative to the partial derivative of the error function. However, this weight updating may not occur in some cases due to a vanishingly small gradient, which in the worst case means that no extra training is possible and the neural network will stop completely. Conversely, similarly to other activation functions, the sigmoid function shrinks a large input space to a tiny input space. Thus, the derivative of the sigmoid function will be small due to large variation at the input that produces a small variation at the output. In a shallow network, only some layers use these activations, which is not a significant issue. While using more layers will lead the gradient to become very small in the training stage, in this case, the network works efficiently. The back-propagation technique is used to determine the gradients of the neural networks. Initially, this technique determines the network derivatives of each layer in the reverse direction, starting from the last layer and progressing back to the first layer. The next step involves multiplying the derivatives of each layer down the network in a similar manner to the first step. For instance, multiplying N small derivatives together when there are N hidden layers employs an activation function such as the sigmoid function. Hence, the gradient declines exponentially while propagating back to the first layer. More specifically, the biases and weights of the first layers cannot be updated efficiently during the training stage because the gradient is small. Moreover, this condition decreases the overall network accuracy, as these first layers are frequently critical to recognizing the essential elements of the input data. However, such a problem can be avoided through employing activation functions. These functions lack the squishing property, i.e., the ability to squish the input space to within a small space. By mapping X to max, the ReLU [ 91 ] is the most popular selection, as it does not yield a small derivative that is employed in the field. Another solution involves employing the batch normalization layer [ 81 ]. As mentioned earlier, the problem occurs once a large input space is squashed into a small space, leading to vanishing the derivative. Employing batch normalization degrades this issue by simply normalizing the input, i.e., the expression | x | does not accomplish the exterior boundaries of the sigmoid function. The normalization process makes the largest part of it come down in the green area, which ensures that the derivative is large enough for further actions. Furthermore, faster hardware can tackle the previous issue, e.g. that provided by GPUs. This makes standard back-propagation possible for many deeper layers of the network compared to the time required to recognize the vanishing gradient problem [ 215 ].

Exploding gradient problem

Opposite to the vanishing problem is the one related to gradient. Specifically, large error gradients are accumulated during back-propagation [ 216 , 217 , 218 ]. The latter will lead to extremely significant updates to the weights of the network, meaning that the system becomes unsteady. Thus, the model will lose its ability to learn effectively. Grosso modo, moving backward in the network during back-propagation, the gradient grows exponentially by repetitively multiplying gradients. The weight values could thus become incredibly large and may overflow to become a not-a-number (NaN) value. Some potential solutions include:

Using different weight regularization techniques.

Redesigning the architecture of the network model.

Underspecification

In 2020, a team of computer scientists at Google has identified a new challenge called underspecification [ 219 ]. ML models including DL models often show surprisingly poor behavior when they are tested in real-world applications such as computer vision, medical imaging, natural language processing, and medical genomics. The reason behind the weak performance is due to underspecification. It has been shown that small modifications can force a model towards a completely different solution as well as lead to different predictions in deployment domains. There are different techniques of addressing underspecification issue. One of them is to design “stress tests” to examine how good a model works on real-world data and to find out the possible issues. Nevertheless, this demands a reliable understanding of the process the model can work inaccurately. The team stated that “Designing stress tests that are well-matched to applied requirements, and that provide good “coverage” of potential failure modes is a major challenge”. Underspecification puts major constraints on the credibility of ML predictions and may require some reconsidering over certain applications. Since ML is linked to human by serving several applications such as medical imaging and self-driving cars, it will require proper attention to this issue.

Applications of deep learning

Presently, various DL applications are widespread around the world. These applications include healthcare, social network analysis, audio and speech processing (like recognition and enhancement), visual data processing methods (such as multimedia data analysis and computer vision), and NLP (translation and sentence classification), among others (Fig.  29 ) [ 220 , 221 , 222 , 223 , 224 ]. These applications have been classified into five categories: classification, localization, detection, segmentation, and registration. Although each of these tasks has its own target, there is fundamental overlap in the pipeline implementation of these applications as shown in Fig.  30 . Classification is a concept that categorizes a set of data into classes. Detection is used to locate interesting objects in an image with consideration given to the background. In detection, multiple objects, which could be from dissimilar classes, are surrounded by bounding boxes. Localization is the concept used to locate the object, which is surrounded by a single bounding box. In segmentation (semantic segmentation), the target object edges are surrounded by outlines, which also label them; moreover, fitting a single image (which could be 2D or 3D) onto another refers to registration. One of the most important and wide-ranging DL applications are in healthcare [ 225 , 226 , 227 , 228 , 229 , 230 ]. This area of research is critical due to its relation to human lives. Moreover, DL has shown tremendous performance in healthcare. Therefore, we take DL applications in the medical image analysis field as an example to describe the DL applications.

figure 29

Examples of DL applications

figure 30

Workflow of deep learning tasks

Classification

Computer-Aided Diagnosis (CADx) is another title sometimes used for classification. Bharati et al. [ 231 ] used a chest X-ray dataset for detecting lung diseases based on a CNN. Another study attempted to read X-ray images by employing CNN [ 232 ]. In this modality, the comparative accessibility of these images has likely enhanced the progress of DL. [ 233 ] used an improved pre-trained GoogLeNet CNN containing more than 150,000 images for training and testing processes. This dataset was augmented from 1850 chest X-rays. The creators reorganized the image orientation into lateral and frontal views and achieved approximately 100% accuracy. This work of orientation classification has clinically limited use. As a part of an ultimately fully automated diagnosis workflow, it obtained the data augmentation and pre-trained efficiency in learning the metadata of relevant images. Chest infection, commonly referred to as pneumonia, is extremely treatable, as it is a commonly occurring health problem worldwide. Conversely, Rajpurkar et al. [ 234 ] utilized CheXNet, which is an improved version of DenseNet [ 112 ] with 121 convolution layers, for classifying fourteen types of disease. These authors used the CheXNet14 dataset [ 235 ], which comprises 112,000 images. This network achieved an excellent performance in recognizing fourteen different diseases. In particular, pneumonia classification accomplished a 0.7632 AUC score using receiver operating characteristics (ROC) analysis. In addition, the network obtained better than or equal to the performance of both a three-radiologist panel and four individual radiologists. Zuo et al. [ 236 ] have adopted CNN for candidate classification in lung nodule. Shen et al. [ 237 ] employed both Random Forest (RF) and SVM classifiers with CNNs to classify lung nodules. They employed two convolutional layers with each of the three parallel CNNs. The LIDC-IDRI (Lung Image Database Consortium) dataset, which contained 1010-labeled CT lung scans, was used to classify the two types of lung nodules (malignant and benign). Different scales of the image patches were used by every CNN to extract features, while the output feature vector was constructed using the learned features. Next, these vectors were classified into malignant or benign using either the RF classifier or SVM with radial basis function (RBF) filter. The model was robust to various noisy input levels and achieved an accuracy of 86% in nodule classification. Conversely, the model of [ 238 ] interpolates the image data missing between PET and MRI images using 3D CNNs. The Alzheimer Disease Neuroimaging Initiative (ADNI) database, containing 830 PET and MRI patient scans, was utilized in their work. The PET and MRI images are used to train the 3D CNNs, first as input and then as output. Furthermore, for patients who have no PET images, the 3D CNNs utilized the trained images to rebuild the PET images. These rebuilt images approximately fitted the actual disease recognition outcomes. However, this approach did not address the overfitting issues, which in turn restricted their technique in terms of its possible capacity for generalization. Diagnosing normal versus Alzheimer’s disease patients has been achieved by several CNN models [ 239 , 240 ]. Hosseini-Asl et al. [ 241 ] attained 99% accuracy for up-to-date outcomes in diagnosing normal versus Alzheimer’s disease patients. These authors applied an auto-encoder architecture using 3D CNNs. The generic brain features were pre-trained on the CADDementia dataset. Subsequently, the outcomes of these learned features became inputs to higher layers to differentiate between patient scans of Alzheimer’s disease, mild cognitive impairment, or normal brains based on the ADNI dataset and using fine-tuned deep supervision techniques. The architectures of VGGNet and RNNs, in that order, were the basis of both VOXCNN and ResNet models developed by Korolev et al. [ 242 ]. They also discriminated between Alzheimer’s disease and normal patients using the ADNI database. Accuracy was 79% for Voxnet and 80% for ResNet. Compared to Hosseini-Asl’s work, both models achieved lower accuracies. Conversely, the implementation of the algorithms was simpler and did not require feature hand-crafting, as Korolev declared. In 2020, Mehmood et al. [ 240 ] trained a developed CNN-based network called “SCNN” with MRI images for the tasks of classification of Alzheimer’s disease. They achieved state-of-the-art results by obtaining an accuracy of 99.05%.

Recently, CNN has taken some medical imaging classification tasks to different level from traditional diagnosis to automated diagnosis with tremendous performance. Examples of these tasks are diabetic foot ulcer (DFU) (as normal and abnormal (DFU) classes) [ 87 , 243 , 244 , 245 , 246 ], sickle cells anemia (SCA) (as normal, abnormal (SCA), and other blood components) [ 86 , 247 ], breast cancer by classify hematoxylin–eosin-stained breast biopsy images into four classes: invasive carcinoma, in-situ carcinoma, benign tumor and normal tissue [ 42 , 88 , 248 , 249 , 250 , 251 , 252 ], and multi-class skin cancer classification [ 253 , 254 , 255 ].

In 2020, CNNs are playing a vital role in early diagnosis of the novel coronavirus (COVID-2019). CNN has become the primary tool for automatic COVID-19 diagnosis in many hospitals around the world using chest X-ray images [ 256 , 257 , 258 , 259 , 260 ]. More details about the classification of medical imaging applications can be found in [ 226 , 261 , 262 , 263 , 264 , 265 ].

Localization

Although applications in anatomy education could increase, the practicing clinician is more likely to be interested in the localization of normal anatomy. Radiological images are independently examined and described outside of human intervention, while localization could be applied in completely automatic end-to-end applications [ 266 , 267 , 268 ]. Zhao et al. [ 269 ] introduced a new deep learning-based approach to localize pancreatic tumor in projection X-ray images for image-guided radiation therapy without the need for fiducials. Roth et al. [ 270 ] constructed and trained a CNN using five convolutional layers to classify around 4000 transverse-axial CT images. These authors used five categories for classification: legs, pelvis, liver, lung, and neck. After data augmentation techniques were applied, they achieved an AUC score of 0.998 and the classification error rate of the model was 5.9%. For detecting the positions of the spleen, kidney, heart, and liver, Shin et al. [ 271 ] employed stacked auto-encoders on 78 contrast-improved MRI scans of the stomach area containing the kidneys or liver. Temporal and spatial domains were used to learn the hierarchal features. Based on the organs, these approaches achieved detection accuracies of 62–79%. Sirazitdinov et al. [ 268 ] presented an aggregate of two convolutional neural networks, namely RetinaNet and Mask R-CNN for pneumonia detection and localization.

Computer-Aided Detection (CADe) is another method used for detection. For both the clinician and the patient, overlooking a lesion on a scan may have dire consequences. Thus, detection is a field of study requiring both accuracy and sensitivity [ 272 , 273 , 274 ]. Chouhan et al. [ 275 ] introduced an innovative deep learning framework for the detection of pneumonia by adopting the idea of transfer learning. Their approach obtained an accuracy of 96.4% with a recall of 99.62% on unseen data. In the area of COVID-19 and pulmonary disease, several convolutional neural network approaches have been proposed for automatic detection from X-ray images which showed an excellent performance [ 46 , 276 , 277 , 278 , 279 ].

In the area of skin cancer, there several applications were introduced for the detection task [ 280 , 281 , 282 ]. Thurnhofer-Hemsi et al. [ 283 ] introduced a deep learning approach for skin cancer detection by fine-tuning five state-of-art convolutional neural network models. They addressed the issue of a lack of training data by adopting the ideas of transfer learning and data augmentation techniques. DenseNet201 network has shown superior results compared to other models.

Another interesting area is that of histopathological images, which are progressively digitized. Several papers have been published in this field [ 284 , 285 , 286 , 287 , 288 , 289 , 290 ]. Human pathologists read these images laboriously; they search for malignancy markers, such as a high index of cell proliferation, using molecular markers (e.g. Ki-67), cellular necrosis signs, abnormal cellular architecture, enlarged numbers of mitotic figures denoting augmented cell replication, and enlarged nucleus-to-cytoplasm ratios. Note that the histopathological slide may contain a huge number of cells (up to the thousands). Thus, the risk of disregarding abnormal neoplastic regions is high when wading through these cells at excessive levels of magnification. Ciresan et al. [ 291 ] employed CNNs of 11–13 layers for identifying mitotic figures. Fifty breast histology images from the MITOS dataset were used. Their technique attained recall and precision scores of 0.7 and 0.88 respectively. Sirinukunwattana et al. [ 292 ] utilized 100 histology images of colorectal adenocarcinoma to detect cell nuclei using CNNs. Roughly 30,000 nuclei were hand-labeled for training purposes. The novelty of this approach was in the use of Spatially Constrained CNN. This CNN detects the center of nuclei using the surrounding spatial context and spatial regression. Instead of this CNN, Xu et al. [ 293 ] employed a stacked sparse auto-encoder (SSAE) to identify nuclei in histological slides of breast cancer, achieving 0.83 and 0.89 recall and precision scores respectively. In this field, they showed that unsupervised learning techniques are also effectively utilized. In medical images, Albarquoni et al. [ 294 ] investigated the problem of insufficient labeling. They crowd-sourced the actual mitoses labeling in the histology images of breast cancer (from amateurs online). Solving the recurrent issue of inadequate labeling during the analysis of medical images can be achieved by feeding the crowd-sourced input labels into the CNN. This method signifies a remarkable proof-of-concept effort. In 2020, Lei et al. [ 285 ] introduced the employment of deep convolutional neural networks for automatic identification of mitotic candidates from histological sections for mitosis screening. They obtained the state-of-the-art detection results on the dataset of the International Pattern Recognition Conference (ICPR) 2012 Mitosis Detection Competition.

Segmentation

Although MRI and CT image segmentation research includes different organs such as knee cartilage, prostate, and liver, most research work has concentrated on brain segmentation, particularly tumors [ 295 , 296 , 297 , 298 , 299 , 300 ]. This issue is highly significant in surgical preparation to obtain the precise tumor limits for the shortest surgical resection. During surgery, excessive sacrificing of key brain regions may lead to neurological shortfalls including cognitive damage, emotionlessness, and limb difficulty. Conventionally, medical anatomical segmentation was done by hand; more specifically, the clinician draws out lines within the complete stack of the CT or MRI volume slice by slice. Thus, it is perfect for implementing a solution that computerizes this painstaking work. Wadhwa et al. [ 301 ] presented a brief overview on brain tumor segmentation of MRI images. Akkus et al. [ 302 ] wrote a brilliant review of brain MRI segmentation that addressed the different metrics and CNN architectures employed. Moreover, they explain several competitions in detail, as well as their datasets, which included Ischemic Stroke Lesion Segmentation (ISLES), Mild Traumatic brain injury Outcome Prediction (MTOP), and Brain Tumor Segmentation (BRATS).

Chen et al. [ 299 ] proposed convolutional neural networks for precise brain tumor segmentation. The approach that they employed involves several approaches for better features learning including the DeepMedic model, a novel dual-force training scheme, a label distribution-based loss function, and Multi-Layer Perceptron-based post-processing. They conducted their method on the two most modern brain tumor segmentation datasets, i.e., BRATS 2017 and BRATS 2015 datasets. Hu et al. [ 300 ] introduced the brain tumor segmentation method by adopting a multi-cascaded convolutional neural network (MCCNN) and fully connected conditional random fields (CRFs). The achieved results were excellent compared with the state-of-the-art methods.

Moeskops et al. [ 303 ] employed three parallel-running CNNs, each of which had a 2D input patch of dissimilar size, for segmenting and classifying MRI brain images. These images, which include 35 adults and 22 pre-term infants, were classified into various tissue categories such as cerebrospinal fluid, grey matter, and white matter. Every patch concentrates on capturing various image aspects with the benefit of employing three dissimilar sizes of input patch; here, the bigger sizes incorporated the spatial features, while the lowest patch sizes concentrated on the local textures. In general, the algorithm has Dice coefficients in the range of 0.82–0.87 and achieved a satisfactory accuracy. Although 2D image slices are employed in the majority of segmentation research, Milletrate et al. [ 304 ] implemented 3D CNN for segmenting MRI prostate images. Furthermore, they used the PROMISE2012 challenge dataset, from which fifty MRI scans were used for training and thirty for testing. The U-Net architecture of Ronnerberger et al. [ 305 ] inspired their V-net. This model attained a 0.869 Dice coefficient score, the same as the winning teams in the competition. To reduce overfitting and create the model of a deeper 11-convolutional layer CNN, Pereira et al. [ 306 ] applied intentionally small-sized filters of 3x3. Their model used MRI scans of 274 gliomas (a type of brain tumor) for training. They achieved first place in the 2013 BRATS challenge, as well as second place in the BRATS challenge 2015. Havaei et al. [ 307 ] also considered gliomas using the 2013 BRATS dataset. They investigated different 2D CNN architectures. Compared to the winner of BRATS 2013, their algorithm worked better, as it required only 3 min to execute rather than 100 min. The concept of cascaded architecture formed the basis of their model. Thus, it is referred to as an InputCascadeCNN. Employing FC Conditional Random Fields (CRFs), atrous spatial pyramid pooling, and up-sampled filters were techniques introduced by Chen et al. [ 308 ]. These authors aimed to enhance the accuracy of localization and enlarge the field of view of every filter at a multi-scale. Their model, DeepLab, attained 79.7% mIOU (mean Intersection Over Union). In the PASCAL VOC-2012 image segmentation, their model obtained an excellent performance.

Recently, the Automatic segmentation of COVID-19 Lung Infection from CT Images helps to detect the development of COVID-19 infection by employing several deep learning techniques [ 309 , 310 , 311 , 312 ].

Registration

Usually, given two input images, the four main stages of the canonical procedure of the image registration task are [ 313 , 314 ]:

Target Selection: it illustrates the determined input image that the second counterpart input image needs to remain accurately superimposed to.

Feature Extraction: it computes the set of features extracted from each input image.

Feature Matching: it allows finding similarities between the previously obtained features.

Pose Optimization: it is aimed to minimize the distance between both input images.

Then, the result of the registration procedure is the suitable geometric transformation (e.g. translation, rotation, scaling, etc.) that provides both input images within the same coordinate system in a way the distance between them is minimal, i.e. their level of superimposition/overlapping is optimal. It is out of the scope of this work to provide an extensive review of this topic. Nevertheless, a short summary is accordingly introduced next.

Commonly, the input images for the DL-based registration approach could be in various forms, e.g. point clouds, voxel grids, and meshes. Additionally, some techniques allow as inputs the result of the Feature Extraction or Matching steps in the canonical scheme. Specifically, the outcome could be some data in a particular form as well as the result of the steps from the classical pipeline (feature vector, matching vector, and transformation). Nevertheless, with the newest DL-based methods, a novel conceptual type of ecosystem issues. It contains acquired characteristics about the target, materials, and their behavior that can be registered with the input data. Such a conceptual ecosystem is formed by a neural network and its training manner, and it could be counted as an input to the registration approach. Nevertheless, it is not an input that one might adopt in every registration situation since it corresponds to an interior data representation.

From a DL view-point, the interpretation of the conceptual design enables differentiating the input data of a registration approach into defined or non-defined models. In particular, the illustrated phases are models that depict particular spatial data (e.g. 2D or 3D) while a non-defined one is a generalization of a data set created by a learning system. Yumer et al. [ 315 ] developed a framework in which the model acquires characteristics of objects, meaning ready to identify what a more sporty car seems like or a more comfy chair is, also adjusting a 3D model to fit those characteristics while maintaining the main characteristics of the primary data. Likewise, a fundamental perspective of the unsupervised learning method introduced by Ding et al. [ 316 ] is that there is no target for the registration approach. In this instance, the network is able of placing each input point cloud in a global space, solving SLAM issues in which many point clouds have to be registered rigidly. On the other hand, Mahadevan [ 317 ] proposed the combination of two conceptual models utilizing the growth of Imagination Machines to give flexible artificial intelligence systems and relationships between the learned phases through training schemes that are not inspired on labels and classifications. Another practical application of DL, especially CNNs, to image registration is the 3D reconstruction of objects. Wang et al. [ 318 ] applied an adversarial way using CNNs to rebuild a 3D model of an object from its 2D image. The network learns many objects and orally accomplishes the registration between the image and the conceptual model. Similarly, Hermoza et al. [ 319 ] also utilize the GAN network for prognosticating the absent geometry of damaged archaeological objects, providing the reconstructed object based on a voxel grid format and a label selecting its class.

DL for medical image registration has numerous applications, which were listed by some review papers [ 320 , 321 , 322 ]. Yang et al. [ 323 ] implemented stacked convolutional layers as an encoder-decoder approach to predict the morphing of the input pixel into its last formation using MRI brain scans from the OASIS dataset. They employed a registration model known as Large Deformation Diffeomorphic Metric Mapping (LDDMM) and attained remarkable enhancements in computation time. Miao et al. [ 324 ] used synthetic X-ray images to train a five-layer CNN to register 3D models of a trans-esophageal probe, a hand implant, and a knee implant onto 2D X-ray images for pose estimation. They determined that their model achieved an execution time of 0.1 s, representing an important enhancement against the conventional registration techniques based on intensity; moreover, it achieved effective registrations 79–99% of the time. Li et al. [ 325 ] introduced a neural network-based approach for the non-rigid 2D–3D registration of the lateral cephalogram and the volumetric cone-beam CT (CBCT) images.

Computational approaches

For computationally exhaustive applications, complex ML and DL approaches have rapidly been identified as the most significant techniques and are widely used in different fields. The development and enhancement of algorithms aggregated with capabilities of well-behaved computational performance and large datasets make it possible to effectively execute several applications, as earlier applications were either not possible or difficult to take into consideration.

Currently, several standard DNN configurations are available. The interconnection patterns between layers and the total number of layers represent the main differences between these configurations. The Table  2 illustrates the growth rate of the overall number of layers over time, which seems to be far faster than the “Moore’s Law growth rate”. In normal DNN, the number of layers grew by around 2.3× each year in the period from 2012 to 2016. Recent investigations of future ResNet versions reveal that the number of layers can be extended up to 1000. However, an SGD technique is employed to fit the weights (or parameters), while different optimization techniques are employed to obtain parameter updating during the DNN training process. Repetitive updates are required to enhance network accuracy in addition to a minorly augmented rate of enhancement. For example, the training process using ImageNet as a large dataset, which contains more than 14 million images, along with ResNet as a network model, take around 30K to 40K repetitions to converge to a steady solution. In addition, the overall computational load, as an upper-level prediction, may exceed 1020 FLOPS when both the training set size and the DNN complexity increase.

Prior to 2008, boosting the training to a satisfactory extent was achieved by using GPUs. Usually, days or weeks are needed for a training session, even with GPU support. By contrast, several optimization strategies were developed to reduce the extensive learning time. The computational requirements are believed to increase as the DNNs continuously enlarge in both complexity and size.

In addition to the computational load cost, the memory bandwidth and capacity have a significant effect on the entire training performance, and to a lesser extent, deduction. More specifically, the parameters are distributed through every layer of the input data, there is a sizeable amount of reused data, and the computation of several network layers exhibits an excessive computation-to-bandwidth ratio. By contrast, there are no distributed parameters, the amount of reused data is extremely small, and the additional FC layers have an extremely small computation-to-bandwidth ratio. Table  3 presents a comparison between different aspects related to the devices. In addition, the table is established to facilitate familiarity with the tradeoffs by obtaining the optimal approach for configuring a system based on either FPGA, GPU, or CPU devices. It should be noted that each has corresponding weaknesses and strengths; accordingly, there are no clear one-size-fits-all solutions.

Although GPU processing has enhanced the ability to address the computational challenges related to such networks, the maximum GPU (or CPU) performance is not achieved, and several techniques or models have turned out to be strongly linked to bandwidth. In the worst cases, the GPU efficiency is between 15 and 20% of the maximum theoretical performance. This issue is required to enlarge the memory bandwidth using high-bandwidth stacked memory. Next, different approaches based on FPGA, GPU, and CPU are accordingly detailed.

CPU-based approach

The well-behaved performance of the CPU nodes usually assists robust network connectivity, storage abilities, and large memory. Although CPU nodes are more common-purpose than those of FPGA or GPU, they lack the ability to match them in unprocessed computation facilities, since this requires increased network ability and a larger memory capacity.

GPU-based approach

GPUs are extremely effective for several basic DL primitives, which include greatly parallel-computing operations such as activation functions, matrix multiplication, and convolutions [ 326 , 327 , 328 , 329 , 330 ]. Incorporating HBM-stacked memory into the up-to-date GPU models significantly enhances the bandwidth. This enhancement allows numerous primitives to efficiently utilize all computational resources of the available GPUs. The improvement in GPU performance over CPU performance is usually 10-20:1 related to dense linear algebra operations.

Maximizing parallel processing is the base of the initial GPU programming model. For example, a GPU model may involve up to sixty-four computational units. There are four SIMD engines per each computational layer, and each SIMD has sixteen floating-point computation lanes. The peak performance is 25 TFLOPS (fp16) and 10 TFLOPS (fp32) as the percentage of the employment approaches 100%. Additional GPU performance may be achieved if the addition and multiply functions for vectors combine the inner production instructions for matching primitives related to matrix operations.

For DNN training, the GPU is usually considered to be an optimized design, while for inference operations, it may also offer considerable performance improvements.

FPGA-based approach

FPGA is wildly utilized in various tasks including deep learning [ 199 , 247 , 331 , 332 , 333 , 334 ]. Inference accelerators are commonly implemented utilizing FPGA. The FPGA can be effectively configured to reduce the unnecessary or overhead functions involved in GPU systems. Compared to GPU, the FPGA is restricted to both weak-behaved floating-point performance and integer inference. The main FPGA aspect is the capability to dynamically reconfigure the array characteristics (at run-time), as well as the capability to configure the array by means of effective design with little or no overhead.

As mentioned earlier, the FPGA offers both performance and latency for every watt it gains over GPU and CPU in DL inference operations. Implementation of custom high-performance hardware, pruned networks, and reduced arithmetic precision are three factors that enable the FPGA to implement DL algorithms and to achieve FPGA with this level of efficiency. In addition, FPGA may be employed to implement CNN overlay engines with over 80% efficiency, eight-bit accuracy, and over 15 TOPs peak performance; this is used for a few conventional CNNs, as Xillinx and partners demonstrated recently. By contrast, pruning techniques are mostly employed in the LSTM context. The sizes of the models can be efficiently minimized by up to 20×, which provides an important benefit during the implementation of the optimal solution, as MLP neural processing demonstrated. A recent study in the field of implementing fixed-point precision and custom floating-point has revealed that lowering the 8-bit is extremely promising; moreover, it aids in supplying additional advancements to implementing peak performance FPGA related to the DNN models.

Evaluation metrics

Evaluation metrics adopted within DL tasks play a crucial role in achieving the optimized classifier [ 335 ]. They are utilized within a usual data classification procedure through two main stages: training and testing. It is utilized to optimize the classification algorithm during the training stage. This means that the evaluation metric is utilized to discriminate and select the optimized solution, e.g., as a discriminator, which can generate an extra-accurate forecast of upcoming evaluations related to a specific classifier. For the time being, the evaluation metric is utilized to measure the efficiency of the created classifier, e.g. as an evaluator, within the model testing stage using hidden data. As given in Eq. 20 , TN and TP are defined as the number of negative and positive instances, respectively, which are successfully classified. In addition, FN and FP are defined as the number of misclassified positive and negative instances respectively. Next, some of the most well-known evaluation metrics are listed below.

Accuracy: Calculates the ratio of correct predicted classes to the total number of samples evaluated (Eq. 20 ).

Sensitivity or Recall: Utilized to calculate the fraction of positive patterns that are correctly classified (Eq. 21 ).

Specificity: Utilized to calculate the fraction of negative patterns that are correctly classified (Eq. 22 ).

Precision: Utilized to calculate the positive patterns that are correctly predicted by all predicted patterns in a positive class (Eq. 23 ).

F1-Score: Calculates the harmonic average between recall and precision rates (Eq. 24 ).

J Score: This metric is also called Youdens J statistic. Eq. 25 represents the metric.

False Positive Rate (FPR): This metric refers to the possibility of a false alarm ratio as calculated in Eq. 26

Area Under the ROC Curve: AUC is a common ranking type metric. It is utilized to conduct comparisons between learning algorithms [ 336 , 337 , 338 ], as well as to construct an optimal learning model [ 339 , 340 ]. In contrast to probability and threshold metrics, the AUC value exposes the entire classifier ranking performance. The following formula is used to calculate the AUC value for two-class problem [ 341 ] (Eq. 27 )

Here, \(S_{p}\) represents the sum of all positive ranked samples. The number of negative and positive samples is denoted as \(n_{n}\) and \(n_{p}\) , respectively. Compared to the accuracy metrics, the AUC value was verified empirically and theoretically, making it very helpful for identifying an optimized solution and evaluating the classifier performance through classification training.

When considering the discrimination and evaluation processes, the AUC performance was brilliant. However, for multiclass issues, the AUC computation is primarily cost-effective when discriminating a large number of created solutions. In addition, the time complexity for computing the AUC is \(O \left( |C|^{2} \; n\log n\right) \) with respect to the Hand and Till AUC model [ 341 ] and \(O \left( |C| \; n\log n\right) \) according to Provost and Domingo’s AUC model [ 336 ].

Frameworks and datasets

Several DL frameworks and datasets have been developed in the last few years. various frameworks and libraries have also been used in order to expedite the work with good results. Through their use, the training process has become easier. Table  4 lists the most utilized frameworks and libraries.

Based on the star ratings on Github, as well as our own background in the field, TensorFlow is deemed the most effective and easy to use. It has the ability to work on several platforms. (Github is one of the biggest software hosting sites, while Github stars refer to how well-regarded a project is on the site). Moreover, there are several other benchmark datasets employed for different DL tasks. Some of these are listed in Table  5 .

Summary and conclusion

Finally, it is mandatory the inclusion of a brief discussion by gathering all the relevant data provided along this extensive research. Next, an itemized analysis is presented in order to conclude our review and exhibit the future directions.

DL already experiences difficulties in simultaneously modeling multi-complex modalities of data. In recent DL developments, another common approach is that of multimodal DL.

DL requires sizeable datasets (labeled data preferred) to predict unseen data and to train the models. This challenge turns out to be particularly difficult when real-time data processing is required or when the provided datasets are limited (such as in the case of healthcare data). To alleviate this issue, TL and data augmentation have been researched over the last few years.

Although ML slowly transitions to semi-supervised and unsupervised learning to manage practical data without the need for manual human labeling, many of the current deep-learning models utilize supervised learning.

The CNN performance is greatly influenced by hyper-parameter selection. Any small change in the hyper-parameter values will affect the general CNN performance. Therefore, careful parameter selection is an extremely significant issue that should be considered during optimization scheme development.

Impressive and robust hardware resources like GPUs are required for effective CNN training. Moreover, they are also required for exploring the efficiency of using CNN in smart and embedded systems.

In the CNN context, ensemble learning [ 342 , 343 ] represents a prospective research area. The collection of different and multiple architectures will support the model in improving its generalizability across different image categories through extracting several levels of semantic image representation. Similarly, ideas such as new activation functions, dropout, and batch normalization also merit further investigation.

The exploitation of depth and different structural adaptations is significantly improved in the CNN learning capacity. Substituting the traditional layer configuration with blocks results in significant advances in CNN performance, as has been shown in the recent literature. Currently, developing novel and efficient block architectures is the main trend in new research models of CNN architectures. HRNet is only one example that shows there are always ways to improve the architecture.

It is expected that cloud-based platforms will play an essential role in the future development of computational DL applications. Utilizing cloud computing offers a solution to handling the enormous amount of data. It also helps to increase efficiency and reduce costs. Furthermore, it offers the flexibility to train DL architectures.

With the recent development in computational tools including a chip for neural networks and a mobile GPU, we will see more DL applications on mobile devices. It will be easier for users to use DL.

Regarding the issue of lack of training data, It is expected that various techniques of transfer learning will be considered such as training the DL model on large unlabeled image datasets and next transferring the knowledge to train the DL model on a small number of labeled images for the same task.

Last, this overview provides a starting point for the community of DL being interested in the field of DL. Furthermore, researchers would be allowed to decide the more suitable direction of work to be taken in order to provide more accurate alternatives to the field.

Availability of data and materials

Not applicable.

Rozenwald MB, Galitsyna AA, Sapunov GV, Khrameeva EE, Gelfand MS. A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features. PeerJ Comput Sci. 2020;6:307.

Article   Google Scholar  

Amrit C, Paauw T, Aly R, Lavric M. Identifying child abuse through text mining and machine learning. Expert Syst Appl. 2017;88:402–18.

Hossain E, Khan I, Un-Noor F, Sikander SS, Sunny MSH. Application of big data and machine learning in smart grid, and associated security concerns: a review. IEEE Access. 2019;7:13960–88.

Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Al Najada H. Survey of review spam detection using machine learning techniques. J Big Data. 2015;2(1):23.

Deldjoo Y, Elahi M, Cremonesi P, Garzotto F, Piazzolla P, Quadrana M. Content-based video recommendation system based on stylistic visual features. J Data Semant. 2016;5(2):99–113.

Al-Dulaimi K, Chandran V, Nguyen K, Banks J, Tomeo-Reyes I. Benchmarking hep-2 specimen cells classification using linear discriminant analysis on higher order spectra features of cell shape. Pattern Recogn Lett. 2019;125:534–41.

Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE. A survey of deep neural network architectures and their applications. Neurocomputing. 2017;234:11–26.

Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP, Shyu ML, Chen SC, Iyengar S. A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv (CSUR). 2018;51(5):1–36.

Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, Hasan M, Van Essen BC, Awwal AA, Asari VK. A state-of-the-art survey on deep learning theory and architectures. Electronics. 2019;8(3):292.

Potok TE, Schuman C, Young S, Patton R, Spedalieri F, Liu J, Yao KT, Rose G, Chakma G. A study of complex deep learning networks on high-performance, neuromorphic, and quantum computers. ACM J Emerg Technol Comput Syst (JETC). 2018;14(2):1–21.

Adeel A, Gogate M, Hussain A. Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments. Inf Fusion. 2020;59:163–70.

Tian H, Chen SC, Shyu ML. Evolutionary programming based deep learning feature selection and network construction for visual data classification. Inf Syst Front. 2020;22(5):1053–66.

Young T, Hazarika D, Poria S, Cambria E. Recent trends in deep learning based natural language processing. IEEE Comput Intell Mag. 2018;13(3):55–75.

Koppe G, Meyer-Lindenberg A, Durstewitz D. Deep learning for small and big data in psychiatry. Neuropsychopharmacology. 2021;46(1):176–90.

Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1. IEEE; 2005. p. 886–93.

Lowe DG. Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE international conference on computer vision, vol. 2. IEEE; 1999. p. 1150–7.

Wu L, Hoi SC, Yu N. Semantics-preserving bag-of-words models and applications. IEEE Trans Image Process. 2010;19(7):1908–20.

Article   MathSciNet   MATH   Google Scholar  

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.

Yao G, Lei T, Zhong J. A review of convolutional-neural-network-based action recognition. Pattern Recogn Lett. 2019;118:14–22.

Dhillon A, Verma GK. Convolutional neural network: a review of models, methodologies and applications to object detection. Prog Artif Intell. 2020;9(2):85–112.

Khan A, Sohail A, Zahoora U, Qureshi AS. A survey of the recent architectures of deep convolutional neural networks. Artif Intell Rev. 2020;53(8):5455–516.

Hasan RI, Yusuf SM, Alzubaidi L. Review of the state of the art of deep learning for plant diseases: a broad analysis and discussion. Plants. 2020;9(10):1302.

Xiao Y, Tian Z, Yu J, Zhang Y, Liu S, Du S, Lan X. A review of object detection based on deep learning. Multimed Tools Appl. 2020;79(33):23729–91.

Ker J, Wang L, Rao J, Lim T. Deep learning applications in medical image analysis. IEEE Access. 2017;6:9375–89.

Zhang Z, Cui P, Zhu W. Deep learning on graphs: a survey. IEEE Trans Knowl Data Eng. 2020. https://doi.org/10.1109/TKDE.2020.2981333 .

Shrestha A, Mahmood A. Review of deep learning algorithms and architectures. IEEE Access. 2019;7:53040–65.

Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in big data analytics. J Big Data. 2015;2(1):1.

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning, vol. 1. Cambridge: MIT press; 2016.

MATH   Google Scholar  

Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for COVID-19. J Big Data. 2021;8(1):1–54.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.

Bhowmick S, Nagarajaiah S, Veeraraghavan A. Vision and deep learning-based algorithms to detect and quantify cracks on concrete surfaces from uav videos. Sensors. 2020;20(21):6299.

Goh GB, Hodas NO, Vishnu A. Deep learning for computational chemistry. J Comput Chem. 2017;38(16):1291–307.

Li Y, Zhang T, Sun S, Gao X. Accelerating flash calculation through deep learning methods. J Comput Phys. 2019;394:153–65.

Yang W, Zhang X, Tian Y, Wang W, Xue JH, Liao Q. Deep learning for single image super-resolution: a brief review. IEEE Trans Multimed. 2019;21(12):3106–21.

Tang J, Li S, Liu P. A review of lane detection methods based on deep learning. Pattern Recogn. 2020;111:107623.

Zhao ZQ, Zheng P, Xu ST, Wu X. Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst. 2019;30(11):3212–32.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–8.

Ng A. Machine learning yearning: technical strategy for AI engineers in the era of deep learning. 2019. https://www.mlyearning.org .

Metz C. Turing award won by 3 pioneers in artificial intelligence. The New York Times. 2019;27.

Nevo S, Anisimov V, Elidan G, El-Yaniv R, Giencke P, Gigi Y, Hassidim A, Moshe Z, Schlesinger M, Shalev G, et al. Ml for flood forecasting at scale; 2019. arXiv preprint arXiv:1901.09583 .

Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T. The rise of deep learning in drug discovery. Drug Discov Today. 2018;23(6):1241–50.

Benhammou Y, Achchab B, Herrera F, Tabik S. Breakhis based breast cancer automatic diagnosis using deep learning: taxonomy, survey and insights. Neurocomputing. 2020;375:9–24.

Wulczyn E, Steiner DF, Xu Z, Sadhwani A, Wang H, Flament-Auvigne I, Mermel CH, Chen PHC, Liu Y, Stumpe MC. Deep learning-based survival prediction for multiple cancer types using histopathology images. PLoS ONE. 2020;15(6):e0233678.

Nagpal K, Foote D, Liu Y, Chen PHC, Wulczyn E, Tan F, Olson N, Smith JL, Mohtashamian A, Wren JH, et al. Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer. NPJ Digit Med. 2019;2(1):1–10.

Google Scholar  

Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8.

Brunese L, Mercaldo F, Reginelli A, Santone A. Explainable deep learning for pulmonary disease and coronavirus COVID-19 detection from X-rays. Comput Methods Programs Biomed. 2020;196(105):608.

Jamshidi M, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, Jamshidi M, La Spada L, Mirmozafari M, Dehghani M, et al. Artificial intelligence and COVID-19: deep learning approaches for diagnosis and treatment. IEEE Access. 2020;8:109581–95.

Shorfuzzaman M, Hossain MS. Metacovid: a siamese neural network framework with contrastive loss for n-shot diagnosis of COVID-19 patients. Pattern Recogn. 2020;113:107700.

Carvelli L, Olesen AN, Brink-Kjær A, Leary EB, Peppard PE, Mignot E, Sørensen HB, Jennum P. Design of a deep learning model for automatic scoring of periodic and non-periodic leg movements during sleep validated against multiple human experts. Sleep Med. 2020;69:109–19.

De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, Askham H, Glorot X, O’Donoghue B, Visentin D, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018;24(9):1342–50.

Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56.

Kermany DS, Goldbaum M, Cai W, Valentim CC, Liang H, Baxter SL, McKeown A, Yang G, Wu X, Yan F, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. 2018;172(5):1122–31.

Van Essen B, Kim H, Pearce R, Boakye K, Chen B. Lbann: livermore big artificial neural network HPC toolkit. In: Proceedings of the workshop on machine learning in high-performance computing environments; 2015. p. 1–6.

Saeed MM, Al Aghbari Z, Alsharidah M. Big data clustering techniques based on spark: a literature review. PeerJ Comput Sci. 2020;6:321.

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–33.

Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA. Deep reinforcement learning: a brief survey. IEEE Signal Process Mag. 2017;34(6):26–38.

Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C. Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing; 2013. p. 1631–42.

Goller C, Kuchler A. Learning task-dependent distributed representations by backpropagation through structure. In: Proceedings of international conference on neural networks (ICNN’96), vol 1. IEEE; 1996. p. 347–52.

Socher R, Lin CCY, Ng AY, Manning CD. Parsing natural scenes and natural language with recursive neural networks. In: ICML; 2011.

Louppe G, Cho K, Becot C, Cranmer K. QCD-aware recursive neural networks for jet physics. J High Energy Phys. 2019;2019(1):57.

Sadr H, Pedram MM, Teshnehlab M. A robust sentiment analysis method based on sequential combination of convolutional and recursive neural networks. Neural Process Lett. 2019;50(3):2745–61.

Urban G, Subrahmanya N, Baldi P. Inner and outer recursive neural networks for chemoinformatics applications. J Chem Inf Model. 2018;58(2):207–11.

Hewamalage H, Bergmeir C, Bandara K. Recurrent neural networks for time series forecasting: current status and future directions. Int J Forecast. 2020;37(1):388–427.

Jiang Y, Kim H, Asnani H, Kannan S, Oh S, Viswanath P. Learn codes: inventing low-latency codes via recurrent neural networks. IEEE J Sel Areas Inf Theory. 2020;1(1):207–16.

John RA, Acharya J, Zhu C, Surendran A, Bose SK, Chaturvedi A, Tiwari N, Gao Y, He Y, Zhang KK, et al. Optogenetics inspired transition metal dichalcogenide neuristors for in-memory deep recurrent neural networks. Nat Commun. 2020;11(1):1–9.

Batur Dinler Ö, Aydin N. An optimal feature parameter set based on gated recurrent unit recurrent neural networks for speech segment detection. Appl Sci. 2020;10(4):1273.

Jagannatha AN, Yu H. Structured prediction models for RNN based sequence labeling in clinical text. In: Proceedings of the conference on empirical methods in natural language processing. conference on empirical methods in natural language processing, vol. 2016, NIH Public Access; 2016. p. 856.

Pascanu R, Gulcehre C, Cho K, Bengio Y. How to construct deep recurrent neural networks. In: Proceedings of the second international conference on learning representations (ICLR 2014); 2014.

Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics; 2010. p. 249–56.

Gao C, Yan J, Zhou S, Varshney PK, Liu H. Long short-term memory-based deep recurrent neural networks for target tracking. Inf Sci. 2019;502:279–96.

Zhou DX. Theory of deep convolutional neural networks: downsampling. Neural Netw. 2020;124:319–27.

Article   MATH   Google Scholar  

Jhong SY, Tseng PY, Siriphockpirom N, Hsia CH, Huang MS, Hua KL, Chen YY. An automated biometric identification system using CNN-based palm vein recognition. In: 2020 international conference on advanced robotics and intelligent systems (ARIS). IEEE; 2020. p. 1–6.

Al-Azzawi A, Ouadou A, Max H, Duan Y, Tanner JJ, Cheng J. Deepcryopicker: fully automated deep neural network for single protein particle picking in cryo-EM. BMC Bioinform. 2020;21(1):1–38.

Wang T, Lu C, Yang M, Hong F, Liu C. A hybrid method for heartbeat classification via convolutional neural networks, multilayer perceptrons and focal loss. PeerJ Comput Sci. 2020;6:324.

Li G, Zhang M, Li J, Lv F, Tong G. Efficient densely connected convolutional neural networks. Pattern Recogn. 2021;109:107610.

Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T, Wang X, Wang G, Cai J, et al. Recent advances in convolutional neural networks. Pattern Recogn. 2018;77:354–77.

Fang W, Love PE, Luo H, Ding L. Computer vision for behaviour-based safety in construction: a review and future directions. Adv Eng Inform. 2020;43:100980.

Palaz D, Magimai-Doss M, Collobert R. End-to-end acoustic modeling using convolutional neural networks for hmm-based automatic speech recognition. Speech Commun. 2019;108:15–32.

Li HC, Deng ZY, Chiang HH. Lightweight and resource-constrained learning network for face recognition with performance optimization. Sensors. 2020;20(21):6114.

Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol. 1962;160(1):106.

Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift; 2015. arXiv preprint arXiv:1502.03167 .

Ruder S. An overview of gradient descent optimization algorithms; 2016. arXiv preprint arXiv:1609.04747 .

Bottou L. Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Springer; 2010. p. 177–86.

Hinton G, Srivastava N, Swersky K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on. 2012;14(8).

Zhang Z. Improved Adam optimizer for deep neural networks. In: 2018 IEEE/ACM 26th international symposium on quality of service (IWQoS). IEEE; 2018. p. 1–2.

Alzubaidi L, Fadhel MA, Al-Shamma O, Zhang J, Duan Y. Deep learning models for classification of red blood cells in microscopy images to aid in sickle cell anemia diagnosis. Electronics. 2020;9(3):427.

Alzubaidi L, Fadhel MA, Al-Shamma O, Zhang J, Santamaría J, Duan Y, Oleiwi SR. Towards a better understanding of transfer learning for medical imaging: a case study. Appl Sci. 2020;10(13):4523.

Alzubaidi L, Al-Shamma O, Fadhel MA, Farhan L, Zhang J, Duan Y. Optimizing the performance of breast cancer classification by employing the same domain transfer learning from hybrid deep convolutional neural network model. Electronics. 2020;9(3):445.

LeCun Y, Jackel LD, Bottou L, Cortes C, Denker JS, Drucker H, Guyon I, Muller UA, Sackinger E, Simard P, et al. Learning algorithms for classification: a comparison on handwritten digit recognition. Neural Netw Stat Mech Perspect. 1995;261:276.

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.

MathSciNet   MATH   Google Scholar  

Dahl GE, Sainath TN, Hinton GE. Improving deep neural networks for LVCSR using rectified linear units and dropout. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 8609–13.

Xu B, Wang N, Chen T, Li M. Empirical evaluation of rectified activations in convolutional network; 2015. arXiv preprint arXiv:1505.00853 .

Hochreiter S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl Based Syst. 1998;6(02):107–16.

Lin M, Chen Q, Yan S. Network in network; 2013. arXiv preprint arXiv:1312.4400 .

Hsiao TY, Chang YC, Chou HH, Chiu CT. Filter-based deep-compression with global average pooling for convolutional networks. J Syst Arch. 2019;95:9–18.

Li Z, Wang SH, Fan RR, Cao G, Zhang YD, Guo T. Teeth category classification via seven-layer deep convolutional neural network with max pooling and global average pooling. Int J Imaging Syst Technol. 2019;29(4):577–83.

Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer; 2014. p. 818–33.

Erhan D, Bengio Y, Courville A, Vincent P. Visualizing higher-layer features of a deep network. Univ Montreal. 2009;1341(3):1.

Le QV. Building high-level features using large scale unsupervised learning. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 8595–8.

Grün F, Rupprecht C, Navab N, Tombari F. A taxonomy and library for visualizing learned features in convolutional neural networks; 2016. arXiv preprint arXiv:1606.07757 .

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition; 2014. arXiv preprint arXiv:1409.1556 .

Ranzato M, Huang FJ, Boureau YL, LeCun Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE; 2007. p. 1–8.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 1–9.

Bengio Y, et al. Rmsprop and equilibrated adaptive learning rates for nonconvex optimization; 2015. arXiv:1502.04390 corr abs/1502.04390

Srivastava RK, Greff K, Schmidhuber J. Highway networks; 2015. arXiv preprint arXiv:1505.00387 .

Kong W, Dong ZY, Jia Y, Hill DJ, Xu Y, Zhang Y. Short-term residential load forecasting based on LSTM recurrent neural network. IEEE Trans Smart Grid. 2017;10(1):841–51.

Ordóñez FJ, Roggen D. Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors. 2016;16(1):115.

CireşAn D, Meier U, Masci J, Schmidhuber J. Multi-column deep neural network for traffic sign classification. Neural Netw. 2012;32:333–8.

Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, inception-resnet and the impact of residual connections on learning; 2016. arXiv preprint arXiv:1602.07261 .

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 2818–26.

Wu S, Zhong S, Liu Y. Deep residual learning for image steganalysis. Multimed Tools Appl. 2018;77(9):10437–53.

Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 4700–08.

Rubin J, Parvaneh S, Rahman A, Conroy B, Babaeizadeh S. Densely connected convolutional networks for detection of atrial fibrillation from short single-lead ECG recordings. J Electrocardiol. 2018;51(6):S18-21.

Kuang P, Ma T, Chen Z, Li F. Image super-resolution with densely connected convolutional networks. Appl Intell. 2019;49(1):125–36.

Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1492–500.

Su A, He X, Zhao X. Jpeg steganalysis based on ResNeXt with gauss partial derivative filters. Multimed Tools Appl. 2020;80(3):3349–66.

Yadav D, Jalal A, Garlapati D, Hossain K, Goyal A, Pant G. Deep learning-based ResNeXt model in phycological studies for future. Algal Res. 2020;50:102018.

Han W, Feng R, Wang L, Gao L. Adaptive spatial-scale-aware deep convolutional neural network for high-resolution remote sensing imagery scene classification. In: IGARSS 2018-2018 IEEE international geoscience and remote sensing symposium. IEEE; 2018. p. 4736–9.

Zagoruyko S, Komodakis N. Wide residual networks; 2016. arXiv preprint arXiv:1605.07146 .

Huang G, Sun Y, Liu Z, Sedra D, Weinberger KQ. Deep networks with stochastic depth. In: European conference on computer vision. Springer; 2016. p. 646–61.

Huynh HT, Nguyen H. Joint age estimation and gender classification of Asian faces using wide ResNet. SN Comput Sci. 2020;1(5):1–9.

Takahashi R, Matsubara T, Uehara K. Data augmentation using random image cropping and patching for deep cnns. IEEE Trans Circuits Syst Video Technol. 2019;30(9):2917–31.

Han D, Kim J, Kim J. Deep pyramidal residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 5927–35.

Wang Y, Wang L, Wang H, Li P. End-to-end image super-resolution via deep and shallow convolutional networks. IEEE Access. 2019;7:31959–70.

Chollet F. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1251–8.

Lo WW, Yang X, Wang Y. An xception convolutional neural network for malware classification with transfer learning. In: 2019 10th IFIP international conference on new technologies, mobility and security (NTMS). IEEE; 2019. p. 1–5.

Rahimzadeh M, Attar A. A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of xception and resnet50v2. Inform Med Unlocked. 2020;19:100360.

Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X. Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 3156–64.

Salakhutdinov R, Larochelle H. Efficient learning of deep boltzmann machines. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics; 2010. p. 693–700.

Goh H, Thome N, Cord M, Lim JH. Top-down regularization of deep belief networks. Adv Neural Inf Process Syst. 2013;26:1878–86.

Guan J, Lai R, Xiong A, Liu Z, Gu L. Fixed pattern noise reduction for infrared images based on cascade residual attention CNN. Neurocomputing. 2020;377:301–13.

Bi Q, Qin K, Zhang H, Li Z, Xu K. RADC-Net: a residual attention based convolution network for aerial scene classification. Neurocomputing. 2020;377:345–59.

Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2015. p. 2017–25.

Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 7132–41.

Mou L, Zhu XX. Learning to pay attention on spectral domain: a spectral attention module-based convolutional network for hyperspectral image classification. IEEE Trans Geosci Remote Sens. 2019;58(1):110–22.

Woo S, Park J, Lee JY, So Kweon I. CBAM: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 3–19.

Roy AG, Navab N, Wachinger C. Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In: International conference on medical image computing and computer-assisted intervention. Springer; 2018. p. 421–9.

Roy AG, Navab N, Wachinger C. Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation’’ blocks. IEEE Trans Med Imaging. 2018;38(2):540–9.

Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2017. p. 3856–66.

Arun P, Buddhiraju KM, Porwal A. Capsulenet-based spatial-spectral classifier for hyperspectral images. IEEE J Sel Topics Appl Earth Obs Remote Sens. 2019;12(6):1849–65.

Xinwei L, Lianghao X, Yi Y. Compact video fingerprinting via an improved capsule net. Syst Sci Control Eng. 2020;9:1–9.

Ma B, Li X, Xia Y, Zhang Y. Autonomous deep learning: a genetic DCNN designer for image classification. Neurocomputing. 2020;379:152–61.

Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X, et al. Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2020. https://doi.org/10.1109/TPAMI.2020.2983686 .

Cheng B, Xiao B, Wang J, Shi H, Huang TS, Zhang L. Higherhrnet: scale-aware representation learning for bottom-up human pose estimation. In: CVPR 2020; 2020. https://www.microsoft.com/en-us/research/publication/higherhrnet-scale-aware-representation-learning-for-bottom-up-human-pose-estimation/ .

Karimi H, Derr T, Tang J. Characterizing the decision boundary of deep neural networks; 2019. arXiv preprint arXiv:1912.11460 .

Li Y, Ding L, Gao X. On the decision boundary of deep neural networks; 2018. arXiv preprint arXiv:1808.05385 .

Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2014. p. 3320–8.

Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C. A survey on deep transfer learning. In: International conference on artificial neural networks. Springer; 2018. p. 270–9.

Weiss K, Khoshgoftaar TM, Wang D. A survey of transfer learning. J Big Data. 2016;3(1):9.

Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):60.

Wang F, Wang H, Wang H, Li G, Situ G. Learning from simulation: an end-to-end deep-learning approach for computational ghost imaging. Opt Express. 2019;27(18):25560–72.

Pan W. A survey of transfer learning for collaborative recommendation with auxiliary data. Neurocomputing. 2016;177:447–53.

Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE; 2009. p. 248–55.

Cook D, Feuz KD, Krishnan NC. Transfer learning for activity recognition: a survey. Knowl Inf Syst. 2013;36(3):537–56.

Cao X, Wang Z, Yan P, Li X. Transfer learning for pedestrian detection. Neurocomputing. 2013;100:51–7.

Raghu M, Zhang C, Kleinberg J, Bengio S. Transfusion: understanding transfer learning for medical imaging. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2019. p. 3347–57.

Pham TN, Van Tran L, Dao SVT. Early disease classification of mango leaves using feed-forward neural network and hybrid metaheuristic feature selection. IEEE Access. 2020;8:189960–73.

Saleh AM, Hamoud T. Analysis and best parameters selection for person recognition based on gait model using CNN algorithm and image augmentation. J Big Data. 2021;8(1):1–20.

Hirahara D, Takaya E, Takahara T, Ueda T. Effects of data count and image scaling on deep learning training. PeerJ Comput Sci. 2020;6:312.

Moreno-Barea FJ, Strazzera F, Jerez JM, Urda D, Franco L. Forward noise adjustment scheme for data augmentation. In: 2018 IEEE symposium series on computational intelligence (SSCI). IEEE; 2018. p. 728–34.

Dua D, Karra Taniskidou E. Uci machine learning repository. Irvine: University of california. School of Information and Computer Science; 2017. http://archive.ics.uci.edu/ml

Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J Big Data. 2019;6(1):27.

Yang P, Zhang Z, Zhou BB, Zomaya AY. Sample subset optimization for classifying imbalanced biological data. In: Pacific-Asia conference on knowledge discovery and data mining. Springer; 2011. p. 333–44.

Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY. Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Trans Cybern. 2013;44(3):445–55.

Wang S, Sun S, Xu J. Auc-maximized deep convolutional neural fields for sequence labeling 2015. arXiv preprint arXiv:1511.05265 .

Li Y, Wang S, Umarov R, Xie B, Fan M, Li L, Gao X. Deepre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics. 2018;34(5):760–9.

Li Y, Huang C, Ding L, Li Z, Pan Y, Gao X. Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods. 2019;166:4–21.

Choi E, Bahadori MT, Sun J, Kulas J, Schuetz A, Stewart W. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2016. p. 3504–12.

Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15(141):20170,387.

Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015;12(10):931–4.

Pokuri BSS, Ghosal S, Kokate A, Sarkar S, Ganapathysubramanian B. Interpretable deep learning for guided microstructure-property explorations in photovoltaics. NPJ Comput Mater. 2019;5(1):1–11.

Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016. p. 1135–44.

Wang L, Nie R, Yu Z, Xin R, Zheng C, Zhang Z, Zhang J, Cai J. An interpretable deep-learning architecture of capsule networks for identifying cell-type gene expression programs from single-cell RNA-sequencing data. Nat Mach Intell. 2020;2(11):1–11.

Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks; 2017. arXiv preprint arXiv:1703.01365 .

Platt J, et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif. 1999;10(3):61–74.

Nair T, Precup D, Arnold DL, Arbel T. Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. Med Image Anal. 2020;59:101557.

Herzog L, Murina E, Dürr O, Wegener S, Sick B. Integrating uncertainty in deep neural networks for MRI based stroke analysis. Med Image Anal. 2020;65:101790.

Pereyra G, Tucker G, Chorowski J, Kaiser Ł, Hinton G. Regularizing neural networks by penalizing confident output distributions; 2017. arXiv preprint arXiv:1701.06548 .

Naeini MP, Cooper GF, Hauskrecht M. Obtaining well calibrated probabilities using bayesian binning. In: Proceedings of the... AAAI conference on artificial intelligence. AAAI conference on artificial intelligence, vol. 2015. NIH Public Access; 2015. p. 2901.

Li M, Sethi IK. Confidence-based classifier design. Pattern Recogn. 2006;39(7):1230–40.

Zadrozny B, Elkan C. Obtaining calibrated probability estimates from decision trees and Naive Bayesian classifiers. In: ICML, vol. 1, Citeseer; 2001. p. 609–16.

Steinwart I. Consistency of support vector machines and other regularized kernel classifiers. IEEE Trans Inf Theory. 2005;51(1):128–42.

Lee K, Lee K, Shin J, Lee H. Overcoming catastrophic forgetting with unlabeled data in the wild. In: Proceedings of the IEEE international conference on computer vision; 2019. p. 312–21.

Shmelkov K, Schmid C, Alahari K. Incremental learning of object detectors without catastrophic forgetting. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 3400–09.

Zenke F, Gerstner W, Ganguli S. The temporal paradox of Hebbian learning and homeostatic plasticity. Curr Opin Neurobiol. 2017;43:166–76.

Andersen N, Krauth N, Nabavi S. Hebbian plasticity in vivo: relevance and induction. Curr Opin Neurobiol. 2017;45:188–92.

Zheng R, Chakraborti S. A phase ii nonparametric adaptive exponentially weighted moving average control chart. Qual Eng. 2016;28(4):476–90.

Rebuffi SA, Kolesnikov A, Sperl G, Lampert CH. ICARL: Incremental classifier and representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2001–10.

Hinton GE, Plaut DC. Using fast weights to deblur old memories. In: Proceedings of the ninth annual conference of the cognitive science society; 1987. p. 177–86.

Parisi GI, Kemker R, Part JL, Kanan C, Wermter S. Continual lifelong learning with neural networks: a review. Neural Netw. 2019;113:54–71.

Soltoggio A, Stanley KO, Risi S. Born to learn: the inspiration, progress, and future of evolved plastic artificial neural networks. Neural Netw. 2018;108:48–67.

Parisi GI, Tani J, Weber C, Wermter S. Lifelong learning of human actions with deep neural network self-organization. Neural Netw. 2017;96:137–49.

Cheng Y, Wang D, Zhou P, Zhang T. Model compression and acceleration for deep neural networks: the principles, progress, and challenges. IEEE Signal Process Mag. 2018;35(1):126–36.

Wiedemann S, Kirchhoffer H, Matlage S, Haase P, Marban A, Marinč T, Neumann D, Nguyen T, Schwarz H, Wiegand T, et al. Deepcabac: a universal compression algorithm for deep neural networks. IEEE J Sel Topics Signal Process. 2020;14(4):700–14.

Mehta N, Pandit A. Concurrence of big data analytics and healthcare: a systematic review. Int J Med Inform. 2018;114:57–65.

Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, Cui C, Corrado G, Thrun S, Dean J. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24–9.

Shawahna A, Sait SM, El-Maleh A. Fpga-based accelerators of deep learning networks for learning and classification: a review. IEEE Access. 2018;7:7823–59.

Min Z. Public welfare organization management system based on FPGA and deep learning. Microprocess Microsyst. 2020;80:103333.

Al-Shamma O, Fadhel MA, Hameed RA, Alzubaidi L, Zhang J. Boosting convolutional neural networks performance based on fpga accelerator. In: International conference on intelligent systems design and applications. Springer; 2018. p. 509–17.

Han S, Mao H, Dally WJ. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding; 2015. arXiv preprint arXiv:1510.00149 .

Chen Z, Zhang L, Cao Z, Guo J. Distilling the knowledge from handcrafted features for human activity recognition. IEEE Trans Ind Inform. 2018;14(10):4334–42.

Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network; 2015. arXiv preprint arXiv:1503.02531 .

Lenssen JE, Fey M, Libuschewski P. Group equivariant capsule networks. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2018. p. 8844–53.

Denton EL, Zaremba W, Bruna J, LeCun Y, Fergus R. Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2014. p. 1269–77.

Xu Q, Zhang M, Gu Z, Pan G. Overfitting remedy by sparsifying regularization on fully-connected layers of CNNs. Neurocomputing. 2019;328:69–74.

Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. Commun ACM. 2018;64(3):107–15.

Xu X, Jiang X, Ma C, Du P, Li X, Lv S, Yu L, Ni Q, Chen Y, Su J, et al. A deep learning system to screen novel coronavirus disease 2019 pneumonia. Engineering. 2020;6(10):1122–9.

Sharma K, Alsadoon A, Prasad P, Al-Dala’in T, Nguyen TQV, Pham DTH. A novel solution of using deep learning for left ventricle detection: enhanced feature extraction. Comput Methods Programs Biomed. 2020;197:105751.

Zhang G, Wang C, Xu B, Grosse R. Three mechanisms of weight decay regularization; 2018. arXiv preprint arXiv:1810.12281 .

Laurent C, Pereyra G, Brakel P, Zhang Y, Bengio Y. Batch normalized recurrent neural networks. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE; 2016. p. 2657–61.

Salamon J, Bello JP. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett. 2017;24(3):279–83.

Wang X, Qin Y, Wang Y, Xiang S, Chen H. ReLTanh: an activation function with vanishing gradient resistance for SAE-based DNNs and its application to rotating machinery fault diagnosis. Neurocomputing. 2019;363:88–98.

Tan HH, Lim KH. Vanishing gradient mitigation with deep learning neural network optimization. In: 2019 7th international conference on smart computing & communications (ICSCC). IEEE; 2019. p. 1–4.

MacDonald G, Godbout A, Gillcash B, Cairns S. Volume-preserving neural networks: a solution to the vanishing gradient problem; 2019. arXiv preprint arXiv:1911.09576 .

Mittal S, Vaishay S. A survey of techniques for optimizing deep learning on GPUs. J Syst Arch. 2019;99:101635.

Kanai S, Fujiwara Y, Iwamura S. Preventing gradient explosions in gated recurrent units. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2017. p. 435–44.

Hanin B. Which neural net architectures give rise to exploding and vanishing gradients? In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2018. p. 582–91.

Ribeiro AH, Tiels K, Aguirre LA, Schön T. Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness. In: International conference on artificial intelligence and statistics, PMLR; 2020. p. 2370–80.

D’Amour A, Heller K, Moldovan D, Adlam B, Alipanahi B, Beutel A, Chen C, Deaton J, Eisenstein J, Hoffman MD, et al. Underspecification presents challenges for credibility in modern machine learning; 2020. arXiv preprint arXiv:2011.03395 .

Chea P, Mandell JC. Current applications and future directions of deep learning in musculoskeletal radiology. Skelet Radiol. 2020;49(2):1–15.

Wu X, Sahoo D, Hoi SC. Recent advances in deep learning for object detection. Neurocomputing. 2020;396:39–64.

Kuutti S, Bowden R, Jin Y, Barber P, Fallah S. A survey of deep learning applications to autonomous vehicle control. IEEE Trans Intell Transp Syst. 2020;22:712–33.

Yolcu G, Oztel I, Kazan S, Oz C, Bunyak F. Deep learning-based face analysis system for monitoring customer interest. J Ambient Intell Humaniz Comput. 2020;11(1):237–48.

Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, Qu R. A survey of deep learning-based object detection. IEEE Access. 2019;7:128837–68.

Muhammad K, Khan S, Del Ser J, de Albuquerque VHC. Deep learning for multigrade brain tumor classification in smart healthcare systems: a prospective survey. IEEE Trans Neural Netw Learn Syst. 2020;32:507–22.

Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, Van Der Laak JA, Van Ginneken B, Sánchez CI. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88.

Mukherjee D, Mondal R, Singh PK, Sarkar R, Bhattacharjee D. Ensemconvnet: a deep learning approach for human activity recognition using smartphone sensors for healthcare applications. Multimed Tools Appl. 2020;79(41):31663–90.

Zeleznik R, Foldyna B, Eslami P, Weiss J, Alexander I, Taron J, Parmar C, Alvi RM, Banerji D, Uno M, et al. Deep convolutional neural networks to predict cardiovascular risk from computed tomography. Nature Commun. 2021;12(1):1–9.

Wang J, Liu Q, Xie H, Yang Z, Zhou H. Boosted efficientnet: detection of lymph node metastases in breast cancer using convolutional neural networks. Cancers. 2021;13(4):661.

Yu H, Yang LT, Zhang Q, Armstrong D, Deen MJ. Convolutional neural networks for medical image analysis: state-of-the-art, comparisons, improvement and perspectives. Neurocomputing. 2021. https://doi.org/10.1016/j.neucom.2020.04.157 .

Bharati S, Podder P, Mondal MRH. Hybrid deep learning for detecting lung diseases from X-ray images. Inform Med Unlocked. 2020;20:100391.

Dong Y, Pan Y, Zhang J, Xu W. Learning to read chest X-ray images from 16000+ examples using CNN. In: 2017 IEEE/ACM international conference on connected health: applications, systems and engineering technologies (CHASE). IEEE; 2017. p. 51–7.

Rajkomar A, Lingam S, Taylor AG, Blum M, Mongan J. High-throughput classification of radiographs using deep convolutional neural networks. J Digit Imaging. 2017;30(1):95–101.

Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul A, Langlotz C, Shpanskaya K, et al. Chexnet: radiologist-level pneumonia detection on chest X-rays with deep learning; 2017. arXiv preprint arXiv:1711.05225 .

Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2097–106.

Zuo W, Zhou F, Li Z, Wang L. Multi-resolution CNN and knowledge transfer for candidate classification in lung nodule detection. IEEE Access. 2019;7:32510–21.

Shen W, Zhou M, Yang F, Yang C, Tian J. Multi-scale convolutional neural networks for lung nodule classification. In: International conference on information processing in medical imaging. Springer; 2015. p. 588–99.

Li R, Zhang W, Suk HI, Wang L, Li J, Shen D, Ji S. Deep learning based imaging data completion for improved brain disease diagnosis. In: International conference on medical image computing and computer-assisted intervention. Springer; 2014. p. 305–12.

Wen J, Thibeau-Sutre E, Diaz-Melo M, Samper-González J, Routier A, Bottani S, Dormont D, Durrleman S, Burgos N, Colliot O, et al. Convolutional neural networks for classification of Alzheimer’s disease: overview and reproducible evaluation. Med Image Anal. 2020;63:101694.

Mehmood A, Maqsood M, Bashir M, Shuyuan Y. A deep siamese convolution neural network for multi-class classification of Alzheimer disease. Brain Sci. 2020;10(2):84.

Hosseini-Asl E, Ghazal M, Mahmoud A, Aslantas A, Shalaby A, Casanova M, Barnes G, Gimel’farb G, Keynton R, El-Baz A. Alzheimer’s disease diagnostics by a 3d deeply supervised adaptable convolutional network. Front Biosci. 2018;23:584–96.

Korolev S, Safiullin A, Belyaev M, Dodonova Y. Residual and plain convolutional neural networks for 3D brain MRI classification. In: 2017 IEEE 14th international symposium on biomedical imaging (ISBI 2017). IEEE; 2017. p. 835–8.

Alzubaidi L, Fadhel MA, Oleiwi SR, Al-Shamma O, Zhang J. DFU_QUTNet: diabetic foot ulcer classification using novel deep convolutional neural network. Multimed Tools Appl. 2020;79(21):15655–77.

Goyal M, Reeves ND, Davison AK, Rajbhandari S, Spragg J, Yap MH. Dfunet: convolutional neural networks for diabetic foot ulcer classification. IEEE Trans Emerg Topics Comput Intell. 2018;4(5):728–39.

Yap MH., Hachiuma R, Alavi A, Brungel R, Goyal M, Zhu H, Cassidy B, Ruckert J, Olshansky M, Huang X, et al. Deep learning in diabetic foot ulcers detection: a comprehensive evaluation; 2020. arXiv preprint arXiv:2010.03341 .

Tulloch J, Zamani R, Akrami M. Machine learning in the prevention, diagnosis and management of diabetic foot ulcers: a systematic review. IEEE Access. 2020;8:198977–9000.

Fadhel MA, Al-Shamma O, Alzubaidi L, Oleiwi SR. Real-time sickle cell anemia diagnosis based hardware accelerator. In: International conference on new trends in information and communications technology applications, Springer; 2020. p. 189–99.

Debelee TG, Kebede SR, Schwenker F, Shewarega ZM. Deep learning in selected cancers’ image analysis—a survey. J Imaging. 2020;6(11):121.

Khan S, Islam N, Jan Z, Din IU, Rodrigues JJC. A novel deep learning based framework for the detection and classification of breast cancer using transfer learning. Pattern Recogn Lett. 2019;125:1–6.

Alzubaidi L, Hasan RI, Awad FH, Fadhel MA, Alshamma O, Zhang J. Multi-class breast cancer classification by a novel two-branch deep convolutional neural network architecture. In: 2019 12th international conference on developments in eSystems engineering (DeSE). IEEE; 2019. p. 268–73.

Roy K, Banik D, Bhattacharjee D, Nasipuri M. Patch-based system for classification of breast histology images using deep learning. Comput Med Imaging Gr. 2019;71:90–103.

Hameed Z, Zahia S, Garcia-Zapirain B, Javier Aguirre J, María Vanegas A. Breast cancer histopathology image classification using an ensemble of deep learning models. Sensors. 2020;20(16):4373.

Hosny KM, Kassem MA, Foaud MM. Skin cancer classification using deep learning and transfer learning. In: 2018 9th Cairo international biomedical engineering conference (CIBEC). IEEE; 2018. p. 90–3.

Dorj UO, Lee KK, Choi JY, Lee M. The skin cancer classification using deep convolutional neural network. Multimed Tools Appl. 2018;77(8):9909–24.

Kassem MA, Hosny KM, Fouad MM. Skin lesions classification into eight classes for ISIC 2019 using deep convolutional neural network and transfer learning. IEEE Access. 2020;8:114822–32.

Heidari M, Mirniaharikandehei S, Khuzani AZ, Danala G, Qiu Y, Zheng B. Improving the performance of CNN to predict the likelihood of COVID-19 using chest X-ray images with preprocessing algorithms. Int J Med Inform. 2020;144:104284.

Al-Timemy AH, Khushaba RN, Mosa ZM, Escudero J. An efficient mixture of deep and machine learning models for COVID-19 and tuberculosis detection using X-ray images in resource limited settings 2020. arXiv preprint arXiv:2007.08223 .

Abraham B, Nair MS. Computer-aided detection of COVID-19 from X-ray images using multi-CNN and Bayesnet classifier. Biocybern Biomed Eng. 2020;40(4):1436–45.

Nour M, Cömert Z, Polat K. A novel medical diagnosis model for COVID-19 infection detection based on deep features and Bayesian optimization. Appl Soft Comput. 2020;97:106580.

Mallio CA, Napolitano A, Castiello G, Giordano FM, D’Alessio P, Iozzino M, Sun Y, Angeletti S, Russano M, Santini D, et al. Deep learning algorithm trained with COVID-19 pneumonia also identifies immune checkpoint inhibitor therapy-related pneumonitis. Cancers. 2021;13(4):652.

Fourcade A, Khonsari R. Deep learning in medical image analysis: a third eye for doctors. J Stomatol Oral Maxillofac Surg. 2019;120(4):279–88.

Guo Z, Li X, Huang H, Guo N, Li Q. Deep learning-based image segmentation on multimodal medical imaging. IEEE Trans Radiat Plasma Med Sci. 2019;3(2):162–9.

Thakur N, Yoon H, Chong Y. Current trends of artificial intelligence for colorectal cancer pathology image analysis: a systematic review. Cancers. 2020;12(7):1884.

Lundervold AS, Lundervold A. An overview of deep learning in medical imaging focusing on MRI. Zeitschrift für Medizinische Physik. 2019;29(2):102–27.

Yadav SS, Jadhav SM. Deep convolutional neural network based medical image classification for disease diagnosis. J Big Data. 2019;6(1):113.

Nehme E, Freedman D, Gordon R, Ferdman B, Weiss LE, Alalouf O, Naor T, Orange R, Michaeli T, Shechtman Y. DeepSTORM3D: dense 3D localization microscopy and PSF design by deep learning. Nat Methods. 2020;17(7):734–40.

Zulkifley MA, Abdani SR, Zulkifley NH. Pterygium-Net: a deep learning approach to pterygium detection and localization. Multimed Tools Appl. 2019;78(24):34563–84.

Sirazitdinov I, Kholiavchenko M, Mustafaev T, Yixuan Y, Kuleev R, Ibragimov B. Deep neural network ensemble for pneumonia localization from a large-scale chest X-ray database. Comput Electr Eng. 2019;78:388–99.

Zhao W, Shen L, Han B, Yang Y, Cheng K, Toesca DA, Koong AC, Chang DT, Xing L. Markerless pancreatic tumor target localization enabled by deep learning. Int J Radiat Oncol Biol Phys. 2019;105(2):432–9.

Roth HR, Lee CT, Shin HC, Seff A, Kim L, Yao J, Lu L, Summers RM. Anatomy-specific classification of medical images using deep convolutional nets. In: 2015 IEEE 12th international symposium on biomedical imaging (ISBI). IEEE; 2015. p. 101–4.

Shin HC, Orton MR, Collins DJ, Doran SJ, Leach MO. Stacked autoencoders for unsupervised feature learning and multiple organ detection in a pilot study using 4D patient data. IEEE Trans Pattern Anal Mach Intell. 2012;35(8):1930–43.

Li Z, Dong M, Wen S, Hu X, Zhou P, Zeng Z. CLU-CNNs: object detection for medical images. Neurocomputing. 2019;350:53–9.

Gao J, Jiang Q, Zhou B, Chen D. Convolutional neural networks for computer-aided detection or diagnosis in medical image analysis: an overview. Math Biosci Eng. 2019;16(6):6536.

Article   MathSciNet   Google Scholar  

Lumini A, Nanni L. Review fair comparison of skin detection approaches on publicly available datasets. Expert Syst Appl. 2020. https://doi.org/10.1016/j.eswa.2020.113677 .

Chouhan V, Singh SK, Khamparia A, Gupta D, Tiwari P, Moreira C, Damaševičius R, De Albuquerque VHC. A novel transfer learning based approach for pneumonia detection in chest X-ray images. Appl Sci. 2020;10(2):559.

Apostolopoulos ID, Mpesiana TA. COVID-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks. Phys Eng Sci Med. 2020;43(2):635–40.

Mahmud T, Rahman MA, Fattah SA. CovXNet: a multi-dilation convolutional neural network for automatic COVID-19 and other pneumonia detection from chest X-ray images with transferable multi-receptive feature optimization. Comput Biol Med. 2020;122:103869.

Tayarani-N MH. Applications of artificial intelligence in battling against COVID-19: a literature review. Chaos Solitons Fractals. 2020;142:110338.

Toraman S, Alakus TB, Turkoglu I. Convolutional capsnet: a novel artificial neural network approach to detect COVID-19 disease from X-ray images using capsule networks. Chaos Solitons Fractals. 2020;140:110122.

Dascalu A, David E. Skin cancer detection by deep learning and sound analysis algorithms: a prospective clinical study of an elementary dermoscope. EBioMedicine. 2019;43:107–13.

Adegun A, Viriri S. Deep learning techniques for skin lesion analysis and melanoma cancer detection: a survey of state-of-the-art. Artif Intell Rev. 2020;54:1–31.

Zhang N, Cai YX, Wang YY, Tian YT, Wang XL, Badami B. Skin cancer diagnosis based on optimized convolutional neural network. Artif Intell Med. 2020;102:101756.

Thurnhofer-Hemsi K, Domínguez E. A convolutional neural network framework for accurate skin cancer detection. Neural Process Lett. 2020. https://doi.org/10.1007/s11063-020-10364-y .

Jain MS, Massoud TF. Predicting tumour mutational burden from histopathological images using multiscale deep learning. Nat Mach Intell. 2020;2(6):356–62.

Lei H, Liu S, Elazab A, Lei B. Attention-guided multi-branch convolutional neural network for mitosis detection from histopathological images. IEEE J Biomed Health Inform. 2020;25(2):358–70.

Celik Y, Talo M, Yildirim O, Karabatak M, Acharya UR. Automated invasive ductal carcinoma detection based using deep transfer learning with whole-slide images. Pattern Recogn Lett. 2020;133:232–9.

Sebai M, Wang X, Wang T. Maskmitosis: a deep learning framework for fully supervised, weakly supervised, and unsupervised mitosis detection in histopathology images. Med Biol Eng Comput. 2020;58:1603–23.

Sebai M, Wang T, Al-Fadhli SA. Partmitosis: a partially supervised deep learning framework for mitosis detection in breast cancer histopathology images. IEEE Access. 2020;8:45133–47.

Mahmood T, Arsalan M, Owais M, Lee MB, Park KR. Artificial intelligence-based mitosis detection in breast cancer histopathology images using faster R-CNN and deep CNNs. J Clin Med. 2020;9(3):749.

Srinidhi CL, Ciga O, Martel AL. Deep neural network models for computational histopathology: a survey. Med Image Anal. 2020;67:101813.

Cireşan DC, Giusti A, Gambardella LM, Schmidhuber J. Mitosis detection in breast cancer histology images with deep neural networks. In: International conference on medical image computing and computer-assisted intervention. Springer; 2013. p. 411–8.

Sirinukunwattana K, Raza SEA, Tsang YW, Snead DR, Cree IA, Rajpoot NM. Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Trans Med Imaging. 2016;35(5):1196–206.

Xu J, Xiang L, Liu Q, Gilmore H, Wu J, Tang J, Madabhushi A. Stacked sparse autoencoder (SSAE) for nuclei detection on breast cancer histopathology images. IEEE Trans Med Imaging. 2015;35(1):119–30.

Albarqouni S, Baur C, Achilles F, Belagiannis V, Demirci S, Navab N. Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Trans Med Imaging. 2016;35(5):1313–21.

Abd-Ellah MK, Awad AI, Khalaf AA, Hamed HF. Two-phase multi-model automatic brain tumour diagnosis system from magnetic resonance images using convolutional neural networks. EURASIP J Image Video Process. 2018;2018(1):97.

Thaha MM, Kumar KPM, Murugan B, Dhanasekeran S, Vijayakarthick P, Selvi AS. Brain tumor segmentation using convolutional neural networks in MRI images. J Med Syst. 2019;43(9):294.

Talo M, Yildirim O, Baloglu UB, Aydin G, Acharya UR. Convolutional neural networks for multi-class brain disease detection using MRI images. Comput Med Imaging Gr. 2019;78:101673.

Gabr RE, Coronado I, Robinson M, Sujit SJ, Datta S, Sun X, Allen WJ, Lublin FD, Wolinsky JS, Narayana PA. Brain and lesion segmentation in multiple sclerosis using fully convolutional neural networks: a large-scale study. Mult Scler J. 2020;26(10):1217–26.

Chen S, Ding C, Liu M. Dual-force convolutional neural networks for accurate brain tumor segmentation. Pattern Recogn. 2019;88:90–100.

Hu K, Gan Q, Zhang Y, Deng S, Xiao F, Huang W, Cao C, Gao X. Brain tumor segmentation using multi-cascaded convolutional neural networks and conditional random field. IEEE Access. 2019;7:92615–29.

Wadhwa A, Bhardwaj A, Verma VS. A review on brain tumor segmentation of MRI images. Magn Reson Imaging. 2019;61:247–59.

Akkus Z, Galimzianova A, Hoogi A, Rubin DL, Erickson BJ. Deep learning for brain MRI segmentation: state of the art and future directions. J Digit Imaging. 2017;30(4):449–59.

Moeskops P, Viergever MA, Mendrik AM, De Vries LS, Benders MJ, Išgum I. Automatic segmentation of MR brain images with a convolutional neural network. IEEE Trans Med Imaging. 2016;35(5):1252–61.

Milletari F, Navab N, Ahmadi SA. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision (3DV). IEEE; 2016. p. 565–71.

Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer; 2015. p. 234–41.

Pereira S, Pinto A, Alves V, Silva CA. Brain tumor segmentation using convolutional neural networks in MRI images. IEEE Trans Med Imaging. 2016;35(5):1240–51.

Havaei M, Davy A, Warde-Farley D, Biard A, Courville A, Bengio Y, Pal C, Jodoin PM, Larochelle H. Brain tumor segmentation with deep neural networks. Med Image Anal. 2017;35:18–31.

Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell. 2017;40(4):834–48.

Yan Q, Wang B, Gong D, Luo C, Zhao W, Shen J, Shi Q, Jin S, Zhang L, You Z. COVID-19 chest CT image segmentation—a deep convolutional neural network solution; 2020. arXiv preprint arXiv:2004.10987 .

Wang G, Liu X, Li C, Xu Z, Ruan J, Zhu H, Meng T, Li K, Huang N, Zhang S. A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images. IEEE Trans Med Imaging. 2020;39(8):2653–63.

Khan SH, Sohail A, Khan A, Lee YS. Classification and region analysis of COVID-19 infection using lung CT images and deep convolutional neural networks; 2020. arXiv preprint arXiv:2009.08864 .

Shi F, Wang J, Shi J, Wu Z, Wang Q, Tang Z, He K, Shi Y, Shen D. Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for COVID-19. IEEE Rev Biomed Eng. 2020;14:4–5.

Santamaría J, Rivero-Cejudo M, Martos-Fernández M, Roca F. An overview on the latest nature-inspired and metaheuristics-based image registration algorithms. Appl Sci. 2020;10(6):1928.

Santamaría J, Cordón O, Damas S. A comparative study of state-of-the-art evolutionary image registration methods for 3D modeling. Comput Vision Image Underst. 2011;115(9):1340–54.

Yumer ME, Mitra NJ. Learning semantic deformation flows with 3D convolutional networks. In: European conference on computer vision. Springer; 2016. p. 294–311.

Ding L, Feng C. Deepmapping: unsupervised map estimation from multiple point clouds. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2019. p. 8650–9.

Mahadevan S. Imagination machines: a new challenge for artificial intelligence. AAAI. 2018;2018:7988–93.

Wang L, Fang Y. Unsupervised 3D reconstruction from a single image via adversarial learning; 2017. arXiv preprint arXiv:1711.09312 .

Hermoza R, Sipiran I. 3D reconstruction of incomplete archaeological objects using a generative adversarial network. In: Proceedings of computer graphics international 2018. Association for Computing Machinery; 2018. p. 5–11.

Fu Y, Lei Y, Wang T, Curran WJ, Liu T, Yang X. Deep learning in medical image registration: a review. Phys Med Biol. 2020;65(20):20TR01.

Haskins G, Kruger U, Yan P. Deep learning in medical image registration: a survey. Mach Vision Appl. 2020;31(1):8.

de Vos BD, Berendsen FF, Viergever MA, Sokooti H, Staring M, Išgum I. A deep learning framework for unsupervised affine and deformable image registration. Med Image Anal. 2019;52:128–43.

Yang X, Kwitt R, Styner M, Niethammer M. Quicksilver: fast predictive image registration—a deep learning approach. NeuroImage. 2017;158:378–96.

Miao S, Wang ZJ, Liao R. A CNN regression approach for real-time 2D/3D registration. IEEE Trans Med Imaging. 2016;35(5):1352–63.

Li P, Pei Y, Guo Y, Ma G, Xu T, Zha H. Non-rigid 2D–3D registration using convolutional autoencoders. In: 2020 IEEE 17th international symposium on biomedical imaging (ISBI). IEEE; 2020. p. 700–4.

Zhang J, Yeung SH, Shu Y, He B, Wang W. Efficient memory management for GPU-based deep learning systems; 2019. arXiv preprint arXiv:1903.06631 .

Zhao H, Han Z, Yang Z, Zhang Q, Yang F, Zhou L, Yang M, Lau FC, Wang Y, Xiong Y, et al. Hived: sharing a {GPU} cluster for deep learning with guarantees. In: 14th {USENIX} symposium on operating systems design and implementation ({OSDI} 20); 2020. p. 515–32.

Lin Y, Jiang Z, Gu J, Li W, Dhar S, Ren H, Khailany B, Pan DZ. DREAMPlace: deep learning toolkit-enabled GPU acceleration for modern VLSI placement. IEEE Trans Comput Aided Des Integr Circuits Syst. 2020;40:748–61.

Hossain S, Lee DJ. Deep learning-based real-time multiple-object detection and tracking from aerial imagery via a flying robot with GPU-based embedded devices. Sensors. 2019;19(15):3371.

Castro FM, Guil N, Marín-Jiménez MJ, Pérez-Serrano J, Ujaldón M. Energy-based tuning of convolutional neural networks on multi-GPUs. Concurr Comput Pract Exp. 2019;31(21):4786.

Gschwend D. Zynqnet: an fpga-accelerated embedded convolutional neural network; 2020. arXiv preprint arXiv:2005.06892 .

Zhang N, Wei X, Chen H, Liu W. FPGA implementation for CNN-based optical remote sensing object detection. Electronics. 2021;10(3):282.

Zhao M, Hu C, Wei F, Wang K, Wang C, Jiang Y. Real-time underwater image recognition with FPGA embedded system for convolutional neural network. Sensors. 2019;19(2):350.

Liu X, Yang J, Zou C, Chen Q, Yan X, Chen Y, Cai C. Collaborative edge computing with FPGA-based CNN accelerators for energy-efficient and time-aware face tracking system. IEEE Trans Comput Soc Syst. 2021. https://doi.org/10.1109/TCSS.2021.3059318 .

Hossin M, Sulaiman M. A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manag Process. 2015;5(2):1.

Provost F, Domingos P. Tree induction for probability-based ranking. Mach Learn. 2003;52(3):199–215.

Rakotomamonyj A. Optimizing area under roc with SVMS. In: Proceedings of the European conference on artificial intelligence workshop on ROC curve and artificial intelligence (ROCAI 2004), 2004. p. 71–80.

Mingote V, Miguel A, Ortega A, Lleida E. Optimization of the area under the roc curve using neural network supervectors for text-dependent speaker verification. Comput Speech Lang. 2020;63:101078.

Fawcett T. An introduction to roc analysis. Pattern Recogn Lett. 2006;27(8):861–74.

Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng. 2005;17(3):299–310.

Hand DJ, Till RJ. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn. 2001;45(2):171–86.

Masoudnia S, Mersa O, Araabi BN, Vahabie AH, Sadeghi MA, Ahmadabadi MN. Multi-representational learning for offline signature verification using multi-loss snapshot ensemble of CNNs. Expert Syst Appl. 2019;133:317–30.

Coupé P, Mansencal B, Clément M, Giraud R, de Senneville BD, Ta VT, Lepetit V, Manjon JV. Assemblynet: a large ensemble of CNNs for 3D whole brain MRI segmentation. NeuroImage. 2020;219:117026.

Download references

Acknowledgements

We would like to thank the professors from the Queensland University of Technology and the University of Information Technology and Communications who gave their feedback on the paper.

This research received no external funding.

Author information

Authors and affiliations.

School of Computer Science, Queensland University of Technology, Brisbane, QLD, 4000, Australia

Laith Alzubaidi & Jinglan Zhang

Control and Systems Engineering Department, University of Technology, Baghdad, 10001, Iraq

Amjad J. Humaidi

Electrical Engineering Technical College, Middle Technical University, Baghdad, 10001, Iraq

Ayad Al-Dujaili

Faculty of Electrical Engineering & Computer Science, University of Missouri, Columbia, MO, 65211, USA

Ye Duan & Muthana Al-Amidie

AlNidhal Campus, University of Information Technology & Communications, Baghdad, 10001, Iraq

Laith Alzubaidi & Omran Al-Shamma

Department of Computer Science, University of Jaén, 23071, Jaén, Spain

J. Santamaría

College of Computer Science and Information Technology, University of Sumer, Thi Qar, 64005, Iraq

Mohammed A. Fadhel

School of Engineering, Manchester Metropolitan University, Manchester, M1 5GD, UK

Laith Farhan

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: LA, and JZ; methodology: LA, JZ, and JS; software: LA, and MAF; validation: LA, JZ, MA, and LF; formal analysis: LA, JZ, YD, and JS; investigation: LA, and JZ; resources: LA, JZ, and MAF; data curation: LA, and OA.; writing–original draft preparation: LA, and OA; writing—review and editing: LA, JZ, AJH, AA, YD, OA, JS, MAF, MA, and LF; visualization: LA, and MAF; supervision: JZ, and YD; project administration: JZ, YD, and JS; funding acquisition: LA, AJH, AA, and YD. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Laith Alzubaidi .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Alzubaidi, L., Zhang, J., Humaidi, A.J. et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8 , 53 (2021). https://doi.org/10.1186/s40537-021-00444-8

Download citation

Received : 21 January 2021

Accepted : 22 March 2021

Published : 31 March 2021

DOI : https://doi.org/10.1186/s40537-021-00444-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Deep learning
  • Machine learning
  • Convolution neural network (CNN)
  • Deep neural network architectures
  • Deep learning applications
  • Image classification
  • Medical image analysis
  • Supervised learning

artificial neural network research papers

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

applsci-logo

Article Menu

artificial neural network research papers

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Exploring the advancements and future research directions of artificial neural networks: a text mining approach.

artificial neural network research papers

1. Introduction

2. background, 2.1. artificial neural network, 2.2. applications of ann-based approaches, 2.3. text mining for ann, 3. research methodology.

  • What do we learn about the most common areas of ANN directions from this data set?
  • During the sample period of January 2000–2020, what trends were seen in ANN directions?
  • What are the most important areas of study in ANN that will shape the field in the future?

4. Data Analysis and Discussion

4.1. descriptive analysis, 4.2. text mining analysis (clustering), 4.2.1. clustering results, 4.2.2. word frequency distribution, 5. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, acknowledgments, conflicts of interest.

  • McClelland, J.L.; Rumelhart, D.E.; PDP Research Group. Parallel Distributed Processing, Volume 2: Explorations in the Microstructure of Cognition: Psychological and Biological Models ; MIT Press: Cambridge, MA, USA, 1987; Volume 2. [ Google Scholar ]
  • Lee, T.S.; Mumford, D. Hierarchical Bayesian inference in the visual cortex. JOSA A 2003 , 20 , 1434–1448. [ Google Scholar ] [ CrossRef ]
  • Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [ Google Scholar ]
  • Suganuma, M.; Shirakawa, S.; Nagao, T. A genetic programming approach to designing convolutional neural network architectures. In Proceedings of the Genetic and Evolutionary Computation Conference, Berlin, Germany, 15–19 July 2017; pp. 497–504. [ Google Scholar ]
  • Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017 , 60 , 84–90. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Hubel, D.H.; Wiesel, T.N. Receptive Fields of Single Neurons in the Cat’s Striate Cortex. J. Physiol. 1959 , 148 , 574–591. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Hinton, G.E.; Osindero, S.; Teh, Y.W. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006 , 18 , 1527–1554. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Salakhutdinov, R.; Larochelle, H. Efficient Learning of Deep Boltzmann Machines. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010; Volume 9, pp. 693–700. [ Google Scholar ]
  • Liu, J.-W.; Chi, G.-H.; Luo, X.-L. Contrastive Divergence Learning of Restricted Boltzmann Machine. In Proceedings of the 2012 Second International Conference on Electric Technology and Civil Engineering, Washington, DC, USA, 18–20 May 2012; pp. 712–715. [ Google Scholar ]
  • Gong, M.; Liu, J.; Li, H.; Cai, Q.; Su, L. A Multiobjective Sparse Feature Learning Model for Deep Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2015 , 26 , 3263–3277. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; Volume 26. [ Google Scholar ]
  • Zhang, Y.; Wallace, B. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv 2015 , arXiv:1510.03820. [ Google Scholar ]
  • Grekousis, G. Artificial neural networks and deep learning in urban geography: A systematic review and meta-analysis. Comput. Environ. Urban Syst. 2019 , 74 , 244–256. [ Google Scholar ] [ CrossRef ]
  • Che, Z.; Purushotham, S.; Cho, K.; Sontag, D.; Liu, Y. Recurrent neural networks for multivariate time series with missing values. Sci. Rep. 2018 , 8 , 6085. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Md, A.Q.; Kapoor, S.; Chris Junni, A.V.; Sivaraman, A.K.; Tee, K.F.; Sabireen, H.; Janakiraman, N. Novel optimization approach for stock price forecasting using multi-layered sequential LSTM. Appl. Soft Comput. 2023 , 134 , 109830. [ Google Scholar ] [ CrossRef ]
  • Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020 , 63 , 139–144. [ Google Scholar ] [ CrossRef ]
  • Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014 , arXiv:1409.0473. [ Google Scholar ]
  • Louati, A.; Lahyani, R.; Aldaej, A.; Mellouli, R.; Nusir, M. Mixed integer linear programming models to solve a real-life vehicle routing problem with pickup and delivery. Appl. Sci. 2021 , 11 , 9551. [ Google Scholar ] [ CrossRef ]
  • Louati, A.; Masmoudi, F.; Lahyani, R. Traffic disturbance mining and feedforward neural network to enhance the immune network control performance. In Proceedings of the Seventh International Congress on Information and Communication Technology: ICICT 2022, London, UK, 21–24 February 2022; Volume 1, pp. 99–106. [ Google Scholar ]
  • Louati, A.; Louati, H.; Li, Z. Deep learning and case-based reasoning for predictive and adaptive traffic emergency management. J. Supercomput. 2021 , 77 , 4389–4418. [ Google Scholar ] [ CrossRef ]
  • Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015 , 521 , 436–444. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv 2012 , arXiv:1207.0580. [ Google Scholar ]
  • He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [ Google Scholar ]
  • Hinton, G.E.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012 , 29 , 82–97. [ Google Scholar ] [ CrossRef ]
  • Wen, T.-H.; Vandyke, D.; Mrkšić, N.; Gašić, M.; Rojas-Barahona, L.M.; Su, P.-H.; Ultes, S.; Young, S. A network-based end-to-end trainable task-oriented dialogue model. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 312–320. [ Google Scholar ]
  • Widrow, B.; Lehr, M.E. 30 Years of Adaptive Neural Networks: Perception, Motor Control, and Cognition ; World Scientific: Singapore, 1990. [ Google Scholar ]
  • Alpaydin, E. Introduction to Machine Learning ; MIT Press: Cambridge, MA, USA, 2010. [ Google Scholar ]
  • Gandomi, A.A.; Haider, M. Beyond the hype: Big data concepts, methods, and analytics. Int. J. Inf. Manag. 2015 , 35 , 137–144. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Louati, H.; Bechikh, S.; Louati, A.; Hung, C.C.; Said, L.B. Deep convolutional neural network architecture design as a bi-level optimization problem. Neurocomputing 2021 , 439 , 44–62. [ Google Scholar ] [ CrossRef ]
  • Louati, H.; Bechikh, S.; Louati, A.; Aldaej, A.; Said, L.B. Evolutionary optimization of convolutional neural network architecture design for thoracic X-ray image classification. In Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices, Proceedings of the 34th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2021, Kuala Lumpur, Malaysia, 26–29 July 2021 ; Springer: Berlin/Heidelberg, Germany, 2021; pp. 121–132. [ Google Scholar ]
  • Louati, H.; Bechikh, S.; Louati, A.; Aldaej, A.; Said, L.B. Joint design and compression of convolutional neural networks as a Bi-level optimization problem. Neural Comput. Appl. 2022 , 34 , 15007–15029. [ Google Scholar ] [ CrossRef ]
  • Louati, A. A hybridization of deep learning techniques to predict and control traffic disturbances. Artif. Intell. Rev. 2020 , 53 , 5675–5704. [ Google Scholar ] [ CrossRef ]
  • Louati, A.; Louati, H.; Nusir, M.; Hardjono, B. Multi-agent deep neural networks coupled with LQF-MWM algorithm for traffic control and emergency vehicles guidance. J. Ambient. Intell. Humaniz. Comput. 2020 , 11 , 5611–5627. [ Google Scholar ] [ CrossRef ]
  • Louati, H.; Louati, A.; Bechikh, S.; Masmoudi, F.; Aldaej, A.; Kariri, E. Topology optimization search of deep convolution neural networks for CT and X-ray image classification. BMC Med. Imaging 2022 , 22 , 120. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Louati, H.; Louati, A.; Bechikh, S.; Ben Said, L. Design and Compression Study for Convolutional Neural Networks Based on Evolutionary Optimization for Thoracic X-ray Image Classification. In Computational Collective Intelligence, Proceedings of the 14th International Conference, ICCCI 2022, Hammamet, Tunisia, 28–30 September 2022 ; Springer: Berlin/Heidelberg, Germany, 2022; pp. 283–296. [ Google Scholar ]
  • Louati, H.; Bechikh, S.; Louati, A.; Aldaej, A.; Said, L.B. Evolutionary optimization for CNN compression using thoracic X-ray image classification. In Advances and Trends in Artificial Intelligence. Theory and Practices in Artificial Intelligence, Proceedings of the 35th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2022, Kitakyushu, Japan, 19–22 July 2022 ; Springer: Berlin/Heidelberg, Germany, 2022; pp. 112–123. [ Google Scholar ]
  • Feng, Y.; Lu, X.; Wang, H. Design research on user experience based on text mining. J. Ambient. Intell. Humaniz. Comput. 2021 , 12 , 9–18. [ Google Scholar ]
  • Jin, Y.; Wang, J.; Liu, Y. Text mining in product design research: A literature review. J. Ambient. Intell. Humaniz. Comput. 2020 , 11 , 149–159. [ Google Scholar ]
  • Li, X.; Li, Q.; Li, Z. Text mining in industrial design research: A literature review. J. Ambient. Intell. Humaniz. Comput. 2019 , 10 , 10175–10184. [ Google Scholar ]
  • Wang, J.; Li, X.; Liu, Y. Text mining in design thinking research: A literature review. J. Ambient. Intell. Humaniz. Comput. 2018 , 9 , 5397–5405. [ Google Scholar ]
  • Zheng, Y.; Wang, J.; Liu, Y. Text mining in design for sustainability research: A literature review. J. Ambient. Intell. Humaniz. Comput. 2018 , 9 , 3357–3365. [ Google Scholar ]
  • Wang, J.; Li, X.; Liu, Y. Text mining in design education research: A literature review. J. Ambient. Intell. Humaniz. Comput. 2017 , 8 , 5685–5693. [ Google Scholar ]
  • Zhang, W.; Liu, Y.; Dai, Y. Sentiment analysis with text mining techniques and artificial neural networks. Inf. Sci. 2020 , 520 , 92–107. [ Google Scholar ]
  • Mohammed, A.H.; Al-Sarawi, S.F.; Dass, S.C. Text classification using artificial neural networks and text mining techniques. J. Ambient. Intell. Humaniz. Comput. 2020 , 11 , 5649–5658. [ Google Scholar ]
  • Sharma, A.; Sharma, S. A review on artificial neural network based named entity recognition using text mining techniques. J. Ambient. Intell. Humaniz. Comput. 2020 , 11 , 4585–4600. [ Google Scholar ]
  • Liu, L.; Liu, Y.; Dai, Y. Opinion mining and sentiment analysis with text mining and artificial neural networks. Inf. Sci. 2019 , 476 , 225–242. [ Google Scholar ]
  • Zhang, Y.; Tan, Y.; Liu, Y. Stock price prediction with text mining and artificial neural networks. J. Ambient. Intell. Humaniz. Comput. 2018 , 9 , 53–62. [ Google Scholar ]

Click here to enlarge figure

AdvantagesDisadvantagesLimitations
High accuracy in modeling complex systemsOverfittingSensitivity to noise and irrelevant inputs
Capable of handling large amounts of dataSlow convergence in trainingLack of interpretability
Can handle non-linear relationshipsProne to local minimaRequires careful selection of architecture and hyperparameters
Can perform unsupervised learningCan be computationally intensiveCan be sensitive to the choice of activation function
Can perform feature extractionCan be prone to over-fittingRequires large amounts of computational resources
Can handle imbalanced dataCan be prone to over- trainingRequires extensive preprocessing and feature engineering
Can model complex, non-linear relationshipsCan be sensitive to the choice of loss functionRequires careful initialization of weights
Can perform online learningCan have poor generalization performanceCan be sensitive to the choice of optimization algorithm
Journal NameTotal Articles
Neural processing Letters1414
Journal of Neural Transmission2429
Neural Computing and Applications4747
Artificial Intelligence Review782
Optical Memory and Neural Networks440
Journal of Computational Neuroscience799
RankKeywordKeywords Count
1Neural network678
2Artificial neural network392
3Classification222
4Machine learning177
5Deep learning169
6Genetic algorithm163
7Support vector machine153
8Particle swarm optimization145
9Feature extraction128
10Optimization125
11Feature selection123
12Recurrent Neural Network110
13Extreme learning machine106
14Clustering103
15Dopamine101
16Face recognition98
17Pattern recognition90
18Data mining98
19Parkinson’s disease87
20Schizophrenia85
21Support vector machines80
22Synchronization78
23Depression76
24Genetic algorithm75
25fuzzy logic65
26Hippocampus64
27Prediction64
28Dimensionality reduction59
29Exponential stability58
30Cognition58
31Artificial intelligence57
32Adaptive control56
33Stability54
34Dementia53
35Deep brain stimulation52
36Oxidativestress52
37Electroencephalogram51
38Swarm intelligence51
39Support vector regression46
40Forecasting46
41Serotonin46
42Polymorphism46
43Linear matrix inequality45
44Time-varying delay45
45Reinforcement learning44
46Botulinum toxin43
47Parkinson’s disease42
48Adhd42
49Modeling41
50Evolutionary algorithm41
Dimension2007–20112012–20152016–2019Trend Lines
Mental disease142138153
Mental curing573735
Brain8470110
Deep learning5447021110
Algorithm91218392
Data mining195278501
Stability223258
Reasoning293262
Transmission534747
Machine learning3075178
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Kariri, E.; Louati, H.; Louati, A.; Masmoudi, F. Exploring the Advancements and Future Research Directions of Artificial Neural Networks: A Text Mining Approach. Appl. Sci. 2023 , 13 , 3186. https://doi.org/10.3390/app13053186

Kariri E, Louati H, Louati A, Masmoudi F. Exploring the Advancements and Future Research Directions of Artificial Neural Networks: A Text Mining Approach. Applied Sciences . 2023; 13(5):3186. https://doi.org/10.3390/app13053186

Kariri, Elham, Hassen Louati, Ali Louati, and Fatma Masmoudi. 2023. "Exploring the Advancements and Future Research Directions of Artificial Neural Networks: A Text Mining Approach" Applied Sciences 13, no. 5: 3186. https://doi.org/10.3390/app13053186

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Dement Neurocogn Disord
  • v.17(3); 2018 Sep

Logo of dnd

Artificial Neural Network: Understanding the Basic Concepts without Mathematics

Su-hyun han.

1 Department of Neurology, Chung-Ang University College of Medicine, Seoul, Korea.

Ko Woon Kim

2 Department of Neurology, Chonbuk National University Hospital, Jeonju, Korea.

SangYun Kim

3 Department of Neurology, Seoul National University College of Medicine and Seoul National University Bundang Hospital, Seongnam, Korea.

Young Chul Youn

Machine learning is where a machine (i.e., computer) determines for itself how input data is processed and predicts outcomes when provided with new data. An artificial neural network is a machine learning algorithm based on the concept of a human neuron. The purpose of this review is to explain the fundamental concepts of artificial neural networks.

WHAT IS THE DIFFERENCE BETWEEN MACHINE LEARNING AND A COMPUTER PROGRAM?

A computer program takes input data, processes the data, and outputs a result. A programmer stipulates how the input data should be processed. In machine learning, input and output data is provided, and the machine determines the process by which the given input produces the given output data. 1 , 2 , 3 This process can then predict the unknown output when new input data is provided. 1 , 2 , 3

So, how does machine determine the process? Consider a linear model, y = Wx + b . If we are given the x and y values shown in Table 1 , we know intuitively that W = 2 and b = 0.

12
24
36
48
510
1020
2040
3060

However, how does a computer know what W and b are? W and b are randomly generated initially. For example, if we initially say W = 1 and b = 0 and input the x values from Table 1 , the predicted y values do not match the y values in the Table 1 . At this point, there is a difference between the correct y values and the predicted y values; the machine gradually adjusts the values of W and b to reduce this difference. The difference between the predicted values and the correct values is called the cost function. Minimizing the cost function makes predictions closer to the correct answers. 4 , 5 , 6 , 7

The costs that correspond to given W and b values are shown in the Fig. 1 (obtained from https://rasbt.github.io/mlxtend/user_guide/general_concepts/gradient-optimization/ ). 8 , 9 , 10 Given the current values of W and b , the gradient is obtained. If the gradient is positive, the values of W and b are decreased. If the gradient is negative, the values of W and b are increased. In other words, the values of W and b determined by the derivative of the cost function. 8 , 9 , 10

An external file that holds a picture, illustration, etc.
Object name is dnd-17-83-g001.jpg

ARTIFICIAL NEURAL NETWORK

The basic unit by which the brain works is a neuron. Neurons transmit electrical signals (action potentials) from one end to the other. 11 That is, electrical signals are transmitted from the dendrites to the axon terminals through the axon body. In this way, the electrical signals continue to be transmitted across the synapse from one neuron to another. The human brain has approximately 100 billion neurons. 11 , 12 It is difficult to imitate this level of complexity with existing computers. However, Drosophila have approximately 100,000 neurons, and they are able to find food, avoid danger, survive, and reproduce. 13 Nematodes have 302 neurons and they survive well. 14 This is a level of complexity that can be replicated well even with today's computers. However, nematodes can perform much better than our computers.

Let's think of the operative principles of neurons. Neurons receive signals and generate other signals. That is, they receive input data, perform some processing, and give an output. 11 , 12 However, the output is not given at a constant rate; the output is generated when the input exceeds a certain threshold. 15 The function that receives an input signal and produces an output signal after a certain threshold value is called an activation function. 11 , 15 As shown in Fig. 2 , when the input value is small, the output value is 0, and once the input value rises above the threshold value, a non-zero output value suddenly appears. Thus, the responses of biological neurons and artificial neurons (nodes) are similar. However, in reality, artificial neural networks use various functions other than activation functions, and most of them use sigmoid functions. 4 , 6 which are also called logistic functions. The sigmoid function has the advantage that it is very simple to calculate compared to other functions. Currently, artificial neural networks predominantly use a weight modification method in the learning process. 4 , 5 , 7 In the course of modifying the weights, the entire layer requires an activation function that can be differentiated. This is because the step function cannot be used as it is. The sigmoid function is expressed as the following equation. 6 , 16

An external file that holds a picture, illustration, etc.
Object name is dnd-17-83-g002.jpg

Biological neurons receive multiple inputs from pre-synaptic neurons. 11 Neurons in artificial neural networks (nodes) also receive multiple inputs, then they add them and process the sum with a sigmoid function. 5 , 7 The value processed by the sigmoid function then becomes the output value. As shown in Fig. 3 , if the sum of inputs A, B, and C exceeds the threshold and the sigmoid function works, this neuron generates an output value. 4

An external file that holds a picture, illustration, etc.
Object name is dnd-17-83-g003.jpg

A neuron receives input from multiple neurons and transmits signals to multiple neurons. 4 , 5 The models of biological neurons and an artificial neural network are shown in Fig. 4 . Neurons are located over several layers, and one neuron is considered to be connected to multiple neurons. In an artificial neural network, the first layer (input layer) has input neurons that transfer data via synapses to the second layer (hidden layer), and similarly, the hidden layer transfers this data to the third layer (output layer) via more synapses. 4 , 5 , 7 The hidden layer (node) is called the “black box” because we are unable to interpret how an artificial neural network derived a particular result. 4 , 5 , 6

An external file that holds a picture, illustration, etc.
Object name is dnd-17-83-g004.jpg

So, how does a neural network learn in this structure? There are some variables that must be updated in order to increase the accuracy of the output values during the learning process. 4 , 5 , 7 At this time, there could be a method of adjusting the sum of the input values or modifying the sigmoid function, but it is a rather practical method to adjust the connection strength between the neurons (nodes). 4 , 5 , 7 Fig. 5 is a reformulation of Fig. 4 with the weight to be applied to the connection strength. Low weights weaken the signal, and high weights enhance the signal. 17 The learning part is the process by which the network adjusts the weights to improve the output. 4 , 5 , 7 The weights W 1,2, W 1,3, W 2,3, W 2,2, and W 3,2 are emphasized by strengthening the signal due to the high weight. If the weight W is 0, the signal is not transmitted, and the network cannot be influenced.

An external file that holds a picture, illustration, etc.
Object name is dnd-17-83-g005.jpg

Multiple nodes and connections could affect predicted values and their errors. In this case, how do we update the weights to get the correct output, and how does learning work?

The updating of the weights is determined by the error between the predicted output and the correct output. 4 , 5 , 7 The error is divided by the ratio of the weights on the links, and the divided errors are back propagated and reassembled ( Fig. 6 ). However, in a hierarchical structure, it is extremely difficult to calculate all the weights mathematically. As an alternative, we can use the gradient descent method 8 , 9 , 10 to reach the correct answer — even if we do not know the complex mathematical calculations ( Fig. 1 ). The gradient descent method is a technique to find the lowest point at which the cost is minimized in a cost function (the difference between the predicted value and the answer obtained by the arbitrarily start weight W ). Again, the machine can start with any W value and alter it gradually (so that it goes down the graph) so that the cost is reduced and finally reaches a minimum. Without complicated mathematical calculations, this minimizes the error between the predicted value and the answer (or it shows that the arbitrary start weight is correct). This is the learning process of artificial neural networks.

An external file that holds a picture, illustration, etc.
Object name is dnd-17-83-g006.jpg

CONCLUSIONS

In conclusion, the learning process of an artificial neural network involves updating the connection strength (weight) of a node (neuron). By using the error between the predicted value and the correct, the weight in the network is adjusted so that the error is minimized and an output close to the truth is obtained.

Funding: This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2017S1A6A3A01078538).

Conflict of Interest: The authors have no financial conflicts of interest.

Author Contributions:

  • Conceptualization: Kim SY.
  • Supervision: Kim KW, Youn YC.
  • Writing - original draft: Han SH, Kim KW.
  • Writing - review & editing: Youn YC.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • NEWS AND VIEWS
  • 07 August 2024

Physics solves a training problem for artificial neural networks

  • Damien Querlioz 0

Damien Querlioz is at Université Paris-Saclay, CNRS, Centre de Nanosciences et de Nanotechnologies, 91120 Palaiseau, France.

You can also search for this author in PubMed   Google Scholar

Huge changes are already under way in health care, industry and education as a result of advances in artificial intelligence (AI), but the costs could outweigh the benefits if AI’s enormous energy consumption is not reined in. The problem is that AI relies heavily on deep neural networks, which are layered algorithms that involve millions or even billions of computations, requiring massive, energy-hungry access to the memories in conventional computers. One possible solution is to replace the computers with systems that more closely reflect the physical structure of biological neural networks, but such systems are typically incapable of performing one of the main steps in training the network. In a paper in Nature , Xue et al . 1 report an ingenious workaround that uses physics to overcome the problem.

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Rent or buy this article

Prices vary by article type

Prices may be subject to local taxes which are calculated during checkout

Nature 632 , 264-265 (2024)

doi: https://doi.org/10.1038/d41586-024-02392-8

Xue, Z. et al. Nature 632 , 280–286 (2024).

Article   Google Scholar  

McMahon, P. L. Nature Rev. Phys. 5 , 717–734 (2023).

Ross, A. et al. Nature Nanotechnol. 18 , 1273–1280 (2023).

Article   PubMed   Google Scholar  

Ambrogio, S. et al. Nature 620 , 768–775 (2023).

Lee, R. H., Mulder, E. A. B. & Hopkins, J. B. Sci. Robot. 7 , eabq7278 (2022).

Marković, D., Mizrahi, A., Querlioz, D. & Grollier, J. Nature Rev. Phys. 2 , 499–510 (2020).

Wright, L. G. et al. Nature 601 , 549–555 (2022).

Scellier, B., Ernoult, M., Kendall, J. & Kumar, S. Adv. Neural Inf. Proc. Syst . 36 , 52705–52731 (2023); available at https://go.nature.com/4fmijnb .

Download references

Reprints and permissions

Competing Interests

The author declares no competing interests.

Related Articles

artificial neural network research papers

See all News & Views

  • Machine learning
  • Optics and photonics

Five ways the brain can age: 50,000 scans reveal possible patterns of damage

Five ways the brain can age: 50,000 scans reveal possible patterns of damage

News 19 AUG 24

Chatbots in science: What can ChatGPT do for you?

Chatbots in science: What can ChatGPT do for you?

Career Column 14 AUG 24

Has your paper been used to train an AI model? Almost certainly

Has your paper been used to train an AI model? Almost certainly

News 14 AUG 24

Partial coherence enhances parallelized photonic computing

Partial coherence enhances parallelized photonic computing

Article 31 JUL 24

Cheap light sources could make AI more energy efficient

Cheap light sources could make AI more energy efficient

News & Views 31 JUL 24

Microchip minutiae imaged using rapid X-ray bursts

Microchip minutiae imaged using rapid X-ray bursts

On-chip topological beamformer for multi-link terahertz 6G to XG wireless

On-chip topological beamformer for multi-link terahertz 6G to XG wireless

Article 14 AUG 24

Precision spectroscopy on 9Be overcomes limitations from nuclear structure

Precision spectroscopy on 9Be overcomes limitations from nuclear structure

Twist-assisted all-antiferromagnetic tunnel junction in the atomic limit

Twist-assisted all-antiferromagnetic tunnel junction in the atomic limit

Postdoc Fellow / Senior Scientist

The Yakoub and Sulzer labs at Harvard Medical School-Brigham and Women’s Hospital and Columbia University

Boston, Massachusetts (US)

Harvard Medical School and Brigham and Women's Hospital

artificial neural network research papers

Faculty Positions at the Center for Machine Learning Research (CMLR), Peking University

CMLR's goal is to advance machine learning-related research across a wide range of disciplines.

Beijing, China

Center for Machine Learning Research (CMLR), Peking University

artificial neural network research papers

Postdoctoral Research Fellows at Suzhou Institute of Systems Medicine (ISM)

ISM, based on this program, is implementing the reserve talent strategy with postdoctoral researchers.

Suzhou, Jiangsu, China

Suzhou Institute of Systems Medicine (ISM)

artificial neural network research papers

Career Opportunities at the Yazhouwan National Laboratory, Hainan, China

YNL recruits leading scientists in agriculture: crop/animal genetics, biotech, photosynthesis, disease resistance, data analysis, and more.

Sanya, Hainan, China

Yazhouwan National Laboratory

artificial neural network research papers

Global Recruitment for Center Directors and Principal Investigators, GIBH

We aim to foster cutting-edge scientific and technological advancements in the field of molecular tissue biology at the single-cell level.

Guangzhou, Guangdong, China

Guangzhou Institutes of Biomedicine and Health(GIBH), Chinese Academy of Sciences

artificial neural network research papers

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

For IEEE Members

Ieee spectrum, follow ieee spectrum, support ieee spectrum, enjoy more free content and benefits by creating an account, saving articles to read later requires an ieee spectrum account, the institute content is only available for members, downloading full pdf issues is exclusive for ieee members, downloading this e-book is exclusive for ieee members, access to spectrum 's digital edition is exclusive for ieee members, following topics is a feature exclusive for ieee members, adding your response to an article requires an ieee spectrum account, create an account to access more content and features on ieee spectrum , including the ability to save articles to read later, download spectrum collections, and participate in conversations with readers and editors. for more exclusive content and features, consider joining ieee ., join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of spectrum’s articles, archives, pdf downloads, and other benefits. learn more about ieee →, join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of ieee spectrum’s articles, archives, pdf downloads, and other benefits. learn more about ieee →, access thousands of articles — completely free, create an account and get exclusive content and features: save articles, download collections, and talk to tech insiders — all free for full access and benefits, join ieee as a paying member., a new type of neural network is more interpretable, kolmogorov-arnold networks could point physicists to new hypotheses.

deep purple dots and lines connected together with 0's and 1's inbetween against a dark background

Artificial neural networks —algorithms inspired by biological brains—are at the center of modern artificial intelligence , behind both chatbots and image generators. But with their many neurons, they can be black boxes , their inner workings uninterpretable to users.

Researchers have now created a fundamentally new way to make neural networks that in some ways surpasses traditional systems. These new networks are more interpretable and also more accurate, proponents say, even when they’re smaller. Their developers say the way they learn to represent physics data concisely could help scientists uncover new laws of nature.

“It’s great to see that there is a new architecture on the table.” —Brice Ménard, Johns Hopkins University

For the past decade or more, engineers have mostly tweaked neural-network designs through trial and error, says Brice Ménard, a physicist at Johns Hopkins University who studies how neural networks operate but was not involved in the new work, which was posted on arXiv in April. “It’s great to see that there is a new architecture on the table,” he says, especially one designed from first principles.

One way to think of neural networks is by analogy with neurons, or nodes, and synapses, or connections between those nodes. In traditional neural networks, called multi-layer perceptrons (MLPs), each synapse learns a weight—a number that determines how strong the connection is between those two neurons. The neurons are arranged in layers, such that a neuron from one layer takes input signals from the neurons in the previous layer, weighted by the strength of their synaptic connection. Each neuron then applies a simple function to the sum total of its inputs, called an activation function.

In the new architecture, the synapses play a more complex role. Instead of simply learning how strong the connection between two neurons is, they learn the full nature of that connection—the function that maps input to output. Unlike the activation function used by neurons in the traditional architecture, this function could be more complex—in fact a “spline” or combination of several functions—and is different in each instance. Neurons, on the other hand, become simpler—they just sum the outputs of all their preceding synapses. The new networks are called Kolmogorov-Arnold Networks (KANs), after two mathematicians who studied how functions could be combined. The idea is that KANs would provide greater flexibility when learning to represent data, while using fewer learned parameters.

“It’s like an alien life that looks at things from a different perspective but is also kind of understandable to humans.” —Ziming Liu, Massachusetts Institute of Technology

The researchers tested their KANs on relatively simple scientific tasks. In some experiments, they took simple physical laws, such as the velocity with which two relativistic-speed objects pass each other. They used these equations to generate input-output data points, then, for each physics function, trained a network on some of the data and tested it on the rest. They found that increasing the size of KANs improves their performance at a faster rate than increasing the size of MLPs did. When solving partial differential equations, a KAN was 100 times as accurate as an MLP that had 100 times as many parameters.

In another experiment, they trained networks to predict one attribute of topological knots, called their signature, based on other attributes of the knots. An MLP achieved 78 percent test accuracy using about 300,000 parameters, while a KAN achieved 81.6 percent test accuracy using only about 200 parameters.

What’s more, the researchers could visually map out the KANs and look at the shapes of the activation functions, as well as the importance of each connection. Either manually or automatically they could prune weak connections and replace some activation functions with simpler ones, like sine or exponential functions. Then they could summarize the entire KAN in an intuitive one-line function (including all the component activation functions), in some cases perfectly reconstructing the physics function that created the dataset.

“In the future, we hope that it can be a useful tool for everyday scientific research ,” says Ziming Liu, a computer scientist at the Massachusetts Institute of Technology and the paper’s first author. “Given a dataset we don’t know how to interpret, we just throw it to a KAN, and it can generate some hypothesis for you. You just stare at the brain [the KAN diagram] and you can even perform surgery on that if you want.” You might get a tidy function. “It’s like an alien life that looks at things from a different perspective but is also kind of understandable to humans.”

Dozens of papers have already cited the KAN preprint. “It seemed very exciting the moment that I saw it,” says Alexander Bodner, an undergraduate student of computer science at the University of San Andrés, in Argentina. Within a week, he and three classmates had combined KANs with convolutional neural networks, or CNNs, a popular architecture for processing images. They tested their Convolutional KANs on their ability to categorize handwritten digits or pieces of clothing. The best one approximately matched the performance of a traditional CNN (99 percent accuracy for both networks on digits, 90 percent for both on clothing) but using about 60 percent fewer parameters. The datasets were simple, but Bodner says other teams with more computing power have begun scaling up the networks. Other people are combining KANs with transformers, an architecture popular in large language models .

One downside of KANs is that they take longer per parameter to train—in part because they can’t take advantage of GPUs . But they need fewer parameters. Liu notes that even if KANs don’t replace giant CNNs and transformers for processing images and language, training time won’t be an issue at the smaller scale of many physics problems. He’s looking at ways for experts to insert their prior knowledge into KANs—by manually choosing activation functions, say—and to easily extract knowledge from them using a simple interface. Someday, he says, KANs could help physicists discover high-temperature superconductors or ways to control nuclear fusion.

  • Cracking Open the Black Box of AI with Cell Biology ›
  • It’s Too Easy to Hide Bias in Deep-Learning Systems ›
  • KAN: Kolmogorov-Arnold Networks ›
  • Kolmogorov-Arnold Networks (KAN): Promising Alternative to Multi ... ›

Matthew Hutson is a freelance writer who covers science and technology, with specialties in psychology and AI. He’s written for Science, Nature, Wired, The Atlantic, The New Yorker, and The Wall Street Journal . He’s a former editor at Psychology Today and is the author of The 7 Laws of Magical Thinking . Follow him on Twitter at @SilverJacket.

Nasir Ahmed: An Unsung Hero of Digital Media

Taiwan reboots its solar-power fishponds, electrostatic motors reach the macro scale, related stories, the ai arms race to combat fake images is even—for now, llama 3 establishes meta as the leader in “open” ai, ai chip trims energy budget back by 99+ percent.

artificial neural network research papers

  • {{subColumn.name}}

AIMS Environmental Science

artificial neural network research papers

  • {{newsColumn.name}}
  • Share facebook twitter google linkedin

artificial neural network research papers

Monitoring river water quality through predictive modeling using artificial neural networks backpropagation

  • Muhammad Andang Novianta 1,2 , 
  • Syafrudin 1,3 ,  ,  , 
  • Budi Warsito 1,4 , 
  • Siti Rachmawati 5
  • 1. Department of Doctoral Environmental Science, Faculty of Postgraduate, Universitas Diponegoro, Semarang 50275, Indonesia
  • 2. Department of Electrical Engineering, Faculty of Engineering, Universitas AKPRIND Indonesia, Yogyakarta 55222, Indonesia
  • 3. Department of Environmental Engineering, Faculty of Engineering, Universitas Diponegoro, Semarang 50275, Indonesia
  • 4. Department of Statistics, Faculty of Science and Mathematics, Universitas Diponegoro, Semarang 50275, Indonesia
  • 5. Department of Environmental Science, Faculty of Mathematics and Natural Sciences, Universitas Sebelas Maret, Surakarta 57126, Indonesia
  • Received: 02 July 2024 Revised: 05 August 2024 Accepted: 07 August 2024 Published: 19 August 2024
  • Full Text(HTML)
  • Download PDF

Predicting river water quality in the Special Region of Yogyakarta (DIY) is crucial. In this research, we modeled a river water quality prediction system using the artificial neural network (ANN) backpropagation method. Backpropagation is one of the developments of the multilayer perceptron (MLP) network, which can reduce the level of prediction error by adjusting the weights based on the difference in output and the desired target. Water quality parameters included biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), dissolved oxygen (DO), total phosphate, fecal coliforms, and total coliforms. The research object was the upstream, downstream, and middle parts of the Oya River. The data source was secondary data from the DIY Environment and Forestry Service. Data were in the form of time series data for 2013–2023. Descriptive data results showed that the water quality of the Oya River in 2020–2023 was better than in previous years. However, increasing community and industrial activities can reduce water quality. This was concluded based on the prediction results of the ANN backpropagation method with a hidden layer number of 4. The prediction results for period 3 in 2023 and period 1 in 2024 are that 1) the concentrations of BOD, fecal coli, and total coli will increase and exceed quality standards, 2) COD and TSS concentrations will increase but will still be below quality standards, 3) DO and total phosphate concentrations will remain constant and still on the threshold of quality standards. The possibility of several water quality parameters increasing above the quality standards remains, so the potential for contamination of the Oya River is still high. Therefore, early prevention of river water pollution is necessary.

  • artificial neural network (ANN) backpropagation ,
  • system modeling ,

Citation: Muhammad Andang Novianta, Syafrudin, Budi Warsito, Siti Rachmawati. Monitoring river water quality through predictive modeling using artificial neural networks backpropagation[J]. AIMS Environmental Science, 2024, 11(4): 649-664. doi: 10.3934/environsci.2024032

Related Papers:

[1] 4: 1–9. --> Utama JP, Syafrudin, Nugraha WD (2015) Penentuan Daya tampung beban pencemaran BOD dan Fecal Coliform Sungai Plumbon Kota Semarang Dengan software QUAL2E. 4: 1–9.
[2]
[3] 13. https://doi.org/10.31227/osf.io/s9n2u --> Zubaidah T, Karnaningroem N, Slamet A (2018) K-means method for clustering water quality status on the rivers of Banjarmasin, Indonesia. 13. https://doi.org/10.31227/osf.io/s9n2u doi:
[4] 11: 155–163. https://doi.org/10.30595/juita.v11i1.16986 --> Novianta MA, Syafrudin S, Warsito B (2023) K-means clustering for Grouping Rivers in DIY based on water quality parameters. 11: 155–163. https://doi.org/10.30595/juita.v11i1.16986 doi:
[5] 11: 339. https://doi.org/10.3390/w11020339 --> Di Z, Chang M, Guo P (2019) Water quality evaluation of the Yangtze River in China using machine learning techniques and data monitoring on different time scales. 11: 339. https://doi.org/10.3390/w11020339 doi:
[6] 32: 3763–3775. https://doi.org/10.1007/s00521-018-3768-7 --> Mohammadrezapour O, Kisi O, Pourahmad F (2020) Fuzzy c-means and K-means clustering with genetic algorithm for identification of homogeneous regions of groundwater quality. 32: 3763–3775. https://doi.org/10.1007/s00521-018-3768-7 doi:
[7] . 8: 81–97. --> Widyastuti M, Marfai MA (2004) Kajian daya tampung sungai gajahwong thdp beban pencemaran. . 8: 81–97.
[8] 30: 1–11. https://doi.org/10.29122/jstmb.v15i1.4145 --> Riza H, Santoso EW, Tejakusuma IG, et al. (2020) Utilization of artificial intelligence to improve flood disaster mitigation. 30: 1–11. https://doi.org/10.29122/jstmb.v15i1.4145 doi:
[9] 7: 147. https://doi.org/10.3390/fluids7050147 --> Makubura R, Meddage DPP, Azamathulla HM, et al. (2022) A simplified mathematical formulation for water quality index (WQI): A case study in the Kelani River Basin, Sri Lanka. 7: 147. https://doi.org/10.3390/fluids7050147 doi:
[10] 631: 130846. https://doi.org/10.1016/j.jhydrol.2024.130846 --> Madhushani C, Dananjaya K, Ekanayake IU, et al. (2024) Modeling streamflow in non-gauged watersheds with sparse data considering physiographic, dynamic climate, and anthropogenic factors using explainable soft computing techniques. 631: 130846. https://doi.org/10.1016/j.jhydrol.2024.130846 doi:
[11] 21: 101920. https://doi.org/10.1016/j.rineng.2024.101920 --> Perera UAKK, Coralage DTS, Ekanayake IU, et al. (2024) A new frontier in streamflow modeling in ungauged basins with sparse data: A modified generative adversarial network with explainable AI. 21: 101920. https://doi.org/10.1016/j.rineng.2024.101920 doi:
[12] 1524: 012147. https://doi.org/10.1088/1742-6596/1524/1/012147 --> Pohan S, Warsito B, Suryono S (2020) Backpropagation artificial neural network for prediction plant seedling growth. 1524: 012147. https://doi.org/10.1088/1742-6596/1524/1/012147 doi:
[13]
[14] 9. --> Nurjaya IK, Estananto E, Murti A (2022) Pemodelan sistem kendali suhu otomatis pada Smart poultry farm menggunakan metode jaringan saraf tiruan. 9.
[15] 24: 273–282. https://doi.org/10.55981/jtl.2023.795 --> Haekal M, Wibowo WC (2023) Prediksi kualitas air sungai menggunakan metode pembelajaran mesin: Studi kasus Sungai Ciliwung: Prediction of river water quality using machine learning methods: Ciliwung River case study. 24: 273–282. https://doi.org/10.55981/jtl.2023.795 doi:
[16] --> Mustafa HM, Mustapha A, Hayder G, et al. (2021) Applications of iot and artificial intelligence in water quality monitoring and prediction: A review, In 2021 6th International Conference on inventive computation technologies (ICICT), 968–975.
[17]
[18] 5: 331–336. https://doi.org/10.30736/jti.v5i1.347 --> Masruroh M (2020) Perbandingan metode regresi linear dan neural network backpropagation dalam prediksi nilai Ujian nasional siswa smp menggunakan software R. 5: 331–336. https://doi.org/10.30736/jti.v5i1.347 doi:
[19] 2: 53–62. https://doi.org/10.31598/sintechjournal.v2i1.355 --> Wanto A (2019) Prediksi produktivitas jagung di Indonesia sebagai upaya antisipasi impor menggunakan jaringan saraf tiruan backpropagation, 2: 53–62. https://doi.org/10.31598/sintechjournal.v2i1.355 doi:
[20] 12. https://doi.org/10.37676/jmi.v12i1.273 --> Sudarsono A (2016) Jaringan syaraf tiruan untuk memprediksi laju pertambahan penduduk menggunakan metode backpropagation (studi Kasus di Kota Bengkulu). 12. https://doi.org/10.37676/jmi.v12i1.273 doi:
[21] 2. https://doi.org/10.36341/rabit.v2i2.152 --> Cynthia EP, Ismanto E (2017) Jaringan syaraf tiruan algoritma backpropagation dalam memprediksi ketersediaan komoditi pangan provinsi riau. 2. https://doi.org/10.36341/rabit.v2i2.152 doi:
[22] 4. --> Aggarwal L, Sahoo BM (2018) Back propagation algorithm for computerized paper evaluation using neural network. 4.
[23]
[24]
[25] --> Alquisola GLV, Coronel DJA, Reolope BMF, et al. (2018) Prediction and visualization of the disaster risks in the Philippines using discrete wavelet transform (DWT), autoregressive integrated moving average (ARIMA), and artificial neural network (ANN), 2018 3rd International Conference on Computer and Communication Systems (ICCCS), 146–149.
[26]
  • This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ -->

Supplements

Access History

Reader Comments

  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0 )

通讯作者: 陈斌, [email protected]

沈阳化工大学材料科学与工程学院 沈阳 110142

artificial neural network research papers

Article views( 8 ) PDF downloads( 1 ) Cited by( 0 )

Figures and Tables

artificial neural network research papers

Figures( 5 )  /  Tables( 3 )

artificial neural network research papers

Associated material

Other articles by authors.

  • Muhammad Andang Novianta
  • Budi Warsito
  • Siti Rachmawati

Related pages

  • on Google Scholar
  • Email to a friend
  • Order reprints

Export File

shu

  • Figure 1. Map of the Oya River research location
  • Figure 2. Multilayer network architecture
  • Figure 3. Backpropagation ANN network architecture with one hidden layer
  • Figure 4. Flowchart of data prediction using the ANN backpropagation
  • Figure 5. Comparison of actual and predicted data on concentrations
  • DOI: 10.3390/proceedings2024105058
  • Corpus ID: 271840103

Predicting the Volumetric Mass Transfer Coefficient in a Double Coaxial Mixer: An Artificial Neural Network Approach

  • Forough Sharifi , E. Behzadfar , F. Ein‐Mozaffari
  • Published in The 3rd International… 28 May 2024
  • Engineering, Computer Science
  • The 3rd International Electronic Conference on Processes&amp;mdash;Green and Sustainable Process Engineering and Process Systems Engineering

Related Papers

Showing 1 through 3 of 0 Related Papers

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Quantitative Biology > Neurons and Cognition

Title: neural networks, artificial intelligence and the computational brain.

Abstract: In recent years, several studies have provided insight on the functioning of the brain which consists of neurons and form networks via interconnection among them by synapses. Neural networks are formed by interconnected systems of neurons, and are of two types, namely, the Artificial Neural Network (ANNs) and Biological Neural Network (interconnected nerve cells). The ANNs are computationally influenced by human neurons and are used in modelling neural systems. The reasoning foundations of ANNs have been useful in anomaly detection, in areas of medicine such as instant physician, electronic noses, pattern recognition, and modelling biological systems. Advancing research in artificial intelligence using the architecture of the human brain seeks to model systems by studying the brain rather than looking to technology for brain models. This study explores the concept of ANNs as a simulator of the biological neuron, and its area of applications. It also explores why brain-like intelligence is needed and how it differs from computational framework by comparing neural networks to contemporary computers and their modern day implementation.
Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
Cite as: [q-bio.NC]
  (or [q-bio.NC] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

EurekAlert! Science News

  • News Releases

Engineers bring efficient optical neural networks into focus

EPFL researchers have published a programmable framework that overcomes a key computational bottleneck of optics-based artificial intelligence systems.

Ecole Polytechnique Fédérale de Lausanne

The core of the optical processor

The core of the optical processor © Alain Herzog

Credit: Alain Herzog

EPFL researchers have published a programmable framework that overcomes a key computational bottleneck of optics-based artificial intelligence systems. In a series of image classification experiments, they used scattered light from a low-power laser to perform accurate, scalable computations using a fraction of the energy of electronics.

As digital artificial intelligence systems grow in size and impact, so does the energy required to train and deploy them – not to mention the associated carbon emissions . Recent research suggests that if current AI server production continues at its current pace, their annual energy consumption could outstrip that of a small country by 2027. Deep neural networks, inspired by the architecture of the human brain, are especially power-hungry due to the millions or even billions of connections between multiple layers of neuron-like processors.

To counteract this mushrooming energy demand, researchers have doubled down on efforts to implement optical computing systems, which have existed experimentally since the 1980s. These systems rely on photons to process data, and although light can theoretically be used to perform computations much faster and more efficiently than electrons, a key challenge has hindered optical systems’ ability to surpass the electronic state-of-the art.

“In order to classify data in a neural network, each node, or ‘neuron’, must make a ‘decision’ to fire or not based on weighted input data. This decision leads to what is known as a nonlinear transformation of the data, meaning the output is not directly proportional to the input,” says Christophe Moser, head of the Laboratory of Applied Photonics Devices in EPFL’s School of Engineering.

Moser explains that while digital neural networks can easily perform nonlinear transformations with transistors, in optical systems, this step requires very powerful lasers. Moser worked with students Mustafa Yildirim, Niyazi Ulas Dinc, and Ilker Oguz, as well as Optics Laboratory head Demetri Psaltis, to develop an energy-efficient method for performing these nonlinear computations optically. Their new approach involves encoding data, like the pixels of an image, in the spatial modulation of a low-power laser beam. The beam reflects back on itself several times, leading to a nonlinear multiplication of the pixels.

“Our image classification experiments on three different datasets showed that our method is scalable, and up to 1,000 times more power-efficient than state-of-the-art deep digital networks, making it a promising platform for realizing optical neural networks,” says Psaltis.

The research, supported by a Sinergia grant from the Swiss National Science Foundation, has recently been published in Nature Photonics.

A simple structural solution

In nature, photons do not directly interact with each other the way charged electrons do. To achieve nonlinear transformations in optical systems, scientists have therefore had to ‘force’ photons to interact indirectly, for example by using a light intense enough to modify the optical properties of the glass or other material it passes through.

The scientists worked around this need for a high-power laser with an elegantly simple solution: they encoded the pixels of an image spatially on the surface of a low-power laser beam. By performing this encoding twice, via adjustment of the trajectory of the beam in the encoder, the pixels are multiplied by themselves, i.e., squared. Since squaring is a non-linear transformation, this structural modification achieves the non-linearity essential to neural network calculations, at a fraction of the energy cost. This encoding can be carried out two, three or even ten times, increasing the non-linearity of the transformation and the precision of the calculation.

“We estimate that using our system, the energy required to optically compute a multiplication is eight orders of magnitude less than that required for an electronic system,” Psaltis says.

Moser and Psaltis emphasize that the scalability of their low-energy approach is a major advantage, as the ultimate goal would be to use hybrid electronic-optical systems to mitigate the energy consumption of digital neural networks. However, further engineering research is needed to achieve such scale-up. For example, because optical systems use different hardware than electronic systems, a next step that the researchers are already working on is developing a compiler to translate digital data into code that optical systems can use.

Nature Photonics

10.1038/s41566-024-01494-z

Method of Research

Experimental study

Subject of Research

Not applicable

Article Title

Nonlinear Processing with Linear Optics

Article Publication Date

31-Jul-2024

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.

Original Source

IMAGES

  1. (PDF) Study of Artificial Neural Network

    artificial neural network research papers

  2. (PDF) Artificial Neural Network

    artificial neural network research papers

  3. (PDF) Recurrent Neural Networks: An Embedded Computing Perspective

    artificial neural network research papers

  4. (PDF) Comprehensive Review of Artificial Neural Network Applications to

    artificial neural network research papers

  5. Artificial neural network model diagram a feed forward neural network b

    artificial neural network research papers

  6. Neural Networks

    artificial neural network research papers

COMMENTS

  1. Artificial Neural Networks: An Overview

    This paper discuss about the artificial neural network and its basic types. This article explains the ANN and its basic outlines the fundamental neuron and the artificial computer model.

  2. Neural networks: An overview of early research, current frameworks and

    The paper also shows how, over time, artificial neural networks have contributed to fundamental concepts at the birth and development of other disciplines such as Computational Neuroscience, Neuro-engineering, Computational Intelligence and Machine Learning.

  3. [1404.7828] Deep Learning in Neural Networks: An Overview

    View a PDF of the paper titled Deep Learning in Neural Networks: An Overview, by Juergen Schmidhuber. In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium.

  4. Fundamentals of Artificial Neural Networks and Deep Learning

    In this chapter, we go through the fundamentals of artificial neural networks and deep learning methods. We describe the inspiration for artificial neural networks and how the methods of deep learning are built. We define the activation function and its role in capturing nonlinear patterns in the input data.

  5. Review of deep learning: concepts, CNN architectures, challenges

    It then presents convolutional neural networks (CNNs) which the most utilized DL network type and describes the development of CNNs architectures together with their main features, e.g., starting with the AlexNet network and closing with the High-Resolution network (HR.Net).

  6. Artificial Neural Networks for Neuroscientists: A Primer

    Summary. Artificial neural networks (ANNs) are essential tools in machine learning that have drawn increasing attention in neuroscience. Besides offering powerful techniques for data analysis, ANNs provide a new approach for neuroscientists to build models for complex behaviors, heterogeneous neural activity, and circuit connectivity, as well ...

  7. Neural Networks

    About the journal The Official Journal of the Asia-Pacific Neural Network Society, the International Neural Network Society & the Japanese Neural Network Society Neural Networks provides a forum for developing and nurturing an international community of scholars and practitioners who are interested in all aspects of …

  8. Brain-inspired learning in artificial neural networks: a review

    This paper presents a comprehensive review of current brain-inspired learning representations in artificial neural networks. We investigate the integration of more biologically plausible mechanisms, such as synaptic plasticity, to enhance these networks' capabilities. Moreover, we delve into the potential advantages and challenges accompanying ...

  9. Analysis of Artificial Neural Network: Architecture, Types, and

    This paper illustrates the different artificial neural network architectures, types, merits, demerits, and applications. Therefore, this paper provides valuable information to students and researchers to enrich their knowledge about an artificial neural network and research it.

  10. Multimodal neurons in artificial neural networks

    Multimodal neurons in artificial neural networks. We've discovered neurons in CLIP that respond to the same concept whether presented literally, symbolically, or conceptually. This may explain CLIP's accuracy in classifying surprising visual renditions of concepts, and is also an important step toward understanding the associations and ...

  11. Deep learning

    This overview paper on the principles of end-to-end training of modular systems such as deep neural networks using gradient-based optimization showed how neural networks (and in particular ...

  12. Introduction to artificial neural networks

    Introduction to artificial neural networks. Enzo Grossi. a. and Massimo Buscema. b. The coupling of computer science and theoretical bases. such as nonlinear dynamics and chaos theory allows the ...

  13. Artificial Neural Networks Based Optimization Techniques: A Review

    In the last few years, intensive research has been done to enhance artificial intelligence (AI) using optimization techniques. In this paper, we present an extensive review of artificial neural networks (ANNs) based optimization algorithm techniques with some of the famous optimization techniques, e.g., genetic algorithm (GA), particle swarm optimization (PSO), artificial bee colony (ABC), and ...

  14. Exploring the Advancements and Future Research Directions of Artificial

    Artificial Neural Networks (ANNs) are machine learning algorithms inspired by the structure and function of the human brain. Their popularity has increased in recent years due to their ability to learn and improve through experience, making them suitable for a wide range of applications. ANNs are often used as part of deep learning, which enables them to learn, transfer knowledge, make ...

  15. [1511.08458] An Introduction to Convolutional Neural Networks

    View a PDF of the paper titled An Introduction to Convolutional Neural Networks, by Keiron O'Shea and Ryan Nash. The field of machine learning has taken a dramatic twist in recent times, with the rise of the Artificial Neural Network (ANN). These biologically inspired computational models are able to far exceed the performance of previous forms ...

  16. AN INTRODUCTION TO ARTIFICIAL NEURAL NETWORK

    This paper gives an introd uction into ANN and the w ay it is used Keywords: - Artificial neural network, ANN, back propagation algorithm, neuron, weights

  17. Artificial Neural Network: Understanding the Basic Concepts without

    An artificial neural network is a machine learning algorithm based on the concept of a human neuron. The purpose of this review is to explain the fundamental concepts of artificial neural networks. Keywords: Machine Learning, Artificial Intelligence, Neural Networks, Deep Learning Go to:

  18. Machine learning with neural networks

    The purpose of this early research on neural networks was to explain neuro-physiological mechanisms [8]. Perhaps the most significant advance was Hebb's learning principle, describing how neural networks ... perceptron learning was the paper by Rumelhart et al. [12]. The authors demon-strated in 1986 that perceptrons can be trained by ...

  19. A view of Artificial Neural Network

    In this paper, An Artificial Neural Network or ANN, its various characteristics and business applications. In this paper also show that "what are neural networks" and "Why they are so important in today's Artificial intelligence?" Because various advances have been made in developing intelligent system, some inspired by biological neural networks. ANN provides a very exciting ...

  20. Physics solves a training problem for artificial neural networks

    Read the paper: Fully forward mode training for optical neural networks 'Physical' neural networks can be designed using several platforms involving, for example, optics 2 , nanoelectronics 3 ...

  21. Kalmogorov-Arnold Neural Networks Shake Up How AI Is Done

    Artificial neural networks—algorithms inspired by ... scientific research," says Ziming Liu, a computer scientist at the Massachusetts Institute of Technology and the paper's ...

  22. Engineers bring efficient optical neural networks into focus

    Researchers have published a programmable framework that overcomes a key computational bottleneck of optics-based artificial intelligence systems. In a series of image classification experiments ...

  23. Artificial Neural Networks for Beginners

    3. Artificial neural networks One type of network sees the nodes as 'artificial neurons'. These are called artificial neural networks (ANNs). An artificial neuron is a computational model inspired in the natural neurons. Natural neurons receive signals through synapses located on the dendrites or membrane of the neuron.

  24. Monitoring river water quality through predictive modeling using

    Wu J, Ming H, Xu J, (2021) Research on intelligent disaster prevention and mitigation method in high flood risk area of river basin based on artificial neural network, 2021 7th International Conference on Hydraulic and Civil Engineering & Smart Water Conservancy and Intelligent Disaster Reduction Forum (ICHCE & SWIDR), 943-952. https://doi ...

  25. Artificial Neural Network Systems

    Artificial Neural Networks is a calculation method that builds several processing units based on interconnected connections. The network consists of an arbitrary number of cells or nodes or units ...

  26. Predicting the Volumetric Mass Transfer Coefficient ...

    DOI: 10.3390/proceedings2024105058 Corpus ID: 271840103; Predicting the Volumetric Mass Transfer Coefficient in a Double Coaxial Mixer: An Artificial Neural Network Approach @article{Sharifi2024PredictingTV, title={Predicting the Volumetric Mass Transfer Coefficient in a Double Coaxial Mixer: An Artificial Neural Network Approach}, author={Forough Sharifi and Ehsan Behzadfar and Farhad Ein ...

  27. [2107.01185] Artificial Neural Network for Cybersecurity: A

    View a PDF of the paper titled Artificial Neural Network for Cybersecurity: A Comprehensive Review, by Prajoy Podder and 4 other authors. Cybersecurity is a very emerging field that protects systems, networks, and data from digital attacks. With the increase in the scale of the Internet and the evolution of cyber attacks, developing novel ...

  28. [2101.08635] Neural Networks, Artificial Intelligence and the

    Advancing research in artificial intelligence using the architecture of the human brain seeks to model systems by studying the brain rather than looking to technology for brain models. This study explores the concept of ANNs as a simulator of the biological neuron, and its area of applications.

  29. Engineers bring efficient optical neural netw

    Deep neural networks, inspired by the architecture of the human brain, are especially power-hungry due to the millions or even billions of connections between multiple layers of neuron-like ...