Artificial Neural Networks

Ga naar: navigatie, zoeken

Inhoud

Samenvattingen

Klik hier om de samenvattingen te bekijken

Noot

  • Niet alle vragen komen overeen met de huidige examenvragen!!
  • De vragen van dit jaar zijn aangeduid met (05-06)
  • Op het examen moet je uit 2 hoopjes telkens 1 vraag trekken. De laatste vraag gaat over je practicum.
  • Het examen is open boek. Je mag meenemen wat je wil.
  • Let er wel op dat je maar 15 minuten voorbereiding hebt...
  • Het examen is mondeling voor beide professoren, ze maken je het leven niet gemakkelijk. Nadruk meer op begrip dan op "vanbuiten opzeggen". Alles wat met het antwoord van je vraag te maken heeft, komt in aanmerking om bijvragen over te stellen. Je hoeft niets op te schrijven: je mag alle papieren uit cursus ofzo meenemen.

Tips

Er worden ook bijvragen gesteld, zorg dat je dus iets meer weet dan het antwoord op de vraag zelf.

Vragen met een "*" voor zijn nog niet opgelost

De cursus is een nogal summier samenraapsel van copiën uit andere boeken (randen vaak niet meegekopieerd), geveegde illustraties en handgeschreven lesnota's. Onderstaande links kan je daarom gebruiken om wat meer achtergrond informatie te krijgen:

[link 1] Engelse slides van een chinees vak, behandelt ook alles zeer wiskundig, maar met wat meer tussenstappen. Grootste topics komen ook hier aan bod. Is wel niet echt publiek toegankelijk, dus heb ik ze ook online gezet in een zipje.

Oorspronkelijke site: http://www.im.isu.edu.tw/faculty/ymchiang/

Slides te vinden daar op: http://www.im.isu.edu.tw/faculty/ymchiang/ann93_2/Ch1.pdf tot Ch15.pdf

Zip te vinden op: http://www.wina.be/~merel/ymchiang-ann93_2.zip (2.288KB)

Neural networks FAQ Andere inleiding

Introduction to Artificial Neural Systems van Jacek M. Zurada was nuttig en te vinden in de wetenschapsbib. Aarzel niet om de bibliotheken af te gaan op zoek naar documentatie.

Examenvragen

We hebben 50 voorbeeldvragen gekregen waaruit we er 2 gesteld gaan krijgen.

Lecture 1

Explain differences between digital computers, neural networks and the human brain.(05-06)(07-08)

Literally in text, p 3-4

Digital Computer Artificial Neural Network Human Brain
Process bits as described by a program (software)

Correctness based on mathematical logic and Boolean Algebra

Process patterns (examples in the data set).

Mathematical non-linear functions with several variables

Process patterns
Process data sequentially (in general) (disadvantage!) Processes data parallel (advantage!)

Neurons of one layer can process data in parallel

Process data parallel (advantage!)
Needs software to operate properly Needs training to operate properly. Training is crucial. Needs training to operate properly
Rigidity, not-robust. Works with strict rules, algorithms, so a change in one single number an have major consequences (disadvantage!) Robust against errors & noise in data (advantage!)

Even error-correcting capabilities!

Robust against errors & noise in data (advantage!)

(See 4) for applicability of either ANN's or computers) There seems to be a high analogy between neural networks and the human brain (biological neural networks). Watch your step though, this analogy is NOT strong enough to convince engineers and computer engineers f the correctness of an artificial neural network, when it's correct in a biological neural network. The correctness of an artificial network MUST follow from mathematical analysis of non-linear functions or dynamical systems and computer simulations. Another comparison between the ANN and the human brain:

Artificial Neural Network Human Brain
Low complexity (max ±100.000 neurons) High complexity (±100.000.000000 neurons)
High processing speed (30-200 million operations per second) Low processing speed (reaction time ±1-2ms)
Energetic efficiency poor: min. (10^-6) Joule / (operation / second) Energetic efficiency good: min. (10^-16) Joule / (operation / second)


Conclusion: Artificial Neural Networks must stay modest with respect to the human brain.

  • The gap between neurons won't be bridged in a few decennia.
  • The energetic efficiency of ANN's is much worse than for the human brain.
  • The only advantage of ANN's is their capability of fast processing.

Explain the difference between learning, memorization and generalization.(05-06)(07-08)

Learning

Learning is het automatisch proces om regels uit een dataset te ontdekken. Dit leren gebeurt door voorbeelden voor te schotelen aan de entiteit (een neuraal netwerk, het menselijke brein). In artificiële neurale netwerken wordt informatie opgeslagen in de gewichten van de connecties in het netwerk.

Er bestaat supervised en unsupervised learning. Bij supervised learning is er een "leraar" aanwezig die het leerproces begeleidt. Deze geeft feedback aan het netwerk over de beslissing die het neemt voor een bepaald voorbeeld. Unsupervised learning is een leerproces zonder aanwezigheid van een leraar.

Het grote voordeel van learning is dat er geen expert nodig is om de regels uit het probleemdomein af te leiden. Dit kan immers automatisch gebeuren door het leerproces van een neuraal netwerk.

Generalization

Generalization is de mogelijkheid van een netwerk om ongekende voorbeelden juist te classificeren. Dit zijn dus voorbeelden die niet voorkomen in een training set.

Een neuraal netwerk leert via een training set. De algemeenheid van de aangeleerde regels moet echter in het oog gehouden worden via een testset. Op een bepaald punt in het trainingsproces wordt enkel het resultaat voor de training set nog beter en verslechtert het resultaat voor de test set. Dit noemt men overtraining of overfitting en hierdoor gaat de algemeenheid van de aangeleerde regels achteruit. Daarom wordt een stopregel geïntroduceerd die ervoor zorgt dat het netwerk slechts wordt getraind zolang de fout op de test set nog vermindert.

De mogelijkheid tot veralgemening van een neuraal net zorgt ervoor dat er geen grote hoeveelheid inputdata moet worden opgeslagen. Een netwerk kan via training met een beperkt aantal voorbeelden toch goede beslissingen nemen wanneer een ongekend voorbeeld aan de invoer wordt gepresenteerd.

Memorization

Memorization is het "van buiten" leren van details in de training data. Een neuraal netwerk dat enkel memoriseert zal falen in het generaliseren en zal geen ongekende data kunnen verwerken.

Explain the operating principle of the neuron and the perceptron and the adaptation of the weights (05-06)

  • Neuron
    • 1 unit in neutraal netwerk
    • input & output van een neuron verbindt deze met andere in het netwerk
    • propageert signaal dat binnenkomt als het sterk genoeg is
    • weighted input
    • non-linear activation function F
  • Perceptron
    • simuleert 1 neuron
    • propageert signaal indien som van het product van de input met hun respectievelijke weights(m.a.w. de gewogen som) groter is dan een treshold T
    • output: de output van de non-linear activation function van de gewogen som van de inputs

Learning in ANN: het leren associëren van inputs met outputs aan de hand van het aanpassen van de verschillende weights van de inputs van de perceptrons (bv. via het back propagation algoritme)

De taak van het perceptron in het leerproces is dus het aanpassen van z'n weights om een zo goed mogelijke output te associëren met z'n inputs.

In which areas is it useful to apply neural networks and where is it not useful? (05-06)(07-08)

3 gebieden waar ANN goed bruikbaar zijn (zie p. 21 & 22):
  • Klassificatie
    • Voorbeelden uit grote dataset klassificeren → data mining
    • Vooral expert systems en pattern recognition
    • Fraudedetectie, medische diagnoses, OCR, herkennen van getallen, letters, gezichten, spraakherkenning, speaker recognition, quality control
  • Neural prediction (lange en korte termijn)
    • bv. Santa Fe
    • voorspellen van bv. gasverbruik, aandelen, klantengedrag
    • simulatie van een dynamisch model (space shuttle)
    • MAAR: inzicht nodig en analyses en PC simulaties
  • Mechanische, chemische en biochemische processen optimaliseren en controleren
    • traditioneel: lineaire controllers
    • ANN niet lineair ' veel beter
    • vb. pendulum
ANN is niét geschikt voor:
  • vaste, exacte berekeningen
  • systemen waarvan het gedrag perfect is vastgelegd (vb. Office)

Soms zijn alle fysische regels gekend en zou men exact kunnen werken, maar zijn de regels zo complex dat ANN toch geschikter is.

ANN kan vele taken, maar vraag niet waarom het iets doet! Moeilijk hetgeen het ANN geleerd heeft precies te extraheren door de omzetting naar numerische waarden waarmee het werkt: input, output en weights.

* Explain the early stopping rule and overtraining.(05-06)(07-08)

Bij het trainen met een trainingsset, wordt de fout op de trainingsset steeds kleiner. De fout zal aanvankelijk ook kleiner worden voor een validatieset. Vanaf deze fout terug groter wordt (in de trend, niet in de kleine wijzigingen) wil dit zeggen dat het netwerk overtraind geraakt. Dit wil zeggen dat het netwerk de trainingsset gaan memoriseren (dus: goed scoren op de trainingsset, maar niet goed op de rest). Om memorisatie te vermijden moet er gestopt worden met trainen als de fout op de validatieset terug groter begint te worden. Het juiste moment om te stoppen is nogal moeilijk te bepalen, omdat de fout even zeer lokaal kan stijgen, terwijl deze daarna terug verder daalt.

Lecture 2

Explain the perceptron and delta learning rule. What is the difference and why? (07-08)

Explain the backpropagation algorithm and explain the role of the derivative of the activation function.(07-08)

Which kind of activation functions can one use for neural networks? What are advantages and disadvantages with respect to learning?(05-06)(07-08)

The standard choice is the sigmoid function (or soft-limiting function), either in symmetric (bipolar: -1,1) or asymmetric (unipolar: 0,1) form. The sigmoid function is global in the sense that it divides the feature space into two halves, one where the response is approaching 1 and another where it is approaching 0 (-1). Hence it is very efficient for making sweeping cuts in the feature space.

The simplest activation function is a step function: if the total net input is less than 0 (or more generally, less than some threshold T) then the output of the neuron is 0, otherwise it is I.

Too bad it can't help us out a lot here, although I found this interesting and relevant piece to read on the net:

Activation functions for the hidden units are needed to introduce nonlinearity into the network. Without nonlinearity, hidden units would not make nets more powerful than just plain perceptrons (which do not have any hidden units, just input and output units). The reason is that a linear function of linear functions is again a linear function. However, it is the nonlinearity (i.e, the capability to represent nonlinear functions) that makes multilayer networks so powerful. Almost any nonlinear function does the job, except for polynomials. For backpropagation learning, the activation function must be differentiable, and it helps if the function is bounded; the sigmoidal functions such as logistic and tanh and the Gaussian function are the most common choices. Functions such as tanh or arctan that produce both positive and negative values tend to yield faster training than functions that produce only positive values such as logistic, because of better numerical conditioning (see ftp://ftp.sas.com/pub/neural/illcond/illcond.html).

For hidden units, sigmoid activation functions are usually preferable to threshold activation functions. Networks with threshold units are difficult to train because the error function is stepwise constant, hence the gradient either does not exist or is zero, making it impossible to use backprop or more efficient gradient-based training methods. Even for training methods that do not use gradients--such as simulated annealing and genetic algorithms--sigmoid units are easier to train than threshold units. With sigmoid units, a small change in the weights will usually produce a change in the outputs, which makes it possible to tell whether that change in the weights is good or bad. With threshold units, a small change in the weights will often produce no change in the outputs.

For the output units, you should choose an activation function suited to the distribution of the target values:

  • For binary (0/1) targets, the logistic function is an excellent choice (Jordan, 1995).
  • For categorical targets using 1-of-C coding, the softmax activation function is the logical extension of the logistic function.
  • For continuous-valued targets with a bounded range, the logistic and tanh functions can be used, provided you either scale the outputs to the range of the targets or scale the targets to the range of the output activation function ("scaling" means multiplying by and adding appropriate constants).
  • If the target values are positive but have no known upper bound, you can use an exponential output activation function, but beware of overflow.
  • For continuous-valued targets with no known bounds, use the identity or "linear" activation function (which amounts to no activation function) unless you have a very good reason to do otherwise.

Explain the difference between feedforward and recurrent nets. Give some examples.

In a feed forward network, the connections between units do not form cycles, this means that they contain no feedback loops, signals of unit activation travel in one direction only, namely from input layer possibly via intermediate layers containing hidden units to the output layer they usually produce a response to an input quickly.

  • generalized linear models

In a feed backward network, or recurrent network, there are cycles in the connections. In some feedback networks each time an input is presented, the NN must iterate for a potentially long time before it produces a response. These NN’s are usually more difficult to train than feed forward networks.

  • Hopfield net

What is the difference between supervised and unsupervised learning? Give examples.(05-06)

The training set for the TLU will consist of a set of pairs (v,t}, where v is an input vector and t is the target class or output ('1' or '0') that v belongs to. This type of training is known as supervised training, since we tell the net what output is expected for each vector. In supervised learning there is a teacher who gives the desired output values for given input values. Algorithms can use the difference between the output and the desired output (the error) to adapt its weights to minimize the error. In unsupervised learning there is no teacher and no output values. Here, the task of the network is to find regularities in the data (important statistical features, clusters). Examples supervised: Perceptrons Examples unsupervised: Competitive learning, Self Organizing maps. PCA (Hebbian Iearning, Oja)

From http://www.faqs.org/faqs/ai-faq/neural-nets/partl/:

  • In supervised learning, the correct results (target values, desired outputs) are known and are given to the NN during training so that the NN can adjust its weights to try match its outputs to the target values. After training, the NN is tested by giving it only input values, not target values, and seeing how close it comes to outputting the correct target values.
  • In unsupervised learning, the NN is not provided with the correct results during training. Unsupervised NN's usually perform some kind of data compression, such as dimensionality reduction or clustering. See "What does unsupervised learning learn?"

The distinction between supervised and unsupervised methods is not always clear-cut. An unsupervised method can learn a summary of a probability distribution, then that summarised distribution can be used to make predictions. Furthermore, supervised methods come in two sub varieties: auto-associative and hetero-associative. In auto-associative learning, the target values are the same as the inputs, whereas in hetero-associative learning. the targets are generally different from the inputs. Many unsupervised methods are equivalent to auto-associative supervised methods. For more details, see "What does unsupervised learning learn?" Examples: see question I also for supervised

Unsupervised Learning (explaining)

  • Hebbian Learning
    • Extract redundancies from data
    • Maximize output when input is similar to earlier inputs
    • Applications
      • Principal component analysis
      • Clustering
      • Feature Mapping
  • Competitive learning
    • Only one output unit can he on
      • Unit that wins inhibits all others
      • Output units are called winner-take-all or grandmother cells
    • Applications are clustering or categorization

Explain the discrete and continuous perceptron and delta learning rule. What is the difference and why? (05-06)

Discrete percetron algorithm: pagina 60 Discreet berekent in stap 3 sgn(wty) hetgeen volgens p61 zero slope heeft en niet differentieerbaar is op het decision surface.

As with the Perceptron, the delta rule compares desired output to actual output to compute weight adjustment. But the delta rule squares the errors and averages them to avoid negative errors cancelling-out positive ones. (http://www.benbest.com/computer/nn.html#perceptrons)

Onderaan p62 van Chapter 3: It may be noted that the weight adjustment rule (3.53=delta training rule) corrects the weights in the same direction as the discrete perceptron learning rule originally formulated in (3.38). The size, or simply length, of the weight correction vector is the main difference between these rules. Both rules involve adding or substracting a fraction of the pattern vector y. The essential difference is the presence of the moderating factor 1 - o^k2. This scaling factor is obviously always positive and smaller than 1.

Explain the difference between linearly separable and linearly non-separable tasks. Which tasks can be handled by a single layer neural network or by a multilayer neural network?

Onvolledig:

  • lin. separable -> 1 layer
  • lin. non-separable -> more layers
    • bv XOR

Explain that certain classification problems can be solved with two layer Neural Networks, that cannot be solved with one layer Neural Networks.(05-06)(07-08)

p 69

Als je lineair scheidbare punten hebt, dan kun je dit met 1 layer doen (een perceptron dus), omdat ieder neuron feitelijk een lijn trekt. Dus als je met lineair onscheidbare dingen zit, dan moet je een aantal lijnen trekken en die dan samen evalueren in uw 2e layer, die dan "class A" of "class B" gaat zeggen. 2e layer is dan ook maar 1 neuron.

Als je dus met deze situatie zit: niet lineair separable. Dan kun je hier 3 lijnen trekken => 3 neuronen in 1e layer: die zeggen in mensenwoorden of een punt links (boven) of rechts (onder) van hun lijn ligt. Dan in de 2e layer kijk je naar de 3 'outputs'. Dan kun je volgens de positie tov de drie lijnen de klasse bepalen (bv boven lijn 1, onder lijn 2, boven lijn3 ' class A)

En dat is dus enkel mogelijk met multilayer, want als ze lineair scheidbaar zijn heb je genoeg met 1 lijn en is links="class A" en rechts="class B"

* Explain the backpropagation algorithm and the choice of the parameters and early stopping rule.(05-06)

The Back-Propagation Algorithm (http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html#The%20Back-Propagation%20Algorithm)

In order to train a neural network to perform some task, we must adjust the weights of each unit in such a way that the error between the desired output and the actual output is reduced. This process requires that the neural network compute the error derivative of the weights (EW). In other words, it must calculate how the error changes as each weight is increased or decreased slightly. The back propagation algorithm is the most widely used method for determining the EW.

The back-propagation algorithm is easiest to understand if all the units in the network are linear. The algorithm computes each EW by first computing the EA, the rate at which the error changes as the activity level of a unit is changed. For output units, the EA is simply the difference between the actual and the desired output. To compute the EA for a hidden unit in the layer just before the output layer, we first identify all the weights between that hidden unit and the output units to which it is connected. We then multiply those weights by the EAs of those output units and add the products. This sum equals the EA for the chosen hidden unit. After calculating all the EAs in the hidden layer just before the output layer, we can compute in like fashion the EAs for other layers, moving from layer to layer in a direction opposite to the way activities propagate through the network. This is what gives back propagation its name. Once the EA has been computed for a unit, it is straight forward to compute the EW for each incoming connection of the unit. The EW is the product of the EA and the activity through the incoming connection.

Note that for non-linear units, (see Appendix C) the back-propagation algorithm includes an extra step. Before back-propagating, the EA must be converted into the EI, the rate at which the error changes as the total input received by a unit is changed.


What is the learning constant, and which values should you give it ? Explain its role.(05-06)(07-08)

http://www.dacs.dtic.mil/techs/neural/neural5.html

The rate at which ANNs learn depends upon several controllable factors. In selecting the approach there are many trade-offs to consider. Obviously, a slower rate means a lot more time is spent in accomplishing the off-line learning to produce an adequately trained system. With the faster learning rates, however, the network may not be able to make the fine discriminations possible with a system that learns more slowly. Researchers are working on producing the best of both worlds.

Generally, several factors besides time have to be considered when discussing the off-line training task, which is often described as "tiresome." Network complexity, size, paradigm selection, architecture, type of learning rule or rules employed, and desired accuracy must all be considered. These factors play a significant role in determining how long it will take to train a network. Changing any one of these factors may either extend the training time to an unreasonable length or even result in an unacceptable accuracy.

Most learning functions have some provision for a learning rate, or learning constant. Usually this term is positive and between zero and one. If the learning rate is greater than one, it is easy for the learning algorithm to overshoot in correcting the weights, and the network will oscillate. Small values of the learning rate will not correct the current error as quickly, but if small steps are taken in correcting errors, there is a good chance of arriving at the best minimum convergence.

* Explain discriminant functions in classifier systems and their relation to the perceptron. (05-06)(07-08)

misschien dat hier wat nuttige info staat? Linear discriminant analysis

Chapter 3: Single layer perceptron classifier (page 54): die gi()

Lecture 3

What is an attractor neural network and how does it work? Give an example.(05-06)(07-08)

Zie p1-7, vanaf lectures 3-7 (tstuk van de andere prof)

Does the Hebb learning rule guarantee pattern stability in the Hopfield model? Explain. (05-06)(07-08)

Zie p6-7

Is the loading capacity of the Hopfield model bounded? Explain. (05-06)(07-08)

Zie p 7-10

The hopfield model is used to store a set of p given patterns Zµi so that when a new, unknown pattern F is presented to the network, the network responds by producing the pattern Zj which closest resembles this newly presented pattern F.

The loading capacity of the model (this being the maximum patterns that can be stored) is dependent of the maximum acceptable error. That is the number of bits that are corrupted for storing this maximum number of patterns, pmax. Clearly, this error increases as we increase the number of patterns to be stored.

The capacity pmax is proportional to N (but never higher than 0.138N) if we are willing to accept a small percentage of errors in each pattern, with N being the number of units in the network.

It is proportional to N/log N if we want that most or all patterns be recalled perfectly.

So yes, the loading capacity is bounded.

Simple reasoning brings you the same conclusion:

It's obvious, the less patterns that are stored with a given number of N units, the easier it is to recall which of the stored patterns mostly resembles a newly given pattern.

If too many patterns would be stored, the network tends to remember nothing (="catastrophic forgetting") Also, the larger the network (more Units), the more patterns that can be stored.

So it's clear the storage capacity is in one way proportional to N, the number of units.

Explain the concept of energy function in neural networks? (05-06)(07-08)

Voor een uitgebreide uitleg van de Energy Function zie in de cursus, het Hopfield Model, pagina 11 tot 13. Ik tracht hieronder zoals de vraag het stelt, het belang van de energie functie te schetsen.

De energie functie is een functie die lokale minima bezit in de attractoren in een hopfield netwerk. Ze daalt altijd naarmate een systeem evolueert volgens de dynamische regel. Wanner een systeem gestart wordt in een vallei, convergeert het netwerk naar het minimum van die vallei.

De energie functie bestaat als de gewichten van het netwerk symmetrisch zijn. Voor biologische netwerken is dit onaanvaardbaar, maar voor artificiële neurale netwerken is dit een ‘slimme’ strategie, omdat de energie functie dan bestaat. Het is immers gemakkelijk aan te tonen dat door de dynamische regel toe te passen de energiefunctie enkel kan dalen (zie p. 12 voor een afleiding).

De energiefunctie kunnen we dus beschouwen als iets wat geminimaliseerd wordt in de stabiele toestanden van een neural net. Dit is vooral nuttig m voor gegeven patronen geschikte gewichten wij kan vinden aan de hand van de energie functie. Het is dus mogelijk om een energiefunctie neer te schrijven die een minimum bevat dat overeenkomt met een bepaald probleem. Dan kan men de gewichten wij bepalen uit de coëfficiënten SiSj . (Zie p. 13)

*Explain the projection method. When is it useful?(05-06) (07-08)

Stuk II, titel Correlated patterns

* What is cross-talk noise? What does it tell us about the functioning of a network? (05-06)(07-08)

Extensions of hopfield model, 1

* Can one describe a temporal sequence of patterns with the Hopfield model? Explain.

II, 7

Lecture 4

How can one solve optimization problems using the Hopfield model? Give an example.(05-06)(07-08)

Zoals je weet is een Hopfield model een methode om een energie functie te gaan minimaliseren. Je kan dit echter ook bekijken als een optimalisatie probleem door - in plaats van met energie functies te werken - met kost functies te werken.

Door zo'n optimalisatie problemen voor te stellen adhv. een Hopfield model creëren we eigenlijk een bepaald type van parallelliseerbaar algoritme om dat specifieke optimalisatie probleem op te lossen. Dat algoritme kan dan vrij eenvoudig geïmplementeerd worden door bestaande Hopfield implementaties wat aan te passen.

[Cfr. 'Optimization Problems’ p32 van Lecture 4]


Hopfield en Tank hebben in 1986 een VLSI chip gemaakt voor zulke netwerken, en die convergeren inderdaad zeer snel naar een lokaal minimum. Jammer genoeg is er geen garantie dat de oplossing optimaal is, maar de ervaring leert dat dit meestal toch een vrij goeie oplossing is [cfr. Traveling Salesman Problem].

Explain the energy function of the travelling salesman problem.(05-06)(07-08)

p.39a onderaan en p39b bovenaan: A,B,C are the Lagrange multipliers in 10.2. The distance between the cities i and k is denoted by d(ik). The first term measures the total length of the tour, which is to be minimized. The second term vanishes if at most one neuron n(ia) is active for each city i (n(ia)=1), i.e. if each city occurs at most once on the tour. The third term vanishes only if the station a is not occupied by two or more cities at once. the last term has its minimum when exactly N neurons are active during the trip. Obviously, the function E takes its global minimum for the best tour.

* Describe the design of a neural network for the weighted matching problem.(05-06)(07-08)

Zie p33 Er staat "comparing with (1.1)" maar het stuk met (1.1) is niet gegeven. Weet iemand welke functie dat moet zijn?

Explain the differences in reinforcement learning for class I, II and III problems. (05-06)

In supervised learning hadden we tot nu toe steeds een "teacher" die de juiste output gat: we konden dan de verkregen output vergelijken met de juiste output.

In reinforcement learning hebben we geen "teacher". maar een "critic". De critic geeft alleen aan of de verkregen output juist of fout is.

Bij reinforcement learning moeten we ons het netwerk voorstellen in een omgeving: de omgeving zorgt voor input voor het netwerk

Het netwerk berekent de output en geeft deze terug aan de omgeving

De omgeving geeft een reinforcement signaal aan het netwerk (juist of fout)

We kunnen 3 klassen van reinforcement learning problems onderscheiden, afhankelijk van de aard van de omgeving

Klasse 1:

In het eenvoudigste geval is het reinforcement signaal steeds hetzelfde voor ieder input-output paar.

De input-patterns worden in een random volgorde gekozen door de omgeving of volgens een schema, maar onafhankelijk van eerdere outputs.

Klasse 2:

Hier bepaalt een input-output paar slechts de kans dat het signaal juist is. Deze kans is wel fixed. Het signaal is nog steeds "goed' of "fout" , maar voor 1 input-output paar zal dit niet steeds hetzelfde zijn. Toch is de kans dat het signaal "goed" is steeds even groot.

Ook hier is de inputvolgorde niet afhankelijk van het verleden.

Dit soort problemen komen vaak voor in "modelling animal learning", "economics systems" en in eenvoudige spelletjes.

Het is geen triviaal probleem: hoe bepalen we de output met de grootste kans op een positiefsignaal?

Een voorbeeld van een probleem met telkens slechts 1 output die slechts 2 waarden kan aannemen is het two-armed bandit problem: we mogen telkens aan l van de 2 armen trekken, zonder dat we weten hoeveel kans we hebben om "te scoren". http://www.willamette.edu/~gorr/classes/cs449/Reinforcement/Bandit/bandit.html

Hier kun je het three-armed bandit problem spelen.

Klasse 3:

Dit is het meest algemene geval. De omgeving zelf kan hier bestuurd worden door een ingewikkeld dynamisch proces. Zowel de reinforcement signalen als de input patterns mogen afhankelijk zijn van het verleden van de outputwaarden.

Een klassiek voorbeeld van zo'n applicatie is een spel, waarbij de tegenspeler de omgeving is. Stel nu dat we het netwerk willen laten schaken, dan krijgt het pas echt een signaal (win of verlies) na een hele reeks "zetten". Hoe moeten we dan een signaal geven aan tussenliggende "zetten"? Dit wordt het "credit assignment problem" genoemd.

Het Exploratie-Exploitatie probleem:

(Dit komt niet uit de cursus, maar het komt wel voor op iedere site over reinforcement learning die ik vind)

Een van de uitdagingen in reinforcement learning is de keuze die we moeten maken tussen exploratie en exploitatie: als we een actie vinden die een grote waardering (reinforcement signal) krijgt, moeten we deze actie vaak gebruiken (exploiteren), maar er zijn misschien nog betere acties, dus moeten we ook de andere acties proberen (exploreren).

Als we met een stochastische taak werken (die niet altijd dezelfde waardering geeft voor een input-output paar) moeten we elke actie een aantal keer toepassen om goed te kunnen schatten welke waardering we hiervoor waarschijnlijk zullen krijgen. (Om dit in te zien speel je best een aantal keer het hierbovenvermelde three-armed bandit, telkens met nieuwe rewards.)

* Explain the recurrent back-propagation algorithm. (07-08)

Zie p33 Er staat "comparing with (1.1)" maar het stuk met (1.1) is niet gegeven. Weet iemand welke functie dat moet zijn? ==== IV, p2 zelfde als gewoon back-propagation algorithm?


Recurrent Backpropagation is used for fixed-point learning. NeuroSolutions is one of the few software products having this ability.

As with static backpropagation, fixed-point learning maps a static input with a static output. The difference is that the mapping is not instantaneous. When data is fed to the input of the network, the network cycles the data through the recurrent connections until it reaches a fixed output. Training a network using fixed-point learning can be more difficult than with static backpropagation, but the added power of these networks can result in much smaller and more efficient implementations.

In recurrent backpropagation, activations are fed forward until a fixed value is achieved. After this relaxation period, the error is computed and propagated backwards. The error activations must be stable before the weights can be updated, so relaxation of the error is also needed.

bron: http://www.nd.com/definitions/recback.htm

Explain the associative reward-penalty algorithm. (05-06)(07-08)

Betere uitleg dan in de cursus -> http://www.cs.indiana.edu/classes/b553/arp.html

* Explain learning with a critic. (Can it be applied for the control of a plant?) When can it be applied?(05-06)(07-08)

Zie vraag 20

One of the earliest machine-learning algorithms in the class we now refer to as reinforcement learning was the learning-with-a-critic method by Widrow, Gupta, and Maitra. In their work, a punish and reward scheme is used to train a single adaptive neuron. The novel feature of this method is that an error signal is not used as is required with supervised-learning algorithms; neither is the algorithm entirely unsupervised. Instead, an external observer called the critic evaluates the neuron's performance in a qualitative way and decides to either reward or punish its behavior. The neuron is rewarded if its performance is good, and is punished if its performance is bad.

Lecture 5

What sort of general tasks can one perform with unsupervised learning?

Zie cursus Unsupervised leaming p 14-15. Unsupervised learning:

  • geen teacher
  • netwerk met inputs en outputs
  • geen feedback van de omgeving over correctheid van outputs
  • netwerk moet zelf patronen, eigenschappen, ... ontdekken in de invoer en deze coderen in de uitvoer

Algemene taken:

Familiariteiten

1 output met continue waarde kan zeggen hoeveel een nieuw invoerpatroon lijkt op typische of gemiddelde patronen uit het verleden. Het netwerk zou gradueel leren wat typisch is.

Principal Component Analysis

Om het vorig geval uit te breiden naar meerdere eenheden, moet men een multi-component basis, of een stel assen, construeren waarlangs men de gelijkenis naar de vorige voorbeelden meet. Een vaak gebruikte benadering uit de statistiek gekend als "principal component analysis", gebruikt de dominante eigenvector richtingen van de correlatiematrix van de invoerpatronen.

Clustering

Een set van binaire output4ner slechts ben actief per keel zou ons kunnen zeggen tot welke van de vele categorieën een invoerpatroon behoort. De geschikte categorieën zouden moeten gevonden worden door het netwerk op basis van de correlaties in de inputpatronen. Elke cluster van gelijkaardige of nabijgelegen patronen zou dan geclassificeerd worden als één enkele output klasse.

Prototyping

Het netwerk zou kunnen categorieën vormen als in het vorige geval, maar dan als output een prototype geven of een exemplaar van de geschikte klasse. Het zou dan de functie hebben van een associatief geheugen, maar de memories zouden rechtstreeks gevonden worden van de invoerpatronen, niet opgelegd van buiten af.

Encodering

De output zal een gecodeerde versie van de input kunnen zijn in minder bits en zoveel mogelijke relevante informatie als mogelijk behouden. Dit zou kunnen gebruikt worden voor datacompressie alvorens te verzenden over een kanaal met gelimiteerde bandbreedte, aangenomen dat een decodeernetwerk ook geconstrueerd kan worden.

Feature mapping (globale organisatie output)

Als de output units een vaste geometrische ordening hadden zoals een tweedimensionale rij met slechts één unit actief per keer konden ze inputpatroncn afbeelden op verschillende punten in deze ordening. Het idee is om een topografische afbeelding te maken van de input zodat gelijkaardige invoerpatronen altijd de dichtstbijzijnde outputs activeren. We verwachten dat een globale organisatie van de output units tevoorschijn komt.

Deze gevallen zijn niet noodzakelijk verschillend en kunnen ook op verschillende manieren gecombineerd worden. Bijvoorbeeld het coderingsprobleem kan gebeuren door "Principal Component Analysis" of "Clustering" te gebruiken. Principal Component Analysis of PCA kan zelf gebruikt worden voor dimensie reductie van de gegevens alvorens "clustering" of feature mapping te gebruiken.

Explain the standard competitive learning rule.(05-06)(07-08)

http://www.cs.indiana.edu/classes/b553/unsup2.html

* Explain vector quantization and learning vector quantization. What is the difference?(05-06)(07-08)

In cursus IV p 14

  • Vector quantization: categoriseer een gegeven set in M klasses en representeer iedere vector door zijn klasse.
    • Unsupervised
    • Categoriseert
  • Learning vector quantisation: ( http://fuzzy.cs.uni-magdeburg.de/~borgelt/doc/lvqd/ ) : A neural network for learning vector quantization consists of two layers: an input layer and an output layer. It represents a set of reference vectors, the coordinates of which are the weights of the connections leading from the input neurons to an output neuron. Hence, one may also say that each output neuron corresponds to one reference vector. The learning method of learning vector quantization is often called competition learning, because it works as follows: For each training pattern the reference vector that is closest to it is determined. The corresponding output neuron is also called the winner neuron. The weights of the connections to this neuron - and this neuron only: the winner takes all - are then adapted. The direction of the adaption depends on whether the class of the training pattern and the class assigned to the reference vector coincide or not. If they coincide, the reference vector is moved closer to the training pattern, otherwise it is moved farther away.
    • Supervised
    • Past grenzen van categorieën aan

Explain the adaptive resonance theory algorithm.(05-06)(07-08)

http://www.cs.indiana.edu/classes/b553/art.html

* What is a self-organizing map? Give an example.(05-06)(07-08)

Self-organizing maps - also called Kohonen feature maps - are a special kind of neural networks that can be used for clustering tasks. They are an extension of the so-called learning vector quantization. Note, however, that this document cannot provide an exhaustive explanation of self-organizing maps (see any textbook on neural networks for that). Rather, it only briefly recalls the basic ideas underlying this kind of neural networks.

A self-organizing map consists of two layers of neurons: an input layer and a so-called competition layer. The weights of the connections from the input neurons to a single neuron in the competition layer are interpreted as a reference vector in the input space. That is, a self-organizing map basically represents a set of vectors in the input space: one vector for each neuron in the competition layer.

A self-organizing map is trained with a method that is called competition learning: When an input pattern is presented to the network, that neuron in the competition layer is determined, the reference vector of which is closest to the input pattern. This neuron is called the winner neuron and it is the focal point of the weight changes. In pure competition learning - as it is used in learning vector quantization - only the weights of the connections leading to the winner neuron are changed. The changes are made in such a way that the reference vector represented by these weights is moved closer to the input pattern.

In self-organizing maps, however, not only the weights of the connections to the winner neuron are adapted. Rather, there is a neighborhood relation defined on the competition layer which indicates which weights of other neurons should also be changed. This neighborhood relation is usually represented as a (usually two-dimensional) grid, the vertices of which are the neurons. This grid most often is rectangular or hexagonal. During the learning process, the weights of all neurons in the competition layer that lie within a certain radius around the winner neuron w.r.t. this grid are also adapted, although the strength of the adaption may depend on their distance from the winner neuron.

The effect of this learning method is that the grid, by which the neighborhood relation on the competition layer is defined, is "spread out" over the region of the input space that is covered by the training patterns. The programs xsom and wsom visualize this process for a very simple learning task.

bron: http://fuzzy.cs.uni-magdeburg.de/~borgelt/doc/somd/

Self-organizing maps (SOMs) are a data visualization technique invented by Professor Teuvo Kohonen which reduce the dimensions of data through the use of self-organizing neural networks. The problem that data visualization attempts to solve is that humans simply cannot visualize high dimensional data as is so techniques are created to help us understand this high dimensional data. Two other techniques of reducing the dimensions of data that has been presented in this course has been N-Land and Multi-dimensional Scaling. The way SOMs go about reducing dimensions is by producing a map of usually 1 or 2 dimensions which plot the similarities of the data by grouping similar data items together. So SOMs accomplish two things, they reduce dimensions and display similarities.

http://davis.wpi.edu/~matt/courses/soms/final.jpg

Just to give you an idea of what a SOM looks like, here is an example of a SOM which I constructed. As you can see, like colors are grouped together such as the greens are all in the upper left hand corner and the purples are all grouped around the lower right and right hand side.

bron: http://davis.wpi.edu/~matt/courses/soms/

Are hybrid learning schemes useful? Explain with an example.

Zie p29 van het deel unsupervised learning (vrij ver naar achteren)

What is an RBF (radial basis function) network? How does it work? (05-06)(07-08)

http://neuron-ai.tuke.sk/NCS/VOL1/P3_html/node37.html One of the misconceptions surrounding the value unit architecture is based upon its use of a Gaussian activation function. This is because another network architecture, the Radial Basis Function (RBF) network [75] uses a similar activation function. That, however, is where the similarities end. The RBF network is a three-layer feedforward network that uses a linear transfer function for the output units and a nonlinear transfer function (normally the Gaussian) for the hidden units. The input layer simply consists of n units connected by weighted connections to the hidden layer and a possible smoothing factor matrix . A hidden unit can be described as representing a point x in n-dimensional pattern space. Consequently the net input to a hidden unit is a distance measure between some input, xp, presented at the input layer and the point represented by the hidden unit; that is, . This means that the net input to a unit is a monotonic function as opposed to the nonmonotonic activation function of the value unit. The Gaussian is then applied to the net input to produce a radial function of the distance between each pattern vector and each hidden unit weight vector. Hence, a RBF unit carves a hypersphere within a pattern space whereas a value unit carves a hyperbar.

In general, an RBF network can be described as constructing global approximations to functions using combinations of basis functions centered around weight vectors. In fact, it has been shown that RBF networks are universal function approximators [38]. Practically, however, the approximated function must be smooth and piecewise continuous. Consequently, although RBF networks can be used for discrimination and classification tasks (see [64] for some examples), binary pattern classification functions that are not piecewise continuous (e.g., parity) pose problems for RBF networks[75]. Thus, RBF networks and value unit networks are not equivalent.

http://www.nd.com/models/rbf.htm

Lecture 6

Explain Oja's rule in the framework of unsupervised Hebbian learning.(05-06)(07-08)

http://www.cs.indiana.edu/classes/b553/unsup1.html

* Give the connection between Sanger's rule and principal component analysis.(05-06)(07-08)

Weight vectors converge to orthogonal unit vectors, the first M principal component directions (normalized eigenvectors of the correlation matrix)

Discuss the relevance of input representation and the number of hidden units in a network construction.(05-06)

Zie p45 ->Chapter "Generalization and network architecture"

What is the aim of pruning and construction algorithms? Give an example. (05-06)(07-08)

Zie 6.6 p60-63

To obtain good generalization ability one has to build into the network as much knowledge about the problem as possible and limit the number of connections appropriately. We need algorithms that optimize not only the weights but also the architecture itself. (optimizing number of layers and units per layer)

There are of course several different criteria judging the "optimality" of the network. There is the generalization ability, learning time, number of units etc. Given so many different hardware restrictions, the cost function for the architecture itself might get pretty complicated. So we mainly focus on using as few units as possible, this should not only reduce computational costs and perhaps training time, but should also improve generalization.

We could mount a search into the space of possible architectures and train it with back-propagation. The search will be carried out by a so called genetic algorithm. But some other more promising approaches were made in which we construct or modify an architecture to suit a particular task, proceeding incrementally. Starting with too many units and take some away; or starting with too few and adding some more.

  • Pruning: starts with too many units and takes some away
  • Construction algorithm: start with a too small network and gradually grow one of the appropriate size. (e.g.: tiling algorithm, cascade correlation algorithm, upstart algorithm)

Explain the tiling algorithm.(05-06)(07-08)

Zie p63-66, tekeningen 6.16 & 6.17

  • Start with a single unit that tries to produce the correct output on as many training examples as possible.
  • Add units to take care of the examples that the first unit got wrong
  • Only adds as many units as needed to cover all examples
  • Similar to the decision tree learning algorithm (zie machine learning)
  • Cross validation can be used to deciding if a network is found that has the right size.

Start at each layer with a master unit that does as well as possible on the target task, and then add further ancillary units until the representation on that layer is faithful. The next layer is constructed in just the same way, using the output of the previous layer as its input. Eventually a master unit itself classifies all patterns correctly, and is therefore the desired output unit.

The master unit in each layer is trained so as to produce the correct target output (±1 on as many of its input patterns as possible, this can be done by the pocket algorithm (variant of perceptron learning rule: when data is not linearly separable, it searches weight space and stores the set of weights which has had the longest unmodified run of successes so far. Algorithm is stopped after chosen time t). The ancillary units in each layer are also trained using the pocket algorithm, but only on subsets of the patterns (subsets which still contain same targets or duplicates, and as such make it not-faithful → should all be different).

Lecture 7

What is the relation between the Bayesian and maximum likelihood approach? (05-06)(07-08)

For large numbers of observations, the Bayesian representation of the density approaches the maximum likelihood solution. Zie figuur 2.5

Explain: Bayesian learning treats the issue of model complexity differently than cross-validation does. (05-06)(07-08)

Zowel Bayesian learning als cross-validation zijn technieken die kunnen worden aangewend om een beslissing te maken over de complexiteit van een neuraal netwerk.

Bayesian Learning

Uit [1] haal ik het volgende:

Bayesian methods will provide solutions to such fundamental problems as:

  • How to judge the uncertainty of predictions. This can be solved by looking at the predictive distribution, as described above.
  • How to choose an appropriate network architecture (e.g., the number of hidden layers, the number of hidden units in each layer).
  • How to adapt to the characteristics of the data (e.g., the smoothness of the function, the degree to which different inputs are relevant.).

Meer uitleg over het kiezen van een architectuur:

Selection of an appropriate network architecture is another place where prior knowledge plays a rule. One approach is to use a very general architecture, with lots of hidden units, maybe in several layers or groups, controlled using hyperparameters. This approach is emphasized by Neal (1996), who argues that there is no statistical need to limit the complexity of the net-work architecture when using well-designed Bayesian methods. It is also possible to choose between architectures in a Bayesian fashion, using the “evidence” for an architecture, as discussed by Mackay (1992a, 1992b).

Hoe lost men dit probleem op met Bayesian Learning?

The result of Bayesian training is a posterior distribution over network weights. If the inputs of the network are set to the values for some new case, the posterior distribution over network weights will give rise to a distribution over the outputs of the network, which is known its the predictive distribution for this new erne. If a single-valued prediction is needed, one might use the mean of the predictive distribution, but the full predictive distribution also tells you how uncertain this prediction is.

Cross Validation

Hier citeer ik uit [2]:

Cross-validation is a method for estimating generalization error based on "resampling". The resulting estimates of generalization error are often used for choosing among various models, such as different network architectures.

Een technische uitleg over cross-validation kan je vinden in [2]

Hoe gaat cross-validation nu te werk om de complexiteit. van een model te bepalen:

Cross-validation can be used simply to estimate the generalization error of a given model. or it can be used for model selection by choosing one of several models that has the smallest estimated generalization error. For example. you might use cross-validation to choose the number of hidden units, or you could use cross-validation to choose a subset of the inputs (subset selection). A subset that contains all relevant inputs will be called a "good" subsets, while the subset that contains all relevant inputs but no others will be called the "best" subset. Note that subsets are "good” and “best” in an asymptotic sense (as the number of training cases goes to infinity). With a small training set, it is possible that a subset that is smaller than the "best" subset may provide better generalization error.

Referenties:

[1] http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-7.html

[2] http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html

Why does one favour small values in Bayesian learning of network weights? Explain.

Zie p77

* What are the basic ideas implemented by support vector machines? Explain.(05-06)(07-08)

* Explain how a network learns in the Bayesian approach.(05-06)(07-08)

* What are the motivations for fuzzifying the perceptron rule? How can one do this? (05-06)(07-08)

* What is the kernel trick in SVM’s and why is it useful? Explain. (05-06)

In machine learning, the kernel trick is a method for easily converting a linear classification learning algorithm into a non-linear one, by mapping the original observations into a higher-dimensional non-linear space so that linear classification in the new space is equivalent to non-linear classification in the original space.

This is done using Mercer's condition, which states that any positive semi-definite kernel K(x, y) can be expressed as a dot product in a high-dimensional space.

More specifically, if the arguments to the kernel are in a measurable space X, and if the kernel is positive semi-definite — i.e.

for any finite subset {x1, ..., xn} of X and subset {c1, ..., cn} of real numbers — then there exists a function φ(x) whose range is in an inner product space of possibly high dimension, such that

.

The kernel trick transforms any algorithm that solely depends on the dot product between two vectors. Wherever a dot product is used, it is replaced with the kernel function. Thus, a linear algorithm can easily be transformed into a non-linear algorithm. This non-linear algorithm is equivalent to the linear algorithm operating in the range space of φ. However, because kernels are used, the φ function is never explicitly computed. This is desirable, because the high-dimensional space may be infinite-dimensional (as is the case when the kernel is a Gaussian).

The kernel trick was first published in the 1964 paper Theoretical foundations of the potential function method in pattern recognition learning.

bron: http://en.wikipedia.org/wiki/Kernel_trick

* What are slack variables? Explain their use. (05-06)

Lecture 8

Explain the problem of echo cancellation and the use of adaline networks to overcome it. (05-06)(07-08)

Zie p3-9

In long range telephony: (see picture p3)

Voice signal to be transmitted: s
Delayed Incoming Signal: n (= noise) 

There is always a distorted echo of n added to s, impossible to avoid (or so I gather)

Problem: given (s + n') and n, calculate y so that:

e = s + (n' – y) approaches s as best as possible.

Now, (<x> means average value of all x's) The mean square error is:

<e²> = <s²> + <(n' – y)²> + 2<s(n' – y)>

On the pages, it says that 2<s(n' – y)> = 0 , becasue the correlation between the signal you recieve and the signal you send is zero. So, simplified, the Mean Square Error is:

<e²> = <s²> + <(n' – y)²>

So, the solution is to create an adaptive filter that, given n, calculates y.

After having stated this, they start talking about adalines (adaptive linear elements) which as far as I can see is just a perceptron with binary bipolar output. Presumably, this actually has some connection to the problem of echo cancellation, though I don't see how... How is n a vector? How then do we apply the single value y to a similar vector s + n'? How can we determine the desired output n' (this being the whole reason for the filter).

The only solution I see is this: we use a layer of adalines so that we have k outputs, with k being the dimension of the input vector n. Also, we drop the notion that adalines have binary bipolar output, but instead have unary bipolar output (in which case we can present the input n as a bit vector). As desired output we have s – n'. Since adalines attempt to minimize <(d – w’x)²>, according to the above formula this equals minimizing <(s + n' – y)²> = <s²> - <(n' – y)²>. This formula has, as minimum, <s²>.

Problem solved... I hope

From what I remember of DSP and adaptive filters, usually the last n samples of the incoming signal are used to calculate y. n is then called the number of taps of the adaptive filter and allows modeling of channels that can be described by polynomials with n terms. I expect the same setup is used here: n delay elements in a string with the outputs all going into the ANN, the single ANN output being the y that is subtracted from the recorded signal.

If they complain about 2<s(n' - y)> being 0, the best explanation I can come up with is this: Presume y starts at a randomly positive value smaller than n'. This value will start increasing during training and therefore approach n'. I have no explanation as to why it should stop increasing afterwards (because it wouldn't... It can decrease <e²> further by increasing y.... ) An alternative solution would be to add the absolute value of 2<s(n' – y)>, but I'd be very careful mentioning that!

Externe links: Definitie te vinden op http://en.wikipedia.org/wiki/Artificial_neural_network#ADALINE

Matlab info: http://www.mathworks.com/access/helpdesk/help/toolbox/nnet/adaptiv6.html

Eerste 6 pagina's over noise cancellation en adaptive filters uitgelegd in andere slides: http://www.im.isu.edu.tw/faculty/ymchiang/ann93_2/Ch10.pdf

* Explain the Adaline network and the LMS learning rule. (05-06)(07-08)

* What is a cellular neural network and how does it work ? What are its main advantages? (05-06)(07-08)

Voor definitie zie: http://www.ce.unipr.it/pardis/CNN/cnn.html

Een goed uitgeschreven inleiding op CNN: http://www.cs.ubbcluj.ro/~lauras/Papers/CellularComputing.rtf

What are typical application areas of cellular neural networks?(05-06)(07-08)

  • Image processing including greyscale image inputs:
    • feature extraction
    • motion detection and estimation
    • path tracking
    • collision avoidance
    • half toning including motion
    • object counting and size estimation
  • Analyzing 3d surfaces
    • detection minima and maxima
    • detection areas where the gradients exceed a given limit
  • Solving partial differential equations
  • Reducing non-visual problems to geometric maps
    • thermo graphic images
    • somatosensory maps (tactile sensing)
    • antenna array images
    • different medical maps and images
  • Modelling biological vision and other sensory-motor organs

What is a CNN template? (05-06)(07-08)

The CNN template defines the interconnection weights among the cells and determines the functionality of the CNN.

A template specifies the interaction between each cell and all its neighbouring cells in terms of their input, state, and output variables.

Lecture 9

What are strong points and drawbacks for applications of neural networks to time-series prediction? (05-06)

Zie ook p131, p135-138

  • ANN's = no magic but based on insight, math, statistic analysis and simulations
  • Zeer algemeen: onafhankelijk van het soort systeem (vb. fysisch, biologisch, ...)
  • Kan ook gebruikt worden als de fysische wetten van het systeem gekend zijn, maar te moeilijk om nuttig te gebruiken (vb. turbulentie)
  • ANN modelleert meer dan lineaire modellen
    → beter voor chaotische systemen, systemen met meerdere evenwichten
  • Voorspellingen met behulp van NN's kunnen veruit de besten zijn onder alle methodes (vooral door niet lineariteit) MAAR geen garantie tot succes
    moet gebaseerd zijn op:
    • Inzicht
    • Wiskundige / statistische analyse
    • Simulaties
      Vb. Santa Fe → NN wonnen, maar zaten ook veel bij de slechtsten
  • Veel parameters die graad van succes bepalen (niet altijd direct succes!)
    • Architectuur: # neuronen, soort netwerk, activation functies
      • te veel neuronen ' geen goede generalisatie
      • te weinig neuronen ' niet expressief genoeg
    • training ZEER belangrijk
      • hoeveelheid
      • goede test en training set
      • preprocessing → genoeg data ±5x aantal weights en verspreid genoeg
    • hoe de fout meten?
      • Fitting error op training set
      • Generalization error op test set

Slim gebruik kan het beste resultaat geven, maar naïef gebruik kan zeer slechte resultaten geven. Een ANN kan nl. ook iets anders leren dan gewenst!

Discuss time-series prediction competitions and results obtained by neural networks.(05-06)

Zie ook p131, p135-138, conclusies p143

Context time series: observaties van een bepaald process dat evolueert in de tijd.

Time series analysis: modelleren van een temporal relationship tussen de opeenvolgende observaties.

Een alternatief om een mathematisch model te bekomen voor een systeem

ANN: kunnen getraind worden om niet-lineaire relaties tussen de observaties te vinden zonder al te veel theoretische complexiteit.

Time series prediction competitions: e.g.: Santa Fe Institute.

  • voorspellingen rond verder verloop van de datasets
  • winnaar van de competitie: neural networks

→ multilayered feed forward with back propagation outperforms all other methods!

Conclusion: "convincing real life applications for the use of neural networks in time series analysis"

Discuss the use of neural networks in the alvinn vehicle control system.

Zie p189-190

Alvinn system: automated truck
Uses: Back propagation algorithm
Each image: 30x32 pixels
Network Input comes from a camera mounted on the truck,
Network Output is the direction in which the vehicle is steered
ANN:
  * 960 input units
  * 4 hidden units
  * 30 output units

Training obtained by observing and mimicking a human driver for 5 minutes. Speed 70mph during 90 miles on public highway

Explain the use of backpropagation in control applications and what is backpropagation through the plant ?

Zie p76 & 178

* Explain the difference between feedforward and feedback neural control.

* Explain the identification of nonlinear plants by neural networks and the 4 classes of models as used by Narendra and the difference between parallel and series-parallel identification. (05-06)

* Explain the use of neural networks for the forecasting of the electric load/electricity consumption. (05-06)

Lecture 10

Discuss different neural control strategies for controlling an inverted pendulum.

  1. gebruik reinforcement learning
  2. bekijk het als een optimalisatieprobleem waarvan de kostfunctie geminimaliseerd moet worden

Discuss the design of a neural network for controlling a truck backer-upper.

Zie p182

Discuss neural networks for use in ovarian cancer classification and prediction. (05-06)

Zie p153-159

Artificial neural network models for the preoperative discrimination between malignant and benign adnexal mass. (Voor de operatie voorspellen of het een goed- of kwaadaardige kanker is)

Korte beschrijving :

Het is de bedoeling hier om een ANN netwerk in te zetten bij het voorspellen van een kwaadaardige kanker op basis van een paar inputs (leeftijd, hoop andere tests, ...) en te evalueren aan de hand van een paar andere voorspellingsmethoden. Andere voorspellingsmethoden zijn RMI (risk of malignancy index), een logistic regression model, ...

Conclusie:

Een ANN kan getraind worden om klinisch accurate voorspellingen te bekomen. Uit de ROC curve + tabel 3 blijkt dat de aanpak aan de hand van een ANN ongeveer even goed is als het regression model en beter is als de enkelvoudige parameter modellen zoals de leeftijd, ...

ROC curves (Receiver Operating Characteristic

The output of a neural network is a real number between 0 and 1

A threshold should be used on the output in order to decide what is benign and malignant.

if output < threshold prediction negative
if output > threshold prediction positive

With any choice of the threshold there are 4 classes in the total population:

	TN	True negative:	negative prediction for benign case
	TP	True positive:	positive prediction for malignant case
	FN	False negative:	negative prediction for malignant case
	FP	False positive:	positive prediction for benign case

The sensitivity = TP / (TP + FN) = TP / # positives

The specificity = TN / (FP + TN) = TN / # negatives

ROC curve: sensitivity versus (1 – specificity)
For all possible thresholds this is a curve
Area under the curve is a measure for the quality of the method independent of choice of threshold

!!NEEDS FIGURE!!

Discuss the use of neural networks in fraud detection applications. (05-06)

Uit inleidende text in begin van cursus:

"One obvious way to tackle this problem is to design a collection of rules to detect the important types of fraud. The inconvenience of this approach is that the rules have to be explicitly drafted and that many rules may he needed. This is called the expert systems approach since the knowledge of experts is frozen into the rules.

The neural network approach on the other hand relies on its learning behaviour from a large collection of examples, which consist of customer behaviour and the corresponding type of fraud or no fraud. This set of examples is used to train the feed forward neural network. Once the network has learned to detect fraud one only has to apply the behaviour of a consumer and the neural network output will recognize whether fraud has happened. The remarkable characteristic of a neural network is that it is implicitly able to generalize i.e., it is able to detect patterns of fraud in the new data which are similar to those that have occurred in the training set.

The generalization property can he effectively observed and used in order to stop the training process as follows (early stopping rule). One divides randomly the input data in two sets: the training set and the test set. The training set is used to adapt the weights during the training. As one is processing the examples of the training set many times, the average accumulated squared error between the desired and the actual output for the entire training set decreases. This error for the test set usually decreases first along with that for the training set but from a certain epoch the test error starts to increase. This is called overtraining and should he avoided One should continue to train until the average error in the test set is minimal."

Voorbeeld van in de slides: fraude detectie bij bank kaarten [staat in het begin van de cursus]

Ze gebruiken een 3-layer feed forward netwerk dat ze trainen zoals hierboven beschreven. 's Nachts kunnen alle transacties van de vorige dag bekeken worden om zo fraude proberen vast te stellen.

Voorbeeld van in de slides: fraude detectie bij GSM's [staat helemaal achteraan in de cursus] Er zijn verschillende fraude indicatoren [lange gesprekken mar exotische bestemmingen, verschillende (kortere) gesprekken naar exotische bestemmingen, herhaalde keren naar hetzelfde nummer bellen (Freephone fraude), enorm snel na elkaar bellen (cloning), etc.]

Men kan de toll tickets (= data record dat voor elk gesprek wordt aangemaakt met informatie zoals duur van gesprek, datum, naar welk nummer er gebeld werd, ed.) gebruiken om een neuraal netwerk te trainen.

Er zijn twee mogelijke types van analyse:

  • Absolute Analysis: zodra het gebruik over een bepaalde limiet gaat neemt het systeem aan dat er fraude gepleegd wordt. Dit heeft als belangrijk nadeel dat bepaalde bedrijfsgebruikers vaak een zeer hoge activiteit hebben en dat die dus als fraudulent aanzien gaan warden.
  • Differential Analysis: men stelt eerst per gebruiker een profiel op waar men de 'normale' activiteit van die persoon in bijhoudt. Bij een grote afwijking tov het normale profiel wordt er alarm geslagen.

Aan de KUL doen ze onderzoek naar dit type van fraude detectie adhv neurale netwerken. Op de volgende URL kan je een paper vinden waar ze een beetje uitleggen hoe ze exact tewerk gaan:

http://www.esat.kuleuven.ac.be/cosic/aspect/papers/WP22_MO6.htm

Explain the receiver operating curve, its use and its role in comparing classification systems. (05-06)

Zie p119-120, en vraag 51

When using classification systems with an output between 0 and 1, for a certain threshold t, there are always 4 cases: (answer is positive when output > t, otherwise it is negative)

	TN	True negative:	negative prediction for negative case
	TP	True positive:	positive prediction for positive case
	FN	False negative:	negative prediction for positive case
	FP	False positive:	positive prediction for negative case

Those values are in function of the chosen threshold.

Define:

Sensitivity = TP / (TP + FN) = TP / #positives
Specificity = TN / (TN + FP) = TN / #negatives

These values are, of course, also in function of the chosen threshold

The ROC is a curve plotting the couples (sensitivity, [l – specificity]) for each value of t. This is a continuous curve. For example, see p119.

It is desirable to have couples in the upper left corner, meaning high sensitivity and high specificity. A single numerical value to express the power of a classification system would be the area below the curve, as shown on p119.

Explain K-means clustering.(05-06)

http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/kmeans.html

* Explain the use of PCA, discriminant analysis, ANN and Bayesian networks for distinguishing different types of leukemia. (05-06)

Bijvragen

(Algemene bijvragen kan je hier zetten. Indien het een uitbreiding op een andere vraag is zet je ze best bij die andere vraag.)