Cognition for Surgical Robots

We consider cognition and cognitive sciences from the point of view of surgical robots. The text is in the form that suits Robotic Surgepedia: in its present form it is a so called “draft” that will be subject of discussion before deciding who will write a final “article”. The article then can/will be surrounded / annexed by “commentaries” that present other views and support further thinking about this quickly growing field. Since Robotic Surgepedia combines the techniques of Wikipedia and Scholarpedia, our references, whenever appropriate, prefer links of these community authored knowledge bases in order to take advantage of the related rich semantic information and available text mining technologies.

The concept of robotic surgery has two main lines.


 * Case (1)
 * The robot executes the surgery, e.g., on a battlefield, or under remote control of the human surgeon.


 * Case (2)
 * The robot accompanies the surgeon, and possibly her team, and actively takes part in the surgery. The necessity of collaboration with the human participant is the main difference between cases (1) and (2).

From the point of view a robotics, Case (1) is similar to certain robotic problems like mine detection on the battlefield, or the robot of the production line. Some of these problems are easy, while others are hard. We will not discuss those here, but refer to the general literature on cognitive robotics. Even in this case it has been noticed that “while traditional cognitive modelling approaches have assumed symbolic coding schemes as a means for depicting the world, translating the world into these kinds of symbolic representations has proven to be problematic if not untenable. Perception and action and the notion of symbolic representation are therefore core issues to be addressed in cognitive robotics.”

The issue of symbolic representation becomes unavoidable if we have human(s) in the loop, since we use language – i.e., symbols –for communication and have to touch the core issues not necessarily relevant for traditional cognitive robotics. In particular, we have to address the following issues:


 * Issue (1)
 * Communication happens via symbols and will result in (co-operative) actions. In turn, symbols should be grounded, we have to deal with the so called symbol grounding problem . However, the complex neural representation of an object on the retina is also a (complex) symbol of that object. We will have to consider if it is possible to define symbols at all, i.e., if one can separate symbols from representations in a meaningful way.


 * Issue (2)
 * According to Issue (1), the symbol grounding problem is connected to the concept of representation and involves some resolution of the homunculus fallacy, the apparent infinite regress that comes into view when one tries to tell apart interpretation from representations. Interpretation is directly connected to questions on thinking and understanding , feedforward and feedback models of cognition, as well as the questions of awareness , attention , and consciousness.


 * Issue (3)
 * From the point of view of usability of a robot that ‘cognizes’, the central issue is the cognitive capability of that robot, which – in the context of usability – is best formulated by the so called Turing test and its levels.

Resolution to the homunculus fallacy
There is the prevailing philosophical problem of the homunculus fallacy. To put it in a nutshell the fallacy starts by claiming that an internal representation of the outside world is still meaningless unless someone can ’read’ or interpret it (see, e.g., Searle (1992) and references therein). Then, however, we have to find a place for this reader and define exactly how it ‘makes sense’ to or ‘interprets’ the internal representations. However, the interpretation – according to the fallacy – is just a new level of abstraction, a new transformation, that is, a new representation. Eventually this thought seems to end up in an infinite regression.

We can resolve the problem by noticing it stems from vaguely described procedure of ‘making sense’. One can turn the fallacy upside down by changing the roles: Not the internal representation but the input should make sense and it makes sense if it can be derived by means of the internal representation. According to this approach the internal representation interprets the input by (re)constructing it. Reconstruction must concern a non-vanishing time interval and the longer this interval, the more the explanatory power of the representation.

This direction is in line with modern concepts of information theory and artificial general intelligence. Firstly, prediction is closely related to coding and compression. Secondly, artificial general intelligence claims that the better the compression, the better the ‘understanding’ or ‘intelligence’ (for an informal argument see ).

In turn, the working hypothesis to the resolution of the homunculus fallacy is that interpretation occurs via a compressed internal representation that can reconstruct – possibly in a lossless manner – the actual input that may be extended in time. Such an internal representation is capable of envisioning, dreaming, and predicting. Components of the compressed representation may be thought as the symbols that takes us to the symbol grounding problem. As a further note, when representations of two intelligent entities are ‘sufficiently similar’, then communication may occur at the symbolic level. Otherwise, they face the symbol matching problem, which arguably corresponds to a graph matching problem. Optimal matching of different graphs is known to be NP-hard, not mentioning the related implicit grounding problems in case of discrepancies.

The symbol grounding problem
Motivations:
 * 1) The European Robotic Surgery intends to bring together a community of engineers, roboticists, scientists, social workers, and surgeons to promote the technological development of the field and its usage for patients undergoing surgery. This goal requires the translation between different disciplines and the creation of an underlying—supporting ontology. Units of this ontology are symbols that need to be grounded to everyday experiences
 * 2) Communication with the robot(s) taking part in the surgical event must occur at the symbolic, i.e., at the representational level and thus the symbols must be grounded to enable actions.

According to the symbolic model of the mind we have an ‘autonomous’ symbolic level. The problem is to connect these symbols to sensory information that is to ground them to the physical world, which is a long-standing and enigmatic problem in cognitive science.

If the computer can ground the symbols to the real world, i.e., if it can create inputs alike those created by the real world in its sensory system, then the computer can explain the input by means of its symbols. This way, this computer overcomes the homunculus fallacy. If we consider the symbolic level as the representation of the events that occur in the real world, then we end up with a list of events and their spatio-temporal graphical structure, where connections represent transition probabilities. Making a highly restrictive assumption, namely, that in the real world such a graph exists and is finite, then the symbol grounding problem becomes a graph matching problem, which is known to be NP-hard and thus it is subject to combinatorial explosion in the number of symbols. Furthermore, one might say that the graph of real world events – if it exists at all – is not finite and it is changing by time. We also note that the symbol grounding problem – at the first sight – seems harder if symbols are not prewired, but they are to be learned (but see, Lorincz (2009) and the references therein).

Turing test levels
The emergence of Strong Artificial Intelligence that matches or exceeds human intelligence has been long predicted. However, the difficulties in achieving this are mainly related to the fact that other than our common sense intuition, there is no straightforward definition of what intelligence is, so measuring its machine implementation remains elusive. Nonetheless, the Loebner prize on the classical Turing Test (Alain Turing in 1950 defined machine intelligence as an ability to produce human-like conversation in keyboard mediated ‘pen-pal’ communications with a human interlocutor) is to be awarded soon. This hypothesis is based on recent advances on tagged corpora for texts: interpretation of texts becomes feasible by ‘grounding’ to cross-annotated databases, e.g., Wikipedia.

Projects aiming at this goal have already proven their strength, e.g., IBM’s Watson project have successfully competed with human participants in the quiz show ‘Jeopardy!’, or see the project at Carnegie Mellon University, called ‘Never Ending Language Learning’.

It is intriguing that grounding of images with the help of cross-annotated collection of image parts is also successful, see e.g., MIT’s LabelMe and Stanford’s ImageNet projects. Grounding of images shares many similarities with text grounding from an algorithmic standpoint.

Stevan Harnad has established a hierarchy of Turing tests. T1 is the level of toy tests, T2 is the level of the original ‘pen-pal’ Turing test, T3 is the level of total sensorimotor (robotic) function, T4 is the total micro-function, and T5 is the level when machine becomes indistinguishable from humans by any empirical or testable means. It seems that T2 can be satisfied by an approximate matching of the graph of symbols, e.g., by means of annotated databases. This leads us to the necessity of creating the graphical structure of the symbols, i.e., the ontology for the European Robotic Surgery project as well as its alignment to other ontologies that already exist. From the point of view of Strong AI, level T2 is sufficient for a robotic companion, provided that its symbolic system is grounded. However, level T2 may not be necessary if the robot’s sensorimotor system is a less complex than that of ours and/or if tasks of the robot are limited.

The Chinese Room argument of Searle gives an interesting insight into the Turing test: (a) we will not be able to tell apart if the robot has a ‘mind’ or not if it passes the Turing test, (b) one can pass the Turing test without any understanding of the subject-matter of the conversation, but – for example – having access to a high-quality question-answering system.

Feedforward, or Turing model of cognition – From representation to action
There is a long-standing problem in cognition, which is eventually related to consciousness and ‘free will’. It is asking the following question: “Are we simple (feedforward) input-output systems”? One answer to this question has been elaborated by Dennett :

"The model of decision making I am proposing has the following feature: when we are faced with an important decision, a consideration-generator whose output is to some degree undetermined produces a series of considerations, some of which may of course be immediately rejected as irrelevant by the agent (consciously or unconsciously). Those considerations that are selected by the agent as having a more than negligible bearing on the decision then figure in a reasoning process, and if the agent is in the main reasonable, those considerations ultimately serve as predictors and explicators of the agent's final decision."

From our point of view, this dilemma on cognition – whether we have free will, if our decisions are deterministic, random, or in between – has relevant issues. A central point of the feedforward model of cognition is that feedback from the environment is also an input to the system. A related assumption is also implicit in the Turing Machine (TM), that has some resemblance to Dennett’s ideas. Informally, TM has a tape, which is the model of the environment. It has a head, which is the model of sensory information processing and actions; it can read and write symbols received from and sent to the environment. It has a finite lookup table (also called action table or transition function). This is the model of the mind. The reads and input from the tape, derives an action from its lookup table, reads the following input, and so on.

TM, alike to Dennett, neglects processing time and the related synchronization problem of the perception action loop. From the point of view of decision making we consider the related mathematical and philosophical problem of the boundary of the agent that separates it from its environment. For TM, the agent could be the lookup table plus the head, or the lookup table alone. The same dilemma arises, e.g., in reinforcement learning (RL). We note that – alike to TM, or RL – if the agent ‘thinks’ that its actions are ‘under its control’, such that its decision-to-action transformation is deterministic, then the following tacit assumption is also made; decision making-to-action transformation is infinitely fast. Such statement is clear for the TM case and it is implicit in Dennett’s standpoint. Nonetheless, infinitely fast processing is not realistic, neither for humans, nor for robots, nor for telesurgery, not mentioning cloud based computations. This observation calls for some reservations and for further considerations.

Goal oriented behaviour – State-to-action mapping is subject to combinatorial explosion
We consider the reinforcement learning (RL) model of goal oriented behaviour. We neglect mathematical details and provide the references to the literature for the interested reader.

If RL is built upon the framework of Markov Decision Problems (MDP) then it has attractive, near optimal polynomial time convergence properties under certain conditions. This RL formulation has been strongly motivated by psychology and neuroscience   , too. It means that upon adopting this model, one has to win through the limitations arising from the Markovian assumption.

MDP has the following ingredients: 1) the agent that includes a finite set of states and a finite set of actions, 2) the environment that includes a transition probability matrix that maps state-action pairs to (the next) states and a reward function that maps state-action pairs to numbers (rewards). The learning objective is to find a state-to-action mapping – i.e., the policy – for the agent that optimizes the cumulated discounted reward that the agent may expect to collect in the future.

The core problem of RL is that even simple agents in simple environments need a number of variables to detect, such as objects, their shape or colour, other agents, including relatives, friends, and enemies, or the space itself, e.g., distance, speed, direction, and so on. The size of the state space grows exponentially in the number of the variables. The base, which corresponds to the discretization of the variables, and the number of possible actions are less harmful. However, another source of combinatorial explosion comes from partial observation; agents need to maintain a history of sensory information. This temporal depth comes as a multiplier in the exponent again.

In the so-called multi-agent systems, the problem becomes even harder as the internal states of the other learning agents are hidden. This fact has serious consequences. First, the number of agents also multiplies the exponent. Second, the hidden nature of the internal states violates the central assumption of RL about the Markov property of the states.

The effect of hidden variables is striking. For example, it has been found that, for two agents with two hidden states (meaning) and two actions (signals to the other agent), some cost on communication is enough to prohibit an agreement on signal meaning association. Nevertheless, there is a resolution if an agent can build an internal model about the other agent. In this case they can quickly come to an agreement. This situation is best described as ‘I know what you are thinking and I proceed accordingly'. However, if both agents build models about each other, then agreement is again hard, unless one of the agents assumes that the other agent also builds a model, like ‘I know that you know what I am thinking'. Such observations highlight the relevance of model construction and the necessary adjustment of the estimated reward function for collaborative situations, such as robotic surgery.

Factored reinforcement learning intends to escape the problem of combinatorial explosion.

Factored reinforcement learning to alleviate combinatorial explosion
Within the framework of RL, factored description can decrease the exponent of the state space. In the multi-agent scenario, for example, if it turns out that a given agent (let us say, agent B) does not influence the long-term goals in the actual events observed by another agent (agent A), then the variables corresponding to agent B can be dropped; only the relevant factors need to be considered by agent A at that time instant. Similarly, during the robotic surgery, the state description can safely neglect information about the events outside of the surgery room as long as it does not influence the conditions of the surgery.

Factored RL (fRL) assumes that the decision making process can be limited to a few variables, and since variables enter the exponent of the size of the state space, fRL may give rise to exponential gains. FRL makes other important assumptions, namely: transition probabilities from state-action pairs to new states as well as the reward system also depend on a few factors only and not on the whole state-space.

FRL has a number of attractive properties: Limitations of RL highlight that learning should find the relevant factors, also called features, for learning and model construction. It also means the following:
 * 1) FRL converges
 * 2) Optimistic initialization and greediness lead to polynomial time convergence
 * It is an approximation that some of the factors may be left unobserved in the state description. This is a ‘belief’ from the point of view of the agent and partially observed MDP might be needed to treat certain situations
 * The factors, such as the position, the direction, and the speed of the knife, or the model-based formulation of the dynamics of a body part make highly compressed symbolic descriptions and highly compressed dynamical description of the interaction of the symbols, relative to the description on the retina of the robot.
 * Neither the extraction of such features, nor the learning of their dynamical properties look trivial. This learning problem seems to form a major bottleneck for Artificial General Intelligence (AGI). Theoretical and experimental efforts have been made to embed the essence of this learning problem into an AGI architecture.

Factors as symbols and the problem of symbol learning
Position, direction, configuration, colour, form, etc. can be seen as factors or variables. They are abstractions, since they do not exist alone. For example, colour belongs to an object or to a phenomenon, which has a certain shape. Factors may assume different representations. Although these representations share a number of properties, we try to distinguish them from the point of view of reinforcement learning. We take the example of a colour.
 * Type (1)
 * A colour is made of spectral components. White colour, for example, contains all colours with uniform strength. The strength of the frequency components can form the representation in this case. Representation can be sparse (single colour) or dense (e.g., white colour).


 * Type (2)
 * If colour is observed through three broad bandwidth sensors (like in our case), then the activity of these sensors represent the colour. Representation is always dense since they are broadband filters.


 * Type (3)
 * The space of the three colour sensors can be discretized in two dimensions by normalizing to the overall light intensity. In this case colour is represented by one of the indices of the discretized 2D space – so representation is the sparsest possible


 * Type (4)
 * One might put Gaussian sensors to all points of this 2D discretized space such that the sensors also sense colours of their neighbours. Then representation will be blob like in the 2D space and so it will be sparse.

Note that Markov Decision Processes (MDPs) require tabulated state representation and this condition is satisfied only for representation (3). Other types require function approximation methods, where attractive properties of MDPs may not hold.

As argued before, factors are related, since the presence of an object involves shape and colour variables, among many other, like its material properties, position, speed, etc. The tabulated description of a state assumes a huge table and the related indices, which (a) is not practical and (b) only a few entries of the table will occur during lifetime. The actual task can select the relevant variables. For example, the colour and the form of the cup may be neglected when somebody is thirsty. The relevant variable is the amount of water in the cup. In turn, the learning task is to separate the variables and to select the relevant ones for the actual task.

Factored description, however, depends on spatial and temporal context. Let us take the example of a chair. If we have this five letter representation, we do not know if it is furniture or a person. Similar problem may occur in a pixel based description as shown in Figure 1.

There are certain differences between pixel based representations and textual representation. Textual representation is made of letters, it neglects many details, but the entropy of the representation is low, it may assume a few forms, like wooden chair, arm chair, etc. It is also uncertain, since the meaning of the symbol set chair depends on the context as mentioned above.

Pixel based description may take many-many different forms subject to form, light conditions, position, etc. and thus the pixel based description of the variable has a high entropy. It is also uncertain since it also depends on the context (Figure 1).

We say that symbol learning is the task of developing low entropy variables from high entropy ones. Is this task – at all – possible? Since we are interested in reinforcement learning (RL) and transition probabilities, our concern is if it is possible to partition the observations such that the transition probabilities between the low-entropy variables are good approximations of the transition probabilities between the high-entropy variables?

Transition probability from a state-action pair to a next state-action pair forms a graph. The question is if there is a partitioning of the graph of the transition probabilities of high entropy variables, which is a reasonable estimation of the graph of the transition probabilities of the low entropy ones. The second question is if it can be learned?

Recent advances that extend extreme graph theory to the information theoretic domain use the above terms and may provide rigorous formulation to the conditions of the ‘symbol learning problem' for the hardest case, the case of extreme graphs. It is of high relevance, since then even in the hardest extreme graph case, symbol learning may be accomplished in polynomial time, because constructive polynomial time algorithms do exist even for such problem types.

Symbol learning and symbol grounding – bottom-up and top-down processes
In reinforcement learning (RL) we are looking for transition probabilities between state-action to states or between state-action to state-actions. In turn, learning to develop a representation is either concerned with the learning of this graph or with matching of an existing graph to the graph of another graph. This latter occurs if the graph is inherited or if two agents want to communicate and need to match their graphs. Optimal graph matching is known to be NP-hard, whereas graph learning is polynomial as mentioned in Factors as symbols and the problem of symbol learning. However, graph-matching is unavoidable for multi-agent systems. We identify the graph matching problem of agents with the symbol-grounding problem. In turn, symbol grounding and symbol learning are not incompatible with each other. Furthermore, these learning methods can rely on each other. The combination may be efficient under certain conditions even if the observations of the agents may originally differ.

Annotated databases and ontologies – Symbol grounding for robots
We consider Wikipedia as an annotated database. In this case, human annotation provides relations between the entries of the database or provides links to resources outside of the database. Annotation may take different forms. For example, (a) segmentation of an image is a human annotation of the pixels of the image, (b) link of a part-of-speech to a Wikipedia page is an explanatory or a disambiguating annotation, or (c) action unit annotation of images or image series of a human face is a cross-modal annotation that connects visual and textual information.

Ontology relates the concepts of the knowledge domain. A special form of ontology derives from Topic Maps. A topic map represents information using (a) topics, (b) associations, i.e., hypergraph relationships between topics, and (c) occurrences, i.e., information resources relevant to the topic. Topics, associations, and occurrences can all be typed. The definitions of the allowed types of a topic map form the ontology of that topic map.

A potential method for grounding a symbol is to design associations of that symbol to grounded symbols of the same or other ontologies. This method is neither necessary nor sufficient. From the point of view of goal oriented systems, we say that a symbol is perfectly grounded, if it gives rise to the optimal response from the point of view of the goal(s). This definition leads to the feedforward model of cognition being very different from the thoughts that led to the homunculus fallacy. This is another route to resolve the fallacy: a representation ‘makes sense’ if it leads to an optimal response from the point of view of the responding entity.

Cognition for surgical robots – Grounded goal oriented behaviour in symbol space with “grounding”
Surgical robots have specific goals that can be somewhere a simple task or a highly complex serious of manoeuvres. They need to sense the environment, including – possibly – human partners, e.g., surgeon(s) and nurse(s). Reinforcement learning in partially observed situation can often be computationally intractable is typically intractable and, in turn, the best possible observation should be aimed. Such observations include the actual sub-goals the surgeon is to accomplish, which might change very quickly subject to new observations during the surgery.

Information is provided to the robot via its sensors and via language based communication either between the human partners or between humans and the robot. Communicated symbols then undergo symbolic manipulation and (i) generate actions plus (ii) compare the observed sensory information to the expectations. If the observed and expected sensory information then we say that symbolic representation ‘makes sense’ and that the learned or inherited symbols are grounded.

Consciousness as a synchronization problem – the feedforward-feedback model of cognition
Information from different sensory modalities comes with different delays. These different pieces of perception are glued together by our brain to a single coherent flow of perception that seems synchronous to the events occurring in the world; the brain manages to compensate for the different delays that could be a considerable fraction of a second. In turn, the brain has and utilizes a predictive system capable of predictions for a few 100 ms or so. If something unpredicted happens during the span of the predicted time interval then it will not reach the brain in time. The synchronous perception of the events of the world should be destroyed in feedforward systems, a new symbolic representation should emerge starting immediately when the erroneously predicted input arrives and we should be able to notice the delay. However, we do not experience such situation. One concludes that the feedforward model of cognition may need some adjustments.

It is also possible that sensory information may correspond to two different situations in the world, like in the case of the Necker cube. The intriguing feature is that a single perception of the two possibilities emerges and they switch in time and the switching time is on the order of a few 100 ms. Another notable feature is that ordinary people cannot stop this switching: the brain brings about the alternative representation in spite of top-down efforts to focus on a single one. Switching is sudden; it can be measured on the neuronal level. Furthermore, neurons representing the input and perceived input are in close proximity of each other.

In short, our observation of the world is a single coherent one even if it is erroneous, including the two ends, (i) when sensory information arrives late to reach and change the representation in time and (ii) when sensory information is not changing at all but more than one interpretation is possible.

This ’problem’ is also known as the neural binding problem, the problem how items encoded in distinct circuits of the brain are combined in perception, decision making and action. A related question that one can pose here is why such an ‘error’ that may span a few 100 ms is necessary. We conjecture that the stability of cognition, decision making, and control can be (one of) the reason(s) that requires such an error in the presence of delays. In turn, the requirement of the stability of control seems to be the key requirement of synchronous perception and thus the key of consciousness, and finally the key to consider cognition as a feedforward and feedback system.

Models of cognition
Considerable effort has been devoted to model cognition. Block diagram-like models, mathematical models, including mathematical logic, Bayesian, fuzzy, artificial neural network, reinforcement learning models, dynamical system models, all the way to computational cognitive neuroscience models have been trying to catch the essence of cognition. The field of Strong-AI (or Artificial General Intelligence ) searches for the underlying general principles of cognition and tries to build cognitive architectures that can match human performance. A close-to-comprehensive list of cognitive models can be found in.