Ray Kurzweil reveals plans for ‘linguistically fluent’ Google software

Smart Reply (credit: Google Research)

Ray Kuzweil, a director of engineering at Google, reveals plans for a future version of Google’s “Smart Reply” machine-learning email software (and more) in a Wired article by Tom Simonite published Wednesday (Aug. 2, 2017).

Running on mobile Gmail and Google Inbox, Smart Reply suggests up to three replies to an email message, saving typing time or giving you ideas for a better reply.

Smarter autocomplete

Kurzweil’s team is now “experimenting with empowering Smart Reply to elaborate on its initial terse suggestions,” Simonite says.

“Tapping a Continue button [in response to an email] might cause ‘Sure I’d love to come to your party!’ to expand to include, for example, ‘Can I bring something?’ He likes the idea of having AI pitch in anytime you’re typing, a bit like an omnipresent, smarter version of Google’s search autocomplete. ‘You could have similar technology to help you compose documents or emails by giving you suggestions of how to complete your sentence,’ Kurzweil says.”

As Simonite notes, Kurzweil’s software is based on his hierarchical theory of intelligence, articulated in Kurzweil’s latest book, How to Create a Mind and in more detail in an arXiv paper by Kurzweil and key members of his team, published in May.

“Kurzweil’s work outlines a path to create a simulation of the human neocortex (the outer layer of the brain where we do much of our thinking) by building a hierarchy of similarly structured components that encode increasingly abstract ideas as sequences,” according to the paper. “Kurzweil provides evidence that the neocortex is a self-organizing hierarchy of modules, each of which can learn, remember, recognize and/or generate a sequence, in which each sequence consists of a sequential pattern from lower-level modules.”

The paper further explains that Smart Reply previously used a “long short-term memory” (LSTM) network*, “which are much slower than feed-forward networks [used in the new software] for training and inference” because with LSTM, it takes more computation to handle longer sequences of words.

Kurzweil’s team was able to produce email responses of similar quality to LSTM, but using fewer computational resources by training hierarchically connected layers of simulated neurons on clustered numerical representations of text. Essentially, the approach propagates information through a sequence of ever more complex pattern recognizers until the final patterns are matched to optimal responses.

Kona: linguistically fluent software

But underlying Smart Reply is “a system for understanding the meaning of language, according to Kurzweil,” Simonite reports.

“Codenamed Kona, the effort is aiming for nothing less than creating software as linguistically fluent as you or me. ‘I would not say it’s at human levels, but I think we’ll get there,’ Kurzweil says. More applications of Kona are in the works and will surface in future Google products, he promises.”

* The previous sequence-to-sequence (Seq2Seq) framework [described in this paper] uses “recurrent neural networks (RNNs), typically long short-term memory (LSTM) networks, to encode sequences of word embeddings into representations that depend on the order, and uses a decoder RNN to generate output sequences word by word. …While Seq2Seq models provide a generalized solution, it is not obvious that they are maximally efficient, and training these systems can be slow and complicated.”

Ray Kurzweil reveals plans for ‘linguistically fluent’ Google software

Smart Reply (credit: Google Research)

Ray Kuzweil, a director of engineering at Google, reveals plans for a future version of Google’s “Smart Reply” machine-learning email software (and more) in a Wired article by Tom Simonite published Wednesday (Aug. 2, 2017).

Running on mobile Gmail and Google Inbox, Smart Reply suggests up to three replies to an email message, saving typing time or giving you ideas for a better reply.

Smarter autocomplete

Kurzweil’s team is now “experimenting with empowering Smart Reply to elaborate on its initial terse suggestions,” Simonite says.

“Tapping a Continue button [in response to an email] might cause ‘Sure I’d love to come to your party!’ to expand to include, for example, ‘Can I bring something?’ He likes the idea of having AI pitch in anytime you’re typing, a bit like an omnipresent, smarter version of Google’s search autocomplete. ‘You could have similar technology to help you compose documents or emails by giving you suggestions of how to complete your sentence,’ Kurzweil says.”

As Simonite notes, Kurzweil’s software is based on his hierarchical theory of intelligence, articulated in Kurzweil’s latest book, How to Create a Mind and in more detail in an arXiv paper by Kurzweil and key members of his team, published in May.

“Kurzweil’s work outlines a path to create a simulation of the human neocortex (the outer layer of the brain where we do much of our thinking) by building a hierarchy of similarly structured components that encode increasingly abstract ideas as sequences,” according to the paper. “Kurzweil provides evidence that the neocortex is a self-organizing hierarchy of modules, each of which can learn, remember, recognize and/or generate a sequence, in which each sequence consists of a sequential pattern from lower-level modules.”

The paper further explains that Smart Reply previously used a “long short-term memory” (LSTM) network*, “which are much slower than feed-forward networks [used in the new software] for training and inference” because with LSTM, it takes more computation to handle longer sequences of words.

Kurzweil’s team was able to produce email responses of similar quality to LSTM, but using fewer computational resources by training hierarchically connected layers of simulated neurons on clustered numerical representations of text. Essentially, the approach propagates information through a sequence of ever more complex pattern recognizers until the final patterns are matched to optimal responses.

Kona: linguistically fluent software

But underlying Smart Reply is “a system for understanding the meaning of language, according to Kurzweil,” Simonite reports.

“Codenamed Kona, the effort is aiming for nothing less than creating software as linguistically fluent as you or me. ‘I would not say it’s at human levels, but I think we’ll get there,’ Kurzweil says. More applications of Kona are in the works and will surface in future Google products, he promises.”

* The previous sequence-to-sequence (Seq2Seq) framework [described in this paper] uses “recurrent neural networks (RNNs), typically long short-term memory (LSTM) networks, to encode sequences of word embeddings into representations that depend on the order, and uses a decoder RNN to generate output sequences word by word. …While Seq2Seq models provide a generalized solution, it is not obvious that they are maximally efficient, and training these systems can be slow and complicated.”

How to run faster, smarter AI apps on smartphones

(credit: iStock)

When you use smartphone AI apps like Siri, you’re dependent on the cloud for a lot of the processing — limited by your connection speed. But what if your smartphone could do more of the processing directly on your device — allowing for smarter, faster apps?

MIT scientists have taken a step in that direction with a new way to enable artificial-intelligence systems called convolutional neural networks (CNNs) to run locally on mobile devices. (CNN’s are used in areas such as autonomous driving, speech recognition, computer vision, and automatic translation.) Neural networks take up a lot of memory and consume a lot of power, so they usually run on servers in the cloud, which receive data from desktop or mobile devices and then send back their analyses.

The new MIT analytic method can determine how much power a neural network will actually consume when run on a particular type of hardware. The researchers used the method to evaluate new techniques for paring down neural networks so that they’ll run more efficiently on handheld devices.

The new CNN designs are also optimized to run on an energy-efficient computer chip optimized for neural networks that the researchers developed in 2016.

Reducing energy consumption

The new MIT software method uses “energy-aware pruning” — meaning they reduce a neural networks’ power consumption by cutting out the layers of the network that contribute very little to a neural network’s final output and consume the most energy.

Associate professor of electrical engineering and computer science Vivienne Sze and colleagues describe the work in an open-access paper they’re presenting this week (of July 24, 2017) at the Computer Vision and Pattern Recognition Conference. They report that the methods offered up to 73 percent reduction in power consumption over the standard implementation of neural networks — 43 percent better than the best previous method.

Meanwhile, another MIT group at the Computer Science and Artificial Intelligence Laboratory has designed a hardware approach to reduce energy consumption and increase computer-chip processing speed for specific apps, using “cache hierarchies.” (“Caches” are small, local memory banks that store data that’s frequently used by computer chips to cut down on time- and energy-consuming communication with off-chip memory.)**

The researchers tested their system on a simulation of a chip with 36 cores, or processing units. They found that compared to its best-performing predecessors, the system increased processing speed by 20 to 30 percent while reducing energy consumption by 30 to 85 percent. They presented the new system, dubbed Jenga, in an open-access paper at the International Symposium on Computer Architecture earlier in July 2017.

Better batteries — or maybe, no battery?

Another solution to better mobile AI is improving rechargeable batteries in cell phones (and other mobile devices), which have limited charge capacity and short lifecycles, and perform poorly in cold weather.

Recently, DARPA-funded researchers from the University of Houston (and at the University of California-San Diego and Northwestern University) have discovered that quinones — an inexpensive, earth-abundant and easily recyclable material that is low-cost and nonflammable — can address current battery limitations.

“One of these batteries, as a car battery, could last 10 years,” said Yan Yao, associate professor of electrical and computer engineering. In addition to slowing the deterioration of batteries for vehicles and stationary electricity storage batteries, it also would make battery disposal easier because the material does not contain heavy metals. The research is described in Nature Materials.

The first battery-free cellphone that can send and receive calls using only a few microwatts of power. (credit: Mark Stone/University of Washington)

But what if we eliminated batteries altogether? University of Washington researchers have invented a cellphone that requires no batteries. Instead, it harvests 3.5 microwatts of power from ambient radio signals, light, or even the vibrations of a speaker.

The new technology is detailed in a paper published July 1, 2017 in the Proceedings of the Association for Computing Machinery on Interactive, Mobile, Wearable and Ubiquitous Technologies.

The UW researchers demonstrated how to harvest this energy from ambient radio signals transmitted by a WiFi base station up to 31 feet away. “You could imagine in the future that all cell towers or Wi-Fi routers could come with our base station technology embedded in it,” said co-author Vamsi Talla, a former UW electrical engineering doctoral student and Allen School research associate. “And if every house has a Wi-Fi router in it, you could get battery-free cellphone coverage everywhere.”

A cellphone CPU (computer processing unit) typically requires several watts or more (depending on the app), so we’re not quite there yet. But that power requirement could one day be sufficiently reduced by future special-purpose chips and MIT’s optimized algorithms.

It might even let you do amazing things. :)

* Loosely based on the anatomy of the brain, neural networks consist of thousands or even millions of simple but densely interconnected information-processing nodes, usually organized into layers. The connections between nodes have “weights” associated with them, which determine how much a given node’s output will contribute to the next node’s computation. During training, in which the network is presented with examples of the computation it’s learning to perform, those weights are continually readjusted, until the output of the network’s last layer consistently corresponds with the result of the computation. With the proposed pruning method, the energy consumption of AlexNet and GoogLeNet are reduced by 3.7x and 1.6x, respectively, with less than 1% top-5 accuracy loss.

** The software reallocates cache access on the fly to reduce latency (delay), based on the physical locations of the separate memory banks that make up the shared memory cache. If multiple cores are retrieving data from the same DRAM [memory] cache, this can cause bottlenecks that introduce new latencies. So after Jenga has come up with a set of cache assignments, cores don’t simply dump all their data into the nearest available memory bank; instead, Jenga parcels out the data a little at a time, then estimates the effect on bandwidth consumption and latency. 

*** The stumbling block, Yao said, has been the anode, the portion of the battery through which energy flows. Existing anode materials are intrinsically structurally and chemically unstable, meaning the battery is only efficient for a relatively short time. The differing formulations offer evidence that the material is an effective anode for both acid batteries and alkaline batteries, such as those used in a car, as well as emerging aqueous metal-ion batteries.

Is anyone home? A way to find out if AI has become self-aware

(credit: Gerd Altmann/Pixabay)

By Susan Schneider, PhD, and Edwin Turner, PhD

Every moment of your waking life and whenever you dream, you have the distinct inner feeling of being “you.” When you see the warm hues of a sunrise, smell the aroma of morning coffee or mull over a new idea, you are having conscious experience. But could an artificial intelligence (AI) ever have experience, like some of the androids depicted in Westworld or the synthetic beings in Blade Runner?

The question is not so far-fetched. Robots are currently being developed to work inside nuclear reactors, fight wars and care for the elderly. As AIs grow more sophisticated, they are projected to take over many human jobs within the next few decades. So we must ponder the question: Could AIs develop conscious experience?

This issue is pressing for several reasons. First, ethicists worry that it would be wrong to force AIs to serve us if they can suffer and feel a range of emotions. Second, consciousness could make AIs volatile or unpredictable, raising safety concerns (or conversely, it could increase an AI’s empathy; based on its own subjective experiences, it might recognize consciousness in us and treat us with compassion).

Third, machine consciousness could impact the viability of brain-implant technologies, like those to be developed by Elon Musk’s new company, Neuralink. If AI cannot be conscious, then the parts of the brain responsible for consciousness could not be replaced with chips without causing a loss of consciousness. And, in a similar vein, a person couldn’t upload their brain to a computer to avoid death, because that upload wouldn’t be a conscious being.

In addition, if AI eventually out-thinks us yet lacks consciousness, there would still be an important sense in which we humans are superior to machines; it feels like something to be us. But the smartest beings on the planet wouldn’t be conscious or sentient.

A lot hangs on the issue of machine consciousness, then. Yet neuroscientists are far from understanding the basis of consciousness in the brain, and philosophers are at least equally far from a complete explanation of the nature of consciousness.

A test for machine consciousness

So what can be done? We believe that we do not need to define consciousness formally, understand its philosophical nature or know its neural basis to recognize indications of consciousness in AIs. Each of us can grasp something essential about consciousness, just by introspecting; we can all experience what it feels like, from the inside, to exist.

(credit: Gerd Altmann/Pixabay)

Based on this essential characteristic of consciousness, we propose a test for machine consciousness, the AI Consciousness Test (ACT), which looks at whether the synthetic minds we create have an experience-based understanding of the way it feels, from the inside, to be conscious.

One of the most compelling indications that normally functioning humans experience consciousness, although this is not often noted, is that nearly every adult can quickly and readily grasp concepts based on this quality of felt consciousness. Such ideas include scenarios like minds switching bodies (as in the film Freaky Friday); life after death (including reincarnation); and minds leaving “their” bodies (for example, astral projection or ghosts). Whether or not such scenarios have any reality, they would be exceedingly difficult to comprehend for an entity that had no conscious experience whatsoever. It would be like expecting someone who is completely deaf from birth to appreciate a Bach concerto.

Thus, the ACT would challenge an AI with a series of increasingly demanding natural language interactions to see how quickly and readily it can grasp and use concepts and scenarios based on the internal experiences we associate with consciousness. At the most elementary level we might simply ask the machine if it conceives of itself as anything other than its physical self.

At a more advanced level, we might see how it deals with ideas and scenarios such as those mentioned in the previous paragraph. At an advanced level, its ability to reason about and discuss philosophical questions such as “the hard problem of consciousness” would be evaluated. At the most demanding level, we might see if the machine invents and uses such a consciousness-based concept on its own, without relying on human ideas and inputs.

Consider this example, which illustrates the idea: Suppose we find a planet that has a highly sophisticated silicon-based life form (call them “Zetas”). Scientists observe them and ponder whether they are conscious beings. What would be convincing proof of consciousness in this species? If the Zetas express curiosity about whether there is an afterlife or ponder whether they are more than just their physical bodies, it would be reasonable to judge them conscious. If the Zetas went so far as to pose philosophical questions about consciousness, the case would be stronger still.

There are also nonverbal behaviors that could indicate Zeta consciousness such as mourning the dead, religious activities or even turning colors in situations that correlate with emotional challenges, as chromatophores do on Earth. Such behaviors could indicate that it feels like something to be a Zeta.

The death of the mind of the fictional HAL 9000 AI computer in Stanley Kubrick’s 2001: A Space Odyssey provides another illustrative example. The machine in this case is not a humanoid robot as in most science fiction depictions of conscious machines; it neither looks nor sounds like a human being (a human did supply HAL’s voice, but in an eerily flat way). Nevertheless, the content of what it says as it is deactivated by an astronaut — specifically, a plea to spare it from impending “death” — conveys a powerful impression that it is a conscious being with a subjective experience of what is happening to it.

Could such indicators serve to identify conscious AIs on Earth? Here, a potential problem arises. Even today’s robots can be programmed to make convincing utterances about consciousness, and a truly superintelligent machine could perhaps even use information about neurophysiology to infer the presence of consciousness in humans. If sophisticated but non-conscious AIs aim to mislead us into believing that they are conscious for some reason, their knowledge of human consciousness could help them do so.

We can get around this though. One proposed technique in AI safety involves “boxing in” an AI—making it unable to get information about the world or act outside of a circumscribed domain, that is, the “box.” We could deny the AI access to the internet and indeed prohibit it from gaining any knowledge of the world, especially information about conscious experience and neuroscience.

(credit: Gerd Altmann/Pixabay)

Some doubt a superintelligent machine could be boxed in effectively — it would find a clever escape. We do not anticipate the development of superintelligence over the next decade, however. Furthermore, for an ACT to be effective, the AI need not stay in the box for long, just long enough administer the test.

ACTs also could be useful for “consciousness engineering” during the development of different kinds of AIs, helping to avoid using conscious machines in unethical ways or to create synthetic consciousness when appropriate.

Beyond the Turing Test

An ACT resembles Alan Turing’s celebrated test for intelligence, because it is entirely based on behavior — and, like Turing’s, it could be implemented in a formalized question-and-answer format. (An ACT could also be based on an AI’s behavior or on that of a group of AIs.)

But an ACT is also quite unlike the Turing test, which was intended to bypass any need to know what was transpiring inside the machine. By contrast, an ACT is intended to do exactly the opposite; it seeks to reveal a subtle and elusive property of the machine’s mind. Indeed, a machine might fail the Turing test because it cannot pass for human, but pass an ACT because it exhibits behavioral indicators of consciousness.

This is the underlying basis of our ACT proposal. It should be said, however, that the applicability of an ACT is inherently limited. An AI could lack the linguistic or conceptual ability to pass the test, like a nonhuman animal or an infant, yet still be capable of experience. So passing an ACT is sufficient but not necessary evidence for AI consciousness — although it is the best we can do for now. It is a first step toward making machine consciousness accessible to objective investigations.

So, back to the superintelligent AI in the “box” — we watch and wait. Does it begin to philosophize about minds existing in addition to bodies, like Descartes? Does it dream, as in Isaac Asimov’s Robot Dreams? Does it express emotion, like Rachel in Blade Runner? Can it readily understand the human concepts that are grounded in our internal conscious experiences, such as those of the soul or atman?

The age of AI will be a time of soul-searching — both of ours, and for theirs.

Originally published in Scientific American, July 19, 2017

Susan Schneider, PhD, is a professor of philosophy and cognitive science at the University of Connecticut, a researcher at YHouse, Inc., in New York, a member of the Ethics and Technology Group at Yale University and a visiting member at the Institute for Advanced Study at Princeton. Her books include The Language of Thought, Science Fiction and Philosophy, and The Blackwell Companion to Consciousness (with Max Velmans). She is featured in the new film, Supersapiens, the Rise of the Mind.

Edwin L. Turner, PhD, is a professor of Astrophysical Sciences at Princeton University, an Affiliate Scientist at the Kavli Institute for the Physics and Mathematics of the Universe at the University of Tokyo, a visiting member in the Program in Interdisciplinary Studies at the Institute for Advanced Study in Princeton, and a co-founding Board of Directors member of YHouse, Inc. Recently he has been an active participant in the Breakthrough Starshot Initiative. He has taken an active interest in artificial intelligence issues since working in the AI Lab at MIT in the early 1970s.

Supersapiens, the Rise of the Mind

(credit: Markus Mooslechner)

In the new film Supersapiens, writer-director Markus Mooslechner raises a core question: As artificial intelligence rapidly blurs the boundaries between man and machine, are we witnessing the rise of a new human species?

“Humanity is facing a turning point — the next evolution of the human mind,” notes Mooslechner. “Will this evolution be a hybrid of man and machine, where artificial intelligence forces the emergence of a new human species? Or will a wave of new technologists, who frame themselves as ‘consciousness-hackers,’ become the future torch-bearers, using technology not to replace the human mind, but rather awaken within it powers we have always possessed — enlightenment at the push of a button?”

“It’s not obvious to me that a replacement of our species by our own technological creation would necessarily be a bad thing,” says ethologist-evolutionary biologist-author Dawkins in the film.

Supersapiens in a Terra Mater Factual Studios production. Executive Producers are Joanne Reay and Walter Koehler. Distribution is to be announced.

Cast:

  • Mikey Siegel, Consciousness Hacker, San FranciscoSam Harris, Neuroscientist, Philosopher
  • Ben Goertzel, Chief Scientist    , Hanson Robotics, Hong Kong
  • Hugo de Garis, retired director of China Brain Project, Xiamen, China
  • Susan Schneider, Philosopher and cognitive scientist University of Connecticut
  • Joel Murphy, owner, OpenBCI, Brooklyn, New York
  • Tim Mullen, Neuroscientist, CEO / Research Director, Qusp Labs
  • Conor Russomanno, CEO, OpenBCI, Brooklyn, New York
  • David Putrino, Neuroscientist, Weill-Cornell Medical College, New York
  • Hannes Sjoblad, Tech Activist, Bodyhacker, Stockholm Sweden.
  • Richard Dawkins, Evolutionary Biologist, Author, Oxford, UK
  • Nick Bostrom, Philosopher, Future of Humanity Institute, Oxford University, UK
  • Anders Sandberg, Computational Neuroscientist, Oxford University, UK
  • Adam Gazzaley, Neuroscientist, Executive Director UCSF Neuroscape, San Francisco, USA
  • Andy Walshe, Director Red Bull High Performance, Santa Monica, USA
  • Randal Koene, Science Director, Carboncopies Science Director, San Francisco


Markus Mooslechner | Supersapiens teaser

How to turn audio clips into realistic lip-synced video


UW (University of Washington) | UW researchers create realistic video from audio files alone

University of Washington researchers at the UW Graphics and Image Laboratory have developed new algorithms that turn audio clips into a realistic, lip-synced video, starting with an existing video of  that person speaking on a different topic.

As detailed in a paper to be presented Aug. 2 at  SIGGRAPH 2017, the team successfully generated a highly realistic video of former president Barack Obama talking about terrorism, fatherhood, job creation and other topics, using audio clips of those speeches and existing weekly video addresses in which he originally spoke on a different topic decades ago.

Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings (streaming audio over the internet takes up far less bandwidth than video, reducing video glitches), or holding a conversation with a historical figure in virtual reality, said Ira Kemelmacher-Shlizerman, an assistant professor at the UW’s Paul G. Allen School of Computer Science & Engineering.


Supasorn Suwajanakorn | Teaser — Synthesizing Obama: Learning Lip Sync from Audio

This beats previous audio-to-video conversion processes, which have involved filming multiple people in a studio saying the same sentences over and over to try to capture how a particular sound correlates to different mouth shapes, which is expensive, tedious and time-consuming. The new machine learning tool may also help overcome the “uncanny valley” problem, which has dogged efforts to create realistic video from audio.

How to do it

A neural network first converts the sounds from an audio file into basic mouth shapes. Then the system grafts and blends those mouth shapes onto an existing target video and adjusts the timing to create a realistic, lip-synced video of the person delivering the new speech. (credit: University of Washington)

1. Find or record a video of the person (or use video chat tools like Skype to create a new video) for the neural network to learn from. There are millions of hours of video that already exist from interviews, video chats, movies, television programs and other sources, the researchers note. (Obama was chosen because there were hours of presidential videos in the public domain.)

2. Train the neural network to watch videos of the person and translate different audio sounds into basic mouth shapes.

3. The system then uses the audio of an individual’s speech to generate realistic mouth shapes, which are then grafted onto and blended with the head of that person. Use a small time shift to enable the neural network to anticipate what the person is going to say next.

4. Currently, the neural network is designed to learn on one individual at a time, meaning that Obama’s voice — speaking words he actually uttered — is the only information used to “drive” the synthesized video. Future steps, however, include helping the algorithms generalize across situations to recognize a person’s voice and speech patterns with less data, with only an hour of video to learn from, for instance, instead of 14 hours.

Fakes of fakes

So the obvious question is: Can you use someone else’s voice on a video (assuming enough videos)? The researchers said they decided against going down the path, but they didn’t say it was impossible.

Even more pernicious: the original video person’s words (not just the voice) could be faked using Princeton/Adobe’s “VoCo” software (when available) — simply by editing a text transcript of their voice recording — or the fake voice itself could be modified.

Or Disney Research’s FaceDirector could be used to edit recorded substitute facial expressions (along with the fake voice) into the video.

However, by reversing the process — feeding video into the neural network instead of just audio — one could also potentially develop algorithms that could detect whether a video is real or manufactured, the researchers note.

The research was funded by Samsung, Google, Facebook, Intel, and the UW Animation Research Labs. You can contact the research team at audiolipsync@cs.washington.edu.


Abstract of Synthesizing Obama: Learning Lip Sync from Audio

Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.

How to ‘talk’ to your computer or car with hand or body poses

Researchers at Carnegie Mellon University’s Robotics Institute have developed a system that can detect and understand body poses and movements of multiple people from a video in real time — including, for the first time, the pose of each individual’s fingers.

The ability to recognize finger or hand poses, for instance, will make it possible for people to interact with computers in new and more natural ways, such as simply pointing at things.

That will also allow robots to perceive you’re doing, what moods you’re in, and whether you can be interrupted, for example. Your self-driving car could get an early warning that a pedestrian is about to step into the street by monitoring your body language. The technology could also be used for behavioral diagnosis and rehabilitation for conditions such as autism, dyslexia, and depression, the researchers say.

This new method was developed at CMU’s NSF-funded Panoptic Studio, a two-story dome embedded with 500 video cameras, but the researchers can now do the same thing with a single camera and laptop computer.

The researchers have released their computer code. It’s already being widely used by research groups, and more than 20 commercial groups, including automotive companies, have expressed interest in licensing the technology, according to Yaser Sheikh, associate professor of robotics.

Tracking multiple people in real time, particularly in social situations where they may be in contact with each other, presents a number of challenges. Sheikh and his colleagues took a bottom-up approach, which first localizes all the body parts in a scene — arms, legs, faces, etc. — and then associates those parts with particular individuals.

Sheikh and his colleagues will present reports on their multiperson and hand-pose detection methods at CVPR 2017, the Computer Vision and Pattern Recognition Conference, July 21–26 in Honolulu.

‘Mind reading’ technology identifies complex thoughts, using machine learning and fMRI

(Top) Predicted brain activation patterns and semantic features (colors) for two pairs of sentences. (Left: “The flood damaged the hospital”; (Right): “The storm destroyed the theater.” (Bottom) observed similar activation patterns and semantic features. (credit: Jing Wang et al./Human Brain Mapping)

By combining machine-learning algorithms with fMRI brain imaging technology, Carnegie Mellon University (CMU) scientists have discovered, in essense, how to “read minds.”

The researchers used functional magnetic resonance imaging (fMRI) to view how the brain encodes various thoughts (based on blood-flow patterns in the brain). They discovered that the mind’s building blocks for constructing complex thoughts are formed, not by words, but by specific combinations of the brain’s various sub-systems.

Following up on previous research, the findings, published in Human Brain Mapping (open-access preprint here) and funded by the U.S. Intelligence Advanced Research Projects Activity (IARPA), provide new evidence that the neural dimensions of concept representation are universal across people and languages.

“One of the big advances of the human brain was the ability to combine individual concepts into complex thoughts, to think not just of ‘bananas,’ but ‘I like to eat bananas in evening with my friends,’” said CMU’s Marcel Just, the D.O. Hebb University Professor of Psychology in the Dietrich College of Humanities and Social Sciences. “We have finally developed a way to see thoughts of that complexity in the fMRI signal. The discovery of this correspondence between thoughts and brain activation patterns tells us what the thoughts are built of.”

Goal: A brain map of all types of knowledge

(Top) Specific brain regions associated with the four large-scale semantic factors: people (yellow), places (red), actions and their consequences (blue), and feelings (green). (Bottom) Word clouds associated with each large-scale semantic factor underlying sentence representations. These word clouds comprise the seven “neurally plausible semantic features” (such as “high-arousal”) most associated with each of the four semantic factors. (credit: Jing Wang et al./Human Brain Mapping)

The researchers used 240 specific events (described by sentences such as “The storm destroyed the theater”) in the study, with seven adult participants. They measured the brain’s coding of these events using 42 “neurally plausible semantic features” — such as person, setting, size, social interaction, and physical action (as shown in the word clouds in the illustration above). By measuring the specific activation of each of these 42 features in a person’s brain system, the program could tell what types of thoughts that person was focused on.

The researchers used a computational model to assess how the detected brain activation patterns (shown in the top illustration, for example) for 239 of the event sentences corresponded to the detected neurally plausible semantic features that characterized each sentence. The program was then able to decode the features of the 240th left-out sentence. (For “cross-validation,” they did the same for the other 239 sentences.)

The model was able to predict the features of the left-out sentence with 87 percent accuracy, despite never being exposed to its activation before. It was also able to work in the other direction: to predict the activation pattern of a previously unseen sentence, knowing only its semantic features.

“Our method overcomes the unfortunate property of fMRI to smear together the signals emanating from brain events that occur close together in time, like the reading of two successive words in a sentence,” Just explained. “This advance makes it possible for the first time to decode thoughts containing several concepts. That’s what most human thoughts are composed of.”

“A next step might be to decode the general type of topic a person is thinking about, such as geology or skateboarding,” he added. “We are on the way to making a map of all the types of knowledge in the brain.”

Future possibilities

It’s conceivable that the CMU brain-mapping method might be combined one day with other “mind reading” methods, such as UC Berkeley’s method for using fMRI and computational models to decode and reconstruct people’s imagined visual experiences. Plus whatever Neuralink discovers.

Or if the CMU method could be replaced by noninvasive functional near-infrared spectroscopy (fNIRS), Facebook’s Building8 research concept (proposed by former DARPA head Regina Dugan) might be incorporated (a filter for creating quasi ballistic photons, avoiding diffusion and creating a narrow beam for precise targeting of brain areas, combined with a new method of detecting blood-oxygen levels).

Using fNIRS might also allow for adapting the method to infer thoughts of locked-in paralyzed patients, as in the Wyss Center for Bio and Neuroengineering research. It might even lead to ways to generally enhance human communication.

The CMU research is supported by the Office of the Director of National Intelligence (ODNI) via the Intelligence Advanced Research Projects Activity (IARPA) and the Air Force Research Laboratory (AFRL).

CMU has created some of the first cognitive tutors, helped to develop the Jeopardy-winning Watson, founded a groundbreaking doctoral program in neural computation, and is the birthplace of artificial intelligence and cognitive psychology. CMU also launched BrainHub, an initiative that focuses on how the structure and activity of the brain give rise to complex behaviors.


Abstract of Predicting the Brain Activation Pattern Associated With the Propositional Content of a Sentence: Modeling Neural Representations of Events and States

Even though much has recently been learned about the neural representation of individual concepts and categories, neuroimaging research is only beginning to reveal how more complex thoughts, such as event and state descriptions, are neurally represented. We present a predictive computational theory of the neural representations of individual events and states as they are described in 240 sentences. Regression models were trained to determine the mapping between 42 neurally plausible semantic features (NPSFs) and thematic roles of the concepts of a proposition and the fMRI activation patterns of various cortical regions that process different types of information. Given a semantic characterization of the content of a sentence that is new to the model, the model can reliably predict the resulting neural signature, or, given an observed neural signature of a new sentence, the model can predict its semantic content. The models were also reliably generalizable across participants. This computational model provides an account of the brain representation of a complex yet fundamental unit of thought, namely, the conceptual content of a proposition. In addition to characterizing a sentence representation at the level of the semantic and thematic features of its component concepts, factor analysis was used to develop a higher level characterization of a sentence, specifying the general type of event representation that the sentence evokes (e.g., a social interaction versus a change of physical state) and the voxel locations most strongly associated with each of the factors.

Tactile sensor lets robots gauge objects’ hardness and manipulate small tools

A GelSight sensor attached to a robot’s gripper enables the robot to determine precisely where it has grasped a small screwdriver, removing it from and inserting it back into a slot, even when the gripper screens the screwdriver from the robot’s camera. (credit: Robot Locomotion Group at MIT)

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have added sensors to grippers on robot arms to give robots greater sensitivity and dexterity. The sensor can judge the hardness of surfaces it touches, enabling a robot to manipulate smaller objects than was previously possible.

The “GelSight” sensor consists of a block of transparent soft rubber — the “gel” of its name — with one face coated with metallic paint. It is mounted on one side of a robotic gripper. When the paint-coated face is pressed against an object, the face conforms to the object’s shape and the metallic paint makes the object’s surface reflective. Mounted on the sensor opposite the paint-coated face of the rubber block are three colored lights at different angles and a single camera.

Humans gauge hardness by the degree to which the contact area between the object and our fingers changes as we press on it. Softer objects tend to flatten more, increasing the contact area. The MIT researchers used the same approach.

A GelSight sensor, pressed against each object manually, recorded how the contact pattern changed over time, essentially producing a short movie for each object. A neural network was then used to look for correlations between changes in contact patterns and hardness measurements. The resulting system takes frames of video as inputs and produces hardness scores with very high accuracy.

The researchers also designed control algorithms that use a computer vision system to guide the robot’s gripper toward a tool and then turn location estimation over to a GelSight sensor once the robot has the tool in hand.

“I think that the GelSight technology, as well as other high-bandwidth tactile sensors, will make a big impact in robotics,” says Sergey Levine, an assistant professor of electrical engineering and computer science at the University of California at Berkeley. “For humans, our sense of touch is one of the key enabling factors for our amazing manual dexterity. Current robots lack this type of dexterity and are limited in their ability to react to surface features when manipulating objects. If you imagine fumbling for a light switch in the dark, extracting an object from your pocket, or any of the other numerous things that you can do without even thinking — these all rely on touch sensing.”

The researchers presented their work in two papers at the International Conference on Robotics and Automation.


Wenzhen Yuan | Measuring hardness of fruits with GelSight sensor


Abstract of Tracking Objects with Point Clouds from Vision and Touch

We present an object-tracking framework that fuses point cloud information from an RGB-D camera with tactile information from a GelSight contact sensor. GelSight can be treated as a source of dense local geometric information, which we incorporate directly into a conventional point-cloud-based articulated object tracker based on signed-distance functions. Our implementation runs at 12 Hz using an online depth reconstruction algorithm for GelSight and a modified secondorder update for the tracking algorithm. We present data from hardware experiments demonstrating that the addition of contact-based geometric information significantly improves the pose accuracy during contact, and provides robustness to occlusions of small objects by the robot’s end effector.

High-speed light-based systems could replace supercomputers for certain ‘deep learning’ calculations

(a) Optical micrograph of an experimentally fabricated on-chip optical interference unit; the physical region where the optical neural network program exists is highlighted in gray. A programmable nanophotonic processor uses a field-programmable gate array (similar to an FPGA integrated circuit ) — an array of interconnected waveguides, allowing the light beams to be modified as needed for a specific deep-learning matrix computation. (b) Schematic illustration of the optical neural network program, which performs matrix multiplication and amplification fully optically. (credit: Yichen Shen et al./Nature Photonics)

A team of researchers at MIT and elsewhere has developed a new approach to deep learning systems — using light instead of electricity, which they say could vastly improve the speed and efficiency of certain deep-learning computations.

Deep-learning systems are based on artificial neural networks that mimic the way the brain learns from an accumulation of examples. They can enable technologies such as face- and voice-recognition software, or scour vast amounts of medical data to find patterns that could be useful diagnostically, for example.

But the computations these systems carry out are highly complex and demanding, even for supercomputers. Traditional computer architectures are not very efficient for calculations needed for neural-network tasks that involve repeated multiplications of matrices (arrays of numbers). These can be computationally intensive for conventional CPUs or even GPUs.

Programmable nanophotonic processor

Instead, the new approach uses an optical device that the researchers call a “programmable nanophotonic processor.” Multiple light beams are directed in such a way that their waves interact with each other, producing interference patterns that “compute” the intended operation.

The optical chips using this architecture could, in principle, carry out dense matrix multiplications (the most power-hungry and time-consuming part in AI algorithms) for learning tasks much faster, compared to conventional electronic chips. The researchers expect a computational speed enhancement of at least two orders of magnitude over the state-of-the-art and three orders of magnitude in power efficiency.

“This chip, once you tune it, can carry out matrix multiplication with, in principle, zero energy, almost instantly,” says Marin Soljacic, one of the MIT researchers on the team.

To demonstrate the concept, the team set the programmable nanophotonic processor to implement a neural network that recognizes four basic vowel sounds. Even with the prototype system, they were able to achieve a 77 percent accuracy level, compared to about 90 percent for conventional systems. There are “no substantial obstacles” to scaling up the system for greater accuracy, according to Soljacic.

The team says is will still take a lot more time and effort to make this system useful. However, once the system is scaled up and fully functioning, the low-power system should find many uses, especially for situations where power is limited, such as in self-driving cars, drones, and mobile consumer devices. Other uses include signal processing for data transmission and computer centers.

The research was published Monday (June 12, 2017) in a paper in the journal Nature Photonics (open-access version available on arXiv).

The team also included researchers at Elenion Technologies of New York and the Université de Sherbrooke in Quebec. The work was supported by the U.S. Army Research Office through the Institute for Soldier Nanotechnologies, the National Science Foundation, and the Air Force Office of Scientific Research.


Abstract of Deep learning with coherent nanophotonic circuits

Artificial neural networks are computational network models inspired by signal processing in the brain. These models have dramatically improved performance for many machine-learning tasks, including speech and image recognition. However, today’s computing hardware is inefficient at implementing neural networks, in large part because much of it was designed for von Neumann computing schemes. Significant effort has been made towards developing electronic architectures tuned to implement artificial neural networks that exhibit improved computational speed and accuracy. Here, we propose a new architecture for a fully optical neural network that, in principle, could offer an enhancement in computational speed and power efficiency over state-of-the-art electronics for conventional inference tasks. We experimentally demonstrate the essential part of the concept using a programmable nanophotonic processor featuring a cascaded array of 56 programmable Mach–Zehnder interferometers in a silicon photonic integrated circuit and show its utility for vowel recognition.