In a few days this summer, Alexa, the voice assistant who talks to me through my Amazon Echo Dot, we ended our interactions with a whisper: "Sweet Dreams."
Every time it happened, I became wrinkled, even though I thought I understood why she did it, insofar as I understand something that goes inside the cool black tube plate.
I had gone to Amazon.com and enabled a third party's "skill" – an application program that allows Alexa to perform a service or make a trick – called "Baby Lullaby". It plays an instrumental version of a nursery song (yes, I still listen to lullabies to sleep), then signs of soft at night time.
My guess is that the last string of code somehow got lost and linked to other "skills". But even though my adult knew perfectly well that "sweet dreams" was a glitch, some of me would think that Alexa meant it. Who long for a modern good night, even in the middle of the afternoon? Proust would have understood.
We all fall for Alexa, unless we fall for Google Assistant or Siri or any other genius in a smart speaker. Speakers manufacturers promise that only voices that follow the wake are filed in the cloud, and Amazon and Google at least make it easy to remove these exchanges.
Nevertheless, everybody often finds strange glitches, as time Alexa sprang into a family's private conversation without saying the warning and sent a recording to a friend on their contact list. Amazon explained that Alexa must have been woken by a word that sounded like Alexa (Texas? A Lexus? Praxis?), Then misunderstood elements in the subsequent conversation as a series of commands. The explanation did not make me feel much better.
However, integrity considerations have not ended with the march of these devices in our homes. Amazon does not reveal exact numbers, but when I ask how many echo devices have been sold, a spokesman says "tens of millions". By the end of last year, over 40 million smart speakers were installed worldwide, according to Canalys, a technology research company.
Based on current sales, Canalys estimates that this figure will reach 100 million by the end of this year. According to a report by National Public Radio and Edison Research in 2018, 8 million Americans have three or more smart speakers, indicating that they feel the need to always have one in hearing protection. By 2021, according to another research company, Ovum, there will be almost as many voice-activated assistants on the planet as humans. It took about 30 years for mobile phones to surpass people. Alexa and her ilk can get there in less than half that time.
One reason is that Amazon and Google drive these devices hard and discount them so hard during last year's peak season that industry observers suspect that businesses lost money on each sold device. These and other technology companies have great ambitions. They want to colonize space. Not interplanetary space. Living room: home, office, car. In the near future, everything from your lighting to your air conditioning to your fridge, coffee maker and even your toilet can be connected to a voice-controlled system.
The company that succeeds in fading the smart speaker market will lock device manufacturers, app designers and consumers into their ecosystem of devices and services, just as Microsoft forced the personal computer industry into its operating system in the 1990s. Alexa alone already works with more than 20,000 smart-home devices that represent more than 3500 brands. Her voice is derived from over 100 third party gadgets, including headphones, security systems and cars.
But there is also an inherent appeal to the units – beyond consumer spending. Even those of us who approach new technology with a healthy caution are reasons to welcome smart speakers in our homes. After my mother-in-law wrote Instagram a cute video of her two-year-old son trying to get Alexa to play Hurry, from Moana soundtrack I wrote to ask why she and my stepson had bought an echo because they are quite strict about what they let the son play with.
"Before we got Alexa, the only way to play music on our computers, and when [he] watching a computer screen, he thinks it's time to watch tv, "my swearer mailed back." It's fun to have a way to listen to music or radio that does not mean opening a computer screen. "She's not the first parent to have that idea. In the same NPR / Edison report, nearly half of the parents recently bought a smart speaker that they had done it to reduce the household screening time.
Perhaps you think talking with Alexa is just a new way to do things you already do on a screen: shop, find out the news and try to figure out if your dog is sick or just depressed. It's not that easy. It's not about replacing body parts used to perform these tasks – replacing fingers and eyes with mouth and ears. We are talking about a change in the status of the technology itself – an upgrade as it was. When we talk with our personal assistants, we will take them closer to our own level.
Followers and confessors
Married to the outstanding human power of speech, Alexa, Google Assistant and Siri have already become bigger than the sum of their parts. They are software, but they are more than so, just as human awareness is an effect of neurons and synapses but is more than that. Their speech makes us treat them as if they had a mind. These secretaries can be faux-conscious non-people, but their words give them personality and social presence.
At the moment, these machines are left at the beginning of their potential, which is likely to cure your request that they will meet. But when sales of smart speakers increase, data power expands exponentially. In our lifetimes, these devices are likely to become much more appropriate conversationalists. When they do, they will fully insinuate in our lives. With their perfect cloud-based memories they will be obese; With their occupation of our most intimate spaces, they will be present everywhere. And with their full ability to develop confessions they can have a remarkable force over our emotional lives. What will it be like?
When Toni Reid, now vice president of Alexa Experience, was asked to join the Echo team 2014 – it was before the device was on the market – she scoffed: "I was just like," What? It's a speaker? & # 39; "After that she worked on Dash Wand, a portable barcode reader and smart microphone that let people scan or print the name of an item they want to add their Amazon shopping cart. The score with Dash Wand was obvious: it made it easier to buy products from Amazon.
I meet Reid at Amazon's Day 1 building in Seattle, a shining tower called Jeff Bezo's corporate philosophy: that every day the company should be as intensive and driven as the first day at a start. ("Day 2 is stasis. Followed by irrelevance. Followed by scary, painful decline." Followed by death, "he wrote in a letter to shareholders in 2016.) Reid studied anthropology as undergraduate education, and she has a social scientific patience for my rudimentary questions about what makes these devices different from the other electronics in our lives.
Echo's basic complaint, she says, is that it releases your hands. Because of something called "far field voice technology", machines can now decipher numbers at a distance. Echo owners can wander around living rooms, kitchens and offices that do this or the same while requesting random pieces of information or ordering toilet paper, no clicks are required.
Alexa, Reid's beauty continues, is that she makes such interactions "frictionless" – a term that I hear over and over again in my conversations with designers and engineers behind these products. You do not have to switch to the desktop and type a keyword in a browser. You do not need to track your iPhone and pause your password. As the ideal servant in a Victorian mansion, Alexa is in the background ready to make her masters' bid fast and accurate.
Frictionlessness is nevertheless the goal. At present, considerable friction remains. It is really remarkable how often smart speakers – even Google Home, which often exceeds echo in tests conducted by technical sites – flutter their lines. They will misunderstand an issue, stress the wrong syllable, give a bizarre response, apologize for not yet knowing a very knowledgeable fact. Alexa's bloopers float around the internet as clips from an absurdist comedy show. In a rogue that went viral on YouTube, a little toy, "Lexa," plays Ticker Ticker"- probably he wants to hear Twinkle Twinkle Little Star. Alexa answers, in her handsome monotone, "You want to hear a station for porn … hot chicks, amateur girls …" (It became more graphic from there.) "No no no!" The children's parents scream in the background.
context The challenge
My sister-in-law got her Echo early in 2015. During two years, when I visited, I saw her bouncing so passionately with her machine as George Costanza's parents did to each other Seinfeld. "I hate Alexa," she recently announced, finally closing the case in a closet. "I would tell her," Play a little Beethoven ", and she would play Eleanor Rigby . Each time."
As Alexa grows in sophistication, it will be so harder to throw Echo on a bunch of old gadgets taken off on the electronic recovery day. Rohit Prasad is the main researcher at Alexa's artificial intelligence team, and a man willing to defy local norms by wearing a barely shirt. He summarizes the greatest obstacle to Alexa to achieve this sophistication in a single word: context.
"You must understand that the language is very ambiguous," he says to me. "It requires conversational context, geographic context." When you ask Alexa about Spursna playing tonight, she has to know if you mean San Antonio Spurs or Tottenham Hotspur, the British football team known as Spurs. When you follow up by asking, "When is their next home game?" Alexa has to remember the previous question and understand what "their" refers to. This short-term memory and syntactic back reference is known in Amazon as "contextual transmission". It was only spring that Alexa developed the ability to answer follow-up questions without saying that her wake up words again.
Alexa needs to get better at capturing context before she can really inspire confidence. And trust things. Not only because consumers will give up on her if she conceals too many requests, but because she is more than a search engine. She is an "action engine", says Prasad. If you ask Alexa a question, she does not offer a list of results. She chooses a response from many. She tells you what she thinks you want to know. "You want a very smart AI. You do not want a stupid AI," says Prasad. "And yet make sure the conversation is coherent – it's incredibly challenging."
To understand that the forces are marshalled to pull us away from the screens and drive us against voices, you need to know something about the psychology of the voice. For one thing, votes create intimacy. I am hardly the only one who has found me confessing my emotional state to my electronic assistant. Many articles have been written about expressions of depression and suicide threats that manufacturers have downloaded. I asked technical managers about this, and they say they try to handle such statements in a responsible manner.
For example, if you say to Alexa, you're feeling depressed, she's been programmed to say, "I'm so sorry you're feeling so. Please know you're not alone. There are people who can help you. Try to talk to a friend or your doctor. You can also reach the Depression and Bipolar Support Alliance for more resources. "
Why should we turn to computers for comfort? Machines give us a way to reveal shameful feelings without being ashamed. When speaking with one, "engage in less of what is called impression management, so they reveal more intimate things about themselves," said Jonathan Gratch, a computer scientist and psychologist at the University of Southern California's Institute of Creative Technologies, studying the spoken and countless psychodynamics in the interaction between human computers. "They will show more sorrow, for example, if they are depressed."
I turn to Diana Van Lancker Sidtis, a spoken researcher at NYU. . She explains how our vocal signatures consist of not only our style of stringent words together but also the sonic marinade where these words are steep, a rich medley of tone, rhythm, pitch, resonance, pronunciation and many other features. The technical term for this collection of properties is prosodi.
When someone talks to us, we hear the words, syntax and prosodis at once. Then we look for clues about what kind of person the speaker is and what she's trying to say, recruiting a remarkable amount of brain power to try and make sense of what we hear. "The brain is linked to see every aspect of every human challenge as meaningful," wrote the late Clifford Nass, a groundbreaking thinker on computer-human relations. The process usually passes during termination, as a powerful stream that targets us against a certain emotional response.
Ton is tricky
We can not put all this mental intervention on break just because one voice is humanoid rather than human. Even when my Google assistant does not do anything more encouraging than delivering the weather forecast, the image of the sweet young waitress actor has made him appear in my mind. That does not mean I do not understand the algorithmic nature of our interaction. I know he's just software.
Then I do not know again. Evolution has not prepared me to know. We have responded to human vowel movements for millions of years as if they signaled human proximity. We have had only about one and a half years to adapt to the idea that one voice can be disconnected from the source and just a few years to adapt to the idea that a device that speaks and sounds like a human being may not be a human.
The power of the voice is by its unpleasant when we can not find its owner – when it's everywhere and nowhere at the same time. There is a reason God talks to Adam and Moses. In the beginning was the Word, not the Scroll.
A 2017 study published in American psychologistmakes it the case that when people talk without looking at each other, they are better to recognize each other's feelings. They are more empathetic. Freud understood this long before empirical research showed it. That's why he let his patients lie on a couch and turn away from him. He could listen increasingly harder to the nudity of truth in their rhythm, while those, untouched by shadows or smiles, slid into the dusk where they could escape from suffocating emotions.
The manufacturers of smart speakers would like to benefit from these psychosocial effects. Amazon and Google have both "personality team", loading with just the right tone for their assistants. This is partial brand management in the textbook: these devices must be ambassadors for their decision makers. Reid tells me that Amazon wants Alexa's personality to reflect the company's values: "Smart, humble, sometimes fun". Google Assistant is "humble, it's helpful, sometimes playful," says Gummi Hafsteinsson, one of the Assistant Chief Product Managers. But having a personality also helps to make a voice relational table.
Ton is tricky. Fixed virtual assistants are often compared with butlers, says Al Lindsay, vice president of Alexa engine sister program and a man with a military engineer's lawyer, that he and his team had another servant in mind. Their "North Star" was the built-in computer that ran the USS Enterprise in Star Trek , respond to the crew's requests with the breathing reflection of a 1960s Pan Am Stewardess. (The company's computer was also an inspiration for Google's engineers. Her voice belonged to actress Majel Barrett, the wife of Star Trek screator, gen roddenberry; when the Google Assistant project was still under wraps, the code name was Majel.)
Americans from the twenty first century feel no longer fully familiar with female obsequiousness. We like our servility to come in less servile flavors. The voice should be friendly but not too friendly. It should have just the right dose of sass.
What Alexa says to her suitors
To fine-tune the assistant's personality, he hired Emma Coats away from Pixar, where she worked as a storyboard artist on Brave , Monster Universityand Inside out. Coats were on a conference day I visited Google Mountain View, California, headquarters. She beamed into Google Hangouts and offered what struck me as a rule No. 1 to write dialogue for the assistant, a poem with the usual simplicity of a Zen koan. Google Assistant, she says, should be able to speak as a person, but it should never pretend to be one. IN Finding Nemo, she states, the fish is as emotionally real as humans, but they go to fish school and they challenge each other to go up and take a boat. Similarly, an artificially intelligent device should "honor the reality that it is software".
For example, if you ask Google Assistant, "What is your favorite taste?" It may say, "You can not go wrong with Neapolitan. There's something for everyone." Of course, it follows the Coats articulated principle. The software can not eat ice cream and therefore can not have glass preferences. If you suggest marriage with Alexa – and Amazon says that 1 million people did it in 2017 – she cautions for similar reasons. "We are in quite different places in our lives," she told me. "Literally. I mean, you're on earth. And I'm in the cloud."
An assistant should be faithful to his cybernetic nature, but it should not let foreigners either. That is where James Giangola, a leading conversation and Personal Assistant for Google Assistant, comes in. Giangola is a cruel man with wavy hair and more than a mess of crazy scientist about him. His job makes the assistant sound normal.
Giangola says, for example, people tend to give new information at the end of an opinion, rather than at the beginning or the middle. "I say," My name is James, "he points out," James is my name. "He offers another example. Say that someone wants to book a flight until June 31." Yes, "says Giangola," there are no 31 June. "So the machine has to deal with two sensitive tasks: comes out as natural and contradicts its human user.
He writes himself on his computer and takes a test recording to illustrate his score. A man says "Book it till June 31". Assistant response, "There are only 30 days in June." The answer is stiff. "June old tasks," Giangola observes. He plays a second version of the exchange: "Book it until June 31." Assistant answers, "In fact, June only has 30 days." Her point – 30 days – comes at the end of the line. And she throws in an actual, which cautiously puts the correction to come. "More natural, right?" Giangola says.
Getting the beat of spoken languages down is crucial, but it's hardly enough to create a decent conversationist. Bots also need a good atmosphere. When Giangola trained the actor whose voice was recorded for Google Assistant, he gave her a backstory to help her produce the exact degree of optimistic geekiness he wanted. The background is charmingly specific: she comes from Colorado, a state in a region that lacks distinctive accent.
"She is the youngest daughter of a research librarian and a physics professor who has a BA in Art History from the northwest," continues Giangola. When she was a child, she won $ 100,000 Jeopardy: Kids Edition. She used to work as a personal assistant to "a very popular late night tv satirical pundit". And she likes kayaking.
A skeptical colleague once asked Giangola: "How does anyone pretend to be kayaking?" During auditions (hundreds of people tested for the role) Giangola turned to the doubt and said, "The candidate who just gave an audition – do you think she sounded energetic as if she was kayaking?" His colleague acknowledged that she did not. "I said," OK. There you go. ""
But vocal realism can be taken longer than people are used to, and it can cause problems – at least for the moment. In May, Google presented Duplex at its annual development conference, using cutting edge technology. To demonstrate its performance, the company played recordings of Duplex that called on no savage people. With the help of a female voice, a time spent at a hairdressing salon; With the help of a male voice, it asked for availability at a restaurant. Duplex speaks with remarkably realistic disfluencies – habits and mm-hmms – and breaks, and no human receptionist realized that she was talking to an artificial agent. One of her voices, the female, spoke with end-sets, even audible in the voice of the young female receptionist who took that conversation.
Many commentators found that Google had made a mistake with its gung ho presentation. Duplex not only breaks the poem that AI will never pretend to be a person; It also seemed to break our trust. We may not always realize how powerful our voice assistants are playing our psychology, but we have at least chosen the relationship. Duplex was a fake-out, and an alarmingly effective one. After that, Google clarified that Duplex would always identify itself to call. But even if Google keeps its word, equally deceptive voice technologies are developing equally. Their creators may not be as worthy of honor. The line between artificial voices and true ones is about to disappear.
The most changing conversation is, of course, the one who can understand the feelings conveyed by your voice and respond accordingly – in a voice that can approximate emotional subtlety. Your smart speaker can not do any of these things yet, but systems for analyzing emotions in voice already exist. Emotion detection – in faces, bodies and voices – was groundbreaking about 20 years ago by a MIT engineer professor named Rosalind Picard, who gave the field his academic name: affective computing. "Then she says," the feeling was associated with irrationality, which was not a feature that engineers respected. "
Picard, a gentle, witty woman, runs the Affective Computing Lab, which is part of MIT's fun strange Media Lab. She and her PhD students are working on quantifying emotions. Picard explains that the difference between most AI surveys and the type she is doing is that the traditional research focuses on "nouns and verbs" – that is, the content of an action or opinion. She is interested in "adverbs" – the emotions conveyed.
"I know I can pick up a phone in a variety of ways. I can brush it with a sharp, angry, jerky move. I can pick it up with a happy and loving expectation," Picard tells. Estimated gestures with shade are important if a machine is to understand the subtle signals that people give each other. A simple act like nodding a head can telegraph different meanings: "I could nod in a jumping, happy way. I could sneak in sad sadness."
In 2009, Picard founded a start-up, Affectiva, focused on sensitivity-enabled AI. Today, the company is run by the second founder Rana el Kaliouby, a former post doctor at Picard's lab. A sense of urgency penetrates Affectiva's open shelves in downtown Boston. The company hopes to be among the best players in the car market. Next generation of advanced cars will be equipped with software and hardware (cameras and microphones, for now) to analyze driver's attention, annoyance and other states. This capacity is already tested in semi-economic cars, which must make informed assessments of when it is safe to transfer control to a driver and when to take over because a driver is too distracted or upset to focus on the road.
Affectives initially focused on sensitivity detection by facial expressions, but recently hired a rising star in emotion measurement, Taniya Mishra. Her law goal is to train computers to interpret the emotional content of human speech. A clue to how we feel is of course the words we use. But we betray as much of not more of our feelings by pitch, volume and pace in our speech. Computers can already register these non-verbal features. The key is to teach them what we humans naturally intuitively: how these mood features suggest our mood.
The biggest challenge in the field, she says, is building up sufficiently large and sufficiently varied databases of languages that the computer can learn from. Mishra's law begins with a number most commonly recorded in the wild – that is, downloaded from video clips on the web or delivered by an ideal data consortium that has collected natural speech samples for academic purposes, including sources. A small battalion of workers in Cairo, Egypt, then analyzes the speech and notice the feelings that convey, as well as the nonlexical vocalisations – grunts, giggles, pauses – which play an important role in revealing a psychological state of a speaker.
Classification is a slow, careful process. Three to five workers must agree on each label. Every hour of tagged speech requires "as many as 20 hours of labeling time," says Mishra. However, there is a solution. When the computers have a sufficient number of human-labeled samples that show the specific acoustic features that come with a fit of pique, say or a part of grief, they can begin to mark the samples themselves and expand the database far faster than just deadly can. As the database grows, these computers will be able to hear numbers and identify their emotional content with increasing accuracy.
Irony and sarcasm
During my research, I quickly lost the count of the number of startups who hope to use voice based analysis in the field. Ellipsis Health, for example, is a San Francisco company that develops AI software for doctors, social workers and other healthcare providers who can review patient numbers for biomarkers of depression and anxiety.
"Changes in emotions, such as depression, are associated with brain changes, and these changes can be linked to engine commands," explains Ellipsi's scientist, Elizabeth Shriberg, these commands control the "voice-speaking device."
Ellipsis software can have many applications. For example, it may be used for routine visits, such as an annual checkup (with the patient's condition, of course). While the doctor completes his degree, a recording can be sent to Ellipsis and the patient's speech is analyzed so quickly that the doctor can get a message before the end of the contract, asking her to ask questions about the patient's mood or referring to the patient to a mental healthcare professional. The software may have picked up a clue of drowsiness or slight loss in the number the doctor missed.
I hoped that some aspects of speech, such as irony or sarcasm, would defeat a computer. But Björn Schuller, a professor of artificial intelligence at Imperial College London and "embedded intelligence" at the University of Augsburg, Germany, tells me he has learned machines to detect sarcasm. Han får dem att analysera språkligt innehåll och tonfall samtidigt, vilket gör det möjligt för dem att hitta luckorna mellan ord och böjning som avgör om en talare betyder det exakta motsatsen till vad hon sagt. Han ger mig ett exempel: "Su-per," den typ av sak du kan spränga ut när du lär dig att din bil kommer att vara i affären för en vecka.
Det naturliga nästa steget efter känslosynkänning kommer givetvis att vara emotionsproduktion: Träning av artificiellt intelligenta medel för att generera approximationer av känslor. När datorer har blivit virtuosiska när de bryter ner känslomässiga komponenterna i vårt tal kommer det bara att vara fråga om tid innan de kan sätta ihop dem till trovärdiga prestationer, till exempel empati. Virtuella assistenter som kan urskilja och reagera på användarnas sinnesrörelse kan skapa en verkligt uppenbar känsla av affinitet, ett band som kan användas för gott eller för sjuk.
Taniya Mishra ser fram emot möjligheten till sådana obligationer. Hon fantaserar om en bil som hon kunde rant i slutet av dagen om allt som hade gått fel – en bil som också är en aktiv lyssnare. "En bil kommer inte att zonera ut", säger hon. "En bil kommer inte att säga," Jag är ledsen, älskling, jag måste springa och äta middag, jag lyssnar på din berättelse senare. " " Rather, with the focus possible only in a robot, the car would track her emotional state over time and observe, in a reassuring voice, that Mishra always feels this way on a particular day of the week. Or perhaps it would play the Pharrell song (Happy, naturally) that has cheered her up in the past. At this point, it will no longer make sense to think of these devices as assistants. They will have become companions.
Spoiled rotten and other dangers
If you don't happen to work in the tech sector, you probably can't think about all the untapped potential in your Amazon Echo or Google Home without experiencing some misgivings. By now, most of us have grasped the dangers of allowing our most private information to be harvested, stored and sold. We know how facial-recognition technologies have allowed authoritarian governments to spy on their own citizens; how companies disseminate and monetise our browsing habits, whereabouts, social-media interactions; how hackers can break into our home-security systems and nanny cams and steal their data or reprogram them for nefarious ends.
Virtual assistants and ever smarter homes able to understand our physical and emotional states will open up new frontiers for mischief making. Despite the optimism of most of the engineers I've talked with, I must admit that I now keep the microphone on my iPhone turned off and my smart speakers unplugged when I don't plan to use them for a while.
But there are subtler effects to consider as well. Take something as innocent-seeming as frictionlessness. To Amazon's Toni Reid, it means convenience. In his Cassandra-esque book Radical Technologies: The Design of Everyday Life, Adam Greenfield, an urbanist, frames frictionlessness as an existential threat: it is meant to eliminate thought from consumption, to "short-circuit the process of reflection that stands between one's recognition of a desire and its fulfilment via the market".
I fear other threats to our psychological well-being. A world populated by armies of sociable assistants could get very crowded. And noisy. It's hard to see how we'd protect those zones of silence in which we think original thoughts, do creative work, achieve flow. A companion is nice when you're feeling lonesome, but there's also something to be said for solitude.
And once our electronic servants become emotionally savvy? They could come to wield quite a lot of power over us, and even more over our children. In their subservient, helpful way, these emoting bots could spoil us rotten. They might be passive when they ought to object to our bad manners ("I don't deserve that!"). Programmed to keep the mood light, they might change the subject whenever dangerously intense feelings threaten to emerge, or flatter us in our ugliest moments. How do you program a bot to do the hard work of a true, human confidant, one who knows when what you really need is tough love?
If I have learned anything in my years of therapy, it is that the human psyche defaults to shallowness. We cling to our denials. It's easier to pretend that deeper feelings don't exist, because, of course, a lot of them are painful. What better way to avoid all that unpleasantness than to keep company with emotive entities unencumbered by actual emotions? But feelings don't just go away like that. They have a way of making themselves known. I wonder how sweet my grandchildren's dreams will be.