Computer vision has long been just an esoteric field hidden at the depths of academic research. Scientists toiled to create mathematical algorithms that could extract meaning from the grid of light and dark pixels that make up an image. How can people look at these pixels and so easily distinguish between a cat and a toaster?
Recall that each pixel conveys nothing more than a single color; still, millions of them are generated by your phone’s camera every second. The combination of these pixels can describe literally anything the camera sees. Even a child’s eye can trivially integrate them to identify objects, but teaching a computer to do the same had been a significant computational challenge. Now that computers can see, the way people interact with things will never be the same.
Over the past decade, advancing computer platforms have finally unlocked the power in these techniques to catapult computer vision into the vanguard of modern societal change. Synergies at the intersection of machine learning and computer vision are the foundation upon which a major leap in technological capability is happening. The coming revolution will be as transformative as any we’ve seen so far.
Reaching “Peak Screen”
In the early days of computer science, only nerdy programmers used computers. Then, the invention of the application program welcomed in a broad range of business analysts, word processors, and video gamers. The always-on mobile internet, with its wide array of consumer products and everyday benefits, has finally brought the computer revolution to everyone. We love it so much that we’ve surrounded ourselves with more screens than any of those early users could have ever imagined. We interact with them at home, at work, in our cars, our pockets, and on our wrists.
We have reached the point of “peak screen” where screens are virtually always present and demanding our attention. Why so many? Because all these interfaces are necessary to communicate with our computers. Computing is such an integral part of how we conduct our everyday lives that we don’t realize how much time we spend telling our machines what to do, what we want, or what we just did.
Fortunately, the easing of this communication burden has already begun. It was led by early machine learning companies like Nest and Pandora. We no longer need to decipher the maddeningly complex programmable thermostat or painstakingly assemble the ideal playlist. I can simply say I want it a bit warmer right now and that I’m loving a particular song and appropriate algorithms hear me and adjust. These systems not only take literal action in the moment but go on to generalize my intent to improve the experience for me overall. Iterate a few times and life is good. The learning aspect of these systems allows them to intuit what I want without forcing a heavy burden to instruct them in detail. Greater meaning is conveyed through fewer instructions, which is undeniably a welcome advance.
Operating in the Background
Computer vision is accelerating this process. With it we’ll simply go about our lives, doing what we like, and the computers will quietly keep track. We’ll no longer need to instruct them affirmatively about what just happened because they’ll have seen it at the same time we did; and more importantly, be able to understand our intent. As computers can see more for themselves, the need for all those screens, keyboards, and indicator lights will recede.
Will it be difficult for us to adapt? No, these computer vision systems are passive; and inherently easy to use. Precisely because there’s no need to instruct them, there’s similarly no need to learn how to use them. Each of the last technological revolutions forced us to learn the new tools, to adapt to the new world, and to remake ourselves to fit in. This one is different, delightfully different. We’ll just do what’s natural and vision-based systems will manage the details.
The Rise of Automation
One common example of public automation is toll roads. Since the turn of the century, many toll booths across the country have been replaced by automated systems that “see” people drive by and understand that their intent is to use the road and pay the toll. This freedom would have felt impossible 20 years earlier but is now accepted and taken for granted by millions of motorists each day.
Another example is the one that I work on, Checkout-Free Shopping. It uses computer vision to see what goes into your shopping cart so the cashier can know what to charge without having to scan every item’s barcode. Like with toll roads, this offers the consumer enormous savings in time and convenience. Much hyped, but probably still five to ten years away, are driverless cars. Also coming are automated parking meters, ticketless amusement parks, high-accuracy medical uses, and numerous industrial applications. Each has, or soon will, make our lives better and then quickly be taken for granted. Adoption will be swift as these systems simplify the experience by fluidly handling the technical bits.
Now that computers can see, the way people interact with things will never be the same. We are about to be freed from waiting in line, manually operating machines, and constantly entering obvious things into a computer. Past advances in computing have unsurprisingly led to more computers doing more things for us. Machine learning and computer vision is a part of that trend of increasing capability and will similarly lead to computers doing increasingly more. The ironic difference this time is they will not stand in front, demanding our attention, but in the background where they belong. We will be freed to interact as people and let the computers do the computing. This revolution will remove many of our screens and facilitate the rehumanizing of society.