tevfik's log

Voice Will Be the Default

When I use Codex, I often speak before I know exactly what I want to say. If I type, I tend to compress the thought too early. If I speak, I can explain the messy version first, then refine it together.

That feels like a small change, but it points to a larger one.

Most software still assumes that using a computer means looking at a screen and moving things around on it.

That made sense. A computer gave us a two-dimensional surface, and the mouse and keyboard let us manipulate it with surprising precision. Once graphical interfaces became mainstream, we learned to open windows, switch tabs, drag files, and operate increasingly complex software through the same basic pattern.

But this was never proof that the screen was the best possible interface. It was partly a consequence of what computers could understand.

For decades, speech was a hard problem. Many smart people worked on it through the 1970s, 1980s, 1990s, and 2000s. Computers could process audio signals and classify patterns, but ordinary conversation was still messy, contextual, and difficult to handle well.

That constraint shaped software. If computers could not reliably understand speech, developers had to build for clicks, taps, menus, and text boxes.

Now that is changing.

Speech recognition has improved for years. The more recent shift is that models can increasingly turn speech into usable intent, not just text. That makes voice a serious interface again, not just a novelty.

Voice matters even more when the computer is no longer just waiting for commands, but helping carry out intentions.

It also opens a simple question: if we were designing computers from scratch today, would the screen still be the default way we interact with them?

I do not think so.

If the desktop GUI were the final form of computing, phones would not have replaced desktops for so much of daily life. If typing were the most natural form of communication, people would not keep switching to calls when the topic becomes important, emotional, or easy to misunderstand.

Text is useful because it is asynchronous. Voice is useful because it is not. When something matters, we often want to explain, interrupt, clarify, and finish the loop in one conversation instead of sending messages back and forth all day.

People have been imagining this future for a long time. When I was at Carnegie Mellon, I saw old work around ubiquitous computing: the idea that the computer should move with you and recede into the environment, instead of forcing you to sit down in front of a dedicated machine. If that is the goal, a mouse and keyboard are a strange endpoint. You do not want to carry the interface with you. You want the technology to be available when you speak, move, or need help.

I already feel this shift in my own work. Speaking is easier when I want to think out loud, give context, or describe a task before I fully know how to phrase it.

But we are still early. The systems I use today know only a small part of my context. They do not fully understand what I am doing, what I usually care about, or when it would be useful to interrupt me with a suggestion. The next step is not only better voice input. It is better understanding of context, timing, and intent.

Voice is powerful because it is natural. A three-year-old can use it before they can read or write. You can use it while walking, in the dark, or when your hands are busy. It works across many situations where a screen is awkward or impossible.

This is also why voice matters for delegation. Typing is fine when you are issuing precise commands. But when you are assigning work, you often need to explain the goal, the context, the tradeoffs, and what judgment should be used along the way. That is much closer to how we speak to another person than how we fill out a form.

Most people are not used to delegating to software yet. They ask AI questions instead of assigning it work. But as that changes, the interaction changes too. The computer stops being a place where you operate tools and starts becoming something you direct.

That does not mean screens disappear.

Some tasks are visual because the information density is visual. If you are choosing an airplane seat, comparing design options, editing a spreadsheet, or scanning a map, seeing the options is better than hearing a list read aloud. Screens are excellent when layout, comparison, and spatial information matter.

But many tasks are not like that. For a large share of computing, we do not need to manipulate pixels. We need to express intent, answer a question, make a decision, or delegate a task.

The future is not screenless. It is that computers stop forcing every task through the screen. The screen dominated when computers needed us to adapt to them. Voice will matter more when computers can finally adapt to us.