How can we write better software? – Interview series, part 2 with Brian Warner

This is part 2 of a new Interview series here at Mozilla Hacks.

“How can we, as developers, write more superb software?”

A simple question without a simple answer. Writing good code is hard, even for developers with years of experience. Luckily, the Mozilla community is made up of some of the best development, QA and security folks in the industry.

This is part two in a series of interviews where I take on the role of an apprentice to learn from some of Mozilla’s finest.

Introducing Brian Warner

When my manager and I first discussed this project, Brian is the first person I wanted to interview. Brian probably doesn’t realize it, but he has been one of my unofficial mentors since I started at Mozilla. He is an exceptional teacher, and has the unique ability to make security approachable.

At Mozilla, Brian designed the pairing protocol for the “old” Firefox Sync, designed the protocol for the “new” Firefox Sync, and was instrumental in keeping Persona secure. Outside of Mozilla, Brian co-founded the Tahoe-LAFS project, and created Buildbot.

What do you do at Mozilla?

My title is a staff security engineer in the Cloud Services group. I analyse and develop protocols for securely managing passwords and account data and I implement those protocols in different fashions. I also review other’s code, I look at external projects to figure out whether it’s appropriate to incorporate them, and I try to stay on top of security failures like 0-days and problems in the wild that might affect us and also tools and algorithms that we might be able to use.

UX vs Security: Is it a false dichotomy? Some people have the impression that for security to be good, it must be difficult to use.

There are times when I think that it’s a three-way tradeoff. Instead of being x-axis, y-axis, and a diagonal line that doesn’t touch zero, sometimes I think it’s a three-way thing where the other axis is how much work you want to put into it or how clever you are or how much user research and experimentation you are willing to do. Stuff that engineers are typically not focused on, but that UX and psychologists are. I believe, maybe it’s more of a hope than a belief, that if you put enough effort into that, then you can actually find something that is secure and usable at the same time, but you have to do a lot more work.

The trick is to figure out what people want to do and find a way of expressing whatever security decisions they have to make into a normal part of their work flow. It’s like when you lend your house key to a neighbour so they can water your plants when you are away on vacation, you’ve got a pretty good idea of what power you are handing over.

There are some social constructs surrounding that like, “I don’t think you’re going to make a copy of that key and so when I get it back from you, you no longer have that power that I granted to you.” There are patterns in normal life with normal non-computer behaviours and objects that we developed some social practices around, I think part of the trick is to use that and assume that people are going to expect something that works like that and then find a way to make the computer stuff more like that.

Part of the problem is that we end up asking people to do very unnatural things because it is hard to imagine or hard to build something that’s better. Take passwords. Passwords are a lousy authentication technology for a lot of different reasons. One of them being that in most cases, to exercise the power, you have to give that power to whoever it is you are trying to prove to. It’s like, “let me prove to you I know a secret”…”ok, tell me the secret.” That introduces all these issues like knowing how to correctly identify who you are talking to, and making sure nobody else is listening.

In addition to that, the best passwords are going to be randomly generated by a computer and they are relatively long. It’s totally possible to memorize things like that but it takes a certain amount of exercise and practice and that is way more work than any one program deserves.

But, if you only have one such password and the only thing you use it on is your phone, then your phone is now your intermediary that manages all this stuff for you, and then it probably would be fair (to ask users to spend more energy managing that password). And it’s clear that your phone is sort of this extension of you, better at remembering things, and that the one password you need in this whole system is the bootstrap.

So some stuff like that, and other stuff like escalating effort in rare circumstances. There are a lot of cases where what you do on an everyday basis can be really easy and lightweight, and it’s only when you lose the phone that you have to go back to a more complicated thing. Just like you only carry so much cash in your wallet, and every once in a while you have to go to a bank and get more.

It’s stuff like that I think it’s totally possible to do, but it’s been really easy to fall into bad patterns like blaming the user or pushing a lot of decisions onto the user when they don’t really have enough information to make a good choice, and a lot of the choices you are giving them aren’t very meaningful.

Do you think many users don’t understand the decisions and tradeoffs they are being asked to make?

I think that’s very true, and I think most of the time it’s an inappropriate question to ask. It’s kind of unfair. Walking up to somebody and putting them in this uncomfortable situation – do you like X or do you like Y – is a little bit cruel.

Another thing that comes to mind is permission dialogs, especially on Windows boxes. They show up all the time, even just to do really basic operations. It’s not like you’re trying to do something exotic or crazy. These dialogs purport to ask the user for permission, but they don’t explain the context or reasons or consequences enough to make it a real question. They’re more like a demand or an ultimatum. If you say “no” then you can’t get your work done, but if you say “yes” then the system is telling you that bad things will happen and it’s all going to be your fault.

It’s intended to give the user an informed choice, but it is this kind of blame the user, blame the victim pattern, where it’s like “something bad happened, but you clicked on the OK button, you’ve taken responsibility for that.” The user didn’t have enough information to do something and the system wasn’t well enough designed that they could do what they wanted to do without becoming vulnerable.

Months before “new” Sync ever saw the light of day, the protocol was hashed out in extremely vocal and public forum. It was the exact opposite of security through obscurity. What did you hope to accomplish?

There were a couple of different things that I was hoping from that discussion. I pushed all that stuff to be described and discussed publicly because it’s the right thing to do, it’s the way we develop software, you know, it’s the open source way. And so I can’t really imagine doing it any other way.

The specific hopes that I had for publishing that stuff was to try to solicit feedback and get people to look for basic design flaws. I wanted to get people comfortable with the security properties, especially because new Sync changes some of them. We are switching away from pairing to something based on passwords. I wanted people to have time to feel they understood what those changes were and why we were making them. We put the design criteria and the constraints out there so people could see we kind of have to switch to a password to meet all of the other goals, and what’s the best we can do given security based on passwords.

Then the other part is that having that kind of public discussion and getting as many experienced people involved as possible is the only way that I know of to develop confidence that we’re building something that’s correct and not broken.

So it is really just more eyeballs…

Before a protocol or API designer ever sits down and writes a spec or line of code, what should they be thinking about?

I’d say think about what your users need. Boil down what they are trying to accomplish into something minimal and pretty basic. Figure out the smallest amount of code, the smallest amount of power, that you can provide that will meet those needs.

This is like the agile version of developing a protocol.

Yeah. Minimalism is definitely useful. Once you have the basic API that enables you to do what needs to be done, then think about all of the bad things that could be done with that API. Try and work out how to prevent them, or make them too expensive to be worthwhile.

A big problem with security is sometimes you ask “what are the chances that problem X would happen.” If you design something and there is a 1/1000 chance that something will happen, that the particular set of inputs will cause this one particular problem to happen. If it really is random, then 1/1000 may be ok, 1/1M may be ok, but if it is in this situation where an attacker gets to control the inputs, then it’s no longer 1/1000, it’s 1 in however many times the attacker chooses to make it 1.

It’s a game of who is cleverer and who is more thorough. It’s frustrating to have to do this case analysis to figure out every possible thing that could happen, every state it could get into, but if somebody else out there is determined to find a hole, that’s the kind of analysis they are going to do. And if they are more thorough than you are, then they’ll find a problem that you failed to cover.

Is this what is meant by threat modelling?

Yeah, different people use the term in different ways, I think of when you are laying out the system, you are setting up the ground rules. You are saying there is going to be this game. In this game, Alice is going to choose a password and Bob is trying to guess her password, and whatever.

You are defining what the ground rules are. So sometimes the rules say things like … the attacker doesn’t get to run on the defending system, their only access is through this one API call, and that’s the API call that you provide for all of the good players as well, but you can’t tell the difference between the good guy and the bad guy, so they’re going to use that same API.

So then you figure out the security properties if the only thing the bad guy can do is make API calls, so maybe that means they are guessing passwords, or it means they are trying to overflow a buffer by giving you some input you didn’t expect.

Then you step back and say “OK, what assumptions are you making here, are those really valid assumptions?” You store passwords in the database with the assumption that the attacker won’t ever be able to see the database, and then some other part of the system fails, and whoops, now they can see the database. OK, roll back that assumption, now you assume that most attackers can’t see the database, but sometimes they can, how can you protect the stuff that’s in the database as best as possible?

Other stuff like, “what are all the different sorts of threats you are intending to defend against?” Sometimes you draw a line in the sand and say “we are willing to try and defend against everything up to this level, but beyond that you’re hosed.” Sometimes it’s a very practical distinction like “we could try to defend against that but it would cost us 5x as much.”

Sometimes what people do is try and estimate the value to the attacker versus the cost to the user, it’s kind of like insurance modelling with expected value. It will cost the attacker X to do something and they’ve got an expected gain of Y based on the risk they might get caught.

Sometimes the system can be rearranged so that incentives encourage them to do the good thing instead of the bad thing. Bitcoin was very carefully thought through in this space where there are these clear points where a bad guy, where somebody could try and do a double spend, try and do something that is counter to the system, but it is very clear for everybody including the attacker that their effort would be better spent doing the mainstream good thing. They will clearly make more money doing the good thing than the bad thing. So, any rational attacker will not be an attacker anymore, they will be a good participant.

How can a system designer maximise their chances of developing a reasonably secure system?

I’d say the biggest guideline is the Principle of Least Authority. POLA is sometimes how that is expressed. Any component should have as little power as necessary to do the specific job that it needs to do. That has a bunch of implications and one of them is that your system should be built out of separate components, and those components should actually be isolated so that if one of them goes crazy or gets compromised or just misbehaves, has a bug, then at least the damage it can do is limited.

The example I like to use is a decompression routine. Something like gzip, where you’ve got bytes coming in over the wire, and you are trying to expand them before you try and do other processing. As a software component, it should be this isolated little bundle of 2 wires. One side should have a wire coming in with compressed bytes and the other side should have decompressed data coming out. It’s gotta allocate memory and do all kinds of format processing and lookup tables and whatnot, but, nothing that box can do, no matter how weird the input, or how malicious the box, can do anything other than spit bytes out the other side.

It’s a little bit like Unix process isolation, except that a process can do syscalls that can trash your entire disk, and do network traffic and do all kinds of stuff. This is just one pipe in and one pipe out, nothing else. It’s not always easy to write your code that way, but it’s usually better. It’s a really good engineering practice because it means when you are trying to figure out what could possibly be influencing a bit of code you only have to look at that one bit of code. It’s the reason we discourage the use of global variables, it’s the reason we like object-oriented design in which class instances can protect their internal state or at least there is a strong convention that you don’t go around poking at the internal state of other objects. The ability to have private state is like the ability to have private property where it means that you can plan what you are doing without potential interference from things you can’t predict. And so the tractability of analysing your software goes way up if things are isolated. It also implies that you need a memory safe language…

Big, monolithic programs in a non memory safe language are really hard to develop confidence in. That’s why I go for higher level languages that have memory safety to them, even if that means they are not as fast. Most of the time you don’t really need that speed. If you do, it’s usually possible to isolate the thing that you need, into a single process.

What common problems do you see out on the web that violate these principles?

Well in particular, the web is an interesting space. We tend to use memory safe languages for the receiver.

You mean like Python and JavaScript.

Yeah, and we tend to use more object-oriented stuff, more isolation. The big problems that I tend to see on the web are failure to validate and sanitize your inputs. Or, failing to escape things like injection attacks.

You have a lot of experience reviewing already written implementations, Persona is one example. What common problems do you see on each of the front and back ends?

It tends to be escaping things, or making assumptions about where data comes from, and how much an attacker gets control over if that turns out to be faulty.

Is this why you advocated making it easy to trace how the data flows through the system?

Yeah, definitely, it’d be nice if you could kind of zoom out of the code and see a bunch of little connected components with little lines running between them, and to say, “OK, how did this module come up with this name string? Oh, well it came from this one. Where did it come from there? Then trace it back to the point where, HERE that name string actually comes from a user submitted parameter. This is coming from the browser, and the browser is generating it as the sending domain of the postMessage. OK, how much control does the attacker have over one of those? What could they do that would be surprising to us? And then, work out at any given point what the type is, see where the transition is from one type to another, and notice if there are any points where you are failing to do that, that transformation or you are getting the type confused. Definitely, simplicity and visibility and tractable analysis are the keys.

What can people do to make data flow auditing simpler?

I think, minimising interactions between different pieces of code is a really big thing. Isolate behaviour to specific small areas. Try and break up the overall functionality into pieces that make sense.

What is defence in depth and how can developers use it in a system?

“Belt and suspenders” is the classic phrase. If one thing goes wrong, the other thing will protect you. You look silly if you are wearing both a belt and suspenders because they are two independent tools that help you keep your pants on, but sometimes belts break, and sometimes suspenders break. Together they protect you from the embarrassment of having your pants fall off. So defence in depth usually means don’t depend upon perimeter security.

Does this mean you should be checking data throughout the system?

There is always a judgement call about performance cost, or, the complexity cost. If your code is filled with sanity checking, then that can distract the person who is reading your code from seeing what real functionality is taking place. That limits their ability to understand your code, which is important to be able to use it correctly and satisfy its needs. So, it’s always this kind of judgement call and tension between being too verbose and not being verbose enough, or having too much checking.

The notion of perimeter security, it’s really easy to fall into this trap of drawing this dotted line around the outside of your program and saying “the bad guys are out there, and everyone inside is good” and then implementing whatever defences you are going to do at that boundary and nothing further inside. I was talking with some folks and their opinion was that there are evolutionary biology and sociology reasons for this. Humans developed in these tribes where basically you are related to everyone else in the tribe and there are maybe 100 people, and you live far away from the next tribe. The rule was basically if you are related to somebody then you trust them, and if you aren’t related, you kill them on sight.

That worked for a while, but you can’t build any social structure larger than 100 people. We still think that way when it comes to computers. We think that there are “bad guys” and “good guys”, and I only have to defend against the bad guys. But, we can’t distinguish between the two of them on the internet, and the good guys make mistakes too. So, the principal of least authority and the idea of having separate software components that are all very independent and have very limited access to each other means that, if a component breaks because somebody compromised it, or somebody tricked it into behaving differently than you expected, or it’s just buggy, then the damage that it can do is limited because the next component is not going to be willing to do that much for it.

Do you have a snippet of code, from you or anybody else, that you think is particularly elegant that others could learn from?

I guess one thing to show off would be the core share-downloading loop I wrote for Tahoe-LAFS.

In Tahoe, files are uploaded into lots of partially-redundant “shares”, which are distributed to multiple servers. Later, when you want to download the file, you only need to get a subset of the shares, so you can tolerate some number of server failures.

The shares include a lot of integrity-protecting Merkle hash trees which help verify the data you’re downloading. The locations of these hashes aren’t always known ahead of time (we didn’t specify the layout precisely, so alternate implementations might arrange them differently). But we want a fast download with minimal round-trips, so we guess their location and fetch them speculatively: if it turns out we were wrong, we have to make a second pass and fetch more data.

This code tries very hard to fetch the bare minimum. It uses a set of compressed bitmaps that record which bytes we want to fetch (in the hopes that they’ll be the right ones), which ones we really need, and which ones we’ve already retrieved, and sends requests for just the right ones.

The thing that makes me giggle about this overly clever module is that the entire algorithm is designed around Rolling Stone lyrics. I think I started with “You can’t always get what you want, but sometimes … you get what you need”, and worked backwards from there.

The other educational thing about this algorithm is that it’s too clever: after we shipped it, we found out it was actually slower than the less-sophisticated code it had replaced. Turns out it’s faster to read a few large blocks (even if you fetch more data than you need) than a huge number of small chunks (with network and disk-IO overhead). I had to run a big set of performance tests to characterize the problem, and decided that next time, I’d find ways to measure the speed of a new algorithm before choosing which song lyrics to design it around. :).

What open source projects would you like to encourage people to get involved with?

Personally, I’m really interested in secure communication tools, so I’d encourage folks (especially designers and UI/UX people) to look into tools like Pond, TextSecure, and my own Petmail. I’m also excited about the variety of run-your-own-server-at-home systems like the GNU FreedomBox.

How can people keep up with what you are doing?

Following my commits on https://github.com/warner is probably a good approach, since most everything I publish winds up there.

Thank you Brian.

Transcript

Brian and I covered far more material than I could include in a single post. The full transcript, available on GitHub, also covers memory safe languages, implicit type conversion when working with HTML, and the Python tools that Brian commonly uses.

Next up!

Both Yvan Boiley and Peter deHaan are presented in the next article. Yvan leads the Cloud Services Security Assurance team and continues with the security theme by discussing his team’s approach to security audits and which tools developers can use to self-audit their site for common problems.

Peter, one of Mozilla’s incredible Quality Assurance engineers, is responsible for ensuring that Firefox Accounts doesn’t fall over. Peter talks about the warning signs, processes and tools he uses to assess a project, and how to give the smack down while making people laugh.

About Robert Nyman [Editor emeritus]

Technical Evangelist & Editor of Mozilla Hacks. Gives talks & blogs about HTML5, JavaScript & the Open Web. Robert is a strong believer in HTML5 and the Open Web and has been working since 1999 with Front End development for the web - in Sweden and in New York City. He regularly also blogs at http://robertnyman.com and loves to travel and meet people.

More articles by Robert Nyman [Editor emeritus]…