Free Ideas for UI Frameworks, or How To Achieve Polished UI

Ever since the original iPhone came out, I’ve had several ideas about how they managed to achieve such fluidity with relatively mediocre hardware. I mean, it was good at the time, but Android still struggles on hardware that makes that look like a 486… It’s absolutely my fault that none of these have been implemented in any open-source framework I’m aware of, so instead of sitting on these ideas and trotting them out at the pub every few months as we reminisce over what could have been, I’m writing about them here. I’m hoping that either someone takes them and runs with them, or that they get thoroughly debunked and I’m made to look like an idiot. The third option is of course that they’re ignored, which I think would be a shame, but given I’ve not managed to get the opportunity to implement them over the last decade, that would hardly be surprising. I feel I should clarify that these aren’t all my ideas, but include a mix of observation of and conjecture about contemporary software. This somewhat follows on from the post I made 6 years ago(!) So let’s begin.

1. No main-thread UI

The UI should always be able to start drawing when necessary. As careful as you may be, it’s practically impossible to write software that will remain perfectly fluid when the UI can be blocked by arbitrary processing. This seems like an obvious one to me, but I suppose the problem is that legacy makes it very difficult to adopt this at a later date. That said, difficult but not impossible. All the major web browsers have adopted this policy, with caveats here and there. The trick is to switch from the idea of ‘painting’ to the idea of ‘assembling’ and then using a compositor to do the painting. Easier said than done of course, most frameworks include the ability to extend painting in a way that would make it impossible to switch to a different thread without breaking things. But as long as it’s possible to block UI, it will inevitably happen.

2. Contextually-aware compositor

This follows on from the first point; what’s the use of having non-blocking UI if it can’t respond? Input needs to be handled away from the main thread also, and the compositor (or whatever you want to call the thread that is handling painting) needs to have enough context available that the first response to user input doesn’t need to travel to the main thread. Things like hover states, active states, animations, pinch-to-zoom and scrolling all need to be initiated without interaction on the main thread. Of course, main thread interaction will likely eventually be required to update the view, but that initial response needs to be able to happen without it. This is another seemingly obvious one – how can you guarantee a response rate unless you have a thread dedicated to responding within that time? Most browsers are doing this, but not going far enough in my opinion. Scrolling and zooming are often catered for, but not hover/active states, or initialising animations (note; initialising animations. Once they’ve been initialised, they are indeed run on the compositor, usually).

3. Memory bandwidth budget

This is one of the less obvious ideas and something I’ve really wanted to have a go at implementing, but never had the opportunity. A problem I saw a lot while working on the platform for both Firefox for Android and FirefoxOS is that given the work-load of a web browser (which is not entirely dissimilar to the work-load of any information-heavy UI), it was very easy to saturate memory bandwidth. And once you saturate memory bandwidth, you end up having to block somewhere, and painting gets delayed. We’re assuming UI updates are asynchronous (because of course – otherwise we’re blocking on the main thread). I suggest that it’s worth tracking frame time, and only allowing large asynchronous transfers (e.g. texture upload, scaling, format transforms) to take a certain amount of time. After that time has expired, it should wait on the next frame to be composited before resuming (assuming there is a composite scheduled). If the composited frame was delayed to the point that it skipped a frame compared to the last unladen composite, the amount of time dedicated to transfers should be reduced, or the transfer should be delayed until some arbitrary time (i.e. it should only be considered ok to skip a frame every X ms).

It’s interesting that you can see something very similar to this happening in early versions of iOS (I don’t know if it still happens or not) – when scrolling long lists with images that load in dynamically, none of the images will load while the list is animating. The user response was paramount, to the point that it was considered more important to present consistent response than it was to present complete UI. This priority, I think, is a lot of the reason the iPhone feels ‘magic’ and Android phones felt like junk up until around 4.0 (where it’s better, but still not as good as iOS).

4. Level-of-detail

This is something that I did get to partially implement while working on Firefox for Android, though I didn’t do such a great job of it so its current implementation is heavily compromised from how I wanted it to work. This is another idea stolen from game development. There will be times, during certain interactions, where processing time will be necessarily limited. Quite often though, during these times, a user’s view of the UI will be compromised in some fashion. It’s important to understand that you don’t always need to present the full-detail view of a UI. In Firefox for Android, this took the form that when scrolling fast enough that rendering couldn’t keep up, we would render at half the resolution. This let us render more, and faster, giving the impression of a consistent UI even when the hardware wasn’t quite capable of it. I notice Microsoft doing similar things since Windows 8; notice how the quality of image scaling reduces markedly while scrolling or animations are in progress. This idea is very implementation-specific. What can be dropped and what you want to drop will differ between platforms, form-factors, hardware, etc. Generally though, some things you can consider dropping: Sub-pixel anti-aliasing, high-quality image scaling, render resolution, colour-depth, animations. You may also want to consider showing partial UI if you know that it will very quickly be updated. The Android web-browser during the Honeycomb years did this, and I attempted (with limited success, because it’s hard…) to do this with Firefox for Android many years ago.

Pitfalls

I think it’s easy to read ideas like this and think it boils down to “do everything asynchronously”. Unfortunately, if you take a naïve approach to that, you just end up with something that can be inexplicably slow sometimes and the only way to fix it is via profiling and micro-optimisations. It’s very hard to guarantee a consistent experience if you don’t manage when things happen. Yes, do everything asynchronously, but make sure you do your book-keeping and you manage when it’s done. It’s not only about splitting work up, it’s about making sure it’s done when it’s smart to do so.

You also need to be careful about how you measure these improvements, and to be aware that sometimes results in synthetic tests will even correlate to the opposite of the experience you want. A great example of this, in my opinion, is page-load speed on desktop browsers. All the major desktop browsers concentrate on prioritising the I/O and computation required to get the page to 100%. For heavy desktop sites, however, this means the browser is often very clunky to use while pages are loading (yes, even with out-of-process tabs – see the point about bandwidth above). I highlight this specifically on desktop, because you’re quite likely to not only be browsing much heavier sites that trigger this behaviour, but also to have multiple tabs open. So as soon as you load a couple of heavy sites, your entire browsing experience is compromised. I wouldn’t mind the site taking a little longer to load if it didn’t make the whole browser chug while doing so.

Don’t lose sight of your goals. Don’t compromise. Things might take longer to complete, deadlines might be missed… But polish can’t be overrated. Polish is what people feel and what they remember, and the lack of it can have a devastating effect on someone’s perception. It’s not always conscious or obvious either, even when you’re the developer. Ask yourself “Am I fully satisfied with this” before marking something as complete. You might still be able to ship if the answer is “No”, but make sure you don’t lose sight of that and make sure it gets the priority it deserves.

One last point I’ll make; I think to really execute on all of this, it requires buy-in from everyone. Not just engineers, not just engineers and managers, but visual designers, user experience, leadership… Everyone. It’s too easy to do a job that’s good enough and it’s too much responsibility to put it all on one person’s shoulders. You really need to be on the ball to produce the kind of software that Apple does almost routinely, but as much as they’d say otherwise, it isn’t magic.

Machine Learning Speech Recognition

Keeping up my yearly blogging cadence, it’s about time I wrote to let people know what I’ve been up to for the last year or so at Mozilla. People keeping up would have heard of the sad news regarding the Connected Devices team here. While I’m sad for my colleagues and quite disappointed in how this transition period has been handled as a whole, thankfully this hasn’t adversely affected the Vaani project. We recently moved to the Emerging Technologies team and have refocused on the technical side of things, a side that I think most would agree is far more interesting, and also far more suited to Mozilla and our core competence.

Project DeepSpeech

So, out with Project Vaani, and in with Project DeepSpeech (name will likely change…) – Project DeepSpeech is a machine learning speech-to-text engine based on the Baidu Deep Speech research paper. We use a particular layer configuration and initial parameters to train a neural network to translate from processed audio data to English text. You can see roughly how we’re progressing with that here. We’re aiming for a 10% Word Error Rate (WER) on English speech at the moment.

You may ask, why bother? Google and others provide state-of-the-art speech-to-text in multiple languages, and in many cases you can use it for free. There are multiple problems with existing solutions, however. First and foremost, most are not open-source/free software (at least none that could rival the error rate of Google). Secondly, you cannot use these solutions offline. Third, you cannot use these solutions for free in a commercial product. The reason a viable free software alternative hasn’t arisen is mostly down to the cost and restrictions around training data. This makes the project a great fit for Mozilla as not only can we use some of our resources to overcome those costs, but we can also use the power of our community and our expertise in open source to provide access to training data that can be used openly. We’re tackling this issue from multiple sides, some of which you should start hearing about Real Soon Now™.

The whole team has made contributions to the main code. In particular, I’ve been concentrating on exporting our models and writing clients so that the trained model can be used in a generic fashion. This lets us test and demo the project more easily, and also provides a lower barrier for entry for people that want to try out the project and perhaps make contributions. One of the great advantages of using TensorFlow is how relatively easy it makes it to both understand and change the make-up of the network. On the other hand, one of the great disadvantages of TensorFlow is that it’s an absolute beast to build and integrates very poorly with other open-source software projects. I’ve been trying to overcome this by writing straight-forward documentation, and hopefully in the future we’ll be able to distribute binaries and trained models for multiple platforms.

Getting Involved

We’re still at a fairly early stage at the moment, which means there are many ways to get involved if you feel so inclined. The first thing to do, in any case, is to just check out the project and get it working. There are instructions provided in READMEs to get it going, and fairly extensive instructions on the TensorFlow site on installing TensorFlow. It can take a while to install all the dependencies correctly, but at least you only have to do it once! Once you have it installed, there are a number of scripts for training different models. You’ll need a powerful GPU(s) with CUDA support (think GTX 1080 or Titan X), a lot of disk space and a lot of time to train with the larger datasets. You can, however, limit the number of samples, or use the single-sample dataset (LDC93S1) to test simple code changes or behaviour.

One of the fairly intractable problems about machine learning speech recognition (and machine learning in general) is that you need lots of CPU/GPU time to do training. This becomes a problem when there are so many initial variables to tweak that can have dramatic effects on the outcome. If you have the resources, this is an area that you can very easily help with. What kind of results do you get when you tweak dropout slightly? Or layer sizes? Or distributions? What about when you add or remove layers? We have fairly powerful hardware at our disposal, and we still don’t have conclusive results about the affects of many of the initial variables. Any testing is appreciated! The Deep Speech 2 paper is a great place to start for ideas if you’re already experienced in this field. Note that we already have a work-in-progress branch implementing some of these ideas.

Let’s say you don’t have those resources (and very few do), what else can you do? Well, you can still test changes on the LDC93S1 dataset, which consists of a single sample. You won’t be able to effectively tweak initial parameters (as unsurprisingly, a dataset of a single sample does not represent the behaviour of a dataset with many thousands of samples), but you will be able to test optimisations. For example, we’re experimenting with model quantisation, which will likely be one of multiple optimisations necessary to make trained models usable on mobile platforms. It doesn’t particularly matter how effective the model is, as long as it produces consistent results before and after quantisation. Any optimisation that can be made to reduce the size or the processor requirement of training and using the model is very valuable. Even small optimisations can save lots of time when you start talking about days worth of training.

Our clients are also in a fairly early state, and this is another place where contribution doesn’t require expensive hardware. We have two clients at the moment. One written in Python that takes advantage of TensorFlow serving, and a second that uses TensorFlow’s native C++ API. This second client is the beginnings of what we hope to be able to run on embedded hardware, but it’s very early days right now.

And Finally

Imagine a future where state-of-the-art speech-to-text is available, for free (in cost and liberty), on even low-powered devices. It’s already looking like speech is going to be the next frontier of human-computer interaction, and currently it’s a space completely tied up by entities like Google, Amazon, Microsoft and IBM. Putting this power into everyone’s hands could be hugely transformative, and it’s great to be working towards this goal, even in a relatively modest capacity. This is the vision, and I look forward to helping make it a reality.

Open Source Speech Recognition

I’m currently working on the Vaani project at Mozilla, and part of my work on that allows me to do some exploration around the topic of speech recognition and speech assistants. After looking at some of the commercial offerings available, I thought that if we were going to do some kind of add-on API, we’d be best off aping the Amazon Alexa skills JS API. Amazon Echo appears to be doing quite well and people have written a number of skills with their API. There isn’t really any alternative right now, but I actually happen to think their API is quite well thought out and concise, and maps well to the sort of data structures you need to do reliable speech recognition.

So skipping forward a bit, I decided to prototype with Node.js and some existing open source projects to implement an offline version of the Alexa skills JS API. Today it’s gotten to the point where it’s actually usable (for certain values of usable) and I’ve just spent the last 5 minutes asking it to tell me Knock-Knock jokes, so rather than waste any more time on that, I thought I’d write this about it instead. If you want to try it out, check out this repository and run npm install in the usual way. You’ll need pocketsphinx installed for that to succeed (install sphinxbase and pocketsphinx from github), and you’ll need espeak installed and some skills for it to do anything interesting, so check out the Alexa sample skills and sym-link the ‘samples‘ directory as a directory called ‘skills‘ in your ferris checkout directory. After that, just run the included example file with node and talk to it via your default recording device (hint: say ‘launch wise guy‘).

Hopefully someone else finds this useful – I’ll be using this as a base to prototype further voice experiments, and I’ll likely be extending the Alexa API further in non-standard ways. What was quite neat about all this was just how easy it all was. The Alexa API is extremely well documented, Node.js is also extremely well documented and just as easy to use, and there are tons of libraries (of varying quality…) to do what you need to do. The only real stumbling block was pocketsphinx’s lack of documentation (there’s no documentation at all for the Node bindings and the C API documentation is pretty sparse, to say the least), but thankfully other members of my team are much more familiar with this codebase than I am and I could lean on them for support.

I’m reasonably impressed with the state of lightweight open source voice recognition. This is easily good enough to be useful if you can limit the scope of what you need to recognise, and I find the Alexa API is a great way of doing that. I’d be interested to know how close the internal implementation is to how I’ve gone about it if anyone has that insider knowledge.

Web Navigation Transitions

Wow, so it’s been over a year since I last blogged. Lots has happened in that time, but I suppose that’s a subject for another post. I’d like to write a bit about something I’ve been working on for the last week or so. You may have seen Google’s proposal for navigation transitions, and if not, I suggest reading the spec and watching the demonstration. This is something that I’ve thought about for a while previously, but never put into words. After reading Google’s proposal, I fear that it’s quite complex both to implement and to author, so this pushed me both to document my idea, and to implement a proof-of-concept.

I think Google’s proposal is based on Android’s Activity Transitions, and due to Android UI’s very different display model, I don’t think this maps well to the web. Just my opinion though, and I’d be interested in hearing peoples’ thoughts. What follows is my alternative proposal. If you like, you can just jump straight to a demo, or view the source. Note that the demo currently only works in Gecko-based browsers – this is mostly because I suck, but also because other browsers have slightly inscrutable behaviour when it comes to adding stylesheets to a document. This is likely fixable, patches are most welcome.


 Navigation Transitions specification proposal

Abstract

An API will be suggested that will allow transitions to be performed between page navigations, requiring only CSS. It is intended for the API to be flexible enough to allow for animations on different pages to be performed in synchronisation, and for particular transition state to be selected on without it being necessary to interject with JavaScript.

Proposed API

Navigation transitions will be specified within a specialised stylesheet. These stylesheets will be included in the document as new link rel types. Transitions can be specified for entering and exiting the document. When the document is ready to transition, these stylesheets will be applied for the specified duration, after which they will stop applying.

Example syntax:

When navigating to a new page, the current page’s ‘transition-exit‘ stylesheet will be referenced, and the new page’s ‘transition-enter‘ stylesheet will be referenced.

When navigation is operating in a backwards direction, by the user pressing the back button in browser chrome, or when initiated from JavaScript via manipulation of the location or history objects, animations will be run in reverse. That is, the current page’s ‘transition-enter‘ stylesheet will be referenced, and animations will run in reverse, and the old page’s ‘transition-exit‘ stylesheet will be referenced, and those animations also run in reverse.

[Update]

Anne van Kesteren suggests that forcing this to be a separate stylesheet and putting the duration information in the tag is not desirable, and that it would be nicer to expose this as a media query, with the duration information available in an @-rule. Something like this:

I think this would indeed be nicer, though I think the exact naming might need some work.

Transitioning

When a navigation is initiated, the old page will stay at its current position and the new page will be overlaid over the old page, but hidden. Once the new page has finished loading it will be unhidden, the old page’s ‘transition-exit‘ stylesheet will be applied and the new page’s ‘transition-enter’ stylesheet will be applied, for the specified durations of each stylesheet.

When navigating backwards, the CSS animations timeline will be reversed. This will have the effect of modifying the meaning of animation-direction like so:

and this will also alter the start time of the animation, depending on the declared total duration of the transition. For example, if a navigation stylesheet is declared to last 0.5s and an animation has a duration of 0.25s, when navigating backwards, that animation will effectively have an animation-delay of 0.25s and run in reverse. Similarly, if it already had an animation-delay of 0.1s, the animation-delay going backwards would become 0.15s, to reflect the time when the animation would have ended.

Layer ordering will also be reversed when navigating backwards, that is, the page being navigated from will appear on top of the page being navigated backwards to.

Signals

When a transition starts, a ‘navigation-transition-startNavigationTransitionEvent will be fired on the destination page. When this event is fired, the document will have had the applicable stylesheet applied and it will be visible, but will not yet have been painted on the screen since the stylesheet was applied. When the navigation transition duration is met, a ‘navigation-transition-end‘ will be fired on the destination page. These signals can be used, amongst other things, to tidy up state and to initialise state. They can also be used to modify the DOM before the transition begins, allowing for customising the transition based on request data.

JavaScript execution could potentially cause a navigation transition to run indefinitely, it is left to the user agent’s general purpose JavaScript hang detection to mitigate this circumstance.

Considerations and limitations

Navigation transitions will not be applied if the new page does not finish loading within 1.5 seconds of its first paint. This can be mitigated by pre-loading documents, or by the use of service workers.

Stylesheet application duration will be timed from the first render after the stylesheets are applied. This should either synchronise exactly with CSS animation/transition timing, or it should be longer, but it should never be shorter.

Authors should be aware that using transitions will temporarily increase the memory footprint of their application during transitions. This can be mitigated by clear separation of UI and data, and/or by using JavaScript to manipulate the document and state when navigating to avoid keeping unused resources alive.

Navigation transitions will only be applied if both the navigating document has an exit transition and the target document has an enter transition. Similarly, when navigating backwards, the navigating document must have an enter transition and the target document must have an exit transition. Both documents must be on the same origin, or transitions will not apply. The exception to these rules is the first document load of the navigator. In this case, the enter transition will apply if all prior considerations are met.

Default transitions

It is possible for the user agent to specify default transitions, so that navigation within a particular origin will always include navigation transitions unless they are explicitly disabled by that origin. This can be done by specifying navigation transition stylesheets with no href attribute, or that have an empty href attribute.

Note that specifying default transitions in all situations may not be desirable due to the differing loading characteristics of pages on the web at large.

It is suggested that default transition stylesheets may be specified by extending the iframe element with custom ‘default-transition-enter‘ and ‘default-transition-exit‘ attributes.

Examples

Simple slide between two pages:

[page-1.html]

[page-1-exit.css]

[page-2.html]

[page-2-enter.css]


I believe that this proposal is easier to understand and use for simpler transitions than Google’s, however it becomes harder to express animations where one element is transitioning to a new position/size in a new page, and it’s also impossible to interleave contents between the two pages (as the pages will always draw separately, in the predefined order). I don’t believe this last limitation is a big issue, however, and I don’t think the cognitive load required to craft such a transition is considerably higher. In fact, you can see it demonstrated by visiting this link in a Gecko-based browser (recommended viewing in responsive design mode Ctrl+Shift+m).

I would love to hear peoples’ thoughts on this. Am I actually just totally wrong, and Google’s proposal is superior? Are there huge limitations in this proposal that I’ve not considered? Are there security implications I’ve not considered? It’s highly likely that parts of all of these are true and I’d love to hear why. You can view the source for the examples in your browser’s developer tools, but if you’d like a way to check it out more easily and suggest changes, you can also view the git source repository.