Ewen Cheslack-Postava me@ewencp.org @ewencp

September 16 2013
Tags: email | web development | programming | tools

Recently I’ve been adding some features to StraightUp that require email. And of course we want those emails to look attractive, so we need to send multi-part text + HTML messages. This doesn’t sound so bad until you look at the sorry state of HTML support in email clients — expect to code like it’s 1999.

It actually isn’t that bad if you keep your emails simple, and there are quite a few posts out there covering the nitty-gritty of writing HTML emails (and a lot more that are useless, superficial overviews). But when I went to start implementing this, I found none of them actually explain how to implement and test your emails. They are just standard lists of what you can’t do. Even the very nice boilerplate templates, such as HTML Email Boilerplate and Emailology, don’t explain how you should implement a pipeline on top of their tools and recommendations.

If you’re sending one-off marketing emails, it might be fine to manually edit and test each email, writing all your styling inline. If you’re doing anything more complex, generating dynamic emails, or repeatedly creating lots of emails, you don’t actually want to write your email templates with a bunch of inline styles. This is one of the reasons services like MailChimp are handy. They do a bunch of this for you automatically and give you good starting templates. But if their service doesn’t fit your needs, what do you do?

To make your life easier, you need a few tools and a basic pipeline in place to let you test and iterate quickly. This is the basic pipeline I came up with:

  1. Start with base HTML email template. Customize it to work with your template engine. Keep CSS styles in <style> tags.
  2. Run your template engine to generate each email, resulting in a raw email that will likely look horrible in a lot of clients since they’ll strip the styling.
  3. Run it through a CSS inliner. There are already a bunch of libraries out there to move your CSS styles into style attributes on the tags that reference them. For some reason, those other articles don’t let you know that these tools exist, let alone point you to them.
  4. Have an automatic mechanism for getting generated test emails out of your system and viewable in a browser. This can be as simple as logging them to files and having your web browser serve them up when you’re in testing mode.
  5. Run a script to filter the output. Ideally this would be identical to the filter executed by a web-based email client. This step could be done before or after you serve the file to the browser.

Not particularly complex, but I figure that documenting this will hopefully save others the extra couple of days of research on best practices and tools to make it all work in a larger application. And especially for the last step, there don’t seem to be many good tools to do this.

If you’ve already got the gist from this list, just jump to the section on CSS inliners for pointers to a few good ones, and to the last section for my half-hearted attempt at a bookmarklet that tries to give you a semi-realistic preview of what your email will end up looking like in browser.

For me, this entire process is tied to our Django-based web application. Naturally, this means we selected some Python libraries, chose Django-specific solutions where it made life easier, and we already have infrastructure for templating and serving web pages ready. However I’ll try to describe most things generically (hopefully not even assuming you’re working on a web app) so the general tools/libraries will still be Google-able, then suggest specific solutions for Python/Django apps.

Steps 1 & 2: Base Templates

Don’t roll your own. Use templates like HTML Email Boilerplate, Emailology, or MailChimp’s Blueprints.1 They resolve so many issues with random email clients, including ones that I can barely believe people still use, that it would be crazy not to. They’ll also help you understand why you need all these settings. But remember to remove all the comments. You’ll avoid spam issues and use significantly less bandwidth.

Once you select an HTML template, you need to translate it into your system’s templating language (sorry, template is overloaded here…). I’m not going to suggest any since that’s a huge topic, especially if you aren’t already using one. We took one of MailChimp’s templates and converted it to use Django template tags rather than MailChimp’s merge tags. We also gave it sections so most of our emails could be written by extending the template to fill in just a small number of sections (Django’s version of template inheritance). Essentially all we do for a new email is specify the title text and body HTML.

Step 3: Inlining

Email clients filter the HTML you feed them, often very aggressively. From what I’ve read and seen so far, GMail seems to be the most aggressive. And they have good reason. HTML emails need to be filtered to make sure they can’t do anything evil — run malicious scripts, modify or overlay elements in the UI, and probably other evil things I haven’t thought of. It’s especially important for web-based clients since they don’t have a good mechanism for isolating themselves from the email being displayed.

But it’s a pain if you’re the one authoring the HTML. You need to know the subset of functionality you’re allowed to use. The biggest issue, and the reason for this post, is that you need to provide all your styling inline as style attributes because <style> tags are removed entirely. This is mentioned with all the templates. However, they all put the style information in a style tag.

Eventually I figured out why — no sane person does the styling inline because it’s too repetitive and becomes unwieldy to modify. Exactly the reason stylesheets were created in the first place. But they never mention that the smart way of handling this is to use their template, then run an inliner which pulls the style for each tag and puts it as the style attribute of the tag, allowing you to completely remove the <style> tag.2 If nothing else, they really should put a warning in giant red letters somewhere near the beginning of their documentation saying that you ultimately need to inline them and that the template’s are provided this way for ease of reading and editability.

Frustratingly, if you search for tools to do this and include ‘email’ as a search term, you’ll probably turn up a lot of forms that take a complete email and process it for you. That’s not very useful if you’re generating dynamic emails, especially if the dynamic parts contain elements that need styling.

But there are plenty of good libraries for doing this:

  • Generally, search for ‘CSS inliner’ and your language.
  • For Ruby: Premailer seems to be the right choice (code here, and available as a gem).
  • For Python: There’s a premailer port. inlinestyler also seems to be a good choice.

We’re using Django and its template language, so we went with an inliner that integrated easily and had a template tag that took care of everything for us: Django Inlinecss (which was forked from roverdotcom’s version).

Step 4: Getting the Emails

Of course you’ll want to check the results before putting the new email into production. You might think that just emailing yourself is good enough. It might be, and that’s how I started. But it quickly becomes annoying: run the test code, wait for the email to arrive, then dig through GMail’s version in the DOM to figure out what’s going wrong and what’s been filtered. If you can get your emails right on the first try, then you can skip these last two steps. If you don’t, you probably want to read on…

So what’s a better way to test? Since web-based email clients have the most aggressive filters, previewing the email in the browser is probably the best idea. How you get them to the browser will vary depending on your setup. I think the easiest thing to do is log each email to a separate file and either make them accessible as a local file or via a web server (e.g. if you develop on a headless VM).

We already have a full web stack on our development VMs since we need it to test our app, so we just modified that setup. First we changed the email backend. Previously we were using the console, which was sufficient for text-only emails. A two-line config change converted us to the filesystem backend, which just spits each email out to a separate file. We also configured our development machines to serve the files at a special URL to make them easily browsable. Of course our unit tests can also access the emails so we can write proper tests of any email functionality we create.

As a side note, I considered a different option: an additional, development-only middleware layer that, when a special query parameter was included, would emit the email back to the browser instead of the normal response. This works great for some other debugging tools, like adding CPU profiling for a request. But it broke down quickly for me. The first thing I wanted to use it for generated multiple emails, and it wasn’t obvious how to handle that well. Additionally, we’re going to generate emails in other processes (in a task queue), where we can’t hijack the response since there isn’t a response. The file-based approach handles all these cases and maintains a bit of history, which can be nice for debugging. It does, however, mean a bit more work in finding the email when you’re doing manual testing.

Step 5: Realistic Previews

Finally, even if you output only the HTML (versus the whole email, which will obviously not render as a regular HTML page), you’re not testing anything by viewing it in a fully functional browser.

There are commercial products you could feed the output to, and these could be very helpful if you need to truly test the emails across a lot of email clients. We’re not particularly worried about ancient clients, so we mostly just want a sanity check (at least for now).

So what we really need is something that transforms the HTML in the same way that GMail or any other web-based email client would. After quite a bit of Googling, I didn’t turn anything up that performs this specific transformation. There are a lot of HTML sanitizers, designed to filter user-submitted HTML that will be included on a site. However, most of these seem to be even more restrictive than email clients (e.g. completely stripping heading tags). Others are huge chunks of code with dependencies that looked like they were going to be tough to build on (e.g. JsHtmlSanitizer from google-caja and htmlsanitizer.js from Google’s Closure library) and may not provide an easy way to customize their rules to match the filtering performed by email clients. I wanted something that would work in the browser (no additional backend support required) and wouldn’t take me too long to build. So I decided to hack something up that would do it for me well enough for my testing.

This Gist is the result. The approach is pretty simple. First, it tries to check if we’re actually looking at a raw, multi-part email and extract just the HTML part, overwriting the document if it finds it. This lets you use the raw email output dumped in the earlier step (which also makes it easy to inspect the plain-text version of the email).

Next, it tries to filter the page so it matches what would be displayed in an email client. It does so by inspecting and modifying the DOM, a simpler approach than requiring a full HTML parser. The filtering is by no means complete, but it takes care of a few key things. First, it removes the head, style, and link tags that email clients always filter out. Then, it tries to remove attributes it can’t trust. Creating an exhaustive list of good/bad attributes (which vary from tag to tag) would take too long, so I used a combined blacklist/whitelist approach. By default, all attributes are permitted and we explcitly specify tags where we want to apply filtering. These filters, however, are whitelists: for tags that have filters, everything but those attributes are removed. Except for a few exceptions. There are some attributes we always permit, e.g. style, and some we always filter, e.g. id and class. The final step should be to filter CSS rules in the style attributes. Right now I haven’t implemented this step because we haven’t needed it yet. We started from a good template and avoided adding extra styles that might require this filtering.

If you want to use the script, try it as a bookmarklet:

Preview HTML Email

You can click it here to test it on this page (which isn’t supposed to be HTML email compatible, and so will look horrible after you do so). Note that the bookmarklet uses the code directly from the Gist, so if you want to use it regularly you should self-host it.

The small set of rules I implemented are enough for testing our emails based on the template we used. But ideally the CSS style filtering would be based on a complete reference table like this. I don’t know that there’s a similar reference for tags and attributes, but I’m sure the small set of rules currently provided could be upgraded (at worst with a set of valid tags provided by the HTML specs, which would at least be a better baseline). If anyone wants to contribute changes, I’d be happy to update the script. Alternatively, I’d love to see an HTML email-specific sanitizer (or set of sanitizer rules for the configurable implementations).

Conclusion

HTML emails don’t have to be particularly hard. But if you’re just getting started with them and trying to get a minimal set of functionality in place, it can be frustrating to find so little guidance about the actual practice of generating them. Hopefully this guide improves that situation, and I hope others will build on and extend the simple previewing bookmarklet. With the right guide and tools, I think we can get people up and running in an hour or two rather than the couple of days it took me to sort out the details.

Thanks

Thanks to Faolan Cheslack-Postava for comments and editing.

  1. Kudos to all of these folks for extremely valuable projects, and to MailChimp specifically for trying to improve the quality of emails even if it’s potentially at odds with their business.

  2. One thing I’ve yet to figure out is why the email clients don’t do this for you. They’re already dealing with potentially malicious input and filtering it. They have to parse and modify the HTML. Why not add that extra step of CSS inlining before the filtering? It would make email author’s lives more sane, reduce the size of emails, and seems like it has little overhead when added to the steps they must already take. In fact, it could make some of them more efficient by removing the original CSS rule before it’s scattered into 72 different tags.

A Survey of Iterator (or Generator) Patterns in golang
September 07 2013
Tags: golang | go | programming | programming languages

(Updated: 9/8/2013 Added “Stateful Iterators” thanks to ryszard.)

I’ve been using Go to implement some performance-sensitive services recently and, so far, it’s been a very nice. When writing Go code, it often feels like a (slightly verbose) scripting language, but has nice type-safety and good performance.

But I’ve realized that this feeling is really only true on a line-to-line basis. What I’ve noticed recently is that a lot of my code ends up being “collection munging”: creating, modifying, and transforming collections. Python excels at this – the built-in core functional working set (map, reduce, and friends) combined with iterators/generators and list/dict comprehensions allows for some very concise, readable code. And often it even has decent performance.

But it isn’t always good enough performance, and that was a reason for moving to Go (or any higher performance, more systems-y language). But I found myself even missing C++98 because collections are actually pretty nice there. Iterators and templates allow code nearly as concise as the Python version, even with all those type declarations.

Unfortunately generics aren’t coming to Go soon 1. But iterators are just interfaces. They abstract away the underlying container to make it look like a sequence of values of some type with a standard interface (e.g. Next(), Prev(), etc). This allows you to write at least partially generic code, e.g. a sum function that assumes an iterator producing a specific type. If each container can generate an IntIterator, we can write one Sum() for all of them. Without generics we still need SumInt() vs. SumFloat(), but at least we don’t need SumIntSlice(), SumIntDictValues(), SumFloatSlice(), and SumFloatDictValues().

Once I realized that this is what I was missing, I wanted to know what the canonical iterator pattern in Go is: what’s the interface and the implementation? I ended up in this thread on the golang-nuts mailing list which covers the basic options for a forward-only iterator (which admittedly simplifies the problem, mainly focusing on implementation rather than the appropriate interface, but this is also the most common type of iterator we need).

But this discussion only covered the basics of the different options and some pitfalls, but didn’t address performance issues and focused primarily on the iterator’s implementation without much consideration for the caller’s code, which I think is the more important part. The whole idea is to provide a good abstraction to the caller and hopefully not compromise performance. So which of the iterator patterns gives the best balance? Or, a more realistic question – when I’m writing code, which version should I use in each case: a) I just want to get the code working b) I want to extract maximum performance or c) I want a balance between readability and performance?

Iterator Patterns

To start, let’s review both the implementer and calling code for each of the options, hopefully giving a clearer summary than that original mailing list thread. None of the patterns explicitly define an iterator interface. They either use a channel, which is a very natural stand-in for a forward-only iterator, or a function which produces values for the caller. The two key aspects I want to look at are a) how does the iterator implementation work and b) compared to ideal caller code that looks something like

for item := range Foo.Iterator() {
    do_ops_with(item)
}

what does the calling code look like? What, if any, hoops do they need to jump through to get things working with that pattern, and how does it affect readability/speed of coding?

In all the examples, I’ll use iterating over a []int to make the code as simple as possible. To show what the caller code looks like, we’ll implement sum() for each type of iterator.

Pattern 1: Iterator Function with Callback

The first approach looks the least like an iterator. Instead, we create a function with the body of the loop and the “iterator” gives a callback for each element:

func IntCallbackIterator(cb func(int)) {
    for _, val := range int_data {
        cb(val)
    }
}

This is clearly very easy to implement. The calling code needs to define the callback and pass it in, using closures to track data used across iterations:

var sum int = 0
cb := func(val int) {
    sum += val
}
IntCallbackIterator(cb)

I’ve written it out in steps to make it clear, but we can write it more “loop-like” by writing the function directly in the call:

var sum int = 0
IntCallbackIterator(func(val int) {
    sum += val
})

Because Go has anonymous functions and with closures, this doesn’t look too bad. It feels a lot like writing JavaScript. To me, one of the major drawbacks is that we’ve lost type declaration elision: we have to explicitly define the full type of the function, including the type you’re iterating over.

You’ll notice there’s no return value on the callback. We can add control, e.g. break, by allowing the function to return a value. Unfortunately, that also then means it always must return a value, or you need two iterators if you want to have concise code when you don’t need break.

Pattern 2: Closures

The second pattern also uses anonymous functions with closures, but in the iterator implementation itself. The basic idea is that the returned iterator is a generator function that you call repeatedly:

func IntClosureIterator() (func() (int, bool), bool) {
    var idx int = 0
    var data_len = len(int_data)
    return func() (int, bool) {
        prev_idx := idx
        idx++
        return int_data[prev_idx], (idx < data_len)
    }, (idx < data_len)
}

The return value has two components: the iterated value followed by a bool indicating if there is another value, i.e. has_next. The initial call to get the iterator also returns this has_next so you can tell if there are any elements.

I’d hope there’s a better way of implementing that, but this is what I was able to came up with. There are two reasons it’s ugly. First, you can’t use a regular for loop because the iteration needs to happen during successive calls to the returned function. Second, you need to also indicate whether there is a next value so the caller can terminate the loop. We could also do this by returning a *int to indicate the validity of the current value, but then we need to dereference in the caller.

The caller for this style doesn’t look too pretty either:

var sum, val int = 0, 0
for it, has_next := IntClosureIterator(); has_next; val, has_next = it() {
    sum += val
}

Again, there might be a better way here. It looks similar to iterator interfaces in, e.g., Java. One major annoyance is that the value can’t be declared in the third part of the for so we have to declare it earlier. The entire for line is muddied up with extra control info – has_next needs to be stored and checked. This is downright ugly compared to a nice range for loop.

Finally, some people like to wrap this type of iterator function into an interface, i.e. they return an object with an interface like this:

type IntIterator interface {
    Next() (int,bool)
}

which directly wraps the function, similar to the style common in other languages like Java. I don’t think it much matters, but some people find the it.Next() call to help clarify what the loop is doing.

Pattern 3: Channels

The final pattern uses channels and fits very nicely into Go. The basic idea is to fire off a goroutine which generates values and pushes them into a channel:

func IntChannelIterator() <-chan int {
    ch := make(chan int)
    go func() {
        for _, val := range int_data {
            ch <- val
        }
        close(ch) // Remember to close or the loop never ends!
    }()
    return ch
}

Control is returned to the caller who can then just read from the channel until it’s exhausted, which range does automatically:

var sum int = 0
for val := range IntChannelIterator() {
    sum += val
}

This is very Go-ish (or whatever the equivalent of “Pythonic” is for golang…). Because channels have direct support in the language through range, the calling code is nice, concise, and clear. And in theory shouldn’t be that expensive because goroutines are cheap (they aren’t “real” threads).

This is as good as it’s going to get in terms of simplicity from the caller’s side, barring the introduction of something like list comprehensions.

However, there is one issue with this approach. If we do something like this:

has_zero := false
for val := range IntChannelIterator() {
    if val == 0 {
        has_zero = true
        break
    }
}

we leave uncollectible garbage. The goroutine continues to “run” in the background, but is blocked from doing any work after it generates one value after the calling loop exited since it can’t write to the channel anymore. I haven’t yet thought much about fixing this problem. I suspect using a bidirectional channel (or two channels) as a request/response mechanism would work, but you’d still need to make sure either side close()d the channel at the right time. Maybe a defer statement could handle that nicely even with break statements. Handling all the cases properly might not be trivial and probably requires the calling code to be more complicated than the simple for-range loop.

Buffered Channels

Although goroutines are cheaper than threads, context switches still cost something. If you’re only doing a small amount of work to move to the next item (as the iterator) and a small amount of work consuming the item (as the caller), the cost of dealing with the channel and switching between goroutines could be expensive.

Conveniently, Go, provides buffered channels. They have a buffer of data instead of just one value. This means that the Go runtime can choose to fill the entire buffer before switching to a different goroutine, avoiding the overhead of switching so frequently. The only implementation change is choosing the buffer size and specifying it in the make() call for the channel. Instead of

ch := make(chan int)

we can use

ch := make(chan int, 10)

The trick is deciding how large that buffer should be. Unfortunately, this is completely dependent on the context.

Pattern 4: Stateful Iterators

A common pattern in other languages, as mentioned in the section about closure-based iterators, is to use an interface with some variant of HasNext() and Next() functions. ryszard contributed this implementation, which he refers to as Stateful because they hold the iteration state in the iterator struct itself (as opposed to the implicit storage in the closure-based implementation). The iterator implementation:

type intStatefulIterator struct {
    current int
    data    []int
}

func (it *intStatefulIterator) Value() int {
    return it.data[it.current]
}
func (it *intStatefulIterator) Next() bool {
    it.current++
    if it.current >= len(it.data) {
        return false
    }
    return true
}
func NewIntStatefulIterator(data []int) *intStatefulIterator {
    return &intStatefulIterator{data: data, current: -1}
}

And here’s how it’s used:

var sum int = 0
it := NewIntStatefulIterator(int_data)
for it.Next() {
    sum += it.Value()
}

In Go, just adding an interface declaration let’s us specify a generic interface to an IntIterator:

type StatefulIterator interface {
    Value() int
    Next() bool
}

which would allow use to write generic functions for any data structure that could generate an iterator with that interface.

I didn’t include this approach initially (beyond noting you could get it easily from the closure based approach) because this was part of what I set out to avoid. It may seem superficial, but I would much prefer an iteration technique that can make the actual iterator not appear in the calling code if possible – I just find that more readable. This is a place where I think C++ seriously fails. Not only do iterators show up, but often the meaning of expressions is very unclear. For example, any time you use a std::map, your code ends up littered with it.first and it.second. On the other hand, if you want more than a simple forward-only iterator, this is the right way to handle it since, e.g., bidirectional iteration can’t clearly be specified via a single function.

Finally, the other reason I don’t like some of the approaches that require additional variables is that I think it’s important to scope properly where possible. I really dislike holding on to the iterator outside the for loop it’s used in. I would probably change the calling code to this:

var sum int = 0
for it := NewIntStatefulIterator(int_data); it.Next(); {
    sum += it.Value()
}

although I think the for loop with missing post statement is awkward.

Evaluation

I wrote up versions of each of these with simple benchmarks courtesy the lovely Go testing framework. The code is available here.

I wrote two versions of each. One directly iterates over a []int, the other introduces a struct and uses a []*struct which it needs to extract the data from. The goal was to hopefully avoid benefits from locality. The results don’t vary much, so either we weren’t benefiting anyway or the allocated data is still laid out nicely. In either case, having both makes me feel more confident these are somewhat reflective of real world performance.

Here are the results, slightly modified for readability:

Callback Ints                500          3836683 ns/op
Callback Struct              500          4104532 ns/op

Closure Ints                 500          5391391 ns/op
Closure Struct               500          6420672 ns/op

Channel Ints                  10        207776217 ns/op
Channel Struct                10        212255270 ns/op

Buffered Channel Ints         20         96576062 ns/op
Buffered Channel Struct       20         99274603 ns/op

Stateful Ints               1000          2558344 ns/op
Stateful Struct              500          4706915 ns/op
Stateful Interface Ints      500          7887930 ns/op
Stateful Interface Structs   200          9286961 ns/op

The results aren’t too surprising, but the scale of the performance difference is interesting. The simple but noisier callback based approach wins on performance. The closure-based almost-an-iterator approach isn’t too slow in comparison, a bit under 50% performance hit. Using this approach instead is probably only an issue if it’s in an inner loop of performance critical code. Finally, the most natural channel based implementation is… slow. By a factor of nearly 50x. Buffering does help though, reducing that to a factor of 25x.

I’d probably choose the channel approach for most code, then use callbacks for performance sensitive code. I think the callback code is cleaner than the closure-based iterator and it only gets too ugly when you have too many nested “loops”.

Updated: The stateful iterator results are new. Not surprisingly, since all the state management is done manually, they are on par with the basic callback version. The two are essentially equivalent when you use the struct directly – the iterator struct takes the place of the closure, and both require one function call per iteration. The interface version is more expensive because it requires extra runtime checks and indirection to find the correct function to call.

Apparently, based on some examples from the standard packages, this style (or a version using a pointer return value to indicate both availability and value at once) are the “Gothic” way of implementing this. I went looking for other patterns because I assumed such a verbose approach wouldn’t be preferred.

Notes

Apparently channel performance has improved significantly in Go over the past couple of years, but clearly the overhead is still too much. My guess would be that the overhead of the goroutine scheduler is the source of the cost, although with only 2 goroutines it’s not clear why it’s as high as it is.

To me, the biggest drawback to all of these options is that they don’t all provide the same type, so they aren’t interchangeable. We can’t modify the iterator implementation independently of the calling code to get better performance.

The channel approach fits perfectly with the language because range directly supports iterating over channels. It would be great to be able to unify these all with a channel interface, but there’s no way to adapt the more efficient approaches without involving a goroutine. I’d guess that the overhead is with scheduling, not with the channel itself (which is effectively a 1-element buffer). It looks like the best option for unifying them for now is to write boilerplate wrapper code, i.e. define something like this:

type IntIterator interface {
    Next() (int,bool)
}

and expose a function that returns that interface. Unfortunately, this generator-like function also produces the ugliest, hardest to read (in my opinion) caller code. So for now, I just use the channel interface first for ease of implementation and assume I’ll have to pay the refactoring cost to uglify the calling code if I need better performance.

What I’d really like to see is some sort of range adapter: provide an interface in the builtin package that lets you use any iterable in a range expression. That should be the unifying interface and allow you to use any of these interchangeably and get really clean calling code. It could just be the usual HasNext()/Next() interface seen in other languages. I assume the issue with this is that there is no way to specify those generically.

Thanks

Thanks to Faolan Cheslack-Postava for editing this post. Thanks to ryszard for the Stateful Iterator implementations.

  1. I understand why generics have not appeared in Go yet. I’m glad they are taking their time to make sure they get it right since generics in many languages have some seriously ugly warts. On the other hand, I clearly disagree with the Go designers on how important they are. I think they are an obviously big win for the language, and the fact that I have a bunch of FooInt(), FooFloat(), FooString() functions is a testament to that.

Or, Why You Need to Understand Transaction Isolation Levels
August 25 2013
Tags: django | mysql | python | programming

When I went to deploy a Django + MySQL project awhile ago, transitioning off of the Django test server started producing confusing results. The most obvious place I saw it was with logins. I would login, the page returned would appear logged in. But then on the next request it would appear as though I was not logged in. For awhile I thought I had screwed something up when customizing the login flow for my site, but couldn’t find any issues.

It turned out to be a problem with the default settings of MySQL (specifically isolation level) combined with using multiple persistent database connections and Django’s (default) auto-commit mode. Surprisingly, this issue is never mentioned anywhere in the Django documentation, despite there being a section discussing isolation levels in PostgreSQL, where the defaults actually work with Django.

At the time, I found a post that explained the problem and solution very clearly. But since it’s no longer available, I’m going to write it up here.

Isolation Levels

The source of the problem is that by default, MySQL uses a higher isolation level than Django expects. What does isolation level mean? It indicates how isolated, or protected, transactions are from other, concurrent transactions. Those other transactions might make changes which affect queries in your transaction. The isolation level controls when these changes become visible to your transaction (or connection).

MySQL defaults to REPEATABLE-READ, which means that once a transaction reads some data, the same values will be returned (or used in evaluating a query) for the rest of the transaction. For example, if we SELECT a user’s entry from a session table, until the transaction is closed we will continue to read the same data. Even if another process connects and executes a transaction that changes the data, we’ll still see the same values. We should only see new values if we wrote them ourselves.

Alternatively, we can weaken the isolation level to READ-COMMITTED, which allows changes from other transactions to become visible mid-transaction. This is weaker than REPEATABLE-READ and could be inconvenient. For example, if you’re running a blog and SELECT the set of stories in a time range, then SELECT the set of topics in the same time range via a join of the entries and topics tables, you can see inconsistent results under READ-COMMITTED. For example, another transaction could have changed all entries’ topics to “none” and the second SELECT would see those changes, resulting in an empty list of topics.

(In fact, even REPEATABLE-READ is weaker than the ideal SERIALIZABLE setting. For example, REPEATABLE-READ will fail in this case: UPDATE all blog entries within a time range and then SELECT them to list the titles of the entries you updated. If another transaction adds a blog entry in that date range, the set of entries modified and the set listed can differ under REPEATABLE-READ. They won’t under SERIALIZABLE.)

The tradeoff between isolation levels is performance vs. simplicity of code using the database. SERIALIZABLE is what most people think of when they think of a traditional ACID database. But it’s not the default for most databases because it’s expensive – it requires either more/longer locking or multiple version concurrency control (i.e. copying data). SERIALIZABLE makes it very easy to reason about the logic in your application, but often has poor performance.

Django and Isolation Levels

So how does this relate to Django? Django’s code assumes you are using a particular isolation level: READ-COMMITTED. It assumes, for example, that if it tries to SELECT to read some data, and then repeats it, that if data matching the query is added that it will show up. For example, get_or_create() relies on this property.

Just generally, the isolation level is critical to writing correct code, even within a single transaction. The examples above demonstrated why – depending on the isolation level, different orders of events can cause different types of errors.

However, since MySQL is in REPEATABLE-READ by default, Django misbehaves (at least in Django < 1.5 with pseudo-auto-commit mode, see final notes section for details). Django puts all queries into an explicit transaction and reuses transactions for multiple queries. Databases generally implement auto-commit where any statements not in a transaction are treated as their own transaction. But Python implements a different approach by default where transactions are reused (and kept open) until a write command is issued (i.e. an INSERT or UPDATE).

Unfortunately, this means that when you’re just doing reads all the time (hopefully the majority of your requests) you’ll continue to see the same data if you’ve already read it in the past. Once data is read, since the transaction is held open, future reads repeat the same value.

But even that’s not a problem until you start using multiple connections (via multiple processes). If you only have one Django process, all writes will go through it and all reads will see the updated state. But when you switch to two or more processes, you’ll start seeing what seem like inconsistencies. Yet another example of why making your testing environment look as close to deployment is important (i.e. don’t use the Django test server).

OK, so how do we fix it? Unless you want to rewrite a lot of Django (and third-party apps), we need to set the database into the mode which Django expects. There are a couple of ways to do so. I prefer trying to keep my deployment as simple as possible, avoiding changes to default configurations as necessary. The easiest fix, given this, is to adjust the setting on a per-connection basis in Django. Add something like this to your settings.py

DATABASES = {
    'default' : {
        ...
        'OPTIONS' : {
            'init_command': 'SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED'
        }
    }
}

This just tells Django to run some raw SQL upon initiating each connection, in this case MySQL lets you set the isolation level on a per-connection basis.

Alternatively, you can just run MySQL with the “correct” isolation level by default. Add the following to your my.cnf:

transaction-isolation = READ-COMMITTED

And of course make sure you restart MySQL.

Final Notes

You can learn more about the isolation level settings in MySQL here and generally about isolation levels (and some problems that can arise using different levels) here on Wikipedia.

This particular problem really is Django-specific, but you should be aware of isolation levels regardless. If you’re already writing your code in a way that assumes READ-COMMITTED, you might as well get the performance gain of putting the database in that mode. If you have additional services interacting with the database (e.g. cron jobs, task queues, other standalone services), you need to be especially careful since the different code may make different assumptions and can operate with different isolation levels. Ideally, you can choose one level and commit to it (har har), but if you’re using frameworks or third-party libraries you can’t always do that.

I considered submitting a documentation patch to Django, but it appears to be moot at this point. Apparently in Django 1.6 the approach to autocommit will change so it is enabled via the database layer rather than simulated. This should get rid of this particular issue. However, you should still be aware of the differences between isolation levels because they can still have an impact on how you write transactions that work with Django and the database. Conveniently, it looks like the entire Django database transaction management API is getting an upgrade that should make it much easier to customize it to your needs.

Finally, there’s another way to handle all of this: don’t use MySQL. PostgreSQL seems like it might be a better option anyway. However, given the number of Django+MySQL installations that I assume exist, I’m surprised how difficult it was to track down this problem and its solution. And regardless, it’s important to understand isolation level issues and what level you’re supposed to be operating at since you can rarely write code which behaves properly under different isolation levels unless you write the lowest-common-denominator, assumes-no-isolation version.

It's Easier Than You Think
June 07 2013
Tags: python | programming | programming languages

Python packaging is a mess — setuptools, easy_install, distribute, pip? — and that mess makes packaging and distributing a library seem intimidating. It turns out it’s really easy, but I’ve yet to find a concise explanation of how to get started. This post is going to fix that. It will quickly walk through the process of packaging a simple library, getting it onto PyPI, and using C extensions in the package. At the bottom you’ll also find links to the key resources I used to figure this out in case you need more detail.

The goal of this post isn’t to explain all the options. Rather, it describes how to build a package which you should be able to build with pip. I’ll briefly try to define terms or explain things like PyPI, but this really expects that you’re already familiar with using packages and now want to create one.

Creating a Basic Package

First, create your basic package layout, including your module:

| setup.py
+ foo/
   | __init__.py
   | code.py

We’re packaging foo — fill in that package with whatever code you like. The only new thing here should be setup.py, which describes the package. Here’s a template:

from setuptools import find_packages, setup
setup(name="foo",
      version="0.1",
      description="A foo utility",
      author="Ewen Cheslack-Postava",
      author_email='me@ewencp.org',
      platforms=["any"],  # or more specific, e.g. "win32", "cygwin", "osx"
      license="BSD",
      url="http://github.com/ewencp/foo",
      packages=find_packages(),
      )

You can fill in more details, but these are sufficient. Notice that the only setting that doesn’t have an obvious value is packages, and it turns out that the packaging system can almost always figure out the right value for us. That’s all there is to it.

To test, use development mode. This links the current directory into your site-packages, where regular packages are installed via pip (e.g. in /usr/local/lib/python2.7/site-packages/ or in a virtualenv). It then works as if it were installed normally, but can be easily disconnected later. Run:

python setup.py develop

and you should now be able to run something like this without an error:

import foo
foo.fn()

This mode is very nice since you can run it once, make modifications, and test without having to re-install the package. When you’re ready to stop developing (e.g. later when you want to actually install the package), run

python setup.py develop --uninstall

Once you’re convinced your package is ready, generate the source tar (“source distribution”):

python setup.py sdist

You’ll find the tar file under the dist/ folder. I highly recommend looking at it to make sure it contains everything you expect to be there is there.

Publishing to PyPI

PyPI is a centralized location that stores information about available packages so you can look them up and easily install them by name. When you run pip install foo, it’s looking up the instructions for installing the package on PyPI. If you register and upload your package to PyPI, others developers can install your package just as easily.

With the package ready, we just need to upload it to PyPI. You need to register a user account on PyPI if you don’t have one.

Next, you need to register your package with PyPI. Again, a setup.py command does just that:

python setup.py register

All the relevant information is pulled from your setup.py. You just need to enter your username and password. This step only reserves the name of the package; no files have been uploaded yet. You could check on PyPI that the package is now there without any files, but let’s just upload the source:

python setup.py sdist upload

That’s it, now running

pip install foo

should install your package. The upload command will upload whatever the current version is, so you can just bump the version number, build the tar, and issue the upload command again to release a new version.

Adding Dependencies

What if you require some other packages? Just specify them as dependencies:

setup(...,
      install_requires=["docutils>=0.3", "setuptools==0.5"],
      )

This is a nice trick if you already have a requirements.txt that specifies the full list of packages for pip to install (optionally with specific versions to use):

setup(...,
      install_requires=[i.strip() for i in open("requirements.txt").readlines()],
      )

Including a C Extension

The package I was building was just a wrapper around some C code. In my case, there wasn’t even any Python code. To include a C extension like this, first create a directory named for the module and put the C source code in it. For example, pyhashxx looks like this:

| setup.py
+ pyhashxx/
   | xxhash.h
   | xxhash.c
   | pycompat.h
   | pyhashxx.c

(Note that often you’ll have the C code in _pyhashxx with a Python wrapper module called pyhashxx that imports and wraps _pyhashxx. I wanted to expose the C module directly without any Python overhead, so I didn’t use this naming scheme.)

Next update your setup.py to add the extension (showing mostly changes here):

from setuptools import find_packages, setup, Extension

pyhashxx = Extension('pyhashxx',
                     sources=['pyhashxx/xxhash.c',
                              'pyhashxx/pyhashxx.c'],
                     depends=['pyhashxx/xxhash.h',
                              'pyhashxx/pycompat.h'])

setup(..., ext_modules=[pyhashxx],)

That should be it – the compilation steps should be taken care of as long as a compiler can be found. If you’re in development mode, you’ll need to run

python setup.py build

to build the extension. If you modify any C files, make sure you re-run this so you’re running the updated code.

Note that some documentation doesn’t cover the very important depends option: if you miss it, these files will not be included in the package. However, it’s easy to miss because the development mode discussed above works in your working directory rather than making a full copy. This is one of the reasons I suggested checking the contents of the tar file before submitting it to PyPI — I screwed up my first version and had to bump the version number because my first version was broken.

Extra Notes

Tests are, you know, good. You can specify the test suite in your setup.py:

setup(...,
      test_suite="tests"
      )

which lets you test easily using

python setup.py test

There are also a lot of tweaks and convenient features you can use when building extensions that aren’t mentioned here. This just gets you up and running. If you’re serious about developing extensions, you should invest some time reading the resources below.

Resources

Thanks

Thanks to Jeff Terrace for reviewing this post.

April 14 2012
Tags: url shorteners | twitter | followup

I posted yesterday about URL shorteners wrapping already shortened URLs. I didn’t have any tweet data to work with, so I could only look at the current state of URL shortening using Twitter’s public stream. Jeff was kind enough to provide some sample tweets from 2009 so I can look at how URL shortening has changed in the past couple of years.

Although the 2009 data had many more tweets, I extracted about the same number of tweets (100,000) and links from those tweets (~20,000) as I collected for the previous post. Here’s a histogram of the number of redirects taken per link found in the tweets from 2009:

Histogram of number of redirects for links from 2009. The vast
 majority have zero or one
 wrappers

and here’s the histogram for 2012, repeated from the last post:

Histogram of number of redirects for links from 2012. The vast
 majority have one or two wrappers around the real
 link

The Y-axes differ a little, but the trend is clear: where links were previously almost all wrapped at most once (average .66 redirects), they are now wrapped twice or more nearly half the time (average 1.48 redirects).

I also checked the top shorteners (involved in any redirects) and re-shorteners (redirects to another shortened link) as in the previous post. The top shorteners aren’t surprising – bit.ly dominated in 2009 with about 78% of the links, and t.co dominates in 2012 with 72% of the links. But here’s the data showing the raw count of shortened links that redirected to another shortened link, for the top 10 reshortening services in 2009 and 2012, sorted by the 2012 numbers:

Site 2009 2012
t.co 0 7929
bit.ly 19 31
j.mp 1 5
ow.ly (owl.li) 3 5
fb.me 0 4
is.gd 2 4
mrkt.ms 0 3
apne.ws 0 2
w3e.us 0 2
lnk.ms 1 2
dzone.it 2 0
yit.me 2 0
migre.me 2 0
retwt.me 2 0
cnn.cn 1 0

The number of tweets in the two data sets aren’t exactly equal, but it doesn’t really matter because the numbers are so skewed. There might be a slight increase in reshortening among the services represented in both years, but t.co links dwarf all the other services. Twitter seems to be single-handedly increasing the degree of indirection to the destination sites linked on Twitter.

April 12 2012
Tags: url shorteners | twitter

This tweet asking how many URL shorteners an average tweeted link goes through piqued my interest, so I decided to find out. I wrote a couple of scripts to collect public tweets, look for links, follow them through redirections, and then do some quick and dirty analysis of the resulting data.

I was interested because I’ve noticed the browser hopping through multiple (slow) sites when clicking links on Twitter. But if you’re not familiar, you should also read Joshua Schachter’s excellent overview of URL shorteners and the problems they cause – it’s worse than just a longer wait for your page. But waiting for your page sucks too: compare this to that, which not only is slower, but has annoying clickthroughs.

But on to the data. The short version: the vast majority of links only go through 1 or 2 shorteners, but nearly as many go through 2 as go through 1. The average number of shorteners you have to go through is about 1.48. In other words, on average you’ll need to make 2.48 synchronous HTTP requests to start displaying a page linked on Twitter.

To determine this, I collected 101,668 tweets containing 20,868 URLs. Following the links through redirects to their destination and counting them up gives this histogram (note the log scale):

Histogram of number of redirects for links. The vast majority have
 one or two wrappers around the real
 link

So a few links manage to get through the tweeting process unscathed, most have one or two shorteners between you and the real page, and then there are some outliers. And one was wrapped 7 times!

But wait, there’s more…

When I created the link above to the Wikipedia page about URL shortening that goes through many shorteners, some of them rejected already shortened links. They are at least making some effort not to re-shorten links. This may have more to do with spam than any other motivation, but it indicates some URL shorteners are avoiding these nested shortened URLs.

So I wanted to look into who is generating rewrapped links. First I looked at the top host names in shortened links, including ones discovered by following redirects. (Again, log scale.)

Picking out just the domain of the shortened links shows that
 Twitter now creates the majority of shortened
 links

This essentially just shows market share. Not surprisingly, t.co is the winner, with some other big names following them.

But I was really interested in re-shortened links: links that redirect to another shortened link that also needs to be resolved. If you cut out the last shortened link (the one that finally points to the real resource), you can see who the worst “re-shorteners” are.

Twitter itself by far is the worst offender in reshortening already
 short links

Twitter is by far the worst offender. Of course, as the last hop between the tweeter and the tweet being published, this isn’t surprising, but it also suggests that Twitter is always or at least very aggressively re-shortening links. This definitely meshes with my experience where Twitter will unnecessarily wrap a short link in t.co form, even if my tweet was short enough already. I’m assuming this is because they want to collect analytics – the value in collecting data about where tweets spread outweighs the cost of making links brittle and slower to load.

If you compare the last two graphs, you can also see which services seem to be actively avoiding reshortening. Those that drop to a lower position in the second graph are either not reshortening links or don’t receive links to be reshortened (i.e. they generate them only for their own pages). Good examples, I think, are Facebook (which isn’t reshortening) and Tumblr or YouTube (which probably only short-link their own pages).

Methodology

My basic approach was pretty simple:

  1. Pull tweets from Twitters public streaming API. I pulled around 100,000 tweets over 40 minutes on April 12 2012 around 10 AM. If you’re curious, this runs at about 100 KB/s. The tweetstream library made this trivial.

  2. Unshorten links by following redirects. I found about 20,000 URLs in the tweets I collected. Since I didn’t have the analysis code ready, I just dumped the list of redirections and status codes to a file. Thanks to the very nice Requests library, this was trivial. They even already held onto the history of requests for me.

  3. Run the analysis. Classify requests as shortened URL redirects and count them up. Ignore anything that didn’t ultimately result in a successful page load to avoid noise due to issues like shortened links that were removed because they pointed at spam or random server failures.

There were a couple of gotchas:

  • Extracting links is tricky. You obviously can’t rely on looking for properly formatted URLs. I used a simple regex I had developed previously to look for links in tweets, which I use for archiving tweets and making sure I have the real link URLs. It’s better to be liberal here because we can filter failures out later – we’ll get 404s or name resolution issues.
  • Classifying shortened/long links isn’t hard, but you need to tweak things a bit. Just looking for redirects (HTTP 301 or 302 status codes) isn’t sufficient since there are non-URL-shortening uses of these. An initial analysis looked a lot worse because these redirects were also counted. Ultimately I took into account the HTTP status code, the length of the host name portion of the URL, and the total length of the URL. It’s still not perfect, but false positives/negatives seem to be pretty rare in my data set.
  • This is simple, but you need async requests for accessing the pages or threading – some sites go down or are slow to respond and it would have taken too long without this. I ran 20 concurrent transfers and the whole process probably took about 30 minutes.

See the code for full details.