Machine translation, semantic HTML, and embedded code, oh my!

It’s hard for me to believe that in the not so distant past I worked as a web producer for O’Reilly’s now defunct Online Publishing Group. Basically I was a sort of HTML jockey, shuttling articles through their content management system while ensuring that they all had consistent markup.

Since the articles O’Reilly published online dealt predominantly with programming, the vast majority contained code snippets and samples. Thus one of my primary roles was to ensure that the embedded code displayed properly on the web. Our HTML “style guide” dictated that all code be wrapped in <pre><code></code></pre> tags. I remember at the time thinking it a bit redundant—I occasionally embed code on my blog, but for simplicity’s sake, I use <pre> alone.

I discovered today another value to the semantic use of HTML: machine translation. I was looking through my referrer logs, and noticed someone had translated one of my posts with code samples. And sure enough, Google translated everything within the <pre> tags, mangling the code. Gah!

This made me think all the way back to my O’Reilly days, and I wondered what Google would do if I also wrapped the code block in <code> tags. Sure enough, they respected the semantic “intent” of the <code> tag and left it untranslated—BUT they also stripped out the newlines, collapsing multiple lines of code to a single line. So if anyone from the Google Translate team is reading, consider this a bug report.

Google Translate Test: before — Before Google Translate

Google Translate Test: after — After Google Translate

Note: Apparently Google will also avoid translating tags with class="notranslate".

Care to Comment?