Easy Endian-ness
January 25th, 2006A lot of my readers have noticed that this blog covers extremely geeky programming topics that often fly right over the heads of less technically-submerged people. It’s not an intelligence thing – it’s all just jargon and experience. So every once in a while I’d like to pull my head out of my assumptions about who is reading the blog, and tackle something considerably more “simple.” Topics that you must understand if you’re a developer, and may want to understand as a curious non-developer.
Tim Gaden points out on his Hawk Wings blog that the term “endian-ness” has become widely used in the Mac community, and it’s leaking out of the labs and into common conversation with customers. He astutely observes that many people haven’t got the foggiest idea what it means.
In the context you’ve probably been hearing it lately, it has to do with differences between two computer chip architectures – particularly Intel and PowerPC.
PowerPC is a “big-endian” while Intel is “little-endian”.
A hexadecimal (hex, base 16) number uses the numerals 0-9 (like decimal) but adds the letters A-F, so it can represent larger numbers with a shorter number of characters. Other than that, it’s constructed like a normal decimal number, when talked about and written by humans. For instance, here’s a hex number:
0x11ff
(The 0x at the beginning is common shorthand for “this is a hex number”).
On a PowerPC system, that is exactly how the number is stored in memory. The “Big End” is the end farthest to the left. This is what you’re used to with decimal numbers. It’s why 3,000,000 dollars is a lot of money. Because the characters that “mean a lot” are on the left. On an Intel machine, the number is stored “with it’s little end in first” – so it looks backwards:
0xff11
As a very practical example, let’s say I run a program that stores a directory of all my friends. It was written for a PowerPC computer, and as part of its data format, writes the total count of friends as a number:
0x0001
You don’t have to be an expert with hex numbers to guess that the “logical value” of this number is ONE. Now, what happens when this program is run on an Intel machine? If it just reads it verbatim from disk, it ends up looking exactly the same in memory as it did on the PowerPC, but it means something completely different. Since the “little end” goes in first on an Intel machine, the number is interpreted backwards, and becomes equivalent to this PowerPC-based value:
0x0100
So instead of 1 friend, the program running on Intel thinks I have 256 friends. Hey, I like this byte-order problem! But when the program proceeds to try reading those 255 non-existant friends from disk, your data format is hosed, and the application either crashes or behaves very strangely.
The solution for most of these problems is something called “byte swapping.” This makes it the responsibility of the programmer to ensure that bytes that went to disk on whatever architecture come back into memory in the appropriate format for the current chip. A byte on a computer is exactly the amount of data that uses up two characters in a hex number. So using the above example, the bytes in question are “0x00″ and 0x01”. If the bytes went to disk in big-endian format (0x0001), they need to “trade places” when they’re read back on Intel, so that they still mean ONE in little-endian (0x0100).
Apple has done a lot of the work for developers in this transition. Thanks to the growing use of highly abstracted data formats, Apple was able to handle the grunt work for things like preferences storage automatically. But for developers with custom data types, who have not planned for endian-ness issues, the announcement that Apple would be moving to Intel was a major wake-up call. They’d have to revamp their data storage and retrieval strategy so that they were always capable of “doing the right thing” regardless of the architecture. For most Mac developers, this means “assume it’s always big-endian, and byte swap if necessary.” This assumption might change over time if little-endian processors like Intel’s end up being what the Mac sticks with. But for the time being, assuming big-endian means that the data formats can be passed seamlessly between existing applications and their Intel-savvy counterparts.
The length of even this “easy” overview of endian-ness proves that it’s actually a complex and difficult concept. I have barely scratched the surface but I hope this helps put things into a bit more perspective. The next time you hear geeks yammering on about endian-ness, perhaps you’ll have a quip or two to interject!
January 25th, 2006 at 7:37 am
[…] UPDATE: But I’m wrong. FastScripts developer Daniel Jalkut reveals all. Technorati Tags: endianness, Wikipedia, AppleScript […]
January 25th, 2006 at 7:56 am
Incidentally, i think the term “bigendian” originally came from Gulliver’s Travels, and the techies borrowed it.
January 25th, 2006 at 8:37 am
“Incidentally, i think the term “bigendian” originally came from Gulliver”™s Travels, and the techies borrowed it.”
Or vice versa, if you’re having endian trouble.
January 25th, 2006 at 10:28 am
I don’t think you need to make your non-technical readers understand about hexadecimal in order to understand this issue. It’s much simpler than that and can easily be explained in terms of ordinary decimal (base 10) numbers as long as you explain it by analogy instead of insisting on explaining the actual situation with bits and bytes.
When we write the number 13, we understand that the left digit is the “tens” column and the right digit is the “ones” column. But it is a mere convention that we write the tens column on the left and the ones column on the right. I.e. that we put the most important column (here, the “tens”) on the left. Some other society (“aliens”) might easily have adopted the reverse convention – that the larger columns proceed to the right. E.g. when that other society wants to indicate the unlucky number referred to above, they would write 31 meaning 3 “ones” and 1 “ten”. There would obviously be confusion and errors if our two societies exchanged written records containing numbers.
See also this interesting account of teaching base 2 (binary) to a grade 3 class using the Socratic method:
http://www.garlikov.com/Soc_Meth.html
January 25th, 2006 at 10:50 am
Cameron – it’s certainly not necessary, but I wanted to talk about bytes, for which hex is very convenient. Also – even in this article I assume a certain degree of technical sophistication – just less than the usual “hacking with GDB” type crowd.
January 25th, 2006 at 11:38 am
Regarding Jon Hendry’s comment about Gulliver’s Travels, well, that’s actually documented in Apple’s Universal Binary Programming Guidelines. :)
January 25th, 2006 at 2:16 pm
When I was a kid in France, the official program for kindergarten included “modern math”. I remember learning the concept of bases for counting. We were using wood pieces of different sizes. For instance, put one small square, then put another one next to it, than when you have 3, remove them and replace with a large rectangle, do it again, do it again, then remove th 3 rectangles and replace with a big square. Or something like that (for base 3, but then we did different bases). Of course, I only made the connection with bases much later, like high school or college, and looked back at that with awe. I happened to later develop a great taste for maths, so maybe I was particularly receptive for that kind of stuff, but it is still amazing that you can teach this kind of stuff to 4-5 years all and make it all intuitive.
The “modern math” program as it was, was however dropped, not sure why and if it was really having any impact on subseuqent success at school
January 25th, 2006 at 9:27 pm
Charles: were the little pieces of wood different colors? I remember using just such things in first grade (about 6 years old in the US). We called them “Rods” I think. I’m pretty sure we never learned different bases with them, though. Lucky kid!
A google search of “Math Rods” finds some similar tools, but when I was a kid there was definitely no writing or anything on them.
January 26th, 2006 at 8:09 am
Red Sweater Editorial: An otherwise thoughtful poster used a fake name and email in writing this comment. In the dialogue that continues in other comments, the poster is referred to as “Not Required” because of their defiant choice of this name for their entry:
January 26th, 2006 at 10:42 am
Daniel: yes, they were of different colors. Not sure how many different bases we did, but I know if was not just base 10. My wife told me this morning she remembers it too.
January 26th, 2006 at 11:12 am
daniel,
i like your technical blogs, even the “hacking with GDB” ones. although i don’t personally hack with gdb, i still enjoy the mental challenge your more technical articles put forward. as i begin to read this article i thought to myself, “here we go again with _another_ endian explanation, ugh.” but by the end, i thought you’d done a good job of explaining the “real” problem developers face, and that why i read your blog, to see how macintosh developers tackle real problems! keep it up!
i also have to agree with “not required.” the current US date standard is insane. i too would like to see the ISO format more widely used. i too will be going bigendian for the date, and little endian for my cpu!
January 26th, 2006 at 11:47 am
I agree that hex is useful in explaining endianness. My personal flavor of OCD was triggered by a different aspect of this entry: “number” vs. “numeral”. Numerals are representations of numbers, and hex and decimal are schemes of numerals rather than numbers. For that matter, this whole endianness issue is about numerals rather than numbers. Because you’re not using the word “numeral”, you’re forced to resort to phrases such as ‘the “logical value” of this number’, which is not only expensive but redundant, when you could have said “this numeral’s number”. OK, so I’m a word geek, and maybe using the word “numeral” would have annoyed and/or confused your target audience. I don’t mean to sound critical. I do keep coming back for more, after all. I just had to vent my numeric-vocabularic spleen.
January 26th, 2006 at 11:57 am
“Not Required”: an interesting, if slightly weird pet peeve you’ve got there. You know my pet peeve? People who think their opinions are important enough to post on my blog, but won’t even leave their email address (only readable by me) to legitimize their person-ness. It’s especially annoying to have to refer to you as “Not Required” in a context that I like to think of as somewhat community-oriented.
I like the nerdiness of the analysis that it’s inappropriate to mix up the “endian-ness” of dates. I never thought of it, myself. But my knee-jerk reaction is to suggest that in the case of a format like “January 26, 2005”, there is simply a hierarchy of information, which is sub-oraganized by specificity. Similarly, it’s common (in the US, at least) to list an address as “123 Main Street Apt #2, Citvyille, State, USA.” You could argue by your logic that the “Apt #2” in the middle of all that is out of place. I disagree. It’s subordinate to its section of the overall address, and allows the section to be meaningful on its own given the knowledge of everything after it. Similarly, a date like “January 26” is meaningful right now because we all know it’s 2005. The 2005 is there in case somebody gets confused or wants to make sure it is really today.
January 26th, 2006 at 12:01 pm
Keith: Thanks a lot for the kind words. It seems the blog is finally getting enough of a readership that I get negative comments to go with the positive ones. A comment like yours every once in a while keeps things in check!
Pete: You totally caught me on being thoughtless and ignorant with regard to “number” vs. “numeral”. I think if I thought about it I would figure it out on my own, but I confess that I’ve never really given a second thought to using “number” for both the quantity and the representation. Must be my colloquial habit. I do think that using the two side-by-side would only lead to headache and confusion by both writer and reader alike. Probably better would have been to clearly define “character” in the context of a “number representation” and thereafter use character or number with more thoughtfulness. (And I’m glad my English imperfections haven’t sent you away for good :) )
(Looking back over this entry, I see that I used the terms “number” and “numeral” in close proximity in a perfectly understandable form. I never knew I knew the difference but I guess I did. No excuses, then!)
January 26th, 2006 at 3:53 pm
The “rods” being discussed are Cuisenaire rods: http://en.wikipedia.org/wiki/Cuisenaire_rods
January 27th, 2006 at 3:39 am
I’m not really sure I understand that pet peeve. If I *did* think my opinion were important, you should expect me to attach my name; clearly that is not the “peeve” behind my post. Likewise, if I thought you needed to legitimize what I wrote based on my identity, then that would mean I expect you to commit a logical fallacy; also clearly not the case. I really just bopped on over from Stepwise thought it was an on-topic point to make. I really don’t understand why I should be forced to be part of the community in order to make a point, and so I almost never post to sites that require registration (a loss, hopefully, to them and their community) and post anonymously unless there really is something to be gained by using my identity. It strikes me as odd that remaining a helpful stranger is frowned upon.
Back on the topic of endian, I *would* agree that a street address can also have mixed endian, but the justification in your disagreement is clearly wrong. The hierarchy you describe clearly favors not mixed endian, but little endian. Yes, with dates, we all can assume the current year, but by that logic we can also assume the current month as well, and often do when we say things like “I have a meeting on the 30th”. Likewise for an address, once you’re at the street location, all that *does* matter is the apartment number, and the representation commonly used does *not* allow you to discard everything after it and reduce it to what is immediately meaningful.
I’m not trying to get you to change your blog date format or “requirements” for commenting; I’m just riffing on endian, too. As we look at endian with computers I think it’s interesting to look at endian elsewhere and think about what things make sense and what things are just force of habit. Yes, there are many common example of mixed endian, but they don’t really make a lot of sense when you actually think about them. It has been said that little endian is favored when writing things out (because you can simply stop if the remainder is “understood”), and that big endian is preferred when representing continual precision (as is often the case for numbers). I don’t know how true that is, but it somehow makes sense that an older architecture like x86, designed when we were moving from paper to electronics, would be little endian whereas newer architectures are usually big endian. I can’t point to a single mixed endian architecture! :-)
January 27th, 2006 at 1:03 pm
Similarly, a date like “January 26″³ is meaningful right now because we all know it”™s 2005. The 2005 is there in case somebody gets confused or wants to make sure it is really today.
And because most of us also know it is January, most of us europeans start dates with the day ;-)
26 January 2005
We also start with a Street’s name and then the number and this seems like designed for usability. Imagine you are trying to find an address by foot or car. What do you need to look for first? You find the street, then you look for numbers. Now let’s not even start with the funny numbering schemes they use all over the globe, some havind the odd numbers on the left side of the street and the even numbers on the opposite side…
January 27th, 2006 at 1:06 pm
Yeah OK – Europe wins :) And let me point out before anybody else that my statement “because we all know it’s 2005” really needs work! Hmm. Duh!
January 28th, 2006 at 7:01 am
Not being a nit-picker or hair splitter, i surely took your statement as “because we all know what year it is”. This is a very interesting topic and the world would be soo much easier if we all could agree on standard notations – maybe even agree on the metric system ;-)
Nerdy Example:
The fine and free Google Analytics does nicely track ecommerce values for us. According to their documentation, they use the period as the decimal seperator – and when looking at ecommerce reports of Google Analytics, numbers are clearly formated as 1,999.00. So allthough i submit Euro values to them – and i’d say the whole Euroland uses the Komma as decimal separator – i play nicely and instruct my application server to format numbers accordingly (just to keep google happy). So my aplication server is being very kind and takes the period as decimal separator and the komma as the thousands seperator so a number looks like 1,999.00 Euro.
Guess what happens? Google correctly takes thesse numbers for single items (skus) of a transaction but fails big time when it comes to the Total of a transaction.
With the totals of transactions, it suddenly recognizes the comma used as thousands separator as the decimal separator. So what in fact was one thousand ninehoundredninetynine point zero is suddenly only one point ninetynine (1,99). I end up with seeing correct statistics of single items (they call it product category report), but totaly off values for the total of transactions on google analytics.
Long story short: Even the many genius google engineers fail on a simple matter of notation.
Of course you’d think they’d at least implement the same logic across all their reports and not get it right with one report but fail with some others. You can’t realy blame them as it is absolutely not so easy to know upfront what kind of numbers in what format the whole word would start submitting. Clear documentation would have helped (their documentation only shows examples with amount in the houndreds).
January 28th, 2006 at 10:09 am
Forgot to mention, that in googles honor, their Analytics product has not been developed on their campus as it was bought from Urchn. Another bit of “lack of usability due to notation/formating habits” ramblings here.
February 3rd, 2006 at 3:13 pm
The Gulliver reference:
http://www.pbs.org/wgbh/cultureshock/flashpoints/literature/gulliver_a.html
Another technical explanation of the porblem:
http://en.wikipedia.org/wiki/Endianness
February 7th, 2006 at 3:11 pm
I agree that, when talking about numbers, the endian-ness doesn’t really matter. But if you are not talking about numbers, it does matter. Little endian doesn’t scale, big endian does. Let me explain with an example.
Say you have a large bit field where each bit represents a boolean value. Say you want to find the largest stretch of cleared bits in the field. With a big endian machine you can optimize this process by reading the bitmap with the native size integers – 32 bit for a 32 bit processor, 64 bit for a 64 bit processor – at a time. When the 128 bit processor comes, the function doesn’t change, just start using 128 bit.
With a little endian machine you cannot do this, for the simple reason that the two 32 bit values will be different than the 64 bit value will be different than the 128 bit value.
Sure the way out is to reverse the loop – start at the end – but that is not always possible or desirable. For example, reading a file from disk backwards isn’t a very performant option. Extending an allocated piece of memory at the front is not a function I’ve heard of either.
IMHO little endian byte order for numbers is a bad idea because everything else is big endian: the sectors on disk, memory allocation functions, everything.
February 9th, 2006 at 5:00 pm
Cuisenaire Rods?
See Learn Fractions with Cuisenaire Rods or search google…
I liked the explanation – simple and clear.