Holy Bitcode Batman, You’re Writing a Compiler!

Recently I’ve taken an interest in programming language design and have started working on a compiler for a new language. The reasons for doing that are perhaps less practical than I’d like, because I’m a practical man, or sometimes pretend to be. At least I’m usually more interested in the application of mathematics than the beauty of mathematics itself. The same used to go for programming languages.

As the few readers of this blog may know, I’ve been experimenting with game programming now and then, but haven’t been driven enough to actually complete and publish a game. In early years I tried Pascal and C++. I did complete some stupid little games in Pascal, but C++ was more a hindrance to me than an enabler. At some point I learned UnrealScript while developing a Deus Ex mod and kind of liked it. Actually I think that was part of what lead me to Java and getting jobs in Enterprise Java programming. I never really thought to write games in Java, though.

When I discovered Scala I fell in love and thought it was the be-all and the end-all of programming languages and very applicable to games. Tried to write a couple of games in it, wrote some libraries for my own use and even ported a physics engine from Java and C++ to Scala. I loved doing it, loved the language features, the type system, the syntax, the preference of immutability over mutability. Perhaps most of all I loved Scala’s elegant mixture of object-oriented and functional programming.

Eventually something about making games in it started nagging me, though. The performance was good enough for the type of game I was making, but the game was also using a lot of resources for the type of game it was. It started to feel wasteful. The idea that games are a big part of what is pushing hardware forward and have to take the most out of it was somehow stuck in my head.

Scala runs on the JVM, which has the nice abstraction of a big heap, and the memory management is done for you. My boss Jevgeni has given a really awesome talk on that topic titled “Do you really get memory?”. But whatever tricks the JVM does to make that abstraction work well enough for most applications, it does produce some amount of waste — extra CPU cycles and garbage objects which need to be collected to free the memory for new objects. And that is a big part of what the JVM engineers are continuously improving on. They are probably the most efficient at collecting (virtual) garbage in the world! But there are cases where that kind of heap abstraction doesn’t seem to hold well, and high-performance games are one of those cases.

My game engine used lots of immutable, short-lived objects. Things soon to become garbage, in other words. Garbage, dying a slow death in the heap. Every small Vector2(x, y) tracked by the collector, maybe living in a separate heap region from its closest friends. And looking up bits from here and there in the heap is really expensive from the CPUs perspective. Even when the Sun Java VM started optimizing away heap allocation of very short-lived objects (enabled by escape analysis), that only gave me a small performance boost. The situation has improved, but back then I decided to try and avoid so much garbage being produced. I optimized some functions manually, doing scalar replacement of Vector and Matrix objects. That made the code look really ugly and unreadable because it hid the mathematical formulas.

I couldn’t stand it. Neither could I stand all these cycles wasted on GC. So I wrote an optimizer that plugged into the Scala compiler and did the scalar replacement automatically. It worked, and gave a more significant improvement than the JVM’s escape analysis optimizations at that time; garbage production was being reduced, I was going green! But it was hard for me to maintain the optimizer as I knew almost nothing about compilers and was just going by my nose. There were some corner cases that were hard to handle correctly. It only worked on code written against a very specific library, optimizing away well-known constructor and method calls.

Writing that optimizer got me somewhat interested in compilers, though. I remember saying to someone during a job interview a few years ago that I like complex problems, but am not interested in the really complex stuff like compilers. Sometimes things work out as the reverse of what you think.

Anyway, working on my game engine, I wanted to create a really powerful entity system. I wanted to use mix-ins and other nice Scala features. Reading some blog posts about game entity systems, I realized that most people seemed to be moving away from inheritance-based systems into component-based systems. Reading more about component-based systems made me run into the topic of data oriented design (PDF), which is all about thinking about data first, and how the program processes it. A couple of presentations on that left an impression on me and made me realize just how expensive it actually is to make the CPU churn through megabytes of random memory.

But I didn’t want to switch to C++ or C to be able to take advantage of the kinds of optimizations that data-oriented programming can give. I had the idea that maybe there should be a language that was object-functional like Scala, but compiled down to very CPU- and cache-friendly data structures and functions, the kind of structures one would use when doing data-oriented programming manually. And I have huge respect for people who do the latter. But I noticed that some of them seemed to be wanting a better language than C++ as well.

So, a language as expressive and type safe as Scala, similarly object-functional, but with more efficient memory access, CPU cache and parallelization friendly, enabling a data-oriented programming style with less hassle than C++. Perhaps one that could even run some functions on the GPU. What could that look like? I started thinking that maybe I should find out first-hand. [Update to clarify my goals] Well, at least I want to find out what it would be like to try to get there. I’m sure combining the type safety and power of Scala with the raw performance of C is way too ambitious for me. So I’m setting the goal way lower, but more about that in the next post.[/Update]

I’ve never really been a language geek. I’ve programmed in several different languages and at the moment am good friends with only two: Java and Scala. There are scores of other programming languages out there and I’m sure there is some language that is at least 2/3 of what I’m looking for.

But I think some bug bit me, because I couldn’t let go of the idea of creating one myself. And this blog post was a lengthy, boring preface to a series of hopefully less boring posts documenting my experiment. I have no illusions — I don’t think I’ll create the “next big language” or anything close — there are people much smarter than me who have been working on programming languages for years and decades. But this will be a fun learning experiment and maybe something useful will come out of it. Next time I will talk about the kind of language(s) I want to create. The (s) is because I want to create a really simple language first, to learn more about compilers during the process, and later expand from that base.

Python is great, but __

I haven’t used Python a lot, but from what I know about it, I consider it one of the good languages. A bonus is that most Python programmers seem pragmatic and humble, which is nice compared to the impression of Ruby community left by some of the more arrogant Ruby folks. But there’s one thing that really irks me about the language, more than the somewhat significant whitespace or some of the syntax quirks:

__init__

What the fuck is up with that? I have seen very few pieces of Python code which didn’t contain those double underscores. How was it decided that commonly used names will contain four underscore characters? I doubt that names such as init, main, iter, etc. were considered so precious that the language must not treat them specially. Perhaps in the early days it wasn’t apparent that the magic methods would be used as often as they are? Or maybe it was but the __ didn’t bother enough people? I would find that strange, but then again people get used to everything. Scala makes magic use of the underscore as well, but I’ve gotten used to the few cases where it’s part of names. Most of the time it’s used as a wildcard, and that doesn’t bother me at all.

Do you program in Python regularly? What do you think about the underscores?

Right Tool for the Job? Yeah, Right

Frankly, I’m a bit tired of hearing the hammer and nails metaphor and “right tool for the job” from developers when they talk about programming languages. While there’s nothing wrong with selecting the right tool when there clearly is one, I think the phrase is often misused and seems to be a cheap explanation to justify personal preferences and subjective decisions seemingly objectively. I think when that phrase is uttered, it mostly means something like: the right tool for me (or my company) at that particular time, considering my (our) experience, knowledge, skills and preferences at that time.

Most of the times there are always several “right tools” that are probably equally good “for the job” and the reasons for choosing one over another are usually not merely technical in nature but depend on many other variables from the larger context in which the choice is made. Actually, I don’t even think there’s ever a completely right choice. The complexity of the real world makes it impossible to know for sure, but nevertheless, choices must be made and we approximate. If someone else made a different choice in a similar situation that doesn’t mean it was the wrong tool for the job.

A programming language is not really just a tool anyway. In some ways it is, but the hammer analogy sucks. For building a house you might need hundreds of different tools, tens at the least. For building a large application you could get away with just one language without ending up with bad abstractions. If you count HTML, JavaScript, SQL in typical web apps for example, then maybe you need five or so, but if these are cases of selecting the right tool then they have long become common sense and are not interesting to discuss.

But usually you only need one general purpose language. There are of course some differences between them and some languages have grown into certain roles (forgive me some sweeping generalizations): C++ is for performance and hardware access; Java is for applications and information systems; PHP, Python, Ruby for web apps and scripting etc. (as a side note, I think Scala can play both of the latter roles) In some cases you might perhaps want two of them — programs in higher level languages sometimes “drop down” to C++ for performance, Grails mixes Groovy and Java — or maybe even three, but much more than that would be overkill. Note that I am not at all against the polyglot programming meme, but I do think the choice languages for any given project or organization should be limited to a few.

So programming languages are like tools in the sense that choosing a language somewhat limits what you can do with it. But actually, I would say that a general purpose programming language together with it’s libraries is more like a complete tool set than a single tool. You can choose any tool set you like, each will get the job done. One set might be missing a hammer, another might be missing a saw — if that turns out to be a problem for a few situations then you compensate. Perhaps by renting a specific tool for just the time when you need it, deciding on using two tool sets from the start, dropping a requirement, or finding a workaround that is doable with your current tools. Even if you have to do the latter, it still doesn’t mean you are looking for nails where there aren’t any — it might be a valid choice considering your constraints. However, if this keeps happening then you probably have chosen the wrong tool (or someone chose it before you).

But besides being a tool set, a programming language also forms an important base for communication between developers, just like natural language does. With this comes a whole new set of considerations for choosing a language that have nothing to do with how exactly it fits to solving a problem technically. It also determines what kind of community you have access to, what kind of cultural values you may be indirectly associated with, how many “compatible” developers there are in your area etc. And I think sometimes, but not always, these questions may be more important than the technical ones.

(functions == objects) => awesome!

One thing I increasingly love about Scala is this: functions are objects, and for many common objects it is true that objects are functions as well. Some examples are arrays, lists, maps etc. For example, lets say we have a map from numeric characters to English words corresponding to these numbers:

val n2s = Map(
      '0' -> "zero",
      '1' -> "one",
      '2' -> "two",
      '3' -> "three",
      '4' -> "four",
      '5' -> "five",
      '6' -> "six",
      '7' -> "seven",
      '8' -> "eight",
      '9' -> "nine")

This map moonlights as (okay, okay, “inherits from”) a function of the type (Char => String) and you can pass it where such a function is expected as parameter, like here:

n.toString.map(n2s).mkString(" ")

Quick explanation for those new to Scala:

  • n.toString converts n to its String representation
  • a String is also a Seq[Char] in Scala land, and the map method applies a function of Char => B (where B is any type) to each element in the sequence, thus producing a new Seq[B]
  • we pass our map n2s to this function, which means we will map the Seq[Char] to a Seq[String]
  • that sequence is then turned into a single String using mkString (and specifying space as the separator between elements)

This function is just an example, it probably isn’t actually useful for anything. I just saw something like it in Python and thought I’d try to make the same program in Scala before realizing that it didn’t do anything really interesting (like converting 123 into “one hundred twenty-three” instead of “one two three”).

Another awesome thing about functions in Scala are partial functions, which are functions taking a single argument that are defined only for some argument values (the domain of the function). Map is also a partial function — defined only for the keys it contains — and pattern matching, which is a very important part of Scala, is related to partial functions as well. I won’t go into more detail in this post, but Doug Pardee has written a more in-depth article about functions in Scala.