Join 3,503 readers in helping fund MetaFilter (Hide)


A mouthful of bytecode
December 4, 2007 7:57 AM   Subscribe

Bytecode-based virtual machines are the Next Big Thing in programming. You can run Lisp, Ruby, Python, OCaml, and yes even COBOL on the JVM. Or if you prefer your languages to be a bit more melodic there's J#, A#, P# and F#. Even C/C++ has a bytecode compiler now. That's not to mention languages that have their own VMs like Erlang or that are writing their own like Parrot or PyPy.

Some background for the confused: Computer programs are traditionally written in human-readable source code then compiled to computer-executable machine code. Bytecode is a sort of halfway point between the two. It's not human readable but neither is it tied to a specific architecture like compiled code is. Java is the canonical example: a program written in Java will run on any machine that has a Java Virtual Machine (JVM) from your desktop to your cellphone.

With bytecode-targeted compilers programmers can write in their favorite language without being limited by the libraries and compilers written for it and have those programs run anywhere there's an appropriate VM.
posted by Skorgu (61 comments total) 20 users marked this as a favorite

 
"Next Big Thing"? If you throw in the .NET CLR (which is as much a bytecode-based VM as the JVM), then "Bytecode-based virtual machines" encompasses a huge swath of development happening today.
posted by Slothrup at 8:01 AM on December 4, 2007


Also, the original CPython was based on a bytecode-based VM, albeit one specific to its implementation.
posted by Slothrup at 8:02 AM on December 4, 2007


Bytecode-based virtual machines have been the Next Big Thing since 1997.
posted by lodurr at 8:08 AM on December 4, 2007


Oh good, the beautiful elegance of COBOL and Perl at the overhead of Java. Just what the programming world was looking for.

With bytecode-targeted compilers programmers can write in their favorite language without being limited by the libraries and compilers written for it...

I don't understand this sentence fragment. That I can call Java bytecode libraries from a python bytecode program?
posted by DU at 8:11 AM on December 4, 2007 [2 favorites]


"That I can call Java bytecode libraries from a python bytecode program?"

Only if you compile your Python code with Jython, a Java based Python bytecode compiler.
posted by PenDevil at 8:13 AM on December 4, 2007 [1 favorite]


Ahh, sorry that wasn't clear. That second 'it' refers to the favorite programming language. I.e. you can call Java libraries from (at least) JRuby and Clojure.
posted by Skorgu at 8:13 AM on December 4, 2007


Also the next major version of Ruby (2.0) will be byte compiled instead of interpreted.
posted by PenDevil at 8:15 AM on December 4, 2007 [1 favorite]


Whoops, I meant to link to this ruby version shootout but I somehow forgot. YARV/Ruby 1.9 is a ruby-only bytecode while JRuby is on the JVM and Ruby.NET is on the CLR.
posted by Skorgu at 8:20 AM on December 4, 2007


How does a compiled LISP program even work? The whole awesomeness of LISP (and Tcl, which I don't see mentioned here) is building up executable statements at run time. These have to be interpreted since they didn't exist at compile time.

Guess: It compiles them at run time, then executes the bytecode. Don't let my (old-skool) boss hear you talk like that or you'll get an ear about theoretical wankery trumping performance.
posted by DU at 8:22 AM on December 4, 2007


NERDS!

/still uses terrapin logo
posted by isopraxis at 8:22 AM on December 4, 2007


Also, the original CPython was based on a bytecode-based VM

CPython is very much alive, thank you, and is still based on a bytecode-based VM.

As for Python and Java, the Jython project recently developed a new compiler that can cross-compile CPython 2.5 bytecode to JVM bytecode, and run it with a Jython 2.2 runtime. That's slightly mind-boggling, if you're asking me.
posted by effbot at 8:26 AM on December 4, 2007 [1 favorite]


How does a compiled LISP program even work? The whole awesomeness of LISP (and Tcl, which I don't see mentioned here) is building up executable statements at run time.


"An Incremental Approach to Compiler Construction" [pdf] is a very accessible intro to compiler construction for a lisp-y language.

For a dynamically typed language like the lisps the compiler is basically compiling into something that can dispatch to the correct operation depending on the types. For performance there's all sorts of static analysis and type annotations that can upgrade that to machine operations (e.g. 'ADD') when possible.
posted by hupp at 8:49 AM on December 4, 2007 [4 favorites]


Forgive an uninformed outsider's comments, but doesn't Flash fit into this category too? It's always struck me that probably the most widely-distributed virtual machine out there is usually left of these sorts of lists because it's "just Flash". It's turing-complete isn't it?
posted by bonehead at 9:04 AM on December 4, 2007


The idea of virtual machines/bytecode interpreters is a good one, but I've often wondered why they haven't gone the route of 68000 machine code. It's an extremely clean and straightforward architecture, and can be emulated at remarkable speed on modern machines. It also has the advantage of being fully implemented in hardware as well. I'm sure you could fit a 68000 chip on a grain of rice these days... so you could have a full hardware environment for things like phones, and a virtual one for things like computers, and (potentially) the ability to steal code from zillions of Amiga, Mac, and Atari ST applications.

There's been tons of emulation work...the interpreters for 68000 code are battle-tested and extremely, extremely fast.

i suppose it may come from the time when the idea was first invented... in 1997, emulating a 68000 was still fairly expensive, but with modern hardware, it costs almost nothing. I strongly suspect that a virtual 68000 would be far more capable than any of these bytecode VMs.
posted by Malor at 9:06 AM on December 4, 2007 [2 favorites]


Don't forget Iron Python and Iron Ruby.

I'm not an expert on Flash, but I've dabbled. ActionScript (the driving force behind anything useful in flash) is based on ECMAScript which is more commonly known as JavaScript. It's a late-bound, interpreted language not a JIT-ed language AFAIK.

It may be turing complete (or not?) but it ain't no real programming language... :)
posted by jeffamaphone at 9:15 AM on December 4, 2007 [1 favorite]


I strongly suspect that a virtual 68000 would be far more capable than any of these bytecode VMs.

Since they are all Turing machines, they are all equally capable.
posted by grouse at 9:23 AM on December 4, 2007 [1 favorite]


This Apple fanboi has to admit that we they really haven't pulled the thumb out WRT architectural-neutral code infrastructure.

Since they knew they were moving from middling PPC to G5 to x86 earlier this decade -- thus having G3, G4, G5, PLUS x86 as target archs -- it really would have behooved everyone to get something like the CLR into the OS at the earliest possible.

The reason this didn't happen is that until the iPod zoomed away Apple's financial situation was dicey and they really couldn't throw a lot of bodies at the problem. (ca 2001 the entire compiler and low-level stuff team could eat at the same table at Caffe Macs).
posted by panamax at 9:31 AM on December 4, 2007


I strongly suspect that a virtual 68000 would be far more capable than any of these bytecode VMs.

WHO SAID THAT!?

That's my long-held thought too, how it's ironic that Apple abandoned the beauty of the 68K ISA when it could have gone the incrementalist x86 approach of "cracking" the front-end binary into intermediate machine-dependent code. If we're going to be stuck with an 70s era programmer-visible ISA, why why why couldn't it have been 68K . . .
posted by panamax at 9:35 AM on December 4, 2007


The idea of virtual machines/bytecode interpreters is a good one, but I've often wondered why they haven't gone the route of 68000 machine code.

Because they're not bytecode interpreters, they're bytecode compilers. They compile bytecode down to real-time optimized native code. Also, the JVM is used and worked on far, far more then any other emulator. So I seriously doubt that it's gotten less attention then an 68k emulator.
posted by delmoi at 9:37 AM on December 4, 2007


Forgive an uninformed outsider's comments, but doesn't Flash fit into this category too? It's always struck me that probably the most widely-distributed virtual machine out there is usually left of these sorts of lists because it's "just Flash". It's turing-complete isn't it?

I think ActionScript is still just interpreted, like Javascript. I know ActionScript 3.0 is supposed to be 'more like java', but I don't know if that means it uses a bytecode compiler.
posted by delmoi at 9:43 AM on December 4, 2007


The JVM is hopelessly borked by its security model, which makes it impossible to implement tail call optimization, which prevents the efficient targeting of a huge swath of modern languages and constructs. Oh, well. Perhaps we'll get lucky in Java 7.
posted by cytherea at 9:47 AM on December 4, 2007


They compile bytecode down to real-time optimized native code.

And more importantly, keep re-compiling as the program runs. Some of the coolest HotSpot (the Java VM) optimizations come from run-time analysis of how the code is executing. For example, biased locking requires that the bytecode contains information about synchronization primitives, something that's likely lost if you compile all the way down to CPU instructions. This article talks about the VM replacing a virtual method call with a direct method call when it knows that there is only a single implementing class loaded... something you can't do without a reasonably high-level description of your program's structure, and something that you would be hard pressed to do if you're just emulating a 86k CPU.
posted by ny_scotsman at 9:50 AM on December 4, 2007


I think ActionScript is still just interpreted, like Javascript. I know ActionScript 3.0 is supposed to be 'more like java', but I don't know if that means it uses a bytecode compiler.

Yup. Adobe actually open-sourced the code, and their VM will be the foundation of Firefox 4, and, indirectly, Internet Explorer. And if you believe Brendan Eich, its going to effectively be the future of software.
posted by gsteff at 10:00 AM on December 4, 2007


Er, it won't be the foundation of IE, but Tamarin will be able to be monkeypatched in.
posted by gsteff at 10:05 AM on December 4, 2007


I think ActionScript is still just interpreted, like Javascript. I know ActionScript 3.0 is supposed to be 'more like java', but I don't know if that means it uses a bytecode compiler.

Flash 9 uses a JIT compiler for everything except initialization functions, iirc. The change was made from Flash 8 to 9 which is why Flash 9 is often 10x faster than Flash 8. ActionScript 3 is based on ECMA 4, so it's basically Java-lite in syntax.
posted by ryoshu at 10:06 AM on December 4, 2007


I got a typewriter tied to a teevee, and I draw pictures on it!
posted by ba at 10:14 AM on December 4, 2007 [1 favorite]


There isn't anything new about this. But I thought that "bytecode" was called "PCode" (short for "pseudocode"). At least, that's what we used to call it when we used it to solve various problems back in the 1980's.
posted by Steven C. Den Beste at 10:53 AM on December 4, 2007 [1 favorite]


Nope, SCDB, bytecode != pseudocode. Bytecode is a low-level machine agnostic language. Pseudocode is a high-level language for describing algorithms which do not run on a computer.
posted by anomie at 11:02 AM on December 4, 2007


SCDB may have gotten the term slightly wrong, but back in the day there was the P-Code machine, a stack-based virtual machine with type-specific opcodes. Wikipedia says it was based on an even older (1966) "O-code".
posted by jepler at 11:31 AM on December 4, 2007 [1 favorite]


And don't forget the Z-machine, which is tied to to many posts over time.
posted by jepler at 11:35 AM on December 4, 2007


JavaScript. It's a late-bound, interpreted language not a JIT-ed language AFAIK.
So.. just like Ruby, Python, etc?

JIT is an implementation detail, as is bytecode. Being late bound doesn't mean you can't JIT it, and using bytecode doesn't imply using JIT compilation. Nor does being interpreted imply bytecode for that matter.

Languages like JS, PHP, Ruby, etc, tend to go through three stages of implementation; the first are AST (Abstract Syntax Tree) walkers; they directly interpret a tree representation of code pretty much as it was written, meaning lots of jumping about a big pointer-heavy data structure.

The second generation, e.g. PHP 4+/Ruby 1.9+, instead translates the AST to bytecode, a stream of lower level instructions, and interpret that. This generally results in a big boost in performance because of things like being able to more easily optimize the simpler P-Code, and scanning a string instead of walking a tree makes somewhat better use of CPU caches and memory accesses.

JIT is a late-stage thing, because it's hard to do, especially in dynamic languages. But there's nothing stopping a suitably smart VM from doing it, even in scarily dynamic languages like JS, Ruby, Smalltalk, etc. You just don't see it very often because it's such a pain to do.

anomie: no. Words can have multiple meanings.
posted by Freaky at 11:38 AM on December 4, 2007 [1 favorite]


Yeah, bad use of a comma by me. "Interpreted not jit-ed" was meant to be read seperate from "late-bound." Lazy editing on my part. Thanks for clearing it up.

Also, I've only been using Flash MX (2002), which means Flash 6 and ActionScript 2.0. Glad to hear they've made it better. Someone should buy me CS3. :)
posted by jeffamaphone at 12:04 PM on December 4, 2007


Ah, wonderful P-Code, I remember you well... 1983/1984 summer Computer Camp at Virginia Tech, learning Pascal on a P-Code implementation. We made a 4x4 dungeon crawl and were dismayed that we couldn't take it home and play it because you needed a licensed P-Code VM to run the thing... Probably Turbo Pascal. Fond memories.

And I wish more people would get into Erlang. Developed for TeleComm switches, it's fault-tolerant, distributed, concurrent and allows hot-swapping of code. I've had Erlang applications run for 3+ years and then only restarted because we physically moved the server. Sigh.
posted by zengargoyle at 12:09 PM on December 4, 2007


This German guy seems to have reimplemented a p-code engine.
posted by meehawl at 1:04 PM on December 4, 2007


I can't let this thread go without a plug for CAL, a JVM-targeted Haskell-inspired lazy functional language developed by some of my colleagues at Business Objects and recently open-sourced. It has great two-way interop with Java code and some nice Eclipse tooling too. The performance is in the same ballpark as Scala and Nice, if the language shootout is any guide.
posted by pascal at 1:18 PM on December 4, 2007


So, I've got a long-standing question about bytecode: is there really much of an advantage in distributing bytecode and compiling it (like Java) versus distributing source and compiling it (like JavaScript)? Does doing the source-to-bytecode compilation ahead of time really improve performance? It seems like the real computational effort is going to be on the bytecode-to-machinecode end, not on the source-to-bytecode part. So why have the middle step at all? Why not just distribute source and do all the compilation on the user end?

It seems like that would give the most flexibility and allow for the greatest amount of optimization.
posted by Kadin2048 at 1:19 PM on December 4, 2007


It depends on how much optimization you're doing. Bytecode is certainly a more convenient format for a user. Source code can be fragile too. If you can compile it into bytecode, you have a guarantee that it works at least that much.

And, of course, some people don't want to share their source.
posted by grouse at 1:28 PM on December 4, 2007


anomie: no. Words can have multiple meanings.

I stand corrected. Sounds like bytecodes are at least as old as I am.
posted by anomie at 1:47 PM on December 4, 2007


"Bytecode-based virtual machines have been the Next Big Thing since 1997."

If by that you mean 1973 or 1966, yes.

http://en.wikipedia.org/wiki/P-code_machine
posted by muppetboy at 2:10 PM on December 4, 2007


jeffamaphone: Grab the Flex 2 SDK from Adobe and you can build Flash 9 swfs from the command line (or my favorite, using ant).

/derail
posted by ryoshu at 2:31 PM on December 4, 2007


It appears I phrased the initial sentence poorly. I meant to emphasize the diversity of languages that are now running on general bytecode VMs rather than the overall concept of compiling to bytecode. I guess I meant "targeting someone elses's bytecode VM is the..."
posted by Skorgu at 4:13 PM on December 4, 2007


My job involves writing code for a hideous p-code generating language called dataflex. It's been about and in production since 1980.

I say it's hideous. Annoyingly, it's a fast language to code in. And it's rock solid. But nobody uses it and it's got a syntax so messed up you'd rather throttle your own grandmother than use it.

Rumour has it that it was used in early iterations of amazon.com, so in all probability it was used on the web way before Java became public.
posted by seanyboy at 4:47 PM on December 4, 2007 [1 favorite]


Wow, seanyboy. You must be one of the few people today who write code that could be run on CP/M.
posted by grouse at 4:57 PM on December 4, 2007


...ActionScript....
...going to effectively be the future of software...

Noooooooooo
posted by yoHighness at 5:09 PM on December 4, 2007


Hey! Why not write your own bytecode interpreter for a fictional machine, then use various compilers written in it (including a fake Unix command line) to solve various functional programming problems?
posted by blenderfish at 6:37 PM on December 4, 2007


JavaScript. It's a late-bound, interpreted language not a JIT-ed language AFAIK.

I don't know if you're aware of this, but JavaScript can be compiled into bytecode and run on the JVM. The project is called Rhino, and they're using it at Google.

In fact, one of the most interesting talks at the UIUC ACM confrence this year was by Steve Yegge on this very subject. He wrote something he calls Rails on Rhino, which is the Ruby on Rails framework ported to JavaScript (minus the ActiveRecords part) and then run on the JVM via Rhino. He says the speed is simply amazing.

The video is on this page (scroll down). He talks a lot about targetting languages to the JVM, and in particular the problems they had with JavaScript.
posted by sbutler at 6:38 PM on December 4, 2007


It may be turing complete (or not?) but it ain't no real programming language... :)

I don't want to waste a lot of energy defending Actionscript, and I'd agree that -- though it definitely IS turing complete -- it's not a hardcore language on the level of lisp or C.

But (for those who care), if you formed your opinion of it prior to this year, you should take another look at it. It's now much closer to being Java than Javascript. It's strongly typed, contains packages, classes, interfaces, namespaces, blah, blah, blah.

Whatever it is, it's not the (inconsistent, clunky) toy it once was.
posted by grumblebee at 7:22 PM on December 4, 2007


Hey, Malor! Are there emulators for the 68010 (and its MMU)? Do they emulate it down to the level of barfing out internal state onto the kernel stack when there's an exception?

Is there an emulator for the Sun-2? (Or even the Sun-1?) That would rock!

I think the 68K instruction set is great, especially for assembly language programming. I wonder if something like a MIPS R3000 would also be a good target for this kind of stuff.
posted by Crabby Appleton at 8:07 PM on December 4, 2007


Scala is another java byte-code targeted functional language. I keep thinking I'll find the time to learn it.

re: Tail call optimizations: A discussion came up on LTU recently about the possibility of the JVM supporting it. It sounds like it could be coming, but not soon. Until then you have to trampoline, or something.
posted by Horselover Fat at 8:11 PM on December 4, 2007


In my youth, I wrote a Pascal-to-pcode compiler. I thought it was pretty neat. God, I'm old.
posted by SPrintF at 8:19 PM on December 4, 2007 [1 favorite]


...ActionScript....
...going to effectively be the future of software...
Noooooooooo


Hey, it could be Malboge.
posted by ryoshu at 9:59 PM on December 4, 2007


grumblebee: I read someone's opinion a while ago (talking about Ruby) that people's "I can't tell you what it is but I know it when I see it" distinction between "real language" and "scripting tool" often coincides with whether the language's runtime is hardcoded in the interpreter/VM (like Ruby) vs. just being another library written in the language in question (like Java)...

and since nobody else has, I'll mention the Dis (also) virtual machine from Bell/Lucent/Vita Nuova's Inferno. The VM is register rather than stack based, and they've specifically pushed optimisation smarts into the compiler, so the VM can be dumb as a brick and still run nicely on embedded hardware.
posted by russm at 10:57 PM on December 4, 2007


russm: I would say weak typing (i.e., not having to declare locals) and automatic garbage collection are the two main things that differentiate a 'scripting' and 'real' language in my mind. (So, yes, that makes Java and C# half scripting languages to me.)
posted by blenderfish at 6:00 PM on December 5, 2007


What's wrong with GC'd languages? You could write a garbage collector for C++. In fact, it's been done.

Garbage Collection is awesome. If I never write another AddRef() / Release() call, my life will be bliss.
posted by jeffamaphone at 9:56 PM on December 5, 2007


What's wrong with GC'd languages?

Nothing at all, just like there's nothing "wrong" with cars with automatic transmissions. They're great for a lot--a majority, even, of people. Don't see racecar drivers using them, though. :)
posted by blenderfish at 1:55 PM on December 6, 2007


The computer:car analogy doesn't hold. All cars are generally the same. No two programs are even close to being a like, even if they purport to do the same thing. You have to consider the context. Write an operating system in Managed code? No. Write a web browser? Sure. Write today's random LOB app? Definately.
posted by jeffamaphone at 2:00 PM on December 6, 2007


Yes, but I think what blenderfish is saying is that the hardcore languages make as few decisions for you as possible. Actionscript contains garbage collection, which is great for my purposes. But if I decide that I don't like its collection routines, I'm shit out of luck. I can't replace them with my own. Whereas in -- say -- C, you have a pretty barebones toolkit. You're always in the driver's seat.

I think a good analogy is cameras. Point-and-shoot cameras aren't bad. In fact, they're great. They're useful to thousands of people. But there's a certain type of photographer -- a real hardcore one -- that wants (and needs) total control. For his projects, he's needs to be able to manipulate things that the rest of us don't touch.
posted by grumblebee at 2:06 PM on December 6, 2007


I think that is a much better analogy, yes.
posted by jeffamaphone at 2:09 PM on December 6, 2007


The camera analogy is a fine one, but it's not any better than the car analogy. Why do racers insist on automatic transmissions? Because they want (and need) total control.

Garbage collection is a wonderful thing. Most of its early performance problems have been solved. These days, it's only a problem for real-time systems. But having GC doesn't mean you don't have to keep track of anything. You still have to keep track of references, or you can end up a lot of memory that isn't eligible for GC.
posted by Crabby Appleton at 3:19 PM on December 6, 2007


jeffamaphone:
All cars are generally the same. No two programs are even close to being a like, even if they purport to do the same thing.

Just plain not true. Like at all. That's like (and last analogy, I PROMISE) saying "no two novels are even close to being alike, even if they're both mysteries." Computer science employs a remarkably small set of algorithms, constructs, paradigms, and idiosyncrasies. This is even more true as you start taking about particular problem domains. Neophytes and autodidacts reinvent/rediscover them on their own and think some combination of 1. 'gosh I'm really smart and special' and/or 2. 'if I can come up with this on my own, there must be infinitely many of these concepts, since so many people program'. Both not true. Things that are 'new' in programming are, 99 out of 100 times, just a gussied up version of something from twenty years ago. (See, for instance, p-code vs. Java, as discussed above.)

Write an operating system in Managed code? No. Write a web browser? Sure. Write today's random LOB app? Definately.
Win the Indy 500 with an automatic transmission? No. Take a driving vacation along the east coast? Sure. Go to the grocery store? Definitely.

Anyway, I know analogies suck. Thanks to Grumblebee and Crabby Appleton for explaining things in a better way.
posted by blenderfish at 5:27 PM on December 6, 2007 [2 favorites]


All the "automatic transmission" analogies really assume some kind of lossy automatic, like the hydramatics we all grew up with or the current crop of CVTs. In a digital age it's easy to conceive of a more or less lossless "automatic" that is able to shift more optimally than 99.9% of drivers, as long as it knows what to optimize for (acceleration or efficiency).

There's probably a relevant further analogy in there, but I'm not going to reach for it.
posted by lodurr at 1:53 AM on December 7, 2007


« Older The Plank, classic British comedy (Youtubed: 1, 2,...  |  Afrigator.... Newer »


This thread has been archived and is closed to new comments