# Migrating Bee Smalltalk to a Different Instruction Set Architecture

# An Experience Report on Porting a Dynamic Metacircular Runtime from x86 to

# AMD64

Javier Pimás Palantir Solutions SRL Buenos Aires, Argentina jpimas@palantirsolutions.com

# Abstract

We report our experience in porting Bee Smalltalk Dynamic Metacircular Runtime (DMR) from the 32-bit Intel-x86 instruction set architecture (ISA) to AMD64 as a first step to move forward towards a multi-platform Bee Smalltalk.

This port required subtle changes in most areas present in typical Virtual Machines (VMs): low-level object shape, JIT-compiler, garbage collector, primitives and the foreignfunction interface. We present a comprehensive analysis of the migration difficulties, and the key implementation and design decisions taken during our work in the context of Bee, which is implemented in terms of a Smalltalk DMR, in contrast to VMs written in languages like C/C++. Additionally, we depict the image-level mechanisms we deviced in order to support the transition between 32 and 64-bit images, which can also be applied to traditional-VM based Smalltalks.

*Keywords* runtime, virtual machine, processor architecture, porting

# 1 Introduction

Bee [8] is an implementation of Smalltalk that runs without what is usually known as a VM. Instead, Bee is supported by a dynamic metacircular runtime library written in Smalltalk, which provides all the mechanisms that would usually be implemented by the VM: JIT-compiler, memory management, primitives and its foreign-function interface. Bee DMR is bootstrapped from a derivative of Digitalk Smalltalk running on top of a host VM. The first iteration of Bee was able to run on 32-bit Intel x86 Windows systems, and this paper presents our experience in porting it to 64-bit AMD64<sup>1</sup> Windows.

To Bee's development team, there were two main reasons for migrating from 32 to 64 bits: compatibility with 64-bit applications and the possibility of using more memory.

54 2018. 

- In this experience report we
  - Analyze which parts of Bee Smalltalk, as an archetype of Dynamic Metacircular Runtimes (DMRs), are dependent or independent on the processor architecture.

- Provide the details that let Bee Smalltalk components which depend on the processor architecture vary accordingly.
- Present an iterative migration approach that was successfully applied to cross compile it to another processor architecture, AMD64.

The process of porting from 32 to 64 bits involved solving multiple issues. We give a brief summary of them now:

- **Object format.** The layout of objects in memory in 64 bits does not need to be the same than the one in 32 bits. Our decisions regarding object format are explained in sections 4.1 and 4.5.
- **Bootstrapping.** To create a 64-bit system, it is necessary to generate a 64-bit image, packaged in a 64-bit executable. We detail these issues in sections 4.2 and 4.3.
- **Native-code generation.** Besides objects, a Smalltalk image contains code. In particular, Bee kernel image contains native code. In order to create that native code, it was required to create an AMD64 assembler, and to plug it to the Smalltalk-to-native compilers. Those issues are addressed in sections 4.6 and 4.7.
- **Integer representation.** The size of the word in the system affects how big the small integers can be, and required adapting code, as shown in section 4.8.
- **Foreign-function interface.** The differences in 64-bit Windows calling-convention design affected the way Smalltalk code communicates with external code. It required changing how external functions are called from Smalltalk and how Bee Smalltalk handle external callbacks from C code. This is detailed in sections 4.10 and 4.11.
- **Chasing platform-dependent code.** A part of the migration work is chasing the remaining places where the code is dependent on the processor architecture. For example, this includes code that in traditional VMs is implemented as primitives and in the garbage collector. We describe this in sections 4.9, 4.12 and 5.

<sup>&</sup>lt;sup>1</sup>also known as x86-64 or, more succinctly, x64

IWST'18, September 10–14, 2018, Cagliary, Italy

# <sup>111</sup> 2 Bee Dynamic Metacircular Runtime <sup>112</sup> Overview

113 Bee DMR is a Smalltalk implementation that runs without a 114 traditional VM. It implements a just-in-time and ahead-of-115 time compiler, not supported by a host VM. It is the Small-116 talk runtime itself that supports the Smalltalk environment, 117 in a similar way than the Smalltalk metamodel is used to 118 describe itself. This self-hosted runtime implementation ap-119 proach is not new, as it has been previously explored in 120 proyects like Klein [10], a Self implementation, and also in 121 Jalapeño/Jikes [1], and Maxine [12] java runtimes, just to 122 cite some examples. 123

The main difference between Bee runtime implementation 124 and a traditional Smalltalk VM is that Bee has no primitives. 125 Instead of using primitives, in Bee the programmers can di-126 rectly or indirectly alter the semantics of Smalltalk code. In 127 particular, the semantics of message sends can be modified 128 arbitrarily. This allows, for example, to implement micro-129 operations known as underprimitives, which can be used 130 to replace primitives. Additionally, the programmer can di-131 rectly communicate with the native-code compiler to alter 132 the shape of the emitted code. For example, it is possible to 133 select which methods to optimize, which ones to inline or 134 which selectors to change from dynamic dispatch to static 135 invoke. The code that is needed for replacing primitives is, 136 by most parts, not different to the one used to write plain 137 Smalltalk-application code. 138

All of Bee runtime, including its native code compiler and the memory manager are written in Smalltalk. Consequently, the migration of Bee to AMD64 platform only required working with either Smalltalk code or either assembly code.

# 3 Differences between x86 and x86-64 platforms

139

140

141

142

143

144

145

158

159

160

AMD64 [5] is the 64-bit version of the 32-bit x86 architecture. 146 It supports wider memory addresses (up to 64-bits in theory) 147 and 64-bit registers and operations. Typical x86 registers 148 have expanded to add 64-bit versions of them. Additionally, 149 8 new 64 bit registers are available (R8 to R15), as shown in 150 figure 1. x86-64 instruction set is mostly backwards compati-151 ble with x86, in the sense that most instructions present in 152 x86 are also present in x86-64, and are encoded in a similar 153 154 way in both architectures. In many cases, x64 instructions analogous to their x86 counterparts are encoded exactly the 155 same as in x86; in other cases, the x64 version requires adding 156 prefixes. This is exemplified in figure 2. 157

# 3.1 Calling Convention and Application Binary Interface Changes in Windows

In Windows-x86, there are two main calling conventions supported: *stdcall* and *cdecl* [7, 11]. They are similar, with EAX, ECX and EDX as callee-saved registers and arguments pushed into the stack from right to left. The most notable

| 64-bit | AMD64<br>32-bit | 64-bit |
|--------|-----------------|--------|
| rax    | r8d             | r8     |
| rbx    | r9d             | r9     |

| eax        | Iax        | rou  | 10  |  |
|------------|------------|------|-----|--|
| ebx        | rbx        | r9d  | r9  |  |
|            |            |      |     |  |
| esp        | rsp        | r14d | r14 |  |
| esp<br>ebp | rsp<br>rbp | r15d | r15 |  |
|            |            |      |     |  |

**x86** 

32-bit

Anv

**Figure 1.** To the left, the original x86 general purpose register set. To the right, the registers added in AMD64. For each x86 32-bit register, a 64-bit counterpart has been added; 8 new 64- and 32-bit registers were also added.

| instruction  | encoding x86 | encoding x64 |
|--------------|--------------|--------------|
| push ebp     | 55           | -            |
| push rbp     | -            | 55           |
| mov eax, ecx | 89 C8        | 89 C8        |
| mov rax, rcx | -            | 48 89 C8     |

**Figure 2.** Instruction encoding is kept similar. For some instructions like push, the same encoding is decoded differently in x86 and AMD64, to adapt to the new word size. For others like mov, specifying a 64-bit register requires adding a prefix.

difference is that in cdecl the caller is responsible for popping the arguments out of the stack, while in stdcall this responsibility is assumed by the callee.

On the other hand, in Windows-x64 only cdecl convention is supported. The 64-bit version of cdecl, however, has a few differences compared to the 32-bit one: caller saved registers are RAX, RCX, RDX, R8, R9, R10, R11; the first 4 arguments are passed in registers RCX, RDX, R8 and R9; there is a *shadow space* preallocated in the stack before the call and the stack is 16-byte aligned considering the arguments at the instant before the call.<sup>2</sup> Calling conventions directly affect the foreign-function interface implementation, because they influence the native code needed to perform external calls, as explained later in section 4.10.

In Windows, except for pointer data types, C types mainly maintain their size. This is particularly true for int and long data types, which are kept 4-bytes wide. However, the change in pointer size affects the layout of structure fields in memory, as fields which are placed after a pointer in a C structure will see their offset increased. Data type size affects C structures and, as a consequence, also affects the foreign-function interface implementation.

Executable files and dynamically-linked libraries for Windows are stored in a format known as Portable Executable

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

<sup>&</sup>lt;sup>2</sup>Unlike in System-V/x64 where it is aligned before calls without considering the arguments

328

329

330

221 (PE). Files in this format contain not only the data and exe-222 cutable code of the program or library, but also a set of tables 223 that describe what is stored in the file, such as exported function offsets, imported functions and relocation informa-224 225 tion. When 64-bit Windows was created, the PE format was expanded to support 64-bits executables. The new format, 226 227 PE32+, is mostly equal to PE32, except for a few tables which 228 store pointers, that were widened to 64 bits.

# 4 Bee Execution Model and Migration to AMD64

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

The first step required to start the transition from x86 to AMD64 was to discover which parts of Bee were dependent on the processor architecture and which were not.

Bee has been designed and implemented in terms of an abstract register machine. This machine design is intended to be agnostic of the underlying processor architecture and word size. It consists of an abstract set of registers and operations that are converted to native code and data according to a concrete target architecture. Furthermore, Smalltalk images usually do their best to be independent of the running platform, improving portability.

However, in any Smalltalk system there are places where the processor architecture surfaces, as in the representation of numbers, in the foreign-function interface, and in calls to primitives. This happens in implementations that run on top of a typical VM, and Bee DMR adds to that list the set of components that implement the low-level interface to the platform. As an example, for x86, a set of compiler objects were implemented that allowed for native x86 code generation at different stages (inline assembly, baseline JIT, optimizing JIT). These objects create a layer of separation between code specific and independent of the architecture. Yet, during migration we noticed that in practice there were other spots where the x86 architecture details leaked into the runtime code. As Bee is a dynamic metacircular runtime, there is not a clean separation of VM code and guest-language code, which in typical Smalltalks is mostly target-agnostic.

#### 4.1 Migration Approach for the AMD64 port

To facilitate the implementation of the AMD64 port, we chose to take an incremental development approach. We decided to make the minimal amount of modifications possible in order to have the AMD64 version fully functional in the shortest possible amount of time. These modifications were basically two:

- Widen the slots of objects from 32 to 64 bits.
- Implement AMD64 code generators.

In contrast, there were a set of design decisions *we chose not to change* at the same time, among which were the object header format and the garbage collection algorithm [9]. We considered that while widening to a 64-bit word size *lets* to implement more efficient algorithms in areas like GC, mixing these changes with the ones needed for the 64-bit port would lead to unnecessary instability of the system. We believe that those changes can be done in ulterior stages. Limitations of our approach are described in section 5.2.

The two modifications we decided to carry out had diverse implications on the different components of Bee runtime. In the following paragraphs we provide a detailed description of them.

# 4.2 Bootstrapping

In order to establish the 64-bits system, we do not convert objects to 64-bits online, but instead we setup the bootstrap mechanism and Smalltalk library writers to cross-compile a 64-bit system from the 32-bit one. This generates a new 64-bit executable that when executed will already live in the AMD64 world. It is not possible in Bee to use 32 and 64-bit objects at the same time. The 64-bit objects are serialized by the library writers at the cross-compilation step.

During this writing process a few objects need to be supervised:

- A set of Smalltalk globals are adapted according to the size of the target architecture. For example the global WordSize is set to 4 or 8 accordingly.
- SmallIntegers are migrated to the target word size.
- Methods for accessing external pointers are chosen and installed depending on the target word size.
- Classes representing external structures are updated to use the correct field offsets, as explained in 4.10.

At the same time, a few initialization steps were added, so that constant objects are set to the appropriate values at launch time. For example, things like null ExternalAddress or ExternalHandle have to be created with the according amounts of bytes.

#### 4.3 OS Executable File Format

In typical Smalltalk VMs, which are written in or translated to C/C++, the executable code of the VM is stored into PE files by the C/C++ compiler, which understands PE and can output executables in that format. In Bee, the executable runtime image is bootstrapped from a set of objects that represent code and data. Those objects are packed into a PE file by a mechanism that models the PE format in Smalltalk. For this reason, the migration of Bee DMR to x64 platform started by the implementation of the PE32+ format.

#### 4.4 Change in Word Size

In Bee, as in any Smalltalk, the big majority of the classes are independent on the size of the word. Only classes that require implementing lower-level components need adjusting to the system word size. The most notable ones in Bee were Process, Thread, Memory, StackFrame, ExternalHandle, ExternalAddress, FFIMethod, SmallInteger and LargeInteger.

In Bee, there are only two types of objects in the heap: 331 332 byte objects and pointer objects. Pointer object slots are accessed indirectly via instance variable reads and writes 333 (emitted by the JIT), or directly through #\_basicAt: and #\_ba-334 335 sicAt:put: underprimitives, explained in section 4.9. In both cases, the offsets calculated for object or stack slots are never 336 337 written explicitly, but computed by the JIT compilers or the 338 implementation of #\_basicAt:(put:). This meant having sin-339 gle points of modification for adapting to the desired word 340 size. The implementation of offset calculation for instance 341 variables was adapted by only having to touch the JIT and a couple of low-level methods, and this was enough to adapt 342 slot accessing in pointer objects. Byte objects, on the other 343 hand are mostly independent of the word size, except on 344 345 two cases: when handling large integers, and for objects that 346 represented addresses, like ExternalAddress. However, in 347 those latter cases, the changes needed were small: switching from using the constant 4 (for the word size) to using the 348 349 global value WordSize, or in other cases, using a constant 350 associated to the word size (i.e. the maximum SmallInteger). 351 Those constants are stored in pool dictionaries, which can ei-352 ther be changed during bootstrapping or either be initialized 353 at start-up time.

354 Finally, we implemented a set of objects that represent the 355 different application binary interfaces (ABI) in use: X86ABI 356 and X64ABI. These objects provide the knowledge needed 357 to adapt code generators to the underlying architecture. For example, they provide the mapping from abstract registers 358 to concrete ones. There are 8 classes that use those objects, 359 all used by the different native-code compilers: the baseline 360 361 JIT, the register allocator, the assembly-code emitter, and 362 other stages of code optimization.

#### 4.5 Bee Object Memory Format

363

364

374

375

In Bee's memory, objects were stored in a format designed 365 for 32-bit arquitectures [8]. In that format, objects contained 366 367 an 8-byte or 16-byte header that specified the object size, its 368 type and some other properties like whether they contained 369 pointers or bytes. Our desire to allow for using more memory impacted directly in the format of objects in memory: the 370 371 most straightforward way of allowing this is to widen object 372 pointers to 64 bits, which in our case meant to make slots of 373 objects in memory double in size.

#### 4.6 Assembly Encoder

Bee contains two main native-code compilers: a baseline 376 JIT and an optimizing compiler. Both of them use the same 377 378 assembler as a back-end which has been designed in terms of the abstract register machine previously mentioned. In 379 this machine, there exist R register for passing a receiver 380 381 or returning a value, A register used mostly for storing an argument, T register for a temporary, S for storing self, E for 382 383 the current closure environment and finally SP and FP for the stack top and frame pointers respectively. This results 384 385

386

429

430

431

432

433

434

435

436

437

438

439

440

| X86ABI>>#regR                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 387                                                                                                   |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|
| ^eax                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 388                                                                                                   |
| X64ABl>>#regR                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 389<br>390                                                                                            |
| ^rax                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 391                                                                                                   |
| Tux                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 392                                                                                                   |
| BaseAssembler>>#and: op1 with: op2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 393                                                                                                   |
| self encode: 'and' with: op1 with: op2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | 394                                                                                                   |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | 395                                                                                                   |
| BytecodeAssembler>>#andRwithA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 396                                                                                                   |
| self encode: 'and' with: abi regR with: abi regA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 397                                                                                                   |
| BytecodeAssembler>>#compareRwithSindex: index                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 398                                                                                                   |
| pointer                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | 399                                                                                                   |
| reset;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | 400                                                                                                   |
| length: abi addressLength;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | 401                                                                                                   |
| base: abi regS;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | 402                                                                                                   |
| displacement: <mark>index</mark> – 1 * abi wordSize.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 403                                                                                                   |
| self encode: 'cmp' with: abi regR with: pointer                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | 404                                                                                                   |
| <b>Figure 3.</b> X86ABI and X64ABI answer a different R register.<br>The assembler delegates slot indexing to the abi object as<br>much as possible.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 405<br>406<br>407<br>408<br>409                                                                       |
| in 7 registers which is almost the same amount of general purpose x86 registers.<br>In x64, the number or registers has been doubled, but Bee abstract machine has been kept without major modifications. The concrete registers used in x86 are replaced with their expanded 8-byte counterparts. The assembler instruction encoding interface works at two levels: on one hand, it provides methods to encode instructions passing concrete registers; on the other hand, it provides an API to pass abstract registers, which are mapped one-to-one to concrete ones according to the target platform. For both of those two levels, the assembler API does not directly expose x86 or x86-64 instructions. Instead, operations provided to the client of the assembler interface are more abstract. Examples of these operations are things like pushing and popping values into | 411<br>412<br>413<br>414<br>415<br>416<br>417<br>418<br>419<br>420<br>421<br>422<br>423<br>424<br>425 |
| and out of the stack, loading and storing values from and to                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | 425                                                                                                   |
| memory, or performing arithmetic and logical operations in                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | 427                                                                                                   |
| registers.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | 428                                                                                                   |

Figure 3 shows the implementation of typical methods of the assembler interface, and how the a assembler uses the ABI objects described in 4.4 to abstract away the differences between x86 and x64. Those objects provide the set of available registers in the platform, a mapping from abstract registers to concrete ones and the target word size. They also allow the assembler to transform from pointer indexing operations to slot offsets according to the word size.

x86 instruction encoding is a complex process. The assembler delegates this task to another object: the InstructionEncoder. This encoder is in charge of writing the instruction

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

496

prefixes, opcode and operands into a machine code stream.
Notably, the same encoder can write both x86 and AMD64
instructions, because their encoding is similar.

The new assembler interface. Implementing the new x86-445 64 assembly encoder was the most time consuming task of 446 all the migration project. The main reason was that we did 447 not do a direct port from the x86 assembler to an x86-64 one. 448 The original x86 assembly encoder was targeted to be used 449 mostly by the JIT compiler. This meant that its API only pro-450 vided for instructions used by the JIT, which were directly 451 translated to the bytes that encoded those instructions. For 452 that reason, native-code stubs needed for callback and dis-453 patch optimization mechanisms could not be written using 454 our original x86 assembler, and were hard-coded as byte ar-455 rays. Those arrays were created by writing and assembling 456 them with external tools. The new assembler was designed 457 to be more generic so that we could change to dynamically 458 generated stubs, written in terms of Bee abstract machine, 459 working in both architectures. The greatest challenge on this 460 area was implementing the logic behind instruction encod-461 ing in x86-64, which is very complex. Additionally, the new 462 assembler is capable of encoding instructions for both x86 463 and x86-64 modes. 464

Instructions with 64-bit immediates. x86-64 operand en-466 coding presents a subtle but important limitation compared 467 to x86. While in x86 it is possible to encode immediate val-468 ues of the word size (32 bits), in x64 immediate operands are 469 limited to only 4 bytes. The only exception to this limitation 470 is in movabs instruction, which allows encoding 8-byte im-471 mediate values into 64 bit registers. This limitation makes it 472 harder to manipulate pointer and small integer values in JIT-473 compiled code. In x64 we added an abstract V register, which 474 is used to overcome this limitation. When needed, operations 475 with 8-byte immediates (i.e. pushing a pointer in 64 bits) are 476 done in two steps: first the immediate is moved to V register 477 using movabs, then V is pushed. V register is mapped to R11 478 in x64. This approach is not the only possible one. Another 479 solution is to store pointers separate from native code, and 480 to only let native code indirectly manipulate them, through 481 some base register. For example, in x64, it is possible to use 482 RIP-relative addressing, so a pointer table could be stored 483 after each method's native code. In native code, pointers 484 could then be accessed via instructions like mov rax, [rip+k] 485 or push [rip+k], where k is an offset from the instruction to 486 an entry in the table. We did not implement that solution 487 for two reasons: such a change would be against our mini-488 mal modifications approach (section 4.1), and also because 489 RIP-relative addressing is not available in 32-bit x86, so we 490 would have needed to use different pointer encoding tech-491 niques for x86 and x86-64. Yet another possibility would be 492 to split pointer loading by combining smaller 32-bit bit-shift 493 and bit-or operations. We discarded that technique as we 494

considered it too complex. For example, the garbage collector would need logic to reconstruct each object pointer in JIT-compiled code, as the original pointers would be split in many instructions.

#### 4.7 Native-Code Compilers

The baseline JIT communicates with the assembler in terms of abstract instructions and registers. This makes its transition to 64 bits mostly transparent. Even the registers used are abstract, and the assembler translates them to concrete ones.

For the optimizing compiler, on the other hand, there is almost no mapping from abstract to concrete registers. This compiler directly talks to the target ABI objects described in section 4.4, to obtain the set of available registers in the architecture. It also asks for the concrete registers assigned to the receiver and the return value. With all that information in hand, the register allocator can assign registers to the values of its intermediate representation and delegate encoding to the assembler. Intermediate operations, on the other hand, are target agnostic. As a last step, the machine-code emitter talks to the assembly encoder who converts them to x86 or x64 instructions accordingly.

Figure 4 shows various snippets of the native-code compilers. The assembler interface which abstracts away concrete register names, used by the baseline JIT, is shown in the first example. The assembler, according to its own configuration, is in charge of converting abstract registers to concrete ones. To the client, the assembler interface does not expose specific instructions of the x86 or x64 architectures. Actually a single operation, from the client point of view, could be implemented by the assembler as a list of concrete machine instructions. In the last example we see the code emitter for the optimizing compiler. In this case, the other level of the assembler is used. The intermediate operations are assigned two concrete registers, and finally the assembler is told to emit machine code to apply a bitwise or on them.

#### 4.8 Integer Representation

Bee integers are divided in SmallIntegers and LargeIntegers, as usual in many Smalltalks. Both types of integers are stored in a two's complement representation. The code of those classes is dependent on the word size but mostly independent of the processor architecture (specially because of the use of the abstract assembler). The methods had to be reviewed so that they would adapt to the word size. Example of this were SmallInteger»sizeInBytes, SmallInteger»bitShift: or LargeInteger»reduce.

### 4.9 Primitives and Underprimitives

Unlike traditional Smalltalk VMs, Bee DMR has no primitives. The code for what is usually represented as a primitive is instead coded in Smalltalk, using *underprimitives*, which can be seen as very specific fragments of primitive operations

444

573

575

576

606

618

619 620 621

622

623 624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

| 559 |                                                                          |
|-----|--------------------------------------------------------------------------|
| 552 | LoadArgumentBytecodeNativizer>>assemble                                  |
| 553 | assembler loadRwithFPindex: self argumentIndex                           |
| 554 |                                                                          |
| 555 | JumpBytecodeNativizer>> <b>assemble</b>                                  |
| 556 | target                                                                   |
| 557 | target := methodNativizer labelAt: self target.                          |
| 558 | assembler jumpTo: target                                                 |
| 559 |                                                                          |
| 560 | LookupLinker>>emitSend: aSelector using: anAssembler                     |
| 561 | send                                                                     |
| 562 | <pre>send := SendSite sending: aSelector with: lookup. anAssembler</pre> |
| 563 | load: anAssembler regA withPointer: send oop;                            |
| 564 | holdJustEmmitedReferenceTo: send;                                        |
| 565 | callIndirectReg: anAssembler regA                                        |
| 566 | 0                                                                        |
| 567 | BareMetalCodeEmitter>>assembleBitOr: instruction                         |
| 568 | left right                                                               |
| 569 | left := allocation at: instruction left.                                 |
| 570 | right := allocation at: instruction right.                               |
| 571 | assembler or: left with: right                                           |
| 572 |                                                                          |

Figure 4. Typical snippets of the different code generators communicating with the assembler. 574

577 like accessing raw pointers in memory, or doing low-level arithmetic and logical operations on registers. The code of 578 what used to be primitives is written in Smalltalk. The only 579 trick is that at specific points, for special messages sends, the 580 JIT compiler inserts special inline assembly code instead of 581 doing a lookup dispatch. The methods that replace primivi-582 tives end-up being mostly independent of the architecture, 583 and the low-level underprimitive details were the only things 584 that required adapting. In this port, as the assembler already 585 provided the abstract interface, the only requirement was to 586 take into account that the word size could be 4 or 8 bytes. 587

588 In the case of the object headers, as they remained unchanged, few modifications were needed. The access to the 589 bits present in object headers is reified in the class Object-590 Header. This serves as a single point of modification if any 591 change to the header format is done. For byte accesses (i.e. ob-592 ject flag accesses) there was nothing to do. For slot accesses, 593 594 used for behavior and extended object size, care had to be taken to always use 32-bit wide reads and writes. Otherwise, 595 when targeting 64-bits, the assembler would incorrectly use 596 a slot size of 64-bits to access these 32-bit fields. 597

Figure 5 shows an example of a part of at: primitive imple-598 mentation. basicObjectAt: is the part used for accessing slot 599 objects. It does not contain any platform dependent code. 600 basicObjectIndexOf: receives a slot index, an returns other 601 one (depending on the object type it could be different of the 602 received one). Finally, \_basicAt: underprimitive is inlined by 603 the JIT to actually access that slot in loadRwithRatA. The 604

| Object>>#basicObjectAt: grossIndex                     | 607 |
|--------------------------------------------------------|-----|
| index                                                  | 608 |
| index := self basicObjectIndexOf: grossIndex.          | 609 |
| ^self _basicAt: index                                  | 610 |
|                                                        | 611 |
| nlineMessageLinker>>#assembleBasicAt                   | 612 |
| nonInteger                                             | 613 |
| nonInteger := assembler labeledIntegerNativizationOfA. | 614 |
| assembler                                              | 615 |
| loadRwithRatA;                                         | 616 |
| @ nonInteger                                           | 617 |
|                                                        |     |

Figure 5. A part of the implementation of at: primitive.

slot offset is calculated at execution time, by shifting the index 2 or 3 bits according to the word size.

#### 4.10 Foreign-Function Interface

Foreign-function interface of VMs usually comprises calling external C functions, accessing external C structures being called by external code through callbacks. In Bee DMR, all of those things are implemented in Smalltalk.

The calling convention for Windows x64 is cdecl, which works in a very similar way to its 32-bit x86 counterpart. The first 4 arguments (from left to right) are not passed in the stack but through registers. However, the convention demands a shadow stack space of the same size than those 4 registers. This fact helped us to reuse the 32-bit cdecl code, which passes all arguments through the stack. For the 64-bit version, we just add a final step before the call to the external function, to offload the contents of the topmost 4 stack slots to registers.

Support for accessing external C structures in the different architectures did not present big obstacles but required to solve a subtle discrepancy: for a same C structure, the memory representation can vary depending on the processor architecture and the platform. This means that the size and offset of the fields in the structures can vary depending on the platform.

Bee resolves this problem mostly automatically. First, all C structures are represented with classes that implement the class-side method def. This method returns for each class the corresponding C structure definition, as a string. During development time, a parser reads those definitions and generates accessor methods for each field with the correct size and offset. The offsets are specified through pool variables. The parser generates two analogous pool dictionaries. The keys in those dictionaries are the names of the fields in the definition method, and the values are their corresponding offsets, for 32 bits in one dictionary and for 64 bits in the other. Field offsets are never specified using constants but using their corresponding automatically generated pool variable.

| 661 |                                                |
|-----|------------------------------------------------|
| 662 | typedef struct tagCOPYDATASTRUCT {             |
| 663 | // off-32   off-64   size                      |
| 664 | ULONG_PTR dwData; // 0 0 4/8                   |
| 665 | DWORD cbData; // 4 8 4                         |
| 666 | PVOID lpData; // 8 16 8                        |
| 667 | <pre>} COPYDATASTRUCT, *PCOPYDATASTRUCT;</pre> |
| 668 |                                                |
| 669 | COPYDATASTRUCT>>cwData:                        |
| 670 | self longAtOffset: cwData                      |
| 671 |                                                |
| 672 | COPYDATASTRUCT>>dwData: anInteger              |
| 673 | self pointerAtOffset: dwData put: anInteger    |
| 674 |                                                |

Figure 6. A C structure of Windows. The size of dwData 676 is 4 or 8 bytes. The offset of cbData and lpData is different in 32 and 64 bits. lpData offset is affected by padding in 64-678 bits. The total size of the struct is 12 bytes in 32 bits and 24 bytes in 64 bits. A parser calculates all this automatically and 680 generates accessor methods.

When bootstrapping the system, the library generator iter-684 ates through all classes that correlate to external C structures 685 and accordingly tweaks them to use the pool dictionaries 686 that correspond to the target architecture. Figure 6 shows 687 an example of a Windows C structure, with the field sizes 688 and offsets. The snippet also shows two of the automatically 689 generated methods to exhibit how the generated code adapts 690 to the varying sizes and offsets. 691

From the point of view of the clients of the foreign-function 692 interface, little care to processor architecture is needed when 693 calling external functions. The generated structure bindings 694 contain functionality to allocate them in the external C heap. 695 Calculation of the correct structure size is abstracted from 696 the client. A manual code review of the clients of the foreign-697 function interface was still required, to detect places where 698 external pointers were incorrectly assumed to be 4 bytes. 699

# 4.11 Callbacks

For C callbacks, adding support for x64 required only little 702 703 changes. When a callback is received by Bee, a native-code stub saves the processor registers according to the calling 704 convention and sends a Smalltalk message to the object that 705 owns that stub. As the last step of the callback handling, 706 the processor registers are restored and a return is issued. 707 The original callback stub was written in assembly, and the 708 encoding of the assembly instructions was stored in a byte 709 array. We changed this to use the new abstract assembler in-710 terface. The code is generated using abstract register names, 711 which makes most of the code independent of the processor 712 architecture. Between x86 and x86-64, there are two main 713 differences though: in x86-64 some arguments are passed in 714

716

717

718

719

720

721

722

723

724

725

726

727

728 729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

registers; as x86 defaults to stdcall calling convention while x86-64 defaults to cdecl, stack clean-up has to be done a little bit differently. The difference in argument passing between x86 and x86-64 is solved in a way analogous to what is done for C function calling. At the prologue of the callback stub, the registers RCX, RDX, R8 and R9 are copied back to the stack in the shadow space. This allows the callback handling code that follows the stub to be able to read all arguments from the stack independently of whether the platform is 32 or 64 bits. On callback exit, the only difference is that for 32 bits the callback clean-up code has to pop the arguments from the stack, while in x64 the arguments are popped by the caller.

#### 4.12 Garbage Collection

Adjustments needed for Bee's garbage collector [9] have been minimal. Of the 83 methods that compose the garbage collection algorithm, only 5 needed any change. There were two types of modifications:

- Behavior Slot. Code that dealt with the behavior slot in object headers had to be tuned, because generic reads and writes of slots (which were used originally in 32bits code) would become 64-bit memory accesses in x64. Instead, as the behavior slot in object headers is only 32-bit wide, a special 32-bit access has to be done both in x86 and x64.
- Forwarding Index. In different phases of Bee GC copying algorithm, object addresses are converted to forwarding indexes. This is done by subtracting the base address of a GC space to the object address, and then shifting the result 2 or 3 bits, depending if 32 or 64 bits. For that, the WordSizeShift global was used.

# 5 Discussion

Bee DMR is fully written in Smalltalk, in contrast to typical Smalltalk VMs written in C/C++, and also different from Squeak's VM which is written in slang but then translated to C [4]. This poses a significant difference between Bee and others: in Bee, the majority of memory accesses are done in terms of type-less object pointers.

One of the most feared challenges predicted before the migration started was the migration from a 32-bit/4-byte architecture to a 64-bit/8-bytes one. The difficulty expected was that detecting all the places where Smalltalk code assumed a word size of 4 bytes would be hard. In practice, this problem did not result as tough as expected. Detection of the methods that required changes was done manually, looking at methods of classes that were suspicious, and also by searching for senders of very low-level messages. In total, around 120 methods in the whole system depend on the contents of the global WordSize. Less than a dozen methods use a similar variable called WordSizeShift, which is used when converting pointers to indexes and vice versa, via shifting

715

700

701

675

677

679

681

682

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

operations. There are 7 classes that use a wordSize instance
variable: half of them are used for the assembler/disassembler, the other half for building code libraries.

#### 5.1 Debugging

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

Debugging the 64-bit version of Bee posed a challenge, specially at the beginning of the process. We used a combination of IDA debugger and disassembler [2] and our own nativecode debugger. With both of them, an initial effort had to be done, to make debugging scripts portable to both x86 and AMD64, and to abstract away the differences that objects present in memory.

While the detection of most of the Smalltalk methods that had to be changed occurred during a Smalltalk code review process, a little amount of them were not found initially and were only caught at run-time, when they would end crashing the system.

#### 5.2 Limitations

The migration approach described in 4.1 has limitations:

- The size field in the header of objects is limited to 32 bits, which in practice means that the biggest object can be 32GB.
- Pointers to behavior are 32-bit wide, so behavior objects have to live in the lower 4GB area of memory. This limitation was introduced to ease in migration from 32 to 64 bits, but added a series of drawbacks. As behaviors have to live in the lower memory, special care has to be taken when instantiating behaviors (putting them in a special GC space if needed). Small-talk libraries also have to be treated specially, loading them below the 4GB limit or migrating their behaviors to the lower area after loading them.
- The GC algorithm has not been modified, and is not optimized for heaps with size in the order of gigabytes or bigger. Currently, the only implemented approach is a scavenging copying collector (which is disabled by default in the old space). Enabling it in such scenarios may introduce undesirable pauses to the system.
  - At present, it is only possible to cross compile from x86 to x64, but not the other way around. While going from x64 to x86 should be straightforward, we have not done any attempt to support that functionality yet.

#### 5.3 Lessons

The target-agnostic assembler interface proved useful to build a compilation framework that can freely change between processor architectures. The combination of an assembler API that accepts both abstract registers and concrete ones, with the help of the X86ABI and X64ABI objects, allowed to share the code of all stages of compilation in both architectures. Smalltalk code is inherently independent of processor architecture, and this extended to the kernel that forms Bee DMR. Components like the assembler required modifications, not because they depended on the architecture, but because they modelled the architecture.

In a Smalltalk image, foreign-function interface is the main platform-dependent piece of code of user applications. If the C bindings are generated automatically, the migration task is reduced to a minimum, only requiring the developer to deal with corner cases. This experience showed the value of adopting a systematic approach for handling code that communicates with external libraries: having a parser for C structure definitions and generates Smalltalk accessors frees the programmer from writing error-prone boilerplate code.

We did not do a comparable port of a traditional VM, so we cannot directly contrast how this work would apply in that situation. However, there are points to be evaluated. We do not see any obstacle to use the same assembler design in a traditional VM. Moreover, at image level, automatic generation of bindings for C structures is directly applicable to any Smalltalk. We still do daily development of Bee on top of a host VM. This posed the advantage of letting us work freely on the native-code compilers and assembler, because they were not in use at the same time they were being modified.

# 6 Validation

To assess the performance of the 64-bit port we run a series of benchmarks. In particular we used the Are We Fast Yet benchmarks [6], ranging from micro to macrobenchmarks. DeltaBlue and Richards [13] are classic benchmarks evaluating the performance of object-oriented applications. Havlak [3] is an optimization algorithm for a compiler but is representative for many application-level optimization problems, too. And the Json benchmark parses a larger JSON document, which is relevant for the performance of many REST services used in today's web applications or micro services. The rest is a collection of numerical and OO benchmarks stressing particular aspects of the implementation.

The benchmarks were run on a machine with a 2.8Ghz 4-core Core i7 7700HQ with hyperthreading and 16GB of memory. The operating system is a 64-bit Windows 10. Measures were taken collecting 50 iterations for each benchmark.

For the performance comparison, we consider peak performance only and discount start-up, warm-up, and JITcompilation times.

The benchmarks are run with an initial heap size of 64 MB, to minimize noise introduced by the GC. The results are normalized to the 32-bit version of Bee DMR, to use it as the baseline for the performance comparison. We report averages and confidence intervals with  $\alpha = 95\%$ .

825

8



**Figure 7.** Normalized macro- and micro-benchmarks execution times, relative to 32-bit Bee (lower is better).

The results are shown in Figure 7. We can observe that performance of the 64-bit version is competitive with the 32bit one. The results are mixed, ranging from taking 12% more time in the worst case (Havlak) for the 64-bit version, to being 16% faster in the best case (DeltaBlue). The algorithms used to implement the 64-bit version of Bee DMR are the same than the ones used in the 32-bits implementation. In 64-bits Bee, pointer objects can be almost double in size than their 32-bits counterparts. However, the AMD64 architecture seems to be optimized to keep up with the increased memory access demands. On the other hand, in 64 bits, small integers can be used to represent bigger numbers than in 32 bits, which could help to improve performance of integer arithmetic operations. Taking all these characteristics into account, the results are not unexpected or surprising.

# 7 Future Work

The work presented here served the authors as a strong evi-922 dence of the usefulness of the migration approach described 923 in section 4.1. It gives us confidence on the potential to use 924 that same approach to expand Bee to other platforms. There 925 remains to be explored how this same design would stand 926 when porting to other not so similar processor architectures 927 like ARM or RISC-V. We still would like to establish what 928 are the obstacles in bootstrapping back from a 64-bit system 929 to a 32-bit one. Finally, now that we have full support for 930 64-bits environments, we have to discover how to manage 931 the huge amounts of memory that the system can theoreti-932 933 cally handle; this will impact mostly in the design of Bee's 934 garbage collector.

# Acknowledgments

The author wants to thank Leandro Caniglia, Valeria Murgia, Jan Vrany and the rest of the development team of Palantir Solutions for providing valuable ideas, discussions and reviews, and being in charge of the development and maintenance of Bee runtime libraries. This work was funded by Palantir Solutions.

### References

- B. Alpern, C. R. Attanasio, J.J. Barton, M. G. Burke, P. Cheng, J.-D. Choi, A. Cocchi, S.J. Fink, D. Grove, M. Hind, S. F. Hummel, D. Lieber, V. Litvinov, M. F. Mergen, T. Ngo, J. R. Russell, V. Sarkar, M. J. Serrano, J.C. Shepherd, S. E. Smith, V.C. Sreedhar, H. Srinivasan, and J. Whaley. 2000. The Jalapeño virtual machine. *IBM Systems Journal* 39, 1 (2000), 211–238. https://doi.org/10.1147/sj.391.0211
- [2] Chris Eagle. 2011. The IDA pro book. No Starch Press.
- [3] Robert Hundt. 2011. Loop recognition in c++/java/go/scala. Proceedings of Scala Days 2011 (2011), 38.
- [4] Dan Ingalls, Ted Kaehler, John Maloney, Scott Wallace, and Alan Kay. 1997. Back to the Future: The Story of Squeak, a Practical Smalltalk Written in Itself. In Proceedings of the 12th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA '97). ACM, 318–326. https://doi.org/10.1145/263698.263754
- [5] Intel. 2018. Intel® 64 and IA-32 Architectures Software Developer's Manual. Volume 1: Basic Architecture 2 (2018).
- [6] Stefan Marr, Benoit Daloze, and Hanspeter Mössenböck. 2016. Cross-Language Compiler Benchmarking—Are We Fast Yet?. In Proceedings of the 12th Symposium on Dynamic Languages (DLS'16). ACM, 120–131. https://doi.org/10.1145/2989225.2989232
- Microsoft Corporation. 2018. Argument Passing and Naming Conventions. (2018). https://msdn.microsoft.com/en-US/library/984x0h58. aspx [Online; accessed 20-July-2018].
- [8] Javier Pimás, Javier Burroni, and Gerardo Richarte. 2014. Design and implementation of Bee Smalltalk runtime. (2014).
- [9] Javier Pimás, Javier Burroni, Jean Baptiste Arnaud, and Stefan Marr. 2017. Garbage Collection and Efficiency in Dynamic Metacircular Runtimes. In Proceedings of the 13th ACM SIGPLAN International Symposium on Dynamic Languages (DLS'17). ACM, 12. https://doi.org/10. 1145/3133841.3133845
- [10] David Ungar, Adam Spitz, and Alex Ausch. 2005. Constructing a metacircular Virtual machine in an exploratory programming environment. In OOPSLA '05: Companion to the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications. ACM, 11–20. https://doi.org/10.1145/1094855.1094865
- [11] Wikipedia contributors. 2018. X86 calling conventions.
   (2018). https://en.wikipedia.org/w/index.php?title=X86\_calling\_ conventions&oldid=850925564 [Online; accessed 20-July-2018].
- [12] Christian Wimmer, Michael Haupt, Michael L. Van De Vanter, Mick Jordan, Laurent Daynès, and Douglas Simon. 2013. Maxine: An Approachable Virtual Machine for, and in, Java. ACM Trans. Archit. Code Optim. 9, 4, Article 30 (Jan. 2013), 24 pages. https://doi.org/10.1145/ 2400682.2400689
- [13] Mario Wolczko. 1996. Benchmarking Java with Richards and Deltablue. (1996). http://www.wolczko.com/java\_benchmarking.html

982

983

984

985

986

987

988

989

990

936

937

938

935

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920