A New Mapfile Syntax for Solaris

Ali Bahrami — Thursday January 07, 2010

Surfing with the Linker-Aliens

In the previous entry, I discussed at length the problems and misfeatures of the original Solaris mapfile language that we inherited with System V Release 4 Unix. The original language was not designed to be extended, yet we've built on top of it for 20+ years. Although we could continue to do so, we have come to a point where a new language that retains the good features of the old, while addressing its shortcomings, would pay dividends.

My project to create a replacement mapfile language is in its final stages. I believe that the resulting syntax is simple, highly readable, and easily extended. Yet, the result is also highly evolutionary. I think anyone who knows the old language will have little difficulty understanding and quickly putting the new one to use. The implementation is complete, and I've used it to build a copy of the Solaris OSnet workspace with all of its mapfiles rewritten using the new syntax. Yesterday, the PSARC case for this work was approved, a significant milestone:

PSARC/2009/688 Human readable and extensible ld mapfile syntax

We're currently in a restricted build period leading up to the release of the next OpenSolaris, and this work will have to wait to integrate until after that, probably in the second half of February. However, the work is essentially done, and this seems like a good time to get some information about it into circulation.

The case materials for PSARC/2009/688 include a replacement mapfile chapter for the Solaris Linker and Libraries Guide. The old chapter will be preserved as an appendix for the benefit of those needing to decrypt existing mapfiles. Until this new material appears in the published manual, I hope you will find this HTML version helpful.

There is little reason to repeat the information in that document here. Instead, I would like to describe the underlying principles we used to design this new language, and to provide a series of examples in which a single item is expressed in both the old and new syntaxes. I think that these examples probably offer the fastest way for someone who already knows the old syntax to start using the new one. I will refer to the Linker and Libraries Manual frequently in this discussion, often using the abbreviation LLM.

Design, Testing, and Base Principles

The new syntax was developed in an iterative manner, starting with a paper design, written in the form of a replacement for the current LLM mapfile chapter, and progressing to implementation and testing with real mapfiles. With each iteration, I would take the lessons learned, debate and discuss the options with my fellow linker alien Rod Evans, and alter the design to address the issues and move forward. As might be expected, there were false starts, and surprises along the way, but eventually things solidified around the final design.

Once we had a final design and a working implementation of it, I modified our linker tests so that each test that uses a mapfile now does so twice, once with the old syntax, and once with the new. This has two important benefits:

I can ensure that the new syntax can do anything the old one can (modulo a few obscure features not taken forward), by comparing the two resulting objects to make sure they are identical.
The old syntax will continue to be used, so it will not fail due to bit rot.

As I iterated though the design process, I developed and refined the following list of requirements and observations that in turn guided following iterations. Listed in no particular order:

We must offer full support for mapfiles in the original syntax. There can be no abrupt translation and cutover, as there are too many of these files in existence. We hope people will convert in time because the new language is better and easier, and because new features will only appear in the new, but there will be no forced conversion. This implies that we must have the concept of mapfile syntax version, with the old syntax being version 1, and the new version 2. Unfortunately, the default must be version 1 for backward compatibility. The link-editor must be able to cheaply and unambiguously determine which version it is reading from a given file before it has to actually interpret a statement from the file.
A given mapfile must contain only version 1 or version 2 syntax (no mixing within a single file). However, a given link-editor invocation can have more than one mapfile, and each mapfile is free to use either syntax without regard to the syntax used by the others.
The mapfile version must be a characteristic of the file itself (i.e. determined by the file contents), and not require a different ld command line option. Hence, the -M option is used for mapfiles of either version.
My study of other mapfile languages, previous efforts within the linker group, our code, and the mapfiles in the OSnet, all convince me that the current mapfile language is semantically at the right level. It need not be higher or lower level than it is, and the basic concepts are fine. The problem we need to solve with this project is primarily one of syntax.
A user familiar with the old mapfiles, upon encountering a mapfile written in the new syntax, should immediately be able to recognize it as a linker mapfile and understand its contents.
The scope/version part of the old mapfile syntax is pretty good, well liked, and widely used. We believe that the vast majority of existing mapfiles only use this part of the old syntax. A new syntax built using this as a starting point would help with the familiarity requirement above.
We don't see terseness as a inherently bad thing. However, the old syntax is too terse, and we're willing to be a bit more verbose in order to be a lot more readable. It should be possible to read almost any mapfile and understand its meaning without resorting to a reference manual to decode things.
The syntax for all directives should follow a single standard generalized form, rather than being invented ad-hoc for each directive. There should be none of the "a $ prefix means this here, but something else over there" that characterizes the old syntax.
The magic character nature of the original language must not be carried forward. All mapfile directives should be identified via a unique and mnemonic keyword as their first token (e.g. LOAD_SEGMENT, SYMBOL_VERSION, CAPABILITY, etc).
Special characters (e.g. =, *, ;, {}, etc) can be used used in the style of a programming language like C, to define the core syntax of the language. For example, ';' can terminate statements, {} can group items, and = can be used to assign a value to something. However, they must not be used to identify directives as was done in the old version. Special characters, are part of the underlying language, and not of any particular directive, must express the same concept wherever they are used.
It should be simple and easy to add an absolutely huge number of new directives, and/or to add a vast number of new options to existing directives, in a backward compatible manner. This is not because we want a huge language (we don't), but because failing to plan for expansion was a key failing of the old language, and we're not going to let that happen again.
A linker mapfile language should be something that a programmer can comfortably edit with a standard text editor, just like code, and the other things that a programmer edits. I am not anti-IDE, but I am anti-required-IDE.
A couple of years ago, I did an XML based mapfile prototype, to determine if that would be a good direction for mapfiles. My conclusion is that it is not. XML is too verbose to be comfortably hand edited, and the XML boilerplace gets in the way. I don't think it is a good fit for a linker mapfile. However, XML does have some useful lessons to teach us, particularly, that the syntax should be simple, and regular. Although we will not use XML, it will be a good thing if the syntax is easily translatable to/from XML using nothing more than simple perl or python. This will serve to make sure we end up with a simple flexible language, and leave the door open to an future XML variant, should that prove interesting.
The new syntax should be able to produce an object identical to that produced by the old syntax, without going to extreme or confusing lengths to achieve it. However, it is acceptable to drop support for a small number of marginal features from the original (i.e. reserved segments), as long as it is possible to add them later should we miss them. The original syntax remains available for the few cases requiring dropped obscure features.
The translation from the old to the new syntax must be straightforward so that a programmer can convert their mapfiles to the new syntax without too much effort.
The internal concepts of segment, and entrance criteria list are good, and should be retained. However, unlike the version 1 syntax, this should all be done within the context of segment definition, rather than having separate segment definition, and section to segment assignment statements. Furthermore, there should be a separate segment directive for each type of supported segment, that only accepts attributes that make sense for segments of that type. This will eliminate a class of error possible in the old syntax, where you attempt to set attributes that are nonsensical for the segment type.
The link-editor contains a built in set of default segments with known names (text, data, bss, ...), and of entrance criteria that direct sections from input files to these segments in the output object. The version 1 syntax is not powerful enough to describe these built in items. As a matter of principle, the version 2 syntax should be able to do this. I view this as a matter of language completeness. (Note: The new Linker and Libraries manual referenced above contains an example of using the new syntax to define the built in segments and entrance criteria).
It would be nice to have a simple mechanism with the ability to conditionalize mapfile lines based on the target platform. In the old syntax, we've observed that frequently, there are multiple, largely identical, per-platform mapfiles that differ in minor ways. A common example is that of setting a different virtual address for a segment in 32 and 64-bit objects. Another is a symbol that only exists on one platform, for historic, or ABI related reasons. These multiple mapfiles represent needless clutter, and are an opportunity to introduce accidental inconsistencies into the varying objects.
Something along the lines of what the C preprocessor allows with #if/#endif would fit the bill. However, we have no desire to have a macro facility, or for most of what CPP does. Just conditional input. If you want more, you can use a real preprocessor (like m4, or even cpp), but in the mapfile language, we want something extremely simple that just solves this one little problem.

New Syntax Overview

The full definition of the version 2 mapfile language can be found online. As mentioned earlier, I won't be repeating that information here. Instead, I'll provide a high level overview, with an eye towards showing how the wish list from the previous section was fulfilled.

A version 2 mapfile can contain two types of directive:

Control directives, which all start with the '$' character, and which control how the mapfile itself is interpreted.
Regular directives, which specify information regarding the output object being linked. Regular directives all start with a mnemonic name that identifies them, such as LOAD_SEGMENT, or SYMBOL_VERSION, and they use a uniform syntax.

As with the version 1 syntax, '#' is the comment character. A # on a line, and everything following it, is ignored by the link-editor, as are empty lines.

The first non-comment, non-empty, line in a version 2 mapfile must be the control directive:

$mapfile_version 2

Any mapfile that does not start with this line is interpreted as a version 1 mapfile, in which case the full original syntax is supported.

Control Directives

Aside from $mapfile_version, there are control directives that provide a conditional input facility that can be used to restrict specific mapfile lines to specific platforms:

$if expr
...
[$elif expr]
       ...
[$else]
...
$endif

The sole purpose of this facility is to allow you to write something like

$if _sparc && _ELF32
    32-bit sparc thing
$elif _x86 && _ELF64
    64-bit x86 thing
$else
    others
$endif

as a way to handle minor per-platform variations in an otherwise identical mapfile.

Users of C, and related, languages will instantly recognize this as being very similar to the C preprocessor, substituting '$' for '#'. That is true, but the similarity is very superficial:

Mapfiles have no macro concept.
The expressions evaluated by $if are purely logical (boolean true/false), with no concept of numeric values, and significantly simpler than those of CPP.

I had a few reasons for making '$' the character for control directives:

To give C programmers a strong visual hint that they're not using CPP, and should have different expectations. As I mentioned earlier, if you need a macro pre-processor, Unix has many available that you can use outside the link-editor.
To preserve '#' as the mapfile comment character.
'$' has no previous meaning at the start of a mapfile statement in the original version 1 syntax.

Reasons 2 and 3 both relate to the fact that the link-editor reads the mapfile to determine which version of syntax is being used. By keeping the same comment character, and using a character for control directives not already used at the start of a statement by the old syntax, the link-editor can safely read and discard opening header comments, locate the first statement in the file, and unambiguously determine if the mapfile is using version 1 or version 2 syntax.

There are a small number of predefined values available for use in $if/$elif expressions:

_ELF32   _ELF64
_sparc   X86
true

I expect these to be sufficient for nearly any mapfile. However, the $add control directive exists to define new values, and $clear to remove them. $add might be used to define convenient shorthand for longer expressions. For example, you you were writing a mapfile that had a large number of special cases involving the 64-bit x86 architecture, a definition like the following might be convenient:

$if _ELF64 && _x86
$add amd64
$endif

Lastly, the $error directive allows you to make your mapfiles safe against attempts to use them in an unexpected context. The text following the directive is issued as a fatal error by the link-editor, which then exits. I expect it to be used as follows:

$if _sparc
sparc thing
$elif _x86
x86 thing
$else
$error unknown platform
$endif

The error message includes the mapfile name, and the line number where the $error directive was encountered.

Regular Directives

The regular directives all specify object-related information.

They all share a common underlying abstract syntax, based on the idea of name-value pairs, and the use of {} brackets for grouping, and to express sub-attributes.

All directives are terminated by the ';' character, as are attributes of directives.

Described informally, the simplest form is a directive name without a value:

directive;

The next form is a directive name with a value, or a whitespace separated list of values.

directive = value...;

The '=' operator is shown, which sets the given directive to the given value, or value list. The '+=' operator can be used, to specify that the value is to be added to the current value, and similarly, a '-=' operator is used to remove values.

More complex directives manipulate items that take multiple attributes enclosed within {...} brackets to group the attributes together as a unit:

directive [name] {
        attribute [= value];
        ...
} [name...];

Such a directive can have a name before the opening '{', which is used to name the result of the given statement. As an example, this may be a segment, or version name. One or more optional names may also be allowed following the closing '}', before the terminating ';'. These names are used to express that the named item being defined has relationship with other named items. For example, the SYMBOL_VERSION directive uses this for inherited version names.

Note that the format for attributes within this form follow the same pattern as that of the simple directive form.

Some directives may have attributes that in turn have sub-attributes. In such cases, the sub-attributes are also grouped within nested { ... } brackets to reflect this hierarchy:

directive [name] {
        attribute {
                subattribute [= value];
                ...
        };
        ...
} [name...];

Such nesting can be carried out to arbitrary depth, as required to express the meaning of a given directive. In practice, 1-2 levels of nesting are sufficient for the directives currently defined. I don't anticipate very deep nesting being necessary, but the flexibly to do so gives me confidence that the new syntax is sufficiently flexible, and that we will be able to expand it as necessary going forward.

Old and New Syntax Compared

I think that the best way to evaluate the new mapfile syntax is to show how one might express the same concepts using both. In the subsections that follow, I will show examples in the old syntax and then re-write them using the new. This won't be a comprehensive demonstration of every possible option, but will touch on all of the main features.

Segments/Sections (Elephant, Monkey, and Donkey Ride Again)

The Linker and Libraries Manual contains the following example, which comes from the original AT&T documentation. This example shows how segments are created and sections assigned to them using the old syntax:

elephant : .data : peanuts.o *popcorn.o; 
monkey : $PROGBITS ?AX; 
monkey : .data; 
monkey = LOAD V0x80000000 L0x4000; 
donkey : .data; 
donkey = ?RX A0x1000; 
text = V0x80008000;

I have re-written this example for the new replacement mapfile chapter, as it provides a direct comparison between the old and new syntaxes. The old chapter, and my replacement, both contain a description of what each line means. I'll reproduce the new version here, omitting the explanations:

$mapfile_version 2
LOAD_SEGMENT elephant {
        ASSIGN_SECTION {
                IS_NAME=.data;
                FILE_PATH=peanuts.o;
        };
        ASSIGN_SECTION {
                IS_NAME=.data;
                FILE_OBJNAME=popcorn.o;
        };
};
LOAD_SEGMENT monkey {
        VADDR=0x80000000;
        MAX_SIZE=0x4000;
        ASSIGN_SECTION {
                TYPE=progbits;
                FLAGS=ALLOC EXECUTE;
        };
        ASSIGN_SECTION {
                IS_NAME=.data
        };
};
LOAD_SEGMENT donkey {
        FLAGS=READ EXECUTE;
        ALIGN=0x1000;
        ASSIGN_SECTION {
                IS_NAME=.data;
        };
};
LOAD_SEGMENT text {
        VADDR=0x80008000
};

The original is extremely compact, but also very cryptic. The new version is is considerably longer, as it uses our recommended style of one item per line, with consistent indentation to show structure. The improvement in readability is substantial. I believe that most programmers can read this and follow its meaning without having to look up the syntax. I'm quite sure the same cannot be said of the old one.

Also note that the new version can be significantly compacted without losing much readability, though there's not much value in doing so:

$mapfile_version 2
LOAD_SEGMENT elephant {
        ASSIGN_SECTION { IS_NAME=.data; FILE_PATH=peanuts.o };
        ASSIGN_SECTION { IS_NAME=.data; FILE_OBJNAME=popcorn.o };
};
LOAD_SEGMENT monkey {
        VADDR=0x80000000; MAX_SIZE=0x4000;
        ASSIGN_SECTION { TYPE=progbits; FLAGS=ALLOC EXECUTE };
        ASSIGN_SECTION { IS_NAME=.data };
};
LOAD_SEGMENT donkey {
        FLAGS=READ EXECUTE; ALIGN=0x1000;
        ASSIGN_SECTION { IS_NAME=.data; };
};
LOAD_SEGMENT text { VADDR=0x80008000 };

Output Section Ordering

The version 1 syntax uses the '|' character to specify output section ordering. The LLM gives this example:

segment_name | section_name1;
segment_name | section_name2;
segment_name | section_name3;

In the version 2 syntax, this mapfile would be written as

$mapfile_version 2
LOAD_SEGMENT segment_name {
        OS_ORDER = section_name1 section_name2 section_name3;
};

Size Symbol Declarations

The version 1 syntax for creating a size symbol is:

segment_name @ symbol_name;

In the version 2 syntax, this is:

$mapfile_version 2
LOAD_SEGMENT segment_name { SIZE_SYMBOL = symbol_name };

File Control Directives

In the version 1 syntax, File Control Directives, indicated by the '-' character, are used to establish the versions that are available from shared objects linked to the object being created. In the new syntax, this is done using the DEPEND_VERSIONS directive.

For example, the following specifies that the version SUNW_1.20, as well as any version inherited by SUNW_1.20, is available for use by the object being created. It also forces SUNW_1.19 to be listed as a dependency, whether or not a symbol from SUNW_1.19 is actually used:

libc.so - SUNW_1.20 $ADDVERS=SUNW_1.19;

The same requirement can be expressed in the new syntax as:

$mapfile_version 2
DEPEND_VERSIONS {
        ALLOW =   SUNW_1.20;
	REQUIRE = SUNW_1.19;
};

Capabilities

Hardware and software capability directives are used to augment or replace the capabilities found in the input objects. For example consider the following statements in the version 1 syntax:

hwcap_1 = mmx;		    # Add MMX to existing hardware capabilities
hwcap_1 = mmx $OVERRIDE;    # Replace existing hardware capabilities with MMX

sfcap_1 = addr32;	    # Add ADDR32 to existing software capabilities
sfcap_1 = addr32 $OVERRIDE; # Replace existing software capabilities with ADDR32

Rewritten using the version 2 syntax:

$mapfile_version 2
CAPABILITY {
	HW += mmx;          # Add MMX to existing hardware capabilities
	HW = mmx;           # Replace existing hardware capabilities with MMX

	SF += addr32;       # Add ADDR32 to existing software capabilities
	SF = addr32;        # Replace existing software capabilities with ADDR32
};

Symbol Versions

The syntax for symbol scope/versioning symbols is the least changed:

{} brackets are still used to group the symbols.
The version name precedes the opening '{'.
Inherited version names follow the closing '}'.
The syntax for current scope is unchanged.
The syntax for scope auto-reduction is unchanged.
The syntax for symbol names without attributes is unchanged.

The following things are different:

The 'SYMBOL_SCOPE' or 'SYMBOL_VERSION' keyword is added to the beginning, before the name and opening '{'..
The syntax for symbol attributes is changed.

For a large number of mapfiles, the only change necessary will be to add the $mapfile_version control directive to the file, and to put the keyword SYMBOL_SCOPE or SYMBOL_VERSION in front of each scope/version.

To show the difference in how symbol attributes are specified, consider the following directive in the old syntax that uses every possible symbol attribute. This is not a realistic example, as many of these options are not mutually compatible. However, it serves to highlight the full set of syntax differences:

VER_1.2 {
        foo = V0x12345678 S0x23
                FUNCTION DATA COMMON
                FILTER libfoo.so
                AUX libfoo.so
                PARENT EXTERN DIRECT NODIRECT INTERPOSE DYNSORT NODYNSORT;

        protected:
               *;
} VER_1.1;

Rewriting this in the version 2 syntax gives:

$mapfile_version 2
SYMBOL_VERSION VER_1.2 {
        foo {
                VALUE = 0x12345678; SIZE = 0x23;
                TYPE = FUNCTION;    TYPE = DATA;    TYPE=COMMON;
                FILTER = libfoo.so;
                AUX = libfoo.so;
                FLAGS = PARENT EXTERN DIRECT NODIRECT INTERPOSE DYNSORT NODYNSORT;
        }

        protected:
               *;
} VER_1.1;

Although the attribute syntax has changed, it is very similar.

Ordered Input Sections

The compiler usually places functions within a single source file together in an single text section in the resulting object. Such an object is an all or nothing proposition — to use any one of these functions, the link-editor must take the entire text segment as a unit. The contents of such a section are fixed in place, and cannot be altered by the linker.

The Sun compilers support a command line flag, -xF, that causes each function to instead be placed in its own separate section. This gives the link-editor finer grained control, as it can omit unused functions while still pulling in the ones needed to complete the link. The link-editor also has the opportunity to arrange these functions in arbitrary order relative to each other, under user control, specified via the mapfile.

The documentation for the original version 1 syntax in the LLM gives this example:

text = LOAD ?RXO;
text : .text%foo
text : .text%bar
text : .text%main

The result of using this mapfile will be for foo(), bar(), and main() to be placed adjacent to each other at the head of the segment, in that order. The ordering is implicit in the order in which the three section to segment statements (':' lines) are given in the mapfile.

The version 2 syntax accomplishes this reordering as follows:

$mapfile_version 2
LOAD_SEGMENT text {
        ASSIGN_SECTION bar  { IS_NAME = .text%bar };
        ASSIGN_SECTION main { IS_NAME = .text%main };
        ASSIGN_SECTION foo  { IS_NAME = .text%foo };
        IS_ORDER = foo bar main;
};

Conditional Input

This example comes from the linker tests. We have a test that sets an address for the text segment, and this test sets a different address for each of 32-bit sparc, 64-bit sparc, 32-bit x86, and 64-bit x86. As a result, we have four mapfiles:

mapfile-sparc
text = V0x40000;
mapfile-sparcv9
text = V0x100400000;
mapfile-i386
text = V0x8080000;
mapfile-amd64
text = V0x480000;

The version 2 syntax can employ conditional input to represent all of these differing values within a single mapfile, simplifying the test makefile. The $error control directive is used to catch cases where this test is run on a new previously unknown platform, and provide a meaningful error to the developer:

$mapfile_version 2

$if _sparc

$if _ELF64
LOAD_SEGMENT text { VADDR = 0x100400000 };
$else
LOAD_SEGMENT text { VADDR = 0x40000 };
$endif

$elif _x86

$if _ELF32
LOAD_SEGMENT text { VADDR = 0x8080000 };
$else
LOAD_SEGMENT text { VADDR = 0x480000 };
$endif

$else
$error unknown platform
$endif

Updates

21 February 2016

Chris Lent at Cooper Union pointed out that I had omitted a leading $ in my example:

$if _sparc && _ELF32
    32-bit sparc thing
elif _x86 && _ELF64
    64-bit x86 thing
$else
    others
$endif

I have added the missing $ to the elif. Thank you for your attention to detail!

Surfing with the Linker-Aliens

Comments

Chris Quenelle — Monday January 11, 2010

Cool. I haven't gotten very far through the docs yet, but it seems that the CAPABILITY directive is specified so that: "FOO -= bar" followed by "FOO += bar" results in bar being omitted from the value of FOO. In other words, += and -= are order-independent. Thus it seems that if one mapfile turns off a flag, a later mapfile cannot turn it back on again (unless it resets the complete value of the flag). I've been involved in specifying the behavior of non-trivial compiler options in the past. It seems simple, but it's actually quite hard to get something that works for all the common cases.

Ali Bahrami — Wednesday January 13, 2010

Thanks Chris! I've made a small change to how this works that should address the issue of '-=' locking out the later ability to add a value back with '+='. As described in the doc, there are two bitmasks, 'value', and 'exclude'. When adding a bit to one of these masks, the same bit will be removed from the other. Hence, a later '+=' can undo the action of an earlier '-='.

Surfing with the Linker-Aliens

Published Elsewhere

https://blogs.sun.com/ali/entry/a_new_mapfile_syntax_for/
https://blogs.oracle.com/ali/entry/a_new_mapfile_syntax_for/
https://blogs.oracle.com/ali/a-new-mapfile-syntax-for-solaris/

Surfing with the Linker-Aliens

[14] Problems With V1 Mapfiles Blog Index (ali) [16] Naming Shared Objects