A New Mapfile Syntax for Solaris |
Ali Bahrami Thursday January 07, 2010
In the previous entry, I discussed at length the problems and misfeatures of the original Solaris mapfile language that we inherited with System V Release 4 Unix. The original language was not designed to be extended, yet we've built on top of it for 20+ years. Although we could continue to do so, we have come to a point where a new language that retains the good features of the old, while addressing its shortcomings, would pay dividends.
My project to create a replacement mapfile language is in its final stages. I believe that the resulting syntax is simple, highly readable, and easily extended. Yet, the result is also highly evolutionary. I think anyone who knows the old language will have little difficulty understanding and quickly putting the new one to use. The implementation is complete, and I've used it to build a copy of the Solaris OSnet workspace with all of its mapfiles rewritten using the new syntax. Yesterday, the PSARC case for this work was approved, a significant milestone:
PSARC/2009/688 Human readable and extensible ld mapfile syntax
We're currently in a restricted build period leading up to the release of the next OpenSolaris, and this work will have to wait to integrate until after that, probably in the second half of February. However, the work is essentially done, and this seems like a good time to get some information about it into circulation.
The case materials for PSARC/2009/688 include a replacement mapfile chapter for the Solaris Linker and Libraries Guide. The old chapter will be preserved as an appendix for the benefit of those needing to decrypt existing mapfiles. Until this new material appears in the published manual, I hope you will find this HTML version helpful.
There is little reason to repeat the information in that document here. Instead, I would like to describe the underlying principles we used to design this new language, and to provide a series of examples in which a single item is expressed in both the old and new syntaxes. I think that these examples probably offer the fastest way for someone who already knows the old syntax to start using the new one. I will refer to the Linker and Libraries Manual frequently in this discussion, often using the abbreviation LLM.
Once we had a final design and a working implementation of it, I modified our linker tests so that each test that uses a mapfile now does so twice, once with the old syntax, and once with the new. This has two important benefits:
As I iterated though the design process, I developed and refined the following list of requirements and observations that in turn guided following iterations. Listed in no particular order:
A couple of years ago, I did an XML based mapfile prototype, to determine if that would be a good direction for mapfiles. My conclusion is that it is not. XML is too verbose to be comfortably hand edited, and the XML boilerplace gets in the way. I don't think it is a good fit for a linker mapfile. However, XML does have some useful lessons to teach us, particularly, that the syntax should be simple, and regular. Although we will not use XML, it will be a good thing if the syntax is easily translatable to/from XML using nothing more than simple perl or python. This will serve to make sure we end up with a simple flexible language, and leave the door open to an future XML variant, should that prove interesting.
Something along the lines of what the C preprocessor allows with #if/#endif would fit the bill. However, we have no desire to have a macro facility, or for most of what CPP does. Just conditional input. If you want more, you can use a real preprocessor (like m4, or even cpp), but in the mapfile language, we want something extremely simple that just solves this one little problem.
A version 2 mapfile can contain two types of directive:
As with the version 1 syntax, '#' is the comment character. A # on a line, and everything following it, is ignored by the link-editor, as are empty lines.
The first non-comment, non-empty, line in a version 2 mapfile must be the control directive:
Any mapfile that does not start with this line is interpreted as a version 1 mapfile, in which case the full original syntax is supported.$mapfile_version 2
$if expr ... [$elif expr] ... [$else] ... $endif
The sole purpose of this facility is to allow you to write something like
$if _sparc && _ELF32 32-bit sparc thing $elif _x86 && _ELF64 64-bit x86 thing $else others $endif
as a way to handle minor per-platform variations in an otherwise identical mapfile.
Users of C, and related, languages will instantly recognize this as being very similar to the C preprocessor, substituting '$' for '#'. That is true, but the similarity is very superficial:
I had a few reasons for making '$' the character for control directives:
There are a small number of predefined values available for use in $if/$elif expressions:
_ELF32 _ELF64 _sparc X86 true
I expect these to be sufficient for nearly any mapfile. However, the $add control directive exists to define new values, and $clear to remove them. $add might be used to define convenient shorthand for longer expressions. For example, you you were writing a mapfile that had a large number of special cases involving the 64-bit x86 architecture, a definition like the following might be convenient:
$if _ELF64 && _x86 $add amd64 $endif
Lastly, the $error directive allows you to make your mapfiles safe against attempts to use them in an unexpected context. The text following the directive is issued as a fatal error by the link-editor, which then exits. I expect it to be used as follows:
$if _sparc sparc thing $elif _x86 x86 thing $else $error unknown platform $endif
The error message includes the mapfile name, and the line number where the $error directive was encountered.
They all share a common underlying abstract syntax, based on the idea of name-value pairs, and the use of {} brackets for grouping, and to express sub-attributes.
All directives are terminated by the ';' character, as are attributes of directives.
Described informally, the simplest form is a directive name without a value:
The next form is a directive name with a value, or a whitespace separated list of values.directive;
The '=' operator is shown, which sets the given directive to the given value, or value list. The '+=' operator can be used, to specify that the value is to be added to the current value, and similarly, a '-=' operator is used to remove values.directive = value...;
More complex directives manipulate items that take multiple attributes enclosed within {...} brackets to group the attributes together as a unit:
directive [name] { attribute [= value]; ... } [name...];
Such a directive can have a name before the opening '{', which is used to name the result of the given statement. As an example, this may be a segment, or version name. One or more optional names may also be allowed following the closing '}', before the terminating ';'. These names are used to express that the named item being defined has relationship with other named items. For example, the SYMBOL_VERSION directive uses this for inherited version names.
Note that the format for attributes within this form follow the same pattern as that of the simple directive form.
Some directives may have attributes that in turn have sub-attributes. In such cases, the sub-attributes are also grouped within nested { ... } brackets to reflect this hierarchy:
directive [name] { attribute { subattribute [= value]; ... }; ... } [name...];
Such nesting can be carried out to arbitrary depth, as required to express the meaning of a given directive. In practice, 1-2 levels of nesting are sufficient for the directives currently defined. I don't anticipate very deep nesting being necessary, but the flexibly to do so gives me confidence that the new syntax is sufficiently flexible, and that we will be able to expand it as necessary going forward.
I have re-written this example for the new replacement mapfile chapter, as it provides a direct comparison between the old and new syntaxes. The old chapter, and my replacement, both contain a description of what each line means. I'll reproduce the new version here, omitting the explanations:elephant : .data : peanuts.o *popcorn.o; monkey : $PROGBITS ?AX; monkey : .data; monkey = LOAD V0x80000000 L0x4000; donkey : .data; donkey = ?RX A0x1000; text = V0x80008000;
The original is extremely compact, but also very cryptic. The new version is is considerably longer, as it uses our recommended style of one item per line, with consistent indentation to show structure. The improvement in readability is substantial. I believe that most programmers can read this and follow its meaning without having to look up the syntax. I'm quite sure the same cannot be said of the old one.$mapfile_version 2 LOAD_SEGMENT elephant { ASSIGN_SECTION { IS_NAME=.data; FILE_PATH=peanuts.o; }; ASSIGN_SECTION { IS_NAME=.data; FILE_OBJNAME=popcorn.o; }; }; LOAD_SEGMENT monkey { VADDR=0x80000000; MAX_SIZE=0x4000; ASSIGN_SECTION { TYPE=progbits; FLAGS=ALLOC EXECUTE; }; ASSIGN_SECTION { IS_NAME=.data }; }; LOAD_SEGMENT donkey { FLAGS=READ EXECUTE; ALIGN=0x1000; ASSIGN_SECTION { IS_NAME=.data; }; }; LOAD_SEGMENT text { VADDR=0x80008000 };
Also note that the new version can be significantly compacted without losing much readability, though there's not much value in doing so:
$mapfile_version 2 LOAD_SEGMENT elephant { ASSIGN_SECTION { IS_NAME=.data; FILE_PATH=peanuts.o }; ASSIGN_SECTION { IS_NAME=.data; FILE_OBJNAME=popcorn.o }; }; LOAD_SEGMENT monkey { VADDR=0x80000000; MAX_SIZE=0x4000; ASSIGN_SECTION { TYPE=progbits; FLAGS=ALLOC EXECUTE }; ASSIGN_SECTION { IS_NAME=.data }; }; LOAD_SEGMENT donkey { FLAGS=READ EXECUTE; ALIGN=0x1000; ASSIGN_SECTION { IS_NAME=.data; }; }; LOAD_SEGMENT text { VADDR=0x80008000 };
segment_name | section_name1; segment_name | section_name2; segment_name | section_name3;
In the version 2 syntax, this mapfile would be written as
$mapfile_version 2 LOAD_SEGMENT segment_name { OS_ORDER = section_name1 section_name2 section_name3; };
segment_name @ symbol_name;
In the version 2 syntax, this is:
$mapfile_version 2 LOAD_SEGMENT segment_name { SIZE_SYMBOL = symbol_name };
In the version 1 syntax, File Control Directives, indicated by the '-' character, are used to establish the versions that are available from shared objects linked to the object being created. In the new syntax, this is done using the DEPEND_VERSIONS directive.
For example, the following specifies that the version SUNW_1.20, as well as any version inherited by SUNW_1.20, is available for use by the object being created. It also forces SUNW_1.19 to be listed as a dependency, whether or not a symbol from SUNW_1.19 is actually used:
libc.so - SUNW_1.20 $ADDVERS=SUNW_1.19;
The same requirement can be expressed in the new syntax as:
$mapfile_version 2 DEPEND_VERSIONS { ALLOW = SUNW_1.20; REQUIRE = SUNW_1.19; };
Hardware and software capability directives are used to augment or replace the capabilities found in the input objects. For example consider the following statements in the version 1 syntax:
hwcap_1 = mmx; # Add MMX to existing hardware capabilities hwcap_1 = mmx $OVERRIDE; # Replace existing hardware capabilities with MMX sfcap_1 = addr32; # Add ADDR32 to existing software capabilities sfcap_1 = addr32 $OVERRIDE; # Replace existing software capabilities with ADDR32
Rewritten using the version 2 syntax:
$mapfile_version 2 CAPABILITY { HW += mmx; # Add MMX to existing hardware capabilities HW = mmx; # Replace existing hardware capabilities with MMX SF += addr32; # Add ADDR32 to existing software capabilities SF = addr32; # Replace existing software capabilities with ADDR32 };
The syntax for symbol scope/versioning symbols is the least changed:
{} brackets are still used to group the symbols.
The syntax for current scope is unchanged.
The syntax for scope auto-reduction is unchanged.
The syntax for symbol names without attributes is unchanged.
The following things are different:
For a large number of mapfiles, the only change necessary will be to add the $mapfile_version control directive to the file, and to put the keyword SYMBOL_SCOPE or SYMBOL_VERSION in front of each scope/version.
To show the difference in how symbol attributes are specified, consider the following directive in the old syntax that uses every possible symbol attribute. This is not a realistic example, as many of these options are not mutually compatible. However, it serves to highlight the full set of syntax differences:
VER_1.2 { foo = V0x12345678 S0x23 FUNCTION DATA COMMON FILTER libfoo.so AUX libfoo.so PARENT EXTERN DIRECT NODIRECT INTERPOSE DYNSORT NODYNSORT; protected: *; } VER_1.1;
Rewriting this in the version 2 syntax gives:
$mapfile_version 2 SYMBOL_VERSION VER_1.2 { foo { VALUE = 0x12345678; SIZE = 0x23; TYPE = FUNCTION; TYPE = DATA; TYPE=COMMON; FILTER = libfoo.so; AUX = libfoo.so; FLAGS = PARENT EXTERN DIRECT NODIRECT INTERPOSE DYNSORT NODYNSORT; } protected: *; } VER_1.1;
Although the attribute syntax has changed, it is very similar.
The Sun compilers support a command line flag, -xF, that causes each function to instead be placed in its own separate section. This gives the link-editor finer grained control, as it can omit unused functions while still pulling in the ones needed to complete the link. The link-editor also has the opportunity to arrange these functions in arbitrary order relative to each other, under user control, specified via the mapfile.
The documentation for the original version 1 syntax in the LLM gives this example:
text = LOAD ?RXO; text : .text%foo text : .text%bar text : .text%main
The result of using this mapfile will be for foo(), bar(), and main() to be placed adjacent to each other at the head of the segment, in that order. The ordering is implicit in the order in which the three section to segment statements (':' lines) are given in the mapfile.
The version 2 syntax accomplishes this reordering as follows:
$mapfile_version 2 LOAD_SEGMENT text { ASSIGN_SECTION bar { IS_NAME = .text%bar }; ASSIGN_SECTION main { IS_NAME = .text%main }; ASSIGN_SECTION foo { IS_NAME = .text%foo }; IS_ORDER = foo bar main; };
- mapfile-sparc
text = V0x40000;- mapfile-sparcv9
text = V0x100400000;- mapfile-i386
text = V0x8080000;- mapfile-amd64
text = V0x480000;
The version 2 syntax can employ conditional input to represent all of these differing values within a single mapfile, simplifying the test makefile. The $error control directive is used to catch cases where this test is run on a new previously unknown platform, and provide a meaningful error to the developer:
$mapfile_version 2 $if _sparc $if _ELF64 LOAD_SEGMENT text { VADDR = 0x100400000 }; $else LOAD_SEGMENT text { VADDR = 0x40000 }; $endif $elif _x86 $if _ELF32 LOAD_SEGMENT text { VADDR = 0x8080000 }; $else LOAD_SEGMENT text { VADDR = 0x480000 }; $endif $else $error unknown platform $endif
$if _sparc && _ELF32 32-bit sparc thing elif _x86 && _ELF64 64-bit x86 thing $else others $endif
Cool. I haven't gotten very far through the docs yet, but it seems that the CAPABILITY directive is specified so that: "FOO -= bar" followed by "FOO += bar" results in bar being omitted from the value of FOO. In other words, += and -= are order-independent. Thus it seems that if one mapfile turns off a flag, a later mapfile cannot turn it back on again (unless it resets the complete value of the flag). I've been involved in specifying the behavior of non-trivial compiler options in the past. It seems simple, but it's actually quite hard to get something that works for all the common cases.
Thanks Chris! I've made a small change to how this works that should address the issue of '-=' locking out the later ability to add a value back with '+='. As described in the doc, there are two bitmasks, 'value', and 'exclude'. When adding a bit to one of these masks, the same bit will be removed from the other. Hence, a later '+=' can undo the action of an earlier '-='.
[14] Problems With V1 Mapfiles | [16] Naming Shared Objects |