S. A. MOORE
LANGUAGE HISTORY
The C language flowed, as the official white book lore goes, from
the
language BCPL. After that, the language B came, followed by C.
These
two language names are from the first and second letters of BCPL.
However, there lies strong influences on the language from what
was
going on in FORTRAN, as will be explained.
LANGUAGE PRINCIPLES
The idea of the C series languages is simple. If you restrict the
data
types available in the language strictly to types that can fit
into a
single machine word, your implementation details go down dramatically.
At first this seems like a strange idea; what about records, arrays
and other complex types ? Well, even with languages that treat
these
as fundamental types, the actual IMPLEMENTATION on the target machine
will translate references to these objects into terms of machine
addresses
or "pointers". With C, you get a rich series of pointer operations,
and you essentially "roll your own" complex type accesses.
Now it is possible to take this to different extremes. Take a hypothetical
language:
stringa[100];
stringb[100];
stridx[4];
for (stridx = 0; *(stringb+*stridx); mylab++)
stringa+*stridx = *(stringb+*stridx);
Where:
label[x] - Allocates x bytes of storage and sets label to the constant
address.
a = b - Places a single machine word from the expression b to the
location
given by address a.
*a - Gets the word at the address a.
a++ - Increments the word at the address a.
This language has the attraction that no complex typing is needed
at all. The
only thing the compiler does is allocate space and equate constant
symbols
to the start of that space, much as an assembler would. Of course,
the notation
for a single machine word:
stridx[4]
Is a lot machine dependent (4 bytes make a word), but that can be
solved using
macro constants:
stridx[WORD];
The two problems with this language are that it does nothing to
help with the
creation of complex data types, and is incredibly pedantic about
forming
references to objects (is it the address itself, the contents,
what it points
to ?).
KUDOS FOR C
C makes a good compromise solution for machine oriented language
design. The
compiler understands complex types and how to define them. It uses
a novel
scheme to unify pointers and arrays, and uses the pointer principle
to
eliminate the difference between arguments passed by reference
and passed by
value. The modularity of C is excellent, and the calling conventions
so
flexible that it was and is possible to rewrite a given compilers
low level
support functions.
THE PROBLEMS
LACK OF REFERENCE CHECKING
Number one on my mind of problems would be the lack of access bounding
by C.
From FORTRAN on, most languages were founded on the idea that storage
was
both defined and protected by the language. Indeed, one of the
biggest
advantages of programming high level languages vs. assembly is
the ability
to control reference problems. With the inclusion of unbounded
pointer
accesses, C imports the worst problem with assembly language.
After two decades of C's ascendance to the most common computer
language on
Earth, I truly believe that the use of C has been a major contributor
to unstable programming. Most software on the market crashes on
uncommon
tasks, or even just because its storage or that of its operating
system
becomes full or fragmented. The use of memory management hardware
to contain
access problems has only changed this situation from a wholesale
system
lockup to an error message (that is meaningless to the user) and
lost files.
A new language, Java, was created mostly to cover the fact that
C's references
cannot be contained.
C encourages lost references a number of ways. Most programmers
walk pointers
through arrays, which makes where you are in the array impossible
for the
compiler to verify. Any item in C can be addressed by a pointer
made up on
the spot. This includes locals, allowing pointers to be made to
variables that
may or may not even exist !
Today the most common error with C is something like "segment violation",
which
means one of your pointers went wild. The most meaningless of error
messages,
you get perhaps an address (machine language) where the fault occurred
and
if you are lucky, a prayer. To paraphrase another author, "this
botch, if
corrected, would make the language an order of magnitude more useful".
LACK OF TYPE CHECKING
Conversion of types is the standard pastime of C programmers. Since
every
language manipulated object is by definition a word (except, strangely,
floating point, which appears to live by a separate set of rules),
they should
be freely convertible. The conversion of types was left pretty
much to the
programmer in the first C version. The result were pointers left
in integer
variables and other horrors. The "fix" for this is hardly better.
Now anything
is ok, as long as you announce you are doing it with a cast. Despite
what
C programmers have told themselves, the new way does not even allow
the
compiler to enforce typing rules. Pointers are commonly passed
as VOID,
and what a pointer actually points to is between the programmer
and god.
I have lost track of the number of times I am told that only bad
programmers
work this way. The API for windows, OS/2 and other operating systems
feature call parameters that are routinely mangled (not just in
the C++
sense). A passed parameter can contain a pointer to a structure;
or be
zero (NULL), which has meaning to the callee; or be a small number,
based
on the fact that pointers a generally large numbers.
MACROS
Nowhere are the reasons for a language feature more obscure than
C's reliance
on a macro preprocessor. The only real need for this hidden, extra
pass is to
define constants, a feature that could easily have been included
in the
language. I have always harbored a suspicion that macros hurt more
than they
help, and the C language turned this into a conviction. Macros
shred the
logical structure of the program. Where macros are extensively
used, it
becomes a challenge to figure out exactly what the actual code
looks like.
Errors become mysterious, and the line numbers become meaningless
(some
compilers attempt to minimize this problem). The reason macros
are a bad
idea is that they have no intelligence associated with them. They
cut and slice
program code with total disregard for how the program syntax is
laid out.
When macros became popular in assembly language programming, some
said that
this amounted to high level language programming. But macros lack
any
ability to check the correctness of or optimize generated code.
The inclusion of macros also belies a link to FORTRAN. Because
FORTRAN was
considered an out of date language with bad syntax (it was), a
macro
preprocessor was one solution to make the language useable. But
here we have
C, supposedly designed without the blemishes of FORTRAN, that is
fitted with
this crutch from day one.
Most good C programmers don't use macros for anything but defines.
The result
is an extra, slow text searching pass that can't be gotten rid
of.
INTERPRETIVE I/O
It truly amazes me how people can preach about the "efficiency"
of C when it
uses interpretive I/O. The famous (infamous ?) printf statement
must scan and
parse a format string in order to perform I/O, at runtime. This
design error
is a direct copy of the "format" statement in FORTRAN !
After that, it is probably a minor issue that the I/O statements
are misnamed
and misformatted. printf (print formatted) may or may not actually
go to a
printer, but fprintf (file print formatted) is a total oxymoron.
There is
no printing going on, even to a user console screen (unless redirection
is
happening). If you are ready for that, you are ready for sprintf
(string print
formatted), which has nothing to do with any I/O device whatever
!
The creators of C can justifiably pat themselves on the back for
the fact that
in C, at least, I/O statements are truly independent of the language.
But this
stupid pet trick has been performed elsewhere before (Algol). The
syntax of
printf means that even the compilers that attempt to perform good
parameter
checking must give up and let everything through. Moreover, the
system fails
to really specify I/O in a completely portable way. The exact method
to get
at printf parameters is implementation defined. The standard workaround
for
this, va_args, force C to be defined as a stack implemented parameter
passing design, hardly the most efficient method. Some compilers
just revert
to a stack based format only when the variable argument type is
presented
(...).
Moving down in the I/O chain, we have the lovely tradition of get
and unget.
C addresses the legitimate need for lookahead with a pure hack.
The programmer
simply puts back what he does not want, or puts back anything else
that comes
to mind.
Finally, the method for finding out if the end of a file has been
reached is
to go ahead and read past the end of the file ! The ripples of
this hack
spread far and wide. An integer (not a character) must be used
to read from
a character file, because the end value is not representable as
a character
(-1). Error checking is implausible. It obviously is not an error
to read
right through the end of the file, so what is ? the second attempt
to read
EOF ? Most commonly, the result of an EOF missed mishap is tons
of garbage
spilled to screen or file.
Things get worse at the lower level. read (which like it or not
is pretty
much defacto standard C) returns only an error for passing the
end of the
file. So you got an error on read. Did you reach the end of the
file ?
Is it a disk error ? Is the computer on fire ? The program now
descends to
sorting out error codes from yet another, error number function
to find out.
WHAT IS AN END OF LINE ?
Near the order of magnitude of the reference checking problem is
the infamous
"end of line" representation. Since C represents EVERYTHING as
a special
character (including EOF as shown above), it of course follows
that EOL
should also be a single character. Unfortunately, when C and Unix
were created,
it rarely was. In fact, the prevailing standard, both now and at
the time
C was created, was to use TWO characters for that job, Carriage
return and
line feed, corresponding to the control movements required to get
a teletype
or printer back to the start of the next line. Rather than attempting
to come
up with a good abstraction for this, C (and Unix) simply decides
that a
new, single character standard will be set, the world be dammed.
The general damage from this event is incalculable. To this day,
Unix systems
cannot exchange files seamlessly with the rest of the world.
The damage on C is also apparent. Since the rest of the computing
world did not
give up and change to fit C, C had to adapt to the standard ASCII
EOL.
This it was not designed to do, and it remains one of the biggest
problems
with C.
The most obvious way to adapt C to standard EOL conventions is
to translate
CR/LF sequences to and from the C EOL (which is a LF alone). But
at what level
does this occur ? Having that take place at the standard FILE handling
level means that this package is not available for reading and
writing
non-ascii files (which it is most certainly used for, despite the
getchar
and putchar names). Putting it deeper than that, in the read and
write routines
is worse. Not only is translation involved, but the exact size
layout of
data is compromised, ie., the data read no longer matches the data
on disk.
The solution placed into the ANSI standard is to have the program
specify
what type of file, text or binary, is being opened. Since this
was never
required under unix, the wall between Unix and other systems is
increased,
not lowered, and we now have a dependency entirely unrelated to
the
type of processor being used, a true backwards step in language
design.
The authors of C seem fully unrepentant for perpetrating this design
flaw,
sniffing that "under unix, the format of a binary file and a text
file
are identical" [from the whitebook].
TYPE SPECIFICATION
Even though type specification appears low on my gripe list, it
is the one
sin that even the authors admit to. Now, we have programs to translate
C type arcania to english. C declarations are supposed to look
similar to
their use. That makes sense. Why should it be less arcane to define
a type
than to use it ?
Before tearing into the ridiculous syntax of some C types, it is
only fair
to mention that the inclusion of formal type identifiers into C
was late
in C's life, in the form of typedef. The only name for types was
in the
odd syntax of structures:
struct x {
...
...
} a, *b;
Which introduced the type name x. The only apparent reason for this
is to allow
self referential structures, a necessity. So structs have a name
that may or
may not be used anywhere. Further, the use of "*" can always be
made to
modify the type being declared.
Curiously, when actually using the type name of a structure, you
have to tell
the compiler what it should already know:
struct x {
struct x *next;
...
};
struct x onevar;
Within the structure, x could conceivably be another structure member,
although
the usual practice for programmers that reuse names this way is
to get the
current meaning. What need there is for this clarification in the
outer scope
escapes all reason. And "*" gives a means to create two completely
different
types in the same list.
With the addition of typedef, things take a marked turn for the
worse:
typedef struct x {
struct x *next;
...
} a, *b;
Vola ! the meaning of a and b have changed dramatically. No longer
are these
objects, but now they are types. Not only that, but:
struct x
and
a
Are aliases of each other. The way I tell new programmers to deal
with this
kind of fun is to use an example from the book, and forget trying
to understand
it.
SYNTAX ARCANIA
One of the most maligned features of Pascal is the way the ";" is
used. What
amazes me is that C is held up as an example of how semicolons
should be used.
Consider:
while (1) doit();
while (1) { doit(); dothat(); }
Even though { ... } is a direct substitution for a single statement,
it does not
need a trailing ";", even though C complains bitterly if any other
statements
lack it. Moreover, in the identical looking:
char mystrings[] = {
"one",
"two"
};
The semicolon is most certainly required. For function predeclarations
(admittedly a newer feature), things get worse:
void dothat(void);
and
void dothat(void)
Are two completely different animals, with only a single character
difference.
The first predeclares the function, the second is the actual function.
Forget
that semicolon on the predeclaration and you get a series of nonsensical
errors
from the compiler, as it cheerfully heads off to parse the rest
of the program
in the mistaken assumption that you are defining a strangely large
function.
Similarly ugly things happen when you add a semicolon to the actual
function
body.
For more syntax arcania, there is little to beat the "comma paradox":
myfunc(a+1, b+3);
Looks normal. But what if the definition of myfunc is:
void myfunc(int);
Oh yes ! It means discard a+1, then use the value of b+3. Comma
is an operator !
Lets go out on a limb and say the compiler can figure this out
by looking at
the definition of the function, and the grouping of the paraentisis.
How about:
thisfunc(a, b, c);
Which is defined:
void thisfunc(int, int);
Does the call mean to discard a, and keep b and c ? or keep a and
c, discarding
b ?
In compiler/language textbooks, this is technically referred to
as an "opps".
OPERATORS FROM HELL
I have never met anyone who did not like C operators. There are
so many of them.
And they do everything. And if you don't know the precedence by
heart, you are
so hosed:
a ? b = 1: b = 2;
Uh, oh. This gives a compile error. Since "=" has lower priority
than ?:, we
just told C to set b = 1 if a is nonzero, otherwise take the value
of b
AND THEN DISCARD IT. Parsing continues at "= 2".
*a++;
Means "increment what a points at" (really). But what does:
*--a;
Mean ? How about predecrement pointer a, then get what it points
to.
My favorite thing to do is to paste a Xerox of that precedence
chart from
the white book on my desk right by the terminal, so I can see it
always,
just to cover this sort of thing.
And how about:
a && b
Does this mean a logically anded with b ? What about a anded with
the address
of b ? Forget your precedence chart, it says the latter. To resolve
it, you
have to know that the compiler always treats "&&" as a
single operator. To
REALLY get a anded with address b, you must use:
a & & b
The real crux of the problem is why... why do you need all those operators ?
a = a+1;
and
a += 1;
and
a++;
All do the same thing. But are they really necessary ? All but toy
compilers
know how to change add by one to an increment operation at the
machine
level. Yet many C programmers are CONVINCED that they are writing
more
efficient code when they directly specify that an increment (say)
should
happen. When C was new, the shift of optimization work onto the
programmer
enabled smaller compilers to be built faster. But now, with compilers
as
a major enterprise, C compilers must often have huge amounts of
code to
deal with rewriting or even outguessing programmer statements.
For example
in the code:
a++;
a++;
a++;
a++;
b = *a;
Is odd, but the programmer might have written that because he knew
that
four increments were more efficient than an add with constant.
But this is
exactly the kind of calculation that changes when modern cache
and
superscalar instruction processors are used. Because an add constant
is now
MORE efficient, the compiler can be in the position of having to
UNOPTIMIZE
code that the programmer cleverly "optimized" in the source code.
ASSIGNMENTS EVERYWHERE
C makes assignment an operator. But it goes even further than that;
any
variable can be incremented or decremented at any time within an
expression.
So you have:
b = a = b++;
Which is by the way basically equivalent to:
b++;
Distributing the equation of variables around an expression is a
wonderful way to
increase order dependence, which is why C has a prescribed evaluation
order.
WHAT VALUE BOOLEAN ?
As several languages have done, the check for boolean true is reduced
to
checking for a non-zero value, and there is no specific type for
boolean.
This has exactly the effect that might have been predicted; boolean
truth checking extends far beyond boolean operations:
char *p;
while (*p) putchar(*p);
Because optimizing compilers know perfectly well how to deal with
a check
against zero, there is no net difference between the above and:
while (*p == 0) putchar(*p);
Save that it is more arcane. In fact, quite the opposite effect
is achieved.
For a check of:
if (x) ....;
A compiler might choose to implement the truth check as a bit test
against
bit 0 of the value of x. The "official" value of true and false
are 1 and
0 in C. But in fact, x could be ANY non-zero value, because the
programmer
has been encouraged to stick all kinds of odd values into such
conditionals
(by many program examples, including in the whitebook). Only the
zero or
non-zero value matters, unless the compiler can determine that
a logical
boolean was performed (last).
This problem ripples though the language. It seems odd that C needs
both
a logical and boolean version of each operator (&& and
& for logical and
boolean "and"), since the results on the "official" 0 and 1 truth
values
would be identical. The reason that the logical values are required
is
because the values many not, in fact, conform to the 0 and 1 convention.
Extra work is then required to bring the values BACK into conformance
in order to perform a valid boolean operation. The white book says:
while (*a++ = *b++);
Is a string copy. In fact the white book goes carefully through
the steps
required to reduce the program to this terse form, and I have had
this
poetic description mentioned to me many times as "proof" of the
efficiency
of C. But entirely unmentioned is the fact that extra work will
probably
have to be expended elsewhere to compensate for this language quirk.
The ugly fact is that those "while (p)" and "if (p)" expressions,
that made the
original white book examples so sexy to programming newcomers,
had a hidden
price. That was encouraging the misuse of boolean statements and
operators
on non-boolean values, which made boolean handling MORE complex
and LESS
efficient in C. This is a language feature being sold with all
the honesty
of a used car lot.
THE CASE FOR CASE
Unusual to C, case matters. This little quirk has implications beyond
the
apparent:
MyVar
Myvar
myvar
Are all different names in C. Breaking with a few thousand years
of the
tradition of having capital letters serve simply as an alternate
type style,
they are now wholly separate letters. We are never given a good
explanation
for this. Was the world running out of good identifiers ? Did the
language
authors just get a lower case terminal ? The common explanation,
and the
one that the C authors vehemently deny, is that it was created
for people
who could not type well, and needed the extra mode to supplement
the "hunt
and peck" method. The legacy is that now, people have to take great
pains
in the Unix and C world to specify the exact case used, and you
must look
carefully at your programs to see if the case of labels are correct.
Thanks, guys.
In typographical terms, the capitol letters, small letters and
later
italics started life as styles or "fonts" of type. They stopped
being
individual type faces (or writing styles) when it was realized
that their
underlying priciples could be used in any type face or style for
emphasis,
and rules then evolved for their use in language text.
But programming languages are not human language text! So why should
the same
principles apply to them ? The answer is simple. If, after inventing
the
automobile, you were to decide that people should walk in the street,
and drive on the sidewalks because, obviously, cars have little
in common
with horses and carriages, don't be shocked when you have confused
the
hell out of everyone (and seriously injured some). We use infix
notation
in most programming languages, and read them from left to right,
not because
there is something inherently better about those methods, but because
we respect human conventions and want computers to be accepted
without
having huge paradygm shifts from programming to everyday life.
WHAT IS A STRING ?
Since C has no bounding, dynamic variables are basically where you
find
space for them. C leaves the issue of how to find the length of
such items
entirely unresolved, leaving it up to the programmer. The one exception
is
strings of characters, which have zeros automatically appended
to them.
This zero acts as the sentinel to find the end of the string.
This scheme works quite well in the general case. But it has two
built in
problems that become apparent with use. The first is that the length
of
the string can only be found by traversing it. This can get prohibitively
expensive with very large strings, and most programs that handle
such strings
give up and use a length/data combination.
Secondly, the removal of one of the character values can be a bigger
problem
than realized. In many cases, we need to handle characters "verbatim",
including any embedded zeros.
MODULARITY
I have to give C high marks for having separate compilation as a
goal
since day one. Few languages had that feature at the time.
However, the modularity system under C is groaning under the weight
of
advanced programs. The standard method is to put all the headers
for
functions, and their data, into an include file. This means that
two
separate representations of each function, the header description
and
the actual function itself, must be created and maintained separately.
This creates a constant source of errors when developing.
Further, problems have arisen with the same file being included
by
multiple subfiles, and errors occurring thereby. A standard system
has
come about to deal with this. We define a variable (macro), then
check
if this is already defined on successive inclusions, skipping the
include if it is a duplicate. All of these measures are clever,
but the
module system in C is still entirely macro driven, and consists
of hacking
together multiple sources to make a compile. The C language itself
includes no facilities for modularity.
CONCLUSION
I bought my first book on C long before I was actually able to lay
my
hands on a compiler. C ran only on the DEC PDP-11 and a few other
computers, none of which I had access to. This gave me plenty of
time
to study the language and dream of getting my hands on the real
thing.
In the years passing, C has gone from obscurity to vogue to necessity.
After programming with it for 15 years, and using other languages
at
the same time, I can honestly say that C takes more time than other
languages of comparable functionality to debug and finalize a working
program. In addition, the final product is less readable, and less
maintainable. C remains a good language. Its simplicity is its
biggest
virtue. But in no way, shape or form does it measure up to the
hype
surrounding it, which often ignores the (many) flaws in the language.