Letters from the Author

Occasional musings from the author of IP Pascal


Previous letters

Our New king        2006/04/09


There really >is< a difference: Pascal and C

2008/07/09

People look at me funny if I state that one of the goals for IP Pascal is to have equivalent functionality to C, but it is true. I don't mean in terms of having the ability to make pointers to everything, including locals, and change types at will. That is a programming style decision:

type intarr = array of integer;
 
var n: intarr;
 
procedure addone(var n: intarr);
 
var i: integer;
 
begin
 
   for i := 1 to max(n) do n[i] := n[i]+1
 
end;
 
and
 
void addone(int *n, int maxn)
 
{
 
    while (maxn--) *n++ += 1;
 
}
 

Accomplish the same thing. The second is more terse, but does not result in faster or more compact code. Why? Well, in writing the second example you would have (as all C programmers have) applied several optimizations off the cuff:

 

 

These are pretty common optimizations performed by good compilers nowadays (vs. the old "bad days" of the early 1980s for microprocessors). Professional C users know the difference between "polite" C and "dirty" C. The above in "polite" C is:

 

void addone(int *n, int maxn)
 
{
 
    int i;
 
    for (i = 0; i < maxn; i++) n[i] += 1;
 
}

Which, of course, looks more like the Pascal example. I could have, but didn't put the redundant looking "n[i] = n[i]" back in, but it clarifies nothing. Hey, C has all those operators, they are fun.

You might be surprised to find that I prefer the first, "dirty" example, but here I digress. C was made for such tricks, and to those who really dislike C written like that, I would suggest are using the wrong language. C is what C is. The point is that these are style decisions. Picking up a pointer to an array and walking forward with it is a perfectly reasonable technique that I use in Pascal code all the time (!). The difference is that I let the compiler do it for me.

So what I mean by "the goals for IP Pascal is to have equivalent functionality to C"? Alan Turing established that different computer languages can accomplish the same thing (within reason), as long as the language forms a complete "system" (more about that in a moment). So we have to clearly separate style differences from functional differences, and that's not as easy as it sounds.

The C language actually has a lot of advantages, and always had from its inception. A short list would be:

Of all of these advantages, IP Pascal has failed only one, that of simplicity. Of course, so has C, and if you will forgive me the observation, the computer industry has neither asked for nor delivered same. However, it is an advantage. In the late 1970's and early 1980's, getting a C compiler running on a microprocessor was a dramatically simpler proposition than other languages, and using it was enjoyable as well.

So, ignoring the amazingly problematic variable length argument lists of C functions such as printf() for now, what I want to talk about is variable length data, specifically arrays. This is because, as I will describe, handling of variable length data goes to the heart of what is different between the two languages.

Variable vs. Dynamic

Now, of course, we have to define what variable length is. For IP Pascal, variable length arrays are arrays that can be allocated to any length at runtime. In both IP Pascal and C, once you allocate an array, you are stuck with that length, unless you dispose of the data and start over again. In dynamic allocation, you can change the length of an array at any time, during the run of the program.

There really isn't anything like a true dynamic type on a stored program computer[1]. There are plenty of ways to make it appear as if the data is truly dynamic, and perhaps that's what counts. An array can vary in what I'll call a "container". That is, either you or the compiler can assume that the array is 100 elements long, but you can ask for 10, 20, 50 or whatever number of array components you want and get that request satisfied. It works as long as you don't ask for more than 100 elements. This is the way that UCSD, Turbo and similar Pascals implement strings. There are other ways, such as implementation via linked lists, which is used in LISP, and you can even step outside of stored computer design to things like memory that is inherently linked, but that's another story.

What is a pointer in C?

Lets go back and look at the header for the addone function in C:

void addone(int *n, int maxn);

This very typical function header in C fairly screams at you about the nature of C programming. The first thing is obvious. The array and its length must be specified separately. Why is that? Lets expand the example a bit:

int a[10];
...
addone(a, 10);

Ok, we passed the static array "a" to addone, or more specifically, a pointer to it. a, the name "a", is an alias for a pointer that points to the base of the static array a. Then, we need to specify its length, which is not a major hardship. If you were not a C expert, you might be tempted to ask "why do we need to tell it the length of a, since the compiler could clearly tell how long it is from the context"?

Lets not go down that road (which leads to madness in any case). Instead, lets look at the heading for addone one more time:

void addone(int *n, int maxn);

In fact, lets get rid of the second parameter, so it won't distract us:

void addone(int *n);

Now, what does "n" (n the parameter) indicate? Not for this specific function, but in general, in all of the language that is C? The answer is that it could indicate any of:

The argument here is not if this is useful, it most certainly is. It allows C to accomplish a lot with very little notation. It simply means that the compiler does not really know exactly what you are doing when you call addone. A lot of very smart people have expended a lot of work to try to (automatically) find out just what you are doing with that pointer, including examining the source code with special tools, looking at all of the source files that make up the program, and even simulating the code in an abstract way (running the code without actually executing the program). All this time and cash gets expended on C in the name of automatic verification or "super lint" programs, designed to capture latent bugs in C. The upshot is that it is a very, very difficult problem to solve %100, although many verifiers live happily with the %80-%95 verification ratios that allow them to keep sucking cash like wild Hoover vacuum cleaners.

No, lets step back, and state some properties of a pointer in C:

  1. A pointer accesses a variable it points at, according to the type of the pointer.
  2. A pointer indexes a "series" of such variables, whose length is not defined.
  3. The variable "behind" any pointer is also a variable, so that p-- still points to a variable (even if it may not be accessible).
  4. The variable "ahead" of any pointer is a variable, such that p++ is points to it (again, it may not be valid).

Ok, if you are a theorem purist, you have already caught a mistake in the above list, or perhaps an "excess". Rule 2 is redundant with 2 and 4. The reason why is that if you repeat rule 3 or 4 you get a series. Ie., if you decrement p, you get a valid element, but you could reapply that rule again, and again any number of times. So lets redo the rules list:

  1. A pointer accesses a variable it points at, according to the type of the pointer.
  2. The variable "behind" any pointer is also a variable, so that p-- still points to a variable (even if it may not be accessible).
  3. The variable "ahead" of any pointer is a variable, such that p++ is points to it (again, it may not be valid).

Now, what about if the next or last element is or is not valid? Well, it may be a property of specific computers and memory management methods that the next or last element is not accessible, but it is not a property of stored program computers in general. What about the start and end of memory? Well, unless the hardware forbids it, that does not really apply either, since memory (addresses) in stored program computers wrap from top to bottom, and bottom to top. In fact, a very famous issue came up with the 80286 based IBM-PCs where the ability of the 8086 memory space to effect such an "infinite loop" was removed by Intel, who quite reasonably believed nobody in their right mind would ever depend on such a fact of life, then promptly added back by designers at IBM who quickly realized that someone out there did, indeed assume that (someone with more money than god and whose name starts with "m").

So if you will please bear with me, and assume that stored computer memory loops around in the general stored program model, we can simplify the rules still further:

  1. A pointer accesses a variable it points at, according to the type of the pointer.
  2. The variable "behind" any pointer is also a variable, so that p-- still points to a variable.
  3. The variable "ahead" of any pointer is a variable, such that p++ is points to it.

Even though we have covered only increment and decrement operations, because a pointer points to a series of (potentially infinite) variables, it also applies to the formula for array access:

is equivalent to:

Where s is the scale factor for the array (the size of the variable a points to).

Now, this looks like a lot about a little, but the properties of a pointer explain a lot about the differences between Pascal and C, and between pointers and what are called "references" in Java and similar languages. If you take the rules for a C pointer, and take away all but the first rule:

  1. A pointer accesses a variable it points at, according to the type of the pointer.

You get pointers in Pascal, and effectively references in Java (although Java is not supposed to have an actual address for the variable pointed to, the effect is the same).

The field of mathematics

Ok, now lets run to the other side of the ship. In mathematics, there is the concept of a "field". The basic properties of a field are:

(sounds like an object, no?)

There are two kinds of fields, one that is "field complete" and one that is not. The definition of "field complete" is that:

For example, linear algebra is nearly "field complete" for data arranged upon a line, hence the name, "linear algebra". Given any set of numbers on a line, and any operation performed on those numbers (add, subtract, multiply, divide, square root, etc.) you will end up with valid result data on the line.

Of course, I said "nearly". Namely because the square root of a negative number, a series of numbers called "imaginary" numbers neatly escape the field. Thus, algebra is pretty field complete, but not completely field complete.

What happens when a field is not complete? Well, unless you are willing to wrap your mind around the idea that no field can ever be complete, an alternate view is that a field that can "escape" is just not drawn large enough. In the case of linear algebra, the incidence of imaginary numbers is often pictured by extending linear algebra to a 2d plane with ordinary numbers on one axis, and imaginary numbers on the other axis. You might say that the larger field of "plane algebra" can contain linear algebra along with its little accidental leaks.

This idea of "jumping out" of a system occurs in nature as well. Life works with a very ordered system based on DNA, forming a field with the four letters of the DNA sequence as "data" and the construction of proteins as "operations" on that. This system is layered atop another system, that of molecular chemistry, so biochemistry is a system layered ontop of molecular chemistry. But life does not stick to the rules, and occasionally it "jumps out" of the biochemical DNA rules and works in terms of molecular chemistry to get things done.

Fields are interesting to study because a field is a complete system, and the study of fields goes a long way towards understanding systems. If you look at it this way, a stored computer is also a field. It defines a set of data, memory, along a line, and a set of operations on that data, the instructions executed by the processor. The computer is "field complete" because no instruction will cause it to leave the field, ie., no instruction will cause the data in the computer to exist in terms other than 0 or 1 bits, and the processor will not jump out of the address space of the processor and start chewing up the floor tiles.

He was out standing in his field?

So when a programming language is applied to a stored program computer, you actually are layering two different fields, one atop the other:

Really this is not an amazing thought. If we wanted to draw a complete picture it would look like:

Etc. We could probably draw this out to may more levels, or collapse a few, but the basic idea is that when you type on your computer, you are at the top of a series of fields or systems that represent abstractions engineers worked out over long periods of time to keep from loosing their sanity. Most of them are pretty tightly constructed fields. The real world equivalent of the stored computer model that "jumps out" of its field is bad data, smoke, or something equally unpleasant, and we hope, not permanent. Engineers work pretty hard to make these levels "field complete".

Although they may not look it, programming languages as systems are pretty much fields as well. They constitute sets of data with a set of operations on them, and hopefully they are field complete. In fact, a large topic of debate today is if the programming languages of today can be moved from sitting ontop of one field, the stored program computer, to other fields, such as distributed multiprocessor systems or even directly to hardware, all of which have various degrees of differences with the field represented by the stored program computer model.

A programming language acts like a field because that is what we want of it, and that is what it was designed for. If in the real world, an operation on the data in the program can result in an operation or data that is outside the field of the language, we most likely will see that as a crash. If you follow what I have been saying above, you will understand what I mean when I say that an accurate description of what occurs is that the program has "jumped out of the field". The CPU left the field formed by the programming language and dropped back to the field formed by the stored program computer. At this point, the program, its code, its data, other programs, system code, it all looks alike to the CPU, and it happily chews it up and (seemingly) will then take on a new purpose of finding and erasing your hard disk.

The field of C

C also represents a field, with data and operations on it. I think by now that you have figured out that I won't term C as being "field complete", but the whole point of C is not to be field complete. In fact, that is where its power comes from. Lets step back to the C pointer rules we formed above:

  1. A pointer accesses a variable it points at, according to the type of the pointer.
  2. The variable "behind" any pointer is also a variable, so that p-- still points to a variable.
  3. The variable "ahead" of any pointer is a variable, such that p++ is points to it.

Now we can explain why the declaration:

void addone(int *n, int maxn);

Works so well even if the declaration "int *n" has three possible meanings. To pound this nail into the ground, lets use a better example:

struct {
 
    int i;
    char c;
    float f;
 
} somestuff;
 
void printstuff(void)
 
{
 
    void *p = &somestuff;
 
    printf("somestuff.i = %i\n", *(int *)p);
    p = p+sizeof(int);
    printf("somestuff.c = %c\n", *(char *)p);
    p = p+sizeof(char);
    printf("somestuff.f = %f\n", *(float *)p);
 
}

Of course, this is really bad coding style, but that's not what we are getting at. The program makes use of the idea that a pointer indexes somewhere in an infinite series of variables, and even bends the first rule by changing exactly what p points to at each step. Because the size of a variable pointed to now varies, the pseudo function sizeof() is used to abstract that. However, the function clearly relies on the fact that the structure elements are in the same order as declared.

So C is escaping the field of C, going into the field that is the stored program computer model, and returning. In fact, a typical C program does that many, many times while it is executing. The fact that it is escaping to stored program concepts is obvious by the basic rules we have formulated for C pointers. These rules are not, by coincidence, the rules for a stored program computer. C is designed to escape to the field that is the stored program computer, and back.

If you think about it, this idea explains a lot about the simplicity of C (or at least original C). When C programmers say that "C lets you do what you want to do", a more accurate description would be "C lets you switch from the C field to the stored program computer field at will". C is simple because, although it implements the operations for you, the rules about how data is accessed are simply left up to the underlying model, the stored program computer.

Ready to travel?

If you look at it from this point of view, it becomes fairly simple to realize why the language C presents problems when moving to computer models that are more and more unlike the stored program model, invented so many years ago. Since C is married to the stored program computer model, moving it off that model to another is going to cause issues. In fact, it is going to cause problems in direct proportion to how unlike the stored program computer model the underlying implementation is. Multitasking is a "little bit off" the stored program model, because it wasn't originally designed with the idea of having multiple instruction processors (CPUs) roving around in the same memory. Implementing algorithms directly in hardware (such as SystemC, Handel C and similar) is moving to a model very much unlike stored program model, and it should not surprise anyone that large adjustments in the C language were required to make it work. This typically involves sharp restrictions on pointer use, but not always. C can be implemented in full in random hardware by -- you guessed it, bringing a considerable amount of stored program concepts down and implementing them on the same hardware, such as placing all variables in the same byte addressable ram.

C's issues with multitasking come down to the basic reliance on the stored program model. A program that routinely escapes from the control of its programming language only has one real way to make multitasking work, which its to hand the basic tools of task management to the programmer in the form of a Pthreads library and a "volatile" keyword, and let 'em at it.

Now lets be fair. Pretty much all of the current languages in existence have stored program prejudices. An array is inherently a stored program concept, since it replicates the idea of a linear array of memory elements that is the basis of stored program concepts. And C is not reliant on all aspects of a stored program computer. For example, having code (instructions) in the same memory space as the data.

Realistically, there are going to be three basic methods going forward into advanced architectures:

  1. Make whatever new architecture (multicore, FGPA, etc.) look like a stored program computer, even if it is only virtually.
  2. Adapt existing program languages to new architectures.
  3. Start over again with a new programming language.

Of course, and perhaps unfortunately, the "direct to hardware" folks seem to believe that (3) is the only reasonable way forward, although I note that the "grand unifying language" that will unite all of hardware and software is appearing as elusive as the "unified field theorem", the similar holy grail of physics.

PascalBasicC/C++whatever

If you got nothing else from this article, I would sincerely hope you gained an appreciation for why "just adding a simple address operator" to Pascal is a real bad idea. In fact, many or even most Pascals have an operator like:

that converts the address of a variable to a pointer. It does not take much to import the power of the stored program computer model directly into your program. You have been "freed", but what you have been freed to do its tie the stored computer program model firmly around your neck. This is why amalgamations such as Delphi[2], which is basically Pascal with Basic and C/C++ added, are dead ends. Without requiring a compete rewrite of most programs, there is really no way for such systems to go forward to more advanced architectures.

The worst news about such "type escapes" is they tend to proliferate. Creating pointers to locals may have just been a programming style decision in the days of DOS, but the massive catalog of Windows, Mac OS X and Linux API calls written in C/C++ can force programmers to use these features just because C also uses them. For example, go back to the example that started this letter:

void addone(int *n, int maxn);

To call the addone function, you are pretty much forced to use the "address of" operator in Pascal because the "int *n" parameter can take many different formats, including a single integer passed as VAR, an array, or even 0 if the parameter being null has a special meaning such as "this parameter is unused".

[1]    In this document I refer to the "stored program computer" model, sometimes referred to as the "Von Neumann" model for reasons of historical respect that I won't get into here.

[2] Delphi is a trademark of Codegear.


For more information contact: Scott A. Moore samiam@moorecad.com
Note that empty subject lines are rejected as spam, please give your mail a subject line.