Welllllll... Let me get on my podium for a while to talk about Tcl arrays. The following comments aren't intended for you personally,
marsd. I don't know the overall structure of your application, and so can't comment as to whether lists or arrays (or more complex data structures provided by a Tcl extension) are a more appropriate choice. But I'd like for anyone who happens to be reading this thread to understand
why one would choose lists vs. arrays to manage data in a Tcl application.
Let's start by reiterating that all languages have their strengths and weaknesses, and arrays are one of Tcl's weaknesses. But their primary weakness is that they aren't "first class" objects in Tcl -- you can't pass the value of an array around in Tcl like you can strings, lists, etc., but have to resort to
upvar tricks to do pass-by-reference. Life would be easier if we could pass the value of an array to a procedure when we wanted.
But the more troublesome issue with Tcl arrays, in my opinion, is the name. When people hear "array," they immediate think of C-style, integer-indexed arrays, and try to use Tcl arrays in the same way. I think that if we had named these data structures "hashes" or "dictionaries," people wouldn't be so quick to fall into that trap. But once people latch onto that word "array," they start using these data structures in situations where a Tcl list is much better suited. But the primary purpose of Tcl arrays is to map arbitrary-string keys to arbitrary-string values.
What's the problem with using a Tcl array as an integer-indexed array? Nothing as such, but it's far less efficient than using a Tcl list. In C (and other languages that use true integer-indexed arrays), the values stored in an array are of a fixed and equal size. (Even in the case of an array of "strings," which turns out to be an array of character pointers to the actual string values.) Thus, when you use "a(327)" in a C program, the application simply has to multiply 327 by the fixed element size and add that to the address of the beginning of the array. You can even use a pointer into the array and increment or decrement the pointer to get to different array elements.
In contrast, every array access in Tcl is actually an access into a hash table. When you use a reference like "a(327)" in Tcl, it has to compute the hash of the string "327", then use that reference into the hash table (and then check to see if there were multiple entries stored at that hash location, and if so, pick out the right one). It's even worse if you're iterating through an array with a
for loop, because when you do your
incr of your index value, Tcl has to convert the index to a number, but when you then do your array reference, Tcl has to convert the index to a string (because Tcl uses strings for its hash values). Thus, you've got your loop index constantly "shimmering" between an integer and a string representation, adding to the overhead.
On the other hand, Tcl lists
are represented internally as simple, C-style arrays of pointers to the element values. Thus, when you say
lindex $list 327, Tcl can take the memory pointer to the list structure, add an offset of 327 times the size of an element pointer, and immediately grab the value of the list element.
As a demonstration of the efficiency of a Tcl list vs. a Tcl array, let's do some timing experiments on accessing random elements from the two different structures. I ran the following on my laptop, which is still using Tcl version 8.3.2 (I really should upgrade to 8.3.4):
[tt]%
for {set i 0} {$i < 100000} {incr i} {
lappend testlist $i
set testarray($i) $i
}
%
[ignore]proc test1 {iter} {
global testlist
for {set i 0} {$i < $iter} {incr i} {
set index [expr { int(rand()*100000) }]
set dummy [/ignore][ignore][lindex $testlist $index][/ignore]
}
}
%
[ignore]proc test2 {iter} {
global testarray
for {set i 0} {$i < $iter} {incr i} {
set index [expr { int(rand()*100000) }][/ignore]
set dummy $testarray($index)
}
}
%
time {test1 1000000}
5498000 microseconds per iteration
%
time {test2 1000000}
6799000 microseconds per iteration[/tt]
As you can see, each array element access was approximately 1.3 microseconds longer than each corresponding list element access.
Okay, enough of this soapbox. Let's look a bit more at your situation,
marsd. Once again, Tcl lists are superior for doing Unix-style "uniq" operations. Especially since Tcl version 8.3, when we added the
lsort -unique option, which removes duplicates from the sorted result. But let's assume that we need to create a
unique procedure that weeds out elements in an array that have duplicate values.
In regards to your question, it
is possible for a Tcl procedure to delete, modify, or create variables in the calling scope using
upvar. Once you've created your alias with
upvar, any action you perform on the alias is also performed on the variable in the calling scope. If you
unset an alias, the corresponding variable is deleted. And if you assign a value to an alias when the variable doesn't exist, Tcl creates the variable for you in the calling scope. As an example, let's create a simple procedure called
swap that modifies an existing array variable, swapping each array element's key and value:
[tt]%
[ignore]proc swap {var} {
upvar 1 $var x
foreach {key value} [array get x] {
set newx($value) $key
}
unset x
array set x [array get newx]
}[/ignore]
%
array set test {1 a 2 b 3 c}
%
parray test
test(1) = a
test(2) = b
test(3) = c
%
swap test
%
parray test
test(a) = 1
test(b) = 2
test(c) = 3[/tt]
So, we can apply the same technique to your
uniq procedure. But let's see how we can make that more efficient.
Currently, you've got nested
for loops doing explicit comparisons of all the array element values. As coded, it's an order N^2 algorithm, which is going to get bogged down with large arrays. I know you don't like array/list trickery,
marsd, but consider that every time you use a
foreach loop to process an array, you're doing a conversion to a list with the
array names command. If you
really wanted to maintain the data in array format only, you'd use the
array startsearch,
array anymore,
array nextelement, and
array donesearch commands to actually walk the array's hash table. (Don't feel bad. I don't think I've ever seen these array commands used to walk an array in any production code I've seen. It's
so much easier to do the
foreach/
array names trick.) So, let me show you a technique for doing a "uniq" on an array with some minimal list trickery, just to walk the array.
We're going to take advantage of the fact that all array elements must have unique keys. If you try to set the value of an array element and the key already exists, Tcl just overwrites the existing element with the new value. So, I'm going to perform a "swap" of our input array's keys and values. Thus, whenever a value in our original array is repeated, we'll end up overwriting the element in our new, swapped array. This has the effect of eleminating all duplicate values. Then, I'll perform another "swap" of the keys and values so that when I return our new, double-swapped array, the data is in the same format as originally. Here it is:
[tt]%
[ignore]proc uniq {arr} {
upvar 1 $arr x
foreach {key value} [array get x] {
set newx($value) $key
}
foreach {key value} [array get newx] {
set newx2($value) $key
}
unset x
array set x [array get newx2]
}[/ignore]
%
array set dup {a 1 b 2 c 3 d 1 e 1 f 2 g 3 h 2 i 1}
%
uniq dup
%
parray dup
dup(a) = 1
dup(b) = 2
dup(c) = 3[/tt]
Of course, it was pure chance that by removing the duplicates, we retained the first three keys. We just as easily could have ended up with "dup(i) = 1" instead of "dup(a) = 1".
And just to demonstrate once more how lists will be more efficient than arrays in a situation like this, lets compare the
uniq procedure to the
lsort -unique command. In the following example, I perform the same initialization sequence in each timing experiment so that differences in the setup of the runs aren't a factor:
[tt]%
[ignore]for {set i 0} {$i < 1000} {incr i} {
set val [expr { int(rand()*100) } ]
lappend origlist $val
set origarray($i) $val
}[/ignore]
%
[ignore]time {
set vallist $origlist
array set valarray [array get origarray]
uniq valarray
} 1000[/ignore]
10816 microseconds per iteration
%
[ignore]time {
set vallist $origlist
array set valarray [array get origarray]
lsort -unique $vallist
} 1000[/ignore]
6149 microseconds per iteration[/tt]
If we then time our setup code, we can then subtract that from each case to find out how much time was consumed by the actual "unique" operations:
[tt]%
[ignore]time {
set vallist $origlist
array set valarray [array get origarray]
} 1000[/ignore]
4677 microseconds per iteration[/tt]
Thus, our array version took 10816-4677=6139 microseconds to do the uniq, whereas our list version took 6149-4677=1472 microseconds.
As to your comments: "As always thanks for your help, and it is good to see that you are back with authoritative answers.

"
My pleasure. It's a good way for me to exercise my Tcl knowledge, and to explore areas that I might not otherwise on my own. Of course, those who actually pay for my knowledge get top priority, which is why I was absent from this forum for a while. And while I enjoy helping people out here, I must say that I need to be "absent with pay" more frequently so that I can make my mortgage. Let's hope that the economy continues to pick up. (And I will happily accept any training or consulting referrals that people might have...) As for the "authoritative answers" part, if I'd known that people were putting that much stock in my opinions, I would have put a bit more thought into my replies to make sure they were correct.

- Ken Jones, President
Avia Training and Consulting
866-TCL-HELP (866-825-4357) US Toll free
415-643-8692 Voice
415-643-8697 Fax