Well,
lilith, perl
does have hash lookup tables!
![[mad] [mad] [mad]](/data/assets/smilies/mad.gif)
They are a native structure of the language, are called hash arrays and are distinguished from other types of structures (scalars and references, subs, list arrays) by prefixing their name with a % sign.
If I understand correctly your problem, this seems to me the way to go:
-with your terms extractor create a hash table [tt]%term[/tt] containings the terms (this is equivalent to your term.txt file) as the keys; the value of each key will be an incremental number ranging from 0 to the total number of unique terms -1, or
t-1 with your symbols (this will be used as an index, called
i or
j below, identifying each unique term);
-an auxiliary array [tt]@term_values[/tt] will contain all the unique terms, the index in the array being
i as above: this will allow to retrieve a term from its index;
-a two dimensioned array [tt]@cooccur[/tt] will contain for each couple of indexes
i,
j the count as defined by you; note that it is not necessary to increment the count for both
i,
j and
j,
i as the matrix will be symmetrical; however you need to decide whether you count as one or two when two equal terms occur within the window (the option 'one' is adopted below);
-a second hash table [tt]%terms_in_window[/tt] will contain all the terms occurring within the window; this is an evolving table (see below) where the key is the term and the value is the number of occurrences of that term in the window;
-another (!) auxiliary (linear) array [tt]@window[/tt], handled as a circular list, will contain all the
i indexes of the terms occuring within the window, in the same order as they appear in the text, the positions occupied by words that are not terms containing -1.
Now the pseudocode using the above data structures would be as follows (
m is the index in the circular list and
w is the window size):
Code:
m=0;
window[0]=-1;
foreach word a in data.txt
if window[m]>=0
terms_in_window{term_values[window[m]]}--;
if terms_in_window{term_values[window[m]]}==0
delete terms_in_window{terms_values[window[m]]};
endif
endif
if a is a term
i=term{a};
foreach key b in %terms_in_window
j=term{b};
if i<=j
cooccur[i,j]+=terms_in_window{b};
else
cooccur[j,i]+=terms_in_window{b};
endif
endfor
window[m]=term{a};
terms_in_window{a}++;
else
window[m]=-1;
endif
m++;
m=0 if m>=w;
endfor
This is of course not perl code! If you need it, come back after checking the above (though as you don't know of the existence of hash tables in perl, I wonder what use you could do of it...)
![[smile] [smile] [smile]](/data/assets/smilies/smile.gif)
However note that:
-the [tt]delete[/tt] operation above is used to erase a hash key from a hash table
-the execution of something like [tt]terms_in_window{a}++;[/tt] (in perl) will create a hash element with a value of one, if that element was non existent before, or increment its value by one if it existed already.
Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads