Data Structures Help 2

moria6 · Nov 20, 2003

In an effort to increase my perl knowledge (almost non-existant) and work on a project, I need help with getting statistics from an e-mail log file. I would like to output a file with an entry for each unique sender and the number of times that sender appears in the log file.

I know enough to open / close the file and to match the field I'm looking for in the file (it's a comma-delim file). What I don't know is what type of data structure to use for the data.
BTW, the file has 266k lines.

Thanks in advance!

maurie

bluedragon2 · Nov 20, 2003

Build an array for each unique name in the log, then count for each name.

Blue [dragon]

If I wasn't Blue, I would just be a Dragon...

raklet · Nov 20, 2003

Better than an array, you want to use a hash. An array is a list of elements that is accessed by numeric index. The elements are not unique and can exist more than once. Array item 1 can be "red", but array item 50 can also be "red"

A hash on the other hand uses strings for its index value and each index has to be unique. So, if a string is encountered more than once that you want to set as the index, then you just add a count to the indexes value instead. That way you see that the hash with index of "red" occurs twice.

moria6 · Nov 20, 2003

Thanks to both of you for the help! I'll go back to my Perl Cookbook and read up on hashes and try to hack some code together. I'll post what I come up with.

Thanks again!

maurie

raklet · Nov 20, 2003

Great. And I will be glad to help you fix any problems you might have. Are you parsing Microsoft Exchange Tracking Logs by chance? If so, I have experience with that as well.

moria6 · Nov 20, 2003

No, we're using Novell's GroupWise 6.5. Not as susceptible to hacking but fewer people writing tools for it. I could buy what I need but it's horribly expensive and my perl skills wouldn't get any better.

maurie

raklet · Nov 20, 2003

Too true about expensive software and programming skilsl. We bought a program (a long time ago) that would parse exchange tracking logs and return detailed statistics. Now that our logs have grown to several GB in size, the program takes over 24 hours to generate stats for four servers. I rewrote the whole thing in perl and can now get the same stats in about 12 minutes.

moria6 · Nov 21, 2003

Way to go on the perl re-write!

I'm hoping to put together something in perl that will do what the expensive canned package can do and more. Along the way I'm hoping to hone what little perl skills I have (because, truth be told, it's fun!). My boss will be happy just to see some stats for nothing more than the cost of my time and most of the time I work on this at home, off hours.

I'm hoping once I get this first part up and running to be able to do: top 10 lists (not Letterman) of senders, recipients, domains, attachment sizes, etc. And somewhere down the road be able to identify spam based on subject lines and report on it. I don't expect too much out of myself. <g>

Thanks again for your help!

maurie

raklet · Nov 21, 2003

I think that sounds like a great project. I am sure you will do well and find it a rewarding experience. Keep perling away and good luck.

bluedragon2 · Nov 21, 2003

raklet, Yes you are right, a hash would probably be better. I am a bit new to perl myself, but just to prove that you could do it my way:

#!/opt/bin/perl

@logfile = `cat logfile.txt`;

foreach $line (@logfile) {
@bits = split(/,/,$line);
$name = $bits[0];
$ck = 1;
foreach $namecheck (@namelist) {
if ($namecheck eq $name) {
$ck = 0;
last;
}
}
if ($ck == 1) {
push (@namelist,$name);
}
$ck = 1;
}
foreach $n (@namelist) {
$ct = 0;
foreach $line (@logfile) {
@bit = split(/,/,$line);
$name = $bit[0];
if ($n eq $name) {
$ct += 1;
}
}
print $n;
print $ct;
}

This little snipit worked for me. I know it may be long and cumbersome...but I am still learning too...

Blue [dragon]

If I wasn't Blue, I would just be a Dragon...

raklet · Nov 21, 2003

Blue,

I have no dispute with your method. I agree that using arrays will work. It is just cumbersome and inefficient to do so. Your code would be a third that size using a hash and it will operate faster as well.

bluedragon2 · Nov 21, 2003

I'll look into that...

Always learning

Blue [dragon]

If I wasn't Blue, I would just be a Dragon...

raklet · Nov 21, 2003

Here is a page from Sams Teach Yourself Perl in 21 Days that talks about this subject:

Limitations of Array Variables
In the array variables you've seen so far, you can access an element of a stored list by specifying a subscript. For example, the following statement accesses the third element of the list stored in the array variable @array:

$scalar = $array[2];

The subscript 2 indicates that the third element of the array is to be referenced.

Although array variables are useful, they have one significant drawback: it's often difficult to remember which element of an array stores what. For example, suppose you want to write a program that counts the number of occurrences of each capitalized word in an input file. You can do this using array variables, but it's very difficult. Listing 10.1 shows you what you have to go through to do this.

--------------------------------------------------------------------------------

Listing 10.1. A program that uses array variables to keep track of capitalized words in an input file.

1: #!/usr/local/bin/perl
2:
3: while ($inputline = <STDIN>) {
4: while ($inputline =~ /\b[A-Z]\S+/g) {
5: $word = $&;
6: $word =~ s/[;.,:-]$//; # remove punctuation
7: for ($count = 1; $count <= @wordlist;
8: $count++) {
9: $found = 0;
10: if ($wordlist[$count-1] eq $word) {
11: $found = 1;
12: $wordcount[$count-1] += 1;
13: last;
14: }
15: }
16: if ($found == 0) {
17: $oldlength = @wordlist;
18: $wordlist[$oldlength] = $word;
19: $wordcount[$oldlength] = 1;
20: }
21: }
22: }
23: print ("Capitalized words and number of occurrences:\n&quot

;
24: for ($count = 1; $count <= @wordlist; $count++) {
25: print ("$wordlist[$count-1]: $wordcount[$count-1]\n&quot

;
26: }

--------------------------------------------------------------------------------

$ program10_1

Here is a line of Input.

This Input contains some Capitalized words.

^D

Capitalized words and number of occurrences:

Here: 1

Input: 2

This: 1

Capitalized: 1

$

This program reads one line of input at a time from the standard input file. The loop starting on line 4 matches each capitalized word in the line; the loop iterates once for each match, and it assigns the match being examined in this particular iteration to the scalar variable $word.

Once any closing punctuation has been removed by line 6, the program must then check whether this word has been seen before. Lines 7-15 do this by examining each element of the list @wordlist in turn. If an element of @wordlist is identical to the word stored in $word, the corresponding element of @wordcount is incremented.

If no element of @wordlist matches $word, lines 16-20 add a new element to @wordlist and @wordcount.

Definition
As you can see, using array variables creates several problems. First, it's not obvious which element of @wordlist in Listing 10.1 corresponds to which capitalized word. In the example shown, $wordlist[0] contains Here because this is the first capitalized word in the input file, but this is not obvious to the reader.

Worse still, the program has no way of knowing which element of @wordlist contains which word. This means that every time the program reads a new word, it has to check the entire list to see if the word has already been found. This becomes time-consuming as the list grows larger.

All of these problems with array variables exist because elements of array variables are accessed by numeric subscripts. To get around these problems, Perl defines another kind of array, which enables you to access array variables using any scalar value you like. These arrays are called associative arrays.

And here is the same word counting program using a hash:

1: #!/usr/local/bin/perl
2:
3: while ($inputline = <STDIN>) {
4: while ($inputline =~ /\b[A-Z]\S+/g) {
5: $word = $&;
6: $word =~ s/[;.,:-]$//; # remove punctuation
7: $wordlist{$word} += 1;
8: }
9: }
10: print ("Capitalized words and number of occurrences:\n&quot

;
11: foreach $capword (keys(%wordlist)) {
12: print ("$capword: $wordlist{$capword}\n&quot

;
13: }

bluedragon2 · Nov 21, 2003

So right you are...I changed my orginal to:

#!/opt/bin/perl

@logfile = `cat logfile.txt`;

foreach $line (@logfile) {
@bits = split(/,/,$line);
$name = $bits[0];
$namelist{$name} += 1;
}
foreach $name (keys(%namelist)) {
print ("$name: $namelist{$name}\n&quot

;
}

for the same results...

Blue [dragon]

If I wasn't Blue, I would just be a Dragon...

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Data Structures Help 2

moria6

MIS

bluedragon2

IS-IT--Management

raklet

MIS

moria6

MIS

raklet

MIS

moria6

MIS

raklet

MIS

moria6

MIS

raklet

MIS

bluedragon2

IS-IT--Management

raklet

MIS

bluedragon2

IS-IT--Management

raklet

MIS

bluedragon2

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor