×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Jobs

Parse Apache access.log Eliminate Spiders and Bots

Parse Apache access.log Eliminate Spiders and Bots

Parse Apache access.log Eliminate Spiders and Bots

(OP)
I've created a relatively simple function to convert the Apache access.log file to MySQL, and the plan is to run it just once (if it doesn't time out!) per site to populate the table, then it won't need to be run again as the site will then keep the table up to date automatically with other programming. This is needed because I do not have direct access to the Apache logs on the server but the hosting company is willing to send me a copy of only mine as a one-shot deal.

However, I cannot figure out how to restrict it from parsing the spider and bot hits. For my purposes, they are irrelevant so how can this be changed to ignore or at least to minimize them so as to keep the database table more manageable in size?

Here is the function:

CODE --> PHP

function ParseToDatabase($path) {
	// Parses the NCSA Combined Log Format lines:
	$pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) (\[[^\]]+\]) "(.*) (.*) (.*)" ([0-9\-]+) ([0-9\-]+) "(.*)" "(.*)"$/';
	global $output;
	if (is_readable($path)) :
		$fh = fopen($path,'r') or die($php_errormsg);
		while (!feof($fh)) :
			$s = fgets($fh);
			if (preg_match($pattern,$s,$matches)) :
				list($whole_match, $remote_host, $logname, $user, $date_time, $method, $request, 
					$protocol, $status, $bytes, $referer, $user_agent) = $matches;
			endif;
			$replacements = array("[","]");
			$date_time = str_replace($replacements, "", $date_time);
			$date_time = strtotime($date_time);
			$date_time = date('Y-m-d H:i:s', $date_time);
			$sqlInsert = sprintf("INSERT INTO accesslog (RemoteHost, IdentUser, AuthUser, TimeStamp, Method, RequestURI, RequestProtocol, Status,  Bytes, Referer, UserAgent) 
								  VALUES ('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')",
									  $remote_host, 
									  $logname, 
									  $user,
									  $date_time, 
									  $method, 
									  $request, 
									  $protocol, 
									  $status, 
									  $bytes, 
									  $referer, 
									  $user_agent);				  
			DBConnect($sqlInsert, "Insert", "db_name");
		endwhile;
		fclose($fh);
	else : 
		echo "Cannot access log file!";
	endif;
} 

RE: Parse Apache access.log Eliminate Spiders and Bots

The Apache access logs for your vhost should be (assuming a Linux OS) in /home/accountname/logs/ and your control panel account should provide you with access to /accountname/ and all folders below and

You are not likely to be allowed access to /var/log/ of course, unless you are on a VPS or dedicated box.

However back at the plot;

You can exclude known 'bots by matching information in the User-Agent string, for which you will require an up to date list of 'bot user agents.

http://www.useragentstring.com/pages/Crawlerlist/, http://www.robotstxt.org/db.html or http://www.user-agents.org/ (the lists do get updated from time to time)

And there is a 'JSON' 'feed' at https://github.com/monperrus/crawler-user-agents which is maintained.


Chris.

Indifference will be the downfall of mankind, but who cares?
Time flies like an arrow, however, fruit flies like a banana.
Webmaster Forum

RE: Parse Apache access.log Eliminate Spiders and Bots

(OP)
Yes, the logs are there but only as compressed .gz files and are compressed by month. Because it's a shared server with other clients, the actual logs are not available any other way which is why the hosting company offered to send me a copy. I suppose I could write a script to extract them to a file but I don't want to go that route unnecessarily. The logs for all my sites are in that same folder so it would have to get only those needed for a given site, extract them, and save them to a file that I could then parse to the database or it could even do it directly but I fear that such a script would take a long, long time to run. I am open to the idea, though, if it's a better way of doing it.

On the question above, is there some way to minimize them by simply filtering against "spider" or "bot" that most seem to have in the User-Agent? I'm not really too familiar with sting patterns or how to filter them unless I'm doing it through a database query but this is just raw log text. It's okay if it misses some but at least most will be eliminated that way.

RE: Parse Apache access.log Eliminate Spiders and Bots

Quote:

the actual logs are not available any other way
You could always use Google Analytics which already has just about every method of filtering/analysing/reporting known to man.

(other analysis tools are available)

I did the "roll your own" PHP/MySQL site access logging many years ago, and abandoned the project when Google made Urchin freely available, however if you are determined to finish it, forget about filtering 'bots before saving to MySQL, because that information IS useful for 'proper' site analysis. Just make ignoring crawlers part of the reporting capabilities.

Having more information and not needing it is infinitely more valuable than needing it and realising you did not save it.





Chris.

Indifference will be the downfall of mankind, but who cares?
Time flies like an arrow, however, fruit flies like a banana.
Webmaster Forum

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close