Smart questions
Smart answers
Smart people
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Member Login




Remember Me
Forgot Password?
Join Us!

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips now!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

Join Tek-Tips
*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Donate Today!

Do you enjoy these
technical forums?
Donate Today! Click Here

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.
Jobs from Indeed

Link To This Forum!

Partner Button
Add Stickiness To Your Site By Linking To This Professionally Managed Technical Forum.
Just copy and paste the
code below into your site.

ASCII to Hex Conversion?

PumpgunMessiah (TechnicalUser) (OP)
17 Mar 12 17:53
I'm in the midst of writing a rather quick and dirty URL Encoding function (replacing all specials characters with their hex value, i.e. 'http://www.url.com/test text.html' -> 'http://www.url.com/test%20text.html') via AWK and am struggling on this rather simple task.

Well, it's rather easy achieving this by using printf via shell (notice the apostrophe in front of the ASCII character, in this case the [):
printf "%02x" "'["
> 5b

But I'm having trouble replicating the same effect with AWK. I've already tried everything with either the AWK internal printf or sprintf. But I'm not even sure if the apostrophe is necessary, which I also had trouble to concatenate to the character to be converted, since even 'escaped' via backslash I couldn't add the darn ' without errors. I've solved this in a rather ugly way by using 'awk -v' to bring the apostrophe from "outside" of awk.

But still, I can't get awk's printf function to do same as the "regular" shell printf.
PumpgunMessiah (TechnicalUser) (OP)
17 Mar 12 18:11
I forgot to add my small testing script (tested with OS X):

CODE

#!/bin/sh

teststr=$1

echo "$teststr" | awk -v dummy="'" '
function url_encode(rawURL) {
    cleanURL=""
    nonURLPos=match(rawURL,/[^[:alnum:]]/)
    while( nonURLPos > 0 ) {
        rawChar=substr(rawURL,nonURLPos,1)
        replaceChar=sprintf("%02x", dummy rawChar)
        cleanURL = cleanURL substr(rawURL,1,nonURLPos-1) "%" replaceChar
        rawURL = substr(rawURL,nonURLPos+1)
        nonURLPos=match(rawURL,/[^[:alnum:]]/)
    }
    cleanURL = cleanURL rawURL
    return cleanURL
}
{ print(url_encode($0))}
'
urlencode.sh "abc ABC[123]456"
>abc%00ABC%00123%00456

replacing the sprintf in 'replaceChar=sprintf("%02x", dummy rawChar)' with the printf function results in the error:
awk: syntax error at source line 7 in function url_encode
 context is
         >>>     replaceChar=printf <<< ("%02x", dummy rawChar)
awk: illegal statement at source line 8 in function url_encode
Annihilannic (MIS)
18 Mar 12 21:18
I've never encountered that apostrophe character constant syntax before... is it a C thing?

As you say, awk certainly doesn't support it... awk has a very simple "typeless" variable system which tries to intelligently DWIM (do what I mean) rather than handling everything literally, which unfortunately is stumping you here.  I can't think of a way around it for now... does it need to be awk?

If it's going to be a pure awk script, you could avoid the shell part entirely and use a #!/usr/bin/awk -f shebang line to allow you to use the apostrophe directly in the script rather than dummy variables, but I tried that too and it didn't help the printf situation.

Annihilannic
tgmlify - code syntax highlighting for your tek-tips posts

PumpgunMessiah (TechnicalUser) (OP)
19 Mar 12 15:22

Quote (Annihilannic):

I've never encountered that apostrophe character constant syntax before... is it a C thing?
I'm not sure myself where that weird syntax with the apostrophe is coming from. I was just searching the web for as solution for my small problem and found this neat little trick, that sadly isn't working with AWK.
But it seems it's indeed a specification of the "classic" printf, but it isn't that well documented:
http://pubs.opengroup.org/onlinepubs/009695399/utilities/printf.html
Where the important part is:

Quote:

If the leading character is a single-quote or double-quote, the value shall be the numeric value in the underlying codeset of the character following the single-quote or double-quote.
So, it's not just the apostrophe, either the single apostrophe or single double-quote will do.

Quote (Annihilannic):

If it's going to be a pure awk script, you could avoid the shell part entirely and use a #!/usr/bin/awk -f shebang line to allow you to use the apostrophe directly in the script rather than dummy variables, but I tried that too and it didn't help the printf situation.

The reason why I mix shell parts with AWK is, that I just needed the url encode function for a half written shell script (something with an functionality similar to GetRight or Firefox' DownThemAll but as shell script). And I didn't wanted to rewrite the rest for AWK too, but still wanted to use AWK for the trickier parts of intense string operation. Something that maybe would require countless 'expr' calls from shell or at least would be more complicated (and thus less performant) using shell only.

And BTW I found another solution, that does the trick, but isn't that pretty either:

CODE --> AWK

BEGIN {
    for (i = 0 ; i <= 255 ; i++) {
            t = sprintf("%c", i)
            _ord_[t] = sprintf("%x", i)
        }
}
function ord(str,    c)
{
    c = substr(str, 1, 1)
    return _ord_[c]
}
 
PumpgunMessiah (TechnicalUser) (OP)
19 Mar 12 18:01
Oh, and BTW I forgot to add the working URL Encoding function. But the mentioned script I want to use it for isn't finished yet.

Why isn't there an edit post feature for already sent posts, so you don't have to answer your own posts do add something later ponder :

CODE --> AWK

BEGIN {
    for (i = 0 ; i <= 255 ; i++) {
            t = sprintf("%c", i)
            _ord_[t] = sprintf("%x", i)
        }
}

function ord(str,    c)
{
    c = substr(str, 1, 1)
    return _ord_[c]
}

function url_encode(rawURL) {
    cleanURL=""
    do {
        nonURLPos = match(rawURL,/[^a-zA-Z0-9_.\:\/]/)
        if (nonURLPos > 0) {
           rawChar = substr(rawURL,nonURLPos,1)
           replaceChar = ord(rawChar)
           cleanURL = cleanURL substr(rawURL,1,nonURLPos-1) "%" replaceChar
           rawURL = substr(rawURL,nonURLPos+1)
        }
    } while (nonURLPos > 0)

    cleanURL = cleanURL rawURL
    return cleanURL
}
Annihilannic (MIS)
19 Mar 12 22:33
I think that's a good solution.

Any aversion to perl?

CODE --> Perl

echo "$teststr" | perl -nwe '
        foreach my $char (split //) {
                if ($char =~ /([[:alnum:]\/.:\n])/) {
                        print $char;
                } else {
                        printf "%%%02x",ord($char);
                }
        }
'

Or shorter, but more cryptic:

CODE --> Perl

echo "$teststr" | perl -nwe '
        print join "", map { $_ =~ /([[:alnum:]\/.:\n])/ ? $_ : sprintf("%%%02x",ord($_)) } split //;
'

Annihilannic
tgmlify - code syntax highlighting for your tek-tips posts

mrn (MIS)
22 Mar 12 8:59
From Shelldorado

CODE

:
##########################################################################
# Title      :    urlencode - encode URL data
# Author     :    Heiner Steven (heiner.steven@odn.de)
# Date       :    2000-03-15
# Requires   :    awk
# Categories :    File Conversion, WWW, CGI
# SCCS-Id.   :    @(#) urlencode    1.4 06/10/29
##########################################################################
# Description
#    Encode data according to
#        RFC 1738: "Uniform Resource Locators (URL)" and
#        RFC 1866: "Hypertext Markup Language - 2.0" (HTML)
#
#    This encoding is used i.e. for the MIME type
#    "application/x-www-form-urlencoded"
#
# Notes
#    o    The default behaviour is not to encode the line endings. This
#    may not be what was intended, because the result will be
#    multiple lines of output (which cannot be used in an URL or a
#    HTTP "POST" request). If the desired output should be one
#    line, use the "-l" option.
#
#    o    The "-l" option assumes, that the end-of-line is denoted by
#    the character LF (ASCII 10). This is not true for Windows or
#    Mac systems, where the end of a line is denoted by the two
#    characters CR LF (ASCII 13 10).
#    We use this for symmetry; data processed in the following way:
#        cat | urlencode -l | urldecode -l
#    should (and will) result in the original data
#
#    o    Large lines (or binary files) will break many AWK
#        implementations. If you get the message
#        awk: record `...' too long
#         record number xxx
#    consider using GNU AWK (gawk).
#
#    o    urlencode will always terminate it's output with an EOL
#        character
#
# Thanks to Stefan Brozinski for pointing out a bug related to non-standard
# locales.
#
# See also
#    urldecode
##########################################################################

PN=`basename "$0"`            # Program name
VER='1.4'

: ${AWK=awk}

Usage () {
    echo >&2 "$PN - encode URL data, $VER
usage: $PN [-l] [file ...]
    -l:  encode line endings (result will be one line of output)

The default is to encode each input line on its own."
    exit 1
}

Msg () {
    for MsgLine
    do echo "$PN: $MsgLine" >&2
    done
}

Fatal () { Msg "$@"; exit 1; }

set -- `getopt hl "$@" 2>/dev/null` || Usage
[ $# -lt 1 ] && Usage            # "getopt" detected an error

EncodeEOL=no
while [ $# -gt 0 ]
do
    case "$1" in
        -l)    EncodeEOL=yes;;
    --)    shift; break;;
    -h)    Usage;;
    -*)    Usage;;
    *)    break;;            # First file name
    esac
    shift
done

LANG=C    export LANG
$AWK '
    BEGIN {
    # We assume an awk implementation that is just plain dumb.
    # We will convert an character to its ASCII value with the
    # table ord[], and produce two-digit hexadecimal output
    # without the printf("%02X") feature.

    EOL = "%0A"        # "end of line" string (encoded)
    split ("1 2 3 4 5 6 7 8 9 A B C D E F", hextab, " ")
    hextab [0] = 0
    for ( i=1; i<=255; ++i ) ord [ sprintf ("%c", i) "" ] = i + 0
    if ("'"$EncodeEOL"'" == "yes") EncodeEOL = 1; else EncodeEOL = 0
    }
    {
    encoded = ""
    for ( i=1; i<=length ($0); ++i ) {
        c = substr ($0, i, 1)
        if ( c ~ /[a-zA-Z0-9.-]/ ) {
        encoded = encoded c        # safe character
        } else if ( c == " " ) {
        encoded = encoded "+"    # special handling
        } else {
        # unsafe character, encode it as a two-digit hex-number
        lo = ord [c] % 16
        hi = int (ord [c] / 16);
        encoded = encoded "%" hextab [hi] hextab [lo]
        }
    }
    if ( EncodeEOL ) {
        printf ("%s", encoded EOL)
    } else {
        print encoded
    }
    }
    END {
        #if ( EncodeEOL ) print ""
    }
' "$@"

Mike

"Whenever I dwell for any length of time on my own shortcomings, they gradually begin to seem mild, harmless, rather engaging little things, not at all like the staring defects in other people's characters."
 

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Back To Forum

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close