Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here


PHP Comparing Base 64 strings

PHP Comparing Base 64 strings

PHP Comparing Base 64 strings

Bear with me... I think my question really is about comparing base 64 encoded strings...

In a mail parsing app, I've come across incoming mail where the In-Reply-To and References headers have been stripped out and replaced by a Microsoft/Outlook Thread-Index header, and need to start generating my own Thread-Index headers on outgoing mail so that when I receive a reply without standard threading headers, I can still match the reply to a thread using Thread-Index.

I found a function which creates a valid Thread-Index header (https://gist.github.com/brettp/7e8e450b0d200279323...), and am storing that header in a MySQL table. According to the function's author

* These headers are base64 encoded 22-byte binary strings in the format:
* 6 bytes: The first 6 significant bytes from a FILETIME timestamp.
* 16 bytes: A unique GUID in hex.

So... a Thread-Index header value apparently looks like this:

CODE -->


Outlook appends 5-byte suffixes to subsequent thread members, so a thread reply would be coded like so:

CODE -->


Note that the first 30 characters are the same (AdH1tsVUVHkXt/ZLS4eksRmXC4Q5Ig), but the reply has dropped the two original trailing equals signs and added the characters AiTOHA.

What I really need to be able to do is match up emails containing the original thread index using a MySQL query, but I don't understand what's going on with the base conversions and encoding in the PHP function.

Do you think it's safe to just match on the first 30 characters of the thread index in order to identify messages from the same thread? I'd be grateful for any advice or suggestions!

More examples of Thread-Index values:

CODE -->


RE: PHP Comparing Base 64 strings

In base 64 encoding trailing == just pad the bas64 representation, they have no meaning. That should be the main answer.
So yess, you can just skip the == part, any = can be stripped off in about the same sense as leading zeros don't change a number, trailing = don't change the encoded data.

Bye, Olaf.

RE: PHP Comparing Base 64 strings


CODE --> php -a

Interactive mode enabled

php > echo base64_encode('M'), PHP_EOL;

php > echo base64_encode('Ma'), PHP_EOL;

php > echo base64_encode('Man'), PHP_EOL;
As you can see, the character before the equal signs can change when another character is added to the input string.


RE: PHP Comparing Base 64 strings

That's a very good point. Since each of the characters in base64 is about 6bits of the original data, and if the data is not a multiple of 6bits by being a multiple of 3 Bytes (24 bit is divisible by 6bit) adding a character to the unencoded string means the 6bit packets at the end change, and that not only changes the = pad characters, but also the last one, as the Q is changed to W in Feherkes example.

You would both remove the padding = chars and avoid the change of characters if your original string is padded to be a multiple of 3 bytes. You say you have 22-byte binary strings if you pad that to 24 bytes you get a base64 result without = and whatever is then added from outlook to the binary data and also encoded base64 does not influence the last chars of the original base64 string. That may be the best option to solve this.

Well, the other obvious option to check whether the first 22 binary bytes are matching is to decode the base64 data.

Bye, Olaf.

RE: PHP Comparing Base 64 strings

Just FYI, you can see how and why it works this way, if you encode Feherkes sample strings all padded to length 3 with chr(0):


echo base64_encode("M\000\000"),PHP_EOL; 
echo base64_encode("Ma\000"),PHP_EOL; 
echo base64_encode("Man"),PHP_EOL; 

That'll show TQAA TWEA TWFu and that differs in A instead of =, so you see the "=" chars denote chr(0), but they also denote these chr(0) don't belong to the original data, so the decoded data has to be cut off.

Seeing that it's clear you not only get new chars, you also modify the TQ to TW and TWE to TWF, as in the first step a chr(0) is replaced by 'a' and in the second step the final chr(0) with 'n', that does not only influence the = positions, that also in general influences the last character of the encoding of the previously shorter string.

Bye, Olaf.

RE: PHP Comparing Base 64 strings

Thanks, all.

As I understand it, the original value, i.e. AdH1tsVUVHkXt/ZLS4eksRmXC4Q5Ig==, becomes a pseudo-unique identifier and is never re-coded during the process. As the email thread grows, additional 5-byte values are appended to the original value to indicate subsequent mails' position in the thread, i.e. AdH1tsVUVHkXt/ZLS4eksRmXC4Q5IgAiTOHA, which seems to indicate thread id AdH1tsVUVHkXt/ZLS4eksRmXC4Q5Ig followed by message id AiTOHA.

If the two equals signs trailing the original value are just right-padding, it doesn't seem to matter that they're stripped from the original value when the first 5-byte value is appended, and since the original ID (sans padding characters) doesn't change throughout the thread, I think I can safely consider the 30-character base 64 string as a unique id.

RE: PHP Comparing Base 64 strings

In your example I think you are just lucky, the last character before the ==, in this case the g, could also change. The first few bits of the extended data are influencing this. Therefore you better pad your identifier to 24 bytes. It doesn't matter by how many bytes the value grows, you could also take the 29 left chars as an identifier, but to be 100% sure, pad your 22 bytes by 2 0 bytes to 24 bytes, that'll result in a 32-character base64 string, which will stay the same no matter how the first 6 bit of added data are.

Bye, Olaf.

RE: PHP Comparing Base 64 strings


Quote (Olaf)

In your example I think you are just lucky, the last character before the ==, in this case the g, could also change.

It seems like after the initial encoding, the base 64 value is just a string, to which other base 64 strings are appended as the thread grows. If the base 64 strings are never re-encoded but simply appended to other base 64 strings, how would the original string change?

I do intend to do much more testing with different values, though, and see if my assumptions hold up.

RE: PHP Comparing Base 64 strings

Quote (cmayo)

It seems like after the initial encoding, the base 64 value is just a string, to which other base64 strings are appended as the thread grows
Well, no, if that was the case, then the == wouldn't go away or move to the end. It's rather the string is decoded, then new bytes are appended and that's encoded again. Base64 is just a transfer encoding.

And the result can change the previous encoding in the last character, as Feherkes examples show, an "M" is encoded in one way, an added "a" - so "Ma" . is resulting in something different not only at the second character and again the last character changed as he added the final "n" to "Man", even disregarding the =, which changed from two to one to none.

Bye, Olaf.

RE: PHP Comparing Base 64 strings

If that's the case, I'm going to have a problem searching for like values in the database. MySQL provides a FROM_BASE64() function for use in queries, but that would only partially decode the values.

I guess I'll try reversing the encoding process before I insert into MySQL and insert the unencoded value, then decode the search string before matching with MySQL.


RE: PHP Comparing Base 64 strings

Well, c'mon, the problem is no problem if you initial data is a multiple of 3 bytes. You just have to add 2 bytes, and then you have 24/3*4 = 32 base64 character never changing, so you can compare the left 32 chars, they then remain the same thread id.

Simply add two 0 bytes in this line:


$thread_ascii = substr($ft_hex, 0, 12) . $guid . "\000\000"; 

Now you're set.

Bye, Olaf.

Edit: Actually it's 12 chars from $ft_hex, isn't it?
And md5 should be 32 chars, hex chars. All these are rather hex digits than ascii. In that case I think you need to simply add . "0000" for 2 zero bytes;

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close