×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Contact US

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Summarising duplicated lists

Summarising duplicated lists

Summarising duplicated lists

(OP)
Hi,

The project I'm working on is a web service that provides a page that can be used to send an e-mail to our customer base. The e-mail addresses are gathered from 6 different sources in our system, and each source has 6 different categories. The user will choose a category, and then any combination of sources. I need to calculate and display the number of unique e-mail addresses the user's combination choice will result in. E-mail addresses may be duplicated across multiple sources, and even within a source's categories.

I'm trying to find an efficient way to provide summary data to the webpage that doesn't involve sending the 36 lists of up to 2000 e-mail addresses each and doing the de-duplication in javascript whenever the user changes their choice.

I can't just provide a count of e-mail addresses for each category in each source, because if the user selects multiple categories, and I add all the selected category counts together, the total will be too high due to it including duplicates.

One idea I had was to generate a master de-duplicated index of e-mails, then send 36 bit-strings where each bit is set if the e-mail address at that bit's position in the master list is present in that source/category list. So if the master de-duplicated list had 5000 e-mails in it, each bit-string would have 5000 characters in it, and character eg. 401 would be '0' if the e-mail in masterList[401] was present in that source/category. '1' if it was. That would cut down on the data sent, but still be a computational task in javascript.

Then I figured it would be simpler to just calculate every combination of category for each source. 6 sources and 6 categories = 384 summaries (6 * 2^6). What I don't like about this is that it has the potential to skyrocket if more categories are added (each category doubles the number of summaries that must be generated).

I reckon there's got to be a smart shortcut to all this that involves sending less data, and a faster algorithm to calculate the totals I'm after.

Any thoughts are appreciated.

Thanks.

RE: Summarising duplicated lists

As a database developer I see this as the task of a database.

Your idea of building a master index of all emails is what came to my mind at first, too. You just pull together the 6 sources into one new "data mart", doesn't have to be complicated, all you seem to need is three tables, since one email might be in several categories.

emails (ID, mailaddress)
categories (ID, categoryname)
categoriesofemails (mailID, categoryID)

When staging data here you already dedupe the emails and connect them with their categories. 5000 records are nothing even for MySQL, so you can easily get a count of any combination of categories (for example) by

CODE --> mysql

SELECT Count(DISTINCT mailaddress) mailsummarycount FROM emails E inner join categoriesofemails CoE on E.id = CoE.mailID And CoE.categoryID IN (1,3,5) 

The site would display the list of categories and offer a multi select choice, which in the end would result in the list of category IDs at the end of the query and that's it.

So in summary:
1. Setting up this small data mart with the three tables
2. Staging of data of the 6 sources into a data mart. This shouldn't take long and could be done multiple times a day, for example each hour.
3. Creating a script usable from the site via Ajax to return the count of mail adresses, main ingredient is the query given above.

You'll never need to show all mail addresses at the site, do you? If the choice is made you send out a newsletter, advert or any other type of mail at the server side. You get the mail addresses by the query without the Count, simply SELECT DISTINCT mailaddress...

Bye, Olaf.

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close