johnherman and Hans63,
First, thank you very much for your very thoughtful responses. This is just the sort of dialogue I would like to pursue.
I would suggest that getting agreement on requirements well in advance of the development cycle is nearly impossible in all but the simplest of data warehousing / DSS projects due to complexity, lack of decision-maker understanding, poor knowledge transfer from business to DW team or even simple requirement misinterpretations, resource re-allocation, budget fluctuations, mergers / acquisitions, and as you mentioned new business rules. These are precisely the reasons we must find ways to "test" for these discrepancies as early as possible in the process. I absolutely agree with the use of a POC, especially with new technologies combined in new or unfamiliar ways, but not POC in the more generally accepted, mainstream DW view. I will call it a "DW POC micro-cycle".
1. Prove the technology, architecture, and function, first.
If unacceptable, rework and iterate until success is achieved.
2. Require integrated testing and quality feedback communication in every milestone step of the process (as suggested by Hans63 above, too). Milestones may be defined through previous points of failure or purely based on experience or even to meet regulator and/or compliance requirements (Sarbanes-Oxley, statutory, federal, Medicaid, copyright, etc...).
Testing means nothing if no one monitors and acts on the results, so tests must be published and viewable by the appropriate team members (DW developers, DW managers, source system developers/managers, etc...), and even summarized for the customer into an overall DW quality assessment. Every inbound atomic piece of data, every data stream, every transformation, every user access, every query, every data profiling run, every business rule change...where does it end? Indeed how do we decide when "enough-is-enough" when it comes to testing? The cheap answer is: "when the desired quality reaches an acceptable level". So we should have quality thresholds integrated into our DW solutions to serve as early warning bell-weathers at the boundaries of or even in the bowels of the data warehousing machinery. Not just the counts of the inbound rows and summarized dollars, people and number of transactions, but ways to move data into the data warehouse and still record its quality scores (previously described in "The Data Warehouse ETL Toolkit" by Kimball and Caserta). This begs the development of a data warehouse quality decision support system, eh? But I'm way off on a completely different rabbit trail now...
If you can't test it, you shouldn't be doing it. As Hans63 pointed out...it probably isn't going to be easy or cheap to produce a robust testing culture in a DW effort. As you point out, testing of ETL counts and hash totals is necessary, but insufficient. Agile purists with whom I've worked have said, "Testing should be done only to the level required by the customer." (Translation: what they are willing to pay for via time/resources!). I am certain the establishment of professional testing (technical, performance, functional) consultant tiger teams for data warehouse architectures / solutions is a field ripe for the taking by the experienced practitioner!
There are many more items to be discussed here, but let's delve a bit into the POC before I ramble further...
PROOF OF CONCEPT
Pre-requisites:
1. All system and application software and hardware have been installed tested and are functioning correctly. There can be a HUGE amount of elapsed time from start to finish in addition to hands-on effort required to get to this point.
2. Point persons (names, phone, email) for all required data access and authentication needs have been identified.
3. Network, server and application credentials and accounts have been established, documented and shared as necessary.
A suggested POC...
I would suggest approaching the proof-of-concept phase of a new DW / BI effort with a traditional broad, but shallow micro-cycle of 2-4-6? weeks typical of agile development.
1. identify the highest value goal attainable in for the desired micro-cycle that will allow the team to:
a. Identify a single, data source central to the effort
b. Confirm security credentials are accurate, complete and provide the required permissions to this data source
c. Exercise the ETL tools, applications, adapters and activity logging to retrieve, stage, load the data source into a mock staging area and track each step.
d. Exercise the version control mechanism chosen for the data warehousing platform to "check-in" all ETL scripts, test scripts, report definitions, database DDL, and any/all other project documentation.
e. Deliver a simple data quality assessment of this single source of data.
f. Install and configure automated data warehousing communication / notification mechanisms and confirm they are functioning at least at some minimal level.
g. Design, develop, test and deliver a simplistic, but formatted master data report using the main information delivery tool chosen for the largest population of information consumers.
h. Configure the information delivery tool authorization scheme to allow a single information consumer to view the report and/or export the report into one of the tools (if any) required by power users or other downstream applications (predictive analytics, statistical package, or SOA-style apps) in scope for the project.
As you can see, we will have performed a relatively broad scope of functions to provide a credible POC. What else should be done (or left out) to improve this based on your experiences?
Resource Criteria:
Three ( 3 ) highly-skilled, experienced, resources trained on the tools, techniques and applications must be able to complete these tasks in less than 4 weeks with ancillary assistance from supporting staff.
Alternatives:
If they can't do it in less than 4 weeks, make sure the scope hasn't crept, or split the process in to two portions: 1. one internal set of deliverables to the team, and
2. one with functional deliverables the customer can identify.
This allows confidence in (or reassessment of) the infrastructure and architecture prior to investing many more hours required to deliver an entire subject area.
There are many more tasks to consider, these are only based on my experience.
I look forward to your responses.
Best regards,
dmcmunn