INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Jobs

Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB
5

Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

(OP)
Need to clean several large pipe delimited text files prior to import into a newly created Sql Server 2012 database.

The plan is to load the Sql Server database and then append data to the database on a monthly basis.

Each text file contain 500,000 to 850,000 records with over 50 columns. Also, there are three date columns in the format YYYYMMDD.

Specifically, I need to review and clean each text file. Cleaning include;

1. Extract records that have a zero in the any of the three date columns.
2. Extract records that have zeroes in the Month and Day portion of the date. (for example, has a format like "20160000").

Initially, I opened the text files using MS Excel but then researched the use of a text editor that can/should be used.

Downloaded and installed the following text editors - GVIM, Sublime Text, and EmEditor. Upon opening the largest text file in each of the editors, I was not readily able to display the data in a columnar format that would facilitate a review of the records.

Based on my experience with MS Excel, It appears that editing the text files using MS Excel would be convenient and relatively easy compared to the use of the text editors.

Did also create the Create Table and Bulk Insert Sql Scripts to load the database(prior to cleaning of the text files). All of the tables were loaded except one. The one table that was not loaded failed because of records that have zeroes in the Month and Day portion of the date. (for example, has a format like "20160000").


Any insight as to the preferred method to clean the text files so I can initially load the Sql Server database?

Any additional insight as to the use of SSIS or Bulk Insert to perform the intial loading of the database and the subsequent appending of data on a monthly basis that will enable the storing of all records with errors in a "temporary" table and just load records that do not have errors?



RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

If I were you, I'd first address the source of the files to output clean text files.

>Did also create the Create Table and Bulk Insert Sql Scripts to load the database(prior to cleaning of the text files).
That's also what I would do, SQL Server should be able to import the file as text only data.

>All of the tables were loaded except one. The one table that was not loaded failed because of records that have zeroes in the Month and Day portion of the date.
Well, why import as date? Import all fields as text, mend such errors or remove such lines and then export the corrected data to import with proper formats/types.
You may also go from the text only staging tables into your real data tables by converting from text to wanted type yourself, saves another roundtrip to hdd and back.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

If you have access to Unix or Linux, awk would resolve this situation nicely. Also, the awk utility is available for Windows as an add-on.

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

no need to use any external program.

Just load everything into a staging table - if required due to "bad data" load those fields that have bad data into a varchar field.
Then either delete or only select those records that meet the desired criteria and transform/load onto the final destination.

Having the records being processed in SQL will most likely prove to be easier than do that process outside.

Regards

Frederico Fonseca
SysSoft Integrated Ltd
www.syssoft-int.com

FAQ219-2884: How Do I Get Great Answers To my Tek-Tips Questions?
FAQ181-2886: How can I maximize my chances of getting an answer?

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

>Having the records being processed in SQL will most likely prove to be easier than do that process outside.

I wouldn't say so in any case, as T-SQL string functions are far behind what other languages offer.
But staging data has one big advantage in comparison to parsing the text file itself: Data is separated into columns already.
Admitted a parser has an easy task in slicing a line of the text file at each pipe symbol, too, but what then?
a) the parser writes out a correct line into a secondary text file?
That's quite costly in time, means you read and write and then afterwards again read a million rows.
b) The parse keeps this line in memory and finally puts in data into MSSQL
Could work, too, but you need a toool capable of bulk loading while parsing, inserting single rows would be inefficient

So finally, while it sounds like doing too much in the first place, loading dirty data into text/char/varchar fields and postprocessing it can pay performancewise, as bulk loading is that much more making use of the advantages of a similarly structured data bulk without transaction log for each row, without locking, etc. It's cheaper to load too much and then throw away or mend dirty data.

We can totally agree on staging data as way to go instead of preprocessing the text file, because you really can't process a text file and remove single lines from it, you rewrite a secondary file, if you do so, and that's where you waste I/O.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

(OP)
Bottom line;

I am interested in performing the initial load of the data that is sourced from the various text files into the Sql Server 2012 database (version: Developer) as fast as possible and perform the monthly appending of data. I just want to ensure that I am setting this up in the most efficient manner at the onset.

Additional context: I initially installed SQl Server 2014 Express Edition and subsequently installed Sql Server 2012 as the second instance. I did not have integration services installed. My initial assumption was that the text data was clean and that all I would need to perform is create the database tables and load the text files using Bulk Insert. It was only when I encountered errors when running the Bulk Insert scripts that has now have now shifted the focus to the use of SSIS to perform the initial load and the subsequent monthly appending of the data.

Have read extensively over the last few days about various alternatives to SSIS such as Power Shell and VB Script. Other options include the use of MS Access and/or PowerQuery and Power Pivot. However, it appears that I should focus on either using MS Excel to clean the data and then use SSIS to perform ETL and load the data into the Sql Server tables or use SSIS to load the records that have no errors into the respective Sql Server tables and then load the records with errors into another table so I can research/review at a later date or send back to the department who prepared the text files.


With that being said, the following questions arise;

1) Considering that I have a Windows 7, 64-bit operating system using 64-bit MS Excel 2016 on a HP Elitebook 8440p laptop with 16 GB memory, what version of Sql Server is ideal that would allow the installation and successful performance of SSIS?



2) As previously stated, I have two installations of Sql Server on the laptop - Sql Server 2014 Express and Sql Server 2012 Developer. Will this present issues going forward? I am interested in installing and using the version of Sql Server that will allow me to efficiently use business intelligence functionality - doesn't have to be the latest version. Any insight as if the upgrade to Sql Server 2016 is warranted at this point?


3) It appears that it may be possible to strictly use Bulk Insert within TSql to initially load the data from the text files into the respective Sql Server tables, perform the monthly appending of the data to the tables and route the records with errors to a separate file. However, I do think that the use of SSIS will enable a more efficient ETL process than Bulk Insert. Any thoughts/insights regarding the use of strictly using Bulk Insert to initially load and perform the monthly appending of data to the database tables?

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

If I am not mistaken SSIS is a paid for tool. It's included in developer, but you are not licensed to use it in production.

Personally, I wouldn't spend the money on it for such a simple task as cleaning your data. As others have already said, it's best to import to a staging table that has all varchar columns, and then clean the data prior to copying to real tables.

From a performance perspective, you may get better performance by using SSIS, but all of your other suggestions are likely to be slower.

SQL 2012, 2014, and 2016 all have Business Intelligence, but again, there is a cost associated with it. If you're willing to bear the cost, then go for it.

-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

The gist if what we said is not to clean data before bulk loading. You do bulk load the data, but not as the final type, all columns are just loaded as text, there is no value range or conversion error that way. The data also is not loaded into final tables, but into staging tables, having similar structure, but only textchar/varchar columns.

Data in the staging tables can be cleaned before merged with the real data tables. For that matter you can use functions such as

You don't need SSIS, nor EXcel, nor powershell. For bulk load you have the BULK INSERT command and/or bcp.exe tool and format files, BULK INSERT let's you specify a format file as the FORMATFILE=? option and you can specify all fields as text/char/varchar in there. As the text file is text, you can't get any copnversion errors that way.

I already mentioned the big advantage you have this way: You get data separated in columns and can address the single columns to detect problematic values to either mend them or skip this data for the daily ETL process.

Your ETL process thereby runs within SQL Server.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

Although I know it is sacrilege to discuss non-Microsoft solutions in a Microsoft forum, awk will be the fastest. Awk is written in C language and will run against the delimited sequential file and provide an output delimited sequential file that can be loaded using bcp. I know discussing Unix/Linux utilities in a MS environment will not be well received, but awk against the flat file will be faster than any database procedures.

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

We're not talking about small files here, John.

Simply do the I/O calcuilation for eg 1 GB of file

Solution with awk:
1. awk reads original file and writes clean file
2. bcp reads clean file and uploads into tables

all data is read 2x and written 2x

Solution with staging tables
1. bcp reads dirty file into staging tables
2. cleaning within staging tables
3. merging staging with production data

All data is read 1x and written 1x alone in step 1
Cleaning data also needs to read it back 1x, but since it's also done in MSSSQL it profits from caching. Also you only write changes, not all data as in awk solution step 2
And the final merging step again profits from cached data and only does the necessary inserts and updates.

Overall you can save a lot of I/O with the solution not using awk.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

Awk free (gawk)

http://unxutils.sourceforge.net/

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

I have worked on data warehouses for credit cards. Millions of customers with 30+ million transactions daily. glasses


==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

John, my logic still holds anyway, I'd say you just can also do it a way not the most performant, too, once drives and ram is sufficiently fast. And from the day on this is true, it will hold true, that's why you never made any bad experience.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

Olaf, your reasoning is fine. And if timing doesn't require "the fastest", then I wholely endorse loading the raw data into a table. It makes analysis of the errors much easier. Our experience with credit card data on fast Unix and Linux servers across a range of databases has shown that system utilities written in C are fastest. As you mention, the input and output files should be on separate storage directly attached to the server for maximum optimization of the awk and bulk load processes.

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

You don't mention if you have a programming background or not, however if you do (and your data is pretty constant) this would be a great place for a Python program. It is very easy to pick up. Just parsing a line is not that hard. And probably a few Youtube videos would get you on your way. It is fast, free and runs pretty much on any platform.

Just a thought

Simi

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

I don't want to cause blood rush, but actually I'm saying you are not using the fastest, simply looking at how many times data needs to be read and written. RAM and CPU is so fast today, that it won't matter, even SSD always is a magnitude behind RAM. Caching mecahnisms not only include hdd hardware caching and RAM caching, also the OS and SQL Server process contribute. And MSSQL can act on it's data with multiple cores, while parsing lines of a sequentieally read file is rather a sequential process not parallalizable.

Let's make some assumptions
1GB file, cleansable to .8GB

Your way reads in 1GB, writes .8GB on a separate drive to not have one controller as the bottleneck, fine. Then read in the .8GB into MSSQL tables. Total read: 1.8 GB, total write 1.6GB, once to file and once into MSSQL, just roughly speaking.

My reads in 1GB, writes 1GB into staging tables, reads 1GB and cleans it to merge into .8GB production table data. Which seems like a worse case as this in simple terms means overall reading 2GB and writing 1.8GB, but the big advantage is, that the process doing the first read, write and read back is one and the same process, the MSSQL process doing a BULK LOAD. It can profit from its caching. Involving two tools always has this disadvantage at least.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

One option that requires a single pass at the data is to use a C# bit of code, either as a standalone application, or within SSIS as a script.

Read file, parse each record as string data and transform data as required - if record is considered to be valid pass it to the feed to the database (either add a row to a datatable object or add it to the SSIS pipeline row) - those invalid can also be subject to same process but to a different destination for further processing if required.

1GB read - 1GB or less written to the database - still a single process.

Regards

Frederico Fonseca
SysSoft Integrated Ltd
www.syssoft-int.com

FAQ219-2884: How Do I Get Great Answers To my Tek-Tips Questions?
FAQ181-2886: How can I maximize my chances of getting an answer?

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

Frederico, in this case The C# process and the MSSQL process are involved, again two processes.

I talk about MSSQL using BULK LOAD INSERT and then more or less SQL along these lines:

CODE

CREATE DATABASE staging ...several options;
CREATE TABLE dbo.stagetable (... ) WITH (MEMORY_OPTIMIZED=ON);
BULK INSERT FROM 'datafile' FORMATFILE = 'formatfile' ....;
-- some SQL cleaning data, eg DELETE FROM dbo.stagetable WHERE NOT ISDATE(...)
MERGE into proddb.dbo.prodtable
USING staging.dbo.stagetable
ON <match_condition>
WHEN MATCHED...UPDATE...
WHEN NOT MATCHED...INSERT...;
DROP TABLE dbo.stagetable;
DROP DATABASE staging; 

There is no other process invoved, only MSSQL, not even bcp.exe or sqlcmd.exe. The MSSQL process reads in data into the memory optimized dbo.stagetable, nothing is written to hdd at this stage, you only read 1GB into the staging table in memory, so the 1GB read from the flat file is written into RAM, not into any MDF file, neither yourdb.mdf nor tempdb.mdf, nor even staging.mdf. The CREATE DATABASE surely creates small staging.mdf/ldf file pair, but they are dropped in the end. There is more detail about this, eg the stagingdb should be created with its recovery model set to simple. But roughly you only read from the original flat file, do some things in RAM and MERGE into the final production db table. Minimum writing to slower devices only at the end of all processing in RAM.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

Olaf, a simple rule of thumb is that a SELECT costs 1, an INSERT or DELETE costs 2, and UPDATE costs 3. Ignoring the bcp load that is part of both processes, the difference will be the cost of running awk across the entire flat file versus running UPDATE across the subset of records that have errors. A flat file read will be considerably faster than a database SELECT. I presume that if over 30% of the rows have errors, then awk will be faster. As the size of the flat file grows, the threshold percent that favors the awk solution will drop.

The Python suggestion is a consideration, but should not modify the original file because an audit trail of what was changed should be kept.

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

1,2,3 what? If the table is in RAM, any SQL will run on the RAM and takes neglectible time compared to the initial read in of the file via BULK INSERT, you still consider normal tables stored on hdd, John, but that's not the case.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

(OP)
Still at this.

Question - Is not an assumption explicitly implied that the use of T-SQL is a more effective method to clean the data relative to the use of formulae, IF Statements, etc. within MS Excel?

So, upon the creation and use of T-SQL scripts to "export" the error-free data from the staging tables to the production tables, what remains in the staging tables are the records with the errors.

Therefore, isn't there still a need to identify all possible type of errors for the records that remain in the staging table - requiring a review of each record? Am I thinking about this correctly?

Also, considering that all errors may not be identifiable at the same time, wouldn't this preclude the use of a temporary table as the staging table?

Any examples of Sql scripts and/or stored procedures that identifies the errors that remain within the staging table?

Is there such a method such as a stored procedure that "identifies" the various types of errors and store the records with like errors in separate tables?

For example, a table for duplicate records, a table for records with nonsensical dates, a table with records that have misspelled fields, etc. Then again, maybe this is already taken care of because
of the various Sql scripts that will be developed to identify all possible errors for the records that remain within the staging table...






RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

Duplicate records would be recognized by 'primary key' errors when attempt ing to Insert. Dates can be checked a few ways. As far as 'misspellings', well, that's a whole different subject. You would need to have a dictionary or list of valid values. Your recent post opens a whole new series of questions and scenarios.

@Olaf: 1,2,3 are relative costs. So Update(3) is more efficient than Delete followed by Insert (2+2) for example.

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

BxWill,

whatever you need and want. The nature of data might not make it necessary to postprocess invalid records, eg think of measurement data taken every ms, losing some of them might not be an issue.
If you look at my code I don't keep staging data, I don't even keep the staging database. More important I create a separate database I drop in the end. Having staging data within your production data you have to think about log file growth for the import/merge process, which should be kept at minimum, that's also not done using temp tables, but a separate database you can finally drop. Therefore many of your questions have no base to stand upon, the staging table finally is not only empty, but gone.

In my short example I simply suggested DELETEing all rows with invalid date, but you can of course also move such rows into tables used for manual inspection. John might suggest you can also process your initial flat file with awk in a way to sort each row - the good ones go into the pot, the bad ones go into your crop - like cinderalla. Indeed having the first read into a MSSQL table I think one of the advantages is to be able to apply any SQL and you know this better than any other tooling, perhaps. If you're also a developer Frederico has a point in using any frontend process in C# or like Simian suggest Python, whatever flavor of language you speak you surely know your ways with these languages to act on data, too.

It's totally up to you what you do between BULK INSERT and MERGE in my script, I would just not suggest loading all data into a staging table or tables or a staging schema of your production data or use tempdb, but use a separate db of the same instance, maybe on separate filegroup/partition. Since staging.dbo.stagetable is an in memory table in MSSQL Server you can also act upon it from outside with sql commands, as long as you don't pull out all the data for processing, because then you better read it in from the flat file into such outside process in the first place.

In the end you always have to imagine where you read from and write to, the flow of the data from hdd to RAM, within LAN and HDD and network controllers and between processes, the flow of data in the slower devices costs most, data in RAM can be processed fastest, no matter if the code doing so is assembler or intepreted commands like adhoc queries, the parsing or compiling of code will take neglectible time. It's overall execution time including the data processing it does will be neglectible even in comparison to SSD reads and writes and of course even more so in comparison with network transfers. As said RAM still is 1-2 magnitudes faster than SSD and that comparison will always hold true, SSDs need non volatile RAM and that's always flashed and the process take more time. Once data is in RAM anything can act comparable fast on it, all the languages know the same core basic data types and they mostly mean the same bytes in the same order. Anyway, very very rarely you have such complex processing of data, that it'll become the bottleneck of your whole ETL process, what determines the overall execution time mainly is I/O, how many times you need to read and write from/to/through slower devices.

The about a million rows you need to process each day are in the GB range, I assume, this once was a huge amount of data, today it's quite small, fitting into physical RAM multiple times on a modern server, so might also consider a trade off between performance and maintainability and ease of modification of data cleansing code, but indeed SQL is fine for data processing. I already said T-SQL string functions are not the very best, there are a few new ones making things better, but if you're a C# pro or Python pro or even Linux pro with Perl, awk or grep or anything, you might go that route, you'll see what fits your needs best. It should always be a concern to inform your data source for better quality initial flat files, so you can keep your preprocessing simplest. What's most important in my experience is looking for ways minimizing the pressure on logs, i.e. you have or set the recovery model to simple or bulk logged. This even plays a role for the staging db you only create temporary. If not, you write out all data twice, once into the data and once into the log file MSSQL maintains for any db. If you have no deep comprehension of transaction logging you should read on that to understand the implications.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

>So Update(3) is more efficient than Delete followed by Insert (2+2) for example

I was suggesting MERGE for the final merge of cleansed data, neither Delete+Insert, nor Insert nor Delete nor Update alone. Do you know T-SQL MERGE?

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

(OP)
I am currently reviewing T-SQL Merge.

As previously stated, I hypothesized that the use of MS Excel to review and clean a text file prior to using Bulk Insert to load a Sql Server 2012 database rather than the use of a text editor may be feasible.

I did proceed to use MS Excel to open each text file and "marked" each cell within a DATE column that had an error by replacing the erroneous date entries that were causing the error with "11111111." Interesting observation was that there were errors that had "00" in the Day portion of the date (formatted as YYYYMMDD) and there were errors that had "0000" in the MMDD portion of the date.

Truncation errors were "resolved" by modifying the length of the field within the CREATE TABLE sql script. Other errors such as asterisks (i.e. ********) were also replaced with "11111111."

Do realize that by opening the text file using MS Excel, "cleaning" and modifying the data in MS Excel and then saving as a Tab-delimited text file prior to using Bulk Insert to load a table within Sql Server may not have been the most efficient method because it appears that MS Excel may actually alter the formatting of some of the data. But, I believe that it was imperative to go through the process of identifying the errors within the text file initially via MS Excel before constructing the T-Sql statements to identify the specific errors that will still reside in the staging tables after exporting the "good" records to the production tables.

Successfully loaded the table within my Test database with all of the records - records that contained "bad" fields as well as records that had all "good" fields. After fine tuning this process, I plan to perform the initial load of the records into the staging tables and then load the "good" records into the tables within the Production database. Then, on a monthly basis, I need to import all of the records into the staging tables and then append the good records to the tables within the Production database and identify the "bad" records so that I can return them to the Department who sourced the text file.

Now, I will revisit the creation of staging tables... Still, I need to determine if a Text editor can perform the above-mentioned editing/cleaning tasks.

So far, I have not been able to open a text file in a Text Editor and display the data in a columnar fashion so that I can even perform the above-mentioned modifications or similar modifications if I desired to do so in the future.

Any additional insight regarding the use of a Text editor vs. MS Excel to open a text file and "clean" the data prior to import into Sql Server is appreciated.

Note, I would like for this to be a seamless automated process whereby the "good" records are loaded and the "bad" records are identified as quickly as possible. Interested in basically using Bulk Insert to perform this.
However, I may have to use SSIS if there is not a steep learning curve to master.

Also, I would appreciate insight as to the process that is planned to append the data on a monthly basis.





RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

As it seems you already found out never never use Excel to look at bad data - Excel will do formatting of its own that may either change completely the data or not process it at all.

Loading as raw data to a sql server table and then looking at it with SQL will enable you to do all type of analysis of the possible errors you may have and devise a parsing method for each individual type of error if so required.

For situations like yours where the data is already delimited then loading the data onto individual fields will in most cases be acceptable - but even that may not work well in partiuclar cases - but assuming it does, once you have the data loaded onto individual columns you can analyse each one of them and see which ones need specific processing based on the desired datatype - that may be dealing with invalid/partial dates, numbers with spaces/invalid chars on them and so on. All this is done easily with SQL.

Regards

Frederico Fonseca
SysSoft Integrated Ltd
www.syssoft-int.com

FAQ219-2884: How Do I Get Great Answers To my Tek-Tips Questions?
FAQ181-2886: How can I maximize my chances of getting an answer?

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

You still don't understand what I suggest, or you disregard it.

All you need is to know the number of columns in your text file, ideally also names and expected data formats of the textual data, eg the already mentioned YYYYMMDD. Then you neither need Excel, nor a TextEditor, nor awk, nor bcp, the BULK INSERT command will import this.

The only thing you need to avoid conversion problems with bulk insert is to prepare a format file and specify it int the BULK INSERT.
This format file should only be done once and remain the same for all future bulk loads, as long as the text file format doesn't change.

The main point to avoid any conversion errors is to treat all the text file columns as TEXTs, as char/varchar fields. So your format file will state all columns are text columns. No matter, if you have YYYMM00 or YYYY0000 values within your dates in the text or not, the bulk insert will import it, because this way it is not automatically converted. You reach your main goal to have the data in a table and can look at it in columns of the staging table used to insert into via BULK INSERT. Tada! That's what you want, isn't it?

Take a look at format file specifications - https://msdn.microsoft.com/en-us/library/ms178129....
Especially https://msdn.microsoft.com/en-us/library/ms178129.... or https://msdn.microsoft.com/en-us/library/ms178129....

Whether the textual data - which still is textual data within the staging table - is valid or not, can now be checked via T-SQL.

All your ingredients are SQL Server and your text file, no more, no less. You may use the bcp tool for generating format files for your staging tables, which you design to only have necessary width char/varchar/nchar/nvarchar columns, whatever suits the text file best. You can't have a conversion error from text to text, can you? Your only enemy is wrong charset/encoding, you need to know whether your text file is ANSI or Unicode, UTF-8/UTF-16 or whatever encoding.

You have experienced the automatic conversion fails on dirty data, you clean it and then let it do it's work or you take it in your own hands, how you convert the textual data yourself via CAST/CONVERT/FORMAT and check it via LIKE, ISDATE and other T-SQL constructs. T-SQL Merge comes later, this is not your concern in this stage of filling the staging tables, T-SQL Merge is the final run from staging to production to let this be done in a single query and with least effect on production data transaction logs, but that's the last step. Whether you do this from the initial pure textual staging tables or from secondary staging tables already having the cleaned and converted data depends on convenience, performance, maintainability, your understanding of T-SQL, etc. It's possible to do the necessary conversions while MERGE, as part of the merge command is querying the staging table data, which can have the conversion expressions/functions/calls in it.

We're only dealing with these lines of my proposed T-SQL script:

CODE -->

CREATE DATABASE staging ...several options;
CREATE TABLE dbo.stagetable (... ) WITH (MEMORY_OPTIMIZED=ON);
BULK INSERT FROM 'datafile' FORMATFILE = 'formatfile' ....; 

It already has enough tasks before doing it: Define the structure you need for your staging table or tables and creating a formatfile to specify in the BULK INSERT. That's the part reading in text file without any complications. Because where there is no conversion there are no conversion errors. The final state then is a memory_optimized SQL Server table holding the text file 1:1, but within usual table columns you can act on with SQL, that's the big advantage over the pure text data file, you have your columns already.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

(OP)
Appreciate the insight.

Will review the data within the links regarding format file specifications and pursue the use of staging tables and post back...

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

(OP)
So far, I am not able to populate the table in production, "Data2."

Basically, converting from varchar to a date field when the date is formatted as YYYYMMDD presents a problem.

Specifically, the error is "Conversion failed when converting date and/or time from character string."

What modifications are needed to successfully load the production table, Data2, using the data from the staging table, #Data1?

As previously stated, it appears that the majority of issus with the text file data are records where one or more date fields end in two or four zeroes. The dates are formulated as YYYYMMDD. There was one record that had a double quotes in the Sale Amount column within the text file (i.e. ""). Is it possible to filter this out using the CASE statement as well?

Also, any additional insight as to the feasibility of additional fields within the production table, Data2, such as;

•TableId int identity(1,1)
•CreatedBy varchar(255) default system_user,
•CreatedAt datetime default getdate()?


Note, an item can be sold more than once during an annual period. Therefore, I do not believe that the Item number should be the primary key.


The code that I have created so far is as follows;


CODE

DECLARE @LOADDIR VARCHAR(255) = 'C:\Data\';
--DECLARE @SQL NVARCHAR(MAX);              <------ IS THIS LINE NEEDED?


CREATE TABLE 	[#Data1](		
		[ItemNbr]   [varchar] (60)  NULL,
		[LATITUDE]   [varchar] (60),
		[LONGITUDE]  [varchar] (60),
		[LAST NAME]   [varchar] (60)  NULL,
		[FIRST NAME]   [varchar] (100)  NULL,
		[RECEIPT_DATE] [varchar] (60) NULL,                  
		[SALE_DATE] [varchar] (60) NULL,                       
		[SALE_AMOUNT]  [varchar] (60) NULL,
		[DELIVERY_DATE] [varchar] (60) NULL,       
		[ORDER_DATE] [varchar] (60) NULL,       
	
)	


CREATE TABLE [Data2](		
		[ItemNbr]   [varchar] (12)  NULL,
		[LATITUDE]   decimal (10,6) ,
		[LONGITUDE]  decimal (10,6),
		[LAST NAME]   [varchar] (25)  NULL,
		[FIRST NAME]   [varchar] (25)  NULL,
		[RECEIPT_DATE] [DATE] (10) NULL,                  
		[SALE_DATE] [DATE] (10) NULL,                      
		[SALE_AMOUNT]  [INT] NULL,
		[DELIVERY_DATE] [DATE] (10) NULL,       
		[ORDER_DATE] [DATE] (10) NULL       
)


BULK INSERT #Data1
   	FROM 'C:\Data\201701SaleData.txt'
   WITH
      (
         FIELDTERMINATOR = '\t',
		 ROWTERMINATOR = '\n',
		 FIRSTROW = 2
	 )




--ERRORS RECEIVED AT THIS POINT !

--Iteration 1: Error: Conversion failed when converting date and/or time from character string.

INSERT INTO Data2
SELECT * FROM #Data1
WHERE
   CASE WHEN RIGHT([RECEIPT_DATE],2) <> '00'						THEN 1 ELSE 0 END = 1
   OR CASE WHEN RIGHT([SALE_DATE],2) <> '00'						THEN 1 ELSE 0 END = 1
   OR CASE WHEN RIGHT([DELIVERY_DATE],2) <> '00'					THEN 1 ELSE 0 END = 1
   OR CASE WHEN RIGHT([ORDER_DATE],2) <> '00'						THEN 1 ELSE 0 END = 1
   OR CASE WHEN ISNUMERIC[SALE_AMOUNT] 							THEN 1 ELSE 0 END = 1


--Iteration 2: Error: Invalid column name "DATETIME2"

INSERT INTO Data2
SELECT * FROM #Data1
WHERE
   CASE WHEN RIGHT(CONVERT([RECEIPT_DATE],DATETIME2),2) <> '00'				THEN 1 ELSE 0 END = 1
   OR CASE WHEN RIGHT(CONVERT([SALE_DATE],DATETIME2),2) <> '00'				THEN 1 ELSE 0 END = 1
   OR CASE WHEN RIGHT(CONVERT([DELIVERY_DATE],DATETIME2),2) <> '00'			THEN 1 ELSE 0 END = 1
   OR CASE WHEN RIGHT(CONVERT([ORDER_DATE],DATETIME2),2) <> '00'			THEN 1 ELSE 0 END = 1
   OR CASE WHEN ISNUMERIC [SALE_AMOUNT]							THEN 1 ELSE 0 END = 1



--Iteration 3; Error: Incorrect syntax near 'CAST', expected 'AS'.

INSERT INTO Data2
    select * FROM #Data1
	WHERE
           (case when CAST(RECEIPT_DATE) as datetime) and RIGHT(SALE_DATE),2 <> '00' then 1 ELSE 0 END = 1)
           (case when CAST(SALE_DATE) as datetime) and RIGHT(SALE_DATE),2 <> '00' then 1 ELSE 0 END = 1)
           (case when CAST(DELIVERY_DATE) as datetime) and RIGHT(SALE_DATE),2 <> '00' then 1 ELSE 0 END = 1)
	   (case when CAST(ORDER_DATE) as datetime) and RIGHT(SALE_DATE),2 <> '00' then 1 ELSE 0 END = 1) 

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

there are a few things that you will need to take in consideration and learn how to use.

First the errors you have on your code.
you are using datetime - note that if you are on 2008 or higher you should be using datetime2 with the desired precision
and if your field is just a date then use "DATE" as datatype - do not use datetimes if not required.

CAST - select cast(receipt_date as datetime) is the correct form

convert - select convert(datetime, receipt_date) or select convert(datetime, receipt_date, format) see online for available formats

Regarding the double quotes - if that is the only type of bad data you can just use the replace function on the select itself
e.g. select convert(int, replace(SALE_AMOUNT, '"','')) as SALE_AMOUNT

regarding using isnumeric or isdata beaware that they may return valid when the data is invalid - do search the net for examples and use with care and extra validation.

As for dates.

you said that input format will always be YYYYMMDD - if that is the case it makes life easier

CODE -->

set dateformat ymd -- force format of date to specific format as to avoid invalid conversions - normally not required if input is yyyymmdd

insert into Data2
-- remove if adding values as null is acceptable
select *
from (
-- end remove
select case
       when len(ltrim(rtrim(receipt_date))) <> 8 then null -- invalid date as size is not 8 
       when ltrim(rtrim(receipt_date)) not like '[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]' then null -- not a valid format for date -- note the 0-1 and 0-3 to do a tiny bit of validation at this point
       when right(ltrim(rtrim(receipt_date)), 2) = '00' then null -- not valid as day part is zeros
       when substring(ltrim(rtrim(receipt_date)),5 , 2) = '00' then null -- not valid as month part is zeros
       when isdate(ltrim(rtrim(receipt_date))) = 1 then convert(date, ltrim(rtrim(receipt_date)), 112) -- now use isdate to verify that string is a valid date and convert if so using specific format 112
       else null
       end as receipt_date
from #Data1
-- remove if adding values as null is acceptable
) t
where receipt_date is not null
-- end remove 

if input formats can vary then the code above would need to be expanded - and eventually moved onto a function as it may become to cumbersome to code.

Similar validation can be made for the other numeric fields although for those you may also need to expand and use patindex to see if there are any non numeric chars.

Regards

Frederico Fonseca
SysSoft Integrated Ltd
www.syssoft-int.com

FAQ219-2884: How Do I Get Great Answers To my Tek-Tips Questions?
FAQ181-2886: How can I maximize my chances of getting an answer?

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

I recommend to not use temp, but aside of that, you can't expect CAST to cast any string to datetimes.

Which version of MSSQL server do you use? How fluent are you with T-SQL and do you learn about advanced features and new functions regularly?

Ideally you use SQL2012 and can use TRY_CONVERT.

Here's a result of different conversions:


You can see TRY_CONVERT has the advantage to result in NULL, if the date is invalid. Your date checks only care for 00, but you still accept 31st of months with less dates, etc., forget about programming such things. SQL2012 also adds in PARSE, TRY_CAST and TRY_PARSE besides TRY_CONVERT and DATETIMEFROMPARTS.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB


Quote:


I recommend to not use temp, but aside of that, you can't expect CAST to cast any string to datetimes.

why not? select cast('20010101' as date), convert(date, '20010101') works just fine, and so do other strings.

And why not use temp? a permanent table for a staging is not necesseraly the best option - and neither is a declared table if that was what you were thinking about.

as for try_convert, same way as with all date manipulation functions on sql server it does suffer from the same issues - select try_convert(date,'200702') will return a valid date (2020-07-02) and select try_convert(date,'2007') will return (2007-01-01) although it would not normally be considered as such when dealing with 8 digits dates as input.

so although the OP is using 2012 and could use that function there would still be needed to do the extra validation to prevent errors

Regards

Frederico Fonseca
SysSoft Integrated Ltd
www.syssoft-int.com

FAQ219-2884: How Do I Get Great Answers To my Tek-Tips Questions?
FAQ181-2886: How can I maximize my chances of getting an answer?

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

I was speakiong to BxWill, mostly, but if you care:

The topic of using temp tables for staging is the load on transaction (with a million rows a day that's tough even with bulk logged recovery mode) and I addressed this in more detail above. It's not accidentally but on a very good purpose that I wrote about creating a staging database and dropping it after the merge process. In short: bad experience, simply bad experience and good experience with the suggested solution, provided you know your options.

On the topic of conversion: I showed what you get, if you cast or convert. But I also showed what happens for a wrong date: no record. If I would have added the messages tab, you'd see this errors. TRY_CONVERT does not error, but results in NULL instead, also a nice way to go about wrong dates or detect them, though there also is ISDATE() for that purpose. No need to write lengthy CASE statements anymore:



So your proposed solution can also be written as TRY_CONVERT(receipt_date) as receipt date in very short and if you'd like another default date than NULL CASE WHEN ISDATE(textdate) THEN CONVERT(datetime, textdate) as adatetime ELSE otherdatetime END. Or you can sort those rows out via WHERE ISDATE(...)

Bye, Olaf.



RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

No Olaf,

my proposed solution can not be written with a single try_convert unless the input is always of a fixed know blocks of data and when the blocks are less than expected they are always considered invalid data. Or when blocks are of expecte size but contain invalid data when considering the data should represent a valid date.

looking at the following blocks for example

CODE -->

select dt1
     , try_convert(date, dt1) as date
     , try_convert(datetime, dt1) as datetime
from (select '20:00'  union
      select '2000'  union
      select '20.00'  union 
      select '200001'  union 
      select '2000010'  union 
      select '20000101'  union 
      select '20000000'  union 
      select '20:00:00'  union 
      select '20010001'
      ) t (dt1) 
although it only contains 1 valid date the output shows otherwise

CODE -->

dt1	 date       datetime
20.00    NULL       NULL
20:00    1900-01-01 1900-01-01 20:00:00.000
20:00:00 1900-01-01 1900-01-01 20:00:00.000
2000     2000-01-01 2000-01-01 00:00:00.000
20000000 NULL       NULL
200001   NULL       NULL
2000010  NULL       NULL
20000101 2000-01-01 2000-01-01 00:00:00.000
20010001 NULL       NULL 

as for temp table - reasons you gave do not, in my opinion and experience, warrant that someone would "not recommend" its use - it does warrant only that other options are given with the usual "try and use the one that performs better for each case".

Tempdb is indeed a very viable option in many cases for several reasons and some of them are
(assuming that the server memory is not enough to hold the data and it does have to spill physically to disk)

it is already sized taking in consideration load on system - if not your DBA's arent doing their work well.
it is split into different files/disks/luns hence reducing possible IO contention
On modern versions of SQL Server and servers it is even located on SSD making it faster to load onto.
there is no absolute requirement to explicitly drop the table as part of the process even if it may be advisable

Regards

Frederico Fonseca
SysSoft Integrated Ltd
www.syssoft-int.com

FAQ219-2884: How Do I Get Great Answers To my Tek-Tips Questions?
FAQ181-2886: How can I maximize my chances of getting an answer?

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

Well,

indeed I already assume YYYMMDD formatted data and errors like BxWill already reported with 00 as day o month portion. Obviously there are also wrong dates with wrong month or day. I have advised to load into a varchaar field, as BxWill did, that takes in everything, also dots and other separators, etc. All these cases can be inspected for ISDATE()=0. And to act on other wrong formats would be the task of other queries. A simple check of all digits for example could be done by testing whether a replacement of all digits leaves an emmpty field.

In regard of the temp table issue, I can not only tell just one story. Yes, it may be a DBA problem, too, but in the end to avoid a disk overflow and get to a tempdb state not easily to shrink again, I simply work with separate new dbs I drop after each staging process and that just works fine no matter how clever or not DBAs are. Speed of drive is not of the essence, if you use memory tables, as I suggest, they still are faster than SSDs, RAM always is fastest in any system, unless you bundle a newest SSD model with a very old server, but it would be a wonder to get that working and even if, the SSD would most probably not get to 100% its performance in an appropriate server, which then would again have faster RAM.

But speed may not be the most major concern overall, long term stability is most important. Google Brent Ozar and tempdb and you get nice tips on many aspects, too, but the process of staging data doesn't belong into the production data logs, what belongs there is merely the last merging step, so use of tempdb only is fine to me for test purposes. It's not only my experience. A collegue of me came to a customer having problem with an unshrinkable tempdb, which simply resulted on loading a copy of users data into temp for a slef written login. You can slowly kill your tempdb, you can obviously simply restart the SQL Server to get a fresh one, but why allow anything to slowly "kill" your system - or even jsut get it into a trouble some state - if you can do without?

BxWiil already stated the outset is the magnitude of a million rows of staging data (more precise 500,000 to 850,000 records), so that is a high volume to care about, performance does not only depend on that, but of course also the frequency of this ETL process. To me a sure case of avoiding to use tempdb. Merge like any T-SQL query command can access all Databases on the sam server instance, having a separate staging db you can create it on the drive or fiel group best fit for the staging purpose and make better use of multiple drives, controllers or raid systems as necessary, can alter that for any staging process and data volume, whereas the tempdb is where it is and is already ionvolved in many operations of production use, already. Should I continue enumerating advantages?

Bye, Olaf.


RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

differences on opinion. 1 million rows is tiny volume for me - dealing with over 50 million rows in one of my processes daily (of many processes). and tempdb on my main prod server during etl has well over 10k transactions per second - and max latency of 5 miliseconds (not SSD yet).

Regards

Frederico Fonseca
SysSoft Integrated Ltd
www.syssoft-int.com

FAQ219-2884: How Do I Get Great Answers To my Tek-Tips Questions?
FAQ181-2886: How can I maximize my chances of getting an answer?

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

(OP)
Appreciate the insight!

Bottom line - I need to ensure that the data is initially loaded into 2012 Sql Server as efficiently and quickly as possible and that the monthly receipt of the data is loaded into the respective table within Sql Server as efficiently and quickly as possible and that any problems whatsoever with the data is quickly identified and definitely not loaded into production.

As a novice at actually loading and managing the Sql Server database going forward, I just want to ensure that I am setting everything up utilizing "Best Practices" assuming that the volume of the data that will reside in
the database 6 months down the road, 12 months down the road, etc. will increase. There will be a lot of data that we will have to perform detailed multi-year trend analyses, pattern analyses of sales transactions, historical correlations between/among multi-year data, etc. Definitely want to spend the majority of time actually analyzing the data but understand that if there is invalid data within the database, then the analyses is in vain no matter how quick the analyses...

I currently have 2012 Sql Server Developer installed on a Windows 7 Professional environment and just trying to ensure that as we scale - with ever increasing data, that we will not have any issues with the way that the data is setup.

The discussion regarding the use of staging tables within the same database vs. the use of a separate database for staging is noted. However, I will at first correct the present code and then later explore the other alternatives...

Will try the various suggestions and post back.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

How much one million rows is a load to transaction logs is also a matter of the frequency, as I also said.

Yes, it's a low volume in itself, as I also already said:

Quote (myself)

The about a million rows you need to process each day are in the GB range, I assume, this once was a huge amount of data, today it's quite small, fitting into physical RAM multiple times on a modern server

No matter how fast your storage system is, RAM always is a magnitude faster and no matter how low a million rows is, the transaction log load is permanent in backups, unless you don't care to keep log backups.

If this is a monthly process, as you now say, BxWill, then this indeed is a low volume overall, too. Speed may not be of essence in this case, growing logs are also not a general problem, as they can be shrinked after log backups, and frequent log backups are a best practice anyway.

Scalabililty also is one aspect MSSQL Server has several options, besides with your main usage of the database for BI, you mostly read data. If you aim for that you should also consider learning MDX and perhaps even load this data into olap cubes right after feeding your normal oltp database.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

(OP)
Using the suggested code as displayed below;

CODE

set dateformat ymd -- force format of date to specific format as to avoid invalid conversions - normally not required if input is yyyymmdd

insert into Data2
-- remove if adding values as null is acceptable
select *
from (
-- end remove
select case
       when len(ltrim(rtrim(receipt_date))) <> 8 then null -- invalid date as size is not 8 
       when ltrim(rtrim(receipt_date)) not like '[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]' then null -- not a valid format for date -- note the 0-1 and 0-3 to do a tiny bit of validation at this point
       when right(ltrim(rtrim(receipt_date)), 2) = '00' then null -- not valid as day part is zeros
       when substring(ltrim(rtrim(receipt_date)),5 , 2) = '00' then null -- not valid as month part is zeros
       when isdate(ltrim(rtrim(receipt_date))) = 1 then convert(date, ltrim(rtrim(receipt_date)), 112) -- now use isdate to verify that string is a valid date and convert if so using specific format 112
       else null
       end as receipt_date
from #Data1
-- remove if adding values as null is acceptable
) t
where receipt_date is not null
-- end remove 

So far, I am not able to validate the data that is within the staging table and populate the table in production.

For the example given above, it appears that I am just validating the receipt_date field. I would like to validate all the date fields and in the near future, would like to validate other fields as well

Therefore, I assume that I can just stack the Case Statements - do something like the following to perform validation on two or more fields. Is this correct or is there a more preferred route?


CODE

CREATE TABLE 	[#Data1](		
		[ItemNbr]   [varchar] (60)  NULL,
		[LATITUDE]   [varchar] (60),
		[LONGITUDE]  [varchar] (60),
		[LAST NAME]   [varchar] (60)  NULL,
		[FIRST NAME]   [varchar] (100)  NULL,
		[RECEIPT_DATE] [varchar] (60) NULL,                  
		[SALE_DATE] [varchar] (60) NULL,                       
		[SALE_AMOUNT]  [varchar] (60) NULL,
		[DELIVERY_DATE] [varchar] (60) NULL,       
		[ORDER_DATE] [varchar] (60) NULL,       
)	

CREATE TABLE [Data2](		
		[ItemNbr]   [varchar] (12)  NULL,
		[LATITUDE]   decimal (10,6) ,
		[LONGITUDE]  decimal (10,6),
		[LAST NAME]   [varchar] (25)  NULL,
		[FIRST NAME]   [varchar] (25)  NULL,
		[RECEIPT_DATE] [DATE] (10) NULL,                  
		[SALE_DATE] [DATE] (10) NULL,                      
		[SALE_AMOUNT]  [INT] NULL,
		[DELIVERY_DATE] [DATE] (10) NULL,       
		[ORDER_DATE] [DATE] (10) NULL       
)

BULK INSERT #Data1
   	FROM 'C:\Data\201701SaleData.txt'
   WITH
      (
         FIELDTERMINATOR = '\t',
		 ROWTERMINATOR = '\n',
		 FIRSTROW = 2
	 )



insert into Data2
-- remove if adding values as null is acceptable
select *
from (
-- end remove
select 
[ItemNbr],
[LATITUDE],
[LONGITUDE],
[LAST NAME],
[FIRST NAME],
[RECEIPT_DATE],                  

case
       when len(ltrim(rtrim(RECEIPT_DATE))) <> 8 then null -- invalid date as size is not 8 
       when ltrim(rtrim(RECEIPT_DATE)) not like '[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]' then null -- not a valid format for date -- note the 0-1 and 0-3 to do a tiny bit of validation at this point
       when right(ltrim(rtrim(RECEIPT_DATE)), 2) = '00' then null -- not valid as day part is zeros
       when substring(ltrim(rtrim(RECEIPT_DATE)),5 , 2) = '00' then null -- not valid as month part is zeros
       when isdate(ltrim(rtrim(RECEIPT_DATE))) = 1 then convert(date, ltrim(rtrim(RECEIPT_DATE)), 112) -- now use isdate to verify that string is a valid date and convert if so using specific format 112
       else null
end as RECEIPT_DATE,
case
       when len(ltrim(rtrim(SALE_DATE))) <> 8 then null -- invalid date as size is not 8 
       when ltrim(rtrim(SALE_DATE)) not like '[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]' then null -- not a valid format for date -- note the 0-1 and 0-3 to do a tiny bit of validation at this point
       when right(ltrim(rtrim(SALE_DATE)), 2) = '00' then null -- not valid as day part is zeros
       when substring(ltrim(rtrim(SALE_DATE)),5 , 2) = '00' then null -- not valid as month part is zeros
       when isdate(ltrim(rtrim(SALE_DATE))) = 1 then convert(date, ltrim(rtrim(SALE_DATE)), 112) -- now use isdate to verify that string is a valid date and convert if so using specific format 112
       else null
end as SALE_DATE,
[SALE_AMOUNT],
[DELIVERY_DATE], 
[ORDER_DATE]

from #Data1
-- remove if adding values as null is acceptable
) t
where RECEIPT_DATE is not null
or SALE_DATE is not null
-- end remove 


I did initially use Select * but then reverted to listing all of the fields within the staging table that I wanted to insert into the production table because of receipt of the following error;

CODE

"Column name or number of supplied values does not match table definition." 


Appreciate any additional insight as I am trying to understand the most efficient method to accomplish the objective.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

Yes stacking is what you need as per your second block of code.

My example was just that - an example - to be expanded by yourself as required for all the fields with issues and with all the remaining fields you need to insert onto the destination table.

What matters on my example is the approach which can be used for any field - it may be that some fields require extra validation and others require less - but that is up to you to determine based on the data you have.

Regards

Frederico Fonseca
SysSoft Integrated Ltd
www.syssoft-int.com

FAQ219-2884: How Do I Get Great Answers To my Tek-Tips Questions?
FAQ181-2886: How can I maximize my chances of getting an answer?

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

Bill, do you really need this complex verification? Do you really have cases of text dates, as Frederico suggested, like '20:00' or '20:00:00'. As faar as you told you had cases like '20170100' and '20170000', i.e. 00 as month or day. If so TRY_convert would suffice in finding those and turn them to NULLs, so you could replace month and day positions with '01' to fix those dates or skip them and categorize them as bad data.

Admitted Fredericos case statement is more thoroughly (turning more wrong values into NULL) than try_convert, you can check length of trimmed values to detected too short dates, you can check for ':' contained in the data, etc. to also sort those out, if they ever will occur, even though they now don't.

To apply same checks on many fields, I would also start designing some scalar valued user defined functions, so you can apply them easier.

Overall, if you're not so well versed with T-SQL, you could consider choosing a data cleansing method using your preferred language, sorry to say so, but you don't make the impression you gain much from the fact you have your data in MSSQL tables.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

(OP)
Appreciate the insight.

Proficient for the most part with T-SQL but not advanced relative to my expertise with MS Excel, MS Access, and Power Query/Power Pivot. Not a programmer although I am comfortable with creating
some VBA within MS Excel and MS Access.

As this is the first Sql Server database that I will create, the slope of the learning curve has decreased quite significantly over the last two to three months. Of course there are some intricacies
and nuances related to the creation of a Sql Server database that I am determined to master to ensure that the database is setup as efficiently as possible.

Finalizing the case statements and have begun the review of try_convert and scalar valued user defined functions.

Considering that the data that I receive are usually large text files, the goal is to quickly review, validate and load the "good" data as quickly as possible so that we can perform detailed trend analytics.

I realize that there are various methods to accomplish the goal and I am interested in learning how to perform the validation task at least two to three different ways - using case statements or using try_convert
or using scalar valued user defined functions or using staging tables in another database, etc. etc.

Any additional insight is appreciated.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

A Scalar-valued function is just a function you define, i.e. to define a more generic and general CASE statement to process a parameter value passed in and returning either false/true in regard of the text being a valid date in your terms (other than perhaps what TRY_CONVERT says) or to return the date as converted and NULL for non convertible dates, just like TRY_CONVERT does, only with your rules.

This just helps to write less code, you don't need to copy over the CASE statement for each field, you just call dbo.yourdateconvert(stagingtable.field) as newdatetime

Calling a function instead of putting it directly into the SQL statement has both pros and cons in performance, but you'll see how it behaves. It shortens your code and effort for sure and also gives you one central place to change for any date validation insteead of needing to mend or change all copies of the case statements, if necessary.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

(OP)
Apparently, there are asterisks in several records for quite a few columns of the text data.

Encountered the error

CODE

Conversion failed when converting the varchar value '********' to data type int. 

Upon downloading and installing a third-party tool (ApexSqlSearch), I was not able to search for the instances where there are asterisks. Therefore, I created SQL and searched the integer fields.

CODE

Select count([SALE_AMOUNT])
FROM #Data1
where [SALE_AMOUNT] = '********' 

As a result of the query, there were three fields that had asterisks on some of the records.

Created Case statements similar to the following;

CODE

Case
   When [SALE_AMOUNT] = '********'
   ELSE [SALE_AMOUNT] * 1
end as [SALE_AMOUNT] 

Now, upon trying to insert the data from the staging table into the test production table, another error results;

CODE

Error converting data type varchar to numeric 

It appears that I have to repeat the process again - reviewing any decimal fields for asterisks and/or non-numeric characters and then constructing the associated Case statement.

Did search the internet for an example of a scalar-valued function. So far, I have not found an example...

Is it possible to provide an example of a scalar valued function that validates at least two of the fields below?

CODE

[LATITUDE]   decimal (10,6) ,
		[LONGITUDE]  decimal (10,6),
		[LAST NAME]   [varchar] (25)  NULL,
		[FIRST NAME]   [varchar] (25)  NULL,
		[RECEIPT_DATE] [DATE] (10) NULL,                  
		[SALE_DATE] [DATE] (10) NULL,                      
		[SALE_AMOUNT]  [INT] NULL,
		[DELIVERY_DATE] [DATE] (10) NULL,       
		[ORDER_DATE] [DATE] (10) NULL 


Any additional insight is appreciated.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

CODE --> T-SQL?

Case
   When [SALE_AMOUNT] = '********'
   ELSE [SALE_AMOUNT] * 1
end as [SALE_AMOUNT] 

Please refer to the books online: https://technet.microsoft.com/en-US/library/ms1817...
This is incomplete, missing THEN and what value to return in case of ****.

CODE --> T-SQL

--Searched CASE expression:
CASE
     WHEN Boolean_expression THEN result_expression [ ...n ] 
     [ ELSE else_result_expression ] 
END 

Quote (BxWill)

Did search the internet for an example of a scalar-valued function. So far, I have not found an example...

Are you sure? Just start up MSSQL Management Studio and take a look at all you have below the database node of the object explorer tree


Scalar Valued Functions are the simplest concept of programming in any language, even not at the level of designing OOP classes or more complex things, they return a scalar value, which just means a simple value, a simple single type, like a double float or a char(20) or whatever scalar value in contrast to a vector of many values or an array or table valued function, which you find in the node above scalar-valued functions.

Are you a developer at all? If not, this is not your job, you should hire one.

Bye, Olaf.

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

sample function for integer validation https://blog.sqlauthority.com/2007/08/11/sql-serve...

This function is limited but contains some rather important aspects.
patindex - to determine if there is any invalid character on the string. I would add a "." to it as to allow for decimals
checking to see if left char is a "-" for sign. I would also look for the rightmost char for same purpose and also for "+" sign.

Function does not check if there are trailing/leading spaces. at least the trailing ones could be removed on the validation as to allow string like " 1235" to be considered valid.

The particular function is only checking if the string is a possibly valid integer - it would be easy for you to modify it and return the converted value back but on that case although it works fine for integers, a similar one for decimals would be more limiting because of the possibly varying number of decimal positions.

you can also use the sample function to add the date validation you were given and create a function to validate a date.

Regards

Frederico Fonseca
SysSoft Integrated Ltd
www.syssoft-int.com

FAQ219-2884: How Do I Get Great Answers To my Tek-Tips Questions?
FAQ181-2886: How can I maximize my chances of getting an answer?

RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

(OP)
Had to step away from this but now returning...

Just to recap,

Have a large text files - from 500,000 to 850,000 records to be loaded to Sql Server monthly. Format for date fields are YYYYMMDD. Latitude and Longitude fields do not have decimals. Also, there is an alpha character at the end of the longitude data (for example, "C" or "T."


Created staging table using tsql, #Data1

CODE

CREATE TABLE  #Data1
Latitude  [varchar] (60) NULL,	
Longitude  [varchar] (60) NULL,
Sale Amount [varchar] (60) NULL,
Total Amount [varchar] (60) NULL,
Sale Date [varchar] (60) NULL,
Receiving Date [varchar] (60) NULL,
.
.
. 


Created test production table, SampleTable

CODE

CREATE TABLE  Data2
Latitude  decimal (8,6) NULL,	
Longitude  decimal (9,6) NULL,
Sale Amount int NULL,
Total Amount int NULL,
Sale Date date NULL,
Receiving Date date NULL,
.
.
. 

Upon trying to load the data from the staging table to the test production table, I receive the following error;

CODE

Arithmetic overflow error converting varchar to data type numeric.
The statement has been terminated 

I am using the Tsql below. What revisions are recommended so that the error no longer appears?
Also, any insight as to how I should cast the longitude data considering that it has an alpha character at the end?

--********************************************************************

CODE

insert into Data2
select
Case 
  WHEN [Latitude] LIKE N'%[^.0-9*-]%' OR LEN([Latitude]) > 11
                THEN 'Invalid Latitude (bad character OR too long). '
  WHEN [Latitude] NOT LIKE N'%*%' AND ISNUMERIC([Latitude]) = 0 AND LEN([Latitude]) > 0
                THEN 'Invalid Latitude (not a number). '
  WHEN ISNUMERIC([Latitude]) = 1 AND AND CHARINDEX('.', [Latitude]) = 0
                THEN Cast([Latitude]/1000000.0 as decimal(8,6) as [Latitude]
  ELSE [Latitude]
END as [Latitude],

Case 
  WHEN [Longitude] LIKE N'%[^.0-9*-]%' OR LEN([Longitude]) > 11
                THEN 'Invalid Longitude (bad character OR too long). '
  WHEN [Latitude] NOT LIKE N'%*%' AND ISNUMERIC([Longitude]) = 0 AND LEN([Longitude]) > 0
                THEN 'Invalid Longitude (not a number). '
  WHEN ISNUMERIC([Longitude]) = 1 AND CHARINDEX('.', [Longitude]) = 0
                THEN Cast(isnull((Left(LTRIM([Longitude]),7)),0) as decimal(9,6)
  ELSE [Longitude]
END as [Longitude],

Case
  When IsNumeric([Sale Amount]) = 0
       Then Null -- Non Numeric data
  When Len([Sale Amount]) > 11
       Then Null --Invalid length
  ELSE [Sale Amount]
END as [Sale Amount],


Case
  When IsNumeric([Total Amount]) = 0
       Then Null -- Non Numeric data
  When Len([Total Amount]) > 11
       Then Null --Invalid length
  ELSE [Total Amount]
END as [Total Amount],

Case
  when len(ltrim(rtrim([Sale Date]))) <> 8 then null -- invalid date as size is not 8 
  when ltrim(rtrim([Sale Date])) not like '[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]' then null -- not a valid format for date 
  when right(ltrim(rtrim([Sale Date])), 2) = '00' then null -- not valid as day part is zeros
  when substring(ltrim(rtrim([Sale Date])),5 , 2) = '00' then null -- not valid as month part is zeros
  when isdate(ltrim(rtrim([Sale Date]))) = 1 then convert(date, ltrim(rtrim([Sale Date])), 112) -- now use isdate to verify that string is a valid date and convert if so using specific format 112
  else null
end as [Sale Date],

Case
  when len(ltrim(rtrim([Receiving Date]))) <> 8 then null -- invalid date as size is not 8 
  when ltrim(rtrim([Receiving Date])) not like '[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]' then null -- not a valid format for date -- note the 0-1 and 0-3 to do a tiny bit of validation at this point
       when right(ltrim(rtrim([Receiving Date])), 2) = '00' then null -- not valid as day part is zeros
       when substring(ltrim(rtrim([Receiving Date])),5 , 2) = '00' then null -- not valid as month part is zeros
       when isdate(ltrim(rtrim([Receiving Date]))) = 1 then convert(date, ltrim(rtrim([Receiving Date])), 112) -- now use isdate to verify that string is a valid date and convert if so using specific format 112
       else null
end as [Receiving Date]

from #Data1
where [Latitude] is not null
or [Latitude] is not null
or [Sale Amount] is not null
or [Total Amount] is not null
or [Sale Date] is not null
or [Receiving Date] is not null 

Assumption is that I should check the fields that have a decimal data type even though the data within the source file does
not contain any data with decimals.

I am currently reviewing the maximum length of the values in each field that have a decimal data type and will then manually revise the precision for all of the fields with a decimal data type.

Any additional insight as to the resolution of this error is appreciated.



RE: Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB

The error points out the value of varchar uses more decimal places either after or before decimal point than the numeric field has. Higher your N(x,y) definitions, don't let them be too tight.
I would define Latitude and Longitude as double float anyway, that would do away with such overflows in both the magnitude and precision.

N(8,6) gives enough room for +90.0000 to -90.0000, but would eg fail on 5 or more decimal places. You don't necessarily have wrong data, but more precise than you can store.
You could also make further validation of that than with ISNUMERIC, define user defined functions checking the range to be between -90 and 90 or -180 and 180 respectively and rounding to the wanted precision.

PS, also take a careful look for your case statements about Latitude and Longitude, you got that messed up.

Bye, Olaf.

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members!

Resources

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close