Search This Blog

Monday, September 30, 2013

Data Mining: What is Data Mining?

Overview

Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Continuous Innovation

Although data mining is a relatively new term, the technology is not. Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost.

Example

For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays.

Thursday, September 26, 2013

Import data from Excel to SQLServer

Many times we come across a situation where we have readily available excel sheets having enormous data that needs to be imported in database system for enhanced querying capabilities or as a backend datasource for some sort of software development.

We could achieve this goal either by copy-pasting the excel rows directly to the destined database table in SQL Server Management Studio or by querying directly on the excel sheet from SQL Server Management Studio itself!

The former way (copy-pasting) looks pretty simple but consider a situation where you are having 10K records in a single excel sheet and wanted to fetch some selective/filtered records only! At that time, the earlier way is much efficient and ideal.

I've created this sample excel file to import/retrieve the data in SQL server, and is also the file I will be using in this article. Here is the screenshot for the same.

Excel sheet - Cars

OPENROWSET() to query excel files

Here, I've used the OPENROWSET() function to query excel files. This is a T-SQL function that can be used to access any OLE DB data source. All you need is the right OLE DB driver.

The following are the queries to retrieve data from Excel files.

--Excel 2007-2010
SELECT * --INTO #Cars
FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
'Excel 12.0 Xml;HDR=YES;Database=C:\cars.xlsx','SELECT * FROM [cars$]');

--Excel 97-2003
SELECT * --INTO #Cars
FROM OPENROWSET('Microsoft.ACE.OLEDB.4.0',
'Excel 8.0;HDR=YES;Database=C:\cars.xls','SELECT * FROM [cars$]');

SQL Server blocked access to STATEMENT OpenRowset OpenDatasource of component Ad Hoc Distributed Queries



Error :


SQL Server blocked access to STATEMENT 'OpenRowset/OpenDatasource' of component 'Ad Hoc Distributed Queries' because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of 'Ad Hoc Distributed Queries' by using sp_configure. For more information about enabling 'Ad Hoc Distributed Queries', see "Surface Area Configuration" in SQL Server Books Online.

Solution :


Solution for this common error in Sql Server is to enable the OpenRowset in Sql Server. We can do it by enabling theShowAdvanced Options and Ad hoc Distributed Queries. this error also appears when you execute some Adhoc queries in Sql Server Query analyzer. If you enable this for the first time, The further statements using this option will go fine without flaw.

To enable the ShowAdvanced Options, Please follow/use the below TSQL statements



SP_CONFIGURE 'SHOW ADVANCED OPTIONS', 1 
GO 
RECONFIGURE 
GO



After executing the above staement if you try to execute or use OpenRowSet or any TSQL commands that requires the Ad Hoc Queries option to be enabled, you would get the below error

Msg 15281, Level 16, State 1, Line 1 SQL Server blocked access to STATEMENT 'OpenRowset/OpenDatasource' of component 'Ad Hoc Distributed Queries' because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of 'Ad Hoc Distributed Queries' by using sp_configure. For more information about enabling 'Ad Hoc Distributed Queries', see "Surface Area Configuration" in SQL Server Books Online.

We can get Rid of the above error using the below TSQL commands


SP_CONFIGURE 'Ad Hoc Distributed Queries', 1 
GO 
RECONFIGURE 
GO 



Once the above TSQL statements are getting executed, you would be able to use the OpenRowSet Statement successfully. To Enable these two options you must have the sysadmin role. If you do not have this role, please contact your database administrator for doing this operation

Monday, September 16, 2013

Understanding SQL Server Configuration Manager

By , 15 Sep 2013 on codeproject

Introduction
SQL Server Configuration Manager is a tool to configure the network protocols used by the SQL Server, and to manage network connectivity.
We will find SQL Server Configuration Manager API in the below path:
Start -> Microsoft SQL Server 2008 R2 -> Configuration Tools -> SQL Server Configuration Manager
SQL Server Configuration Manager gets information for WMI (Windows Management Instrumentation) Provider which takes information for “SQLServerManager10.msc” which is located in “C:\Windows\System32\” path. Generally SQL Server Configuration Manager will use to Start, Stop, resume and Configure services of SQL Server.

Left pane has list of configurations we can do for SQL Server.
  • SQL Server Services
  • SQL Server Network Configuration
  • SQL Server Native Client Configuration

Find Which Column in Which Table Updated from logfile

Introduction

Sometimes, database developers need to work with a database in which they don't have any permission to add any procedure or trigger, because that database is for someone else but they need change tracking. In this tip, I will show a solution for this problem.

Background

First of all, I want to appreciate Muhammad Imran for the great information on his site:http://raresql.com/2012/02/01/how-to-recover-modified-records-from-sql-server-part-1.
Then, as a brief history and description: In SQLServer, the log file (LDF) is a Microsoft solution for implementing transactions, and rollback action uses the log file for rolling back, so we can use the LDF file to track data.

Using the Code

SQL Server has undocumented T-function sys.fn_dblog that is used for reading log file and procedure DBCC PAGE that is used for details of data page like allocation_UnitIdFieldsValuesSlotId.
Create PROCEDURE Find_UpdatedColumn_In_Table
@Database_Name NVARCHAR(MAX),
@SchemaName_n_TableName NVARCHAR(MAX),
@Date_From datetime='1900/01/01',
@Date_To datetime ='9999/12/31',
@FeildName nvarchar(max)
AS
begin
DECLARE @parms nvarchar(1024)
DECLARE @Fileid INT
DECLARE @Pageid INT
DECLARE @Slotid INT

DECLARE @ConsolidatedPageID VARCHAR(MAX)
Declare @AllocUnitID as bigint
Declare @TransactionID as VARCHAR(MAX)
Declare @Operation as VARCHAR(MAX)
Declare @DatabaseCollation VARCHAR(MAX)

/*  Pick The actual data
*/
declare @temppagedata table 
(
[ParentObject] sysname,
[Object] sysname,
[Field] sysname,
[Value] sysname)

declare @pagedata table 
(
[Page ID] sysname,
[AllocUnitId] bigint,
[ParentObject] sysname,
[Object] sysname,
[Field] sysname,
[Value] sysname)

    DECLARE Page_Data_Cursor CURSOR FOR 
    /*We need to filter LOP_MODIFY_ROW,LOP_MODIFY_COLUMNS from log 
    for modified records & Get its Slot No, Page ID & AllocUnit ID*/
    SELECT [PAGE ID],[Slot ID],[AllocUnitId]
    FROM    sys.fn_dblog(NULL, NULL)  
    WHERE    
    AllocUnitId IN 
    (Select [Allocation_unit_id] from sys.allocation_units allocunits
    INNER JOIN sys.partitions partitions ON (allocunits.type IN (1, 3)  
    AND partitions.hobt_id = allocunits.container_id) OR (allocunits.type = 2 
    AND partitions.partition_id = allocunits.container_id)  
    Where object_id=object_ID('' + @SchemaName_n_TableName + ''))
    AND Operation IN ('LOP_MODIFY_ROW','LOP_MODIFY_COLUMNS')  _
    AND [Context] IN ('LCX_HEAP','LCX_CLUSTERED')
    /*Use this subquery to filter the date*/

    AND [TRANSACTION ID] IN (SELECT DISTINCT _
    [TRANSACTION ID] FROM    sys.fn_dblog(NULL, NULL) 
    WHERE Context IN ('LCX_NULL') AND Operation in ('LOP_BEGIN_XACT')  
    --AND [Transaction Name]='UPDATE'
    AND  CONVERT(NVARCHAR(11),[Begin Time]) BETWEEN @Date_From AND @Date_To)

    /****************************************/

    GROUP BY [PAGE ID],[Slot ID],[AllocUnitId]
    ORDER BY [Slot ID]    
 
    OPEN Page_Data_Cursor

    FETCH NEXT FROM Page_Data_Cursor INTO @ConsolidatedPageID, @Slotid,@AllocUnitID

    WHILE @@FETCH_STATUS = 0
    BEGIN
        DECLARE @hex_pageid AS VARCHAR(Max)
        /*Page ID contains File Number and page number It looks like 0001:00000130.
          In this example 0001 is file Number &  
          00000130 is Page Number & These numbers are in Hex format*/
        SET @Fileid=SUBSTRING(@ConsolidatedPageID,0,CHARINDEX_
        (':',@ConsolidatedPageID)) -- Separate File ID from Page ID
        SET @hex_pageid ='0x'+ SUBSTRING(@ConsolidatedPageID,CHARINDEX_
        (':',@ConsolidatedPageID)+1,Len(@ConsolidatedPageID))  ---Separate the page ID
       
        SELECT @Pageid=Convert(INT,cast('' AS XML).value('xs:hexBinary_
        (substring(sql:variable("@hex_pageid"),sql:column("t.pos")) )', _
        'varbinary(max)')) -- Convert Page ID from hex to integer
        FROM (SELECT CASE substring(@hex_pageid, 1, 2) WHEN '0x' THEN 3 ELSE 0 END) AS t(pos) 
                    
        DELETE @temppagedata
        -- Now we need to get the actual data (After modification) from the page
        INSERT INTO @temppagedata EXEC( 'DBCC PAGE(' + @DataBase_Name + ', ' + _
        @fileid + ', ' + @pageid + ', 3) with tableresults,no_infomsgs;'); 
        -- Add Page Number and allocUnit ID in data to identity which one page it belongs to.                    
        INSERT INTO @pagedata SELECT @ConsolidatedPageID,@AllocUnitID,_
        [ParentObject],[Object],[Field] ,[Value] FROM @temppagedata

        FETCH NEXT FROM Page_Data_Cursor INTO  @ConsolidatedPageID, @Slotid,@AllocUnitID
    END

CLOSE Page_Data_Cursor
DEALLOCATE Page_Data_Cursor
DECLARE @Newhexstring VARCHAR(MAX);

--select * from  @PageData
--select * from  @temppagedata
DECLARE @ModifiedRawData TABLE
(
  [ID] INT IDENTITY(1,1),
  [PAGE ID] VARCHAR(MAX),
  [Slot ID] INT,
  [AllocUnitId] BIGINT,
  [RowLog Contents 0_var] VARCHAR(MAX),
  [RowLog Contents 0] VARBINARY(8000)
)
--The modified data is in multiple rows in the page, 
--so we need to convert it into one row as a single hex value.
--This hex value is in string format
INSERT INTO @ModifiedRawData ([PAGE ID],[Slot ID],[AllocUnitId]
,[RowLog Contents 0_var])
SELECT B.[PAGE ID],A.[Slot ID],A.[AllocUnitId]
,
(
SELECT REPLACE(STUFF((SELECT REPLACE(SUBSTRING([VALUE],CHARINDEX(':',[Value])+1,48),'†','')
FROM @pagedata C  
WHERE B.[Page ID]= C.[Page ID] And A.[Slot ID] =LTRIM(RTRIM_
(SUBSTRING(C.[ParentObject],5,3))) And [Object] Like '%Memory Dump%' 
Group By [Value] FOR XML PATH('') ),1,1,'') ,' ','')
) AS [Value]

FROM sys.fn_dblog(NULL, NULL) A
INNER JOIN @pagedata B On A.[PAGE ID]=B.[PAGE ID]
AND A.[AllocUnitId]=B.[AllocUnitId]
AND A.[Slot ID] =LTRIM(RTRIM(Substring(B.[ParentObject],5,3)))
AND B.[Object] Like '%Memory Dump%'
WHERE A.AllocUnitId IN 
(Select [Allocation_unit_id] from sys.allocation_units allocunits
INNER JOIN sys.partitions partitions ON (allocunits.type IN (1, 3)
AND partitions.hobt_id = allocunits.container_id) OR (allocunits.type = 2 
AND partitions.partition_id = allocunits.container_id)  
Where object_id=object_ID('' + @SchemaName_n_TableName + ''))
AND Operation in ('LOP_MODIFY_ROW','LOP_MODIFY_COLUMNS') _
AND [Context] IN ('LCX_HEAP','LCX_CLUSTERED')
/*Use this subquery to filter the date*/

AND [TRANSACTION ID] IN (Select DISTINCT [TRANSACTION ID] FROM sys.fn_dblog(NULL, NULL) 
Where Context IN ('LCX_NULL') AND Operation IN ('LOP_BEGIN_XACT')  
AND [Transaction Name]='UPDATE'
AND  CONVERT(NVARCHAR(11),[Begin Time]) BETWEEN @Date_From AND @Date_To)

/****************************************/
GROUP BY B.[PAGE ID],A.[Slot ID],A.[AllocUnitId]--,[Transaction ID]
ORDER BY [Slot ID]

-- Convert the hex value data in string, convert it into Hex value as well. 
UPDATE @ModifiedRawData  SET [RowLog Contents 0] = cast('' AS XML).value_
('xs:hexBinary(substring(sql:column("[RowLog Contents 0_var]"), 0) )', 'varbinary(max)')
FROM @ModifiedRawData

-- 
select Field, Value  from @temppagedata t
inner join @ModifiedRawData m on LTRIM(RTRIM(SUBSTRING_
(t.[ParentObject],5,3)))  = cast(m.[Slot ID] as nvarchar(1000))
where t.Field =  @FeildName

end

EXEC Find_UpdatedColumn_In_Table  'Test','dbo.Test','2000/01/01','9999/12/31',N'id'

Note

Something to note is that this proc just works with the current log, and if we need to work with log back up, we need to restore.
By 13 Sep 2013

Thursday, August 1, 2013

How To Create an ADO.NET Data Access Utility Class for SQL Server


By:  

Problem

I am a .NET developer and I typically write applications that use a SQL Server database.  I'm looking for a really simple, reusable class that encapsulates my ADO.NET database access code for create, read, update and delete (CRUD).  As I see it I need two methods in the class: one that executes a stored procedure that returns a result set and another that executes a stored procedure that does an insert, update or a delete.  Can you provide an example of how to do it?

Solution

While there are many code samples readily available to encapsulate ADO.NET database access, I prefer the simple, bare-bones approach that satisfies your requirements of a method that executes a query and another that executes a command.  In this tip I will review a solution that has a class library for the database utility and a console application that uses the class library.
I will assume that the reader is familiar with creating .NET applications using Visual Studio.
Connection Strings
When you write ADO.NET code to access a database, you need a connection string to specify the database that you want to access.  The connection string can be stored in your application's app.config file or web.config file (for a web application).  The following is an example of an app.config file:
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
    <startup> 
        <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.5" />
    </startup>
  <connectionStrings>
    <add name="mssqltips"
         connectionString="data source=localhost;initial catalog=mssqltips;Integrated Security=SSPI;"
         providerName="System.Data.SqlClient" />
  </connectionStrings>
</configuration>
The following are the main points about the app.config file:
  • mssqltips is the name of the connection string; we will use the name mssqltips to access the connection string
  • Data source is the server name of the SQL Server database instance
  • Initial catalog is the database name
  • Integrated Security=SSPI means we are using windows authentication to connect to the database
  • Provider name is the ADO.NET data provider for SQL Server
You can specify many more settings in the connection string than I have shown here.  Take a look atSqlConnection.ConnectionString Property for the details.
Database Utility Class
I will use a class library project named DataAccessUtility to implement the database access utility class.  When a class library is compiled it generates a dynamic link library (.DLL) which can then be referenced from any .NET application.  The class library will have a single class named SqlDatabaseUtility with the following methods:
  • GetConnection() opens a database connection
  • ExecuteQuery() executes a stored procedure that performs a query
  • ExecuteCommand() executes a stored procedure that performs an insert, update or delete
Before I get to reviewing the code in the methods, here are a couple of things that need to be done:
  • Add a reference to System.Configuration to the class library project; I need this to access the connection string in the app.config file
  • Add using statements to the SqlDatabaseUtility class for the namespaces System.Configuration, System.Data and System.Data.SqlClient; I am using classes from these namespaces
The GetConnection() method has the following code:
public SqlConnection GetConnection(string connectionName)
{
  string cnstr = ConfigurationManager.ConnectionStrings[connectionName].ConnectionString;
  SqlConnection cn = new SqlConnection(cnstr);
  cn.Open();
  return cn;
}
The main points about the GetConnection() method are:
  • Reads the connection string from the app.config (or web.config) file
  • Creates an instance of a SqlConnection object passing the connection string into the constructor
  • Calls the Open() method on the SqlConnection object which "opens" a database connection
  • Returns the SqlConnection object to the caller
The ExecuteQuery() method has the following code:
 public DataSet ExecuteQuery(
  string connectionName,
  string storedProcName,
  Dictionary procParameters
)
{
  DataSet ds = new DataSet();
  using(SqlConnection cn = GetConnection(connectionName))
  {
      using(SqlCommand cmd = cn.CreateCommand())
      {
          cmd.CommandType = CommandType.StoredProcedure;
          cmd.CommandText = storedProcName;
          // assign parameters passed in to the command
          foreach (var procParameter in procParameters)
          {
            cmd.Parameters.Add(procParameter.Value);
          }
          using (SqlDataAdapter da = new SqlDataAdapter(cmd))
          {
            da.Fill(ds);
          }
      }
 }
 return ds;
}
The main points about the ExecuteQuery() method are:
  • Creates a Dataset that will be used to return the query results to the caller
  • Calls the GetConnection() method to open a database connection
  • Creates a SqlCommand object from the Connection, and sets the CommandType and CommandText properties
  • Adds any parameters passed in to the SqlCommand parameter collection
  • Creates a SqlDataAdapter for the SqlCommand, and calls the Fill method to execute the query and populate a dataset
  • Returns the Dataset to the caller
  • The SqlConnection, SqlCommand, and SqlDataAdapter objects are wrapped with a "using" statement which ensures that the objects are disposed; the caller is not responsible for "freeing" these objects
The ExeuteCommand() method has the following code:
public int ExecuteCommand(
  string connectionName,
  string storedProcName,
  Dictionary<string, SqlParameter> procParameters
)
{
  int rc;

  using (SqlConnection cn = GetConnection(connectionName))
  {
    // create a SQL command to execute the stored procedure
    using (SqlCommand cmd = cn.CreateCommand())
    {
      cmd.CommandType = CommandType.StoredProcedure;
      cmd.CommandText = storedProcName;

      // assign parameters passed in to the command
      foreach (var procParameter in procParameters)
      {
        cmd.Parameters.Add(procParameter.Value);
      }

      rc = cmd.ExecuteNonQuery();
    }
  }

  return rc;
}
The main points about the ExecuteCommand() method are:
  • Calls the GetConnection() method to open a database connection; the using construct is used to close the database connection automatically
  • Creates a SqlCommand object from the Connection and sets the CommandType and CommandText properties
  • Adds any parameters passed in to the SqlCommand parameter collection
  • Calls the SqlCommand ExecuteNonQuery method to call the stored procedure; the return value is the number of rows affected; e.g. the number of rows inserted, update or deleted by the command
  • The SqlConnection and SqlCommand objects are wrapped with a "using" statement which ensures that the objects are disposed; the caller is not responsible for "freeing" these objects
Console Application
In this section I will review a .NET console application that will access a SQL Server database by using the SqlDatabaseUtility  class.  Here is a T-SQL script that creates a table, and two stored procedures - one that inserts a row and another that performs a query:
use mssqltips
go

create table [dbo].[customer] (
 [id] [int] identity(1,1) NOT NULL,
 [name] [varchar](50) NOT NULL,
 [state] [varchar](2) NOT NULL,
 constraint [pk_customer] primary key clustered ([id] asc)
)
go

create procedure dbo.AddCustomer
 @name  varchar(50)
,@state char(2)
as
begin
 insert into dbo.customer
  ([name], [state])
 values
  (@name, @state)
end
go

create procedure dbo.GetCustomerList
as
begin
 select [id], [name], [state]
 from dbo.customer
end
go
Before I get to reviewing the code in the console application, here are a couple of things that need to be done:
  • Add a reference to the DataAccessUtility class library to the console application; I need this to call the methods in the SqlDatabaseUtility class
  • Add a using statement for the DataAccessUtility, System.Data and System.Data.SqlClient namespaces
  • Create an mssqltips database and run the above T-SQL script in it
  • Put the connectionStrings element (shown in the Connection Strings section above) into the app.config file in the console application project
Here is the code to call the AddCustomer stored procedure:
SqlDatabaseUtility dbutility = new SqlDatabaseUtility();
            
// add a customer
Dictionary cmdParameters = new Dictionary();
cmdParameters["name"] = new SqlParameter("name", "Smith");
cmdParameters["state"] = new SqlParameter("state", "MD");
dbutility.ExecuteCommand("mssqltips", "dbo.AddCustomer", cmdParameters);
The main points about the above code are:
  • Create an instance of the SqlDatabaseUtility class
  • Create a Dictionary collection for parameters; it's like a name-value pair
  • Add parameters to the collection; parameter names must match the stored procedure parameters
  • Call the SqlDatabaseUtility ExecuteCommand method passing in the connection name, stored procedure name, and the parameter collection
Here is the code to call the GetCustomerList stored procedure:
Dictionary<string, SqlParameter> queryParameters = new Dictionary<string, SqlParameter>();
DataSet ds = dbutility.ExecuteQuery("mssqltips", "dbo.GetCustomerList", queryParameters);
DataTable t = ds.Tables[0];
foreach(DataRow r in t.Rows)
{
  Console.WriteLine(string.Format("{0}\t{1}\t{2}", 
    r[0].ToString(), 
    r[1].ToString(), 
    r[2].ToString()));
}
The main points about the above code are:
  • Create a Dictionary collection for parameters; even though the GetCustomerList does not take any parameters, you still have to pass an empty collection
  • Call the SqlDatabaseUtility ExecuteQuery method passing the connection name, stored procedure name, and empty parameter collection
  • ExecuteQuery returns a Dataset which is a collection of DataTables
  • Get the first Datatable from the Dataset, iterate through the rows and print the column values to the console

Next Steps

  • The above code is an example of a very simple approach to calling stored procedures from .NET code.
  • To keep the code as simple as possible, there is no exception handling shown.  You do need try/catch blocks around your database calls.  Take a look at the SqlCommand methods to see the kinds of exceptions that you need to catch. 
  • Download the sample code here to experiment on your own.

Tuesday, July 30, 2013

SSIS Interview Questions - Part 1


By: 

Problem

When preparing for a SQL Server interview, it is helpful to understand what questions may be asked related to SSIS.  In this tip series,  I will try to cover as much as I can to help you prepare.

Solution

What is SQL Server Integration Services (SSIS)?

  • SQL Server Integration Services (SSIS) is component of SQL Server 2005 and later versions. SSIS is an enterprise scale ETL (Extraction, Transformation and Load) tool which allows you to develop data integration and workflow solutions. Apart from data integration, SSIS can be used to define workflows to automate updating multi-dimensional cubes and automating maintenance tasks for SQL Server databases.

How does SSIS differ from DTS?

  • SSIS is a successor to DTS (Data Transformation Services) and has been completely re-written from scratch to overcome the limitations of DTS which was available in SQL Server 2000 and earlier versions. A significant improvement is the segregation of the control/work flow from the data flow and the ability to use a buffer/memory oriented architecture for data flows and transformations which improve performance.

What is the Control Flow?

  • When you start working with SSIS, you first create a package which is nothing but a collection of tasks or package components.  The control flow allows you to order the workflow, so you can ensure tasks/components get executed in the appropriate order.

What is the Data Flow Engine?

  • The Data Flow Engine, also called the SSIS pipeline engine, is responsible for managing the flow of data from the source to the destination and performing transformations (lookups, data cleansing etc.).  Data flow uses memory oriented architecture, called buffers, during the data flow and transformations which allows it to execute extremely fast. This means the SSIS pipeline engine pulls data from the source, stores it in buffers (in-memory), does the requested transformations in the buffers and writes to the destination. The benefit is that it provides the fastest transformation as it happens in memory and we don't need to stage the data for transformations in most cases.

What is a Transformation?

  •  A transformation simply means bringing in the data in a desired format. For example you are pulling data from the source and want to ensure only distinct records are written to the destination, so duplicates are  removed.  Anther example is if you have master/reference data and want to pull only related data from the source and hence you need some sort of lookup. There are around 30 transformation tasks available and this can be extended further with custom built tasks if needed.

What is a Task?

  • A task is very much like a method of any programming language which represents or carries out an individual unit of work. There are broadly two categories of tasks in SSIS, Control Flow tasks and Database Maintenance tasks. All Control Flow tasks are operational in nature except Data Flow tasks. Although there are around 30 control flow tasks which you can use in your package you can also develop your own custom tasks with your choice of .NET programming language.

What is a Precedence Constraint and what types of Precedence Constraint are there?

  • SSIS allows you to place as many as tasks you want to be placed in control flow. You can connect all these tasks using connectors called Precedence Constraints. Precedence Constraints allow you to define the logical sequence of tasks in the order they should be executed. You can also specify a condition to be evaluated before the next task in the flow is executed.
  • These are the types of precedence constraints and the condition could be either a constraint, an expression or both 
    • Success (next task will be executed only when the last task completed successfully) or
    • Failure (next task will be executed only when the last task failed) or
    • Complete (next task will be executed no matter the last task was completed or failed).

What is a container and how many types of containers are there?

  • A container is a logical grouping of tasks which allows you to manage the scope of the tasks together.
  • These are the types of containers in SSIS:
    • Sequence Container - Used for grouping logically related tasks together
    • For Loop Container - Used when you want to have repeating flow in package
    • For Each Loop Container - Used for enumerating each object in a collection; for example a record set or a list of files.
  • Apart from the above mentioned containers, there is one more container called the Task Host Container which is not visible from the IDE, but every task is contained in it (the default container for all the tasks).

What are variables and what is variable scope?

  • A variable is used to store values. There are basically two types of variables, System Variable (like ErrorCode, ErrorDescription, PackageName etc) whose values you can use but cannot change and User Variable which you create, assign values and read as needed. A variable can hold a value of the data type you have chosen when you defined the variable.
  • Variables can have a different scope depending on where it was defined. For example you can have package level variables which are accessible to all the tasks in the package and there could also be container level variables which are accessible only to those tasks that are within the container.

Monday, July 29, 2013

Import Excel unicode data with SQL Server Integration Services


By:    |   Read Comments (41)   |   Related Tips: 1 | 2 | 3 | 4 | 5 | More > Integration Services Excel

Problem

One task that most people are faced with at some point in time is the need to import data into SQL Server from an Excel spreadsheet.  We have talked about different approaches to doing this in previous tips using OPENROWSET, OPENQUERY, Link Servers, etc... These options are great, but they may not necessarily give you as much control as you may need during the import process.
Another approach to doing this is using SQL Server Integration Services (SSIS).  With SSIS you can import different types of data as well as apply other logic during the importing process.  One problem though that I have faced with importing data from Excel into a SQL Server table is the issue of having to convert data types from Unicode to non-Unicode.  SSIS treats data in an Excel file as Unicode, but my database tables are defined as non-Unicode, because I don't have the need to store other code sets and therefore I don't want to waste additional storage space.  Is there any simple way to do this in SSIS?

Solution

If you have used SSIS to import Excel data into SQL Server you may have run into the issue of having to convert data from Unicode to non-Unicode.  By default Excel data is treated as Unicode and also by default when you create new tables SQL Server will make your character type columns Unicode as well (nchar, nvarchar,etc...)  If you don't have the need to store Unicode data, you probably always use non-Unicode datatypes such as char and varchar when creating your tables, so what is the easiest way to import my Excel data into non-Unicode columns?
The following shows two different examples of importing data from Excel into SQL Server.  The first example uses Unicode datatypes and the second does not.
Here is what the data in Excel looks like.

Example 1 - Unicode data types in SQL Server

Our table 'unicode" is defined as follows:
CREATE TABLE [dbo].[unicode](
[firstName] [nvarchar](50) NULL,
[lastName] [nvarchar](50) NULL
) ON [PRIMARY]
If we create a simple Data Flow Task and an Excel Source and an OLE DB Destination mapping firstname to firstname and lastname to lastname the import works great as shown below.

Example 2- non-Unicode data types in SQL Server

Our table 'non_unicode" is defined as follows:
CREATE TABLE [dbo].[non_unicode](
[firstName] [varchar](50) NULL,
[lastName] [varchar](50) NULL
) ON [PRIMARY]
If we map the columns firstname to firstname and lastname to lastname we automatically get the following error in the OLE DB Destination.
Columns "firstname" and "firstname" cannot convert between unicode and non-unicode data types...
If we execute the task we get the following error dialog box which gives us additional information.

Solving the Problem

So based on the error we need to convert the data types so they are the same types.
If you right click on the OLE Destination and select "Show Advanced Editor" you have the option of changing the DataType from string [DT_STR] to Unicode string [DT_WSTR].  But once you click on OK it looks like the changed was saved, but if you open the editor again the change is gone and back to the original value.  This makes sense since you can not change the data type in the actual table.

If you right click on the Excel Source and select "Show Advanced Editor" you have the option of changing the DataType from Unicode string [DT_WSTR] to string [DT_STR] and the change is saved. 
If you click OK the change is saved, but now you get the error in the Excel Source that you can not convert between unicode and non-unicode as shown below.  So this did not solve the problem either.

Using the Data Conversion Task

So to get around this problem we have to also use a Data Conversion task.  This will allow us to convert data types so we can get the import completed.  The following picture shows the "Data Conversion" task in between the Excel Source and the OLE DB Destination.
If you right click on "Data Conversion" and select properties you will get a dialog box such as the following.  In here we created an Output Alias for each column.
Our firstname column becomes firstname_nu (this could be any name you want) and we are making the output be a non-unicode string.  In addition we do the same thing for the lastname column.
If we save this and change the mapping as shown to use our new output columns and then execute the task we can see that the import was successful.

As you can see this is pretty simple to do once you know that you need to use the Data Conversion task to convert the data types.