May 2012 - Artículos

After data generation in Excel explained in the first part of the article, in this second installment we'll show how to insert that information into a SQL Server database.

 

Database creation

After spreadsheet creation, we'll transfer its contents into a SQL Server database running the following script from SQL Server Management Studio.

USE master
GO
CREATE DATABASE Poblacion
GO

USE Poblacion
GO

CREATE TABLE DatosPoblacion (
Fila_ID int NOT NULL,
Edad_ID int NULL,
Sexo_ID char(1) NULL,
CCAA_ID int NULL,
Pais_ID int NULL,
Fecha_Alta datetime NULL,
CONSTRAINT PK_DatosPoblacion PRIMARY KEY CLUSTERED (Fila_ID ASC))
GO

CREATE TABLE Sexo (
Sexo_ID char(1) NOT NULL,
Sexo_DS varchar(10) NULL,
CONSTRAINT PK_Sexo PRIMARY KEY CLUSTERED (Sexo_ID ASC))
GO

CREATE TABLE CCAA (
CCAA_ID int NOT NULL,
CCAA_DS varchar(50) NULL,
CONSTRAINT PK_CCAA PRIMARY KEY CLUSTERED (CCAA_ID ASC))
GO

CREATE TABLE Pais (
Pais_ID int NOT NULL,
Pais_DS varchar(50) NULL,
CONSTRAINT PK_Paises PRIMARY KEY CLUSTERED (Pais_ID ASC))
GO

 

As we have seen, in addition to the table that will house the data generated from Excel, we will also create the catalog tables, which contain descriptions of some fields of code existing in DatosPoblacion table with which establish the necessary relationships.

 

Import Excel data from SQL Server

To insert data into DatosPoblacion table, we'll use Transact-SQL's OPENROWSET function as follows.

INSERT INTO DatosPoblacion
SELECT Fila_ID,Edad_ID,Sexo_ID,CCAA_ID,Pais_ID,Fecha_Alta
FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
'Excel 8.0;Database=C:\DatosOrigen\GenerarDatosPoblacion.xlsx',
'SELECT * FROM [Hoja1$]')

 

However, it's possible we get an error when trying to execute this insert sentence, which informs us that SQL Server is not configured to query in this way, called 'Ad Hoc Distributed Queries'.

If we want to check the SQL Server settings, we must run the system stored procedure sp_configure. However, it is likely that among the options listed, we do not see the distributed queries configuration. If we are in this case, we must enable the advanced options display using the following sentences.

EXECUTE sp_configure 'show advanced options',1
GO
RECONFIGURE
GO

 

Now we see the value of 'Ad Hoc Distributed Queries' option by executing sp_configure.To enable it execute the following.

EXECUTE sp_configure 'Ad Hoc Distributed Queries',1
GO
RECONFIGURE
GO

 

When we run sp_configure back, we will already see activated the ability to run distributed queries.

 

Thus, the previous sentence with OPENROWSET will work properly, filling the DatosPoblacion table with the content of the GenerarDatosPoblacion.xlsx file.

 

Optimizing data import from Excel

However, at the current point we may face a performance problem, because if we followed the above steps for creating the Excel file, we have a spreadsheet with one million rows, which can take about fifteen minutes to load into SQL Server table. For the sample developed in this article, we have used a virtual machine with Windows 7 operating system and 1.5 GB RAM, so that the above times may vary depending on the configuration used for these tests.

If we want to reduce these load times we may opt an alternative technique for transferring data which explain below.

In first place we will open back the file GenerarDatosPoblacion.xlsx, saving it as "CSV (Comma delimited)" type.

 

In this way we'll get a text file with fields delimited by the semicolon character. We can see its contents by opening it with Notepad.

 

Then we will create a new table in the database with the following structure.

CREATE TABLE DatosPoblacionExcel (
Fila_ID varchar(20) NULL,
Edad_ID varchar(20) NULL,
Sexo_ID varchar(20) NULL,
CCAA_ID varchar(20) NULL,
Pais_ID varchar(20) NULL,
Anualidad varchar(20) NULL,
Mes varchar(20) NULL,
Dia varchar(20) NULL,
Fecha_Alta varchar(20) NULL)

 

Into this new table we will import the contents of GenerarDatosPoblacion.csv, using the Transact-SQL statement BULK INSERT. Through the FIELDTERMINATOR option, we'll specify the character used as field separator, while the option FIRSTROW indicate that the reading of data starts in the second row of the file, because the first contains the column names.

BULK INSERT DatosPoblacionExcel
FROM 'C:\DatosOrigen\GenerarDatosPoblacion.csv'
WITH (FIELDTERMINATOR =';', FIRSTROW=2)

 

The time consumed by the bulk insert operation in the table DatosPoblacionExcel be about thirty seconds.

To finish this process, we will insert in the table DatosPoblacion the records from DatosPoblacionExcel table, excluding the unnecessary fields, as shown in the following statement, whose execution will take approximately 15 seconds.

INSERT INTO DatosPoblacion
SELECT Fila_ID,Edad_ID,Sexo_ID,CCAA_ID,Pais_ID,Fecha_Alta
FROM DatosPoblacionExcel

 

As we have seen, this inserting data technique, but requires us to perform an additional step, is a significant gain in time, employing less than a minute in the transfer of data to the table DatosPoblacion, compared to the fifteen minutes used by the OPENROWSET function.

 

Importing the other catalog tables

For tables CCAA and Pais, we will use two Excel files containing, respectively, the official classification of regions and countries .These files are available in compressed format on the website of the Madrid Community Statistics Institute. Once downloaded and unzipped we'll run the following SQL statements for import into our database.

INSERT INTO CCAA
SELECT DISTINCT ccaa, liteccaa
FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'Excel 8.0;Database=C:\DatosOrigen\ccaaprov.xls',
'SELECT * FROM [ccaaprov$]')


INSERT INTO Pais
SELECT DISTINCT isopais,lpaisc
FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'Excel 8.0;Database=C:\DatosOrigen\cozonu.xls',
'SELECT * FROM [cozonu$]')

 

Note that depending on the version of Excel file to import, in OPENROWSET function we'll use a different provider to get the data.If the file is for Excel 2007-2010 will use 'Microsoft.ACE.OLEDB.12.0 ', while for earlier versions will be "Microsoft.Jet.OLEDB.4.0 '.

 

Setting the country code in the table DatosPoblacion

A closer look at the records of the Pais table will unveil that Pais_ID field values ​​are uncorrelated, being also the lowest value 4 and the highest 894.

 

This contrasts with existing data in the field of the same name for the table DatosPoblacion because, although the maximum and minimum value of this field is also between 4 and 894, we find a lot of records where the field Pais_ID not correspond to any value in the Pais table.

To solve this problem we resort to a couple of techniques, of which the first will be taken from DatosPoblacion table, each Pais_ID field values ​​that do not exist in the Pais table, adding one until we reach a value which does exist in the countries catalog table mentioned. We will implement this process in the SQL Server function below.

CREATE FUNCTION dbo.ObtenerPais(@nPais_ID int)
RETURNS int
AS
BEGIN

WHILE (SELECT COUNT(*) FROM Pais WHERE Pais_ID = @nPais_ID) = 0
BEGIN
SET @nPais_ID = @nPais_ID + 1
END

RETURN @nPais_ID

END
GO

 

The update of field Pais_ID in table DatosPoblacion the take out with the next statement.

UPDATE DatosPoblacion
SET Pais_ID = dbo.ObtenerPais(Pais_ID)
WHERE Pais_ID NOT IN (SELECT Pais_ID FROM Pais)

 

The second technique is more direct, as it avoids the use of the search function of the field Pais_ID in the countries table.What we do here is an update of the field Pais_ID for the entire table DatosPoblacion, looking in the Pais table, the value of Pais_ID closest to that exists in the same field in the table DatosPoblacion.

UPDATE DatosPoblacion
SET Pais_ID = (SELECT TOP 1 Pais.Pais_ID FROM Pais
WHERE Pais.Pais_ID >= DatosPoblacion.Pais_ID
ORDER BY Pais.Pais_ID)

 

In both cases, we get that all the records in the DatosPoblacion table join correctly with the Pais table through Pais_ID field.

 

Manual data entry

After the above operations, the only remaining empty table is Sexo, so we'll run the following statements, which will create the necessary records.

INSERT INTO Sexo VALUES ('H','Hombre')
INSERT INTO Sexo VALUES ('M','Mujer')

 

Establishing relationships between tables

To finish the database creation, we will establish the appropriate relationships between fields in the table DatosPoblacion and other catalog tables using the following sentences.

ALTER TABLE DatosPoblacion WITH CHECK ADD
CONSTRAINT FK_DatosPoblacion_Sexo FOREIGN KEY(Sexo_ID) REFERENCES Sexo (Sexo_ID),
CONSTRAINT FK_DatosPoblacion_CCAA FOREIGN KEY(CCAA_ID) REFERENCES CCAA (CCAA_ID),
CONSTRAINT FK_DatosPoblacion_Pais FOREIGN KEY(Pais_ID) REFERENCES Pais (Pais_ID)

 

After this operation we finish this article, which explained the different parts of a process for generating, from Excel, a considerable amount of test data that can be used later from SQL Server.We hope you find useful.

Regards.

Creating sample data

During the application development stages, in most of the time, we are faced with the need to have at our disposal a set of test data to use in the different processes that are developing. A similar situation occurs if we are building an information system based on OLAP data cubes using SQL Server Analysis Services, because in these cases we will need also a large volume of data, to perform analysis simulations.

Suppose we have to generate a database of population data, with a table containing data of individuals such as age, code region of residence, national origin, sex, registration date, etc. On the other hand, also need a series of catalog tables of countries, regions, and other values ​​related to the fields in the table of individuals.

Among the full range of utilities, tricks, and more, which exist to carry out this task, in this article we will use Excel as a tool for generating the dummy data set for individuals that later we'll dump in a SQL Server database, which could be used as a data source for the application or information system.

 

Creating data with Excel

After starting Excel 2010 (earlier versions may also use), our first task will be to create a column with the values ​​that will serve to identify the rows in the table. The simplest way to generate is to introduce a pair of consecutive numbers in individual cells in a column of the worksheet, select both cells and drag the fill handle to the last row for which we want to generate the numbers.

 

However, the generation of values ​​using this technique can be somewhat cumbersome in the case that we must to produce a large number of rows and / or columns, so that for the creation of data for all columns of the sheet we'll use a more flexible yet powerful solution: Excel macros.

 

Creating a Macro

To create a macro we will click on the View tab of the Excel ribbon, and within the Macros group  we will click on the option of the same name, which opens the Macro dialog box, where we'll write the macro name: CrearDatosPoblacion.

 

Clicking the Create button opens the Visual Basic for Applications (VBA) window editor, so that we can start writing the code of the macro.

 

Within the procedure body CrearDatosPoblación, write the following code block, where first, clean up the cells of the spreadsheet where we are currently positioned. Then ask the user to enter through an InputBox dialog box, the number of rows to generate. After inserting the column title, we will introduce the first two values ​​that start the series, which generate using the method  Selection.AutoFill.

Option Explicit

Sub CrearDatosPoblacion()

Dim nFilaDestino As Long
Dim sCeldaOrigen As String
Dim sCeldaDestino As String

'clean worksheet cells
Cells.Select
Selection.ClearContents

nFilaDestino = InputBox("Number of records to generate")
nFilaDestino = nFilaDestino + 1

'fila_id column
'---------------

'column caption
Range("A1").Select
ActiveCell.FormulaR1C1 = "Fila_ID"

'init values of series to generate
Range("A2").Select
ActiveCell.FormulaR1C1 = "1"
Range("A3").Select
ActiveCell.FormulaR1C1 = "2"

'select init values range
Range("A2:A3").Select

'fill total range of cells
sCeldaOrigen = "A2"
sCeldaDestino = "A" & nFilaDestino
Selection.AutoFill Destination:=Range(sCeldaOrigen & ":" & sCeldaDestino), Type:=xlFillDefault

End Sub

 

 

To run the macro will select the menu "Run | Run Sub / UserForm", or press the key F5, filling the first column of the sheet with the number of values ​​indicated in the InputBox.

 

Before continuing we'll save our work in VBA or Excel, bearing in mind that we must do it in a file of type "Excel Macro-Enabled Workbook (.xlsm)", what we will be notified by a dialog box in time to save. We will click "No" in that dialog and save our worksheet in a file named GenerarDatosPoblacion.xlsm.

 

Generating random data

The next column to create correspond to the age of individuals, in it, need to generate random values ​​within a range of numbers, representing the maximum and minimum age that a person can have, for example between 0 and 120.

To create a random value between two numbers in a cell of Excel, we can use a formula with RANDBETWEEN function, that receives as parameter the above numbers, returning as a result the number generated.

 

Moving this functionality to the macro that we developed, we will add to it the next code block, which assign the formula expression RANDBETWEEN (0,120) to a cell in the second column. We will select that cell, and repeat the formula over a range of cells using the method ActiveCell.AutoFill.

'age
'----
Range("B1").Select
ActiveCell.FormulaR1C1 = "Edad_ID"

sCeldaOrigen = "B2"
sCeldaDestino = "B" & nFilaDestino
Range(sCeldaOrigen).Select
ActiveCell.FormulaR1C1 = "=RANDBETWEEN(0,120)"
ActiveCell.AutoFill Destination:=Range(sCeldaOrigen & ":" & sCeldaDestino), Type:=xlFillDefault

 

Running the macro again, we'll get the new column with the age data.

 

Our next step is for the column of sex of individuals. Here we could have used numbers 1 and 2, also generated randomly, to identify male and female respectively, but let's get a little this operation, using instead the letters H and M as values ​​for the column. So the problem now is how to randomly generate these letters in the cells of the column, since RANDBETWEEN only receives and returns numerical results.

The solution is very simple, as it also involves using RANDBETWEEN function, but combining it with the decision expression IF. We will pass to RANDBETWEEN the numbers 1 and 2 as parameters, and according to the result obtained, IF return the letter H or M. Below is the code block for this column, we will add to the macro.

'sex
'---
Range("C1").Select
ActiveCell.FormulaR1C1 = "Sexo_ID"

sCeldaOrigen = "C2"
sCeldaDestino = "C" & nFilaDestino
Range(sCeldaOrigen).Select
ActiveCell.FormulaR1C1 = "=IF(RANDBETWEEN(1,2)=1,""H"",""M"")"
ActiveCell.AutoFill Destination:=Range(sCeldaOrigen & ":" & sCeldaDestino), Type:=xlFillDefault

 

 

For the next two columns: code region of residence and country of origin code, we'll follow the same mechanics as the column ages, although using different numerical ranges.

'code region of residence
'------------------------
Range("D1").Select
ActiveCell.FormulaR1C1 = "CCAA_ID"

sCeldaOrigen = "D2"
sCeldaDestino = "D" & nFilaDestino
Range(sCeldaOrigen).Select
ActiveCell.FormulaR1C1 = "=RANDBETWEEN(1,19)"
ActiveCell.AutoFill Destination:=Range(sCeldaOrigen & ":" & sCeldaDestino), Type:=xlFillDefault

'country
'-------
Range("E1").Select
ActiveCell.FormulaR1C1 = "Pais_ID"

sCeldaOrigen = "E2"
sCeldaDestino = "E" & nFilaDestino
Range(sCeldaOrigen).Select
ActiveCell.FormulaR1C1 = "=RANDBETWEEN(4,894)"
ActiveCell.AutoFill Destination:=Range(sCeldaOrigen & ":" & sCeldaDestino), Type:=xlFillDefault

 

 

Our next step is to create the data of an alleged registration date of individuals in this population system, a task that will make in two phases. First, we create each part of the date in separate columns.

'date elements:
'==============
'year
'---
Range("F1").Select
ActiveCell.FormulaR1C1 = "Anualidad"

sCeldaOrigen = "F2"
sCeldaDestino = "F" & nFilaDestino
Range(sCeldaOrigen).Select
ActiveCell.FormulaR1C1 = "=RANDBETWEEN(2008,2010)"
ActiveCell.AutoFill Destination:=Range(sCeldaOrigen & ":" & sCeldaDestino), Type:=xlFillDefault

'month
'-----
Range("G1").Select
ActiveCell.FormulaR1C1 = "Mes"

sCeldaOrigen = "G2"
sCeldaDestino = "G" & nFilaDestino
Range(sCeldaOrigen).Select
ActiveCell.FormulaR1C1 = "=RANDBETWEEN(1,12)"
ActiveCell.AutoFill Destination:=Range(sCeldaOrigen & ":" & sCeldaDestino), Type:=xlFillDefault

'day
'---
Range("H1").Select
ActiveCell.FormulaR1C1 = "Dia"

sCeldaOrigen = "H2"
sCeldaDestino = "H" & nFilaDestino
Range(sCeldaOrigen).Select
ActiveCell.FormulaR1C1 = "=IF(RC[-1]=2, RANDBETWEEN(1,28), IF(OR(RC[-1]=4, RC[-1]=6, RC[-1]=9, RC[-1]=11), RANDBETWEEN(1,30), RANDBETWEEN(1,31)))"
ActiveCell.AutoFill Destination:=Range(sCeldaOrigen & ":" & sCeldaDestino), Type:=xlFillDefault

 

As we have seen, the day is the part of the date that requires further work, as we must be careful to avoid the generation, for example, of a day 31 if in the month column we have the value 2. We will solve this checking the month column in first place, and according this we'll use a different range to generate the day. So that in the formula we'll use several IF expressions combined with RANDBETWEEN functions.

 

In the second phase of this operation, we will create a final column in the spreadsheet with a date in a format understandable to SQL Server, as a result of the concatenation of the above columns: Anualidad, Mes and Dia.

It is necessary to take into account when composing the date in this way, we must add a zero to the month and day when these values ​​are only one digit.

'date composition
'----------------
Range("I1").Select
ActiveCell.FormulaR1C1 = "Fecha_Alta"

sCeldaOrigen = "I2"
sCeldaDestino = "I" & nFilaDestino
Range(sCeldaOrigen).Select
ActiveCell.FormulaR1C1 = "=RC[-3] & IF(LEN(RC[-2])=1,""0"" & RC[-2],RC[-2]) & " & _
"IF(LEN(RC[-1])=1,""0"" & RC[-1],RC[-1]) "

ActiveCell.AutoFill Destination:=Range(sCeldaOrigen & ":" & sCeldaDestino), Type:=xlFillDefault

 

When we rerun the macro, we'll get the correctly formatted date in the last column. This time also establish the number of records to generate in a million, that way we will test the data creation power of our process.

 

To complete operations in Excel we'll save the file as "Excel Workbook (*.xlsx)". Dialog box appears again warning that we couldn't save macros in a .xlsx file, which we'll reply by clicking Yes.

At this point we conclude the first part of the article, in the second installment will explain how to transfer the data just generated to SQL Server.

Regards,