This article addresses the challenge of develop a process to create a database with demographic information, which serves as a starting point for creating a data model in PowerPivot, used in the construction of the population pyramid shown in Population pyramids with PowerPivot. Preparing the data and Population pyramids with PowerPivot. Chart development.

Unlike the article «Generate SQL Server test data from Excel» (part 1 and part 2) also published in this blog, where we used Excel to generate dummy data population, this time we will work with real data, which represent a larger volume of information to manage with respect the above-mentioned article.

The current aim is to create a database containing a population table, in which each record represents an individual, with its corresponding age, sex and ​​healthcare area.

Before proceeding, I want to express my gratitude to Ricard Gènova Maleras, demographer specializing in population health analysis, member of the Health Service Reports and Studies (Directorate of Health Promotion and Prevention, DG Primary Care, Health Ministry CM) for their valuable advice in those key demographics for the development of this and the all related population pyramids published in this blog. Equally, I want to thank to the rest members of that Service: Jenaro Astray Mochales, Felicitas Dominguez Berjón and María Dolores Esteban Vasallo, for their support and assistance in all health concepts about population necessary for the proper focus of this article.

Last but not least, thanks to Enrique Barceló, technical fellow in the BI team, for their support and advise about matters relating to the OLAP data cubes development.

 

Data collection and preparation

There are several Web sites belonging to different agencies, where we will find demographic information needed to perform our analysis, organized by criteria such as sex, age, geographic region, etc. Of all these agencies to highlight the relevant United Nations , National Statistical Institute (Spain), Institute of Statistics of the CAM (Madrid) and CELADE (demographics of Latin America and Caribbean). In all cases, in addition to direct consultation in the website, we have the ability to download information in various formats such as Excel, CSV, etc.

For the development of the examples in this article we’ll use the registered population data from the Community of Madrid for the year 2010, classified by age, sex and health zoning, effective as of that year, found in the following link (T10Z2_SALUD10) from the website of the Statistical Institute of CAM.

 

Data extraction strategy

The Excel file pc10t10z2_salud10.xls obtained from the above link, organizes the data of population around three worksheets containing respectively, the values ​​of total population, men and women.

Each worksheet has in its first two columns the code and name of the ​​health care area, while the remaining columns contain population figures classified in five-year groups, where each group has a column with the total five-year and several columns representing the detail by simple age of the ages that make up the group.

DemographicDataGenerationSQLServer_01

 

The data we need extract are, first, the codes and names of health zoning, for which we can use any of the worksheets in the Excel file. Moreover, we have to get the total five-year age group in both sheets: male and female population.

​​We will bring all these values to a new worksheet in which we copy each combination of health area codes, age group and population type (male / female) in order of steps similar to the following figure.

DemographicDataGenerationSQLServer_02

 

Here we see an approximation of what would be the resulting spreadsheet.

DemographicDataGenerationSQLServer_03

 

Using select, copy and paste operations by hand to transfer the needed data to a new sheet can be a cumbersome and heavy task. That is why we propose to use a macro that automates all this work, allowing us to repeat the process as many times as we want, and apply it to other populations whose data is structured the same way.

To create the macro, on the Excel Ribbon will click on the «Macros» option, belonging to the same name group, located on the «View» tab.

DemographicDataGenerationSQLServer_05

 

This option is also available on the «Developer» tab, within «Code» group.

DemographicDataGenerationSQLServer_04

 

In the event that the «Developer» tab is not visible, we will click on the «File» tab, and within this on «Options». In «Excel Options» window will click «Customize Ribbon» and displaying the list on the right select the item «Main Tabs». In the panel below the list will check «Developer», which will make this tab visible in the ribbon.

DemographicDataGenerationSQLServer_06

 

Regardless the way chosen for its creation, the «Macro» window will appear, we’ll give the name «TraspasarDatosPoblacion» to our macro and click «Create», which will lead to the VBA (Visual Basic for Applications) editor, where we write the code for the macro.

We will divide the work to carry out the macro into three parts: data creation for health area code column, age range column, and finally, the columns of numbers of population by sex.

In the next block of code we can see the instructions responsible of creating a new worksheet, which assign a name and column headings. Later we’ll select the area codes from a source worksheet, which will paste in the new sheet many times as there age ranges.

 

Option Explicit
Private Const RANGOS_COLUMNAS As Integer = 20

Sub TraspasarDatosPoblacion()

' create new sheet and assign name
Dim sHojaDestino As String
sHojaDestino = "DatosBasePoblacion"

Dim oWorksheet As Worksheet
Set oWorksheet = Sheets.Add(After:=Sheets(Sheets.Count))
oWorksheet.Name = sHojaDestino

' assign column titles
Range("A1").Select
ActiveCell.FormulaR1C1 = "Zona"
Range("B1").Select
ActiveCell.FormulaR1C1 = "Rango_Edad"
Range("C1").Select
ActiveCell.FormulaR1C1 = "Poblacion_H"
Range("D1").Select
ActiveCell.FormulaR1C1 = "Poblacion_M"

' back to source data sheet
Sheets("Hombres").Select

' set start and end data rows
Dim sFilaInicial As String
Dim sFilaFinal As String
sFilaInicial = "14"
sFilaFinal = "299"

' calculate number of existing zones
Dim nCantidadZonas As Integer
nCantidadZonas = (CInt(sFilaFinal) - CInt(sFilaInicial))

' zone codes
' ----------
Dim sColumna As String
sColumna = "A"

' select codes, copy to clipboard and paste in a new sheet
Range(sColumna & sFilaInicial & ":" & sColumna & sFilaFinal).Select
Selection.Copy
Sheets(sHojaDestino).Select

Dim nContador As Integer
Dim nFilaVacia As Long

' do paste for each age group
For nContador = 0 To RANGOS_COLUMNAS
  ' obtain next empty row number from column
  nFilaVacia = FilaVaciaEnColumna(1)

  ' place in cell
  Range(sColumna & CStr(nFilaVacia)).Select

  ' paste data
  ActiveSheet.Paste
Next

' assign white color background to column
Columns("A:A").Select
Selection.Interior.Pattern = xlNone
Selection.Interior.TintAndShade = 0
Selection.Interior.PatternTintAndShade = 0

 

 

Before each assignment of values ​​in cells, we need to know the empty row from which paste the data. For this, we’ll use the auxiliary function «FilaVaciaEnColumna» to iterate the collection of rows in the sheet to find the row containing an empty cell in that column on which we are working.

 

Function FilaVaciaEnColumna(ByVal nPosicionCelda As Integer) As Long
Dim vFila As Variant
Dim nFila As Long

For Each vFila In ActiveSheet.Rows
  nFila = vFila.Row

  If vFila.Cells(nPosicionCelda).Text = "" Then
    Exit For
  End If
Next

FilaVaciaEnColumna = nFila

End Function

 

The next phase in creating the macro will be to assign values ​​to the column age ranges. So we’ll create an array containing these values, and to cross it, in each iteration, the current range value will be copied the same number of times than the amount of existing health zones. Note that before assigning values ​​to this column we will apply the text format (property Selection.NumberFormat), since otherwise, the default format may cause some values ​​of the age ranges are interpreted as month-year.

 

' age ranges
' --------------
Dim aRangosEdad(RANGOS_COLUMNAS) As String
aRangosEdad(0) = "0-4"
aRangosEdad(1) = "5-9"
aRangosEdad(2) = "10-14"
aRangosEdad(3) = "15-19"
aRangosEdad(4) = "20-24"
aRangosEdad(5) = "25-29"
aRangosEdad(6) = "30-34"
aRangosEdad(7) = "35-39"
aRangosEdad(8) = "40-44"
aRangosEdad(9) = "45-49"
aRangosEdad(10) = "50-54"
aRangosEdad(11) = "55-59"
aRangosEdad(12) = "60-64"
aRangosEdad(13) = "65-69"
aRangosEdad(14) = "70-74"
aRangosEdad(15) = "75-79"
aRangosEdad(16) = "80-84"
aRangosEdad(17) = "85-89"
aRangosEdad(18) = "90-94"
aRangosEdad(19) = "95-99"
aRangosEdad(20) = "100+"

' assign format of age column to text
sColumna = "B"
Columns(sColumna & ":" & sColumna).Select
Selection.NumberFormat = "@"

' repeat each value of aRangosEdad array
' as times as number of existing zones
Dim vRangoEdad As Variant

For Each vRangoEdad In aRangosEdad
  ' obtain next empty row number from column
  nFilaVacia = FilaVaciaEnColumna(2)

  ' assign age range value in next two cells down
  Range(sColumna & CStr(nFilaVacia)).Select
  ActiveCell.FormulaR1C1 = vRangoEdad

  Range(sColumna & CStr(nFilaVacia + 1)).Select
  ActiveCell.FormulaR1C1 = vRangoEdad

  ' set cell range and fill with value of above cells
  Range(sColumna & CStr(nFilaVacia) & ":" & sColumna & CStr(nFilaVacia + 1)).Select
  Selection.AutoFill Destination:=Range(sColumna & CStr(nFilaVacia) & ":" & sColumna & CStr(nFilaVacia + nCantidadZonas)), Type:=xlFillDefault
Next

 

And we end the macro with the assignment of the columns dedicated to the population figures by sex, where this time use two combined arrays containing the column headings that correspond to the five-year total population and population type. When touring both in a nested way, for a certain type of population, we’ll extract their numbers of individuals, assigning them to the destination worksheet.

 

' population figures by sex
' ----------------------------
Dim aColumnasPoblacion(RANGOS_COLUMNAS) As String
aColumnasPoblacion(0) = "D"
aColumnasPoblacion(1) = "J"
aColumnasPoblacion(2) = "P"
aColumnasPoblacion(3) = "V"
aColumnasPoblacion(4) = "AB"
aColumnasPoblacion(5) = "AH"
aColumnasPoblacion(6) = "AN"
aColumnasPoblacion(7) = "AT"
aColumnasPoblacion(8) = "AZ"
aColumnasPoblacion(9) = "BF"
aColumnasPoblacion(10) = "BL"
aColumnasPoblacion(11) = "BR"
aColumnasPoblacion(12) = "BX"
aColumnasPoblacion(13) = "CD"
aColumnasPoblacion(14) = "CJ"
aColumnasPoblacion(15) = "CP"
aColumnasPoblacion(16) = "CV"
aColumnasPoblacion(17) = "DB"
aColumnasPoblacion(18) = "DH"
aColumnasPoblacion(19) = "DN"
aColumnasPoblacion(20) = "DT"

Dim aTiposPoblacionColumnas(1) As String
aTiposPoblacionColumnas(0) = "Hombres,C,3"
aTiposPoblacionColumnas(1) = "Mujeres,D,4"

Dim vTipoPoblacionColumna As Variant
Dim sTipoPoblacion As String
Dim nPosicionColumna As Integer
Dim vColumnaPoblacion As Variant

' for each population type
For Each vTipoPoblacionColumna In aTiposPoblacionColumnas
  ' obtain information of population type:
  ' sex, target column, numeric position in target column
  sTipoPoblacion = Split(vTipoPoblacionColumna, ",")(0)
  sColumna = Split(vTipoPoblacionColumna, ",")(1)
  nPosicionColumna = Split(vTipoPoblacionColumna, ",")(2)

  ' iterate the columns with population figures
  ' and paste in new sheet
  For Each vColumnaPoblacion In aColumnasPoblacion
    ' place in sheet with source population data
    Sheets(sTipoPoblacion).Select

    ' select and copy cells
    Range(vColumnaPoblacion & sFilaInicial & ":" & vColumnaPoblacion & sFilaFinal).Select
    Selection.Copy

    ' place in target sheet
    Sheets(sHojaDestino).Select

    ' obtain next empty row, place in cell and paste data
    nFilaVacia = FilaVaciaEnColumna(nPosicionColumna)
    Range(sColumna & CStr(nFilaVacia)).Select
    ActiveSheet.Paste
  Next
Next

' change column style to display cell borders
Columns("C:D").Select
Selection.Style = "Normal"

Range("A1").Select

End Sub

 

Then create a new macro with the name «TraspasarDatosZonificacion» whereby we’ll move to a new worksheet codes and names of the healthcare areas. These data will be used to load a table in the database we’ll create later.

 

Sub TraspasarDatosZonificacion()
' create new sheet and assign name
Dim sHojaDestino As String
sHojaDestino = "DatosZonificacion"

Dim oWorksheet As Worksheet
Set oWorksheet = Sheets.Add(After:=Sheets(Sheets.Count))
oWorksheet.Name = sHojaDestino

' set column titles
Range("A1").Select
ActiveCell.FormulaR1C1 = "Zona_ID"
Range("B1").Select
ActiveCell.FormulaR1C1 = "Zona_DS"

' back to data sheet
Sheets("Hombres").Select

' copy zone codes and names to clipboard
Range("A14:B299").Select
Selection.Copy

' back to target sheet and paste data
Sheets(sHojaDestino).Select
Range("A2").Select
ActiveSheet.Paste

' remove cell background color
Columns("A:B").Select
Selection.Interior.Pattern = xlNone
Selection.Interior.TintAndShade = 0
Selection.Interior.PatternTintAndShade = 0

Range("A1").Select

End Sub

 

When we copy the codes and names of the health zones to new sheets within the Excel file, the color configuration of the cells corresponding to these values is copied too. To achieve white background color in these cells, we use the following lines of code.

 

Selection.Interior.TintAndShade = 0
Selection.Interior.PatternTintAndShade = 0

 

But if our version of Excel is prior to 2007 these lines will fail at runtime, so we must comment them to avoid being executed.

Finally, from the macro management window, we’ll select each one and click on «Run».

DemographicDataGenerationSQLServer_07

 

As a result we will obtain two new sheets with the data generated by the macros.

DemographicDataGenerationSQLServer_08

 

Database creation

Our next step is to create the database PiramidePoblacion, to accommodate the information we have just prepared in Excel. The following code block shows the Transact-SQL script that will be executed in our SQL Server instance (in this article we have used SQL Server 2008 R2).

 

CREATE DATABASE PiramidePoblacion
GO
USE PiramidePoblacion
GO

CREATE TABLE DatosBasePoblacion
(
Fila_ID int IDENTITY(1,1) NOT NULL,
Zona varchar(3) NULL,
Rango_Edad varchar(10) NULL,
Poblacion_H int NULL,
Poblacion_M int NULL
)
GO

CREATE TABLE Poblacion
(
Fila_ID int IDENTITY(1,1) NOT NULL,
Zona_ID varchar(3) NULL,
Edad_ID int NULL,
Sexo_ID char(1) NULL,
CONSTRAINT PK_Poblacion PRIMARY KEY CLUSTERED (Fila_ID ASC)
)
GO

CREATE TABLE Zona
(
Zona_ID varchar(3) NOT NULL,
Zona_DS varchar(30) NULL,
CONSTRAINT PK_Zona PRIMARY KEY CLUSTERED (Zona_ID ASC)
)
GO

CREATE TABLE Edad
(
Edad_ID int NOT NULL,
Edad_Grupo varchar(20) NULL,
CONSTRAINT PK_Edad PRIMARY KEY CLUSTERED (Edad_ID ASC)
)
GO

CREATE TABLE Sexo
(
Sexo_ID char(1) NOT NULL,
Sexo_DS varchar(10) NULL,
CONSTRAINT PK_Sexo PRIMARY KEY CLUSTERED (Sexo_ID ASC)
)
GO

 

The table Poblacion will contain the main data of our population, while tables Zona, Edad and Sexo contain catalog information (code / descriptor).

Surely, in the script we will have noticed the lack of foreign keys between Poblacion table and the rest. This is an intentional forgetting that we’ll fix in the articles Population pyramids with PowerPivot. Preparing the data and Population pyramids with PowerPivot. Chart development, where we demonstrate how using PowerPivot we can also create relationships between tables in the model.

After running the script, the first action that will take place in the new database will be transfer data from sheet DatosBasePoblacion of file pc10t10z2_salud10.xls into the table of the same name, using the following statement.

 

INSERT INTO DatosBasePoblacion
SELECT * FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
'Excel 8.0;Database=C:DatosOrigenpc10t10z2_salud10.xls',
'SELECT * FROM [DatosBasePoblacion$]')

 

Generating population records in the database

The purpose of the process that we are doing so far is to generate, for table Poblacion, a number of records that represent the population with which we work, and that will equal the sum of columns Poblacion_H and Poblacion_M from table DatosBasePoblacion.

DemographicDataGenerationSQLServer_09

 

For this volume of data, using a loop to create one by one the records is totally unfeasible due to time and performance reasons, so we must find an alternative technique that allows us to work using sets of results, to create the necessary number of records, using a small number of operations.

The technique that we will opt is shown in the book Microsoft SQL Server 2008: T-SQL Querying , by Itzik Ben-Gan and other great names in SQL Server: Lubor Kollar, Dejan Sarka and Steve Kass; and is use several «Common Table Expressions» or CTE nested, through which generate a result set of appropiate size, that we will combine with an INSERT INTO statement, to add to Poblacion table a determined number of records. All this will include in the following stored procedure.

 

CREATE PROCEDURE GenerarRegistrosPoblacion
@nZona_ID AS int,
@sRango_Edad AS varchar(10),
@sSexo_ID AS char(1),
@nPoblacion AS int
AS
BEGIN
DECLARE @nEdad_ID AS int =
CASE
  WHEN CHARINDEX('-',@sRango_Edad) = 0 THEN 100
  WHEN CHARINDEX('-',@sRango_Edad) = 2 THEN CAST(LEFT(@sRango_Edad,1) AS int)
  WHEN CHARINDEX('-',@sRango_Edad) = 3 THEN CAST(LEFT(@sRango_Edad,2) AS int)
END;

WITH
Numeros0 AS (SELECT 1 AS Numero UNION ALL SELECT 1),
Numeros1 AS (SELECT 1 AS Numero FROM Numeros0 AS TblPrim CROSS JOIN Numeros0 AS TblSeg),
Numeros2 AS (SELECT 1 AS Numero FROM Numeros1 AS TblPrim CROSS JOIN Numeros1 AS TblSeg),
Numeros3 AS (SELECT 1 AS Numero FROM Numeros2 AS TblPrim CROSS JOIN Numeros2 AS TblSeg),
Numeros4 AS (SELECT 1 AS Numero FROM Numeros3 AS TblPrim CROSS JOIN Numeros3 AS TblSeg),
Numeros5 AS (SELECT 1 AS Numero FROM Numeros4 AS TblPrim CROSS JOIN Numeros4 AS TblSeg),
NumerosTotal AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS NumeroFila FROM Numeros5)
INSERT INTO Poblacion
SELECT @nZona_ID, @nEdad_ID, @sSexo_ID
FROM NumerosTotal
WHERE NumeroFila <= @nPoblacion;

END

GO

 

Then write a script which will cross the DatosBasePoblacion table, and each of their records will take the value of Poblacion_H and Poblacion_M fields, inserting in Poblacion table, by executing the stored procedure GenerarRegistrosPoblacion, a number of rows equal to value of the aforementioned fields.

 

DECLARE @nContador AS int = 1;
DECLARE @nDatosBaseFilaFinal AS int = (SELECT MAX(Fila_ID) FROM DatosBasePoblacion);
DECLARE @nZona AS int;
DECLARE @sRango_Edad AS varchar(10);
DECLARE @nPoblacion_H AS int;
DECLARE @nPoblacion_M AS int;
WHILE (@nContador <= @nDatosBaseFilaFinal)
BEGIN
SELECT @nZona = Zona,
@sRango_Edad = Rango_Edad,
@nPoblacion_H = Poblacion_H,
@nPoblacion_M = Poblacion_M
FROM DatosBasePiramidePoblacion
WHERE Fila_ID = @nContador;

EXECUTE GenerarRegistrosPoblacion @nZona, @sRango_Edad, 'H',@nPoblacion_H
EXECUTE GenerarRegistrosPoblacion @nZona, @sRango_Edad, 'M',@nPoblacion_M

SET @nContador += 1
END

 

The time spent in the execution of this script was 1 minute and 37 seconds in a virtual machine equipped with a Core 2 Duo CPU and 1.5 GB of RAM, which represents excellent performance given the large number of records added to Poblacion table.

We ended the preparing database operations with the statements used to add data in tables that serve as catalogs for the table Poblacion. As we see below, for the Zona table we also get the records from pc10t10z2_salud10.xls file.

 

--////
INSERT INTO Zona
SELECT * FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
'Excel 8.0;Database=C:DatosOrigenpc10t10z2_salud10.xls', 
'SELECT * FROM [DatosZonificacion$]')
--////
DECLARE @nContador AS int = 0;
WHILE (@nContador <= 120)
BEGIN
INSERT INTO Edad
SELECT @nContador,
CASE
  WHEN @nContador BETWEEN 0 AND 4 THEN '000-004'
  WHEN @nContador BETWEEN 5 AND 9 THEN '005-009'
  WHEN @nContador BETWEEN 10 AND 14 THEN '010-014'
  WHEN @nContador BETWEEN 15 AND 19 THEN '015-019'
  WHEN @nContador BETWEEN 20 AND 24 THEN '020-024'
  WHEN @nContador BETWEEN 25 AND 29 THEN '025-029'
  WHEN @nContador BETWEEN 30 AND 34 THEN '030-034'
  WHEN @nContador BETWEEN 35 AND 39 THEN '035-039'
  WHEN @nContador BETWEEN 40 AND 44 THEN '040-044'
  WHEN @nContador BETWEEN 45 AND 49 THEN '045-049'
  WHEN @nContador BETWEEN 50 AND 54 THEN '050-054'
  WHEN @nContador BETWEEN 55 AND 59 THEN '055-059'
  WHEN @nContador BETWEEN 60 AND 64 THEN '060-064'
  WHEN @nContador BETWEEN 65 AND 69 THEN '065-069'
  WHEN @nContador BETWEEN 70 AND 74 THEN '070-074'
  WHEN @nContador BETWEEN 75 AND 79 THEN '075-079'
  WHEN @nContador BETWEEN 80 AND 84 THEN '080-084'
  WHEN @nContador BETWEEN 85 AND 89 THEN '085-089'
  WHEN @nContador BETWEEN 90 AND 94 THEN '090-094'
  WHEN @nContador BETWEEN 95 AND 99 THEN '095-099'
  WHEN @nContador >= 100 THEN '100+'
END

SET @nContador += 1;
END

--////
INSERT INTO Sexo VALUES ('H','Hombre')
INSERT INTO Sexo VALUES ('M','Mujer')

 

Data loading completed

At this point we conclude the process of creation and loading of demographic information in our database. In the articles pointed out at the beginning, we will use this database as a starting point for the construction of population pyramids with PowerPivot.