April 2012 - Artículos
This article addresses the challenge of develop a process to create a database with demographic information, which serves as a starting point for creating a data model in PowerPivot, used in the construction of the population pyramid shown in Population pyramids with PowerPivot. Preparing the data and Population pyramids with PowerPivot. Chart development.
Unlike the article "Generate SQL Server test data from Excel" (part 1 and part 2) also published in this blog, where we used Excel to generate dummy data population, this time we will work with real data, which represent a larger volume of information to manage with respect the above-mentioned article.
The current aim is to create a database containing a population table, in which each record represents an individual, with its corresponding age, sex and healthcare area.
Before proceeding, I want to express my gratitude to Ricard Gènova Maleras, a demographer specializing in population health analysis, part of the Health Service Reports and Studies (Directorate of Health Promotion and Prevention, DG Primary Care, Health Ministry CM) for their valuable advice in those key demographics for the development of this and the all related population pyramids published in this blog; and the other members of that Service: Jenaro Astray Mochales, Felicitas Dominguez Berjón, María Dolores Esteban Vasallo and Beatriz Elvira Rodriguez, for their support and assistance in all health concepts about population necessary for the proper focus of this article.
Last but not least, thanks to Enrique Barceló, fellow sufferer in this work of BI, for the interesting times we chatting about matters relating to the development of information systems in general and data cubes in particular, and for sharing his amazing knowledge of OLAP with the rest of the team.
Data collection and preparation
There are several Web sites belonging to different agencies, where we will find demographic information needed to perform our analysis, organized by criteria such as sex, age, geographic region, etc. Of all these agencies to highlight the relevant United Nations , National Statistical Institute (Spain), Institute of Statistics of the CAM (Madrid) and CELADE (demographics of Latin America and Caribbean). In all cases, in addition to direct consultation in the website, we have the ability to download information in various formats such as Excel, CSV, etc.
For the development of the examples in this article we'll use the registered population data from the Community of Madrid for the year 2010, classified by age, sex and health zoning, effective as of that year, found in the following link (T10Z2_SALUD10) from the website of the Statistical Institute of CAM.
Data extraction strategy
The Excel file pc10t10z2_salud10.xls obtained from the above link, organizes the data of population around three worksheets containing respectively, the values of total population, men and women.
Each worksheet has in its first two columns the code and name of the health care area, while the remaining columns contain population figures classified in five-year groups, where each group has a column with the total five-year and several columns representing the detail by simple age of the ages that make up the group.

The data we need extract are, first, the codes and names of health zoning, for which we can use any of the worksheets in the Excel file. Moreover, we have to get the total five-year age group in both sheets: male and female population.
We will bring all these values to a new worksheet in which we copy each combination of health area codes, age group and population type (male / female) in order of steps similar to the following figure.

Here we see an approximation of what would be the resulting spreadsheet.
Using select, copy and paste operations by hand to transfer the needed data to a new sheet can be a cumbersome and heavy task. That is why we propose to use a macro that automates all this work, allowing us to repeat the process as many times as we want, and apply it to other populations whose data is structured the same way.
To create the macro, on the Excel Ribbon will click on the "Macros" option, belonging to the same name group, located on the "View" tab.
This option is also available on the "Developer" tab, within "Code" group.
In the event that the "Developer" tab is not visible, we will click on the "File" tab, and within this on "Options". In "Excel Options" window will click "Customize Ribbon" and displaying the list on the right select the item "Main Tabs". In the panel below the list will check "Developer", which will make this tab visible in the ribbon.
Regardless the way chosen for its creation, the "Macro" window will appear, we'll give the name "TraspasarDatosPoblacion" to our macro and click "Create", which will lead to the VBA (Visual Basic for Applications) editor, where we write the code for the macro.
We will divide the work to carry out the macro into three parts: data creation for health area code column, age range column, and finally, the columns of numbers of population by sex.
In the next block of code we can see the instructions responsible of creating a new worksheet, which assign a name and column headings. Later we'll select the area codes from a source worksheet, which will paste in the new sheet many times as there age ranges.
Option Explicit
Private Const RANGOS_COLUMNAS As Integer = 20
Sub TraspasarDatosPoblacion()
' create new sheet and assign name
Dim sHojaDestino As String
sHojaDestino = "DatosBasePoblacion"
Dim oWorksheet As Worksheet
Set oWorksheet = Sheets.Add(After:=Sheets(Sheets.Count))
oWorksheet.Name = sHojaDestino
' assign column titles
Range("A1").Select
ActiveCell.FormulaR1C1 = "Zona"
Range("B1").Select
ActiveCell.FormulaR1C1 = "Rango_Edad"
Range("C1").Select
ActiveCell.FormulaR1C1 = "Poblacion_H"
Range("D1").Select
ActiveCell.FormulaR1C1 = "Poblacion_M"
' back to source data sheet
Sheets("Hombres").Select
' set start and end data rows
Dim sFilaInicial As String
Dim sFilaFinal As String
sFilaInicial = "14"
sFilaFinal = "299"
' calculate number of existing zones
Dim nCantidadZonas As Integer
nCantidadZonas = (CInt(sFilaFinal) - CInt(sFilaInicial))
' zone codes
' ---------------
Dim sColumna As String
sColumna = "A"
' select codes, copy to clipboard and paste in a new sheet
Range(sColumna & sFilaInicial & ":" & sColumna & sFilaFinal).Select
Selection.Copy
Sheets(sHojaDestino).Select
Dim nContador As Integer
Dim nFilaVacia As Long
' do paste for each age group
For nContador = 0 To RANGOS_COLUMNAS
' obtain next empty row number from column
nFilaVacia = FilaVaciaEnColumna(1)
' place in cell
Range(sColumna & CStr(nFilaVacia)).Select
' paste data
ActiveSheet.Paste
Next
' assign white color background to column
Columns("A:A").Select
Selection.Interior.Pattern = xlNone
Selection.Interior.TintAndShade = 0
Selection.Interior.PatternTintAndShade = 0
Before each assignment of values in cells, we need to know the empty row from which paste the data. For this, we'll use the auxiliary function "FilaVaciaEnColumna" to iterate the collection of rows in the sheet to find the row containing an empty cell in that column on which we are working.
Function FilaVaciaEnColumna(ByVal nPosicionCelda As Integer) As Long
Dim vFila As Variant
Dim nFila As Long
For Each vFila In ActiveSheet.Rows
nFila = vFila.Row
If vFila.Cells(nPosicionCelda).Text = "" Then
Exit For
End If
Next
FilaVaciaEnColumna = nFila
End Function
The next phase in creating the macro will be to assign values to the column age ranges. So we'll create an array containing these values, and to cross it, in each iteration, the current range value will be copied the same number of times than the amount of existing health zones. Note that before assigning values to this column we will apply the text format (property Selection.NumberFormat), since otherwise, the default format may cause some values of the age ranges are interpreted as month-year.
' age ranges
' --------------
Dim aRangosEdad(RANGOS_COLUMNAS) As String
aRangosEdad(0) = "0-4"
aRangosEdad(1) = "5-9"
aRangosEdad(2) = "10-14"
aRangosEdad(3) = "15-19"
aRangosEdad(4) = "20-24"
aRangosEdad(5) = "25-29"
aRangosEdad(6) = "30-34"
aRangosEdad(7) = "35-39"
aRangosEdad(8) = "40-44"
aRangosEdad(9) = "45-49"
aRangosEdad(10) = "50-54"
aRangosEdad(11) = "55-59"
aRangosEdad(12) = "60-64"
aRangosEdad(13) = "65-69"
aRangosEdad(14) = "70-74"
aRangosEdad(15) = "75-79"
aRangosEdad(16) = "80-84"
aRangosEdad(17) = "85-89"
aRangosEdad(18) = "90-94"
aRangosEdad(19) = "95-99"
aRangosEdad(20) = "100+"
' assign format of age column to text
sColumna = "B"
Columns(sColumna & ":" & sColumna).Select
Selection.NumberFormat = "@"
' repeat each value of aRangosEdad array
' as times as number of existing zones
Dim vRangoEdad As Variant
For Each vRangoEdad In aRangosEdad
' obtain next empty row number from column
nFilaVacia = FilaVaciaEnColumna(2)
' assign age range value in next two cells down
Range(sColumna & CStr(nFilaVacia)).Select
ActiveCell.FormulaR1C1 = vRangoEdad
Range(sColumna & CStr(nFilaVacia + 1)).Select
ActiveCell.FormulaR1C1 = vRangoEdad
' set cell range and fill with value of above cells
Range(sColumna & CStr(nFilaVacia) & ":" & sColumna & CStr(nFilaVacia + 1)).Select
Selection.AutoFill Destination:=Range(sColumna & CStr(nFilaVacia) & ":" & sColumna & CStr(nFilaVacia + nCantidadZonas)), Type:=xlFillDefault
Next
And we end the macro with the assignment of the columns dedicated to the population figures by sex, where this time use two combined arrays containing the column headings that correspond to the five-year total population and population type. When touring both in a nested way, for a certain type of population, we'll extract their numbers of individuals, assigning them to the destination worksheet.
' population figures by sex
' ----------------------------
Dim aColumnasPoblacion(RANGOS_COLUMNAS) As String
aColumnasPoblacion(0) = "D"
aColumnasPoblacion(1) = "J"
aColumnasPoblacion(2) = "P"
aColumnasPoblacion(3) = "V"
aColumnasPoblacion(4) = "AB"
aColumnasPoblacion(5) = "AH"
aColumnasPoblacion(6) = "AN"
aColumnasPoblacion(7) = "AT"
aColumnasPoblacion(8) = "AZ"
aColumnasPoblacion(9) = "BF"
aColumnasPoblacion(10) = "BL"
aColumnasPoblacion(11) = "BR"
aColumnasPoblacion(12) = "BX"
aColumnasPoblacion(13) = "CD"
aColumnasPoblacion(14) = "CJ"
aColumnasPoblacion(15) = "CP"
aColumnasPoblacion(16) = "CV"
aColumnasPoblacion(17) = "DB"
aColumnasPoblacion(18) = "DH"
aColumnasPoblacion(19) = "DN"
aColumnasPoblacion(20) = "DT"
Dim aTiposPoblacionColumnas(1) As String
aTiposPoblacionColumnas(0) = "Hombres,C,3"
aTiposPoblacionColumnas(1) = "Mujeres,D,4"
Dim vTipoPoblacionColumna As Variant
Dim sTipoPoblacion As String
Dim nPosicionColumna As Integer
Dim vColumnaPoblacion As Variant
' for each population type
For Each vTipoPoblacionColumna In aTiposPoblacionColumnas
' obtain information of population type:
' sex, target column, numeric position in target column
sTipoPoblacion = Split(vTipoPoblacionColumna, ",")(0)
sColumna = Split(vTipoPoblacionColumna, ",")(1)
nPosicionColumna = Split(vTipoPoblacionColumna, ",")(2)
' iterate the columns with population figures
' and paste in new sheet
For Each vColumnaPoblacion In aColumnasPoblacion
' place in sheet with source population data
Sheets(sTipoPoblacion).Select
' select and copy cells
Range(vColumnaPoblacion & sFilaInicial & ":" & vColumnaPoblacion & sFilaFinal).Select
Selection.Copy
' place in target sheet
Sheets(sHojaDestino).Select
' obtain next empty row, place in cell and paste data
nFilaVacia = FilaVaciaEnColumna(nPosicionColumna)
Range(sColumna & CStr(nFilaVacia)).Select
ActiveSheet.Paste
Next
Next
' change column style to display cell borders
Columns("C:D").Select
Selection.Style = "Normal"
Range("A1").Select
End Sub
Then create a new macro with the name "TraspasarDatosZonificacion" whereby we'll move to a new worksheet codes and names of the healthcare areas. These data will be used to load a table in the database we'll create later.
Sub TraspasarDatosZonificacion()
' create new sheet and assign name
Dim sHojaDestino As String
sHojaDestino = "DatosZonificacion"
Dim oWorksheet As Worksheet
Set oWorksheet = Sheets.Add(After:=Sheets(Sheets.Count))
oWorksheet.Name = sHojaDestino
' set column titles
Range("A1").Select
ActiveCell.FormulaR1C1 = "Zona_ID"
Range("B1").Select
ActiveCell.FormulaR1C1 = "Zona_DS"
' back to data sheet
Sheets("Hombres").Select
' copy zone codes and names to clipboard
Range("A14:B299").Select
Selection.Copy
' back to target sheet and paste data
Sheets(sHojaDestino).Select
Range("A2").Select
ActiveSheet.Paste
' remove cell background color
Columns("A:B").Select
Selection.Interior.Pattern = xlNone
Selection.Interior.TintAndShade = 0
Selection.Interior.PatternTintAndShade = 0
Range("A1").Select
End Sub
When we copy the codes and names of the health zones to new sheets within the Excel file, the color configuration of the cells corresponding to these values is copied too. To achieve white background color in these cells, we use the following lines of code.
Selection.Interior.TintAndShade = 0
Selection.Interior.PatternTintAndShade = 0
But if our version of Excel is prior to 2007 these lines will fail at runtime, so we must comment them to avoid being executed.
Finally, from the macro management window, we'll select each one and click on "Run".
As a result we will obtain two new sheets with the data generated by the macros.

Database creation
Our next step is to create the database PiramidePoblacion, to accommodate the information we have just prepared in Excel. The following code block shows the Transact-SQL script that will be executed in our SQL Server instance (in this article we have used SQL Server 2008 R2).
CREATE DATABASE PiramidePoblacion
GO
USE PiramidePoblacion
GO
CREATE TABLE DatosBasePoblacion
(
Fila_ID int IDENTITY(1,1) NOT NULL,
Zona varchar(3) NULL,
Rango_Edad varchar(10) NULL,
Poblacion_H int NULL,
Poblacion_M int NULL
)
GO
CREATE TABLE Poblacion
(
Fila_ID int IDENTITY(1,1) NOT NULL,
Zona_ID varchar(3) NULL,
Edad_ID int NULL,
Sexo_ID char(1) NULL,
CONSTRAINT PK_Poblacion PRIMARY KEY CLUSTERED (Fila_ID ASC)
)
GO
CREATE TABLE Zona
(
Zona_ID varchar(3) NOT NULL,
Zona_DS varchar(30) NULL,
CONSTRAINT PK_Zona PRIMARY KEY CLUSTERED (Zona_ID ASC)
)
GO
CREATE TABLE Edad
(
Edad_ID int NOT NULL,
Edad_Grupo varchar(20) NULL,
CONSTRAINT PK_Edad PRIMARY KEY CLUSTERED (Edad_ID ASC)
)
GO
CREATE TABLE Sexo
(
Sexo_ID char(1) NOT NULL,
Sexo_DS varchar(10) NULL,
CONSTRAINT PK_Sexo PRIMARY KEY CLUSTERED (Sexo_ID ASC)
)
GO
The table Poblacion will contain the main data of our population, while tables Zona, Edad and Sexo contain catalog information (code / descriptor).
Surely, in the script we will have noticed the lack of foreign keys between Poblacion table and the rest. This is an intentional forgetting that we'll fix in the articles Population pyramids with PowerPivot. Preparing the data and Population pyramids with PowerPivot. Chart development, where we demonstrate how using PowerPivot we can also create relationships between tables in the model.
After running the script, the first action that will take place in the new database will be transfer data from sheet DatosBasePoblacion of file pc10t10z2_salud10.xls into the table of the same name, using the following statement.
INSERT INTO DatosBasePoblacion
SELECT * FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
'Excel 8.0;Database=C:\DatosOrigen\pc10t10z2_salud10.xls',
'SELECT * FROM [DatosBasePoblacion$]')
Generating population records in the database
The purpose of the process that we are doing so far is to generate, for table Poblacion, a number of records that represent the population with which we work, and that will equal the sum of columns Poblacion_H and Poblacion_M from table DatosBasePoblacion.
For this volume of data, using a loop to create one by one the records is totally unfeasible due to time and performance reasons, so we must find an alternative technique that allows us to work using sets of results, to create the necessary number of records, using a small number of operations.
The technique that we will opt is shown in the book Microsoft SQL Server 2008: T-SQL Querying , by Itzik Ben-Gan and other great names in SQL Server: Lubor Kollar, Dejan Sarka and Steve Kass; and is use several "Common Table Expressions" or CTE nested, through which generate a result set of appropiate size, that we will combine with an INSERT INTO statement, to add to Poblacion table a determined number of records. All this will include in the following stored procedure.
CREATE PROCEDURE GenerarRegistrosPoblacion
@nZona_ID AS int,
@sRango_Edad AS varchar(10),
@sSexo_ID AS char(1),
@nPoblacion AS int
AS
BEGIN
DECLARE @nEdad_ID AS int =
CASE
WHEN CHARINDEX('-',@sRango_Edad) = 0 THEN 100
WHEN CHARINDEX('-',@sRango_Edad) = 2 THEN CAST(LEFT(@sRango_Edad,1) AS int)
WHEN CHARINDEX('-',@sRango_Edad) = 3 THEN CAST(LEFT(@sRango_Edad,2) AS int)
END;
WITH
Numeros0 AS (SELECT 1 AS Numero UNION ALL SELECT 1),
Numeros1 AS (SELECT 1 AS Numero FROM Numeros0 AS TblPrim CROSS JOIN Numeros0 AS TblSeg),
Numeros2 AS (SELECT 1 AS Numero FROM Numeros1 AS TblPrim CROSS JOIN Numeros1 AS TblSeg),
Numeros3 AS (SELECT 1 AS Numero FROM Numeros2 AS TblPrim CROSS JOIN Numeros2 AS TblSeg),
Numeros4 AS (SELECT 1 AS Numero FROM Numeros3 AS TblPrim CROSS JOIN Numeros3 AS TblSeg),
Numeros5 AS (SELECT 1 AS Numero FROM Numeros4 AS TblPrim CROSS JOIN Numeros4 AS TblSeg),
NumerosTotal AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS NumeroFila FROM Numeros5)
INSERT INTO Poblacion
SELECT @nZona_ID, @nEdad_ID, @sSexo_ID
FROM NumerosTotal
WHERE NumeroFila <= @nPoblacion;
END
GO
Then write a script which will cross the DatosBasePoblacion table, and each of their records will take the value of Poblacion_H and Poblacion_M fields, inserting in Poblacion table, by executing the stored procedure GenerarRegistrosPoblacion, a number of rows equal to value of the aforementioned fields.
DECLARE @nContador AS int = 1;
DECLARE @nDatosBaseFilaFinal AS int = (SELECT MAX(Fila_ID) FROM DatosBasePoblacion);
DECLARE @nZona AS int;
DECLARE @sRango_Edad AS varchar(10);
DECLARE @nPoblacion_H AS int;
DECLARE @nPoblacion_M AS int;
WHILE (@nContador <= @nDatosBaseFilaFinal)
BEGIN
SELECT @nZona = Zona,
@sRango_Edad = Rango_Edad,
@nPoblacion_H = Poblacion_H,
@nPoblacion_M = Poblacion_M
FROM DatosBasePiramidePoblacion
WHERE Fila_ID = @nContador;
EXECUTE GenerarRegistrosPoblacion @nZona, @sRango_Edad, 'H',@nPoblacion_H
EXECUTE GenerarRegistrosPoblacion @nZona, @sRango_Edad, 'M',@nPoblacion_M
SET @nContador += 1
END
The time spent in the execution of this script was 1 minute and 37 seconds in a virtual machine equipped with a Core 2 Duo CPU and 1.5 GB of RAM, which represents excellent performance given the large number of records added to Poblacion table.
We ended the preparing database operations with the statements used to add data in tables that serve as catalogs for the table Poblacion. As we see below, for the Zona table we also get the records from pc10t10z2_salud10.xls file.
--////
INSERT INTO Zona
SELECT * FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
'Excel 8.0;Database=C:\DatosOrigen\pc10t10z2_salud10.xls',
'SELECT * FROM [DatosZonificacion$]')
--////
DECLARE @nContador AS int = 0;
WHILE (@nContador <= 120)
BEGIN
INSERT INTO Edad
SELECT @nContador,
CASE
WHEN @nContador BETWEEN 0 AND 4 THEN '000-004'
WHEN @nContador BETWEEN 5 AND 9 THEN '005-009'
WHEN @nContador BETWEEN 10 AND 14 THEN '010-014'
WHEN @nContador BETWEEN 15 AND 19 THEN '015-019'
WHEN @nContador BETWEEN 20 AND 24 THEN '020-024'
WHEN @nContador BETWEEN 25 AND 29 THEN '025-029'
WHEN @nContador BETWEEN 30 AND 34 THEN '030-034'
WHEN @nContador BETWEEN 35 AND 39 THEN '035-039'
WHEN @nContador BETWEEN 40 AND 44 THEN '040-044'
WHEN @nContador BETWEEN 45 AND 49 THEN '045-049'
WHEN @nContador BETWEEN 50 AND 54 THEN '050-054'
WHEN @nContador BETWEEN 55 AND 59 THEN '055-059'
WHEN @nContador BETWEEN 60 AND 64 THEN '060-064'
WHEN @nContador BETWEEN 65 AND 69 THEN '065-069'
WHEN @nContador BETWEEN 70 AND 74 THEN '070-074'
WHEN @nContador BETWEEN 75 AND 79 THEN '075-079'
WHEN @nContador BETWEEN 80 AND 84 THEN '080-084'
WHEN @nContador BETWEEN 85 AND 89 THEN '085-089'
WHEN @nContador BETWEEN 90 AND 94 THEN '090-094'
WHEN @nContador BETWEEN 95 AND 99 THEN '095-099'
WHEN @nContador >= 100 THEN '100+'
END
SET @nContador += 1;
END
--////
INSERT INTO Sexo VALUES ('H','Hombre')
INSERT INTO Sexo VALUES ('M','Mujer')
Data loading completed
At this point we conclude the process of creation and loading of demographic information in our database. In the articles pointed out at the beginning, we will use this database as a starting point for the construction of population pyramids with PowerPivot.
Una característica esencial de los modelos de datos
tabulares pertenecientes a BISM (Business Intelligence Semantic Model), el nuevo
paradigma de desarrollo BI incorporado en SQL Server 2012, lo constituyen las
relaciones entre las tablas integrantes de un determinado modelo, ya que sin
ellas resultaría imposible realizar un adecuado análisis de la información
contenida en el mismo.
Pongamos como ejemplo que tenemos una base de datos
con el nombre AdvWksDW, creada a partir de algunas de las tablas de la base de
datos AdventureWorksDW2012,
pero sin las correspondientes relaciones entre las mismas. El siguiente bloque
de código muestra las sentencias SQL necesarias para su creación.
CREATE DATABASE AdvWksDW
GO
USE AdvWksDW
GO
SELECT * INTO FactInternetSales FROM AdventureWorksDW2012.dbo.FactInternetSales
GO
SELECT * INTO DimDate FROM AdventureWorksDW2012.dbo.DimDate
GO
Para aquellos lectores más familiarizados con el
trabajo con modelos de análisis multidimensional, si tuviéramos que
desarrollar un cubo OLAP empleando esta base de datos, la tabla FactInternetSales
representaría a la tabla de hechos,
ya que contiene las columnas a partir de las cuales podemos obtener las medidas
o resultados numéricos a analizar; mientras que la tabla DimDate representaría
a una tabla de dimensión, empleada para
filtrar la información por las fechas existentes en el modelo.
A continuación crearemos un proyecto de análisis de
tipo tabular con SQL Server Data Tools (SSDT); el entorno de desarrollo para
SQL Server 2012 basado en Visual Studio 2010, conocido en anteriores versiones
como Business Intelligence Development Studio.

En el modelo de datos de este proyecto incluiremos las
tablas de la base de datos que acabamos de crear, añadiendo también una medida que
sume los valores de la columna SalesAmount.

Para analizar este modelo desde Excel
seleccionaremos la opción de menú de SSDT "Model | Analyze in Excel", o bien
haremos clic en el botón de la barra de herramientas que desempeña el mismo
propósito.

Esta acción tendrá como resultado la apertura de una
nueva hoja de cálculo que mostrará el modelo de datos en una tabla dinámica,
dentro de la cual situaremos la medida "Sum of SalesAmount" de la tabla
FactInternetSales en el bloque Valores, y el campo CalendarYear de la tabla
DimDate en el bloque "Etiquetas de fila".
El resultado que necesitaríamos obtener de esta tabla
dinámica sería la suma del campo SalesAmount, perteneciente a la tabla
FactInternetSales, agrupada por los años de venta, es decir, por el año del
campo OrderDateKey. Puesto que contamos con la tabla DimDate como catálogo de
fechas, si combinamos en una consulta SQL ambas tablas por los campos de fecha adecuados,
obtendríamos las cifras esperadas.

Sin embargo el resultado obtenido en la tabla
dinámica es muy distinto, ya que todas sus celdas muestran el mismo valor: el total
de la suma del campo SalesAmount. Esto es debido a la inexistencia de una relación
entre los campos OrderDateKey y DateKey de las tablas FactInternetSales y
DimDate respectivamente, lo cual produce que el motor del modelo tabular no
sepa cómo aplicar la medida "Sum of SalesAmount".

Relaciones
en el modelo tabular
En escenarios como el que acabamos de describir es donde
nos damos cuenta de la importancia que cobra un adecuado establecimiento de
relaciones entre las tablas de nuestro modelo, a fin de conseguir un óptimo
análisis de la información.
Un modelo tabular obtiene, en el caso de que
existan, las relaciones directamente de la fuente de datos origen durante la
ejecución del asistente de importación de datos. A las relaciones obtenidas de
esta forma las denominaremos relaciones
automáticas.
Si el modelo carece de relaciones, o bien
necesitamos relaciones adicionales a las ya existentes, podemos crearlas
mediante la ventana de administración disponible a tal efecto en el entorno de
desarrollo de Visual Studio 2010; llamaremos a este tipo relaciones manuales.
En nuestra situación actual, para que la información
del modelo resulte coherente, necesitamos, como hemos indicado antes, crear una
relación entre las tablas FactInternetSales y DimDate, por los campos OrderDateKey
y DateKey; para ello seleccionaremos la opción de menú "Table | Create
Relationships", que abrirá la ventana en la que realizaremos dicha operación,
seleccionando las tablas y campos integrantes de la relación.

Una vez creada la relación volveremos a la tabla
dinámica de Excel en la que estamos analizando el modelo, y después de hacer
clic en la opción Actualizar (grupo Datos de la pestaña Opciones, contenida en
la pestaña de nivel superior "Herramientas de tabla dinámica") se producirá una
nueva lectura de los datos del modelo por parte de la tabla dinámica,
refrescando su contenido, y haciendo que las cifras de ventas por año se
muestren ya correctamente.

El estado
de la relación. Relaciones activas e inactivas
En un modelo de datos puede suceder
que una tabla en la que hemos definido medidas, varios de sus campos se
relacionen con un mismo campo de otra tabla que vamos a utilizar para filtrar
los datos de la primera. Cuando esto ocurre, ¿cómo podemos saber qué campo de
la tabla de medidas se está relacionando con la tabla de filtro al analizar el
modelo por ambas tablas? Para aquellos lectores con experiencia en el
desarrollo de cubos OLAP, la pregunta a formular sería: ¿qué campo de la tabla
de hechos se está relacionando con la tabla de dimensión?
La respuesta a esta pregunta la hallamos en el
estado de la relación (activa o inactiva), y la mejor manera de ilustrar este
concepto es a través de un ejemplo. Por consiguiente, vamos a crear un nuevo
proyecto tabular en SSDT, cuya fuente de datos sea esta vez la base de datos
AdventureWorksDW2012, importando en el modelo, al igual que en el ejemplo
anterior, las tablas FactInternetSales y DimDate; y definiendo igualmente una
medida con la suma de la columna SalesAmount.
Al analizar el modelo en Excel, situando el campo
CalendarYear en las etiquetas de fila, los números resultantes corresponderán a
las ventas realizadas por fecha de factura.

Que serán las mismas cifras que obteníamos de la
consulta SQL presentada en un apartado anterior.

Llegados a este punto es donde el lector se
preguntará con toda seguridad: "¿Y cómo sabe el modelo que yo quiero obtener la
suma del campo SalesAmount en base al campo OrderDateKey, y no en base a
DueDateKey o ShipDateKey, que también son de tipo fecha?"
Realmente, el modelo no lo sabe, sino que se limita
a utilizar la relación activa del modelo. Expliquemos esto con mayor detalle:
Si observamos el diseñador del modelo en vista de
diagrama, veremos que desde la tabla FactInternetSales parten tres relaciones
hacia la tabla DimDate. De las flechas que indican visualmente las relaciones,
la que tiene el trazo de línea continuo corresponde a la relación activa, y por
lo tanto, será la que se utilizará de forma predeterminada cuando se consulte
el modelo empleando el campo CalendarYear de la tabla DimDate. Al hacer clic en
dicha relación, la ventana de propiedades del entorno de desarrollo de Visual
Studio mostrará el detalle de la misma. Nótese que la propiedad Active tiene el
valor True.

También podemos averiguar cuál es la relación activa,
y por ende, las inactivas, seleccionando en Visual Studio la opción de menú
"Table | Manage Relationships", que abrirá la ventana de administración de las
relaciones del modelo, donde la columna
Active nos informará de este aspecto en particular.

Consultando
el modelo a través de una relación inactiva
La cuestión a plantear ahora sería
cómo podríamos interrogar al modelo, de forma que nos proporcionara la suma del campo SalesAmount, pero utilizando alguno de los otros
dos campos de fecha (DueDateKey o ShipDateKey) que también están relacionados con
el campo DateKey de la tabla DimDate. Dicho con otras palabras, lo que
necesitamos es alterar el contexto de filtro predeterminado actualmente activo.
La solución
pasa por crear una nueva medida, que también sume el campo SalesAmount, pero
indicando de alguna manera que el campo de agrupación va a ser uno distinto del
predeterminado. Por ejemplo, si queremos obtener los resultados de ventas por
el campo ShipDateKey, usaremos la siguiente expresión DAX:
VentasPorFechaEnvio:=CALCULATE( SUM([SalesAmount]) ;
USERELATIONSHIP(FactInternetSales[ShipDateKey]; DimDate[DateKey]) )
Para añadir
esta medida al modelo de datos, dentro del área reservada a las medidas haremos
clic en una celda vacía debajo de la columna SalesAmount, y escribiremos la
anterior expresión en el panel de introducción de fórmulas/expresiones.

Analicemos
esta expresión por partes: debemos emplear la función SUM(), ya que nuestro
objetivo principal es sumar el campo SalesAmount, pero como queremos que al
agrupar la suma, se utilice el campo ShipDateKey, mediante la relación de éste
con el campo DateKey de la tabla DimDate, tenemos que especificar
explícitamente dicha relación porque no es la activa, y esto es lo que hacemos
con la función USERELATIONSHIP(). Ahora bien, precisamente por el hecho de
utilizar una relación que no es la activa, necesitamos alterar el contexto de
filtro actual, lo que conseguimos usando la función CALCULATE().

Como hemos
hecho en ejemplos anteriores, si trasladamos esta situación a una consulta SQL,
para comprobar que los valores obtenidos son los mismos, la sentencia quedaría
tal y como vemos a continuación.

Conclusiones
En el presente artículo hemos realizado una introducción
a las relaciones en los modelos tabulares del nuevo modelo semántico para
Business Intelligence incorporado a SQL Server 2012. Dicha característica
representa una pieza fundamental en la construcción de sistemas de análisis
empleando esta tecnología, por lo que confiamos en que este artículo sirva de
ayuda al lector para iniciarse en el desarrollo de sus propios modelos de
análisis utilizando BISM. Los ejemplos elaborados a lo largo del artículo están
disponibles en este enlace.
In the first part of this article we developed a data model in PowerPivot representing population figures by age and sex. In this second installment will shape these figures on a chart shaped population pyramid.
Pyramid chart. First approach
In its current state, the pivot table already have enough information (figures by population, age and sex) to try to create a chart that represents a population pyramid, but we advance the reader that in this first approach we will not get the desired effect.
Positioned in the pivot table, from the Excel ribbon, select the option "PivotChart" belonging to "Tools" group from "Options" tab, which in turn is contained in the top-level tab "PivotTable Tools".
This selection opens "Insert Chart" window, which contains all available chart types. Here we will realize that there is no specific template for creating a pyramid chart, therefore, of all bid at our disposal we choose, within "Bar" category, the "Clustered Bar" type, which as we shall see later, it will be the best adapted to the result we want achieve.

Accepting this window, the chart will be created from the PivotTable data, and as we anticipated, the result won't look like the image presented in the first part of the article.

However, the main difference lies in the orientation of the men bar population, which should be left. All other aspects are basically visual configuration issues, which explain how to solve in short.
Solving the trajectory of the population bars
Focusing on the male population bar, the solution for achieve to draw itself in the opposite direction to the current, is to put in negative the values of the cells in the pivot table for this segment of the population.
If we were in a simple spreadsheet with no connection to PowerPivot, the solution is as simple as editing the cells in column Hombre, changing their values to negative, but in this scenario data is being obtained from the PowerPivot data model, so that it's not possible directly edit the values in the PivotTable.
To solve this kind of problems we have to resort to the creation of columns and / or calculated measures, which through DAX expressions provide the results we need. In the transition to negative the values of the Hombre column, we will open the PowerPivot window, and placing ourselves in the first available empty column in the Poblacion table, write the following expression in the formula bar:
=IF([Sexo_ID] = "M", 1, -1)
We just created a calculated column that will be evaluated for each row of the Poblacion table, checking if Sexo_ID field value is equal to the letter "M" (Mujer), if so, the column value in that row will be 1, otherwise, when the field contains "H" (Hombre), the returned value will be -1.
Then we double-click its header to assign Sexo_Codigo as the column's name it. We can also give name by right clicking on the header and choosing "Rename Column".

Turning again to the Excel window, we'll remove the chart we created in the worksheet and uncheck the RecuentoPoblacion measure, leaving empty the area of the PivotTable values.
The next step will be to create a new measure named SumaPoblacion, based on the following DAX expression:
=SUM([Sexo_Codigo])
By applying this measure to the PivotTable, the SUM function takes the sum of column values passed as parameter, so the population figures of men now appear negative. This means that when re-creating the chart in the manner explained above, the indicator bars of the population values by sex are now drawn in opposite directions. As an additional feature, on "PivotTable Tools" tab, into "Design" tab, at the "Layout" group, we'll drop "Grand Totals", selecting "On for Columns Only", that hides the row totals column, because its presence in this context is irrelevant.

Visual configuration of population bars
Although the chart bars are now shown with the effect we wanted, it would be desirable a few tweaks in its visual appearance to improve the presentation quality.
First we right click on the age ranges labels, selecting "Format axis". In the window of the same name, under "Axis Options", assign the value "Low" to "Axis Labels" property, which align the labels column to the left of the chart.

Then we right click on any of the bars in the graph, choosing "Format Data Series". In the configuration series window, into "Series Options," on property "Series Overlap" will move the position marker to the right end position (fully overlapped), while in property "Gap Width" will move the position marker to the left end (no gap at all). In this way we will accomplish the bars increase their thickness and eliminate the space between them, being completely joined to form the population pyramid.

Calculating the population percentages
So far, the data representation obtained, both in the pivot table as in the chart pyramid is based on absolute population numbers. However, the usual practice is that such representation is made as a proportion of each age group and sex on total population.
For example, in our pivot table, the women population aged between 55 and 59 is 184,888 people; to get the percentage that this population group is in relation to all individuals with whom we are working (6,458,684), we divide the group by the total, and format the result as a percentage, gaining 2.86%.
If we want the PivotTable to perform this operation for all population groups, we'll have to add additional calculations in the form of measures, but before that we'll remove the current population chart, because we will build it again from one of the new measures.
We start out the existence of a measure, SumaPoblacion, which as we know, it returns the number of people adding the field Sexo_Codigo. The next step is to create a new measure, which included in the pivot table, provide the total population in all cells.
Our first reaction might be to reuse the RecuentoPoblacion measure, created in the early stages of our example, but soon we realize that it is useless for this purpose, because although this measure counts the Poblacion table rows, the cells result are affected by the fields used in rows and columns, as well as other active filters in the PivotTable.
For a measure always count all rows in a table, regardless of the filters may be active, we'll use the CALCULATE function, which as the first parameter pass the operation to perform, in this case the count of rows in the Poblacion table using COUNTROWS function. Then pass as many parameters as filters we want to remove, using the ALL(TableName) function for each table that is, somehow, acting as a filter.
Under such assumptions create a new measure called TotalGlobalPoblacion, with the following DAX expression:
=CALCULATE (COUNTROWS (Poblacion), ALL (Edad), ALL (Sexo))
When applied this measure to the PivotTable, all the cells show the same value: the total population.

Now we need a third measure that makes the division between the two previous and display the result in percentage format. This new measure will have the name PorcentajePoblacion and use the following formula:
=[SumaPoblacion] / [TotalGlobalPoblacion]
As we saw in the first installment of the article, to apply the formatting to this measure will right click on one of its cells choosing "Number Format". In the format window this time select the "Custom" category, through which we introduce in the "Type" field the following format string.
0,00%;0,00%

This string will format the number as a percentage, and show the male population columns values without the negative sign, but internally, these values will remain negative.

Next deactivate all PivotTable measures except PorcentajePoblación, which will be the only one that remains visible. Then again add a clustered bar chart, using the configuration steps outlined above, and adding new format features to improve their presentation.
Apply percentage format to the horizontal axis
Firstly we will right click on the horizontal axis labels by selecting "Format axis", which will open the format window. In "Number" section, select the format category "Custom", and in the "Format Code" field, write the following format string:
0%;0%
By clicking the "Add" button, the string is added to the list of custom strings. Accepting the window, the format will apply to the horizontal axis labels.

Emphasizing the population bar edges
Then we will right click on one of the bars in the chart, choosing again "Format Data Series". This time, in "Border Color" section, we will click "Solid Line", selecting black color; while in "Border Styles" assign the value "2 pt." in property "Width". This operation will make for both groups of bars in the chart.

Then we right click on the labels of the age ranges, selecting "Format axis". In this format window assign the same values for the color and border style properties that just used to the chart bars.
As a result of these actions, the chart will show the edges with an outline clearly highlighted.

Relocating the legend position
At the time of chart creation, Excel puts the legend (Sexo_DS field) on the right side by default. However, it is possible to change the location of this element if we want to provide more space to the drawing of population bars. To do this, we right-click the legend and select "Format Legend" in the format window, under "Legend Options", we will click "Top".

As we can see, the chart has won drawing surface, but the legend indicators have been placed in reverse order with respect to the bars. To solve this problem we will click on the legend (Sexo_DS), dropping a menu of filter options, which will select "Sort Z to A".

With this action, the legend indicators shall be placed properly, but now we will find that the colors of the bars have been reversed, and lost the edge of the chart bars.
We will restore the edges of the bars in the manner explained above, whereas in terms of colors, for each side of the pyramid do a right click on one bar and select "Shape Fill", changing the current color which originally had the chart.

To complete the adjustments we are making about the legend, drag it until it is situated at the same level of the upper element of the pyramid, and will increase its width, so that the indicators are further apart.

Clear the fields and add title
Then we right click on any of the fields of the chart buttons and select the option "Hide All Field Buttons on Chart". In this way will prevent the user to apply filters on the horizontal and / or vertical pyramid axis, keeping a solid structure thus avoiding the possibility, for example, to hide age ranges either sex. However, this filter feature will still exist from the PivotTable.

Moreover, in the tab group "PivotChart Tools" we will click on "Layout", and within the group "Labels" will click on "Chart Title", which will drop several items where we will choose "Above Chart", adding a text box to the chart, we will edit to give it a title. At this point we have completed the development of our pyramid.

Energizing the data pyramid data through slicers
Although we have achieved the stated goal of creating a population pyramid, it would be interesting to enrich the information that actually offers as we do in this section.
Looking at the data model in the PowerPivot window, we shall realize that we haven't use health zoning information yet, so we can use these data to construct a filter that displays the pyramid based on the population concerning one or more of these health areas.
PowerPivot pivot tables, in addition to traditional filter, incorporating a new filter type called "slicer", in addition to the usual filtering functionality provides a more flexible user interface for handle the values to work with.
Let's create one slicer on the pyramid based on zone information. To this end, on the field list panel, drag the field Zona_DS from Zona table to "Slicers Horizontal" block, resulting in a slice located above both the pivot table and pyramid chart.

To filter data by slicer simply have to select the name of the health area we want to use as a filter. It is also possible to filter several areas simultaneously holding down the Ctrl key while clicking the areas composing the filter (as shown in the figure below).To remove all active filters will click on the icon located in the upper right corner of the slicer.

And at this point we concluded the article, in which its two parts we have shown how to construct a population pyramid in Excel 2010, using PowerPivot as a management tool for population data. However, the power of this technology goes beyond the mere treatment of demographic information, covering its scope to any environment where we have to make an analysis with large volumes of data.