Skip to content

Latest commit

 

History

History
126 lines (71 loc) · 7.45 KB

README.md

File metadata and controls

126 lines (71 loc) · 7.45 KB

Step by step Pentaho

In this tutorial you'll see how we made our transformations on Pentaho to create our dimensional tables and our fact table. We'll approach only the dimension Date so that this tutorial doesn't get too long.

Dimension Date

To create the dimension Date you'll need to use the following steps:

  • CSV file input
  • Split fields
  • Sort rows
  • Unique rows
  • Dimension lookup/update

You'll use two CSV file input steps, each one followed by a Split fields step. Then you set the main output of both previous steps into one Sort rows step, followed by a Unique rows step and, finally, a Dimension lookup/update step. In the image below we can see how it looks:

dim_date

1. Opening the .CSV

The first thing we need to do is open our .csv in the CSV file input step.

date-csv-input

You must specify the path of the file you want to open in the Filename field. Let the Delimiter be a , so you can get the fields properly by clicking on Get fields, and uncheck the Lazy conversion? box. You can also change the step name so it's easier to know which field you're dealing with.

As said before, you'll have to do this step twice: one for the Date Reported field and another for the Date Occurred field.

Remove all the fields you won't use so that your step looks just like the image above and then hit OK.

2. Spliting the data

The reason you have to do this step is because in the .csv we're using, the fields that have a date format (Date Reported and Date Occurred) comes this way:

2013-09-20T00:00:00

The values after T don't change throughout the data in this .csv, so we did a split in this field to get only the useful information, which is the date itself.

With that in mind, you must set the Field to split according to the previous field you used on the input file. Set the Delimiter as T and then create a New Field using the name of your choice (here we used date). IMPORTANT: The name of the new field must be the same on both Split fields step.

Set the Type as Date and the Format as yyyy-MM-dd. Hit OK.

date-split

Repeat the steps above to get the Date Occured dates.

3. Sorting the rows

You must use this step so you can use the Unique Rows step afterwards.

Here all you have to do is click on Get fields and hit OK.

date-sort-rows

4. Getting only unique rows

On fields Date Reported and Date Occurred there are repeated dates. In the previous step you just put both together in a field called date and ordered it. Now you must remove the repeated dates for the atomicity of the database. You'll do this by using the Unique rows step.

All you have to do is click on Get again and then hit OK.

date-unique-rows

5. Creating the dimension Date

Now you're in the final part. Set the main output of the previous step to the Dimension lookup/update step. Here the first thing you have to do is create a new Connection to the DBMS you're using. In our project we used PostgreSQL and the connection looks like this:

conexao

After that, mark the Update the dimension? box and set the table where the data will be stored. On Keys, set the Dimension field with the name of your choice and put date on Field in stream. The last thing to do is set the Technical key field. You can use the name you want for it. The technical key holds the ID for each data inserted into the table you specified.

date-dimension-lookup

Hit OK and run the transformation.

Other Dimensions

The steps you'll follow to create the other dimensions are very similar to the ones you saw above. The differences are:

  • The dimension Location needs a sequence of three Split fields so you can get the Latitude and the Longitude correctly. This is how the field containing both comes from the .csv: "{'longitude': '-118.2574', 'human_address': '{""address"":"""",""city"":"""",""state"":"""",""zip"":""""}', 'needs_recoding': False, 'latitude': '34.0277'}"
  • The other dimensions don't need a Split fields step. The transformation will be very similar. Instead of passing the values through a Split fields step, we simply put the main output of the CSV input file as the Sort rows step.

Here is a image showing how our Location transformation look: dim_location

All the transformations we made on this project can be found on the folder transformations.

Fact Occurrence

For the fact Occurrence you must read the .csv and make the same steps to filter the Date Reported, Date Ocurred, Longitude and Latitude fields you made before on the dimensions, i.e., you have to split the data coming from the .csv just like you did before. You can copy the previous Split fields steps and paste them in the fact transformation that it'll work just fine.

To make the fact Occurrence you'll use:

  • Split fields (as mentioned before)
  • Dimension lookup/update
  • Table output

Here's how your fact transformation shall look when you're done:

occurrence

1. Getting the IDs of the dimensions

On the table fact you'll have only the IDs of the other dimensions and some metrics you think are important to your project. In our case, our fact has only the IDs of the dimensions, our metrics will be generated on our reports.

To get the IDs you'll concatenate the Dimension lookup/update steps before throwing the data to the table in your Table output step. There'll be one Dimension lookup/update step for each dimensional table, except for Date, that you'll use 2 steps in order to have both Date Reported and Date Occurred in the fact table.

This is how the step for Date Occurred will look:

fact-date-occurred

You must repeat the step above to the Date Reported field, changing only the Field in stream.

For the other dimensional tables you'll put all the incoming data of each table in their respective step. It is indispensable that you specify each table Technical key field for the succcess of the transformation. You can check how these steps were made in our img folder or in the tranformation file fact.ktr.

Creating the fact table

To create the fact Occurrence you must set the previous steps main output as the Table output step.

Open the Table output. Click on Database fields and then on Get fields. Delete all the fields leaving only the IDs you've got as technical keys on the previous steps and the DR Number field. Pay attention because you'll have two id_dim_date on the column Stream field: id_dim_date and id_dim_date_1. The first one has the IDs of Date Reported and the other has the IDs that are relative to the Date Occurred field. Also remember to rename the fields so they won't contain spaces or illegal characters.

This is how our Table output step looks:

fact-table-output

After you've done with the steps specified above, hit OK and run the transformation.

On your DBMS you shall see something like this:

facto-select