PDI Best Practices


Isolation / Separation Process Of Applications

A transformation should perform a precise task. If your transformation starts to perform an action that becomes complicated/convoluted, think about splitting it up or create a sub-transformation. Doing this will make your transformations clearer and also allow re-usability.

Make Work Portable / Utilize Central Environment File

Making work portable means that moving your work to another machine/location will allow minimal alterations to be made to retain functionality.
In order to do this, wherever possible don’t use fixed names but variables. Environmental variables can be held in the kettle.properties file.

Do not hard code file/folder locations

For names of transformations and jobs, use relative paths (using the ${Internal.Job.Filename.Directory} and ${Internal.Transformation.Filename.Directory} variables.

Fully Documented PDI Components

As well as detailed support documentation for all transformations and jobs created, a simplified description of the functionality should also be included in the settings. Within the PDI canvas, right click and select Transformation/Job settings and use the description (and extended description if required) fields to summarize the components functionality.
In addition to this, the use of notes added to the canvas to describe what function a step or group of steps is performing is also required for ongoing support purposes. 

Naming Conventions

Folder names

Uppercase
No Spaces
Use directory structure to split jobs and transformations into distinct process stages
Sub directories should be used to provide logical structure, split jobs into stages and sub stages

File Names

No spaces
Name to indicate the purpose/action of the job/transformation

Transformation/Job Steps

All steps within transformations and jobs should be named to accurately and concisely reflect its action. This will aid in quickly understanding what a transformation is doing.

Properties

Prefixed with the portfolio/application shortcode 
Upper case
Clearly indicate what the parameter is 
Icon
PROJECTNAME_PROJECT_DB_USERNAME
Icon
DB_USR


Parameters

DB Connections

Same name as target database

Canvas

To avoid confusion, the layout of steps on the canvas must be done as clearly as possible. There are numerous topics on structured programming but to try and standardize the layout of a canvas, it’s suggested that inputs are at the top and the process flows down the page, resulting in an output at the bottom of the canvas.

Use of Mapping Steps

Wherever possible, build transformations and jobs which can be reused across other files/applications. The easiest way of doing this is with the use of arguments, variables and named parameters. If tasks are identified which are going to be used in several situations, create sub transformations.
Is a macro
Avoid renaming or removing fields

PDI Settings

Performance

Transformations are networks
Network speed is limited to the slowest part
The slowest step is indicated while running in Spoon
Slow steps have a full input and empty output buffer
First re-write, re-think, re-organize
Parallize work
End-to-end data pipe-lining
http://help.pentaho.com/Documentation/5.4/0L0/0Y0/070/030

Folders

Logical separation of jobs and transformations using folder hieratchies

Documentation

Jobs and transformations should be "self documenting". This is achieved by ensuring that the name of each step clearly states what the step is doing. Notes should be added to the canvas to annotate why certain steps are being performed for more complex tasks.

Logging

Instead of calling kitchen.sh and pan.sh directly, wrapper scripts have been created which will log the time, user, and exact command and parameters that was executed to a log file before passing through to kitchen.sh and pan.sh.
The log files are located
Icon
To do - need to send these logs to a central log server.

See attached files: panw.sh and kitchenw.sh
Icon
Note the location to store the log files may change for each server

Tips and Tricks

PDI Performance Tuning Tips

http://help.pentaho.com/Documentation/5.4/0L0/0Y0/070/030

Avoid using "Copy rows to results" and "Get rows from results"

Using these steps to pass data from one transformation to the next for large data sets is inefficient and can cause out of memory errors.
  • This method loads the full data set is into memory. Whilst this method enables transformations to be broken into distinct steps and help reusability and maintainability, it causes performance issues. 
  • Loading the data from source and saving to the destination in the same transformation is more efficient and saves system resources as only the current data being processed is held in memory.
There is also a bug using this method as described here http://jira.pentaho.com/browse/PDI-13913 which results in the data in memory being duplicated if "Copy rows to results" is used in multiple transformations.
An example of this is that in the Wealth ETL project, 2 million rows of data were loaded and passed through an ETL job using this method. This resulted in an out of memory exception on the production server.
To resolve the issue all of the individual transformation steps were merged into one transformation, thus removing the copy step, this also resulted in a massive performance increase (2 hours > 20 minutes!).

Sorting

Avoid using the sort step for large data sets and use a temporary database table instead.
  • It has been observed that the sort is inefficient and massive time savings can be made by inserting the data into a temporary database table and then an ordered select in a subsequent transformation.

No comments:

Post a Comment