Isolation / Separation Process Of Applications
A
transformation should perform a precise task. If your transformation
starts to perform an action that becomes complicated/convoluted, think
about splitting it up or create a sub-transformation. Doing this will
make your transformations clearer and also allow re-usability.
Make Work Portable / Utilize Central Environment File
Making
work portable means that moving your work to another machine/location
will allow minimal alterations to be made to retain functionality.
In
order to do this, wherever possible don’t use fixed names but
variables. Environmental variables can be held in the kettle.properties
file.
Do not hard code file/folder locations
For
names of transformations and jobs, use relative paths (using the
${Internal.Job.Filename.Directory} and
${Internal.Transformation.Filename.Directory} variables.
Fully Documented PDI Components
As
well as detailed support documentation for all transformations and jobs
created, a simplified description of the functionality should also be
included in the settings. Within the PDI canvas, right click and select
Transformation/Job settings and use the description (and extended
description if required) fields to summarize the components
functionality.
In addition to this,
the use of notes added to the canvas to describe what function a step or
group of steps is performing is also required for ongoing support
purposes.
Naming Conventions
Folder names
UppercaseNo Spaces
Use directory structure to split jobs and transformations into distinct process stages
Sub directories should be used to provide logical structure, split jobs into stages and sub stages
File Names
No spacesName to indicate the purpose/action of the job/transformation
Transformation/Job Steps
All
steps within transformations and jobs should be named to accurately and
concisely reflect its action. This will aid in quickly understanding
what a transformation is doing.
Properties
Prefixed with the portfolio/application shortcodeUpper case
Clearly indicate what the parameter is
Parameters
DB Connections
Same name as target databaseCanvas
To
avoid confusion, the layout of steps on the canvas must be done as
clearly as possible. There are numerous topics on structured programming
but to try and standardize the layout of a canvas, it’s suggested that
inputs are at the top and the process flows down the page, resulting in
an output at the bottom of the canvas.
Use of Mapping Steps
Wherever
possible, build transformations and jobs which can be reused across
other files/applications. The easiest way of doing this is with the use
of arguments, variables and named parameters. If tasks are identified
which are going to be used in several situations, create sub
transformations.
Is a macroAvoid renaming or removing fields
PDI Settings
Performance
Transformations are networksNetwork speed is limited to the slowest part
The slowest step is indicated while running in Spoon
Slow steps have a full input and empty output buffer
First re-write, re-think, re-organize
Parallize work
End-to-end data pipe-lining
http://help.pentaho.com/Documentation/5.4/0L0/0Y0/070/030
Folders
Logical separation of jobs and transformations using folder hieratchiesDocumentation
Jobs and transformations should be "self documenting". This is achieved by ensuring that the name of each step clearly states what the step is doing. Notes should be added to the canvas to annotate why certain steps are being performed for more complex tasks.Logging
Instead of calling kitchen.sh and pan.sh directly, wrapper scripts have been created which will log the time, user, and exact command and parameters that was executed to a log file before passing through to kitchen.sh and pan.sh.The log files are located
See attached files: panw.sh and kitchenw.sh
Tips and Tricks
PDI Performance Tuning Tips
http://help.pentaho.com/Documentation/5.4/0L0/0Y0/070/030Avoid using "Copy rows to results" and "Get rows from results"
Using these steps to pass data from one transformation to the next for large data sets is inefficient and can cause out of memory errors.- This method loads the full data set is into memory. Whilst this method enables transformations to be broken into distinct steps and help reusability and maintainability, it causes performance issues.
- Loading the data from source and saving to the destination in the same transformation is more efficient and saves system resources as only the current data being processed is held in memory.
An example of this is that in the Wealth ETL project, 2 million rows of data were loaded and passed through an ETL job using this method. This resulted in an out of memory exception on the production server.
To resolve the issue all of the individual transformation steps were merged into one transformation, thus removing the copy step, this also resulted in a massive performance increase (2 hours > 20 minutes!).
Sorting
Avoid using the sort step for large data sets and use a temporary database table instead.- It has been observed that the sort is inefficient and massive time savings can be made by inserting the data into a temporary database table and then an ordered select in a subsequent transformation.
No comments:
Post a Comment