ETL DataStage实现V13





3.4.3    Load







3.5    过期文件处理



3.6    过期数据处理



第4章    ETL实现

4.1    总体实现


现在得根据设计时的方式,创建ETL接口文件、临时文件、中间文件、程序、日志等的存放方式; DS JOB中用到的参数的存放,传递等等这些问题。再对需要的程序和DS的东东进行开发。


另外,Scratch目录存放DS进行运算时自己生成的临时文件,如果文件很大,进行排序等操作时此目录会占用很大的空间。Datasets存放DS Data Sets Stage生成的文件。所以这两个目录需要特别注意。

DS JOB中用到的参数一般有三种方式:一种是在DS Administrator中定义,在JOB中调用;另一种方式是在参数文件或者参数表中定义,用程序调用,赋给相应的JOB;第三种是将两者混合使用,基本固定不变的参数(如ETL根路径、数据库用户名、密码)在DS Administrator中设置,经常变化的参数(如接口时间)在参数文件或者参数表中定义。

DS Administrator中定义的参数如下

4.2    调度


对于调度比较简单的,可以用Sequence Job将相关JOB串联起来,不同Sequence Job间的依赖,可以用消息文件或者数据库记录的方式,最后直接在Director中配置调度时间。而对于需要记录执行情况日志的,则可以再做一层通用的Job,在此JobJob Control中,用basic写通用的方式调度其下的E T L Job,并将其记录日志,参数就是Job的名称和时间等。其实就是此类Job调用E T L Job,然后再用Sequence Job调用他们。

而对于比较复杂的,就得单独开发独立的程序了。可以在配置表中将不同JOB的执行时间,依赖条件,优先级,顺序乱序执行等信息配置进去,写程序(Shell, Python, Java等)根据不同的情况,按照不同的方式执行,调用DS提供的接口dsJobJOB进行控制,并将重要的步骤记录到日志表中。而对于重传等操作,可以通过更改日志表中的状态来执行。配置的JOB可以是E T L Job,也可以是Sequence Job

4.3    Parallel Job VS Server Job

E T L用到的JobServer Job, Parallel Job, Mainframe Job(专供大型机上用的),一般情况下就是Server Job, Parallel Job了。从一类Job转到另一类Job,跟从一个开发工具转到另一个开发工具一样,让人感到很陌生。项目中是用Server Job合适,还是Parallel Job合适?以下是他们之间的一个比较(直接从网上copy)

1) The basic difference between server and parallel Jobs is the degree of parallelism. Server Job Stages do not have in built partitoning and parallelism mechanism for extracting and loading data between different Stages.
• All you can do to enhance the speed and perormance in server Jobs is to enable inter process row buffering through the administrator. This helps Stages to exchange data as soon as it is available in the link.
• You could use IPC Stage too which helps one passive Stage read data from another as soon as data is available. In other words, Stages do not have to wait for the entire set of records to be read first and then transferred to the next Stage. Link partitioner and link collector Stages can be used to achieve a certain degree of partitioning paralellism.
• All of the above features which have to be explored in server Jobs are built in dataStage Px.
2) The Px engine runs on a multiprocessor system and takes full advantage of the processing nodes defined in the configuration file. Both SMP and MMP architecture is supported by dataStage Px.
3) Px takes advantage of both pipeline parallelism and partitoning paralellism. Pipeline parallelism means that as soon as data is available between Stages( in pipes or links), it can be exchanged between them without waiting for the entire record set to be read. Partitioning parallelism means that entire record set is partitioned into small sets and processed on different nodes(logical processors). For example if there are 100 records, then if there are 4 logical nodes then each node would process 25 records each. This enhances the speed at which loading takes place to an amazing degree. Imagine situations where billions of records have to be loaded daily. This is where dataStage PX comes as a boon for ETL process and surpasses all other ETL tools in the market.
4) In parallel we have Dataset which acts as the intermediate data storage in the linked list, it is the best storage option it stores the data in dataStage internal format.
5) In parallel we can choose to display OSH , which gives information about the how Job works.
6) In Parallel Transformer there is no reference link possibility, in server Stage reference could be given to transformer. Parallel Stage can use both basic






您的电子邮箱地址不会被公开。 必填项已用*标注