数据合并，相当于数据库中的merge

行列转换，需要将某些数据转换成行，或者是将行转换成列

参照完整性检查，对于数据中的参照完整性，入库前需要进行关联等方式检查其参照完整性

唯一性检查，对数据进行去重操作

3.4.3 Load

就是将数据入库，如果前面的处理都做了，就可以直接入库了。入库的时候需要考滤：

更新入库，对数据库中的记录进行更新

插入，就是将数据直接入库

刷新，将表中的数据清空，然后入库

部分刷新，将表中的部分数据清除，然后入库

由于性能等方面的需要，入库前后，可能需要做一些处理，如索引临时失效，主外键约束临时失效等

3.5 过期文件处理

接口文件或者是ETL中间处理生成的文件会越积越多，久而久之，再大的文件系统也会撑爆，所以就得对文件进行处理了。接口文件或者有些中间生成的文件，过期之后，有些需要归档，以供不时之需，有些就可以直接删了（这个我喜欢）。

这部分的处理最好写个统一的程序，对所有过期的文件进行处理，当需要过期的数据时，也方便取回。

3.6 过期数据处理

同理，数据也有一个生命周期，当绝大多数都用不到时，就需要对它进行处理了，否则性能，管理，存储都会是一个大问题。对于数据量很大的表，就是delete掉那些数据耗费的时间也是很惊人的了。还好，各个数据库厂商都提供了分区表这个东西，detach表分区还是相当快的。当然，将表分区，还得考滤管理，维护，性能方面的问题。充分利用数据库提供的大数据量的操作方式，将过期数据进行归档删除。

做此操作时，尽量将此表的其他相关操作都停掉，避免产生死锁或者严重的锁等待。当然，这个也是一个程序实现好了。

第4章 ETL实现

4.1 总体实现

设计完了，那实现就开始了。设计时可以不关心具体用什么产品，什么技术，实现时就很得关心这些东西的优劣了。

现在得根据设计时的方式，创建ETL接口文件、临时文件、中间文件、程序、日志等的存放方式； DS JOB中用到的参数的存放，传递等等这些问题。再对需要的程序和DS的东东进行开发。

Job要联接数据库，一般都是通过数据库客户端的方式，这样，需要在在Engine下的dsenv文件中配置数据库客户端的联接方式，同时在DS用户（默认是dsadm）的.profile文件中进行配置。

另外，Scratch目录存放DS进行运算时自己生成的临时文件，如果文件很大，进行排序等操作时此目录会占用很大的空间。Datasets存放DS Data Sets Stage生成的文件。所以这两个目录需要特别注意。

DS JOB中用到的参数一般有三种方式：一种是在DS Administrator中定义，在JOB中调用；另一种方式是在参数文件或者参数表中定义，用程序调用，赋给相应的JOB；第三种是将两者混合使用，基本固定不变的参数（如ETL根路径、数据库用户名、密码）在DS Administrator中设置，经常变化的参数（如接口时间）在参数文件或者参数表中定义。

DS Administrator中定义的参数如下

4.2 调度

调度的复杂程度，决定了调度的方式。

对于调度比较简单的，可以用Sequence Job将相关JOB串联起来，不同Sequence Job间的依赖，可以用消息文件或者数据库记录的方式，最后直接在Director中配置调度时间。而对于需要记录执行情况日志的，则可以再做一层通用的Job，在此Job的Job Control中，用basic写通用的方式调度其下的E T L Job，并将其记录日志，参数就是Job的名称和时间等。其实就是此类Job调用E T L Job，然后再用Sequence Job调用他们。

而对于比较复杂的，就得单独开发独立的程序了。可以在配置表中将不同JOB的执行时间，依赖条件，优先级，顺序乱序执行等信息配置进去，写程序（Shell, Python, Java等）根据不同的情况，按照不同的方式执行，调用DS提供的接口dsJob对JOB进行控制，并将重要的步骤记录到日志表中。而对于重传等操作，可以通过更改日志表中的状态来执行。配置的JOB可以是E T L Job，也可以是Sequence Job。

4.3 Parallel Job VS Server Job

E T L用到的Job有Server Job, Parallel Job, Mainframe Job(专供大型机上用的），一般情况下就是Server Job, Parallel Job了。从一类Job转到另一类Job，跟从一个开发工具转到另一个开发工具一样，让人感到很陌生。项目中是用Server Job合适，还是Parallel Job合适？以下是他们之间的一个比较(直接从网上copy的)：

1) The basic difference between server and parallel Jobs is the degree of parallelism. Server Job Stages do not have in built partitoning and parallelism mechanism for extracting and loading data between different Stages.
• All you can do to enhance the speed and perormance in server Jobs is to enable inter process row buffering through the administrator. This helps Stages to exchange data as soon as it is available in the link.
• You could use IPC Stage too which helps one passive Stage read data from another as soon as data is available. In other words, Stages do not have to wait for the entire set of records to be read first and then transferred to the next Stage. Link partitioner and link collector Stages can be used to achieve a certain degree of partitioning paralellism.
• All of the above features which have to be explored in server Jobs are built in dataStage Px.
2) The Px engine runs on a multiprocessor system and takes full advantage of the processing nodes defined in the configuration file. Both SMP and MMP architecture is supported by dataStage Px.
3) Px takes advantage of both pipeline parallelism and partitoning paralellism. Pipeline parallelism means that as soon as data is available between Stages( in pipes or links), it can be exchanged between them without waiting for the entire record set to be read. Partitioning parallelism means that entire record set is partitioned into small sets and processed on different nodes(logical processors). For example if there are 100 records, then if there are 4 logical nodes then each node would process 25 records each. This enhances the speed at which loading takes place to an amazing degree. Imagine situations where billions of records have to be loaded daily. This is where dataStage PX comes as a boon for ETL process and surpasses all other ETL tools in the market.
4) In parallel we have Dataset which acts as the intermediate data storage in the linked list, it is the best storage option it stores the data in dataStage internal format.
5) In parallel we can choose to display OSH , which gives information about the how Job works.
6) In Parallel Transformer there is no reference link possibility, in server Stage reference could be given to transformer. Parallel Stage can use both basic

以下文章点击率最高

Loading…