红联Linux门户
Linux帮助

使用Linux文本工具简化数据的提取(一)

发布时间:2006-11-03 10:23:37来源:红联作者:xcoolo
  很多 Linux® 系统管理员都需要做一些整理纯文本配置文件的乏味工作。幸运的是,Linux 有很多源自于 UNIX® 的数据提取工具,包括 head、tail、grep、egrep、fgrep、cut、paste、join、awk 等。本文给出了几个真实的例子,它们可以展示如何使用这些简单的命令行工具更好地服务于系统管理工作。本文将逐一介绍这些数据提取工具及其操作,将它们应用到日常工作所使用的典型文件中,并介绍一下为什么这些工具对于从这些文件中提取数据来说非常重要。

  Linux 操作系统中有很多文件:配置文件、文本文件、文档文件、日志文件、用户文件,这个清单还在不断增长。通常,这些文件都包含了要查找重要数据所需要访问的一些信息。尽管我们可以简单地使用诸如 cat、more 之类的标准工具将大部分文件的内容输出到屏幕上,但是系统中有更加合适的工具可以对文本进行过滤和处理,这样就可以只关心我们想要的内容。

  在阅读本文的过程中,您可以打开 shell 并体验一下每个工具的例子。

  正则表达式

  在开始之前,我们需要首先理解什么是正则表达式,以及如何使用正则表达式。

  在最简单的形式中,正则表达式(regular expression)是用来在文件中定位文本的一些搜索标准。例如,要查找所有包含单词 “admin” 的行,我们就可以对 “admin” 进行搜索。因此,“admin” 就构成了一个正则表达式。如果我们不但希望查找 “admin”,而且还想将其替换成 “root”,那么我们就可以在一个工具中使用适当的命令将 “admin” 替换成 “root”。它们都构成了正则表达式。

  正则表达式所采用的一些基本规则如下:

  ● 任何单个字符或一串字符都可以匹配字符本身,例如上面的 “admin” 的例子。

  ● ^ 符号(^)表示一行的开始;$ 符号($)表示一行的结束。

  ● 要搜索特殊字符(例如 $ 符号),需要在这些字符前面加上反斜线(\)。例如, \$ 就表示查找 $,而不是一行的末尾。
 
  ● 点(.)代表任何单个字符。例如,ad..n 代表 5 个字符项,前两个字符是 “ad”,最后一个字符是 “n”。中间两个字符可以是任何字符,但是只能是由两个字符组成。

  ● 任何时候如果正则表达式包含在斜线中(例如 /re/),搜索就是通过文件顺序进行的。如果正则表达式包含在问号中(例如,?re?),搜索就是通过文件逆序进行的。

  ● 方括号([])表示多个值,减号(-)表示值的范围。例如,[0-9] 与 [0123456789] 相同,[a-z] 就等效于搜索任何小写字符。如果一个列表的首字符是 ^ 符号,它就匹配不在这个清单中的任何字符。表 1 给出了这些规则是如何真正进行匹配的。

  表 1. 示例正则表达式

引用:
例子 说明
[abc] 匹配 “a”、“b”、“c” 之一
[a-z] 匹配从 “a” 到 “z” 的任何一个小写字符
[A-Z] 匹配从 “A” 到 “Z” 的任何一个大写字符
[0-9] 匹配从 0 到 9 的任何一个数字
[^0-9] 匹配任何除了 0 到 9 数字范围内的任何字符
[-0-9] 匹配从 0 到 9 的任何数字,或者是短横线(-)
[0-9-] 匹配从 0 到 9 的任何数字,或者是短横线(-)
[^-0-9] 匹配除从 0 到 9 的数字和短横线(-)之外的任何字符
[a-zA-Z0-9] 匹配任何字符或数字


  了解了这些信息,下面让我们开始看一下相关工具。
文章评论

共有 1354 条评论

  1. 72.232.206.* 于 2006-11-29 13:54:47发表:

    Great work!
    My homepage | Cool site

  2. xcoolo 于 2006-11-03 10:26:20发表:

      注意我们找到了两个匹配项:我们希望找的 “market” 和 “marketing”。如果在这个文件中还存在 “marketable” 或 “marketed” 之类的单词,那么上面的命令也会显示包含这些单词的行的内容。

      在 grep 中可以使用通配符和元字符,我强烈建议将它们放到引号中,这样 shell 就不会将它们解释成命令了。

      要查找所有包含数字的行,请使用下面的命令:

    引用:
    # grep "[0-9]" memo
    1. RICKY PONTING
    2. GREEME SMITH
    3. STEPHEN FLEMING
    4. BORIS BAKER
    5. SACHIN TENDULKAR
    6. BRIAN LARA
    7. SHANE WARNE


      要查找所有包含 “the” 的行,请使用下面的命令:

    引用:
    # grep the memo
    In order to better serve the needs of our mass
    market customers, ABC Publishing is
    integrating the groups selling to this channel
    for ABC General Reference and ABC Computer
    Publishing. This change will allow us to
    better coordinate our selling and marketing
    efforts, as well as simplify ABC's
    relationships with these customers in the
    areas of customer service, co-op management,
    and credit and collection. Two national
    account managers, Ricky Ponting and Greeme
    Smith, have joined the sales team as a result
    of these changes.

    To achieve this goal, we have also organized
    the new mass sales group into three distinct
    teams reporting to our current sales
    directors, Stephen Flemming and Boris Baker. I
    have outlined below the national account
    managers and their respective accounts in each
    of the teams. We have also hired two new
    national account managers and a new sales
    administrator to complete our account
    coverage. They include:


      正如您可能已经注意到的一样,输出结果中包含了单词 “these”,还有单词 “the” 的一些精确匹配。

      grep 工具,与很多其他 UNIX/Linux 工具一样,都是大小写敏感的,这意味着查找 “The” 和 “the” 会产生完全不同的结果。

    引用:
    # grep The memo
    To achieve this goal, we have also organized
    the new mass sales group into three distinct
    teams reporting to our current sales
    directors, Stephen Flemming and Boris Baker. I
    have outlined below the national account
    managers and their respective accounts in each
    of the teams. We have also hired two new
    national account managers and a new sales
    administrator to complete our account
    coverage. They include:


      如果您想查找一个特定的单词或短语,但却不太关心它们的大小写,那可以使用两种方法。第一种方法是使用方括号同时查找 “The” 和 “the”,如下所示:

    引用:
    # grep "[T, t]he" memo
    In order to better serve the needs of our mass
    market customers, ABC Publishing is
    integrating the groups selling to this channel
    for ABC General Reference and ABC Computer
    Publishing. This change will allow us to
    better coordinate our selling and marketing
    efforts, as well as simplify ABC's
    relationships with these customers in the
    areas of customer service, co-op management,
    and credit and collection. Two national
    account managers, Ricky Ponting and Greeme
    Smith, have joined the sales team as a result
    of these changes.

    To achieve this goal, we have also organized
    the new mass sales group into three distinct
    teams reporting to our current sales
    directors, Stephen Flemming and Boris Baker. I
    have outlined below the national account
    managers and their respective accounts in each
    of the teams. We have also hired two new
    national account managers and a new sales
    administrator to complete our account
    coverage. They include:


      第二种方法是使用 -i 选项,这告诉 grep 忽略大小写的敏感性。

    引用:
    # grep -i the memo
    In order to better serve the needs of our mass
    market customers, ABC Publishing is
    integrating the groups selling to this channel
    for ABC General Reference and ABC Computer
    Publishing. This change will allow us to
    better coordinate our selling and marketing
    efforts, as well as simplify ABC's
    relationships with these customers in the
    areas of customer service, co-op management,
    and credit and collection. Two national
    account managers, Ricky Ponting and Greeme
    Smith, have joined the sales team as a result
    of these changes.

    To achieve this goal, we have also organized
    the new mass sales group into three distinct
    teams reporting to our current sales
    directors, Stephen Flemming and Boris Baker. I
    have outlined below the national account
    managers and their respective accounts in each
    of the teams. We have also hired two new
    national account managers and a new sales
    administrator to complete our account
    coverage. They include:


    除了 -i 选项之外,还有另外几个命令行选项可以用来改变 grep 的输出结果。最常见的选项如下所示:

    ● -c ---- 屏蔽正常输出;相反,打印每个输入文件的匹配行数。

    ● -l ---- 屏蔽正常输出;相反,打印包含正常输出内容的每个输入文件的名字。

    ● -n ---- 在每行输出前面加上该行在输入文件中的行号作为前缀。

    ● -v ---- 将匹配意义进行逆反 ---- 即选择那些不 匹配搜索条件的行。

  3. xcoolo 于 2006-11-03 10:24:33发表:

      grep

      grep 工具的工作方式是对文件的每一行搜索给定字符串的首次出现。如果找到了这个字符串,就打印该行的内容;否则就不对该行进行打印。下面这个文件我称之为 “memo”,阐述了 grep 的用法和结果。

    引用:
    To: All Employees

    From: Human Resources

    In order to better serve the needs of our mass market customers,
    ABC Publishing is integrating the groups selling to this channel
    for ABC General Reference and ABC Computer Publishing. This change
    will allow us to better coordinate our selling and marketing efforts,
    as well as simplify ABC's relationships with these customers in the areas
    of customer service, co-op management, and credit and collection.
    Two national account managers, Ricky Ponting and Greeme Smith,
    have joined the sales team as a result of these changes.
    To achieve this goal, we have also organized the new mass sales
    group into three distinct teams reporting to our current sales
    directors, Stephen Fleming and Boris Baker. I have outlined below
    the national account managers and their respective accounts in
    each of the teams. We have also hired two new national account
    managers and a new sales administrator to complete our account
    coverage. They include:

    Sachin Tendulkar, who joins us from XYZ Consumer Electronics
    as a national account manager covering traditional mass merchants.

    Brian Lara, who comes to us via PQR Company and will be responsible
    for managing our West Coast territory.

    Shane Warne, who will become an account administrator for our
    warehouse clubs business and joins us from DEF division.

    Effectively, we have seven new faces on board:

    1. RICKY PONTING
    2. GREEME SMITH
    3. STEPHEN FLEMING
    4. BORIS BAKER
    5. SACHIN TENDULKAR
    6. BRIAN LARA
    7. SHANE WARNE

    Please join me in welcoming each of our new team members.


      举一个简单的例子来说,要查找包含 “welcoming” 单词的行,最好的方法是使用下面的命令行:

    引用:
    # grep welcoming memo
    Please join me in welcoming each of our new team members.


      如果您想查找单词 “market”,结果会有很大的不同,如下所示:

    引用:
    # grep market memo
    In order to better serve the needs of our mass
    market customers, ABC Publishing is
    integrating the groups selling to this channel
    for ABC General Reference and ABC Computer
    Publishing. This change will allow us to
    better coordinate our selling and marketing
    efforts, as well as simplify ABC's
    relationships with these customers in the
    areas of customer service, co-op management,
    and credit and collection. Two national
    account managers, Ricky Ponting and Greeme
    Smith, have joined the sales team as a result
    of these changes.