Parsing e-mail addresses and URLs from text
Parsing a required text from a given file is a common task that we encounter in text processing. Items such as, e-mails and URLs can be found out with the help of correct regex sequences. Mostly, we need to parse e-mail addresses from a contact list of an e-mail client, which is composed of many unwanted characters and words, or from an HTML web page.
How to do it...
The regular expression pattern to match an e-mail address is as follows:
[A-Za-z0-9._]+@[A-Za-z0-9.]+\.[a-zA-Z]{2,4}
For example:
$ cat url_email.txt this is a line of text contains,<email> #[email protected]. </email> and email address, blog "http://www.google.com", [email protected] dfdfdfdddfdf;[email protected]<br /> <a href="http://code.google.com"><h1>Heading</h1>
As we are using extended regular expressions (+
, for instance), we should use egrep
.
$ egrep -o '[A-Za-z0-9._]+@[A-Za-z0-9.]+\.[a-zA-Z]{2,4}' url_email.txt [email protected] test@yahoo...