Script to extract domains urls from email (Source Not found error)

script command: 

python dynurl.py beta-missing.txt "url=" "," a.txt
TOOD: make a general file extract
import sys
input_filename=sys.argv[1]
start_pattern=sys.argv[2]
end_pattern=sys.argv[3]
output_filename=sys.argv[4]
f=open(input_filename, "r")
f1=open(output_filename,"a+")
d=[]
while(True):
	line_feed=f.readline()
	if(line_feed):
		if(line_feed.find(start_pattern)>-1):
			content=line_feed[line_feed.find(start_pattern)+len(start_pattern):]
			content=content[:content.find(end_pattern)]
			domain=content[content.find(".")+1:]
			KNOWN_TLDS = ('com', 'biz', 'ca', 'info', 'net', 'org', 'uk', 'us')
			for x in KNOWN_TLDS:
				if(domain.find(x)>-1):
					domain=domain.split(x)
					ps="www."+domain[0]+x
					d.append(ps)
					f1.write(ps)
					break

			else:
				print "Unknown TLDs"

	else:
		break
d.sort()
d = list(set(d))
for x in d:
	print x
f.close()
f1.close()
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: