Regex Problems

1. Efficiently Get a list of all complete URLs that use https protocol



In [1]:

    
urls = ['http://www.domain.com',
        'https://somedomain.com',
        'http://my-domain-123.net',
        'https://google.com',
        'http://www.foo.com',
        'https://bar-baz3.com',
        'ftp://domain2.com']



In [3]:

    
import re



In [28]:

    
# A complete match checking for the presence of some alphanumeric after the // followed
# by at least one group of .alphanumeric could be 'https://\w+(.\w+)+'
# I don't check if it ends with .com, .net or others but there are too many options here

[m.string for u in urls for m in [re.search('https://', u)] if m]









    Out[28]:





['https://somedomain.com', 'https://google.com', 'https://bar-baz3.com']

2. Get domains (without protocols) (including extension, e.g. .com) for URLs with both http and https protocols.



In [31]:

    
[m.group(1) for u in urls for m in [re.search('https?://(\w+(.\w+)+)', u)] if m]









    Out[31]:





['www.domain.com',
 'somedomain.com',
 'my-domain-123.net',
 'google.com',
 'www.foo.com',
 'bar-baz3.com']

3. Below is a list of language codes. Determine how many are some form of English

English codes will start with En, en, or EN



In [32]:

    
languages = ['Ar','It','it','En','En_gb','jp','en_GB','EN_IE','en-NZ','en','es','ES-es']



In [37]:

    
sum([1 for c in languages for m in [re.search('^en', c, re.IGNORECASE)] if m])









    Out[37]:





6

4. Fix all langauge codes so that they are formatted as such:

first two letters are lower case
codes with region endings use hyphen and not underscore
the region endings are upper case



In [67]:

    
[m.group(1).lower() + ('-' + m.group(4).upper() if m.group(2) else '')\
 for c in languages for m in [re.search('(^[A-Za-z]{2})((.)([A-Za-z]{2}$))?', c)] if m]









    Out[67]:





['ar',
 'it',
 'it',
 'en',
 'en-GB',
 'jp',
 'en-GB',
 'en-IE',
 'en-NZ',
 'en',
 'es',
 'es-ES']