정규 표현식 Regular Expression Syntax

정규 표현식 Regular Expression Syntax

2020. 3. 1. 14:58ㆍPython programming

1. re module: built-in module for regular expressions

2. re.search() function

- The regex pattern / The string we want to search that pattern for

- a set by placing the characters we want to match for in square brackets [ ]

예시 )

예시2)
string_list = ["Julie's favorite color is Blue.",
"Keli's favorite color is Green.",
"Craig's favorite colors are blue and red."]

blue_mentions = 0
pattern = "[Bb]lue"

for s in string_list:
if re.search(pattern, s):
blue_mentions += 1

print(blue_mentions)

>> 2

3. Series.str.contains() method

위의 예시 2같은 loops in pandas 보다 vectorized methods are often faster and require less code.

게다가 결과를 The result is a boolean mask: a series of True/False values. => 합계 가능

예시 1) eg_list = ["Julie's favorite color is green.",
"Keli's favorite color is Blue.",
"Craig's favorite colors are blue and red."]

eg_series = pd.Series(eg_list)

pattern = "[Bb]lue"

pattern_contained = eg_series.str.contains(pattern)
print(pattern_contained)

>> 0    False
1     True
2     True
dtype: bool

예시 2) hn라는 데이터프레임 안에 있는 title 컬럼에서 pattern 일치하는 행만 남기기

titles = hn['title']

pattern = "[Rr]uby"

ruby_titles = titles[titles.str.contains(pattern)]

4. quantifier
- how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths

예 ) matches the numbers in text from 1000 to 2999

예2 )

예3) e-mail 이나 email 들어있는 행만 남기기

pattern = "e-?mail"

email_bool = titles.str.contains(pattern)

email_count = email_bool.sum()

email_titles = titles[email_bool]

5. regex character classes

Ranges can be used for letters as well as numbers.
Sets and ranges can be combined.

- In order to match word characters between our brackets, we can combine the word character class (\w) with the 'one or more' quantifier (+), giving us a combined pattern of \w+.

- use backslashes to escape the [ and ] characters.

예) pdf ("p" or "d" or "f")를 찾을 때 vs. "[pdf]" 를 찾을 때

- 정리

We can use a backslash to escape characters that have special meaning in regular expressions (e.g. \ will match an open bracket character).
Character classes let us match certain groups of characters (e.g. \w will match any word character).
Character classes can be combined with quantifiers when we want to match different numbers of characters.

6. escape sequence

- \b : 마지막 글자 제거 The escape sequence \b represents a backspace, so the final letter from our string is removed.

ex) print('hello\b world') >> hell world

- raw strings by prefixing our string with the r character.

We strongly recommend using raw strings for every regex you write, rather than remember which sequences are escape sequences and using raw strings selectively. That way, you'll never encounter a situation where you forget or overlook something which causes your regex to break.

ex) print(r'hello\b world') >> hello\b world

- capture groups : to specify one or more groups within our match that we can access separately

예시 ) 원 데이터

67      Analysis of 114 propaganda sources from ISIS, Jabhat al-Nusra, al-Qaeda [pdf]
101                                Munich Gunman Got Weapon from the Darknet [German]
160                                      File indexing and searching for Plan 9 [pdf]
163    Attack on Kunduz Trauma Centre, Afghanistan  Initial MSF Internal Review [pdf]
196                                            [Beta] Speedtest.net  HTML5 Speed Test

pattern = r"(\[\w+\])" #사각괄호와 함께 안에 문자도 읽어오기, 단 그 문자는 0자리 이상이여야함
tag_5_matches = tag_5.str.extract(pattern)

>> 67        [pdf]
101    [German]
160       [pdf]
163       [pdf]
196      [Beta]
Name: title, dtype: object

pattern = r"\[(\w+)\]" #사각괄호 안에 문자 읽어오기, 단 그 문자는 0자리 이상이여야함
tag_5_matches = tag_5.str.extract(pattern)

>> 67        pdf
101    German
160       pdf
163       pdf
196      Beta
Name: title, dtype: object

7. RegExr
: to build regular expressions and includes syntax highlighting, instant matches, and regex syntax reference.

8. Negative character classes

: character classes that match every character except a character class.

9. word boundary anchor

- using the syntax \b. A word boundary matches the position between a word character and a non-word character, or a word character and the start/end of a string.

ex)

pattern_2 = r"\bJava\b"

m2 = re.search(pattern_2, string)
print(m2)

>> _sre.SRE_Match object; span=(41, 45), match='Java'

ex) 위와 같은 사례 ~ 단어의 왼쪽/오른쪽 시작과 끝에 block을 형성하기 때문에, JavaScript나 JavaOwner 같은 단어는 검색되지 않는다

pattern = r"\b[Jj]ava\b"

java_titles = titles[titles.str.contains(pattern)]

10. beginning anchor and the end anchor, which represent the start and the end of the string, respectfully.

11. flags

- re.search() 에는 flag 라는 옵션이 있다. 이를 활용하여 accepts one or more flags, which are special variables in the re module that modify the behavior of the regex interpreter.

- 특히, 많이 쓰는 옵션은 re.IGNORECASE flag 입니다 (흔히 re.I 라고 많이 씀) use re.I — the ignorecase flag — to make our pattern case insensitive 정규표현식 쓸 때 마다 너무 많은 케이스가 있어서, 일일이 케이스를 지정하지 않더라도 자동으로 고려하도록 해보자

ex )

email_tests = pd.Series(['email', 'Email', 'eMail', 'EMAIL'])
email_tests.str.contains(r"email")

>> 0     True
1    False
2    False
3    False
dtype: bool

그러나, fla 옵션 쓰면 모든 단어를 소문자/대문자 다 고려해줌.

import re
email_tests.str.contains(r"email",flags=re.I)

>> 0    True
1    True
2    True
3    True
dtype: bool

ex) import re

email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
'e-mail', 'eMail', 'E-Mail', 'EMAIL', 'emails', 'Emails',
'E-Mails'])
pattern = r"\be[\-\s]?mails?"
email_mentions = titles.str.contains(pattern, flags=re.I).sum()

12. extract 추출하기

방법 1 > Use the Series.str.extract() method.

방법 2 > Use a regex capture group : by wrapping the part of our pattern we want to capture in parentheses. If we want to capture the whole pattern, we just wrap the whole pattern in a pair of parentheses:

ex 1) s or q or l 을 읽어오는게 아니라, SQL 연속된 세 글자(만)를 추출하고 싶다면 시작점과 끝점 양쪽을 괄호로 묶어줄 것

pattern = r"(SQL)"

sql_capitalizations = titles.str.extract(pattern, flags=re.I) 대문자 소문자 다 읽어오기

ex 2) sql 이 포함된 모든 단어를 추출하고 싶다면
pattern = r"(\w+SQL)"

sql_flavors = titles.str.extract(pattern, flags=re.I)

sql_flavors_freq = sql_flavors.value_counts()

print(sql_flavors_freq)

>> PostgreSQL    27
NoSQL         16
MySQL         12
CloudSQL       1
SparkSQL       1
MemSQL         1
nosql          1
mySql          1
Name: title, dtype: int64

ex 3) 원 데이터
Developing a computational pipeline using the asyncio module in Python 3
Python 3 on Google App Engine flexible environment now in beta
Python 3.6 proposal, PEP 525: Asynchronous Generators
How async/await works in Python 3.5.0
Ubuntu Drops Python 2.7 from the Default Install in 16

#파이썬 버전 찾기

# The regular expression should contain a capture group for the digit and period characters (the Python versions)

pattern = r"[Pp]ython ([\d\.]+)"

# Extract the Python versions

test =titles.str.extract(pattern)

# to create a dictionary frequency table of the extracted Python versions

py_versions_freq =dict(test.value_counts())

ex 4) "C" 언어 긁어오기 . 아래는 원데이터

14                                    Custom Deleters for C++ Smart Pointers
221                                           Lisp, C++: Sadness in my heart
222                                     MemSQL (YC W11) Raises $36M Series C
354     VW C.E.O. Personally Apologized to President Obama in Plea for Mercy
366                                         The new C standards are worth it
445                              Moz raises $10m Series C from Foundry Group
509              BDE 3.0 (Bloomberg's core C++ library): Open Source Release
522                             Fuchsia: Micro kernel written in C by Google
550               How to Become a C.E.O.? The Quickest Path Is a Winding One
1283                      A lightweight C++ signals and slots implementation
Name: title, dtype: object

# Let's use a negative set to prevent matches for the + character and the . character.

# mentions of C++, a distinct language from C.

# Cases where the letter C is followed by a period, like in the substring C.E.O.

pattern = r"\b[Cc]\b[^.+]"
first_ten = first_10_matches(pattern)

The new C standards are worth it

Moz raises $10m Series C from Foundry Group

Fuchsia: Micro kernel written in C by Google

13. lookarounds

위에서 보니 C 언어를 찾으려 했는데 "Series C," 같은 게 섞여나옴. Neither of these can be avoided using negative sets, which are used to allow multiple matches for a single character.

- define a character or sequence of characters that either must or must not come before or after our regex match.

(abc 앞에 zzz나오냐 / abc 앞에 zzz 안 나오냐 / abc 나오고 zzz 나오냐/ abc 나오고 zzz 안 나오냐)

Inside the parentheses, the first character of a lookaround is always ?.
If the lookaround is a lookbehind, the next character will be <, which you can think of as an arrow head pointing behind the match.
The next character indicates whether the lookaround is positive (=) or negative (!).

예) test_cases = ['Red_Green_Blue',
              'Yellow_Green_Red',
              'Red_Green_Red',
              'Yellow_Green_Blue',
              'Green']

def run_test_cases(pattern):
    for tc in test_cases:
        result = re.search(pattern, tc)
        print(result or "NO MATCH")

run_test_cases(r"Green(?=_Blue)") #_Blue 앞에 Green 나오냐

>> _sre.SRE_Match object; span=(4, 9), match='Green'
NO MATCH
NO MATCH
_sre.SRE_Match object; span=(7, 12), match='Green'
NO MATCH

run_test_cases(r"(?<!Yellow_)Green") # Green 앞에 Yellow 안 나오냐

>> _sre.SRE_Match object; span=(4, 9), match='Green'
NO MATCH
_sre.SRE_Match object; span=(4, 9), match='Green'
NO MATCH
_sre.SRE_Match object; span=(0, 5), match='Green'

run_test_cases(r"Green(?=.{5})") #Green 다음에 정확히 다섯 글자 나오냐

>> _sre.SRE_Match object; span=(4, 9), match='Green'
NO MATCH
NO MATCH
_sre.SRE_Match object; span=(7, 12), match='Green'
NO MATCH

예시 2) 고난이도 흑흑

Write a regular expression and assign it to pattern. The regular expression should:

Match instances of C or c where they are not preceded or followed by another word character.
Exclude instances where the match is followed by a . or + character, without removing instances where the match occurs at the end of the string.
Exclude instances where the word 'Series' immediately precedes the match.

정답은 pattern = r"(?<!Series\s)\b[Cc]\b(?![\.\+])"

14. backreferences

연속되는 문자가 나올 때, 잡아낼 대상을 구체화 하여 반복한다 specify a capture group and then to repeat it.

예제 1)

test_cases = [
              "I'm going to read a book.",
              "Green is my favorite color.",
              "My name is Aaron.",
              "No doubles here.",
              "I have a pet eel."
             ]

for tc in test_cases:
    print(re.search(r"(\w)\1", tc)) 아무글자(캐릭터)나 있는 그룹을 하나 가져온다

--->> _sre.SRE_Match object; span=(21, 23), match='oo'
_sre.SRE_Match object; span=(2, 4), match='ee'
None
None
_sre.SRE_Match object; span=(13, 15), match='ee'

주의사항) 왜"Aaron"은 빠졌을까?
-> the uppercase and lowercase "a" are two different characters, so the backreference does not match.

예제2)

# a regular expression to match cases of repeated words:
# 1) a word as a series of one or more word characters that are preceded and followed by a boundary anchor.
#2) repeated words as the same word repeated twice, separated by a whitespace character.

pattern = r"\b(\w+)\s\1\b"

repeated_words = titles[titles.str.contains(pattern)]

>>> 3102 Silicon Valley Has a Problem Problem
3176 Wire Wire: A West African Cyber Threat
3178 Flexbox Cheatsheet Cheatsheet

15. re.sub() function

= str.replace() function

1> re.sub(pattern, repl, string, flags=0)

* The repl parameter = the text that you would like to substitute for the match

ex1 ) string = "aBcDEfGHIj"

print(re.sub(r"[A-Z]", "-", string))

>> a-c--f---j

2> Series.str.replace(pattern, repl, flags=0)

ex2) sql_variations = pd.Series(["SQL", "Sql", "sql"])

sql_uniform = sql_variations.str.replace(r"sql", "SQL", flags=re.I)
print(sql_uniform)

>> 0    SQL
1    SQL
2    SQL
dtype: object

email_tests.str.contains(r"email",flags=re.I)

ex 3) email_variations에 있는 다양한 표현들을 다 포괄하는 하나의 정규표현식 만들기, 그 하나의 정규표현식으로 데이터 클렌징 하기

email_variations = pd.Series(['email', 'Email', 'e Mail',
'e mail', 'E-mail', 'e-mail',
'eMail', 'E-Mail', 'EMAIL'])

# to replace each of the matches in 'email_variations' with "email"
pattern=r"e[\-\s]?mail"

# ? = a character a zero or one times

email_uniform = email_variations.str.replace(pattern, "email",flags=re.I)

>>
0 email
1 email
2 email
3 email
4 email
5 email
6 email
7 email
8 email

# Use the same syntax to replace all mentions of email in "titles" with "email"
titles_clean = titles.str.replace(pattern, "email", flags=re.I)

ex4) "test_urls"에서 https:// 로 시작하는 프로토콜 에서 도메인 가져오기

test_urls = pd.Series([
'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
'http://www.interactivedynamicvideo.com/',
'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
'HTTPS://github.com/keppel/pinn',
'Http://phys.org/news/2015-09-scale-solar-youve.html',
'https://iot.seeed.cc',
'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
'http://beta.crowdfireapp.com/?beta=agnipath',
'https://www.valid.ly?param',
'http://css-cursor.techstream.org'
])

protocol = r"https?://([\w\.\-]+)"
# ? = a character zero or one times
# . = . character

test_urls_clean = test_urls.str.extract(protocol, flags=re.I)

>> 0 www.amazon.com
1 www.interactivedynamicvideo.com
2 www.nytimes.com
3 evonomics.com
4 github.com
5 phys.org
6 iot.seeed.cc
7 www.bfilipek.com
8 beta.crowdfireapp.com
9 www.valid.ly
10 css-cursor.techstream.org

domains = hn["url"].str.extract(protocol)

>> 0        www.interactivedynamicvideo.com
1                        www.thewire.com
2                         www.amazon.com
3                        www.nytimes.com
4                        arstechnica.com

top_domains=(domains.value_counts()).head(5)

>> github.com 1008
medium.com 825
www.nytimes.com 525
www.theguardian.com 248
techcrunch.com 245

ex 5) 하나의 생성일시를 두 컬럼으로 나누기

- 원데이터

pattern = r"(.+)\s(.+)" white space로 두 컬럼 구분

dates_times = created_at.str.extract(pattern)
print(dates_times)

_          0 1
0   8/4/2016  11:52
1  1/26/2016  19:30
2  6/23/2016  22:20
3  6/17/2016   0:01
4  9/30/2015   4:12

ex 6) 고난이도 흑흑. 하나의 컬럼을 세 파트로 나누기

pattern = r"(https?)://([\w\-\.]+)/?(.*)"

# 양 괄호 ( ) 로 그룹을 구분해준다
# The first capture group should include the protocol text, up to but not including ://
# The second group should contain the domain, from after :// up to but not including /
# The third group should contain the page path, from after / to the end of the string = 공란이거나, 혹은, 점 . 다음에 모든 텍스트 *

test_url_parts=test_urls.str.extract(pattern, flags = re.I)
>> 0      1           2
0 https      www.amazon.com           Technology-Ventures-Enterprise-Thomas-Byers/dp...
1      http      www.interactivedynamicvideo.com
2      http      www.nytimes.com           2007/11/07/movies/07stein.html?_r=0
3      http      evonomics.com      advertising-cannot-maintain-internet-heres-sol...
4      HTTPS      github.com           keppel/pinn
5      Http      phys.org      news/2015-09-scale-solar-youve.html
6      https      iot.seeed.cc
7      http      www.bfilipek.com      2016/04/custom-deleters-for-c-smart-pointers.html
8      http      beta.crowdfireapp.com         ?beta=agnipath
9      https      www.valid.ly                ?param

url_parts=hn["url"].str.extract(pattern, flags = re.I)

16. named capture groups

In order to name a capture group we use the syntax ?P, where name is the name of our capture group. This syntax goes after the open parentheses, but before the regex syntax that defines the capture group:

ex)

pattern = r"(?P.+) (?P.+)"
dates_times = created_at.str.extract(pattern)
print(dates_times)

_       date   time
0   8/4/2016  11:52
1  1/26/2016  19:30
2  6/23/2016  22:20
3  6/17/2016   0:01
4  9/30/2015   4:12

ex 2 )
pattern = r"(?P<protocol>https?)://(?P<domain>[\w\-\.]+)/?(?P<url>.*)"
url_parts = hn["url"].str.extract(pattern, flags = re.I)

>> protocol domain path
0 http www.interactivedynamicvideo.com
1 http www.thewire.com entertainment/2013/04/florida-djs-april-fools-...
2 https www.amazon.com Technology-Ventures-Enterprise-Thomas-Byers/dp...
3 http www.nytimes.com 2007/11/07/movies/07stein.html?_r=0
4 http arstechnica.com business/2015/10/comcast-and-other-isps-boost-...
... ... ... ...
20094 https puri.sm philosophy/how-purism-avoids-intels-active-man...
20095 https medium.com @zreitano/the-yc-application-broken-down-and-t...
20096 http blog.darknedgy.net technology/2016/01/01/0/
20097 https medium.com @benjiwheeler/how-product-hunt-really-works-d8...
20098 https github.com jmcarp/robobrowser

///

'Python programming' 카테고리의 다른 글

[참고] real world data set 모음 (0)	2020.03.08
[참고] Loan Prediction - github (0)	2020.03.08
Transforming Data With Pandas (0)	2020.01.18
기초통계 : 평균, 분산 (1)	2019.12.21
또다른 plot 그리기 (0)	2019.12.15

You can count on me

You can count on me

태그

최근글

댓글

공지사항

아카이브

'Python programming' 카테고리의 다른 글

관련글

티스토리툴바