2020. 3. 1. 14:58ㆍPython programming
1. re module: built-in module for regular expressions
2. re.search() function
- The regex pattern / The string we want to search that pattern for
- a set by placing the characters we want to match for in square brackets [ ]
예시 )
예시2)
string_list = ["Julie's favorite color is Blue.",
"Keli's favorite color is Green.",
"Craig's favorite colors are blue and red."]
blue_mentions = 0
pattern = "[Bb]lue"
for s in string_list:
if re.search(pattern, s):
blue_mentions += 1
print(blue_mentions)
>> 2
3. Series.str.contains() method
위의 예시 2같은 loops in pandas 보다 vectorized methods are often faster and require less code.
게다가 결과를 The result is a boolean mask: a series of True/False values. => 합계 가능
예시 1) eg_list = ["Julie's favorite color is green.",
"Keli's favorite color is Blue.",
"Craig's favorite colors are blue and red."]
eg_series = pd.Series(eg_list)
pattern = "[Bb]lue"
pattern_contained = eg_series.str.contains(pattern)
print(pattern_contained)
>> 0 False
1 True
2 True
dtype: bool
예시 2) hn라는 데이터프레임 안에 있는 title 컬럼에서 pattern 일치하는 행만 남기기
titles = hn['title']
pattern = "[Rr]uby"
ruby_titles = titles[titles.str.contains(pattern)]
4. quantifier
- how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths
예 ) matches the numbers in text from 1000 to 2999
예2 )
예3) e-mail 이나 email 들어있는 행만 남기기
pattern = "e-?mail"
email_bool = titles.str.contains(pattern)
email_count = email_bool.sum()
email_titles = titles[email_bool]
5. regex character classes
-
Ranges can be used for letters as well as numbers.
-
Sets and ranges can be combined.
- In order to match word characters between our brackets, we can combine the word character class (\w) with the 'one or more' quantifier (+), giving us a combined pattern of \w+.
- use backslashes to escape the [ and ] characters.
예) pdf ("p" or "d" or "f")를 찾을 때 vs. "[pdf]" 를 찾을 때
- 정리
-
We can use a backslash to escape characters that have special meaning in regular expressions (e.g. \ will match an open bracket character).
-
Character classes let us match certain groups of characters (e.g. \w will match any word character).
-
Character classes can be combined with quantifiers when we want to match different numbers of characters.
6. escape sequence
- \b : 마지막 글자 제거 The escape sequence \b represents a backspace, so the final letter from our string is removed.
ex) print('hello\b world') >> hell world
- raw strings by prefixing our string with the r character.
We strongly recommend using raw strings for every regex you write, rather than remember which sequences are escape sequences and using raw strings selectively. That way, you'll never encounter a situation where you forget or overlook something which causes your regex to break.
ex) print(r'hello\b world') >> hello\b world
- capture groups : to specify one or more groups within our match that we can access separately
예시 ) 원 데이터
67 Analysis of 114 propaganda sources from ISIS, Jabhat al-Nusra, al-Qaeda [pdf]
101 Munich Gunman Got Weapon from the Darknet [German]
160 File indexing and searching for Plan 9 [pdf]
163 Attack on Kunduz Trauma Centre, Afghanistan Initial MSF Internal Review [pdf]
196 [Beta] Speedtest.net HTML5 Speed Test
pattern = r"(\[\w+\])" #사각괄호와 함께 안에 문자도 읽어오기, 단 그 문자는 0자리 이상이여야함
tag_5_matches = tag_5.str.extract(pattern)
>> 67 [pdf]
101 [German]
160 [pdf]
163 [pdf]
196 [Beta]
Name: title, dtype: object
pattern = r"\[(\w+)\]" #사각괄호 안에 문자 읽어오기, 단 그 문자는 0자리 이상이여야함
tag_5_matches = tag_5.str.extract(pattern)
>> 67 pdf
101 German
160 pdf
163 pdf
196 Beta
Name: title, dtype: object
7. RegExr
: to build regular expressions and includes syntax highlighting, instant matches, and regex syntax reference.
8. Negative character classes
: character classes that match every character except a character class.
9. word boundary anchor
- using the syntax \b. A word boundary matches the position between a word character and a non-word character, or a word character and the start/end of a string.
ex)
pattern_2 = r"\bJava\b"
m2 = re.search(pattern_2, string)
print(m2)
>> _sre.SRE_Match object; span=(41, 45), match='Java'
ex) 위와 같은 사례 ~ 단어의 왼쪽/오른쪽 시작과 끝에 block을 형성하기 때문에, JavaScript나 JavaOwner 같은 단어는 검색되지 않는다
pattern = r"\b[Jj]ava\b"
java_titles = titles[titles.str.contains(pattern)]
10. beginning anchor and the end anchor, which represent the start and the end of the string, respectfully.
11. flags
- re.search() 에는 flag 라는 옵션이 있다. 이를 활용하여 accepts one or more flags, which are special variables in the re module that modify the behavior of the regex interpreter.
- 특히, 많이 쓰는 옵션은 re.IGNORECASE flag 입니다 (흔히 re.I 라고 많이 씀) use re.I — the ignorecase flag — to make our pattern case insensitive 정규표현식 쓸 때 마다 너무 많은 케이스가 있어서, 일일이 케이스를 지정하지 않더라도 자동으로 고려하도록 해보자
ex )
email_tests = pd.Series(['email', 'Email', 'eMail', 'EMAIL'])
email_tests.str.contains(r"email")
>> 0 True
1 False
2 False
3 False
dtype: bool
그러나, fla 옵션 쓰면 모든 단어를 소문자/대문자 다 고려해줌.
import re
email_tests.str.contains(r"email",flags=re.I)
>> 0 True
1 True
2 True
3 True
dtype: bool
ex) import re
email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
'e-mail', 'eMail', 'E-Mail', 'EMAIL', 'emails', 'Emails',
'E-Mails'])
pattern = r"\be[\-\s]?mails?"
email_mentions = titles.str.contains(pattern, flags=re.I).sum()
12. extract 추출하기
방법 1 > Use the Series.str.extract() method.
방법 2 > Use a regex capture group : by wrapping the part of our pattern we want to capture in parentheses. If we want to capture the whole pattern, we just wrap the whole pattern in a pair of parentheses:
ex 1) s or q or l 을 읽어오는게 아니라, SQL 연속된 세 글자(만)를 추출하고 싶다면 시작점과 끝점 양쪽을 괄호로 묶어줄 것
pattern = r"(SQL)"
sql_capitalizations = titles.str.extract(pattern, flags=re.I) 대문자 소문자 다 읽어오기
ex 2) sql 이 포함된 모든 단어를 추출하고 싶다면
pattern = r"(\w+SQL)"
sql_flavors = titles.str.extract(pattern, flags=re.I)
sql_flavors_freq = sql_flavors.value_counts()
print(sql_flavors_freq)
>> PostgreSQL 27
NoSQL 16
MySQL 12
CloudSQL 1
SparkSQL 1
MemSQL 1
nosql 1
mySql 1
Name: title, dtype: int64
ex 3) 원 데이터
Developing a computational pipeline using the asyncio module in Python 3
Python 3 on Google App Engine flexible environment now in beta
Python 3.6 proposal, PEP 525: Asynchronous Generators
How async/await works in Python 3.5.0
Ubuntu Drops Python 2.7 from the Default Install in 16
#파이썬 버전 찾기
# The regular expression should contain a capture group for the digit and period characters (the Python versions)
pattern = r"[Pp]ython ([\d\.]+)"
# Extract the Python versions
test =titles.str.extract(pattern)
# to create a dictionary frequency table of the extracted Python versions
py_versions_freq =dict(test.value_counts())
ex 4) "C" 언어 긁어오기 . 아래는 원데이터
14 Custom Deleters for C++ Smart Pointers
221 Lisp, C++: Sadness in my heart
222 MemSQL (YC W11) Raises $36M Series C
354 VW C.E.O. Personally Apologized to President Obama in Plea for Mercy
366 The new C standards are worth it
445 Moz raises $10m Series C from Foundry Group
509 BDE 3.0 (Bloomberg's core C++ library): Open Source Release
522 Fuchsia: Micro kernel written in C by Google
550 How to Become a C.E.O.? The Quickest Path Is a Winding One
1283 A lightweight C++ signals and slots implementation
Name: title, dtype: object
# Let's use a negative set to prevent matches for the + character and the . character.
# mentions of C++, a distinct language from C.
# Cases where the letter C is followed by a period, like in the substring C.E.O.
pattern = r"\b[Cc]\b[^.+]"
first_ten = first_10_matches(pattern)
>>
The new C standards are worth it |
Moz raises $10m Series C from Foundry Group |
Fuchsia: Micro kernel written in C by Google |
13. lookarounds
위에서 보니 C 언어를 찾으려 했는데 "Series C," 같은 게 섞여나옴. Neither of these can be avoided using negative sets, which are used to allow multiple matches for a single character.
- define a character or sequence of characters that either must or must not come before or after our regex match.
(abc 앞에 zzz나오냐 / abc 앞에 zzz 안 나오냐 / abc 나오고 zzz 나오냐/ abc 나오고 zzz 안 나오냐)
- Inside the parentheses, the first character of a lookaround is always ?.
- If the lookaround is a lookbehind, the next character will be <, which you can think of as an arrow head pointing behind the match.
- The next character indicates whether the lookaround is positive (=) or negative (!).
예) test_cases = ['Red_Green_Blue',
'Yellow_Green_Red',
'Red_Green_Red',
'Yellow_Green_Blue',
'Green']
def run_test_cases(pattern):
for tc in test_cases:
result = re.search(pattern, tc)
print(result or "NO MATCH")
run_test_cases(r"Green(?=_Blue)") #_Blue 앞에 Green 나오냐
>> _sre.SRE_Match object; span=(4, 9), match='Green'
NO MATCH
NO MATCH
_sre.SRE_Match object; span=(7, 12), match='Green'
NO MATCH
run_test_cases(r"(?<!Yellow_)Green") # Green 앞에 Yellow 안 나오냐
>> _sre.SRE_Match object; span=(4, 9), match='Green'
NO MATCH
_sre.SRE_Match object; span=(4, 9), match='Green'
NO MATCH
_sre.SRE_Match object; span=(0, 5), match='Green'
run_test_cases(r"Green(?=.{5})") #Green 다음에 정확히 다섯 글자 나오냐
>> _sre.SRE_Match object; span=(4, 9), match='Green'
NO MATCH
NO MATCH
_sre.SRE_Match object; span=(7, 12), match='Green'
NO MATCH
예시 2) 고난이도 흑흑
Write a regular expression and assign it to pattern. The regular expression should:
- Match instances of C or c where they are not preceded or followed by another word character.
- Exclude instances where the match is followed by a . or + character, without removing instances where the match occurs at the end of the string.
- Exclude instances where the word 'Series' immediately precedes the match.
정답은 pattern = r"(?<!Series\s)\b[Cc]\b(?![\.\+])"
14. backreferences
연속되는 문자가 나올 때, 잡아낼 대상을 구체화 하여 반복한다 specify a capture group and then to repeat it.
예제 1)
test_cases = [
"I'm going to read a book.",
"Green is my favorite color.",
"My name is Aaron.",
"No doubles here.",
"I have a pet eel."
]
for tc in test_cases:
print(re.search(r"(\w)\1", tc)) 아무글자(캐릭터)나 있는 그룹을 하나 가져온다
--->> _sre.SRE_Match object; span=(21, 23), match='oo'
_sre.SRE_Match object; span=(2, 4), match='ee'
None
None
_sre.SRE_Match object; span=(13, 15), match='ee'
주의사항) 왜"Aaron"은 빠졌을까? |
예제2)
# a regular expression to match cases of repeated words:
# 1) a word as a series of one or more word characters that are preceded and followed by a boundary anchor.
#2) repeated words as the same word repeated twice, separated by a whitespace character.
pattern = r"\b(\w+)\s\1\b"
repeated_words = titles[titles.str.contains(pattern)]
>>> 3102 Silicon Valley Has a Problem Problem
3176 Wire Wire: A West African Cyber Threat
3178 Flexbox Cheatsheet Cheatsheet
15. re.sub() function
= str.replace() function
1> re.sub(pattern, repl, string, flags=0)
* The repl parameter = the text that you would like to substitute for the match
ex1 ) string = "aBcDEfGHIj"
print(re.sub(r"[A-Z]", "-", string))
>> a-c--f---j
2> Series.str.replace(pattern, repl, flags=0)
ex2) sql_variations = pd.Series(["SQL", "Sql", "sql"])
sql_uniform = sql_variations.str.replace(r"sql", "SQL", flags=re.I)
print(sql_uniform)
>> 0 SQL
1 SQL
2 SQL
dtype: object
email_tests.str.contains(r"email",flags=re.I)
ex 3) email_variations에 있는 다양한 표현들을 다 포괄하는 하나의 정규표현식 만들기, 그 하나의 정규표현식으로 데이터 클렌징 하기
email_variations = pd.Series(['email', 'Email', 'e Mail',
'e mail', 'E-mail', 'e-mail',
'eMail', 'E-Mail', 'EMAIL'])
# to replace each of the matches in 'email_variations' with "email"
pattern=r"e[\-\s]?mail"
# ? = a character a zero or one times
email_uniform = email_variations.str.replace(pattern, "email",flags=re.I)
>>
0 email
1 email
2 email
3 email
4 email
5 email
6 email
7 email
8 email
# Use the same syntax to replace all mentions of email in "titles" with "email"
titles_clean = titles.str.replace(pattern, "email", flags=re.I)
ex4) "test_urls"에서 https:// 로 시작하는 프로토콜 에서 도메인 가져오기
test_urls = pd.Series([
'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
'http://www.interactivedynamicvideo.com/',
'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
'HTTPS://github.com/keppel/pinn',
'Http://phys.org/news/2015-09-scale-solar-youve.html',
'https://iot.seeed.cc',
'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
'http://beta.crowdfireapp.com/?beta=agnipath',
'https://www.valid.ly?param',
'http://css-cursor.techstream.org'
])
protocol = r"https?://([\w\.\-]+)"
# ? = a character zero or one times
# . = . character
test_urls_clean = test_urls.str.extract(protocol, flags=re.I)
>> 0 www.amazon.com
1 www.interactivedynamicvideo.com
2 www.nytimes.com
3 evonomics.com
4 github.com
5 phys.org
6 iot.seeed.cc
7 www.bfilipek.com
8 beta.crowdfireapp.com
9 www.valid.ly
10 css-cursor.techstream.org
domains = hn["url"].str.extract(protocol)
>> 0 www.interactivedynamicvideo.com
1 www.thewire.com
2 www.amazon.com
3 www.nytimes.com
4 arstechnica.com
top_domains=(domains.value_counts()).head(5)
>> github.com 1008
medium.com 825
www.nytimes.com 525
www.theguardian.com 248
techcrunch.com 245
ex 5) 하나의 생성일시를 두 컬럼으로 나누기
- 원데이터
pattern = r"(.+)\s(.+)" white space로 두 컬럼 구분
dates_times = created_at.str.extract(pattern)
print(dates_times)
>>
_ 0 1
0 8/4/2016 11:52
1 1/26/2016 19:30
2 6/23/2016 22:20
3 6/17/2016 0:01
4 9/30/2015 4:12
ex 6) 고난이도 흑흑. 하나의 컬럼을 세 파트로 나누기
pattern = r"(https?)://([\w\-\.]+)/?(.*)"
# 양 괄호 ( ) 로 그룹을 구분해준다
# The first capture group should include the protocol text, up to but not including ://
# The second group should contain the domain, from after :// up to but not including /
# The third group should contain the page path, from after / to the end of the string = 공란이거나, 혹은, 점 . 다음에 모든 텍스트 *
test_url_parts=test_urls.str.extract(pattern, flags = re.I)
>> 0 1 2
0 https www.amazon.com Technology-Ventures-Enterprise-Thomas-Byers/dp...
1 http www.interactivedynamicvideo.com
2 http www.nytimes.com 2007/11/07/movies/07stein.html?_r=0
3 http evonomics.com advertising-cannot-maintain-internet-heres-sol...
4 HTTPS github.com keppel/pinn
5 Http phys.org news/2015-09-scale-solar-youve.html
6 https iot.seeed.cc
7 http www.bfilipek.com 2016/04/custom-deleters-for-c-smart-pointers.html
8 http beta.crowdfireapp.com ?beta=agnipath
9 https www.valid.ly ?param
url_parts=hn["url"].str.extract(pattern, flags = re.I)
16. named capture groups
In order to name a capture group we use the syntax ?P, where name is the name of our capture group. This syntax goes after the open parentheses, but before the regex syntax that defines the capture group:
ex)
pattern = r"(?P.+) (?P
>>
_ date time
0 8/4/2016 11:52
1 1/26/2016 19:30
2 6/23/2016 22:20
3 6/17/2016 0:01
4 9/30/2015 4:12
ex 2 )
pattern = r"(?P<protocol>https?)://(?P<domain>[\w\-\.]+)/?(?P<url>.*)"
url_parts = hn["url"].str.extract(pattern, flags = re.I)
>> protocol domain path
0 http www.interactivedynamicvideo.com
1 http www.thewire.com entertainment/2013/04/florida-djs-april-fools-...
2 https www.amazon.com Technology-Ventures-Enterprise-Thomas-Byers/dp...
3 http www.nytimes.com 2007/11/07/movies/07stein.html?_r=0
4 http arstechnica.com business/2015/10/comcast-and-other-isps-boost-...
... ... ... ...
20094 https puri.sm philosophy/how-purism-avoids-intels-active-man...
20095 https medium.com @zreitano/the-yc-application-broken-down-and-t...
20096 http blog.darknedgy.net technology/2016/01/01/0/
20097 https medium.com @benjiwheeler/how-product-hunt-really-works-d8...
20098 https github.com jmcarp/robobrowser
///
'Python programming' 카테고리의 다른 글
[참고] real world data set 모음 (0) | 2020.03.08 |
---|---|
[참고] Loan Prediction - github (0) | 2020.03.08 |
Transforming Data With Pandas (0) | 2020.01.18 |
기초통계 : 평균, 분산 (1) | 2019.12.21 |
또다른 plot 그리기 (0) | 2019.12.15 |