★★pandas - pd.concat()와 pd.merge(), 문자 데이터 다루기 Working With Strings In Pandas (데이터 클렌징시 유용)

★★pandas - pd.concat()와 pd.merge(), 문자 데이터 다루기 Working With Strings In Pandas (데이터 클렌징시 유용)

2019. 11. 27. 17:12ㆍPython programming

(1) pd.concat( [ df1, df2 ] , ignore_index=True )

(2) pd.merge( left = , right= , on = '컬럼이름', how = 'left' )

Inner: only includes elements that appear in both dataframes with a common key [디폴트]
Outer: includes all data from both dataframes
Left: includes all of the rows from the "left" dataframe along with any rows from the "right" dataframe with a common key; the result retains all columns from both of the original dataframes
Right: includes all of the rows from the "right" dataframe along with any rows from the "left" dataframe with a common key; the result retains all columns from both of the original dataframes

(3) pd.concat() vs pd.merge()

====================================================

Working With Strings In Pandas

====================================================

(4) pd.rename( )

컬럼명 A에서 B로 변환할 때

col_rename = {'A':'B'}

df = df.rename( col_rename, axis=1 )

(5) string.split() method

- 기능 : 단어 하나씩 자르기

(6) split 외에도 이렇게나 많은 vector가 있지여 vectorized string method들

- 텍스트 데이터 다룰 때, 성능을 위해 Series.apply() 보다는 built-in vectorized methods 쓰는 게 좋다

- using vectorized string methods 의 장점

    1. Better performance
    2. Code that is easier to read and write
    3. Automatically excludes missing values

- 참고 문서 : working with text data

(7) 문자 안에서 패턴을 찾기 위한 정규표현식(regex)
https://docs.python.org/3.4/library/re.html

(8) Series.str.contains() method
- if a specific phrase appeared in a series 를 검사, 결과값은 True, False, and missing values 로 표현됨

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

pandas.Series.str.contains — pandas 1.0.0 documentation

Analogous, but stricter, relying on re.match instead of re.search.

pandas.pydata.org

(9) Series.str.extract() method

예시) pattern = r"([1-2][0-9]{3})"

years = merged['SpecialNotes'].str.extract(pattern, expand=True)

(10) Series.str.extractall()

- 비교 : Series.str.extract() method will only extract the first match of the pattern.

- If we wanted to extract all of the matches, we can use the Series.str.extractall() method.

예시) pattern = r"(?P[1-2][0-9]{3})"

# 데이터 예시
# Integrated household survey (IHS), 2012
# Integrated household survey (IHS), 2010/11

years = merged['IESurvey'].str.extractall(pattern)

value_counts= years['Years'].value_counts()
value_counts

2012    33
2010    28
2011    22

예시2)

pattern = r"(?P[1-2][0-9]{3})/?(?P[0-9]{2})?"

# a question mark, ?, after each of the two new groups to indicate that a match for those groups is optional.

# 그래서 yyyy 도 검색되고 yyyy/yy도 검색된다

years = merged['IESurvey'].str.extractall(pattern)

first_two_year = years['First_Year'].str[0:2] #맨 앞 20을 떼서

years['Second_Year'] = first_two_year + years['Second_Year'] #20과 19를 붙여서 2019를 만든다

- 추가적인 참고사항

1. regex를 괄호로 묶어줘야 인식 가능.If part of the regex isn't grouped using parantheses, (), it won't be extracted.

2. When we add a string to a column using the plus sign, +, pandas will add that string to every value in the column. Note that the strings will be added together without any spaces.

'Python programming' 카테고리의 다른 글

기초통계 : 평균, 분산 (0)	2019.12.21
또다른 plot 그리기 (0)	2019.12.15
그래프를 그려요 - matplotlib (0)	2019.11.25
Pandas 를 써봐요 (0)	2019.11.23
CLASS 를 이해해요 (0)	2019.11.21

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

You can count on me

You can count on me

태그

최근글

댓글

공지사항

아카이브

'Python programming' 카테고리의 다른 글

관련글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역