★★pandas - pd.concat()와 pd.merge(), 문자 데이터 다루기 Working With Strings In Pandas (데이터 클렌징시 유용)

2019. 11. 27. 17:12Python programming

(1) pd.concat( [ df1, df2 ] , ignore_index=True )

 

(2) pd.merge( left = , right= , on = '컬럼이름', how = 'left' )

  1. Inner: only includes elements that appear in both dataframes with a common key [디폴트]
  2. Outer: includes all data from both dataframes
  3. Left: includes all of the rows from the "left" dataframe along with any rows from the "right" dataframe with a common key; the result retains all columns from both of the original dataframes
  4. Right: includes all of the rows from the "right" dataframe along with any rows from the "left" dataframe with a common key; the result retains all columns from both of the original dataframes

 

(3) pd.concat() vs pd.merge()

 

 

====================================================

  Working With Strings In Pandas

====================================================

(4) pd.rename( )

컬럼명 A에서 B로 변환할 때

col_rename = {'A':'B'}

df = df.rename( col_rename, axis=1 )

 

 

(5) string.split() method 

- 기능 : 단어 하나씩 자르기 

 

(6) split 외에도 이렇게나 많은 vector가 있지여 vectorized string method들 

- 텍스트 데이터 다룰 때, 성능을 위해 Series.apply() 보다는 built-in vectorized methods 쓰는 게 좋다 

- using vectorized string methods 의 장점 
     
    1. Better performance 
    2. Code that is easier to read and write 
    3. Automatically excludes missing values 

- 참고 문서 : working with text data

 

(7)  문자 안에서 패턴을 찾기 위한 정규표현식(regex)
 https://docs.python.org/3.4/library/re.html

    
(8)  Series.str.contains() method
- if a specific phrase appeared in a series 를 검사, 결과값은 True, False, and missing values 로 표현됨

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html 

 

pandas.Series.str.contains — pandas 1.0.0 documentation

Analogous, but stricter, relying on re.match instead of re.search.

pandas.pydata.org

 

(9) Series.str.extract() method

예시) pattern = r"([1-2][0-9]{3})"

years = merged['SpecialNotes'].str.extract(pattern, expand=True)

 

(10) Series.str.extractall()

- 비교 : Series.str.extract() method will only extract the first match of the pattern.

 - If we wanted to extract all of the matches, we can use the Series.str.extractall() method.

 예시) pattern = r"(?P[1-2][0-9]{3})"

# 데이터 예시
# Integrated household survey (IHS), 2012
# Integrated household survey (IHS), 2010/11


years = merged['IESurvey'].str.extractall(pattern)

value_counts= years['Years'].value_counts()
value_counts

2012    33
2010    28
2011    22

예시2)  

      pattern = r"(?P[1-2][0-9]{3})/?(?P[0-9]{2})?"

    # a question mark, ?, after each of the two new groups to indicate that a match for those groups is optional.

    # 그래서 yyyy 도 검색되고 yyyy/yy도 검색된다 


   years = merged['IESurvey'].str.extractall(pattern)

   first_two_year = years['First_Year'].str[0:2]  #맨 앞 20을 떼서 

   years['Second_Year'] = first_two_year + years['Second_Year']  #20과 19를 붙여서 2019를 만든다 

  

- 추가적인 참고사항

1. regex를 괄호로 묶어줘야 인식 가능.If part of the regex isn't grouped using parantheses, (), it won't be extracted.

2. When we add a string to a column using the plus sign, +, pandas will add that string to every value in the column. Note that the strings will be added together without any spaces.

 

 

 

'Python programming' 카테고리의 다른 글

기초통계 : 평균, 분산  (0) 2019.12.21
또다른 plot 그리기  (0) 2019.12.15
그래프를 그려요 - matplotlib  (0) 2019.11.25
Pandas 를 써봐요  (0) 2019.11.23
CLASS 를 이해해요  (0) 2019.11.21