파이썬 3의 문자열



In [116]:

    
text='하나 둘 셋'

문자열의 길이



In [117]:

    
len(text)









    Out[117]:





6

인덱스



In [118]:

    
text[3]









    Out[118]:





'둘'



In [119]:

    
text[-1]









    Out[119]:





'셋'

슬라이스



In [120]:

    
text[0:2]









    Out[120]:





'하나'



In [121]:

    
text[3:4]









    Out[121]:





'둘'



In [122]:

    
text[-3:]









    Out[122]:





'둘 셋'

건너뛰기



In [112]:

    
text = '1-2-3-4-5'
text[::2]









    Out[112]:





'12345'



In [114]:

    
text[::-1]









    Out[114]:





'5-4-3-2-1'

문자열 비교



In [92]:

    
"파이썬" == "파이썬"









    Out[92]:





True



In [93]:

    
"파이썬" != "자바스크립트"









    Out[93]:





True



In [140]:

    
'python' > 'Python'









    Out[140]:





True



In [94]:

    
"파이썬" > "자바스크립트"









    Out[94]:





True



In [138]:

    
"파이썬" > "하이썬"









    Out[138]:





False

문자열 이어붙이기



In [6]:

    
text + ' 넷'









    Out[6]:





'하나 둘 셋 넷'



In [141]:

    
text+4









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-141-80d3634e74e7> in <module>()
----> 1 text+4

TypeError: Can't convert 'int' object to str implicitly



In [20]:

    
text + str(4)









    Out[20]:





'하나 둘 셋4'

많은 문자열을 합칠 때



In [1]:

    
texts = ['하나', '둘', '셋', '넷']



In [2]:

    
" ".join(texts)









    Out[2]:





'하나 둘 셋 넷'



In [4]:

    
",".join(texts)









    Out[4]:





'하나,둘,셋,넷'

문자열 검색



In [143]:

    
text = '파이썬 프로그래밍을 즐기는 파이썬 프로그래머'



In [125]:

    
'파이썬' in text









    Out[125]:





True



In [142]:

    
'자바스크립트' in text









    Out[142]:





False



In [126]:

    
text.index('파이썬')









    Out[126]:





0



In [128]:

    
text.find('파이썬')









    Out[128]:





0



In [144]:

    
text.index('파이썬', 3)









    Out[144]:





15



In [145]:

    
text.find('파이썬', 3)









    Out[145]:





15



In [154]:

    
text.rindex('파이썬', 0, 15)









    Out[154]:





0



In [129]:

    
text.rfind('파이썬')









    Out[129]:





15



In [156]:

    
text.rfind('파이썬', 3, 15)









    Out[156]:





-1



In [ ]:

    
text.r

string.index와 string.find의 주요한 차이점



In [115]:

    
text.index('스칼라') # ValueError









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-115-2e2054672ed4> in <module>()
----> 1 text.index('스칼라') # ValueError

ValueError: substring not found



In [38]:

    
text.find('스칼라')









    Out[38]:





-1



In [52]:

    
for fn in ['my_picture.jpg', 'family_picture.jpeg', 'logo.png']:
    if fn.lower().endswith(('.jpg', '.jpeg')):
        print(fn, " 파일은 JPEG 이미지")









    



my_picture.jpg  파일은 JPEG 이미지
family_picture.jpeg  파일은 JPEG 이미지

문자열 분해

string.split()

string.partition/rpartition()



In [157]:

    
text = '파이썬은 데이터 처리에 적합합니다.'



In [158]:

    
text.split()









    Out[158]:





['파이썬은', '데이터', '처리에', '적합합니다.']



In [159]:

    
data = '이름, 이메일, 전화번호'
data.split(',')









    Out[159]:





['이름', ' 이메일', ' 전화번호']



In [46]:

    
filepath = 'c:\\users\\seongjoo\documents'



In [47]:

    
filepath.partition("\\")









    Out[47]:





('c:', '\\', 'users\\seongjoo\\documents')



In [48]:

    
filepath.rpartition("\\")









    Out[48]:





('c:\\users\\seongjoo', '\\', 'documents')

문자열 정리



In [53]:

    
text = "\t 문자열 정리      "



In [54]:

    
text.lstrip()









    Out[54]:





'문자열 정리      '



In [55]:

    
text.rstrip()









    Out[55]:





'\t 문자열 정리'



In [56]:

    
text.strip()









    Out[56]:





'문자열 정리'



In [57]:

    
"<[(알맹이)]>".strip("[](){}<>")









    Out[57]:





'알맹이'

문자열 치환



In [58]:

    
birth_date = '1982/12/27'
birth_date.replace('/','-')









    Out[58]:





'1982-12-27'

문자열 개수



In [55]:

    
text = 'big python, python for big data. Python is the best!'
text.lower().count('python')









    Out[55]:





3

문자열 형식



In [7]:

    
'내 이름은 {}입니다. 이메일은 {}입니다.'.format('이성주', 'seongjoo@codebasic.co')









    Out[7]:





'내 이름은 이성주입니다. 이메일은 seongjoo@codebasic.co입니다.'



In [8]:

    
'{2} {0} {1}'.format('하나', '둘', '셋')









    Out[8]:





'셋 하나 둘'

천 단위 쉼표 출력



In [21]:

    
'{:,}'.format(123456789)









    Out[21]:





'123,456,789'

형식 지정자

https://docs.python.org/3.4/library/string.html



In [19]:

    
'{0:f} {1:.1f} {2:.2f} {3:.2e} {4:.1%}'.format(3.15, 3.15, 3.0, 1/3, 1/3)









    Out[19]:





'3.150000 3.1 3.00 3.33e-01 33.3%'



In [1]:

    
import string

문자열 상수



In [5]:

    
string.ascii_letters









    Out[5]:





'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'



In [3]:

    
string.ascii_lowercase









    Out[3]:





'abcdefghijklmnopqrstuvwxyz'



In [4]:

    
string.ascii_uppercase









    Out[4]:





'ABCDEFGHIJKLMNOPQRSTUVWXYZ'



In [6]:

    
string.digits









    Out[6]:





'0123456789'

Template



In [21]:

    
from string import Template



In [24]:

    
template = Template('$who likes $what')
template.substitute(who='SJ', what='Python')









    Out[24]:





'SJ likes Python'

기본적으로는 공백을 기준으로 분리

쉼표를 기준으로 분리



In [34]:

    
text = '데이터는 파이썬, 파이썬은 데이터'

if text.find('파이썬') == text.rfind('파이썬'):
    verdict = '유일한 단어'
else:
    verdict = '해당 단어가 두 개 이상'

print(verdict)









    



해당 단어가 두 개 이상



In [63]:

    
text = 'big python, python for big data. Python is the best!'
text_kr = u'빅 파이썬, 빅데이터를 위한 파이썬. 파이썬이 최고'



In [39]:

    
string.lower('HELLO')









    Out[39]:





'hello'



In [46]:

    
string.find(text, 'javascript')









    Out[46]:





-1



In [65]:

    
string.find(text_kr, u'파이썬')









    Out[65]:





2



In [49]:

    
text = 'big python, python for big data. Python is the best!'
string.find(text, 'python', 18)









    Out[49]:





-1



In [53]:

    
text = 'big python, python for big data. Python is the best!'

text_lower = string.lower(text)

start_index_list = []
start_idx = 0
while start_idx < len(text_lower):
    idx = string.find(text_lower, 'python', start_idx)
    if idx == -1:
        break
    start_index_list.append(idx)
    start_idx = idx+len('python')
    
print(start_index_list)









    



[4, 12, 33]

날짜시간



In [ ]:

정규 표현식 (regular expression)

해결하고자 하는 문제: 텍스트에서 특정한 패턴의 탐색, 추출



In [63]:

    
text = """이성주의 전화번호는 010-1234-5678입니다. 
이메일은 seongjoo@codebasic.co입니다. 김성주의 전화번호는 010-4321-8765입니다."""



In [64]:

    
import re
phonePattern = re.compile(r'\d{3}-\d{4}-\d{4}')
phonePattern.findall(text)









    Out[64]:





['010-1234-5678', '010-4321-8765']



In [ ]:

유니코드와 인코딩



In [35]:

    
text = 'Python'

유니코드는 인코딩에 따라 서로 다른 바이트값을 가질 수 있다.



In [42]:

    
'Python'.encode('utf8')









    Out[42]:





b'Python'



In [43]:

    
'Python'.encode('utf16')









    Out[43]:





b'\xff\xfeP\x00y\x00t\x00h\x00o\x00n\x00'



In [44]:

    
'Python'.encode('utf32')









    Out[44]:





b'\xff\xfe\x00\x00P\x00\x00\x00y\x00\x00\x00t\x00\x00\x00h\x00\x00\x00o\x00\x00\x00n\x00\x00\x00'

바이트를 유니코드의 코드포인트로 변환하는 작업을 디코딩이라 한다.



In [49]:

    
text_encoded = '빅파이'.encode('utf8')
text_encoded









    Out[49]:





b'\xeb\xb9\x85\xed\x8c\x8c\xec\x9d\xb4'



In [50]:

    
text_encoded.decode('utf8')









    Out[50]:





'빅파이'

디코드 시, 인코딩 정보가 일치해야 한다.



In [51]:

    
text_encoded = '빅파이'.encode('utf8')
text_encoded.decode('cp949')









    



---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-51-fa22ca7aa465> in <module>()
      1 text_encoded = '빅파이'.encode('utf8')
----> 2 text_encoded.decode('cp949')

UnicodeDecodeError: 'cp949' codec can't decode byte 0xec in position 6: illegal multibyte sequence

유니코드 코드 포인트 값 얻기



In [86]:

    
ord('파이썬'[0])









    Out[86]:





54028



In [87]:

    
hex(ord('파이썬'[0]))









    Out[87]:





'0xd30c'

코드 포인트값에서부터 유니코드 문자로 변환



In [88]:

    
chr(54028)









    Out[88]:





'파'



In [90]:

    
chr(0xd30c)









    Out[90]:





'파'

bytes와 bytearray



In [55]:

    
text_encoded = '빅파이(bigpy)'.encode('utf8')
text_encoded









    Out[55]:





b'\xeb\xb9\x85\xed\x8c\x8c\xec\x9d\xb4(bigpy)'



In [56]:

    
text_encoded[0]









    Out[56]:





235



In [57]:

    
text_encoded[:1]









    Out[57]:





b'\xeb'



In [ ]:

pandas로 문자열 벡터 단위 처리



In [131]:

    
from pandas import Series



In [132]:

    
data = {'이성주': 'seongjoo@email.com', '튜링': ''}



In [134]:

    
data = Series(data)
data









    Out[134]:





이성주    seongjoo@email.com
튜링                       
dtype: object



In [135]:

    
data.str.contains('email.com')









    Out[135]:





이성주     True
튜링     False
dtype: bool



In [136]:

    
data.str.count('email.com')









    Out[136]:





이성주    1
튜링     0
dtype: int64

도전과제

문자열 '프로그래밍 \<b>파이썬\</b>'에서 b 태그 내 문자열을 추출하시오.

>>> extract_tag('프로그래밍 \파이썬\', "b")

'파이썬'

'데이터는 파이썬, 파이썬은 데이터'이 담겨있다. 대상 문장 내 '파이썬'이라는 단어가 유일한지, 또는 두 개 이상인지를 판별하는 논리를 구현하시오.

문자열 '앨런 매티슨 튜링(Alan Mathison Turing, OBE, FRS, 1912년 6월 23일 ~ 1954년 6월 7일)은 영국의 수학자, 암호학자, 논리학자이자 컴퓨터 과학의 선구적 인물이다.'에 대해 다음을 수행하시오.

1) 출생일과 사망일을 추출해 다음과 같이 출력하시오.

출생: 1912-06-23 사망: 1954-06-07

2) 사망 당시, 나이를 계산하시오.



In [82]:

    
import re
from datetime import datetime

alan_turing =  """앨런 매티슨 튜링(Alan Mathison Turing, OBE, FRS, 1912년 6월 23일 ~ 1954년 6월 7일)은 
영국의 수학자, 암호학자, 논리학자이자 컴퓨터 과학의 선구적 인물이다."""

date_str_pattern = re.compile(r"\d{4}년 \d{1,2}월 \d{1,2}일")
birth_str, death_str = date_str_pattern.findall(alan_turing)

birth_date = datetime.strptime(birth_str, '%Y년 %m월 %d일')
death_date = datetime.strptime(death_str, '%Y년 %m월 %d일')
life_span = death_date - birth_date
death_age = life_span.days//365

print("출생: {}".format(birth_date.strftime('%Y-%m-%d')))
print("사망: {}".format(death_date.strftime('%Y-%m-%d')))
print("사망 시 나이: {}세".format(death_age))









    



출생: 1912-06-23
사망: 1954-06-07
사망 시 나이: 41세



In [ ]: