Edit:
To close everything completely, it is better to append this:
# close the documentdoc.Close(False)# quit Wordword.Quit()
Also, note that you should use absolute path for your .doc
file, not the relative one. So use this to get the absolute path:
import os# for example, ``rel_path`` could be './myfile.doc'full_path = os.path.abspath(rel_path)
The answer from Shivam Kotwalia works perfectly. However, the object is imported as a byte type. Sometimes you may need it as a string for performing REGEX or something like that.
I recommend the following code (two lines from Shivam Kotwalia's answer) :
import textracttext = textract.process("path/to/file.extension")text = text.decode("utf-8")
The last line will convert the object text to a string.
I agree with Shivam's answer except for textract doesn't exist for windows.And, for some reason antiword also fails to read the '.doc' files and gives an error:
'filename.doc' is not a word document. # This happens when the file wasn't generated via MS Office. Eg: Web-pages may be stored in .doc format offline.
So, I've got the following workaround to extract the text:
from bs4 import BeautifulSoup as bssoup = bs(open(filename).read())[s.extract() for s in soup(['style', 'script'])]tmpText = soup.get_text()text = "".join("".join(tmpText.split('\t')).split('\n')).encode('utf-8').strip()print text
This script will work with most kinds of files.Have fun!
Prerequisites :
install antiword : sudo apt-get install antiword
install docx : pip install docx
from subprocess import Popen, PIPEfrom docx import opendocx, getdocumenttextfrom cStringIO import StringIOdef document_to_text(filename, file_path):cmd = ['antiword', file_path]p = Popen(cmd, stdout=PIPE)stdout, stderr = p.communicate()return stdout.decode('ascii', 'ignore')print document_to_text('your_file_name','your_file_path')
Notice – New versions of python-docx removed this function. Make sure to pip install docx and not the new python-docx
I looked for solution so long. Materials about .doc
file is not enough, finally I solved this problem by changing type .doc
to .docx
from win32com import client as wcw = wc.Dispatch('Word.Application')# Or use the following method to start a separate process:# w = wc.DispatchEx('Word.Application')doc=w.Documents.Open(os.path.abspath('test.doc'))doc.SaveAs("test_docx.docx",16)
I had to do the same to search through a ton of *.doc files for a specific number and came up with:
special_chars = {"b'\\t'": '\t',"b'\\r'": '\n',"b'\\x07'": '|',"b'\\xc4'": 'Ä',"b'\\xe4'": 'ä',"b'\\xdc'": 'Ü',"b'\\xfc'": 'ü',"b'\\xd6'": 'Ö',"b'\\xf6'": 'ö',"b'\\xdf'": 'ß',"b'\\xa7'": '§',"b'\\xb0'": '°',"b'\\x82'": '‚',"b'\\x84'": '„',"b'\\x91'": '‘',"b'\\x93'": '“',"b'\\x96'": '-',"b'\\xb4'": '´'}def get_string(path):string = ''with open(path, 'rb') as stream:stream.seek(2560) # Offset - text starts after byte 2560current_stream = stream.read(1)while not (str(current_stream) == "b'\\xfa'"):if str(current_stream) in special_chars.keys():string += special_chars[str(current_stream)]else:try:char = current_stream.decode('UTF-8')if char.isalnum():string += charexcept UnicodeDecodeError:string += ''current_stream = stream.read(1)return string
I'm not sure how 'clean' this solution is, but it works well with regex.
This code will run when if you are looking for how to read the doc file in python install the all related packages first and see the result.
if doc_file:
_file=requests.get(request.values['MediaUrl0'])doc_file_link=BytesIO(_file.content)file_path=os.getcwd()+'\+data.doc'E=open(file_path,'wb')E.write(doc_file_link.getbuffer())E.close()word = win32.gencache.EnsureDispatch('Word.Application',pythoncom.CoInitialize())doc = word.Documents.Open(file_path)doc.Activate()doc_data=doc.Range().Textprint(doc_data)doc.Close(False)if os.path.exists(file_path):os.remove(file_path)
!pip install python-docx
import docx#Creating a word file objectdoc = open("file.docx","rb")#creating word reader objectdocument = docx.Document(doc)